OCR Studio

OCR Studio is a React + FastAPI app for image/PDF OCR, with a CLI for terminal-based OCR. It uses GLM-OCR via mlx-vlm on Apple Silicon.

Highlights

GLM-OCR prompt modes: plain_ocr, table, formula
Image OCR and multi-page PDF OCR
Output formats for PDF: json, markdown, html, docx
CLI for running OCR directly from the terminal (no server required)
Web UI with React frontend and FastAPI backend
Apple Silicon friendly (no CUDA/NVIDIA requirement)

Architecture

Two ways to use OCR Studio:

CLI:    ./ocr-studio image|pdf  -->  mlx-vlm (in-process)

Web:    React UI  -->  FastAPI API  -->  mlx_vlm.server (port 8080)

Quick Start (Apple M2 / M3)

1) Set up the environment

A single virtualenv (.venv) holds both mlx-vlm and the backend/CLI dependencies. The install script handles everything:

./scripts/install-mlx-vlm.sh install

This creates .venv and installs mlx-vlm, PyMuPDF, python-docx, markdown, Pillow, and other required packages.

2) Use the CLI (no server needed)

The CLI loads the model in-process — no separate server to start:

# OCR a single image
./ocr-studio image photo.png

# OCR with table mode
./ocr-studio image scan.png --mode table

# Batch OCR multiple images to a directory
./ocr-studio image *.png --output results/

# OCR a PDF to markdown
./ocr-studio pdf document.pdf --format markdown --output result.md

# OCR a PDF to DOCX
./ocr-studio pdf document.pdf --format docx --output result.docx

# Quiet mode — only OCR output, no progress on stderr
./ocr-studio -q image photo.png > result.txt

See CLI Reference below for full details.

3) Use the Web UI (optional)

The .venv environment from step 1 already includes most dependencies. Install the remaining backend packages into it:

.venv/bin/python -m pip install -r backend/requirements.txt

Configure the app:

cp .env.example .env

Start all services:

./scripts/start-local.sh start

Check status:

./scripts/start-local.sh status

Frontend: http://localhost:3000 Backend: http://localhost:8000

Backend env keys:

GLM_OCR_API_URL (default http://localhost:8080/chat/completions)
GLM_OCR_MODEL (default mlx-community/GLM-OCR-bf16)
GLM_OCR_API_KEY (optional)

CLI Reference

./ocr-studio <command> [options]

Global Flag	Description
`-q`, `--quiet`	Suppress progress messages on stderr

`image` — OCR image files

./ocr-studio image <files...> [--mode MODE] [--model MODEL] [--output PATH]

Flag	Description	Default
`files`	One or more image files (positional)	required
`--mode`	`plain_ocr`, `table`, or `formula`	`plain_ocr`
`--model`	HuggingFace model ID	`mlx-community/GLM-OCR-bf16`
`--output`	File or directory (stdout if omitted)	stdout

When --output is a directory, each input file produces a .txt file in that directory. When it is a file path, all results are written to that single file.

`pdf` — OCR a PDF

./ocr-studio pdf <file> [--mode MODE] [--format FMT] [--model MODEL] [--output PATH] [--dpi DPI]

Flag	Description	Default
`file`	PDF file (positional)	required
`--mode`	`plain_ocr`, `table`, or `formula`	`plain_ocr`
`--format`	`json`, `markdown`, `html`, or `docx`	`markdown`
`--model`	HuggingFace model ID	`mlx-community/GLM-OCR-bf16`
`--output`	Output file (stdout if omitted; required for `docx`)	stdout
`--dpi`	PDF rendering resolution	`144`

Configuration

.env.example:

# API
API_HOST=0.0.0.0
API_PORT=8000

# Frontend
FRONTEND_PORT=3000

# GLM OCR proxy
GLM_OCR_API_URL=http://localhost:8080/chat/completions
GLM_OCR_MODEL=mlx-community/GLM-OCR-bf16
GLM_OCR_API_KEY=

# Upload
MAX_UPLOAD_SIZE_MB=100

Frontend proxy target (optional):

By default Vite proxies /api to http://localhost:8000
Override with VITE_PROXY_TARGET if backend is elsewhere

API

`POST /api/ocr`

Form fields:

image (required)
mode (plain_ocr|table|formula)

Response:

{
  "success": true,
  "text": "...",
  "raw_text": "...",
  "image_dims": { "w": 1024, "h": 768 },
  "metadata": { "mode": "plain_ocr" }
}

`POST /api/process-pdf`

Form fields:

pdf_file (required)
mode (plain_ocr|table|formula)
output_format (json|markdown|html|docx)
dpi (optional int)

Development

Backend sanity check:

python3 -m py_compile backend/*.py
python3 -m py_compile cli/ocr_cli.py

Backend tests (if virtualenv is set up):

.venv/bin/python -m pytest backend/tests -q

Frontend build:

cd frontend
npm install
npm run build

Full local smoke test:

./scripts/smoke-test-local.sh

Start/stop local app services:

./scripts/start-local.sh check
./scripts/start-local.sh start
./scripts/start-local.sh status
./scripts/start-local.sh stop

Install mlx-vlm and CLI dependencies into .venv:

./scripts/install-mlx-vlm.sh install
./scripts/install-mlx-vlm.sh check

Troubleshooting

CLI: first run is slow. The model loads into memory on each invocation (~10-30s). Subsequent image/page processing is fast. For batch work, pass multiple files in one command to amortize the load time.
Web: first MLX request can be slow due to warm-up/shader compilation.
During model warm-up, OCR requests may return 503 briefly. Retry after a few seconds.
If OCR endpoint is unavailable, /health still responds but OCR calls return upstream errors.
If you see large bundle warnings in Vite, they are non-blocking for production build.

License

Licensed under MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Studio

Highlights

Architecture

Quick Start (Apple M2 / M3)

1) Set up the environment

2) Use the CLI (no server needed)

3) Use the Web UI (optional)

CLI Reference

`image` — OCR image files

`pdf` — OCR a PDF

Configuration

API

`POST /api/ocr`

`POST /api/process-pdf`

Development

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
backend		backend
cli		cli
docs/plans		docs/plans
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ocr-studio		ocr-studio

Folders and files

Latest commit

History

Repository files navigation

OCR Studio

Highlights

Architecture

Quick Start (Apple M2 / M3)

1) Set up the environment

2) Use the CLI (no server needed)

3) Use the Web UI (optional)

CLI Reference

image — OCR image files

pdf — OCR a PDF

Configuration

API

POST /api/ocr

POST /api/process-pdf

Development

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`image` — OCR image files

`pdf` — OCR a PDF

`POST /api/ocr`

`POST /api/process-pdf`

Packages