Skip to content

wongfei2009/ocr-studio

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Studio

OCR Studio is a React + FastAPI app for image/PDF OCR, with a CLI for terminal-based OCR. It uses GLM-OCR via mlx-vlm on Apple Silicon.

Highlights

  • GLM-OCR prompt modes: plain_ocr, table, formula
  • Image OCR and multi-page PDF OCR
  • Output formats for PDF: json, markdown, html, docx
  • CLI for running OCR directly from the terminal (no server required)
  • Web UI with React frontend and FastAPI backend
  • Apple Silicon friendly (no CUDA/NVIDIA requirement)

Architecture

Two ways to use OCR Studio:

CLI:    ./ocr-studio image|pdf  -->  mlx-vlm (in-process)

Web:    React UI  -->  FastAPI API  -->  mlx_vlm.server (port 8080)

Quick Start (Apple M2 / M3)

1) Set up the environment

A single virtualenv (.venv) holds both mlx-vlm and the backend/CLI dependencies. The install script handles everything:

./scripts/install-mlx-vlm.sh install

This creates .venv and installs mlx-vlm, PyMuPDF, python-docx, markdown, Pillow, and other required packages.

2) Use the CLI (no server needed)

The CLI loads the model in-process — no separate server to start:

# OCR a single image
./ocr-studio image photo.png

# OCR with table mode
./ocr-studio image scan.png --mode table

# Batch OCR multiple images to a directory
./ocr-studio image *.png --output results/

# OCR a PDF to markdown
./ocr-studio pdf document.pdf --format markdown --output result.md

# OCR a PDF to DOCX
./ocr-studio pdf document.pdf --format docx --output result.docx

# Quiet mode — only OCR output, no progress on stderr
./ocr-studio -q image photo.png > result.txt

See CLI Reference below for full details.

3) Use the Web UI (optional)

The .venv environment from step 1 already includes most dependencies. Install the remaining backend packages into it:

.venv/bin/python -m pip install -r backend/requirements.txt

Configure the app:

cp .env.example .env

Start all services:

./scripts/start-local.sh start

Check status:

./scripts/start-local.sh status

Frontend: http://localhost:3000 Backend: http://localhost:8000

Backend env keys:

  • GLM_OCR_API_URL (default http://localhost:8080/chat/completions)
  • GLM_OCR_MODEL (default mlx-community/GLM-OCR-bf16)
  • GLM_OCR_API_KEY (optional)

CLI Reference

./ocr-studio <command> [options]
Global Flag Description
-q, --quiet Suppress progress messages on stderr

image — OCR image files

./ocr-studio image <files...> [--mode MODE] [--model MODEL] [--output PATH]
Flag Description Default
files One or more image files (positional) required
--mode plain_ocr, table, or formula plain_ocr
--model HuggingFace model ID mlx-community/GLM-OCR-bf16
--output File or directory (stdout if omitted) stdout

When --output is a directory, each input file produces a .txt file in that directory. When it is a file path, all results are written to that single file.

pdf — OCR a PDF

./ocr-studio pdf <file> [--mode MODE] [--format FMT] [--model MODEL] [--output PATH] [--dpi DPI]
Flag Description Default
file PDF file (positional) required
--mode plain_ocr, table, or formula plain_ocr
--format json, markdown, html, or docx markdown
--model HuggingFace model ID mlx-community/GLM-OCR-bf16
--output Output file (stdout if omitted; required for docx) stdout
--dpi PDF rendering resolution 144

Configuration

.env.example:

# API
API_HOST=0.0.0.0
API_PORT=8000

# Frontend
FRONTEND_PORT=3000

# GLM OCR proxy
GLM_OCR_API_URL=http://localhost:8080/chat/completions
GLM_OCR_MODEL=mlx-community/GLM-OCR-bf16
GLM_OCR_API_KEY=

# Upload
MAX_UPLOAD_SIZE_MB=100

Frontend proxy target (optional):

  • By default Vite proxies /api to http://localhost:8000
  • Override with VITE_PROXY_TARGET if backend is elsewhere

API

POST /api/ocr

Form fields:

  • image (required)
  • mode (plain_ocr|table|formula)

Response:

{
  "success": true,
  "text": "...",
  "raw_text": "...",
  "image_dims": { "w": 1024, "h": 768 },
  "metadata": { "mode": "plain_ocr" }
}

POST /api/process-pdf

Form fields:

  • pdf_file (required)
  • mode (plain_ocr|table|formula)
  • output_format (json|markdown|html|docx)
  • dpi (optional int)

Development

Backend sanity check:

python3 -m py_compile backend/*.py
python3 -m py_compile cli/ocr_cli.py

Backend tests (if virtualenv is set up):

.venv/bin/python -m pytest backend/tests -q

Frontend build:

cd frontend
npm install
npm run build

Full local smoke test:

./scripts/smoke-test-local.sh

Start/stop local app services:

./scripts/start-local.sh check
./scripts/start-local.sh start
./scripts/start-local.sh status
./scripts/start-local.sh stop

Install mlx-vlm and CLI dependencies into .venv:

./scripts/install-mlx-vlm.sh install
./scripts/install-mlx-vlm.sh check

Troubleshooting

  • CLI: first run is slow. The model loads into memory on each invocation (~10-30s). Subsequent image/page processing is fast. For batch work, pass multiple files in one command to amortize the load time.
  • Web: first MLX request can be slow due to warm-up/shader compilation.
  • During model warm-up, OCR requests may return 503 briefly. Retry after a few seconds.
  • If OCR endpoint is unavailable, /health still responds but OCR calls return upstream errors.
  • If you see large bundle warnings in Vite, they are non-blocking for production build.

License

Licensed under MIT. See LICENSE.

About

A quick vibe coded app for deepseek OCR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 45.1%
  • JavaScript 41.5%
  • Shell 10.8%
  • CSS 1.7%
  • Other 0.9%