OCR Workbench

OCR Workbench is a ready-to-use framework for easily comparing popular OCR libraries in Python. It abstracts away the individual setup and usage details of each library, allowing you to focus on evaluating results on your data rather than spending time implementing each method yourself.

Simply provide a collection of PDF files and compare all libraries using a single script.

Supported Libraries

Currently, the following libraries are supported:

Docling including tesseract, EasyOCR, RapidOCR, surya, Granite
MinerU
Marker
Azure Document Intelligence
LightOnOCR-2-1B
Chandra OCR 2

This selection focuses on benchmarking open source libraries against proprietary Azure Document Intelligence.

Features

Running experiments with the above libraries via a single script
Automatic conversion of all provided PDF files in data/input into markdown using all methods
Hardware acceleration using CPU, MPS and CUDA
Time and cpu-memory tracking for each method

Qualitative Evaluation on 4 Sample Documents

Using our script, we produced OCR outputs for four example documents and compared them in terms of OCR quality as well as resource consumption.

The following publicly available PDFs were used:

Information About Coca-Cola Volume Growth
Handwriting Sample from NIST Special Database 19 (the sample image was saved as a PDF file)
2020 Annual Report Midwest Food Bank
RKI: Epidemiologisches Bulletin (German)

Speed is measured on a Macbook Air M4 for CPU and MPS, and on an NVIDIA RTX 5090 GPU (32 GB VRAM). Memory usage is measured only once for CPU. OCR quality is subjectively graded and compared based on the markdown output stored in data/output/<ocr-method>/<file-name>.md.

Summary

The following table summarizes the comparison of all methods on all PDFs. Extraction quality (Excellent, Very good, Good, Medium, Poor) and GPU extraction speed in seconds are shown, as well as cost per page when running on an NVIDIA RTX 5090 GPU hosted on runpod.io for 89ct/hour. We did not carry out any specific runtime optimizations for any method.

OCR-Library	Coca-Cola	NIST Handwriting	World Food Bank	RKI Bulletin (German)	Cost / page
LightOnOCR-2-1B	🏆 Excellent (192s)	🏆 Excellent (23s)	🏆 Excellent (191s)	🏆 Excellent (416s)	0.5 ct
Chandra OCR 2	🏆 Excellent (376s)	🏆 Excellent (120s)	🏆 Excellent (476s)	🏆 Excellent (863s)	1.4 ct
Document Intelligence	🟢 Very good (8s)	🟢 Good (5s)	🟢 Very good (14s)	🏆 Excellent (10s)	1 ct
Docling - suryaocr	🟢 Very good (31s)	🔴 Poor (8s)	🟢 Good (49s)	🟢 Very good (270s)	0.17 ct
Docling - RapidOCR	🟢 Good (12s)	🔴 Poor (4s)	🟢 Good (28s)	🟢 Very good (52s)	0.06 ct
MinerU	🟢 Good (42s)	🔴 Poor (16s)	🟢 Good (60s)	🟢 Very good (88s)	0.17 ct
marker	🟢 Good (29)	🔴 Poor (5s)	🟢 Good (35s)	🟡 Medium (143s)	0.11 ct
Docling - Granite	🔴 Poor (343s)	🟡 Medium (108s)	🔴 Poor (171s)	🟢 Good (597s)	1.16 ct
Docling - EasyOCR	🟡 Medium (37s)	🔴 Poor (7s)	🟡 Medium (41s)	🟢 Very good (95s)	0.11 ct
Docling - Tesseract	🔴 Poor (32s)	🔴 Poor (10s)	🔴 Poor (50s)	🟢 Good (101s)	0.13 ct

We can see that the open weights models LightOnOCR-2-1B and Chandra OCR 2 yield the best results. In the case of LightOnOCR-2-1B this is impressive, since it means that an open weights model can be used to save costs without sacrificing on extraction quality! The proprietary Azure Document Intelligence yields second-best results and has the highest speed, at least when compared against an NVIDIA RTX 5090 GPU. RapidOCR was the fastest open source alternative and therefore the cheapest model in our experiments.

Especially when extraction speed does not matter as much (e.g., in offline computation settings), open source methods can be a much cheaper alternative. Moreover, more powerful GPUs (H100 or better) could be used to catch up with Document Intelligence speed. Frameworks like vLLM and specifically compiled modules like FlashAttention can then be used to further optimize inference speed.

Details about the qualitative evaluation can be found in the qualitative evaluation details.

Why do we still care about OCR?

Scanned documents are essentially collections of images stored in PDF format. While easy to view and share, their text content is not machine-readable by default. Extracting this text requires specialized machine learning techniques. Optical Character Recognition (OCR) converts text in images into structured, machine-readable data that can be reliably used for tasks such as document search, analysis, and automated processing.

In this repository, we compare open source OCR engines against proprietary ones. We also include VLM based approaches.

Installation

Since the dependencies of different OCR libraries can have conflicts, we use a separate Python environment per OCR library. Each one uses uv as a dependency manager. Make sure to install uv before moving on.

Set up the respective environment using:

cd <environment-name>
uv sync

where <environment-name> is one of docling_environment, marker_environment, mineru_environment, azure_environment, lighton_environment.

Docling with tesseract

If you want to run docling with tesseract, tesseract needs to be installed:

# Ubuntu:
sudo apt install tesseract-ocr-all
# Via brew on Mac OS:
brew install tesseract
brew install tesseract-lang

Additionally, the correct path for the tesseract data directory needs to be set in docling_environment/config.json. See https://tesseract-ocr.github.io/tessdoc/Installation.html for an explanation. If you do not wish to use tesseract, simply remove it from ocr_engines in docling_environment/config.json.

Azure Document Intelligence

In order to use Azure Document Intelligence, you need to set up an account on Microsoft Azure and create a Document Intelligence resource. Then place the endpoint URL and API key in docint_environment/config.json.

Running experiments

Place some PDF files to be parsed in data/input.

Then run:

bash run_ocr_experiments.sh -a <accelerator> -e <environment>

where <accelerator> is one of cpu, mps, cuda and <environment> is one of docling, marker, mineru, docint.

By default, the script processes all PDF files in data/input.

If you want to run a single experiment instead, run the following:

bash run_ocr_experiments.sh -i <input-file> -a <accelerator> -e <environment>

The markdown output is stored in data/output/<ocr-method>/<file-name>.md.

In order to visualize CPU memory over time, run:

source <environment>/.venv/bin/activate
mprof plot data/output/<ocr-method>/<file-name>_mem_cpu.dat

For almost every environment, there exists an additional config.json file with some preconfigured defaults. You can change it based on your needs.

Quatlitative Evaluation Details

Information About Coca-Cola Volume Growth

OCR-Library	Extraction Quality	Speed [seconds]	CPU Memory Usage
Docling - Tesseract	Poor (misses text, confuses table entries, doesn't read checkboxes correctly)	CPU: 34 MPS: 34 GPU: 32	3 GB
Docling - EasyOCR	Medium (reads text well, confuses some table entries, doesn't read checkboxes correctly)	CPU: 129 MPS: 52 GPU: 37	11.4 GB
Docling - RapidOCR	Good (reads text well, gets table entries correct, doesn't read checkboxes correctly)	CPU: 161 MPS: 160 GPU: 12	6.4 GB
Docling - suryaocr	Very good (reads text well, gets table entries correct, gets most checkboxes correct)	CPU: 369 MPS: 337 GPU: 31	3.7 GB
Docling - Granite	Poor (misses great share of text, gets table entries correct, misses checkboxes)	CPU: 1564 MPS: 313 GPU: 343	5.8 GB
marker	Good (reads text well, confuses some table entries, gets most checkboxes correct)	CPU: 229 MPS: 212 GPU: 29	11.8 GB
MinerU	Good (reads text well, gets table entries correct, doesn't read checkboxes correctly)	CPU: 160 MPS: 50 GPU: 42	4.3 GB
Document Intelligence	Very good (reads text well, gets table entries correct, gets some checkboxes correct)	8	70 MB (processing happens in cloud)
LightOnOCR-2-1B	Excellent (reads text well, gets table entries correct, gets all checkboxes correct)	CPU: 1828 MPS: 1993 GPU: 192	15.7 GB
Chandra OCR 2	Excellent (reads text well, gets table entries correct, gets all checkboxes correct)	CPU: X MPS: X GPU: 376	OOM on MacBook Air M4

NIST Handwriting Sample

OCR-Library	Extraction Quality	Speed [seconds]	CPU Memory Usage
Docling - Tesseract	Poor (mistakes most text for images)	CPU: 3 MPS: 4 GPU: 10	1.4 GB
Docling - EasyOCR	Poor (mistakes most text for images)	CPU: 12 MPS: 6 GPU: 7	11.2 GB
Docling - RapidOCR	Poor (mistakes most text for images)	CPU: 18 MPS: 16 GPU: 4	2.9 GB
Docling - suryaocr	Poor (mistakes most text for images)	CPU: 48 MPS: 46 GPU: 8	1.8 GB
Docling - Granite	Medium (recognizes around half the text correctly)	CPU: 164 MPS: 7 GPU: 108	1.5 GB
marker	Poor (mistakes half of the form for image, reads out remaining text well)	CPU: 31 MPS: 39 GPU: 5	7.8 GB
MinerU	Poor (misses text and makes mistakes, does not align captions with content well)	CPU: 25 MPS: 16 GPU: 16	4.6 GB
Document Intelligence	Good (reads all handwriting and text well, does not align captions with contents well)	5	46 MB (processing happens in cloud)
LightOnOCR-2-1B	Excellent (reads all handwriting and text well, aligns all form contents perfectly)	CPU: 170 MPS: 168 GPU: 23	12.4 GB
Chandra OCR 2	Excellent (reads all handwriting and text well, aligns all form contents perfectly)	CPU: X MPS: X GPU: 120	OOM on MacBook Air M4

World Food Bank 2020 Annual Report

OCR-Library	Extraction Quality	Speed [seconds]	CPU Memory Usage
Docling - Tesseract	Poor (reads text well, mistakes table of content for image, gets double column layout mostly correct, mistakes tables for images)	CPU: 30 MPS: 29 GPU: 50	8 GB
Docling - EasyOCR	Medium (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables mostly correct)	CPU: 166 MPS: 54 GPU: 41	13 GB
Docling - RapidOCR	Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct)	CPU: 227 MPS: 200 GPU: 28	6.5 GB
Docling - suryaocr	Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct)	CPU: 370 MPS: 358 GPU: 49	7.8 GB
Docling - Granite	Poor (reads text well but often swallows it, gets table of contents correct, gets double layout sometimes correct, misses tables)	CPU: 838 MPS: 127 GPU: 171	1.8 GB
marker	Good (reads text well, misses page numbers in table of contents, gets double column layout mostly correct, gets table entries correct)	CPU: 193 MPS: 168 GPU: 35	11.2 GB
MinerU	Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct)	CPU: 263 MPS: 130 GPU: 60	4.4 GB
Document Intelligence	Very good (reads text well, gets table of contents correct, gets double column layout mostly correct, gets table entries correct)	14	64 MB (processing happens in cloud)
LightOnOCR-2-1B	Excellent (reads text well, gets table of contents correct, gets double column layout correct, gets table entries correct)	CPU: 2440 MPS: 1881 GPU: 191	15 GB
Chandra OCR 2	Excellent (reads text well, gets table of contents correct, gets double column layout correct, gets table entries correct)	CPU: X MPS: X GPU: 476	OOM on MacBook Air M4

RKI: Epidemiologisches Bulletin

OCR-Library	Extraction Quality	Speed [seconds]	CPU Memory Usage
Docling - Tesseract	Good (reads text well, gets table of contents correct, does not structure table well)	CPU: 70 MPS: 68 GPU: 101	7.1 GB
Docling - EasyOCR	Very good (reads text well, gets table of contents correct, gets table entries correct)	CPU: 241 MPS: 108 GPU: 95	14 GB
Docling - RapidOCR	Very good (reads text well, gets table of contents correct, gets table entries correct)	CPU: 542 MPS: 511 GPU: 52	6.4 GB
Docling - suryaocr	Very good (reads text well, gets table of contents correct, gets table entries correct)	CPU: 712 MPS: 704 GPU: 270	8.2 GB
Docling - Granite	Good (reads text well, gets table of contents correct, misses some tables)	CPU: 153 MPS: 152 GPU: 597	1.7 GB
marker	Medium (reads text well, gets table of contents correct, mixes up table entries)	CPU: 479 MPS: 617 GPU: 143	12.1 GB
MinerU	Very good (reads text well, gets table of contents partially correct, gets table entries correct)	CPU: 544 MPS: 156 GPU: 88	4.8 GB
Document Intelligence	Excellent (reads text well, gets table of contents correct, gets table entries correct)	10	96 MB (processing happens in cloud)
LightOnOCR-2-1B	Excellent (reads text well, gets table of contents correct, gets table entries correct)	CPU: 4187 MPS: 3987 GPU: 416	14.9 GB
Chandra OCR 2	Excellent (reads text well, gets table of contents correct, gets table entries correct)	CPU: X MPS: X GPU: 863	OOM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Workbench

Supported Libraries

Qualitative Evaluation on 4 Sample Documents

Summary

Why do we still care about OCR?

Installation

Running experiments

Quatlitative Evaluation Details

Information About Coca-Cola Volume Growth

NIST Handwriting Sample

World Food Bank 2020 Annual Report

RKI: Epidemiologisches Bulletin

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
chandra_environment		chandra_environment
data		data
docint_environment		docint_environment
docling_environment		docling_environment
lighton_environment		lighton_environment
marker_environment		marker_environment
mineru_environment		mineru_environment
.gitignore		.gitignore
README.md		README.md
run_ocr_experiments.sh		run_ocr_experiments.sh

Folders and files

Latest commit

History

Repository files navigation

OCR Workbench

Supported Libraries

Qualitative Evaluation on 4 Sample Documents

Summary

Why do we still care about OCR?

Installation

Running experiments

Quatlitative Evaluation Details

Information About Coca-Cola Volume Growth

NIST Handwriting Sample

World Food Bank 2020 Annual Report

RKI: Epidemiologisches Bulletin

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages