Skip to content

joneswack/ocr-workbench

Repository files navigation

OCR Workbench

OCR Workbench is a ready-to-use framework for easily comparing popular OCR libraries in Python. It abstracts away the individual setup and usage details of each library, allowing you to focus on evaluating results on your data rather than spending time implementing each method yourself.

Simply provide a collection of PDF files and compare all libraries using a single script.

Supported Libraries

Currently, the following libraries are supported:

This selection focuses on benchmarking open source libraries against proprietary Azure Document Intelligence.

Features

  • Running experiments with the above libraries via a single script
  • Automatic conversion of all provided PDF files in data/input into markdown using all methods
  • Hardware acceleration using CPU, MPS and CUDA
  • Time and cpu-memory tracking for each method

Qualitative Evaluation on 4 Sample Documents

Using our script, we produced OCR outputs for four example documents and compared them in terms of OCR quality as well as resource consumption.

The following publicly available PDFs were used:

Speed is measured on a Macbook Air M4 for CPU and MPS, and on an NVIDIA RTX 5090 GPU (32 GB VRAM). Memory usage is measured only once for CPU. OCR quality is subjectively graded and compared based on the markdown output stored in data/output/<ocr-method>/<file-name>.md.

Summary

The following table summarizes the comparison of all methods on all PDFs. Extraction quality (Excellent, Very good, Good, Medium, Poor) and GPU extraction speed in seconds are shown, as well as cost per page when running on an NVIDIA RTX 5090 GPU hosted on runpod.io for 89ct/hour. We did not carry out any specific runtime optimizations for any method.

OCR-Library Coca-Cola NIST Handwriting World Food Bank RKI Bulletin (German) Cost / page
LightOnOCR-2-1B 🏆 Excellent (192s) 🏆 Excellent (23s) 🏆 Excellent (191s) 🏆 Excellent (416s) 0.5 ct
Chandra OCR 2 🏆 Excellent (376s) 🏆 Excellent (120s) 🏆 Excellent (476s) 🏆 Excellent (863s) 1.4 ct
Document Intelligence 🟢 Very good (8s) 🟢 Good (5s) 🟢 Very good (14s) 🏆 Excellent (10s) 1 ct
Docling - suryaocr 🟢 Very good (31s) 🔴 Poor (8s) 🟢 Good (49s) 🟢 Very good (270s) 0.17 ct
Docling - RapidOCR 🟢 Good (12s) 🔴 Poor (4s) 🟢 Good (28s) 🟢 Very good (52s) 0.06 ct
MinerU 🟢 Good (42s) 🔴 Poor (16s) 🟢 Good (60s) 🟢 Very good (88s) 0.17 ct
marker 🟢 Good (29) 🔴 Poor (5s) 🟢 Good (35s) 🟡 Medium (143s) 0.11 ct
Docling - Granite 🔴 Poor (343s) 🟡 Medium (108s) 🔴 Poor (171s) 🟢 Good (597s) 1.16 ct
Docling - EasyOCR 🟡 Medium (37s) 🔴 Poor (7s) 🟡 Medium (41s) 🟢 Very good (95s) 0.11 ct
Docling - Tesseract 🔴 Poor (32s) 🔴 Poor (10s) 🔴 Poor (50s) 🟢 Good (101s) 0.13 ct

We can see that the open weights models LightOnOCR-2-1B and Chandra OCR 2 yield the best results. In the case of LightOnOCR-2-1B this is impressive, since it means that an open weights model can be used to save costs without sacrificing on extraction quality! The proprietary Azure Document Intelligence yields second-best results and has the highest speed, at least when compared against an NVIDIA RTX 5090 GPU. RapidOCR was the fastest open source alternative and therefore the cheapest model in our experiments.

Especially when extraction speed does not matter as much (e.g., in offline computation settings), open source methods can be a much cheaper alternative. Moreover, more powerful GPUs (H100 or better) could be used to catch up with Document Intelligence speed. Frameworks like vLLM and specifically compiled modules like FlashAttention can then be used to further optimize inference speed.

Details about the qualitative evaluation can be found in the qualitative evaluation details.

Why do we still care about OCR?

Scanned documents are essentially collections of images stored in PDF format. While easy to view and share, their text content is not machine-readable by default. Extracting this text requires specialized machine learning techniques. Optical Character Recognition (OCR) converts text in images into structured, machine-readable data that can be reliably used for tasks such as document search, analysis, and automated processing.

In this repository, we compare open source OCR engines against proprietary ones. We also include VLM based approaches.

Installation

Since the dependencies of different OCR libraries can have conflicts, we use a separate Python environment per OCR library. Each one uses uv as a dependency manager. Make sure to install uv before moving on.

Set up the respective environment using:

cd <environment-name>
uv sync

where <environment-name> is one of docling_environment, marker_environment, mineru_environment, azure_environment, lighton_environment.

Docling with tesseract

If you want to run docling with tesseract, tesseract needs to be installed:

# Ubuntu:
sudo apt install tesseract-ocr-all
# Via brew on Mac OS:
brew install tesseract
brew install tesseract-lang

Additionally, the correct path for the tesseract data directory needs to be set in docling_environment/config.json. See https://tesseract-ocr.github.io/tessdoc/Installation.html for an explanation. If you do not wish to use tesseract, simply remove it from ocr_engines in docling_environment/config.json.

Azure Document Intelligence

In order to use Azure Document Intelligence, you need to set up an account on Microsoft Azure and create a Document Intelligence resource. Then place the endpoint URL and API key in docint_environment/config.json.

Running experiments

Place some PDF files to be parsed in data/input.

Then run:

bash run_ocr_experiments.sh -a <accelerator> -e <environment>

where <accelerator> is one of cpu, mps, cuda and <environment> is one of docling, marker, mineru, docint.

By default, the script processes all PDF files in data/input.

If you want to run a single experiment instead, run the following:

bash run_ocr_experiments.sh -i <input-file> -a <accelerator> -e <environment>

The markdown output is stored in data/output/<ocr-method>/<file-name>.md.

In order to visualize CPU memory over time, run:

source <environment>/.venv/bin/activate
mprof plot data/output/<ocr-method>/<file-name>_mem_cpu.dat

For almost every environment, there exists an additional config.json file with some preconfigured defaults. You can change it based on your needs.

Quatlitative Evaluation Details

Information About Coca-Cola Volume Growth

OCR-Library Extraction Quality Speed [seconds] CPU Memory Usage
Docling - Tesseract Poor (misses text, confuses table entries, doesn't read checkboxes correctly) CPU: 34
MPS: 34
GPU: 32
3 GB
Docling - EasyOCR Medium (reads text well, confuses some table entries, doesn't read checkboxes correctly) CPU: 129
MPS: 52
GPU: 37
11.4 GB
Docling - RapidOCR Good (reads text well, gets table entries correct, doesn't read checkboxes correctly) CPU: 161
MPS: 160
GPU: 12
6.4 GB
Docling - suryaocr Very good (reads text well, gets table entries correct, gets most checkboxes correct) CPU: 369
MPS: 337
GPU: 31
3.7 GB
Docling - Granite Poor (misses great share of text, gets table entries correct, misses checkboxes) CPU: 1564
MPS: 313
GPU: 343
5.8 GB
marker Good (reads text well, confuses some table entries, gets most checkboxes correct) CPU: 229
MPS: 212
GPU: 29
11.8 GB
MinerU Good (reads text well, gets table entries correct, doesn't read checkboxes correctly) CPU: 160
MPS: 50
GPU: 42
4.3 GB
Document Intelligence Very good (reads text well, gets table entries correct, gets some checkboxes correct) 8 70 MB (processing happens in cloud)
LightOnOCR-2-1B Excellent (reads text well, gets table entries correct, gets all checkboxes correct) CPU: 1828
MPS: 1993
GPU: 192
15.7 GB
Chandra OCR 2 Excellent (reads text well, gets table entries correct, gets all checkboxes correct) CPU: X
MPS: X
GPU: 376
 OOM on MacBook Air M4

NIST Handwriting Sample

OCR-Library Extraction Quality Speed [seconds] CPU Memory Usage
Docling - Tesseract Poor (mistakes most text for images) CPU: 3
MPS: 4
GPU: 10
1.4 GB
Docling - EasyOCR Poor (mistakes most text for images) CPU: 12
MPS: 6
GPU: 7
11.2 GB
Docling - RapidOCR Poor (mistakes most text for images) CPU: 18
MPS: 16
GPU: 4
2.9 GB
Docling - suryaocr Poor (mistakes most text for images) CPU: 48
MPS: 46
GPU: 8
1.8 GB
Docling - Granite Medium (recognizes around half the text correctly) CPU: 164
MPS: 7
GPU: 108
1.5 GB
marker Poor (mistakes half of the form for image, reads out remaining text well) CPU: 31
MPS: 39
GPU: 5
7.8 GB
MinerU Poor (misses text and makes mistakes, does not align captions with content well) CPU: 25
MPS: 16
GPU: 16
4.6 GB
Document Intelligence Good (reads all handwriting and text well, does not align captions with contents well) 5 46 MB (processing happens in cloud)
LightOnOCR-2-1B Excellent (reads all handwriting and text well, aligns all form contents perfectly) CPU: 170
MPS: 168
GPU: 23
12.4 GB
Chandra OCR 2 Excellent (reads all handwriting and text well, aligns all form contents perfectly) CPU: X
MPS: X
GPU: 120
OOM on MacBook Air M4

World Food Bank 2020 Annual Report

OCR-Library Extraction Quality Speed [seconds] CPU Memory Usage
Docling - Tesseract Poor (reads text well, mistakes table of content for image, gets double column layout mostly correct, mistakes tables for images) CPU: 30
MPS: 29
GPU: 50
8 GB
Docling - EasyOCR Medium (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables mostly correct) CPU: 166
MPS: 54
GPU: 41
13 GB
Docling - RapidOCR Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct) CPU: 227
MPS: 200
GPU: 28
6.5 GB
Docling - suryaocr Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct) CPU: 370
MPS: 358
GPU: 49
7.8 GB
Docling - Granite Poor (reads text well but often swallows it, gets table of contents correct, gets double layout sometimes correct, misses tables) CPU: 838
MPS: 127
GPU: 171
1.8 GB
marker Good (reads text well, misses page numbers in table of contents, gets double column layout mostly correct, gets table entries correct) CPU: 193
MPS: 168
GPU: 35
11.2 GB
MinerU Good (reads text well, gets table of contents mostly correct, gets double column layout mostly correct, gets tables correct) CPU: 263
MPS: 130
GPU: 60
4.4 GB
Document Intelligence Very good (reads text well, gets table of contents correct, gets double column layout mostly correct, gets table entries correct) 14 64 MB (processing happens in cloud)
LightOnOCR-2-1B Excellent (reads text well, gets table of contents correct, gets double column layout correct, gets table entries correct) CPU: 2440
MPS: 1881
GPU: 191
15 GB
Chandra OCR 2 Excellent (reads text well, gets table of contents correct, gets double column layout correct, gets table entries correct) CPU: X
MPS: X
GPU: 476
OOM on MacBook Air M4

RKI: Epidemiologisches Bulletin

OCR-Library Extraction Quality Speed [seconds] CPU Memory Usage
Docling - Tesseract Good (reads text well, gets table of contents correct, does not structure table well) CPU: 70
MPS: 68
GPU: 101
7.1 GB
Docling - EasyOCR Very good (reads text well, gets table of contents correct, gets table entries correct) CPU: 241
MPS: 108
GPU: 95
14 GB
Docling - RapidOCR Very good (reads text well, gets table of contents correct, gets table entries correct) CPU: 542
MPS: 511
GPU: 52
6.4 GB
Docling - suryaocr Very good (reads text well, gets table of contents correct, gets table entries correct) CPU: 712
MPS: 704
GPU: 270
8.2 GB
Docling - Granite Good (reads text well, gets table of contents correct, misses some tables) CPU: 153
MPS: 152
GPU: 597
1.7 GB
marker Medium (reads text well, gets table of contents correct, mixes up table entries) CPU: 479
MPS: 617
GPU: 143
12.1 GB
MinerU Very good (reads text well, gets table of contents partially correct, gets table entries correct) CPU: 544
MPS: 156
GPU: 88
4.8 GB
Document Intelligence Excellent (reads text well, gets table of contents correct, gets table entries correct) 10 96 MB (processing happens in cloud)
LightOnOCR-2-1B Excellent (reads text well, gets table of contents correct, gets table entries correct) CPU: 4187
MPS: 3987
GPU: 416
14.9 GB
Chandra OCR 2 Excellent (reads text well, gets table of contents correct, gets table entries correct) CPU: X
MPS: X
GPU: 863
OOM

About

Workbench for comparing OCR engines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors