This benchmark evaluates how effectively different PDF parsing solutions extract mathematical formulas from documents. We generate synthetic PDFs with diverse formatting scenarios, parse them with different parsers, and assess the quality of the parsed output through a two-stage evaluation pipeline: identifying formulas in the parsed text, then scoring them based on semantic similarity to the ground truth.
Overview of the benchmarking framework: (1) Formula dataset extraction from Wikipedia, (2) Synthetic PDF generation with ground truth, (3) Two-stage evaluation pipeline with LLM-based matching and scoring.
| Rank | Parser | Score | Params/Cost | Inference |
|---|---|---|---|---|
| 1 | Qwen3-VL-235B-A22B-Instruct | 9.76 | 22B | GPU/API |
| 2 | Gemini 3 Pro | 9.75 | $2.00/12.00 M tok | API |
| 3 | PaddleOCR-VL | 9.65 | 0.9B | CPU/GPU |
| 4 | Mathpix | 9.64 | $0.005/page | API |
| 5 | dots.ocr | 9.43 | 1.7B | GPU |
| 6 | PP-StructureV3 | 9.34 | <0.3B | CPU/GPU |
| 7 | Nanonets-OCR-s | 9.31 | 4B | GPU |
| 8 | Gemini 2.5 Pro | 9.28 | $1.25/10.00 M tok | API |
| 9 | MonkeyOCR-pro-3B | 9.25 | 3B | GPU |
| 10 | MinerU2.5 | 9.17 | 1.2B | CPU/GPU/API |
| 11 | olmOCR-2-7B-1025-FP8 | 8.94 | 7B | GPU |
| 12 | Gemini 2.5 Flash | 8.78 | $0.15/0.60 M tok | API |
| 13 | Mistral OCR | 8.66 | $0.001/page | API |
| 14 | DeepSeek-OCR | 8.55 | 0.6B | GPU |
| 15 | LlamaParse | 8.14 | $0.09/page | API |
| 16 | GPT-5 nano | 7.79 | $0.05/0.40 M tok | API |
| 17 | PyPDF | 7.69 | —/Free | CPU |
| 18 | GOT-OCR2.0 | 7.38 | 0.58B | CPU/GPU |
| 19 | PyMuPDF4LLM | 6.67 | —/Free | CPU |
| 20 | GPT-5 mini | 6.61 | $0.25/2.00 M tok | API |
| 21 | GROBID | 5.70 | —/Free | CPU |
Legend:
- Score: Average LLM-as-a-Judge score (0-10 scale) across all formulas (1411 inline + 641 display formulas from 100 PDFs)
- Params/Cost: Active parameters for open-source models or API pricing for commercial services (as of December 2025)
- Inference: Deployment options (CPU, GPU, API)
📊 Detailed scores (Inline/Display/CDM)
| Rank | Parser | Inline | Display | CDM |
|---|---|---|---|---|
| 1 | Qwen3-VL-235B-A22B-Instruct | 9.75 | 9.65 | 0.99 |
| 2 | Gemini 3 Pro | 9.72 | 9.71 | 0.99 |
| 3 | PaddleOCR-VL | 9.64 | 9.60 | 0.98 |
| 4 | Mathpix | 9.60 | 9.62 | 0.97 |
| 5 | dots.ocr | 9.31 | 9.55 | 0.94 |
| 6 | PP-StructureV3 | 9.26 | 9.47 | 0.98 |
| 7 | Nanonets-OCR-s | 9.30 | 9.26 | 0.96 |
| 8 | Gemini 2.5 Pro | 9.16 | 9.39 | 0.97 |
| 9 | MonkeyOCR-pro-3B | 9.26 | 9.20 | 0.98 |
| 10 | MinerU2.5 | 9.16 | 9.17 | 0.96 |
| 11 | olmOCR-2-7B-1025-FP8 | 8.88 | 8.94 | 0.92 |
| 12 | Gemini 2.5 Flash | 8.63 | 8.98 | 0.96 |
| 13 | Mistral OCR | 8.46 | 9.01 | 0.95 |
| 14 | DeepSeek-OCR | 8.65 | 8.30 | 0.95 |
| 15 | LlamaParse | 8.06 | 8.20 | 0.85 |
| 16 | GPT-5 nano | 7.77 | 7.67 | 0.75 |
| 17 | PyPDF | 7.75 | 7.52 | 0.75 |
| 18 | GOT-OCR2.0 | 7.07 | 7.95 | 0.91 |
| 19 | PyMuPDF4LLM | 6.70 | 6.41 | 0.58 |
| 20 | GPT-5 mini | 6.55 | 6.56 | 0.75 |
| 21 | GROBID | 5.97 | 5.07 | 0.71 |
- Inline: LLM-as-a-Judge score for inline formulas (1411 formulas)
- Display: LLM-as-a-Judge score for display-mode formulas (641 formulas)
- CDM: Character Detection Metric (0-1 scale) - character-level accuracy via visual rendering comparison
PDFs are generated synthetically using LaTeX with randomized parameters:
-
PDF Generation: Each document contains randomly selected formulas embedded in text passages, displayed as inline or display-mode equations. Parameters include formats, styling, languages, and content structure. Layout and structure vary to test parser robustness across different scenarios.
-
Formula Dataset: Mathematical formulas are randomly sampled from our dataset of 319,000 formulas extracted from Wikipedia, ensuring diversity in formula complexity and real-world relevance. Dataset: piushorn/wikipedia-latex-formulas-319k
-
Ground Truth: Since PDFs are generated from LaTeX source, we automatically obtain exact ground truth for each formula as a byproduct of the generation process.
-
Reproducibility Artifacts: All parsing outputs and evaluation artifacts (matching results, LLM ratings) for all 20+ evaluated parsers are available on Zenodo.
Parser outputs are assessed using a two-step pipeline:
Given a parser's output (the extracted text from a PDF), an LLM establishes initial formula-to-ground-truth correspondences, then fuzzy search reliably extracts exact formula strings from the parsed output. This achieves robust alignment even when parser outputs differ significantly from ground truth.
The primary metric is the LLM-as-a-Judge score (0-10 scale, default: Gemini 3 Flash via OpenRouter). For each formula pair, the LLM judge evaluates three key criteria: (1) Correctness - whether mathematical symbols, variables, and operations are accurately preserved, (2) Completeness - whether all parts are present without omissions, and (3) Semantic equivalence - whether the extracted formula conveys the same mathematical meaning as the ground truth. Our research demonstrates that using LLMs as judges provides a robust and meaningful metric for comparing ground truth LaTeX formulas against parsed output, focusing on semantic equivalence and mathematical correctness rather than relying solely on text similarity metrics or visual rendering comparison. Scores are computed separately for inline and display formulas.
Benchmark Datasets: New benchmark datasets are released quarterly (e.g., 2025-q4), each containing 100 PDFs with diverse mathematical content.
There are two ways to use this benchmark, depending on your needs:
Use this if: You quickly want to evaluate your PDF Parsing tool against the benchmark.
Advantage: Simple pip install, no need to integrate with the repository structure.
pip install pdf-parse-benchNote: Set OPENROUTER_API_KEY environment variable for evaluation (used for LLM-as-a-Judge scoring via OpenRouter).
Get the benchmark PDFs and parse them with your parser:
from pdf_parse_bench import get_benchmark_pdfs_dir
from pathlib import Path
# Get benchmark PDFs (included in the package)
pdfs_dir = get_benchmark_pdfs_dir()
# Parse each PDF with your parser
output_dir = Path("results/my_parser")
for pdf_path in pdfs_dir.glob("*.pdf"):
# Parse PDF with your parser
parsed_text = your_parser.parse(pdf_path)
# Save to expected format: {output_dir}/{pdf_name}/parsed.md
(output_dir / pdf_path.stem / "parsed.md").parent.mkdir(parents=True, exist_ok=True)
(output_dir / pdf_path.stem / "parsed.md").write_text(parsed_text)Required output structure:
results/my_parser/
├── 000/
│ └── parsed.md
├── 001/
│ └── parsed.md
├── 002/
│ └── parsed.md
...
Run the benchmark evaluation on your parsed results:
from pathlib import Path
from pdf_parse_bench import Benchmark, PDFParser, get_benchmark_ground_truth_dir
class MyParser(PDFParser):
@classmethod
def display_name(cls) -> str:
return "My Parser"
def parse(self, pdf_path: Path, output_path: Path) -> str:
raise NotImplementedError # Already parsed in Step 1
bench = Benchmark(
parser_output_dir=Path("results/my_parser"),
ground_truth_dir=get_benchmark_ground_truth_dir(),
llm_judge_models=["google/gemini-3-flash-preview"],
parser=MyParser(),
)
bench.extract()
bench.evaluate()
bench.save_benchmark_summary()Use this if: You want to contribute your parser to the benchmark, reproduce published results, or ensure full reproducibility of your evaluation setup.
Advantage: Full automation with CLI, parser configuration is versioned and reproducible, easy to share exact setup with others.
git clone https://github.com/phorn1/pdf-parse-bench.git
cd pdf-parse-bench
# Install with uv
uv sync
# Configure environment (copy and edit .env.example)
cp .env.example .envCreate a new parser module in the parsers/ directory:
# parsers/my_parser/__main__.py
from pathlib import Path
from pdf_parse_bench.utilities import PDFParser
from pdf_parse_bench.pipeline import run_cli
class MyParser(PDFParser):
@classmethod
def display_name(cls) -> str:
return "My Parser"
def parse(self, pdf_path: Path, output_path: Path) -> str:
# Your parsing logic here
markdown = "# Parsed content"
self._write_output(markdown, output_path)
return markdown
if __name__ == "__main__":
run_cli(MyParser())uv run -m parsers.my_parserThe benchmark infrastructure handles everything automatically:
- Loading test PDFs from
data/2025-q4/pdfs/ - Parsing each PDF with your parser
- Extracting formulas from parsed output
- Running evaluation against ground truth
- Saving results in standardized format
The benchmark CLI provides several options to customize execution:
# Run only specific steps
uv run -m parsers.my_parser --step parse
uv run -m parsers.my_parser --step extract --step evaluate
# Reprocess existing results
uv run -m parsers.my_parser --reprocess all
uv run -m parsers.my_parser --reprocess parse --reprocess extract
# Use a different LLM judge for evaluation (OpenRouter model format)
uv run -m parsers.my_parser --llm-judge-model openai/gpt-5-mini
# Enable Character Detection Metrics (CDM)
# Note: Requires CDM service installation (https://github.com/opendatalab/UniMERNet/tree/main/cdm)
# and CDM_SERVICE_URL environment variable
uv run -m parsers.my_parser --enable-cdm
# Custom input/output directories
uv run -m parsers.my_parser -i data/2025-q4 -o results/custompdf-parse-bench/
├── src/pdf_parse_bench/ # Core benchmark infrastructure
│ ├── pipeline/ # Benchmark execution pipeline
│ ├── eval/ # Evaluation metrics and judges
│ ├── extraction/ # Formula extraction from parsed text
│ ├── utilities/ # Base classes and helpers
│ └── synth_pdf/ # Synthetic PDF generation (optional)
├── parsers/ # Parser implementations
│ ├── pymupdf4llm/
│ ├── llamaparse/
│ ├── mathpix/
│ └── ... # Add your own!
├── data/ # Benchmark datasets
│ └── 2025-q4/ # Current benchmark version
│ ├── pdfs/ # Test PDFs
│ └── ground_truth/ # LaTeX formulas
Contributions are welcome!
Adding a parser implementation: See Option 2 above for instructions on adding your parser to the repository.
Bug reports and feature requests: Please open an issue on GitHub.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this benchmark in your research or project, please cite our paper:
@misc{horn2025benchmarking,
title = {Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs},
author = {Horn, Pius and Keuper, Janis},
year = {2025},
eprint={2511.10390},
archivePrefix={arXiv},
primaryClass={cs.CV},
url = {https://arxiv.org/abs/2512.09874}
}📄 Paper: arXiv:2512.09874
This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project LLMpraxis under grant 13FH622KX2.

