A Comprehensive Bilingual Benchmark for General Chart Parsing across Families, Scenarios, and Formats
中文版 • Paper • GitHub Repo • HuggingFace Dataset • ModelScope Dataset
- [2026.06.01] 📖 Code and data are released!
ChartArena is a comprehensive bilingual benchmark for evaluating the chart parsing capabilities of vision-language models, spanning the full difficulty spectrum of charts encountered in practice. It covers eight chart families: both numeric charts (bar, line, pie, radar, box plot, combination) and diagrammatic structures (flowchart, mind map), each presented across three visual scenarios (digital renderings, printed photos, and hand-drawn photos) and two languages (Chinese and English).
To enable fair comparison across models that produce mutually incompatible output formats, ChartArena adopts a format-agnostic evaluation protocol: heterogeneous predictions are normalized into two canonical semantic spaces: a triple view for numeric charts and a directed graph view for diagrammatic charts, and scored with structure-aware metrics.
| Item | Details |
|---|---|
| Chart Families | 8 (bar, line, pie, radar, box plot, combination, flowchart, mind map) |
| Chart Categories | Numeric charts, mind maps, flowcharts |
| Visual Scenarios | 3 (digital rendering, printed photo, hand-drawn photo) |
| Languages | Bilingual (Chinese and English) |
We evaluate 26 models across three categories: general-purpose MLLMs, document parsing MLLMs, and expert chart understanding models. Results are reported as mAP$_{high}$ per chart family, with separate EN (English) and ZH (Chinese) scores each averaged over three visual scenarios. Within each category, bold marks the best result per column.
Full leaderboard (click to expand)
| Model | Date | Bar (EN) | Bar (ZH) | Line (EN) | Line (ZH) | Pie (EN) | Pie (ZH) | Radar (EN) | Radar (ZH) | Box (EN) | Box (ZH) | Combo (EN) | Combo (ZH) | Flow (EN) | Flow (ZH) | Mind (EN) | Mind (ZH) | Avg (EN) | Avg (ZH) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 2024.05 | 21.6 | 36.3 | 27.5 | 52.9 | 76.7 | 74.2 | 9.7 | 24.9 | 19.1 | 9.6 | 9.9 | 40.7 | 49.8 | 27.1 | 64.0 | 24.8 | 34.8 | 36.3 |
| GPT-5 | 2025.08 | 35.1 | 52.3 | 48.1 | 65.1 | 81.1 | 78.9 | 32.0 | 41.5 | 19.8 | 12.8 | 14.2 | 46.5 | 58.1 | 35.3 | 76.6 | 33.5 | 45.6 | 45.8 |
| Qwen2.5-VL-7B-Instruct | 2025.02 | 15.2 | 36.9 | 17.9 | 39.9 | 63.4 | 73.1 | 8.3 | 19.1 | 0.9 | 2.8 | 6.0 | 40.6 | 29.7 | 23.2 | 45.4 | 29.9 | 23.3 | 33.2 |
| Qwen2.5-VL-72B-Instruct | 2025.02 | 27.1 | 53.3 | 38.2 | 66.7 | 73.5 | 77.0 | 10.9 | 38.5 | 15.0 | 15.3 | 14.3 | 50.5 | 50.1 | 43.6 | 63.8 | 55.0 | 36.6 | 50.0 |
| InternVL3.5-8B | 2025.08 | 20.9 | 49.4 | 34.1 | 49.9 | 63.9 | 72.6 | 12.6 | 35.7 | 4.3 | 10.7 | 7.6 | 41.2 | 31.5 | 24.3 | 47.0 | 32.2 | 27.7 | 39.5 |
| InternVL3.5-241B-A28B | 2025.08 | 27.5 | 57.2 | 41.3 | 55.7 | 77.7 | 83.3 | 15.2 | 41.4 | 18.7 | 21.6 | 17.7 | 47.8 | 43.8 | 36.6 | 62.6 | 45.5 | 38.0 | 48.6 |
| Qwen3VL-8B-Instruct | 2025.10 | 33.9 | 63.4 | 43.1 | 67.9 | 78.6 | 88.3 | 16.8 | 52.1 | 35.7 | 30.4 | 14.2 | 51.9 | 50.0 | 41.5 | 75.2 | 62.6 | 43.4 | 57.3 |
| Qwen3VL-235B-A22B-Ins. | 2025.10 | 44.5 | 71.9 | 57.1 | 77.1 | 85.8 | 87.9 | 24.6 | 52.4 | 54.8 | 55.1 | 29.1 | 60.8 | 57.9 | 49.8 | 79.4 | 73.7 | 54.2 | 66.1 |
| Qwen3.5-35B-A3B | 2026.02 | 48.0 | 68.1 | 60.4 | 77.6 | 89.7 | 88.7 | 25.2 | 57.9 | 50.1 | 50.6 | 35.2 | 62.1 | 62.5 | 56.5 | 77.1 | 75.6 | 56.0 | 67.1 |
| GLM-4.5V | 2025.07 | 33.5 | 61.4 | 51.7 | 70.5 | 81.2 | 83.1 | 19.7 | 43.1 | 32.4 | 37.4 | 21.2 | 52.5 | 44.7 | 39.6 | 66.2 | 43.7 | 43.8 | 53.9 |
| Seed-1.8 (non-thinking) | 2025.12 | 29.1 | 59.7 | 46.0 | 72.5 | 84.7 | 88.0 | 22.0 | 45.9 | 16.1 | 17.5 | 15.0 | 59.7 | 47.8 | 50.3 | 76.5 | 69.1 | 42.2 | 57.8 |
| Seed-2.0 Pro (non-thinking) | 2026.02 | 40.3 | 73.3 | 56.5 | 80.7 | 91.5 | 90.5 | 21.3 | 54.7 | 44.5 | 55.2 | 32.4 | 62.2 | 62.6 | 61.3 | 83.1 | 85.8 | 54.0 | 70.5 |
| Kimi K2.5 (non-thinking) | 2026.02 | 45.2 | 70.3 | 60.9 | 79.8 | 87.2 | 86.7 | 30.2 | 59.7 | 40.6 | 47.6 | 33.6 | 63.6 | 59.9 | 57.9 | 80.8 | 79.4 | 54.8 | 68.1 |
| MiMo-V2-Omni | 2026.03 | 31.1 | 56.9 | 41.5 | 66.4 | 87.0 | 85.8 | 19.7 | 46.1 | 19.1 | 30.3 | 19.4 | 54.7 | 57.1 | 51.0 | 76.6 | 64.6 | 43.9 | 57.0 |
| Gemini 2.5 Pro | 2025.03 | 46.0 | 76.5 | 56.5 | 77.6 | 88.6 | 87.3 | 17.5 | 53.0 | 10.2 | 22.1 | 28.7 | 57.6 | 62.1 | 57.8 | 71.7 | 67.1 | 47.7 | 62.4 |
| Gemini 3.1 Pro | 2026.02 | 57.9 | 78.7 | 67.0 | 85.3 | 92.5 | 95.1 | 31.8 | 62.7 | 32.5 | 45.2 | 39.7 | 70.3 | 65.6 | 63.1 | 86.8 | 85.2 | 59.2 | 73.2 |
| Model | Date | Bar (EN) | Bar (ZH) | Line (EN) | Line (ZH) | Pie (EN) | Pie (ZH) | Radar (EN) | Radar (ZH) | Box (EN) | Box (ZH) | Combo (EN) | Combo (ZH) | Flow (EN) | Flow (ZH) | Mind (EN) | Mind (ZH) | Avg (EN) | Avg (ZH) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dots.mocr (3B) | 2025.07 | 28.3 | 40.9 | 41.8 | 60.1 | 68.8 | 78.3 | 20.3 | 43.1 | 24.1 | 16.0 | 26.9 | 47.1 | 26.2 | 20.6 | 28.7 | 19.6 | 33.1 | 40.7 |
| PaddleOCR-VL (1B) | 2025.10 | 31.8 | 49.3 | 43.0 | 51.6 | 57.5 | 75.2 | 14.4 | 29.0 | 11.7 | 20.7 | 21.3 | 54.0 | -- | -- | -- | -- | 23.9 | 35.8 |
| HunyuanOCR (1B) | 2025.11 | 33.0 | 60.0 | 49.5 | 68.2 | 71.0 | 74.8 | 19.0 | 41.1 | 43.9 | 45.2 | 20.1 | 50.8 | 39.9 | 35.9 | 55.0 | 46.6 | 41.4 | 52.8 |
| Model | Date | Bar (EN) | Bar (ZH) | Line (EN) | Line (ZH) | Pie (EN) | Pie (ZH) | Radar (EN) | Radar (ZH) | Box (EN) | Box (ZH) | Combo (EN) | Combo (ZH) | Flow (EN) | Flow (ZH) | Mind (EN) | Mind (ZH) | Avg (EN) | Avg (ZH) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ChartAst (13B) | 2024.01 | 5.2 | -- | 4.2 | -- | 0.3 | -- | 1.5 | -- | 0.3 | -- | 0.0 | -- | -- | -- | -- | -- | 1.4 | -- |
| ChartVLM (8.3B) | 2024.02 | 11.2 | 5.3 | 11.5 | 4.3 | 12.9 | 8.2 | 2.1 | 5.0 | 0.7 | 0.4 | 4.1 | 4.4 | -- | -- | -- | -- | 5.3 | 3.5 |
| TinyChart (3B) | 2024.04 | 6.1 | 6.3 | 9.7 | 3.2 | 5.7 | 5.4 | 0.5 | 3.4 | 0.2 | 1.3 | 0.7 | 4.2 | -- | -- | -- | -- | 2.9 | 3.0 |
| ChartMoE (8B) | 2024.09 | 18.7 | 24.4 | 14.7 | 22.3 | 15.0 | 48.5 | 3.7 | 16.1 | 2.7 | 1.6 | 5.1 | 19.5 | 4.0 | -- | 4.1 | -- | 8.5 | 16.7 |
| ChartCoder (7B) | 2025.01 | 23.2 | 12.6 | 22.0 | 19.6 | 34.3 | 16.7 | 5.5 | 13.9 | 5.4 | 11.4 | 3.7 | 5.1 | 5.6 | -- | 1.0 | -- | 12.6 | 9.9 |
| RRVF (7B) | 2025.07 | 35.8 | 66.5 | 41.5 | 54.3 | 51.6 | 75.3 | 16.6 | 40.3 | 14.7 | 14.1 | 23.5 | 61.2 | 36.4 | 32.4 | 68.4 | 63.8 | 36.0 | 51.0 |
| MSRL (7B) | 2025.08 | 32.7 | 45.2 | 35.2 | 34.3 | 41.2 | 67.9 | 25.9 | 48.0 | 11.2 | 13.0 | 16.7 | 35.2 | 23.2 | 12.4 | 31.0 | 18.8 | 27.1 | 34.3 |
ChartArena groups charts into three categories, each with a default extraction task:
| Chart Category | Examples | Default Task |
|---|---|---|
| Numerical charts | Bar / Line / Pie / Radar / Box / Combo … | SE_MD |
| Mind maps (logic diagrams) | Tree / hierarchy diagrams | SE_MD |
| Flowcharts | Process / workflow diagrams | SE_MERMAID |
The eleven extraction tasks (click to expand)
| Task | Output Format | Description |
|---|---|---|
| SE_MD | Markdown table / list | Numerical charts → Markdown table; mind maps → Markdown nested list |
| SE_JSON | JSON | Structured JSON with title and values |
| SE_CSV | CSV | Comma-separated values |
| SE_CODE | Python (matplotlib) | Reproduce the chart as executable Python code |
| SE_SVG | SVG | Reproduce the chart as SVG markup |
| SE_MERMAID | Mermaid | Flowchart as Mermaid diagram syntax |
| SE_GRAPHVIZ | Graphviz DOT | Flowchart as DOT language |
| SE_PLANTUML | PlantUML | Flowchart as PlantUML syntax |
| SE_DIAGRAMS | diagrams.net XML | Flowchart as draw.io XML |
| SE_D2 | D2 | Flowchart as D2 diagram language |
| SE_CYTOSCAPE | Cytoscape JSON | Flowchart as Cytoscape.js JSON |
Scoring metrics: mAP (map_strict / map_slight / map_high) and EM (exact match).
git clone <this-repo>
cd ChartArena
pip install -r requirements.txt
# Optional: only if you plan to use --api_type local_vllm
pip install vllm
The dataset (jsonl + images) is released as a single archive. Place the files under data/:
data/
├── ChartArena.jsonl
└── images/
├── bar/...
├── line/...
├── pie/...
└── ...
Each line of the jsonl looks like:
{
"img_path": "images/xxx.png",
"chart_type": "柱状图",
"img_type": "电子印刷",
"lang_type": "中文",
"anno": "..."
}
img_path is a relative path from the data/ directory and is used as the unique key throughout the pipeline.
Two backends are supported via --api_type: openai_compat for any OpenAI-compatible HTTP service (local or cloud), and local_vllm for in-process model loading. Inference supports resume — re-running the same command skips already-completed samples.
Backend details and task selection (click to expand)
Works with locally-served models (vllm serve, sglang, lmdeploy) or public APIs that speak the OpenAI Chat Completions protocol (OpenAI, Gemini, Claude, Together, …).
python infer.py \
--api_type openai_compat \
--model_name Qwen2.5-VL-72B-Instruct \
--base_url http://127.0.0.1:8000/v1 \
--api_key EMPTY \
--max_workers 64
No need to start a server first. The script loads the checkpoint directly with vllm.LLM.
python infer.py \
--api_type local_vllm \
--model_path /path/to/Qwen2.5-VL-72B-Instruct \
--tensor_parallel_size 4 \
--max_model_len 32768
By default, each chart category runs one task. You can override with --task_data, --task_logic, --task_flowchart (each accepts one or more task names):
# Run SE_MD and SE_JSON for numerical charts, SE_MERMAID for flowcharts
python infer.py --api_type openai_compat --model_name ... --base_url ... \
--task_data SE_MD SE_JSON \
--task_flowchart SE_MERMAID
Each run writes one jsonl file:
infer_outputs/<model_tag>/results.jsonl
<model_tag> defaults to --model_name / basename of --model_path. You can override it with --output_tag.
# Score all models under infer_outputs/
python judge.py
# Score specific models only
python judge.py --models Qwen2.5-VL-72B-Instruct gemini-2.5-pro
# Force re-score a specific task (e.g. after a scoring algorithm update)
python judge.py --force_rejudge SE_MERMAID
Outputs to judge_outputs/<model_tag>/results.jsonl. The judge step is purely rule-based and fast.
python analyze.py
# → judge_outputs/results_analysis.xlsx
Scores are shown to 3 decimal places (e.g. 0.873).
Workbook contents (click to expand)
- Task overview (Sheet 1) — per-model average score for each task
- Per-task sheets — model × source file breakdown for each task
- By chart type (
by_chart_type/) — one Excel per task, one sheet per chart type - Detailed breakdown (
detail_by_category/) — per-model per-task breakdown by(chart_type, img_type, lang_type)
# 1. Run inference
python infer.py \
--api_type openai_compat \
--model_name Qwen2.5-VL-72B-Instruct \
--base_url http://127.0.0.1:8000/v1
# 2. Score
python judge.py
# 3. Generate Excel report
python analyze.py
The repository is organized around a three-stage pipeline (inference → judging → analysis), with pluggable API backends, per-format scoring modules, and shared metric utilities.
Full directory tree (click to expand)
ChartArena/
├── README.md / README_zh.md
├── requirements.txt
├── data/ # ← download benchmark data here
├── infer_outputs/ # inference results (auto-created)
├── judge_outputs/ # scoring results (auto-created)
├── apis/
│ ├── base.py # APIBase abstract class
│ ├── openai_compat.py # OpenAI-compatible client
│ └── local_vllm.py # in-process vLLM
├── methods/
│ ├── prompts.py # prompt templates
│ ├── context.py # context building utilities
│ ├── normalize.py # output normalization
│ ├── scoring.py # scoring entry points
│ └── parsers/ # per-format output parsers
├── metrics/
│ ├── SCRM.py # core MAP / EM metric
│ ├── tree_eval.py # Markdown list evaluation
│ ├── mermaid_eval.py # Mermaid diagram evaluation
│ ├── flowchart_common.py # flowchart multi-format evaluation
│ └── dsl_parsers/ # DSL-specific parser utilities
├── utils/
│ ├── io.py # ResultWriter (thread-safe incremental writer)
│ ├── signal_utils.py # graceful Ctrl+C shutdown
│ └── image_utils.py # base64 encoding for OpenAI-compat
├── infer.py # entry: inference
├── judge.py # entry: rule-based scoring
└── analyze.py # entry: Excel analysis report
% (coming soon)
This benchmark is released for research purposes only.
