Skip to content

pspdada/ChartArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChartArena

A Comprehensive Bilingual Benchmark for General Chart Parsing across Families, Scenarios, and Formats

中文版PaperGitHub RepoHuggingFace DatasetModelScope Dataset

News

  • [2026.06.01] 📖 Code and data are released!

Overview

ChartArena is a comprehensive bilingual benchmark for evaluating the chart parsing capabilities of vision-language models, spanning the full difficulty spectrum of charts encountered in practice. It covers eight chart families: both numeric charts (bar, line, pie, radar, box plot, combination) and diagrammatic structures (flowchart, mind map), each presented across three visual scenarios (digital renderings, printed photos, and hand-drawn photos) and two languages (Chinese and English).

To enable fair comparison across models that produce mutually incompatible output formats, ChartArena adopts a format-agnostic evaluation protocol: heterogeneous predictions are normalized into two canonical semantic spaces: a triple view for numeric charts and a directed graph view for diagrammatic charts, and scored with structure-aware metrics.

Contents

Benchmark Statistics

Item Details
Chart Families 8 (bar, line, pie, radar, box plot, combination, flowchart, mind map)
Chart Categories Numeric charts, mind maps, flowcharts
Visual Scenarios 3 (digital rendering, printed photo, hand-drawn photo)
Languages Bilingual (Chinese and English)

Leaderboard

We evaluate 26 models across three categories: general-purpose MLLMs, document parsing MLLMs, and expert chart understanding models. Results are reported as mAP$_{high}$ per chart family, with separate EN (English) and ZH (Chinese) scores each averaged over three visual scenarios. Within each category, bold marks the best result per column.

Full leaderboard (click to expand)

General-Purpose MLLMs

Model Date Bar (EN) Bar (ZH) Line (EN) Line (ZH) Pie (EN) Pie (ZH) Radar (EN) Radar (ZH) Box (EN) Box (ZH) Combo (EN) Combo (ZH) Flow (EN) Flow (ZH) Mind (EN) Mind (ZH) Avg (EN) Avg (ZH)
GPT-4o 2024.05 21.6 36.3 27.5 52.9 76.7 74.2 9.7 24.9 19.1 9.6 9.9 40.7 49.8 27.1 64.0 24.8 34.8 36.3
GPT-5 2025.08 35.1 52.3 48.1 65.1 81.1 78.9 32.0 41.5 19.8 12.8 14.2 46.5 58.1 35.3 76.6 33.5 45.6 45.8
Qwen2.5-VL-7B-Instruct 2025.02 15.2 36.9 17.9 39.9 63.4 73.1 8.3 19.1 0.9 2.8 6.0 40.6 29.7 23.2 45.4 29.9 23.3 33.2
Qwen2.5-VL-72B-Instruct 2025.02 27.1 53.3 38.2 66.7 73.5 77.0 10.9 38.5 15.0 15.3 14.3 50.5 50.1 43.6 63.8 55.0 36.6 50.0
InternVL3.5-8B 2025.08 20.9 49.4 34.1 49.9 63.9 72.6 12.6 35.7 4.3 10.7 7.6 41.2 31.5 24.3 47.0 32.2 27.7 39.5
InternVL3.5-241B-A28B 2025.08 27.5 57.2 41.3 55.7 77.7 83.3 15.2 41.4 18.7 21.6 17.7 47.8 43.8 36.6 62.6 45.5 38.0 48.6
Qwen3VL-8B-Instruct 2025.10 33.9 63.4 43.1 67.9 78.6 88.3 16.8 52.1 35.7 30.4 14.2 51.9 50.0 41.5 75.2 62.6 43.4 57.3
Qwen3VL-235B-A22B-Ins. 2025.10 44.5 71.9 57.1 77.1 85.8 87.9 24.6 52.4 54.8 55.1 29.1 60.8 57.9 49.8 79.4 73.7 54.2 66.1
Qwen3.5-35B-A3B 2026.02 48.0 68.1 60.4 77.6 89.7 88.7 25.2 57.9 50.1 50.6 35.2 62.1 62.5 56.5 77.1 75.6 56.0 67.1
GLM-4.5V 2025.07 33.5 61.4 51.7 70.5 81.2 83.1 19.7 43.1 32.4 37.4 21.2 52.5 44.7 39.6 66.2 43.7 43.8 53.9
Seed-1.8 (non-thinking) 2025.12 29.1 59.7 46.0 72.5 84.7 88.0 22.0 45.9 16.1 17.5 15.0 59.7 47.8 50.3 76.5 69.1 42.2 57.8
Seed-2.0 Pro (non-thinking) 2026.02 40.3 73.3 56.5 80.7 91.5 90.5 21.3 54.7 44.5 55.2 32.4 62.2 62.6 61.3 83.1 85.8 54.0 70.5
Kimi K2.5 (non-thinking) 2026.02 45.2 70.3 60.9 79.8 87.2 86.7 30.2 59.7 40.6 47.6 33.6 63.6 59.9 57.9 80.8 79.4 54.8 68.1
MiMo-V2-Omni 2026.03 31.1 56.9 41.5 66.4 87.0 85.8 19.7 46.1 19.1 30.3 19.4 54.7 57.1 51.0 76.6 64.6 43.9 57.0
Gemini 2.5 Pro 2025.03 46.0 76.5 56.5 77.6 88.6 87.3 17.5 53.0 10.2 22.1 28.7 57.6 62.1 57.8 71.7 67.1 47.7 62.4
Gemini 3.1 Pro 2026.02 57.9 78.7 67.0 85.3 92.5 95.1 31.8 62.7 32.5 45.2 39.7 70.3 65.6 63.1 86.8 85.2 59.2 73.2

Document Parsing MLLMs

Model Date Bar (EN) Bar (ZH) Line (EN) Line (ZH) Pie (EN) Pie (ZH) Radar (EN) Radar (ZH) Box (EN) Box (ZH) Combo (EN) Combo (ZH) Flow (EN) Flow (ZH) Mind (EN) Mind (ZH) Avg (EN) Avg (ZH)
dots.mocr (3B) 2025.07 28.3 40.9 41.8 60.1 68.8 78.3 20.3 43.1 24.1 16.0 26.9 47.1 26.2 20.6 28.7 19.6 33.1 40.7
PaddleOCR-VL (1B) 2025.10 31.8 49.3 43.0 51.6 57.5 75.2 14.4 29.0 11.7 20.7 21.3 54.0 -- -- -- -- 23.9 35.8
HunyuanOCR (1B) 2025.11 33.0 60.0 49.5 68.2 71.0 74.8 19.0 41.1 43.9 45.2 20.1 50.8 39.9 35.9 55.0 46.6 41.4 52.8

Expert Chart Understanding Models

Model Date Bar (EN) Bar (ZH) Line (EN) Line (ZH) Pie (EN) Pie (ZH) Radar (EN) Radar (ZH) Box (EN) Box (ZH) Combo (EN) Combo (ZH) Flow (EN) Flow (ZH) Mind (EN) Mind (ZH) Avg (EN) Avg (ZH)
ChartAst (13B) 2024.01 5.2 -- 4.2 -- 0.3 -- 1.5 -- 0.3 -- 0.0 -- -- -- -- -- 1.4 --
ChartVLM (8.3B) 2024.02 11.2 5.3 11.5 4.3 12.9 8.2 2.1 5.0 0.7 0.4 4.1 4.4 -- -- -- -- 5.3 3.5
TinyChart (3B) 2024.04 6.1 6.3 9.7 3.2 5.7 5.4 0.5 3.4 0.2 1.3 0.7 4.2 -- -- -- -- 2.9 3.0
ChartMoE (8B) 2024.09 18.7 24.4 14.7 22.3 15.0 48.5 3.7 16.1 2.7 1.6 5.1 19.5 4.0 -- 4.1 -- 8.5 16.7
ChartCoder (7B) 2025.01 23.2 12.6 22.0 19.6 34.3 16.7 5.5 13.9 5.4 11.4 3.7 5.1 5.6 -- 1.0 -- 12.6 9.9
RRVF (7B) 2025.07 35.8 66.5 41.5 54.3 51.6 75.3 16.6 40.3 14.7 14.1 23.5 61.2 36.4 32.4 68.4 63.8 36.0 51.0
MSRL (7B) 2025.08 32.7 45.2 35.2 34.3 41.2 67.9 25.9 48.0 11.2 13.0 16.7 35.2 23.2 12.4 31.0 18.8 27.1 34.3

Task Definitions

ChartArena groups charts into three categories, each with a default extraction task:

Chart Category Examples Default Task
Numerical charts Bar / Line / Pie / Radar / Box / Combo … SE_MD
Mind maps (logic diagrams) Tree / hierarchy diagrams SE_MD
Flowcharts Process / workflow diagrams SE_MERMAID
The eleven extraction tasks (click to expand)
Task Output Format Description
SE_MD Markdown table / list Numerical charts → Markdown table; mind maps → Markdown nested list
SE_JSON JSON Structured JSON with title and values
SE_CSV CSV Comma-separated values
SE_CODE Python (matplotlib) Reproduce the chart as executable Python code
SE_SVG SVG Reproduce the chart as SVG markup
SE_MERMAID Mermaid Flowchart as Mermaid diagram syntax
SE_GRAPHVIZ Graphviz DOT Flowchart as DOT language
SE_PLANTUML PlantUML Flowchart as PlantUML syntax
SE_DIAGRAMS diagrams.net XML Flowchart as draw.io XML
SE_D2 D2 Flowchart as D2 diagram language
SE_CYTOSCAPE Cytoscape JSON Flowchart as Cytoscape.js JSON

Scoring metrics: mAP (map_strict / map_slight / map_high) and EM (exact match).

Getting Started

1. Setup

git clone <this-repo>
cd ChartArena
pip install -r requirements.txt
# Optional: only if you plan to use --api_type local_vllm
pip install vllm

2. Download benchmark data

The dataset (jsonl + images) is released as a single archive. Place the files under data/:

data/
├── ChartArena.jsonl
└── images/
    ├── bar/...
    ├── line/...
    ├── pie/...
    └── ...

Each line of the jsonl looks like:

{
  "img_path": "images/xxx.png",
  "chart_type": "柱状图",
  "img_type": "电子印刷",
  "lang_type": "中文",
  "anno": "..."
}

img_path is a relative path from the data/ directory and is used as the unique key throughout the pipeline.

3. Inference

Two backends are supported via --api_type: openai_compat for any OpenAI-compatible HTTP service (local or cloud), and local_vllm for in-process model loading. Inference supports resume — re-running the same command skips already-completed samples.

Backend details and task selection (click to expand)

(a) openai_compat — any OpenAI-compatible HTTP service

Works with locally-served models (vllm serve, sglang, lmdeploy) or public APIs that speak the OpenAI Chat Completions protocol (OpenAI, Gemini, Claude, Together, …).

python infer.py \
    --api_type openai_compat \
    --model_name Qwen2.5-VL-72B-Instruct \
    --base_url http://127.0.0.1:8000/v1 \
    --api_key EMPTY \
    --max_workers 64

(b) local_vllm — in-process vLLM, give it a model path

No need to start a server first. The script loads the checkpoint directly with vllm.LLM.

python infer.py \
    --api_type local_vllm \
    --model_path /path/to/Qwen2.5-VL-72B-Instruct \
    --tensor_parallel_size 4 \
    --max_model_len 32768

Task selection

By default, each chart category runs one task. You can override with --task_data, --task_logic, --task_flowchart (each accepts one or more task names):

# Run SE_MD and SE_JSON for numerical charts, SE_MERMAID for flowcharts
python infer.py --api_type openai_compat --model_name ... --base_url ... \
    --task_data SE_MD SE_JSON \
    --task_flowchart SE_MERMAID

Output

Each run writes one jsonl file:

infer_outputs/<model_tag>/results.jsonl

<model_tag> defaults to --model_name / basename of --model_path. You can override it with --output_tag.

4. Judging

# Score all models under infer_outputs/
python judge.py

# Score specific models only
python judge.py --models Qwen2.5-VL-72B-Instruct gemini-2.5-pro

# Force re-score a specific task (e.g. after a scoring algorithm update)
python judge.py --force_rejudge SE_MERMAID

Outputs to judge_outputs/<model_tag>/results.jsonl. The judge step is purely rule-based and fast.

5. Analysis report

python analyze.py
# → judge_outputs/results_analysis.xlsx

Scores are shown to 3 decimal places (e.g. 0.873).

Workbook contents (click to expand)
  • Task overview (Sheet 1) — per-model average score for each task
  • Per-task sheets — model × source file breakdown for each task
  • By chart type (by_chart_type/) — one Excel per task, one sheet per chart type
  • Detailed breakdown (detail_by_category/) — per-model per-task breakdown by (chart_type, img_type, lang_type)

6. End-to-end example

# 1. Run inference
python infer.py \
    --api_type openai_compat \
    --model_name Qwen2.5-VL-72B-Instruct \
    --base_url http://127.0.0.1:8000/v1

# 2. Score
python judge.py

# 3. Generate Excel report
python analyze.py

7. Repo layout

The repository is organized around a three-stage pipeline (inference → judging → analysis), with pluggable API backends, per-format scoring modules, and shared metric utilities.

Full directory tree (click to expand)
ChartArena/
├── README.md / README_zh.md
├── requirements.txt
├── data/                        # ← download benchmark data here
├── infer_outputs/               # inference results (auto-created)
├── judge_outputs/               # scoring results  (auto-created)
├── apis/
│   ├── base.py                  # APIBase abstract class
│   ├── openai_compat.py         # OpenAI-compatible client
│   └── local_vllm.py            # in-process vLLM
├── methods/
│   ├── prompts.py               # prompt templates
│   ├── context.py               # context building utilities
│   ├── normalize.py             # output normalization
│   ├── scoring.py               # scoring entry points
│   └── parsers/                 # per-format output parsers
├── metrics/
│   ├── SCRM.py                  # core MAP / EM metric
│   ├── tree_eval.py             # Markdown list evaluation
│   ├── mermaid_eval.py          # Mermaid diagram evaluation
│   ├── flowchart_common.py      # flowchart multi-format evaluation
│   └── dsl_parsers/             # DSL-specific parser utilities
├── utils/
│   ├── io.py                    # ResultWriter (thread-safe incremental writer)
│   ├── signal_utils.py          # graceful Ctrl+C shutdown
│   └── image_utils.py           # base64 encoding for OpenAI-compat
├── infer.py                     # entry: inference
├── judge.py                     # entry: rule-based scoring
└── analyze.py                   # entry: Excel analysis report

Citation

% (coming soon)

License

This benchmark is released for research purposes only.

About

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages