Skip to content

ai-for-edu/ScratchMath

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScratchMath Banner

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

AIED 2026 — 27th International Conference on Artificial Intelligence in Education

Dingjie Song1    Tianlong Xu2    Yi-Fan Zhang4    Hang Li5    Zhiling Yan1
Xing Fan3    Haoyang Li3    Lichao Sun1    Qingsong Wen2,†

1Lehigh University    2Squirrel Ai Learning (USA)    3Squirrel Ai Learning (China)
4Chinese Academy of Sciences    5Michigan State University

Corresponding author

Project Page Paper Dataset License

ScratchMath Overview

✨ Highlights

  • 🎯 Novel Task — First benchmark targeting error diagnosis in authentic student handwritten scratchwork, shifting from the "examinee" to the "examiner" perspective.
  • 📝 Real-World Data — 1,720 samples from Chinese primary & middle school students, meticulously annotated via human-machine collaboration.
  • 🔀 Two Complementary Tasks — Error Cause Explanation (ECE) + Error Cause Classification (ECC) across 7 error types.
  • 📊 Comprehensive Evaluation — 16 leading MLLMs benchmarked; best model (o4-mini) reaches 57.2% vs. human experts at 83.9%.

📖 Overview

ScratchMath evaluates whether Multimodal Large Language Models (MLLMs) can analyze handwritten mathematical scratchwork produced by real students. Unlike existing math benchmarks that focus on problem-solving, ScratchMath targets error diagnosis — identifying what type of mistake a student made and explaining why.

💬 Error Cause Explanation (ECE)

Given a math problem, correct answer, reference solution, the student's incorrect answer, and an image of handwritten scratchwork, generate a free-form explanation of the student's error.

Metric: LLM-as-a-Judge (o3-mini, 88.6% human agreement)

🏷️ Error Cause Classification (ECC)

Using the same inputs, classify the error into one of 7 categories:

Category English
计算错误 Calculation Error
题目理解错误 Comprehension Error
知识点错误 Knowledge Gap Error
答题技巧错误 Strategy Error
手写誊抄错误 Transcription Error
逻辑推理错误 Reasoning Error
注意力与细节错误 Attention Error

Metric: Weighted-average accuracy


📦 Dataset

The dataset is hosted on HuggingFace: songdj/ScratchMath

Subset Samples Grades Description
primary 1,479 1 – 6 Primary school math
middle 241 7 – 9 Middle school math

Each sample contains: question_id, question, answer, solution, student_answer, student_scratchwork (image), error_category, error_explanation.

from datasets import load_dataset

# Load primary school subset
ds = load_dataset("songdj/ScratchMath", "primary", split="train")
print(ds[0]["question"], ds[0]["error_explanation"])

🏆 Leaderboard

Performance of state-of-the-art MLLMs on ScratchMath:

Rank Model #Params ECE Primary ECE Middle ECC Primary ECC Middle Avg
👤 Human Expert 93.2 89.0 80.1 73.4 83.9
🥇 o4-mini 71.8 69.7 40.1 47.3 57.2
🥈 Gemini 2.0 Flash Thinking 65.9 61.0 43.9 47.3 54.5
🥉 Gemini 2.0 Flash 52.2 46.9 38.6 49.0 46.7
4 Qwen2.5-VL 72B 40.0 34.0 32.5 49.4 39.0
5 QVQ 72B 57.5 56.8 12.7 17.0 36.0
6 Gemma-3 27B 38.9 26.1 32.2 46.1 35.8
7 Skywork-R1V 38B 37.5 33.6 27.7 43.2 35.5
8 GPT-4o 47.7 44.8 26.1 22.0 35.2
9 InternVL2.5 78B 27.1 24.5 30.7 44.8 31.8

Bold = best per column. Italic = human performance. Full results (16 models) in the paper.


🚀 Quick Start

Installation

git clone https://github.com/ai-for-edu/ScratchMath.git
cd ScratchMath
pip install -r requirements.txt

Run ECE Evaluation

export OPENAI_API_KEY="your-key"

# Generate error explanations
python -m eval.run_ece \
    --model gpt-4o \
    --subset primary \
    --output results/ece_gpt4o_primary.jsonl

# Judge with LLM-as-a-Judge
python -m eval.judge_ece \
    --predictions results/ece_gpt4o_primary.jsonl \
    --judge-model o3-mini \
    --output results/ece_gpt4o_primary_judged.jsonl

Run ECC Evaluation

python -m eval.run_ecc \
    --model gpt-4o \
    --subset primary \
    --output results/ecc_gpt4o_primary.jsonl
🔧 Using Open-Source Models (via vLLM)
# Start vLLM server
vllm serve Qwen/Qwen2.5-VL-7B-Instruct

# Run evaluation
python -m eval.run_ece \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --subset primary \
    --output results/ece_qwen25vl_primary.jsonl \
    --api-base http://localhost:8000/v1 \
    --api-key dummy

📄 Citation

If you find this work useful, please cite:

@inproceedings{song2026scratchmath,
  title     = {Can MLLMs Read Students' Minds? Unpacking Multimodal Error
               Analysis in Handwritten Math},
  author    = {Song, Dingjie and Xu, Tianlong and Zhang, Yi-Fan and Li, Hang
               and Yan, Zhiling and Fan, Xing and Li, Haoyang and Sun, Lichao
               and Wen, Qingsong},
  booktitle = {Proceedings of the 27th International Conference on Artificial
               Intelligence in Education (AIED)},
  year      = {2026}
}

⚖️ License

This project is licensed under CC BY-NC-SA 4.0.

About

Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages