feat: add CRPE-Relation task by njb-nvidia · Pull Request #1354 · EvolvingLMMs-Lab/lmms-eval

njb-nvidia · 2026-05-26T20:45:24Z

Summary

Adds CRPE-Relation, a 7,576-item single-image MCQ on object / predicate / subject relationships drawn from The All-Seeing Project V2.

Dataset: `nv-njb/CRPE` — a bundled re-host of `OpenGVLab/CRPE`.

Why a re-host

The original `OpenGVLab/CRPE` repo ships the `crpe_relation.jsonl` annotation file alongside 544 `abnormal_images/` JPEGs, but the remaining 5,400 records reference COCO val2017 images by relative path — those JPEGs are not in the HF repo, so out-of-the-box `load_dataset` cannot resolve them.

The re-host inlines all 1,081 unique referenced images (537 from COCO val2017 + 544 from abnormal_images) as JPEG bytes under an `Image()` feature. Result: a self-contained parquet (~1 GB across 4 shards) that loads end-to-end via standard `load_dataset` — no extra COCO download needed.

Files

`lmms_eval/tasks/crpe_relation/crpe_relation.yaml` — task config.
`lmms_eval/tasks/crpe_relation/utils.py` — doc transforms, `MultiChoiceRegexFilter` (letter-first, then choice-text substring; strips ``/`` wrappers).

Parity vs. local fork

Qwen3-VL-2B-Instruct, full `test` split (7,576 items), 8x H100, greedy decoding.

Source	Accuracy	Stderr
Fork (vllm backend)	0.7401	±0.005
Upstream (HF simple/qwen3_vl)	0.7418	±0.005


Identical `filtered_resps`	7,174 / 7,576 (94.7%)
Verdict agreement	95.7%
Δ	+0.17 pp

Essentially identical — well within stderr.

Test plan

`uv run lmms-eval --tasks crpe_relation --limit 8` smoke
Full 7,576-doc run on 8x H100 with Qwen3-VL-2B-Instruct; matches the fork's vllm score within 0.2 pp
Per-doc analysis: 94.7% identical predictions, 95.7% verdict agreement
Bundled parquet loads via `load_dataset("nv-njb/CRPE", split="test")` without external dependencies

CRPE-Relation is a 7,576-item single-image MCQ on object/predicate/ subject relationships, drawn from The All-Seeing Project V2. Dataset: nv-njb/CRPE — a bundled re-host of the original OpenGVLab/CRPE annotations (which ship only the 544 abnormal_images/ JPEGs, while the remaining 5,400 records reference COCO val2017 by relative path). The re-host inlines all 1,081 unique images (537 COCO val2017 + 544 abnormal) as JPEG bytes under an Image() feature so the parquet loads end-to-end via standard load_dataset with no extra COCO download. Metric: exact_match (flexible-extract) on the MCQ letter. The filter parses inline A./B./C./D. choices out of the question text, then tries (1) leading uppercase letter, (2) substring-match against any choice text. Handles common reasoning wrappers (<think>...</think>, <answer>...</answer>).

kcz358 approved these changes May 28, 2026

View reviewed changes

kcz358 merged commit b71d25a into EvolvingLMMs-Lab:main May 28, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CRPE-Relation task#1354

feat: add CRPE-Relation task#1354
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-crpe_relation-task

njb-nvidia commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njb-nvidia commented May 26, 2026

Summary

Why a re-host

Files

Parity vs. local fork

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants