feat: add Physical AI Understanding task by njb-nvidia · Pull Request #1353 · EvolvingLMMs-Lab/lmms-eval

njb-nvidia · 2026-05-26T18:43:23Z

Summary

Adds Physical AI Understanding, a 1,214-item video MCQ benchmark from NVIDIA's Cosmos PhysicalAI family covering embodied / autonomous-vehicle / robotics reasoning.

Each item carries a structured index2ans mapping ({"A": ..., "B": ..., "C": ..., "D": ...}) and a target letter, so we don't need to re-parse choices from the prompt.

Dataset: `shi-labs/physical-ai-bench-understanding` — parquet QA at `data/test-*.parquet` and 1,027 source videos at `videos//.mp4` in the same HF repo. Videos are fetched once via `snapshot_download` and cached on disk for subsequent `doc_to_visual` lookups.

Files

`lmms_eval/tasks/physical_ai_understanding/physical_ai_understanding.yaml` — task config.
`lmms_eval/tasks/physical_ai_understanding/utils.py` — doc transforms, `MultiChoiceRegexFilter` (letter-first, then choice-text substring; strips `` / `` wrappers).

Parity vs. local fork

Qwen3-VL-2B-Instruct, full `test` split (1,214 items).

Source	Accuracy	Stderr
Fork (vllm backend)	0.4992	±0.014
Upstream (HF simple/qwen3_vl, fps=2)	0.4786	±0.014


Identical `filtered_resps`	1,087 / 1,214 (89.5%)
Verdict agreement	91.7%
Δ	-2.1 pp (within stderr)

Different inference backends (vllm vs HF simple) account for the small per-doc divergence; this is in line with the drift we've seen on prior video-MCQ ports (egotaskqa, metavqa).

Notes on short videos

A handful of clips have only ~6 frames. The upstream `simple/qwen3_vl` model defaults to strict `nframes=32` via `qwen_vl_utils`, which errors on these. Running with `model_args=fps=2.0,max_num_frames=32` flips it to lenient `max_frames` mode and the task runs end-to-end. (No code changes needed — just a model-args note for reproducibility.)

Test plan

`uv run lmms-eval --tasks physical_ai_understanding --limit 4` smoke
Full 1,214-doc run on 8x H100 with Qwen3-VL-2B-Instruct; scores within stderr of the fork's vllm run
Per-doc analysis: 89.5% identical predictions, 91.7% verdict agreement
Video cache via `snapshot_download` works (1,027 files cached on first call, reused on subsequent calls)

A 1,214-item video MCQ benchmark from NVIDIA's Cosmos PhysicalAI family covering embodied / AV / robotics reasoning, with structured 4-option choices ({"A": ..., "B": ..., "C": ..., "D": ...}). Dataset: shi-labs/physical-ai-bench-understanding (parquet QA at data/test-*.parquet + 1,027 source videos at videos/<subset>/<id>.mp4 in the same HF repo). The videos are fetched once with snapshot_download and cached on disk for subsequent doc_to_visual lookups. Metric: exact_match (flexible-extract) on the MCQ letter — the filter first tries a leading uppercase letter, falls back to substring-match against the index2ans choice texts, and handles common reasoning wrappers (<think>...</think>, <answer>...</answer>).

kcz358 approved these changes May 28, 2026

View reviewed changes

kcz358 merged commit 6377299 into EvolvingLMMs-Lab:main May 28, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Physical AI Understanding task#1353

feat: add Physical AI Understanding task#1353
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-physical_ai_understanding-task

njb-nvidia commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njb-nvidia commented May 26, 2026

Summary

Files

Parity vs. local fork

Notes on short videos

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants