Skip to content

feat: add Physical AI Understanding task#1353

Merged
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-physical_ai_understanding-task
May 28, 2026
Merged

feat: add Physical AI Understanding task#1353
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-physical_ai_understanding-task

Conversation

@njb-nvidia
Copy link
Copy Markdown
Contributor

Summary

Adds Physical AI Understanding, a 1,214-item video MCQ benchmark from NVIDIA's Cosmos PhysicalAI family covering embodied / autonomous-vehicle / robotics reasoning.

Each item carries a structured index2ans mapping ({"A": ..., "B": ..., "C": ..., "D": ...}) and a target letter, so we don't need to re-parse choices from the prompt.

Dataset: `shi-labs/physical-ai-bench-understanding` — parquet QA at `data/test-*.parquet` and 1,027 source videos at `videos//.mp4` in the same HF repo. Videos are fetched once via `snapshot_download` and cached on disk for subsequent `doc_to_visual` lookups.

Files

  • `lmms_eval/tasks/physical_ai_understanding/physical_ai_understanding.yaml` — task config.
  • `lmms_eval/tasks/physical_ai_understanding/utils.py` — doc transforms, `MultiChoiceRegexFilter` (letter-first, then choice-text substring; strips `` / `` wrappers).

Parity vs. local fork

Qwen3-VL-2B-Instruct, full `test` split (1,214 items).

Source Accuracy Stderr
Fork (vllm backend) 0.4992 ±0.014
Upstream (HF simple/qwen3_vl, fps=2) 0.4786 ±0.014
Identical `filtered_resps` 1,087 / 1,214 (89.5%)
Verdict agreement 91.7%
Δ -2.1 pp (within stderr)

Different inference backends (vllm vs HF simple) account for the small per-doc divergence; this is in line with the drift we've seen on prior video-MCQ ports (egotaskqa, metavqa).

Notes on short videos

A handful of clips have only ~6 frames. The upstream `simple/qwen3_vl` model defaults to strict `nframes=32` via `qwen_vl_utils`, which errors on these. Running with `model_args=fps=2.0,max_num_frames=32` flips it to lenient `max_frames` mode and the task runs end-to-end. (No code changes needed — just a model-args note for reproducibility.)

Test plan

  • `uv run lmms-eval --tasks physical_ai_understanding --limit 4` smoke
  • Full 1,214-doc run on 8x H100 with Qwen3-VL-2B-Instruct; scores within stderr of the fork's vllm run
  • Per-doc analysis: 89.5% identical predictions, 91.7% verdict agreement
  • Video cache via `snapshot_download` works (1,027 files cached on first call, reused on subsequent calls)

A 1,214-item video MCQ benchmark from NVIDIA's Cosmos PhysicalAI
family covering embodied / AV / robotics reasoning, with structured
4-option choices ({"A": ..., "B": ..., "C": ..., "D": ...}).

Dataset: shi-labs/physical-ai-bench-understanding (parquet QA at
data/test-*.parquet + 1,027 source videos at videos/<subset>/<id>.mp4
in the same HF repo). The videos are fetched once with
snapshot_download and cached on disk for subsequent doc_to_visual
lookups.

Metric: exact_match (flexible-extract) on the MCQ letter — the filter
first tries a leading uppercase letter, falls back to substring-match
against the index2ans choice texts, and handles common reasoning
wrappers (<think>...</think>, <answer>...</answer>).
@kcz358 kcz358 merged commit 6377299 into EvolvingLMMs-Lab:main May 28, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants