Tracking the remaining blockers for full PAIBench-C reproduction from issue #14 and PR #211.
Known gaps:
-
Prompt conversion pipeline — PAIBench-C uses natural-language captions (metadata.csv / captions/*.json), while Cosmos3 uses a structured prompt.json schema. The conversion recipe is not yet public.
-
Per-clip prompt files — The 600 structured prompt.json files used for Table 16 are not yet available.
-
SAM2 determinism — The seed (2026) only controls the Cosmos generator (WAN/DIT). The SAM2 random point sampling in the evaluation pipeline is not seeded, so metric computation is non-deterministic even with identical prompts. See: https://github.com/SHI-Labs/physical-ai-bench/blob/main/conditional_generation/models/grounded_sam_v2.py#L44-L83
-
Evaluation code — Unclear whether the evaluation used the official PAIBench-C benchmark code or a custom pipeline.
-
Source segmentation validation — Haven't compared generated segmentation against the official PAIBench-C precomputed GT yet.
Action items:
Tracking the remaining blockers for full PAIBench-C reproduction from issue #14 and PR #211.
Known gaps:
Prompt conversion pipeline — PAIBench-C uses natural-language captions (
metadata.csv/captions/*.json), while Cosmos3 uses a structuredprompt.jsonschema. The conversion recipe is not yet public.Per-clip prompt files — The 600 structured
prompt.jsonfiles used for Table 16 are not yet available.SAM2 determinism — The seed (
2026) only controls the Cosmos generator (WAN/DIT). The SAM2 random point sampling in the evaluation pipeline is not seeded, so metric computation is non-deterministic even with identical prompts. See: https://github.com/SHI-Labs/physical-ai-bench/blob/main/conditional_generation/models/grounded_sam_v2.py#L44-L83Evaluation code — Unclear whether the evaluation used the official PAIBench-C benchmark code or a custom pipeline.
Source segmentation validation — Haven't compared generated segmentation against the official PAIBench-C precomputed GT yet.
Action items: