Skip to content

Track remaining PAIBench-C reproducibility gaps #219

@Muneerali199

Description

@Muneerali199

Tracking the remaining blockers for full PAIBench-C reproduction from issue #14 and PR #211.

Known gaps:

  1. Prompt conversion pipeline — PAIBench-C uses natural-language captions (metadata.csv / captions/*.json), while Cosmos3 uses a structured prompt.json schema. The conversion recipe is not yet public.

  2. Per-clip prompt files — The 600 structured prompt.json files used for Table 16 are not yet available.

  3. SAM2 determinism — The seed (2026) only controls the Cosmos generator (WAN/DIT). The SAM2 random point sampling in the evaluation pipeline is not seeded, so metric computation is non-deterministic even with identical prompts. See: https://github.com/SHI-Labs/physical-ai-bench/blob/main/conditional_generation/models/grounded_sam_v2.py#L44-L83

  4. Evaluation code — Unclear whether the evaluation used the official PAIBench-C benchmark code or a custom pipeline.

  5. Source segmentation validation — Haven't compared generated segmentation against the official PAIBench-C precomputed GT yet.

Action items:

  • Publish prompt conversion script
  • Release per-clip prompt files
  • Confirm SAM2 seeding approach or fix evaluation determinism
  • Clarify evaluation pipeline
  • Validate against official GT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions