Track remaining PAIBench-C reproducibility gaps

Tracking the remaining blockers for full PAIBench-C reproduction from issue #14 and PR #211.

**Known gaps:**

1. **Prompt conversion pipeline** — PAIBench-C uses natural-language captions (`metadata.csv` / `captions/*.json`), while Cosmos3 uses a structured `prompt.json` schema. The conversion recipe is not yet public.

2. **Per-clip prompt files** — The 600 structured `prompt.json` files used for Table 16 are not yet available.

3. **SAM2 determinism** — The seed (`2026`) only controls the Cosmos generator (WAN/DIT). The SAM2 random point sampling in the evaluation pipeline is not seeded, so metric computation is non-deterministic even with identical prompts. See: https://github.com/SHI-Labs/physical-ai-bench/blob/main/conditional_generation/models/grounded_sam_v2.py#L44-L83

4. **Evaluation code** — Unclear whether the evaluation used the official PAIBench-C benchmark code or a custom pipeline.

5. **Source segmentation validation** — Haven't compared generated segmentation against the official PAIBench-C precomputed GT yet.

**Action items:**
- [ ] Publish prompt conversion script
- [ ] Release per-clip prompt files
- [ ] Confirm SAM2 seeding approach or fix evaluation determinism
- [ ] Clarify evaluation pipeline
- [ ] Validate against official GT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track remaining PAIBench-C reproducibility gaps #219

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Track remaining PAIBench-C reproducibility gaps #219

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions