qa(phase-3): wire the per-run latency sidecar so the dormant RRI latency gate activates#957
Conversation
…ncy gate activates PR #954 added additive latency hard-gates (latency_s_per_beat / latency_coldopen) to qa/release_readiness.py, but they were DORMANT on a real sweep: read_latency() reads each PERSONA run dir's latency.json sidecar, while the runners derive the latency rollup into the TRANSCRIPT dir ($T/$RUN.latency.json) — so the per-run sidecar never existed and the gate fell through to a (safe-by-design) evidence-gap SKIP instead of actually gating. Wire it (all additive): - qa/latency_rollup.py: new reusable stamp_sidecars(rollup, run_dirs) + a --stamp-into CLI flag. It writes {s_per_beat, coldopen_s, turns_per_beat} into each run dir as <run>/latency.json — the exact shape read_latency() consumes. NULL columns are preserved (read_latency treats null as ABSENT -> skip, never a fabricated 0.0); a non-existent run dir is skipped, never created. - qa/release_gate.sh: after the duo run (which produced the per-beat ledger), re-derive the SAME rollup and stamp it into every persona run dir BEFORE the RRI rollup reads them. Non-fatal: a stamp hiccup / a duo with no derivable beat leaves the gate a documented skip. - qa/release_readiness.py: corrected read_latency()'s stale docstring (it claimed run_duo.sh writes the per-run sidecar; in fact run_duo writes to the transcript dir and release_gate.sh stamps the per-run sidecar) — the inaccuracy is part of why the gate looked wired but wasn't. - qa/evidence_audit.py: refreshed the stale "canonical 11 RRI gates" comment — the evaluated set is 11 by default and 13 once latency gates carry evidence (gates_total is read dynamically; RRI_GATE_NAMES stays the always-required baseline). Verify: - A real RRI rollup over runs WITH over-budget latency evidence (s_per_beat>120 or coldopen_s>240) now FAILS the latency gates (gates_total 13, release_ready=False); under budget PASSES; absent stays a byte-identical skip (gates_total 11). - New tests: stamp_sidecars unit coverage + a CLI --stamp-into test (test_latency_rollup.py); an end-to-end SEAM test driving the production rollup→stamp→gate path (test_release_readiness.py); a static contract locking the release_gate.sh wiring so the gate can't silently go dormant again (test_release_gate_static.py). - Single-process: qa/test_release_readiness.py + qa/test_deterministic_rri_gate.py + qa/test_latency_rollup.py + affected static/audit/scope/orchestrate tests — all green.
|
Warning Review limit reached
More reviews will be available in 43 minutes and 46 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
Comment |
Problem
PR #954 added additive latency hard-gates (
latency_s_per_beat/latency_coldopen) toqa/release_readiness.py, but they were dormant on a real sweep:release_readiness.read_latency()reads each persona run dir'slatency.jsonsidecar (then alatencyblock inrun.json, thenscore.json).qa/run_duo.sh) derive the latency rollup into the transcript dir ($T/$RUN.latency.json), not the per-run dir.So the per-run sidecar never existed → the gate fell through to a (safe-by-design) evidence-gap SKIP instead of actually gating. Diagnosed in the Phase-3 sprint (the gate looked wired but wasn't).
Fix (all additive)
qa/latency_rollup.py— new reusablestamp_sidecars(rollup, run_dirs)helper + a--stamp-intoCLI flag. Writes{s_per_beat, coldopen_s, turns_per_beat}into each run dir as<run>/latency.json(the exact shaperead_latency()consumes). NULL columns preserved verbatim (read_latency treats null as absent → skip, never a fabricated0.0); a non-existent run dir is skipped, never created (a stale/typo path can't fabricate evidence).qa/release_gate.sh— after the duo run (which already produced the per-beat ledger), re-derive the same rollup and stamp it into every persona run dir before the RRI rollup reads them. Non-fatal: a stamp hiccup, or a duo with no derivable beat (NULL columns), leaves the gate a documented skip — exactly today's behavior when latency evidence is absent.qa/release_readiness.py— correctedread_latency()'s stale docstring (it claimedrun_duo.shwrites the per-run sidecar; in fact run_duo writes to the transcript dir andrelease_gate.shstamps the per-run sidecar). That inaccuracy is part of why the gate looked wired but wasn't.qa/evidence_audit.py— refreshed the stale"canonical 11 RRI gates"comment: the evaluated set is 11 by default and 13 once latency gates carry evidence (gates_totalis read dynamically;RRI_GATE_NAMESstays the always-required baseline — the conditional latency gates are intentionally not in it).Attribution note
The latency budget is a build-level signal ("is generation within budget on this build?"). The duo is the canonical deep play, so its rollup is replicated into every persona run dir; the gate aggregates the max across personas, so identical values yield exactly that build figure. (The
.apppersona plays write a different transcript shape —dm.combined.jsonl— so a per-persona rollup is out of scope here; the duo rollup is the available, representative evidence.)Verification
s_per_beat=300> 120,coldopen_s=500> 240), stamped via the exactlatency_rollup.py --stamp-intocommandrelease_gate.shdrives, now FAILS both latency gates →gates_total=13,release_ready=False. Under budget PASSES. Absent stays a byte-identical skip (gates_total=11).qa/test_latency_rollup.py—stamp_sidecarsunit coverage (writes the right columns, skips non-existent dirs, preserves NULL) + a--stamp-intoCLI test.qa/test_release_readiness.py— an end-to-end seam test driving the production rollup → stamp → gate path (synthetic duo beats, not a hand-written sidecar), asserting over-budget FAIL + under-budget PASS.qa/test_release_gate_static.py— a static contract locking therelease_gate.shwiring (--stamp-into "$RUN_DIRS"runs before the RRI rollup) so the gate can't silently go dormant again.-p no:xdist):qa/test_release_readiness.py+qa/test_deterministic_rri_gate.py+qa/test_latency_rollup.py+ affected static/audit/scope/orchestrate tests — all green. No committed data artifact touched (scores.db reverted; tests use tmp_path +--outtmp).license_checkpasses.