qa(phase-3): RRI cadence + latency gates, observability (trends/reconcile), split-RRI orchestration, visual regression#954
Conversation
…ile, split-RRI orchestration, visual regression Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention. All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1, RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6. - release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4 additive. test_deterministic_rri_gate.py (9). - .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks. - scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new. - orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness. test (no live SSH). - visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips gracefully) screenshot diffing. test. - CI: 3 new test files added to qa-release-gate-tests. Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership plus integrated-green verification.
|
Warning Review limit reached
More reviews will be available in 3 minutes and 40 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (13)
Comment |
…ncy gate activates (#957) PR #954 added additive latency hard-gates (latency_s_per_beat / latency_coldopen) to qa/release_readiness.py, but they were DORMANT on a real sweep: read_latency() reads each PERSONA run dir's latency.json sidecar, while the runners derive the latency rollup into the TRANSCRIPT dir ($T/$RUN.latency.json) — so the per-run sidecar never existed and the gate fell through to a (safe-by-design) evidence-gap SKIP instead of actually gating. Wire it (all additive): - qa/latency_rollup.py: new reusable stamp_sidecars(rollup, run_dirs) + a --stamp-into CLI flag. It writes {s_per_beat, coldopen_s, turns_per_beat} into each run dir as <run>/latency.json — the exact shape read_latency() consumes. NULL columns are preserved (read_latency treats null as ABSENT -> skip, never a fabricated 0.0); a non-existent run dir is skipped, never created. - qa/release_gate.sh: after the duo run (which produced the per-beat ledger), re-derive the SAME rollup and stamp it into every persona run dir BEFORE the RRI rollup reads them. Non-fatal: a stamp hiccup / a duo with no derivable beat leaves the gate a documented skip. - qa/release_readiness.py: corrected read_latency()'s stale docstring (it claimed run_duo.sh writes the per-run sidecar; in fact run_duo writes to the transcript dir and release_gate.sh stamps the per-run sidecar) — the inaccuracy is part of why the gate looked wired but wasn't. - qa/evidence_audit.py: refreshed the stale "canonical 11 RRI gates" comment — the evaluated set is 11 by default and 13 once latency gates carry evidence (gates_total is read dynamically; RRI_GATE_NAMES stays the always-required baseline). Verify: - A real RRI rollup over runs WITH over-budget latency evidence (s_per_beat>120 or coldopen_s>240) now FAILS the latency gates (gates_total 13, release_ready=False); under budget PASSES; absent stays a byte-identical skip (gates_total 11). - New tests: stamp_sidecars unit coverage + a CLI --stamp-into test (test_latency_rollup.py); an end-to-end SEAM test driving the production rollup→stamp→gate path (test_release_readiness.py); a static contract locking the release_gate.sh wiring so the gate can't silently go dormant again (test_release_gate_static.py). - Single-process: qa/test_release_readiness.py + qa/test_deterministic_rri_gate.py + qa/test_latency_rollup.py + affected static/audit/scope/orchestrate tests — all green. Co-authored-by: Eva <arncalso@gmail.com>
Phase 3 (final) — cross-version regression prevention + agent-facing signals
Closes the program. All additive (absent inputs → byte-identical existing output); pure
readers/reporters + opt-in CI signals; nothing auto-acts or SSHes without an explicit flag.
release_readiness.py: latency hard-gates (s_per_beat/coldopen_s) from existing artifacts, gating only when present & over budget (latency_baseline.json);--deterministic-onlymode marks LLM/persona gates SKIPPED so CI/agent get an early deterministic-release signal. New advisoryrelease-readiness.yml(dispatch+schedule).test_release_readiness.pytests unchanged & green; +4 additive;test_deterministic_rri_gate.py(9)scores_db.py:trends_json()/--trends-json(per-field time-series the agent queries) +reconcile()/--reconcile(READ-ONLYINDEX.jsonl↔db consistency)INDEX.jsonlreader, never rewrites it — coordinates with open sibling #573. 26 existing tests green + 15 neworchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup--planby default (prints exact commands); remote SSH only behind explicit--execute. Reusessupport_vm_preflight+release_readinessvisual_regression_check.py+screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips if absent)Verification (local single-process, integrated)
123 passed, 2 skipped (PIL-absent audit skips). The existing
release_readiness.py(41) andscores_db.py(26) suites stay green. Zero committed-artifact writes (thescores.dblazy-migrationis excluded; tests use
tmp_path). 3 new test files added to CIqa-release-gate-tests.