Skip to content

qa(phase-3): RRI cadence + latency gates, observability (trends/reconcile), split-RRI orchestration, visual regression#954

Merged
100yenadmin merged 1 commit into
mainfrom
qa-lab/p3-cadence-obs
Jun 16, 2026
Merged

qa(phase-3): RRI cadence + latency gates, observability (trends/reconcile), split-RRI orchestration, visual regression#954
100yenadmin merged 1 commit into
mainfrom
qa-lab/p3-cadence-obs

Conversation

@100yenadmin

Copy link
Copy Markdown
Member

Phase 3 (final) — cross-version regression prevention + agent-facing signals

Closes the program. All additive (absent inputs → byte-identical existing output); pure
readers/reporters + opt-in CI signals; nothing auto-acts or SSHes without an explicit flag.

Area What Notes
RRI cadence + latency release_readiness.py: latency hard-gates (s_per_beat/coldopen_s) from existing artifacts, gating only when present & over budget (latency_baseline.json); --deterministic-only mode marks LLM/persona gates SKIPPED so CI/agent get an early deterministic-release signal. New advisory release-readiness.yml (dispatch+schedule). load-bearing — 41 pre-existing test_release_readiness.py tests unchanged & green; +4 additive; test_deterministic_rri_gate.py (9)
Observability scores_db.py: trends_json()/--trends-json (per-field time-series the agent queries) + reconcile()/--reconcile (READ-ONLY INDEX.jsonl↔db consistency) reconcile is a tolerant INDEX.jsonl reader, never rewrites it — coordinates with open sibling #573. 26 existing tests green + 15 new
Split-RRI orchestration orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup DRY-RUN/--plan by default (prints exact commands); remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness
Visual regression visual_regression_check.py + screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips if absent) dependency-light

Verification (local single-process, integrated)

123 passed, 2 skipped (PIL-absent audit skips). The existing release_readiness.py (41) and
scores_db.py (26) suites stay green. Zero committed-artifact writes (the scores.db lazy-migration
is excluded; tests use tmp_path). 3 new test files added to CI qa-release-gate-tests.

The two in-workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers
ran a broad git diff and saw sibling builders' files) — disproven by disjoint-ownership + the integrated
green suite. Builds on #949/#950/#952/#953; completes the 3-phase QA Lab upgrade.

…ile, split-RRI orchestration, visual regression

Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention.
All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1,
RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6.

- release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing
  disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json
  (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New
  --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so
  CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4
  additive. test_deterministic_rri_gate.py (9).
- .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only
  over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks.
- scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and
  reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites
  INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new.
- orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default
  (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight +
  release_readiness. test (no live SSH).
- visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips
  gracefully) screenshot diffing. test.
- CI: 3 new test files added to qa-release-gate-tests.

Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing
release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db
lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree
cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership
plus integrated-green verification.
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@100yenadmin, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 40 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9890e11f-a660-44da-95d4-4bee2a8db4e2

📥 Commits

Reviewing files that changed from the base of the PR and between b014dd8 and d3d5f51.

📒 Files selected for processing (13)
  • .github/workflows/ci.yml
  • .github/workflows/release-readiness.yml
  • qa/latency_baseline.json
  • qa/orchestrate_split_rri.py
  • qa/release_readiness.py
  • qa/scores_db.py
  • qa/screenshot_baselines/README.md
  • qa/test_deterministic_rri_gate.py
  • qa/test_orchestrate_split_rri.py
  • qa/test_release_readiness.py
  • qa/test_scores_db.py
  • qa/test_visual_regression_check.py
  • qa/visual_regression_check.py

Comment @coderabbitai help to get the list of available commands and usage tips.

@100yenadmin 100yenadmin merged commit 2e51dbe into main Jun 16, 2026
20 checks passed
@100yenadmin 100yenadmin deleted the qa-lab/p3-cadence-obs branch June 16, 2026 13:47
100yenadmin added a commit that referenced this pull request Jun 16, 2026
…ncy gate activates (#957)

PR #954 added additive latency hard-gates (latency_s_per_beat / latency_coldopen) to
qa/release_readiness.py, but they were DORMANT on a real sweep: read_latency() reads each
PERSONA run dir's latency.json sidecar, while the runners derive the latency rollup into the
TRANSCRIPT dir ($T/$RUN.latency.json) — so the per-run sidecar never existed and the gate
fell through to a (safe-by-design) evidence-gap SKIP instead of actually gating.

Wire it (all additive):
- qa/latency_rollup.py: new reusable stamp_sidecars(rollup, run_dirs) + a --stamp-into CLI
  flag. It writes {s_per_beat, coldopen_s, turns_per_beat} into each run dir as
  <run>/latency.json — the exact shape read_latency() consumes. NULL columns are preserved
  (read_latency treats null as ABSENT -> skip, never a fabricated 0.0); a non-existent run dir
  is skipped, never created.
- qa/release_gate.sh: after the duo run (which produced the per-beat ledger), re-derive the
  SAME rollup and stamp it into every persona run dir BEFORE the RRI rollup reads them.
  Non-fatal: a stamp hiccup / a duo with no derivable beat leaves the gate a documented skip.
- qa/release_readiness.py: corrected read_latency()'s stale docstring (it claimed run_duo.sh
  writes the per-run sidecar; in fact run_duo writes to the transcript dir and release_gate.sh
  stamps the per-run sidecar) — the inaccuracy is part of why the gate looked wired but wasn't.
- qa/evidence_audit.py: refreshed the stale "canonical 11 RRI gates" comment — the evaluated
  set is 11 by default and 13 once latency gates carry evidence (gates_total is read
  dynamically; RRI_GATE_NAMES stays the always-required baseline).

Verify:
- A real RRI rollup over runs WITH over-budget latency evidence (s_per_beat>120 or
  coldopen_s>240) now FAILS the latency gates (gates_total 13, release_ready=False); under
  budget PASSES; absent stays a byte-identical skip (gates_total 11).
- New tests: stamp_sidecars unit coverage + a CLI --stamp-into test (test_latency_rollup.py);
  an end-to-end SEAM test driving the production rollup→stamp→gate path
  (test_release_readiness.py); a static contract locking the release_gate.sh wiring so the gate
  can't silently go dormant again (test_release_gate_static.py).
- Single-process: qa/test_release_readiness.py + qa/test_deterministic_rri_gate.py +
  qa/test_latency_rollup.py + affected static/audit/scope/orchestrate tests — all green.

Co-authored-by: Eva <arncalso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant