qa: structured run names + INDEX.jsonl + backfill + find_run helper#573
Draft
100yenadmin wants to merge 1 commit into
Draft
qa: structured run names + INDEX.jsonl + backfill + find_run helper#573100yenadmin wants to merge 1 commit into
100yenadmin wants to merge 1 commit into
Conversation
…572) Adds an indexer + auto-append runner hooks + query helper so agents and humans can find past playtest runs, play-state snapshots, and transcripts without grepping ~800 mixed-naming artifact dirs. - qa/scripts/indexer.py — extract one INDEX row from a dir/file (stdlib, idempotent, fcntl-locked) - qa/scripts/backfill_index.py — rebuild qa/INDEX.jsonl from scratch (~800 rows, <1s, idempotent) - qa/scripts/find_run.py — agent-facing query helper (--since/--sha/--persona/--gave-up/--failed/--scored/...) - qa/INDEX_SCHEMA.md — schema, canonical naming, query recipes - qa/README.md — pointer block - qa/QA_TOOLS.md + AGENTS.md — agent-facing routing to find_run.py - qa/ui_playtest.sh + qa/ui_playtest_app.sh — canonical name fallback + auto-append on success (|| true wrapper, never blocks) - .gitignore — /qa/INDEX.jsonl{,.new,.lock} Canonical name format (going-forward): <YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario> Complements qa/scores.db (curated headline verdicts, ~69 rows); cross-links via scored_in_ledger. INDEX.jsonl is per-developer gitignored — matches the qa/ui_playtest_runs/ ignore convention.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
100yenadmin
pushed a commit
that referenced
this pull request
Jun 16, 2026
…ile, split-RRI orchestration, visual regression Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention. All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1, RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6. - release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4 additive. test_deterministic_rri_gate.py (9). - .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks. - scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new. - orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness. test (no live SSH). - visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips gracefully) screenshot diffing. test. - CI: 3 new test files added to qa-release-gate-tests. Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership plus integrated-green verification.
100yenadmin
added a commit
that referenced
this pull request
Jun 16, 2026
…ile, split-RRI orchestration, visual regression (#954) Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention. All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1, RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6. - release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4 additive. test_deterministic_rri_gate.py (9). - .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks. - scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new. - orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness. test (no live SSH). - visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips gracefully) screenshot diffing. test. - CI: 3 new test files added to qa-release-gate-tests. Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership plus integrated-green verification. Co-authored-by: Eva <arncalso@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #572.
Summary
qa/scripts/{indexer,backfill_index,find_run}.py(stdlib-only),qa/INDEX_SCHEMA.md,qa/README.mdqa/ui_playtest.sh,qa/ui_playtest_app.sh(canonical name fallback + best-effort indexer auto-append),AGENTS.md,qa/QA_TOOLS.md(route agents atfind_run.py),.gitignore(/qa/INDEX.jsonl{,.new,.lock})qa/ui_playtest_runs/ignore convention.qa/scores.db: that's the curated quality verdict (~69 rows); INDEX is the raw artifact catalog (~800 rows). Cross-link viascored_in_ledger.Canonical name format (going-forward)
Legacy names (
nb1,gate-<sha>-<persona>,handoff-<TS>-<sha>-<scenario>) parse best-effort by regex; opaque ones (sweep-newbie) fall through to mtime +nullfields.Local verification (against
/Users/lume/ClawDnD-val)Spot-check 5 random run entries against source
run.json: 5/5 cross-validated (commit_shamatchesbuild_shaprefix).Idempotency: re-running backfill or
--appendon an already-indexed dir UPDATES the row, no dupes (key =(kind, id)).Runner-hook safety: indexer crash → exit 2;
|| truewrapper catches → runner continues with its own$SCORE_RC.bash -nclean on both modified runners.Test plan
bash -n qa/ui_playtest.sh qa/ui_playtest_app.sh— no syntax errorsbackfill_index.pyagainst full local corpus — 494 entries in <1s, 5/5 spot-checks passfind_run.pyexercise:--since,--sha,--persona,--gave-up,--failed,--scored,--has-rubric,--paths-only,--count,--jsonlindexed_at)--appendidempotency: same dir → row UPDATED, count unchanged|| true→ runner unaffectedsqlite3connection uses?mode=ro— safe to read whilescores_db.py --addruns concurrentlyINDEX_SCHEMA.mdmatches whatindexer.pywrites matches whatfind_run.pyreadsui_playtest.shinvocation, confirm INDEX appended) — owner can run in their next playtest cycleOut of scope (follow-ups)
sweep-newbie, etc.) — index in-place, preserves mtimesplay-state/(engine is Python, separate concern; backfill catches existing)Confidence
~95%. The 5% empirical gap: whether the 2 opaque legacy dirs without metadata or a parseable sha (
sweep-newbie,sweep2-optimizer) parse cleanly with the mtime-only fallback — backfill against the full local corpus confirms they do.