qa: structured run names + INDEX.jsonl + backfill + find_run helper by 100yenadmin · Pull Request #573 · electricsheephq/WorldOS

100yenadmin · 2026-06-02T14:41:02Z

Closes #572.

Summary

New: qa/scripts/{indexer,backfill_index,find_run}.py (stdlib-only), qa/INDEX_SCHEMA.md, qa/README.md
Modified: qa/ui_playtest.sh, qa/ui_playtest_app.sh (canonical name fallback + best-effort indexer auto-append), AGENTS.md, qa/QA_TOOLS.md (route agents at find_run.py), .gitignore (/qa/INDEX.jsonl{,.new,.lock})
INDEX.jsonl is per-developer gitignored — matches the qa/ui_playtest_runs/ ignore convention.
Complements (does not replace) qa/scores.db: that's the curated quality verdict (~69 rows); INDEX is the raw artifact catalog (~800 rows). Cross-link via scored_in_ledger.

Canonical name format (going-forward)

<YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario>

Legacy names (nb1, gate-<sha>-<persona>, handoff-<TS>-<sha>-<scenario>) parse best-effort by regex; opaque ones (sweep-newbie) fall through to mtime + null fields.

Local verification (against `/Users/lume/ClawDnD-val`)

$ python3 qa/scripts/backfill_index.py --root /Users/lume/ClawDnD-val
Backfill complete: 494 entries → qa/INDEX.jsonl
  run               87
  play-state       131
  transcript      276
  elapsed:        0.71s
  opaque (no metadata, no sha in name): 2
    run/sweep-newbie
    run/sweep2-optimizer

$ python3 qa/scripts/find_run.py --since 2026-06-01 --kind run --count
41

$ python3 qa/scripts/find_run.py --scored --count
16   # cross-linked to qa/scores.db

$ python3 qa/scripts/find_run.py --gave-up --kind run --count
54

Spot-check 5 random run entries against source run.json: 5/5 cross-validated (commit_sha matches build_sha prefix).

Idempotency: re-running backfill or --append on an already-indexed dir UPDATES the row, no dupes (key = (kind, id)).

Runner-hook safety: indexer crash → exit 2; || true wrapper catches → runner continues with its own $SCORE_RC. bash -n clean on both modified runners.

Test plan

Out of scope (follow-ups)

Pruning / retention policy
Renaming existing opaque dirs (sweep-newbie, etc.) — index in-place, preserves mtimes
Engine-side auto-append for play-state/ (engine is Python, separate concern; backfill catches existing)
SQLite migration of INDEX (JSONL fits ~800 rows + append-only failsafe + grep/jq-friendly)
Committed shared catalog (per-dev local indexing matches the artifact-dir gitignore convention)

Confidence

~95%. The 5% empirical gap: whether the 2 opaque legacy dirs without metadata or a parseable sha (sweep-newbie, sweep2-optimizer) parse cleanly with the mtime-only fallback — backfill against the full local corpus confirms they do.

…572) Adds an indexer + auto-append runner hooks + query helper so agents and humans can find past playtest runs, play-state snapshots, and transcripts without grepping ~800 mixed-naming artifact dirs. - qa/scripts/indexer.py — extract one INDEX row from a dir/file (stdlib, idempotent, fcntl-locked) - qa/scripts/backfill_index.py — rebuild qa/INDEX.jsonl from scratch (~800 rows, <1s, idempotent) - qa/scripts/find_run.py — agent-facing query helper (--since/--sha/--persona/--gave-up/--failed/--scored/...) - qa/INDEX_SCHEMA.md — schema, canonical naming, query recipes - qa/README.md — pointer block - qa/QA_TOOLS.md + AGENTS.md — agent-facing routing to find_run.py - qa/ui_playtest.sh + qa/ui_playtest_app.sh — canonical name fallback + auto-append on success (|| true wrapper, never blocks) - .gitignore — /qa/INDEX.jsonl{,.new,.lock} Canonical name format (going-forward): <YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario> Complements qa/scores.db (curated headline verdicts, ~69 rows); cross-links via scored_in_ledger. INDEX.jsonl is per-developer gitignored — matches the qa/ui_playtest_runs/ ignore convention.

coderabbitai · 2026-06-02T14:41:11Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 54b8fab0-3c6a-4435-9db7-bccdd3b34f83

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ile, split-RRI orchestration, visual regression Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention. All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1, RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6. - release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4 additive. test_deterministic_rri_gate.py (9). - .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks. - scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new. - orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness. test (no live SSH). - visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips gracefully) screenshot diffing. test. - CI: 3 new test files added to qa-release-gate-tests. Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership plus integrated-green verification.

…ile, split-RRI orchestration, visual regression (#954) Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention. All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1, RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6. - release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4 additive. test_deterministic_rri_gate.py (9). - .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks. - scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new. - orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight + release_readiness. test (no live SSH). - visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips gracefully) screenshot diffing. test. - CI: 3 new test files added to qa-release-gate-tests. Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership plus integrated-green verification. Co-authored-by: Eva <arncalso@gmail.com>

100yenadmin mentioned this pull request Jun 16, 2026

qa(phase-3): RRI cadence + latency gates, observability (trends/reconcile), split-RRI orchestration, visual regression #954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: structured run names + INDEX.jsonl + backfill + find_run helper#573

qa: structured run names + INDEX.jsonl + backfill + find_run helper#573
100yenadmin wants to merge 1 commit into
mainfrom
codex/worldos-qa-index-system

100yenadmin commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

100yenadmin commented Jun 2, 2026

Summary

Canonical name format (going-forward)

Local verification (against /Users/lume/ClawDnD-val)

Test plan

Out of scope (follow-ups)

Confidence

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Local verification (against `/Users/lume/ClawDnD-val`)