Skip to content

qa: structured run names + INDEX.jsonl + backfill + find_run helper#573

Draft
100yenadmin wants to merge 1 commit into
mainfrom
codex/worldos-qa-index-system
Draft

qa: structured run names + INDEX.jsonl + backfill + find_run helper#573
100yenadmin wants to merge 1 commit into
mainfrom
codex/worldos-qa-index-system

Conversation

@100yenadmin

Copy link
Copy Markdown
Member

Closes #572.

Summary

  • New: qa/scripts/{indexer,backfill_index,find_run}.py (stdlib-only), qa/INDEX_SCHEMA.md, qa/README.md
  • Modified: qa/ui_playtest.sh, qa/ui_playtest_app.sh (canonical name fallback + best-effort indexer auto-append), AGENTS.md, qa/QA_TOOLS.md (route agents at find_run.py), .gitignore (/qa/INDEX.jsonl{,.new,.lock})
  • INDEX.jsonl is per-developer gitignored — matches the qa/ui_playtest_runs/ ignore convention.
  • Complements (does not replace) qa/scores.db: that's the curated quality verdict (~69 rows); INDEX is the raw artifact catalog (~800 rows). Cross-link via scored_in_ledger.

Canonical name format (going-forward)

<YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario>

Legacy names (nb1, gate-<sha>-<persona>, handoff-<TS>-<sha>-<scenario>) parse best-effort by regex; opaque ones (sweep-newbie) fall through to mtime + null fields.

Local verification (against /Users/lume/ClawDnD-val)

$ python3 qa/scripts/backfill_index.py --root /Users/lume/ClawDnD-val
Backfill complete: 494 entries → qa/INDEX.jsonl
  run               87
  play-state       131
  transcript      276
  elapsed:        0.71s
  opaque (no metadata, no sha in name): 2
    run/sweep-newbie
    run/sweep2-optimizer

$ python3 qa/scripts/find_run.py --since 2026-06-01 --kind run --count
41

$ python3 qa/scripts/find_run.py --scored --count
16   # cross-linked to qa/scores.db

$ python3 qa/scripts/find_run.py --gave-up --kind run --count
54

Spot-check 5 random run entries against source run.json: 5/5 cross-validated (commit_sha matches build_sha prefix).

Idempotency: re-running backfill or --append on an already-indexed dir UPDATES the row, no dupes (key = (kind, id)).

Runner-hook safety: indexer crash → exit 2; || true wrapper catches → runner continues with its own $SCORE_RC. bash -n clean on both modified runners.

Test plan

  • bash -n qa/ui_playtest.sh qa/ui_playtest_app.sh — no syntax errors
  • backfill_index.py against full local corpus — 494 entries in <1s, 5/5 spot-checks pass
  • find_run.py exercise: --since, --sha, --persona, --gave-up, --failed, --scored, --has-rubric, --paths-only, --count, --jsonl
  • Idempotency: backfill twice → identical content (modulo indexed_at)
  • Indexer --append idempotency: same dir → row UPDATED, count unchanged
  • Runner-hook simulation: indexer crash with bad path → exit 2; wrapped in || true → runner unaffected
  • sqlite3 connection uses ?mode=ro — safe to read while scores_db.py --add runs concurrently
  • fcntl flock for concurrent runners — works on macOS
  • Schema in INDEX_SCHEMA.md matches what indexer.py writes matches what find_run.py reads
  • Live runner smoke (one fresh ui_playtest.sh invocation, confirm INDEX appended) — owner can run in their next playtest cycle

Out of scope (follow-ups)

  • Pruning / retention policy
  • Renaming existing opaque dirs (sweep-newbie, etc.) — index in-place, preserves mtimes
  • Engine-side auto-append for play-state/ (engine is Python, separate concern; backfill catches existing)
  • SQLite migration of INDEX (JSONL fits ~800 rows + append-only failsafe + grep/jq-friendly)
  • Committed shared catalog (per-dev local indexing matches the artifact-dir gitignore convention)

Confidence

~95%. The 5% empirical gap: whether the 2 opaque legacy dirs without metadata or a parseable sha (sweep-newbie, sweep2-optimizer) parse cleanly with the mtime-only fallback — backfill against the full local corpus confirms they do.

…572)

Adds an indexer + auto-append runner hooks + query helper so agents and
humans can find past playtest runs, play-state snapshots, and transcripts
without grepping ~800 mixed-naming artifact dirs.

- qa/scripts/indexer.py — extract one INDEX row from a dir/file (stdlib, idempotent, fcntl-locked)
- qa/scripts/backfill_index.py — rebuild qa/INDEX.jsonl from scratch (~800 rows, <1s, idempotent)
- qa/scripts/find_run.py — agent-facing query helper (--since/--sha/--persona/--gave-up/--failed/--scored/...)
- qa/INDEX_SCHEMA.md — schema, canonical naming, query recipes
- qa/README.md — pointer block
- qa/QA_TOOLS.md + AGENTS.md — agent-facing routing to find_run.py
- qa/ui_playtest.sh + qa/ui_playtest_app.sh — canonical name fallback + auto-append on success (|| true wrapper, never blocks)
- .gitignore — /qa/INDEX.jsonl{,.new,.lock}

Canonical name format (going-forward):
  <YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario>

Complements qa/scores.db (curated headline verdicts, ~69 rows); cross-links via scored_in_ledger.
INDEX.jsonl is per-developer gitignored — matches the qa/ui_playtest_runs/ ignore convention.
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 54b8fab0-3c6a-4435-9db7-bccdd3b34f83

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

100yenadmin pushed a commit that referenced this pull request Jun 16, 2026
…ile, split-RRI orchestration, visual regression

Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention.
All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1,
RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6.

- release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing
  disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json
  (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New
  --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so
  CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4
  additive. test_deterministic_rri_gate.py (9).
- .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only
  over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks.
- scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and
  reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites
  INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new.
- orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default
  (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight +
  release_readiness. test (no live SSH).
- visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips
  gracefully) screenshot diffing. test.
- CI: 3 new test files added to qa-release-gate-tests.

Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing
release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db
lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree
cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership
plus integrated-green verification.
100yenadmin added a commit that referenced this pull request Jun 16, 2026
…ile, split-RRI orchestration, visual regression (#954)

Phase 3 (final) of the QA Lab upgrade — agent-facing signals + cross-version regression prevention.
All ADDITIVE: absent inputs mean byte-identical existing output. Closes audit gaps RRI-CADENCE-1,
RRI-DETERMINISM-GAP-3, RRI-SPLIT-VM-LANE-2, REGRESSION-5/7, OBS-2/3/4/5/6, NO-VISUAL-REGRESSION-DIFFING-6.

- release_readiness.py (load-bearing, additive): latency hard-gates (s_per_beat, coldopen_s) from existing
  disk artifacts, gating ONLY when latency evidence is present and over the budget in new latency_baseline.json
  (defaults 120/240, headroom over healthy ~78/~157); absent means skip, never a false fail. New
  --deterministic-only mode evaluates only non-LLM gates and marks LLM/persona gates SKIPPED (not failed) so
  CI/the agent get an early deterministic-release signal. 41 pre-existing tests unchanged and still pass; +4
  additive. test_deterministic_rri_gate.py (9).
- .github/workflows/release-readiness.yml: advisory workflow_dispatch+schedule running --deterministic-only
  over a temp fixture, uploads RRI-deterministic.json. continue-on-error, never blocks.
- scores_db.py (additive): trends_json() + --trends-json (per-field time-series the agent queries) and
  reconcile() + --reconcile (READ-ONLY INDEX.jsonl vs db consistency; tolerant parser, never rewrites
  INDEX.jsonl, coordinates with open sibling PR #573). 26 existing tests green + 15 new.
- orchestrate_split_rri.py: one-step VM(part-B)+Mac(part-A handoff) RRI rollup; DRY-RUN/--plan by default
  (prints exact commands), remote SSH only behind explicit --execute. Reuses support_vm_preflight +
  release_readiness. test (no live SSH).
- visual_regression_check.py + qa/screenshot_baselines/: strict (stdlib sha256) vs audit (PIL optional, skips
  gracefully) screenshot diffing. test.
- CI: 3 new test files added to qa-release-gate-tests.

Verification (local single-process, integrated): 123 passed, 2 skipped (PIL-absent audit skips). Existing
release_readiness (41) + scores_db (26) suites stay green. Zero committed-artifact writes (scores.db
lazy-migration excluded; tests use tmp_path). The two workflow FIX verdicts were shared-worktree
cross-attribution false-positives (reviewers saw sibling builders' files); disproven by disjoint-ownership
plus integrated-green verification.

Co-authored-by: Eva <arncalso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qa: structured run names + INDEX.jsonl + backfill + find_run.py helper

1 participant