Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,14 @@
# committed; its output is not.
/qa/findings.jsonl

# Local QA artifact index (qa/scripts/backfill_index.py output) — auto-built catalog of
# every playtest run + play-state + transcript. Per-developer (each local tree has its own
# artifacts). The indexer/backfill/find_run scripts and INDEX_SCHEMA.md ARE committed;
# the INDEX.jsonl itself is not. Query with `qa/scripts/find_run.py`.
/qa/INDEX.jsonl
/qa/INDEX.jsonl.new
/qa/INDEX.jsonl.lock

# Privately-imported, user-owned adventures — NEVER commit (may be copyrighted).
# Only original / CC-licensed content belongs under content/campaigns/.
/content/campaigns/_imported/
Expand Down
12 changes: 12 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,18 @@
- The VM cannot prove Mac-only surfaces. `WorldOS.app` build/launch, native #356, and built-app UI play evidence stay on this Mac or macOS CI.
- VM artifacts can feed RRI only when `run.json`, `score.json`, `session_surface.final.json`, network/image evidence, palette-live evidence, and build SHA are explicit. Otherwise the result remains partial/harness-contaminated.

## QA Artifact Index

- Past playtest runs, play-state snapshots, and transcripts are indexed at `qa/INDEX.jsonl` (gitignored, per-dev). Query with `qa/scripts/find_run.py`, not raw `ls`/grep.
- On a fresh clone (or after manual artifact moves): `python3 qa/scripts/backfill_index.py` rebuilds the local index. Idempotent. Takes <1s.
- Common queries:
- `qa/scripts/find_run.py --since 2026-05-25 --kind run --failed` — recent failed runs
- `qa/scripts/find_run.py --sha <sha7>` — every artifact for a commit
- `qa/scripts/find_run.py --persona newbie --gave-up` — runs where the persona gave up
- `qa/scripts/find_run.py --scored` — runs that also have a curated `qa/scores.db` verdict
- Two layers, two stores: `qa/INDEX.jsonl` is the RAW artifact catalog (~800 rows); `qa/scores.db` (rendered to `qa/scores_ledger.md`) is the CURATED quality verdict layer (~69 hand-validated rows). INDEX entries cross-link to ledger rows via `scored_in_ledger`.
- Schema, naming convention, and rebuild recipes: [`qa/INDEX_SCHEMA.md`](qa/INDEX_SCHEMA.md).

## GitHub And Reviews

- Use branch prefix `codex/` for new branches unless instructed otherwise.
Expand Down
259 changes: 259 additions & 0 deletions qa/INDEX_SCHEMA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# QA Index — Schema & Usage

`qa/INDEX.jsonl` is an append-only index of QA artifacts. One JSON object per
line, three artifact kinds (`run`, `play-state`, `transcript`). Built by:

- **`qa/scripts/indexer.py`** — extract metadata from a single dir/file
- **`qa/scripts/backfill_index.py`** — walk all artifact dirs, rebuild from scratch
- **`qa/scripts/find_run.py`** — query helper (the agent-facing surface)

The index is **gitignored** (matches the `qa/ui_playtest_runs/` ignore pattern)
— each developer keeps a local index of their own artifacts. The scripts and
schema are committed; the data is not.

## Canonical run-name format

New playtest runs (going-forward) use:

```
<YYYYMMDDTHHMMSSZ>-<sha7>-<world>-<persona>-<provider>-<scenario>
```

Example:

```
20260602T053015Z-9545383-baldurs-gate-newbie-claude-codex-native
```

Components:

| Field | Source | Notes |
|------------|---------------------------------------|------------------------------------------------|
| `TS` | `date -u +%Y%m%dT%H%M%SZ` | UTC, sortable |
| `sha7` | `git rev-parse --short HEAD` | 7-char commit |
| `world` | runner arg `$2` | e.g. `baldurs-gate` |
| `persona` | runner arg `$3` | `newbie` / `veteran` / `adversarial` / ... |
| `provider` | env `WOS_APP_SELECTED_PROVIDER` | `claude` / `codex` |
| `scenario` | runner-fixed suffix | `play` (ui_playtest.sh), `app` (ui_playtest_app.sh) |

Legacy names (`nb1`, `gate-96c0401-newbie`, `handoff-<TS>-<sha>-<scenario>`,
etc.) are parsed best-effort by `indexer.parse_canonical_name`: timestamp
extracted by regex, sha by 7-hex match, persona/world by suffix match against
known values. Whatever isn't parseable falls through to the dir mtime + `null`
fields — never crashes.

## INDEX.jsonl row schema

Every row has these fields:

```json
{
"kind": "run | play-state | transcript",
"id": "<dir name or transcript filename>",
"path": "<repo-relative path>",
"timestamp_iso": "2026-06-02T05:30:15Z",
"commit_sha": "9545383",
"indexed_at": "2026-06-02T14:00:00Z",
"source": "runner | backfill"
}
```

### `kind: "run"` adds

```json
{
"version": "v1.0.3-126-g1057234",
"world": "baldurs-gate",
"persona": "newbie",
"provider": "claude",
"scenario": "codex-native",
"surface": "BUILT dist/WorldOS.app (part A) + ...",
"beats_cap": 6,
"budget_usd": 4.0,
"canonical_name": false,

"part": "A | B | AB",
"part_a_result": "PASS | FAIL | skipped",
"part_b_persona_loop": "ran | skipped | backend_not_ready | ...",
"part_b_score_pass": true,
"spend_usd": 0.42,
"minted_run_dir": "play-20260602043338",
"player_agent": "claude",
"player_cost_usd": 0.25,
"player_rc": 0,
"port": 8765,

"score": {
"completed_intro_flow": true,
"reached_play_screen": true,
"actions_total": 11,
"in_story_turns": 2,
"console_errors": 0,
"network_failures": 0,
"image_404s": 0,
"gave_up": true,
"persona_satisfaction": 4,
"satisfaction_source": "derived | self-reported",
"pass": false
},
"bug_counts": {
"critical": 0, "major": 2, "minor": 0, "trivial": 0,
"total": 2, "ndjson_lines": 2
},
"summary_md": "qa/ui_playtest_runs/<id>/summary.md",

"linked_transcripts": ["qa/transcripts/<id>.chat.jsonl", ...],
"linked_play_state": "play-state/<id-or-minted-dir>",
"linked_rubrics": {
"tolkien": "qa/transcripts/<...>.tolkien.json",
"angrydm": "qa/transcripts/<...>.angrydm.json",
"score": "qa/transcripts/<...>.score.json",
"state": "qa/transcripts/<...>.state.json"
},

"scored_in_ledger": {
"story_overall": 4.1, "mech_overall": 4.1, "angrydm_overall": 3.2,
"behavioral": "GREEN", "rri": null, "critical_bugs": 0,
"image_render_rate": null, "pass": 1,
"surface": "engine-duo", "dm_model": "sonnet", "scorer_model": "claude"
}
}
```

`scored_in_ledger` is populated only when `qa/scores.db` has a row with the
matching `run_id`. It's the curated quality verdict (69 rows total, across all
surfaces and worktrees); the raw `INDEX.jsonl` row is the raw artifact.

### `kind: "play-state"` adds

```json
{
"world": null, "persona": null, "canonical_name": false,
"campaign_count": 1,
"chat_lines": 7,
"player_moves": 2,
"linked_run": "qa/ui_playtest_runs/<id-if-matching>"
}
```

### `kind: "transcript"` adds

```json
{
"run_id": "gate-96c0401-duo",
"role": "chat | dm | player | null",
"suffix": "chat.jsonl | dm.<nanos>.jsonl | ...",
"line_count": 5,
"linked_run": "qa/ui_playtest_runs/<run_id-if-matching>"
}
```

## Common queries (`find_run.py`)

```bash
# All runs since a date
qa/scripts/find_run.py --since 2026-05-25

# Failed runs only (score.pass != true)
qa/scripts/find_run.py --kind run --failed

# Runs where the persona gave up
qa/scripts/find_run.py --gave-up

# Runs by exact commit
qa/scripts/find_run.py --sha 1057234

# Just paths, for scripting
qa/scripts/find_run.py --persona newbie --paths-only --limit 20

# Runs that have a curated scores.db row
qa/scripts/find_run.py --scored

# Runs that DON'T have a curated row (mostly exploratory)
qa/scripts/find_run.py --unscored --kind run

# Runs with story-craft / angry-DM rubrics available
qa/scripts/find_run.py --has-rubric

# Raw JSONL, pipe to jq
qa/scripts/find_run.py --since 2026-06-01 --jsonl | jq '.id, .score.persona_satisfaction'

# Just count matches
qa/scripts/find_run.py --failed --count
```

Full flag reference: `qa/scripts/find_run.py --help`.

## Grep recipes (no Python required)

```bash
# Recent runs in last 24h (by indexed_at)
grep -F "\"indexed_at\": \"$(date -u +%Y-%m-%d)" qa/INDEX.jsonl

# Find an id substring
grep -F '"id": "handoff-20260602' qa/INDEX.jsonl | python3 -c 'import sys,json; [print(json.loads(l)["path"]) for l in sys.stdin]'

# All gave-up runs
python3 -c '
import json
for line in open("qa/INDEX.jsonl"):
e = json.loads(line)
if (e.get("score") or {}).get("gave_up"):
print(e["id"], "→", e["path"])
'
```

## Rebuilding the index

If the index gets out of sync (file deleted, runner skipped the append, fresh
clone), rebuild from scratch:

```bash
python3 qa/scripts/backfill_index.py
```

Idempotent; safe to re-run. Backfill writes to `qa/INDEX.jsonl.new` then atomic
renames, so a partial run doesn't leave a corrupt index.

## Auto-append on every run

The two playtest runners write to `INDEX.jsonl` automatically:

- `qa/ui_playtest.sh` — appends after `score.json` write
- `qa/ui_playtest_app.sh` — appends after `run.json` write
- `qa/release_gate.sh` — inherits via its per-persona `ui_playtest_app.sh` calls

The append is wrapped in `|| true` so a broken indexer never fails a real
playtest. Indexer is also idempotent — re-running on an already-indexed dir
updates the existing row (matched by `(kind, id)`) rather than appending a
duplicate.

## Why JSONL, not SQLite

- ~800 rows total, growing slowly. SQLite is overkill.
- Append-only is failsafe: a half-written line at EOF is recoverable; a
half-written sqlite write can corrupt the db.
- Grep/jq/awk-friendly. No client library needed.
- `qa/scores.db` already exists for the *curated* scores layer — JSONL covers
the *raw* artifact layer. Two different concerns, two different stores.

## Relationship to `qa/scores.db`

| Layer | Store | Rows | Source | Purpose |
|------------------|--------------------|-------|---------------|----------------------------------------------------------|
| Raw artifacts | `qa/INDEX.jsonl` | ~800 | runners + backfill | Every playtest dir + play-state + transcript |
| Curated quality | `qa/scores.db` | ~69 | hand-validated (`scores_db.py --add`) | Headline scored runs across surfaces |

INDEX rows with a matching `run_id` in `scores.db` get a `scored_in_ledger`
field pointing at the curated verdict. Use `--scored` / `--unscored` on
`find_run.py` to filter either way.

## Sister surfaces (not yet indexed)

- Engine-side play-state writes (engine server is Python, separate concern) —
backfill catches existing dirs; runtime auto-append would need an engine
hook. Filed as a follow-up issue.
- Cross-worktree artifacts (other devs' / CI's runs landing under
`/private/tmp/wos-*/qa/...`) — each worktree has its own index. If a shared
catalog becomes useful later, promote milestone rows into a committed
`qa/MILESTONES.jsonl`.
4 changes: 4 additions & 0 deletions qa/QA_TOOLS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Default local paths:
| `qa/export_app_evidence.py` | Normalize a live app or run dir into a reviewable evidence bundle | `manifest.json`, status/session snapshots, screenshots, traces, logs | You are trying to prove behavior without first running a gate |
| `qa/app_failure_buckets.py` | Classify harness failures into the stable app bucket list | Bucket JSON / shell-readable output | You need product fixes; this only labels failures |
| `qa/app_handoff_hooks.js` | Static/same-port hook probe for core agent-driving controls | Hook-check JSON inside handoff evidence | You need human exploratory testing; this is a bounded locator check |
| `qa/scripts/find_run.py` | Search past QA artifacts (playtest runs, play-states, transcripts) by date / sha / persona / gate state / failed / gave-up. Reads `qa/INDEX.jsonl` — the local artifact catalog built by `backfill_index.py` and auto-appended by the runners. | Stdout: matching entries with paths | You want curated headline quality verdicts — use `qa/scores_ledger.md` (rendered from `qa/scores.db`) instead |
| `qa/scripts/backfill_index.py` | Rebuild `qa/INDEX.jsonl` from scratch (one-time on fresh clone, or after manual artifact moves). Idempotent. | `qa/INDEX.jsonl` (gitignored, per-dev) | The index is already current — the runners auto-append on every playtest |

Don't grep `qa/ui_playtest_runs/` directly to find past runs — use `qa/scripts/find_run.py`. Full schema and recipes in [`qa/INDEX_SCHEMA.md`](INDEX_SCHEMA.md).

Copy-paste fast handoff command:

Expand Down
41 changes: 41 additions & 0 deletions qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# qa/ — QA harness

Routing pointers (see also: [`QA_TOOLS.md`](QA_TOOLS.md), [`SCORECARD.md`](SCORECARD.md), [`SCORING.md`](SCORING.md)).

## Finding past QA runs

Don't `ls` / grep `qa/ui_playtest_runs/`, `play-state/`, or `qa/transcripts/` directly — there are 800+ artifacts with mixed naming. Query the index instead:

```bash
qa/scripts/find_run.py --since 2026-05-25 --gate red --failed
qa/scripts/find_run.py --persona newbie --paths-only --limit 20
qa/scripts/find_run.py --sha 1057234
```

Full schema, query recipes, and the canonical naming format for new runs: [`INDEX_SCHEMA.md`](INDEX_SCHEMA.md).

On a fresh clone (or when the index is stale):

```bash
python3 qa/scripts/backfill_index.py
```

Idempotent. Writes `qa/INDEX.jsonl` (gitignored, per-developer). The two playtest runners (`ui_playtest.sh`, `ui_playtest_app.sh`) auto-append to the index on every successful run.

## Layered stores

- **Raw artifact catalog** — `qa/INDEX.jsonl` (this directory). Every playtest dir, play-state, transcript. Auto-built.
- **Curated quality verdicts** — `qa/scores.db` rendered to [`scores_ledger.md`](scores_ledger.md). Hand-validated headline runs across surfaces. Append via `qa/scores_db.py --add`.

INDEX rows that match a curated ledger row get a `scored_in_ledger` field linking the two.

## Other key docs in this dir

| File | Purpose |
|---|---|
| [`QA_TOOLS.md`](QA_TOOLS.md) | Command map for agents — which tool for which surface |
| [`SCORECARD.md`](SCORECARD.md) | Run-level evidence ledger (rendered from `scores.db`) |
| [`SCORING.md`](SCORING.md) | Lens scoring spec (story-craft, mechanical, angry-DM) |
| [`UI_PLAYTEST.md`](UI_PLAYTEST.md) | UI playtest harness (player + DM) |
| [`GUI_WORKBOOK.md`](GUI_WORKBOOK.md) | GUI-built-app surface notes |
| [`INDEX_SCHEMA.md`](INDEX_SCHEMA.md) | Artifact index schema + naming + queries |
Empty file added qa/scripts/__init__.py
Empty file.
Loading
Loading