diff --git a/WorldOS-GUI-RUNBOOK.md b/WorldOS-GUI-RUNBOOK.md index 6d0c841b..61a8ff92 100644 --- a/WorldOS-GUI-RUNBOOK.md +++ b/WorldOS-GUI-RUNBOOK.md @@ -338,6 +338,14 @@ release truth still requires `qa/ui_playtest_app.sh` Part A+B and the full RRI s Mac handoff bundle only if all required handoff gates and manifests are same-SHA, clean, private-art-present, and gap-free. If the VM runs a newer SHA, rerun `qa/app_handoff_gate.py` on that newer SHA first. +- **GLM QA lane (cheap batch sweeps, token saver — NOT the release gate).** Any heavy persona/duo sweep on + this VM can run on **GLM 5.2** instead of Claude to save Anthropic tokens: set + `WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`. `qa/glm_profile.sh` (sourced by `run_duo` / + `run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint + raised + timeouts/retries; it is a no-op for Claude and scrubs stray GLM env on switch-back. **The scorer stays + Claude** (`qa/score.sh`, pinned-Sonnet, isolated `~/.claude`). Use GLM for bug-finding/smoke; **Claude + stays the quality bar** for the release RRI. Full strategy + the cap-rate finding: + `docs/MODEL-TIERING-STRATEGY.md`. ## Release (when RRI = 10/10 on a fresh .app build) Bump `.claude-plugin/plugin.json` → 1.0.4, tag `v1.0.4`, GitHub release + CHANGELOG. Then MAINTAIN: diff --git a/WorldOS-RUNBOOK.md b/WorldOS-RUNBOOK.md index e2c85be0..285a002a 100644 --- a/WorldOS-RUNBOOK.md +++ b/WorldOS-RUNBOOK.md @@ -236,6 +236,17 @@ is LEGACY narrative; don't hand-edit it.) or `qa/score_openclaw.sh` (gateway gpt-5.4, grades **~1.5 pts HARSHER** — a strict cross-check, NOT the headline). +**GLM QA lane (cheap batch sweeps — token saver, NOT the release gate).** To run a QA harness on +**GLM 5.2** instead of Claude, set both role models to GLM: +`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2 qa/run_duo.sh …`. `qa/glm_profile.sh` (sourced by +every harness — `run_duo` / `run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint ++ credentials (from `~/.openclaw/secrets/glm.env`) and raises the cold-open/per-beat timeouts + retry +ceilings (GLM is ~2–3× slower than Opus). It is a **no-op for Claude** and defensively scrubs any stray GLM +env on switch-back, so a clean Claude run is byte-identical. **The scorer stays Claude** (`qa/score.sh` runs +the pinned-Sonnet scorer under isolated `~/.claude`, whichever model played). Use GLM to save Anthropic +tokens on bug-finding/build-smoke sweeps; **Claude remains the quality bar** for the release scorecard. Full +strategy + the cap-rate finding: `docs/MODEL-TIERING-STRATEGY.md`. + **Targets (the loop's exit bar):** **story ≥ 4.3, mechanical ≥ 4.5, gate GREEN, 0 critical/high** adversarial defects. diff --git a/docs/MODEL-TIERING-STRATEGY.md b/docs/MODEL-TIERING-STRATEGY.md index 5d61e2ae..4915b181 100644 --- a/docs/MODEL-TIERING-STRATEGY.md +++ b/docs/MODEL-TIERING-STRATEGY.md @@ -49,6 +49,80 @@ Opus sweep confirming story ≥4.3 AND no latency give-up. generation-bound, so a prefetch helper touches ≤5% of the wall-clock. Refuted by the latency forensics. - **A headless `--fast` mode** — doesn't exist for `claude -p`; use `--effort`. +## GLM as a cheap batch-QA engine (QA-only; Claude stays the quality bar) +GLM (z.ai's **GLM 5.2**, served over an Anthropic-compatible endpoint) is a **cost lever for QA sweeps**, +NOT a model the player ever touches. The point is to run cheap batch QA — many duos to find bugs / smoke a +build — without spending Anthropic tokens, while **Claude remains the quality bar** for the release gate. + +**The clean model-profile system** (`qa/glm_profile.sh`; PRs #1026 + #1028). A **single model choice flows +coherently** through the harness via `WORLDOS_DM_MODEL` (+ `WORLDOS_ACTOR_MODEL`). The profile is keyed off +that one choice: +- **No-op for Claude.** If neither role names a GLM model, `worldos_apply_glm_profile` does NOT apply the + profile and never alters a Claude default — a clean Claude run is byte-for-byte unchanged. +- **Switch-back is always clean (no leak).** On the Claude path the profile *defensively scrubs* any stray + GLM-injected env left in the shell after a QA run (`ANTHROPIC_BASE_URL` / `ANTHROPIC_AUTH_TOKEN` / + `ANTHROPIC_API_KEY` / `API_TIMEOUT_MS` / a `/tmp/glm-claude-config` `CLAUDE_CONFIG_DIR`). Each unset is + **GLM-conditional** (matched by the z.ai host or a byte-match to `glm.env`), so a legitimate `sk-ant-` key, + `api.anthropic.com`, a corporate proxy, or a user's own config dir is **never touched**. "Switch back to + Opus is always clean" even if a GLM export leaked. +- **Mixed-model guard.** If exactly one role is GLM (a half-GLM/half-Claude config — almost always a + mistake, since `ANTHROPIC_BASE_URL` is process-global and the "Claude" half would silently inherit z.ai), + it warns and **normalizes both roles to GLM** so a run can never silently route the two roles to different + providers. +- **Product is forced clean-Claude.** `scripts/play.sh` + `scripts/play_party.sh` conditionally neutralize + ambient GLM env before any `claude -p`, so the `.app` always runs Claude (Opus) quality and never opts into + GLM. **QA uses GLM via `qa/glm_profile.sh`; the product play path never does.** +- **The scorer is ALWAYS isolated-Claude.** `qa/score.sh` runs the pinned-`sonnet` scorer under + `env -u ANTHROPIC_BASE_URL -u ANTHROPIC_API_KEY …` so it uses clean `~/.claude` (Anthropic OAuth) + **regardless of which model PLAYED the game** — the measurement is on a constant scorer, never on GLM. + +**When to use GLM:** cheap batch QA sweeps to save Anthropic tokens (bug-finding, build smoke, parallel +duos). **NOT the final release gate** — Claude stays the quality bar; GLM is a viable cheap batch-sweep +engine, not a replacement for Claude on the release scorecard. Run it via +`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`; the profile auto-wires the z.ai endpoint + raised +timeouts/retry ceilings (GLM is ~2–3× slower than Opus). See the **GLM QA lane** notes in `WorldOS-RUNBOOK.md` +and `WorldOS-GUI-RUNBOOK.md`. + +### The cap-rate finding (honest-measurement repair, NOT a GLM weakness) +The ~30% GLM "cap rate" that early overnight sweeps showed was **NOT a GLM quality weakness** — it was +**self-inflicted, model-agnostic over-aggressive FATAL gates** capping **both** models. The Phase-2 reorient +ran a GLM-vs-Claude 1-v-1 and immediately found that **Claude opus runs RED-capped too** (2/2). A RED +behavioral gate caps all three lenses to ≤2.5, and several FATAL gates were nuking *legitimate* short +emergent sessions on both models. Two root causes, both fixed: +- **`no_rejected_tool_calls`** — a model passing a string/comma-string where a list arg was expected was + rejected by Pydantic → FATAL. Fixed at the validation layer (#1027: a `BeforeValidator` coerces + `str → [s]` / comma-`str → split`, schema unchanged, genuinely-wrong types still rejected). +- **`party_traveled` / `combat_not_left_active`** — a deep single-scene social duo read as "never left the + opening scene," and a 6-beat duo that truncated mid-fight read as "combat abandoned" → FATAL. Fixed by + making severity **beat-scoped / discriminator-aware** (#1030: WARN below the single-scene/late-start + threshold, FATAL only for a genuine stuck-DM or real abandon). **Adversarially verified: no true + integrity gate was weakened** — the corpus fixtures still RED genuine failures (player-seated, + rejected-tools, dice, dm-output, SRD-correctness, xp all untouched). + +This is the spirit of the north star: **scores are measurement, never the target.** The gate had been +FALSE-CAPPING good story-craft (short single-scene / truncated-combat sessions are legitimate, and pillar 1 +is story-craft first) — fixing it makes the measurement *honest*. This is the OPPOSITE of score-gaming. + +**Honest GLM-vs-Claude quality (measured on the fixed engine, 2026-06-19).** Same-SHA 1-v-1 +(`43a5ecc`: #1027 coercion + #1028 clean-profile + #1030 gate-severity), same world/persona/6-beats, +both scored by the isolated Claude sonnet scorer. 5 runs (3 Claude opus/sonnet + 2 GLM-5.2), **all +behavioral GREEN — 0 RED-caps** (vs the pre-fix ~30%, which was the self-inflicted gate false-cap, NOT +a GLM weakness): + +| model | story | mech | angry | cold-open | +|---|---|---|---|---| +| Claude (opus DM / sonnet actor) | **4.13** | 3.67 | 3.33 | ~205–249s | +| GLM-5.2 (both roles) | 3.9 | **3.8** | **3.4** | **604–872s** | + +**Verdict:** GLM is **comparable quality** — within ~0.2 on every lens; *higher* on mechanical (3.8 vs +3.67) and angry-DM (3.4 vs 3.33), ~0.2 lower on story-craft (3.9 vs 4.13). A real QA runner, not +degraded. **Its true cost is LATENCY** — GLM cold-opens run 604–872s (3–4× Claude's ~205–249s) and +routine beats ~120–166s (vs ~80–96s), so a 3-run GLM batch is ~2.5–3.5h. ⇒ **Use GLM for cheap +overnight / VM batch sweeps where latency is hidden; never interactive, and never the final release +gate (Claude stays the quality bar).** Both models sit BELOW the RRI release bar (story ≥4.3, mech +≥4.5) — story ~4.0–4.1 is close; the mech ~3.7–3.8 gap is largely the emergent-social-duo coverage +artifact (little combat to score), not an engine defect (Engine-Excellent is met). + ## Validation ladder (cheap → expensive; before any model/effort spend) digest-correctness (1 engine call, no LLM) → cache-stability (1 two-beat run) → effort/flag-wiring probe (confirm the runner consumes the flag — see worldos-dev "QA must exercise the flag") → short duo A/B on the diff --git a/qa/SCORING.md b/qa/SCORING.md index 38fa13c6..12d23cc9 100644 --- a/qa/SCORING.md +++ b/qa/SCORING.md @@ -1,10 +1,16 @@ # WorldOS QA Scoring System — standardized reference -> Source of truth for HOW we measure a playtest. Current as of 2026-05-26. +> Source of truth for HOW we measure a playtest. Current as of 2026-06-19 (post-24h reorient). > The running results ledger is `qa/scores_db.py` (SQLite) → `qa/scores_ledger.md` (`add_run()` / `--render`); `qa/SCORECARD.md` is LEGACY narrative. > For the current app/native handoff tools and RRI routing, start with `qa/QA_TOOLS.md` and > `WorldOS-GUI-RUNBOOK.md`; this file describes the story/mechanical scoring model. +> **Everything here is MEASUREMENT, never the target.** The north star (`VISION.md`) is the +> *felt player session* — a no-prior-knowledge player plays a complete 8-beat Baldur's-Gate-caliber +> arc and never once feels "this is broken." Scores, RRI, and rubric numbers exist only to *measure* +> that; **no score-gaming.** The gate-severity work in §1a is the sharp edge of this: making the +> measurement HONEST (it was punishing legitimate good story-craft) is the opposite of gaming a number. + The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it. @@ -29,6 +35,69 @@ If the gate is RED, all three LLM scorecards are **capped to ≤ 2.5 / INVALID** annotated with the failed checks (`worldos_cap_score_red`). A dead/non-progressing scene can never display as 4.1 again. On a GREEN run, scores pass through untouched. +## 1a. Gate severity — a FATAL must mean a true integrity failure (honest-measurement repair) +Because a RED caps **all three lenses ≤ 2.5**, the line between **FATAL** (caps the run) and +**WARN** (advisory, doesn't cap) IS the measurement. The contract (the post-24h reorient, +PR #1030 — paired with the #1027 coercion fix below): + +> A FATAL behavioral gate fires **only on a true integrity / correctness failure** — no PC +> seated, a rejected/validation-walled tool call, dice never used, a save corrupted by a real +> engine bug, a fight genuinely abandoned. It must **NOT** fire on a quality/completeness signal +> that a *legitimate short emergent duo* trips. Those are WARNs. + +A Phase-2 GLM-vs-Claude 1-v-1 found **Claude opus runs RED-capping too** (2/2), so two FATAL gates +were demoting good story-craft (VISION pillar 1) on **both** models — a **model-agnostic false-cap**, +not a model quirk. The two beat-scoped fixes (`qa/assert_behavioral.py`): + +- **`party_traveled`** (`assert_behavioral.py:676–696`). The bare `visited >= 2` rule read a deep + 6–7-beat **single-scene social duo** as "never left the opening scene" → FATAL. **Now beat-scoped:** + `SINGLE_SCENE_MIN_BEATS = 8` — below 8 beats this is a **WARN** (a single-scene vignette is not a + stuck DM); **at/above 8** it stays **FATAL**. The strict anti-gaming in-place-progression exception + (the run must have **advanced the clock AND resolved an actual completed quest** — `clock_advanced + AND arc_resolved`, deliberately *not* clock-only or beats-only, adversarially verified against a + cheap-`set_quest_status("active")` game) is **unchanged**; a substantial run that never moves *and* + never progresses is still a FATAL stuck DM. +- **`combat_not_left_active`** (`assert_behavioral.py:326–397`). A 6-beat duo that **enters combat + near its beat budget and truncates mid-fight** legitimately never reaches `end_combat` → the old bare + FATAL capped a run that did nothing wrong (proven: `qa/transcripts/claude-1v1-2`, an opus duo whose + final DM line is cut off mid-sentence). **Now severity rides a `started_late` discriminator** (where + the last `start_combat` lands in the ordered tool stream): a fight that started in the **final ~20%** + of calls — or a **resume-into-combat** session with **no `start_combat` this run** — is a + **truncation → WARN**. Only a **genuine abandon** (a substantial run `≥ COMBAT_ABANDON_MIN_BEATS = 10` + where combat started **early** with room to resolve, `end_combat` never fired, and the fight is **still + active** at the snapshot) stays **FATAL** — that corrupts the next load (and the engine's `start_combat` + next-load guard is the deeper backstop). + +**This is honest-measurement repair, NOT score-gaming.** A gate-severity audit classified *every* +FATAL gate KEEP-FATAL vs over-aggressive and changed **only** the two quality/completeness ones; an +adversarial verifier confirmed **no true integrity gate was weakened** — `player_in_party`, +`no_rejected_tool_calls`, `dice_used`, `dm_produced_output`, SRD-correctness, and the XP gates are all +**untouched** — and the behavioral-gate **corpus still REDs genuine failures** (`party_traveled` padded +to 8 beats, `combat_not_left_active` reshaped to a real-abandon profile — both still trip the preserved +FATAL path). The opus runs that wrongly RED-capped (`claude-1v1-1`/`claude-1v1-2`) are now GREEN; the +corpus + taxonomy suite (39) and `fast_gate` (226) stay green. This makes the floor measure *broken*, +so the lenses can grade good short story-craft instead of being capped to ≤ 2.5. + +### 1a.1 List-arg coercion — the tool-arg contract (#1027) +The **#1 source** of the model-agnostic RED-cap was *upstream* of the gate: FastMCP validates a tool +call's args against the Pydantic type hints **before** the function body runs, so a model passing a +bare string (`approval_tags="honest_dealing"`) or a comma-string (`actor_ids="id1,id2"`) where a +**list** is expected was rejected ("Input should be a valid list") → the FATAL `no_rejected_tool_calls` +gate → all three lenses capped. This deflated **~30%** of runs and hit the **Claude** baseline +transcripts (`baseline-rc1`, `cue-thaw`) exactly as hard as GLM — *not* a GLM-only problem. + +**The contract** (`servers/engine/models.py:21–63`): list-typed tool args coerce at the validation +layer via a reusable `_coerce_list` **`BeforeValidator`** (the `ListArg` / `StrListArg` / `OptStrListArg` +aliases), applied to the high-traffic DM-called args — `record_decision` (options / actor_ids / +approval_tags), `author_companion_gauges`, `start_combat` (combatant_ids / surpriser_ids), `cast_spell` +(target_ids), and the nested `persist_beat` decision path (`server.py:12582–12594`). Behavior: +`None → None`; a real `list → unchanged`; `"" → []`; `"foo" → ["foo"]`; `"a,b , c" → ["a","b","c"]`; +**anything genuinely wrong (int / dict) is returned as-is so Pydantic STILL rejects it loudly** — the +coercion is purely additive and never swallows a real type bug. Critically, a `BeforeValidator` is +**invisible to `json_schema()`**, so the emitted wire schema stays a plain `array` and the pinned +schema byte-budget (`test_tool_schema_budget`) does not regress. The model gets coerced, not walled — +so a stringified list no longer manufactures a false `no_rejected_tool_calls` RED. + ## 1b. Feature-engagement coverage — the dead-system tracker (WS0) `qa/feature_engagement.py` @@ -172,3 +241,23 @@ toward the new measured max. gate. `qa/test_lens_variance.py` is the deterministic, CI-safe guard that keeps this floor honest (it reads only on-disk artifacts; live re-derivation is an explicit, opt-in, non-CI step gated behind `WORLDOS_LIVE_SCORER=1`). + +## 7. Timing columns — where a beat's seconds go (Wave-1) +Additive observability, not a gate. A finished run now reports **where time goes**, flowing +**per-tool-call sidecar → `qa/latency_rollup.py` → `qa/scores.db` columns → the `story_readout` TIMING +stamp**. The engine wraps each `@mcp.tool()` once and, **only when `WORLDOS_TOOLTIMING_PATH` is set** +(default-OFF — production pays nothing), appends `{ts, tool, wall_ms, ok, campaign_id}` per call to a +JSONL sidecar (PR #1006). `latency_rollup.py` (PR #1007) then derives two dimensions: **per-kind +generation** means — `combat_s_per_beat` / `social_s_per_beat` / `camp_s_per_beat`, each the mean beat +`duration_api_ms` over beats classified by their tool calls (cold-open / combat / camp / social; combat +outranks camp) **straight from the transcripts, no sidecar needed**; and a **tool-exec split** from the +optional sidecar — `mean_tool_call_ms`, `slowest_tool` (largest *total* summed `wall_ms`), and +`tool_exec_pct` (= Σ tool wall-s ÷ Σ whole-beat `duration_ms`, with a `tool_exec_pct_basis` stamp). +These land as additive `scores.db` columns (`combat_s_per_beat`, `social_s_per_beat`, `mean_tool_call_ms`, +`slowest_tool`, `tool_exec_pct`, `duration_wall_s`; old rows read NULL via `ALTER TABLE`) and as a +one-line `TIMING |` readout next to `COVERAGE`, e.g. +`TIMING | beat~86s gen~96s cold~240s | combat~140s social~70s camp~95s | tool=3% slowest=scene_context` +(the tool clause is omitted when no sidecar). **The headline finding: engine tool-exec is only ~1–4% of +a beat** — routine beats are ~90–100% **generation/decode-bound** (Opus more so, extended thinking), so +when a combat turn feels slow it's the *model thinking*, not the tools. Everything degrades to `None` +without a sidecar, leaving the rest of the rollup byte-identical.