electricsheephq · 100yenadmin · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/WorldOS-GUI-RUNBOOK.md b/WorldOS-GUI-RUNBOOK.md
@@ -338,6 +338,14 @@ release truth still requires `qa/ui_playtest_app.sh` Part A+B and the full RRI s
   Mac handoff bundle only if all required handoff gates and manifests are same-SHA, clean,
   private-art-present, and gap-free. If the VM runs a newer SHA, rerun `qa/app_handoff_gate.py` on that
   newer SHA first.
+- **GLM QA lane (cheap batch sweeps, token saver — NOT the release gate).** Any heavy persona/duo sweep on
+  this VM can run on **GLM 5.2** instead of Claude to save Anthropic tokens: set
+  `WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`. `qa/glm_profile.sh` (sourced by `run_duo` /
+  `run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint + raised
+  timeouts/retries; it is a no-op for Claude and scrubs stray GLM env on switch-back. **The scorer stays
+  Claude** (`qa/score.sh`, pinned-Sonnet, isolated `~/.claude`). Use GLM for bug-finding/smoke; **Claude
+  stays the quality bar** for the release RRI. Full strategy + the cap-rate finding:
+  `docs/MODEL-TIERING-STRATEGY.md`.
 
 ## Release (when RRI = 10/10 on a fresh .app build)
 Bump `.claude-plugin/plugin.json` → 1.0.4, tag `v1.0.4`, GitHub release + CHANGELOG. Then MAINTAIN:

diff --git a/WorldOS-RUNBOOK.md b/WorldOS-RUNBOOK.md
@@ -236,6 +236,17 @@ is LEGACY narrative; don't hand-edit it.)
   or `qa/score_openclaw.sh` (gateway gpt-5.4, grades **~1.5 pts HARSHER** — a strict
   cross-check, NOT the headline).
 
+**GLM QA lane (cheap batch sweeps — token saver, NOT the release gate).** To run a QA harness on
+**GLM 5.2** instead of Claude, set both role models to GLM:
+`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2 qa/run_duo.sh …`. `qa/glm_profile.sh` (sourced by
+every harness — `run_duo` / `run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint
++ credentials (from `~/.openclaw/secrets/glm.env`) and raises the cold-open/per-beat timeouts + retry
+ceilings (GLM is ~2–3× slower than Opus). It is a **no-op for Claude** and defensively scrubs any stray GLM
+env on switch-back, so a clean Claude run is byte-identical. **The scorer stays Claude** (`qa/score.sh` runs
+the pinned-Sonnet scorer under isolated `~/.claude`, whichever model played). Use GLM to save Anthropic
+tokens on bug-finding/build-smoke sweeps; **Claude remains the quality bar** for the release scorecard. Full
+strategy + the cap-rate finding: `docs/MODEL-TIERING-STRATEGY.md`.
+
 **Targets (the loop's exit bar):** **story ≥ 4.3, mechanical ≥ 4.5, gate GREEN, 0
 critical/high** adversarial defects.
 

diff --git a/docs/MODEL-TIERING-STRATEGY.md b/docs/MODEL-TIERING-STRATEGY.md
@@ -49,6 +49,80 @@ Opus sweep confirming story ≥4.3 AND no latency give-up.
   generation-bound, so a prefetch helper touches ≤5% of the wall-clock. Refuted by the latency forensics.
 - **A headless `--fast` mode** — doesn't exist for `claude -p`; use `--effort`.
 
+## GLM as a cheap batch-QA engine (QA-only; Claude stays the quality bar)
+GLM (z.ai's **GLM 5.2**, served over an Anthropic-compatible endpoint) is a **cost lever for QA sweeps**,
+NOT a model the player ever touches. The point is to run cheap batch QA — many duos to find bugs / smoke a
+build — without spending Anthropic tokens, while **Claude remains the quality bar** for the release gate.
+
+**The clean model-profile system** (`qa/glm_profile.sh`; PRs #1026 + #1028). A **single model choice flows
+coherently** through the harness via `WORLDOS_DM_MODEL` (+ `WORLDOS_ACTOR_MODEL`). The profile is keyed off
+that one choice:
+- **No-op for Claude.** If neither role names a GLM model, `worldos_apply_glm_profile` does NOT apply the
+  profile and never alters a Claude default — a clean Claude run is byte-for-byte unchanged.
+- **Switch-back is always clean (no leak).** On the Claude path the profile *defensively scrubs* any stray
+  GLM-injected env left in the shell after a QA run (`ANTHROPIC_BASE_URL` / `ANTHROPIC_AUTH_TOKEN` /
+  `ANTHROPIC_API_KEY` / `API_TIMEOUT_MS` / a `/tmp/glm-claude-config` `CLAUDE_CONFIG_DIR`). Each unset is
+  **GLM-conditional** (matched by the z.ai host or a byte-match to `glm.env`), so a legitimate `sk-ant-` key,
+  `api.anthropic.com`, a corporate proxy, or a user's own config dir is **never touched**. "Switch back to
+  Opus is always clean" even if a GLM export leaked.
+- **Mixed-model guard.** If exactly one role is GLM (a half-GLM/half-Claude config — almost always a
+  mistake, since `ANTHROPIC_BASE_URL` is process-global and the "Claude" half would silently inherit z.ai),
+  it warns and **normalizes both roles to GLM** so a run can never silently route the two roles to different
+  providers.
+- **Product is forced clean-Claude.** `scripts/play.sh` + `scripts/play_party.sh` conditionally neutralize
+  ambient GLM env before any `claude -p`, so the `.app` always runs Claude (Opus) quality and never opts into
+  GLM. **QA uses GLM via `qa/glm_profile.sh`; the product play path never does.**
+- **The scorer is ALWAYS isolated-Claude.** `qa/score.sh` runs the pinned-`sonnet` scorer under
+  `env -u ANTHROPIC_BASE_URL -u ANTHROPIC_API_KEY …` so it uses clean `~/.claude` (Anthropic OAuth)
+  **regardless of which model PLAYED the game** — the measurement is on a constant scorer, never on GLM.
+
+**When to use GLM:** cheap batch QA sweeps to save Anthropic tokens (bug-finding, build smoke, parallel
+duos). **NOT the final release gate** — Claude stays the quality bar; GLM is a viable cheap batch-sweep
+engine, not a replacement for Claude on the release scorecard. Run it via
+`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`; the profile auto-wires the z.ai endpoint + raised
+timeouts/retry ceilings (GLM is ~2–3× slower than Opus). See the **GLM QA lane** notes in `WorldOS-RUNBOOK.md`
+and `WorldOS-GUI-RUNBOOK.md`.
+
+### The cap-rate finding (honest-measurement repair, NOT a GLM weakness)
+The ~30% GLM "cap rate" that early overnight sweeps showed was **NOT a GLM quality weakness** — it was
+**self-inflicted, model-agnostic over-aggressive FATAL gates** capping **both** models. The Phase-2 reorient
+ran a GLM-vs-Claude 1-v-1 and immediately found that **Claude opus runs RED-capped too** (2/2). A RED
+behavioral gate caps all three lenses to ≤2.5, and several FATAL gates were nuking *legitimate* short
+emergent sessions on both models. Two root causes, both fixed:
+- **`no_rejected_tool_calls`** — a model passing a string/comma-string where a list arg was expected was
+  rejected by Pydantic → FATAL. Fixed at the validation layer (#1027: a `BeforeValidator` coerces
+  `str → [s]` / comma-`str → split`, schema unchanged, genuinely-wrong types still rejected).
+- **`party_traveled` / `combat_not_left_active`** — a deep single-scene social duo read as "never left the
+  opening scene," and a 6-beat duo that truncated mid-fight read as "combat abandoned" → FATAL. Fixed by
+  making severity **beat-scoped / discriminator-aware** (#1030: WARN below the single-scene/late-start
+  threshold, FATAL only for a genuine stuck-DM or real abandon). **Adversarially verified: no true
+  integrity gate was weakened** — the corpus fixtures still RED genuine failures (player-seated,
+  rejected-tools, dice, dm-output, SRD-correctness, xp all untouched).
+
+This is the spirit of the north star: **scores are measurement, never the target.** The gate had been
+FALSE-CAPPING good story-craft (short single-scene / truncated-combat sessions are legitimate, and pillar 1
+is story-craft first) — fixing it makes the measurement *honest*. This is the OPPOSITE of score-gaming.
+
+**Honest GLM-vs-Claude quality (measured on the fixed engine, 2026-06-19).** Same-SHA 1-v-1
+(`43a5ecc`: #1027 coercion + #1028 clean-profile + #1030 gate-severity), same world/persona/6-beats,
+both scored by the isolated Claude sonnet scorer. 5 runs (3 Claude opus/sonnet + 2 GLM-5.2), **all
+behavioral GREEN — 0 RED-caps** (vs the pre-fix ~30%, which was the self-inflicted gate false-cap, NOT
+a GLM weakness):
+
+| model | story | mech | angry | cold-open |
+|---|---|---|---|---|
+| Claude (opus DM / sonnet actor) | **4.13** | 3.67 | 3.33 | ~205–249s |
+| GLM-5.2 (both roles) | 3.9 | **3.8** | **3.4** | **604–872s** |
+
+**Verdict:** GLM is **comparable quality** — within ~0.2 on every lens; *higher* on mechanical (3.8 vs
+3.67) and angry-DM (3.4 vs 3.33), ~0.2 lower on story-craft (3.9 vs 4.13). A real QA runner, not
+degraded. **Its true cost is LATENCY** — GLM cold-opens run 604–872s (3–4× Claude's ~205–249s) and
+routine beats ~120–166s (vs ~80–96s), so a 3-run GLM batch is ~2.5–3.5h. ⇒ **Use GLM for cheap
+overnight / VM batch sweeps where latency is hidden; never interactive, and never the final release
+gate (Claude stays the quality bar).** Both models sit BELOW the RRI release bar (story ≥4.3, mech
+≥4.5) — story ~4.0–4.1 is close; the mech ~3.7–3.8 gap is largely the emergent-social-duo coverage
+artifact (little combat to score), not an engine defect (Engine-Excellent is met).
+
 ## Validation ladder (cheap → expensive; before any model/effort spend)
 digest-correctness (1 engine call, no LLM) → cache-stability (1 two-beat run) → effort/flag-wiring probe
 (confirm the runner consumes the flag — see worldos-dev "QA must exercise the flag") → short duo A/B on the

diff --git a/qa/SCORING.md b/qa/SCORING.md
@@ -1,10 +1,16 @@
 # WorldOS QA Scoring System — standardized reference
 
-> Source of truth for HOW we measure a playtest. Current as of 2026-05-26.
+> Source of truth for HOW we measure a playtest. Current as of 2026-06-19 (post-24h reorient).
 > The running results ledger is `qa/scores_db.py` (SQLite) → `qa/scores_ledger.md` (`add_run()` / `--render`); `qa/SCORECARD.md` is LEGACY narrative.
 > For the current app/native handoff tools and RRI routing, start with `qa/QA_TOOLS.md` and
 > `WorldOS-GUI-RUNBOOK.md`; this file describes the story/mechanical scoring model.
 
+> **Everything here is MEASUREMENT, never the target.** The north star (`VISION.md`) is the
+> *felt player session* — a no-prior-knowledge player plays a complete 8-beat Baldur's-Gate-caliber
+> arc and never once feels "this is broken." Scores, RRI, and rubric numbers exist only to *measure*
+> that; **no score-gaming.** The gate-severity work in §1a is the sharp edge of this: making the
+> measurement HONEST (it was punishing legitimate good story-craft) is the opposite of gaming a number.
+
 The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM
 lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it.
 
@@ -29,6 +35,69 @@ If the gate is RED, all three LLM scorecards are **capped to ≤ 2.5 / INVALID**
 annotated with the failed checks (`worldos_cap_score_red`). A dead/non-progressing scene
 can never display as 4.1 again. On a GREEN run, scores pass through untouched.
 
+## 1a. Gate severity — a FATAL must mean a true integrity failure (honest-measurement repair)
+Because a RED caps **all three lenses ≤ 2.5**, the line between **FATAL** (caps the run) and
+**WARN** (advisory, doesn't cap) IS the measurement. The contract (the post-24h reorient,
+PR #1030 — paired with the #1027 coercion fix below):
+
+> A FATAL behavioral gate fires **only on a true integrity / correctness failure** — no PC
+> seated, a rejected/validation-walled tool call, dice never used, a save corrupted by a real
+> engine bug, a fight genuinely abandoned. It must **NOT** fire on a quality/completeness signal
+> that a *legitimate short emergent duo* trips. Those are WARNs.
+
+A Phase-2 GLM-vs-Claude 1-v-1 found **Claude opus runs RED-capping too** (2/2), so two FATAL gates
+were demoting good story-craft (VISION pillar 1) on **both** models — a **model-agnostic false-cap**,
+not a model quirk. The two beat-scoped fixes (`qa/assert_behavioral.py`):
+
+- **`party_traveled`** (`assert_behavioral.py:676–696`). The bare `visited >= 2` rule read a deep
+  6–7-beat **single-scene social duo** as "never left the opening scene" → FATAL. **Now beat-scoped:**
+  `SINGLE_SCENE_MIN_BEATS = 8` — below 8 beats this is a **WARN** (a single-scene vignette is not a
+  stuck DM); **at/above 8** it stays **FATAL**. The strict anti-gaming in-place-progression exception
+  (the run must have **advanced the clock AND resolved an actual completed quest** — `clock_advanced
+  AND arc_resolved`, deliberately *not* clock-only or beats-only, adversarially verified against a
+  cheap-`set_quest_status("active")` game) is **unchanged**; a substantial run that never moves *and*
+  never progresses is still a FATAL stuck DM.
+- **`combat_not_left_active`** (`assert_behavioral.py:326–397`). A 6-beat duo that **enters combat
+  near its beat budget and truncates mid-fight** legitimately never reaches `end_combat` → the old bare
+  FATAL capped a run that did nothing wrong (proven: `qa/transcripts/claude-1v1-2`, an opus duo whose
+  final DM line is cut off mid-sentence). **Now severity rides a `started_late` discriminator** (where
+  the last `start_combat` lands in the ordered tool stream): a fight that started in the **final ~20%**
+  of calls — or a **resume-into-combat** session with **no `start_combat` this run** — is a
+  **truncation → WARN**. Only a **genuine abandon** (a substantial run `≥ COMBAT_ABANDON_MIN_BEATS = 10`
+  where combat started **early** with room to resolve, `end_combat` never fired, and the fight is **still
+  active** at the snapshot) stays **FATAL** — that corrupts the next load (and the engine's `start_combat`
+  next-load guard is the deeper backstop).
+
+**This is honest-measurement repair, NOT score-gaming.** A gate-severity audit classified *every*
+FATAL gate KEEP-FATAL vs over-aggressive and changed **only** the two quality/completeness ones; an
+adversarial verifier confirmed **no true integrity gate was weakened** — `player_in_party`,
+`no_rejected_tool_calls`, `dice_used`, `dm_produced_output`, SRD-correctness, and the XP gates are all
+**untouched** — and the behavioral-gate **corpus still REDs genuine failures** (`party_traveled` padded
+to 8 beats, `combat_not_left_active` reshaped to a real-abandon profile — both still trip the preserved
+FATAL path). The opus runs that wrongly RED-capped (`claude-1v1-1`/`claude-1v1-2`) are now GREEN; the
+corpus + taxonomy suite (39) and `fast_gate` (226) stay green. This makes the floor measure *broken*,
+so the lenses can grade good short story-craft instead of being capped to ≤ 2.5.
+
+### 1a.1 List-arg coercion — the tool-arg contract (#1027)
+The **#1 source** of the model-agnostic RED-cap was *upstream* of the gate: FastMCP validates a tool
+call's args against the Pydantic type hints **before** the function body runs, so a model passing a
+bare string (`approval_tags="honest_dealing"`) or a comma-string (`actor_ids="id1,id2"`) where a
+**list** is expected was rejected ("Input should be a valid list") → the FATAL `no_rejected_tool_calls`
+gate → all three lenses capped. This deflated **~30%** of runs and hit the **Claude** baseline
+transcripts (`baseline-rc1`, `cue-thaw`) exactly as hard as GLM — *not* a GLM-only problem.
+
+**The contract** (`servers/engine/models.py:21–63`): list-typed tool args coerce at the validation
+layer via a reusable `_coerce_list` **`BeforeValidator`** (the `ListArg` / `StrListArg` / `OptStrListArg`
+aliases), applied to the high-traffic DM-called args — `record_decision` (options / actor_ids /
+approval_tags), `author_companion_gauges`, `start_combat` (combatant_ids / surpriser_ids), `cast_spell`
+(target_ids), and the nested `persist_beat` decision path (`server.py:12582–12594`). Behavior:
+`None → None`; a real `list → unchanged`; `"" → []`; `"foo" → ["foo"]`; `"a,b , c" → ["a","b","c"]`;
+**anything genuinely wrong (int / dict) is returned as-is so Pydantic STILL rejects it loudly** — the
+coercion is purely additive and never swallows a real type bug. Critically, a `BeforeValidator` is
+**invisible to `json_schema()`**, so the emitted wire schema stays a plain `array` and the pinned
+schema byte-budget (`test_tool_schema_budget`) does not regress. The model gets coerced, not walled —
+so a stringified list no longer manufactures a false `no_rejected_tool_calls` RED.
+
 ## 1b. Feature-engagement coverage — the dead-system tracker (WS0)
 `qa/feature_engagement.py`
 
@@ -172,3 +241,23 @@ toward the new measured max.
   gate. `qa/test_lens_variance.py` is the deterministic, CI-safe guard that keeps this
   floor honest (it reads only on-disk artifacts; live re-derivation is an explicit,
   opt-in, non-CI step gated behind `WORLDOS_LIVE_SCORER=1`).
+
+## 7. Timing columns — where a beat's seconds go (Wave-1)
+Additive observability, not a gate. A finished run now reports **where time goes**, flowing
+**per-tool-call sidecar → `qa/latency_rollup.py` → `qa/scores.db` columns → the `story_readout` TIMING
+stamp**. The engine wraps each `@mcp.tool()` once and, **only when `WORLDOS_TOOLTIMING_PATH` is set**
+(default-OFF — production pays nothing), appends `{ts, tool, wall_ms, ok, campaign_id}` per call to a
+JSONL sidecar (PR #1006). `latency_rollup.py` (PR #1007) then derives two dimensions: **per-kind
+generation** means — `combat_s_per_beat` / `social_s_per_beat` / `camp_s_per_beat`, each the mean beat
+`duration_api_ms` over beats classified by their tool calls (cold-open / combat / camp / social; combat
+outranks camp) **straight from the transcripts, no sidecar needed**; and a **tool-exec split** from the
+optional sidecar — `mean_tool_call_ms`, `slowest_tool` (largest *total* summed `wall_ms`), and
+`tool_exec_pct` (= Σ tool wall-s ÷ Σ whole-beat `duration_ms`, with a `tool_exec_pct_basis` stamp).
+These land as additive `scores.db` columns (`combat_s_per_beat`, `social_s_per_beat`, `mean_tool_call_ms`,
+`slowest_tool`, `tool_exec_pct`, `duration_wall_s`; old rows read NULL via `ALTER TABLE`) and as a
+one-line `TIMING |` readout next to `COVERAGE`, e.g.
+`TIMING | beat~86s gen~96s cold~240s | combat~140s social~70s camp~95s | tool=3% slowest=scene_context`
+(the tool clause is omitted when no sidecar). **The headline finding: engine tool-exec is only ~1–4% of
+a beat** — routine beats are ~90–100% **generation/decode-bound** (Opus more so, extended thinking), so
+when a combat turn feels slow it's the *model thinking*, not the tools. Everything degrades to `None`
+without a sidecar, leaving the rest of the rollup byte-identical.