Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions WorldOS-GUI-RUNBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,14 @@ release truth still requires `qa/ui_playtest_app.sh` Part A+B and the full RRI s
Mac handoff bundle only if all required handoff gates and manifests are same-SHA, clean,
private-art-present, and gap-free. If the VM runs a newer SHA, rerun `qa/app_handoff_gate.py` on that
newer SHA first.
- **GLM QA lane (cheap batch sweeps, token saver — NOT the release gate).** Any heavy persona/duo sweep on
this VM can run on **GLM 5.2** instead of Claude to save Anthropic tokens: set
`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`. `qa/glm_profile.sh` (sourced by `run_duo` /
`run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint + raised
timeouts/retries; it is a no-op for Claude and scrubs stray GLM env on switch-back. **The scorer stays
Claude** (`qa/score.sh`, pinned-Sonnet, isolated `~/.claude`). Use GLM for bug-finding/smoke; **Claude
stays the quality bar** for the release RRI. Full strategy + the cap-rate finding:
`docs/MODEL-TIERING-STRATEGY.md`.

## Release (when RRI = 10/10 on a fresh .app build)
Bump `.claude-plugin/plugin.json` → 1.0.4, tag `v1.0.4`, GitHub release + CHANGELOG. Then MAINTAIN:
Expand Down
11 changes: 11 additions & 0 deletions WorldOS-RUNBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,17 @@ is LEGACY narrative; don't hand-edit it.)
or `qa/score_openclaw.sh` (gateway gpt-5.4, grades **~1.5 pts HARSHER** — a strict
cross-check, NOT the headline).

**GLM QA lane (cheap batch sweeps — token saver, NOT the release gate).** To run a QA harness on
**GLM 5.2** instead of Claude, set both role models to GLM:
`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2 qa/run_duo.sh …`. `qa/glm_profile.sh` (sourced by
every harness — `run_duo` / `run_party` / `run_combat_sprint` / `ui_playtest`) auto-wires the z.ai endpoint
+ credentials (from `~/.openclaw/secrets/glm.env`) and raises the cold-open/per-beat timeouts + retry
ceilings (GLM is ~2–3× slower than Opus). It is a **no-op for Claude** and defensively scrubs any stray GLM
env on switch-back, so a clean Claude run is byte-identical. **The scorer stays Claude** (`qa/score.sh` runs
the pinned-Sonnet scorer under isolated `~/.claude`, whichever model played). Use GLM to save Anthropic
tokens on bug-finding/build-smoke sweeps; **Claude remains the quality bar** for the release scorecard. Full
strategy + the cap-rate finding: `docs/MODEL-TIERING-STRATEGY.md`.

**Targets (the loop's exit bar):** **story ≥ 4.3, mechanical ≥ 4.5, gate GREEN, 0
critical/high** adversarial defects.

Expand Down
74 changes: 74 additions & 0 deletions docs/MODEL-TIERING-STRATEGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,80 @@ Opus sweep confirming story ≥4.3 AND no latency give-up.
generation-bound, so a prefetch helper touches ≤5% of the wall-clock. Refuted by the latency forensics.
- **A headless `--fast` mode** — doesn't exist for `claude -p`; use `--effort`.

## GLM as a cheap batch-QA engine (QA-only; Claude stays the quality bar)
GLM (z.ai's **GLM 5.2**, served over an Anthropic-compatible endpoint) is a **cost lever for QA sweeps**,
NOT a model the player ever touches. The point is to run cheap batch QA — many duos to find bugs / smoke a
build — without spending Anthropic tokens, while **Claude remains the quality bar** for the release gate.

**The clean model-profile system** (`qa/glm_profile.sh`; PRs #1026 + #1028). A **single model choice flows
coherently** through the harness via `WORLDOS_DM_MODEL` (+ `WORLDOS_ACTOR_MODEL`). The profile is keyed off
that one choice:
- **No-op for Claude.** If neither role names a GLM model, `worldos_apply_glm_profile` does NOT apply the
profile and never alters a Claude default — a clean Claude run is byte-for-byte unchanged.
- **Switch-back is always clean (no leak).** On the Claude path the profile *defensively scrubs* any stray
GLM-injected env left in the shell after a QA run (`ANTHROPIC_BASE_URL` / `ANTHROPIC_AUTH_TOKEN` /
`ANTHROPIC_API_KEY` / `API_TIMEOUT_MS` / a `/tmp/glm-claude-config` `CLAUDE_CONFIG_DIR`). Each unset is
**GLM-conditional** (matched by the z.ai host or a byte-match to `glm.env`), so a legitimate `sk-ant-` key,
`api.anthropic.com`, a corporate proxy, or a user's own config dir is **never touched**. "Switch back to
Opus is always clean" even if a GLM export leaked.
- **Mixed-model guard.** If exactly one role is GLM (a half-GLM/half-Claude config — almost always a
mistake, since `ANTHROPIC_BASE_URL` is process-global and the "Claude" half would silently inherit z.ai),
it warns and **normalizes both roles to GLM** so a run can never silently route the two roles to different
providers.
- **Product is forced clean-Claude.** `scripts/play.sh` + `scripts/play_party.sh` conditionally neutralize
ambient GLM env before any `claude -p`, so the `.app` always runs Claude (Opus) quality and never opts into
GLM. **QA uses GLM via `qa/glm_profile.sh`; the product play path never does.**
- **The scorer is ALWAYS isolated-Claude.** `qa/score.sh` runs the pinned-`sonnet` scorer under
`env -u ANTHROPIC_BASE_URL -u ANTHROPIC_API_KEY …` so it uses clean `~/.claude` (Anthropic OAuth)
**regardless of which model PLAYED the game** — the measurement is on a constant scorer, never on GLM.

**When to use GLM:** cheap batch QA sweeps to save Anthropic tokens (bug-finding, build smoke, parallel
duos). **NOT the final release gate** — Claude stays the quality bar; GLM is a viable cheap batch-sweep
engine, not a replacement for Claude on the release scorecard. Run it via
`WORLDOS_DM_MODEL=glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2`; the profile auto-wires the z.ai endpoint + raised
timeouts/retry ceilings (GLM is ~2–3× slower than Opus). See the **GLM QA lane** notes in `WorldOS-RUNBOOK.md`
and `WorldOS-GUI-RUNBOOK.md`.

### The cap-rate finding (honest-measurement repair, NOT a GLM weakness)
The ~30% GLM "cap rate" that early overnight sweeps showed was **NOT a GLM quality weakness** — it was
**self-inflicted, model-agnostic over-aggressive FATAL gates** capping **both** models. The Phase-2 reorient
ran a GLM-vs-Claude 1-v-1 and immediately found that **Claude opus runs RED-capped too** (2/2). A RED
behavioral gate caps all three lenses to ≤2.5, and several FATAL gates were nuking *legitimate* short
emergent sessions on both models. Two root causes, both fixed:
- **`no_rejected_tool_calls`** — a model passing a string/comma-string where a list arg was expected was
rejected by Pydantic → FATAL. Fixed at the validation layer (#1027: a `BeforeValidator` coerces
`str → [s]` / comma-`str → split`, schema unchanged, genuinely-wrong types still rejected).
- **`party_traveled` / `combat_not_left_active`** — a deep single-scene social duo read as "never left the
opening scene," and a 6-beat duo that truncated mid-fight read as "combat abandoned" → FATAL. Fixed by
making severity **beat-scoped / discriminator-aware** (#1030: WARN below the single-scene/late-start
threshold, FATAL only for a genuine stuck-DM or real abandon). **Adversarially verified: no true
integrity gate was weakened** — the corpus fixtures still RED genuine failures (player-seated,
rejected-tools, dice, dm-output, SRD-correctness, xp all untouched).

This is the spirit of the north star: **scores are measurement, never the target.** The gate had been
FALSE-CAPPING good story-craft (short single-scene / truncated-combat sessions are legitimate, and pillar 1
is story-craft first) — fixing it makes the measurement *honest*. This is the OPPOSITE of score-gaming.

**Honest GLM-vs-Claude quality (measured on the fixed engine, 2026-06-19).** Same-SHA 1-v-1
(`43a5ecc`: #1027 coercion + #1028 clean-profile + #1030 gate-severity), same world/persona/6-beats,
both scored by the isolated Claude sonnet scorer. 5 runs (3 Claude opus/sonnet + 2 GLM-5.2), **all
behavioral GREEN — 0 RED-caps** (vs the pre-fix ~30%, which was the self-inflicted gate false-cap, NOT
a GLM weakness):

| model | story | mech | angry | cold-open |
|---|---|---|---|---|
| Claude (opus DM / sonnet actor) | **4.13** | 3.67 | 3.33 | ~205–249s |
| GLM-5.2 (both roles) | 3.9 | **3.8** | **3.4** | **604–872s** |

**Verdict:** GLM is **comparable quality** — within ~0.2 on every lens; *higher* on mechanical (3.8 vs
3.67) and angry-DM (3.4 vs 3.33), ~0.2 lower on story-craft (3.9 vs 4.13). A real QA runner, not
degraded. **Its true cost is LATENCY** — GLM cold-opens run 604–872s (3–4× Claude's ~205–249s) and
routine beats ~120–166s (vs ~80–96s), so a 3-run GLM batch is ~2.5–3.5h. ⇒ **Use GLM for cheap
overnight / VM batch sweeps where latency is hidden; never interactive, and never the final release
gate (Claude stays the quality bar).** Both models sit BELOW the RRI release bar (story ≥4.3, mech
≥4.5) — story ~4.0–4.1 is close; the mech ~3.7–3.8 gap is largely the emergent-social-duo coverage
artifact (little combat to score), not an engine defect (Engine-Excellent is met).

## Validation ladder (cheap → expensive; before any model/effort spend)
digest-correctness (1 engine call, no LLM) → cache-stability (1 two-beat run) → effort/flag-wiring probe
(confirm the runner consumes the flag — see worldos-dev "QA must exercise the flag") → short duo A/B on the
Expand Down
91 changes: 90 additions & 1 deletion qa/SCORING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# WorldOS QA Scoring System — standardized reference

> Source of truth for HOW we measure a playtest. Current as of 2026-05-26.
> Source of truth for HOW we measure a playtest. Current as of 2026-06-19 (post-24h reorient).
> The running results ledger is `qa/scores_db.py` (SQLite) → `qa/scores_ledger.md` (`add_run()` / `--render`); `qa/SCORECARD.md` is LEGACY narrative.
> For the current app/native handoff tools and RRI routing, start with `qa/QA_TOOLS.md` and
> `WorldOS-GUI-RUNBOOK.md`; this file describes the story/mechanical scoring model.

> **Everything here is MEASUREMENT, never the target.** The north star (`VISION.md`) is the
> *felt player session* — a no-prior-knowledge player plays a complete 8-beat Baldur's-Gate-caliber
> arc and never once feels "this is broken." Scores, RRI, and rubric numbers exist only to *measure*
> that; **no score-gaming.** The gate-severity work in §1a is the sharp edge of this: making the
> measurement HONEST (it was punishing legitimate good story-craft) is the opposite of gaming a number.

The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM
lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it.

Expand All @@ -29,6 +35,69 @@ If the gate is RED, all three LLM scorecards are **capped to ≤ 2.5 / INVALID**
annotated with the failed checks (`worldos_cap_score_red`). A dead/non-progressing scene
can never display as 4.1 again. On a GREEN run, scores pass through untouched.

## 1a. Gate severity — a FATAL must mean a true integrity failure (honest-measurement repair)
Because a RED caps **all three lenses ≤ 2.5**, the line between **FATAL** (caps the run) and
**WARN** (advisory, doesn't cap) IS the measurement. The contract (the post-24h reorient,
PR #1030 — paired with the #1027 coercion fix below):

> A FATAL behavioral gate fires **only on a true integrity / correctness failure** — no PC
> seated, a rejected/validation-walled tool call, dice never used, a save corrupted by a real
> engine bug, a fight genuinely abandoned. It must **NOT** fire on a quality/completeness signal
> that a *legitimate short emergent duo* trips. Those are WARNs.

A Phase-2 GLM-vs-Claude 1-v-1 found **Claude opus runs RED-capping too** (2/2), so two FATAL gates
were demoting good story-craft (VISION pillar 1) on **both** models — a **model-agnostic false-cap**,
not a model quirk. The two beat-scoped fixes (`qa/assert_behavioral.py`):

- **`party_traveled`** (`assert_behavioral.py:676–696`). The bare `visited >= 2` rule read a deep
6–7-beat **single-scene social duo** as "never left the opening scene" → FATAL. **Now beat-scoped:**
`SINGLE_SCENE_MIN_BEATS = 8` — below 8 beats this is a **WARN** (a single-scene vignette is not a
stuck DM); **at/above 8** it stays **FATAL**. The strict anti-gaming in-place-progression exception
(the run must have **advanced the clock AND resolved an actual completed quest** — `clock_advanced
AND arc_resolved`, deliberately *not* clock-only or beats-only, adversarially verified against a
cheap-`set_quest_status("active")` game) is **unchanged**; a substantial run that never moves *and*
never progresses is still a FATAL stuck DM.
- **`combat_not_left_active`** (`assert_behavioral.py:326–397`). A 6-beat duo that **enters combat
near its beat budget and truncates mid-fight** legitimately never reaches `end_combat` → the old bare
FATAL capped a run that did nothing wrong (proven: `qa/transcripts/claude-1v1-2`, an opus duo whose
final DM line is cut off mid-sentence). **Now severity rides a `started_late` discriminator** (where
the last `start_combat` lands in the ordered tool stream): a fight that started in the **final ~20%**
of calls — or a **resume-into-combat** session with **no `start_combat` this run** — is a
**truncation → WARN**. Only a **genuine abandon** (a substantial run `≥ COMBAT_ABANDON_MIN_BEATS = 10`
where combat started **early** with room to resolve, `end_combat` never fired, and the fight is **still
active** at the snapshot) stays **FATAL** — that corrupts the next load (and the engine's `start_combat`
next-load guard is the deeper backstop).

**This is honest-measurement repair, NOT score-gaming.** A gate-severity audit classified *every*
FATAL gate KEEP-FATAL vs over-aggressive and changed **only** the two quality/completeness ones; an
adversarial verifier confirmed **no true integrity gate was weakened** — `player_in_party`,
`no_rejected_tool_calls`, `dice_used`, `dm_produced_output`, SRD-correctness, and the XP gates are all
**untouched** — and the behavioral-gate **corpus still REDs genuine failures** (`party_traveled` padded
to 8 beats, `combat_not_left_active` reshaped to a real-abandon profile — both still trip the preserved
FATAL path). The opus runs that wrongly RED-capped (`claude-1v1-1`/`claude-1v1-2`) are now GREEN; the
corpus + taxonomy suite (39) and `fast_gate` (226) stay green. This makes the floor measure *broken*,
so the lenses can grade good short story-craft instead of being capped to ≤ 2.5.

### 1a.1 List-arg coercion — the tool-arg contract (#1027)
The **#1 source** of the model-agnostic RED-cap was *upstream* of the gate: FastMCP validates a tool
call's args against the Pydantic type hints **before** the function body runs, so a model passing a
bare string (`approval_tags="honest_dealing"`) or a comma-string (`actor_ids="id1,id2"`) where a
**list** is expected was rejected ("Input should be a valid list") → the FATAL `no_rejected_tool_calls`
gate → all three lenses capped. This deflated **~30%** of runs and hit the **Claude** baseline
transcripts (`baseline-rc1`, `cue-thaw`) exactly as hard as GLM — *not* a GLM-only problem.

**The contract** (`servers/engine/models.py:21–63`): list-typed tool args coerce at the validation
layer via a reusable `_coerce_list` **`BeforeValidator`** (the `ListArg` / `StrListArg` / `OptStrListArg`
aliases), applied to the high-traffic DM-called args — `record_decision` (options / actor_ids /
approval_tags), `author_companion_gauges`, `start_combat` (combatant_ids / surpriser_ids), `cast_spell`
(target_ids), and the nested `persist_beat` decision path (`server.py:12582–12594`). Behavior:
`None → None`; a real `list → unchanged`; `"" → []`; `"foo" → ["foo"]`; `"a,b , c" → ["a","b","c"]`;
**anything genuinely wrong (int / dict) is returned as-is so Pydantic STILL rejects it loudly** — the
coercion is purely additive and never swallows a real type bug. Critically, a `BeforeValidator` is
**invisible to `json_schema()`**, so the emitted wire schema stays a plain `array` and the pinned
schema byte-budget (`test_tool_schema_budget`) does not regress. The model gets coerced, not walled —
so a stringified list no longer manufactures a false `no_rejected_tool_calls` RED.

## 1b. Feature-engagement coverage — the dead-system tracker (WS0)
`qa/feature_engagement.py`

Expand Down Expand Up @@ -172,3 +241,23 @@ toward the new measured max.
gate. `qa/test_lens_variance.py` is the deterministic, CI-safe guard that keeps this
floor honest (it reads only on-disk artifacts; live re-derivation is an explicit,
opt-in, non-CI step gated behind `WORLDOS_LIVE_SCORER=1`).

## 7. Timing columns — where a beat's seconds go (Wave-1)
Additive observability, not a gate. A finished run now reports **where time goes**, flowing
**per-tool-call sidecar → `qa/latency_rollup.py` → `qa/scores.db` columns → the `story_readout` TIMING
stamp**. The engine wraps each `@mcp.tool()` once and, **only when `WORLDOS_TOOLTIMING_PATH` is set**
(default-OFF — production pays nothing), appends `{ts, tool, wall_ms, ok, campaign_id}` per call to a
JSONL sidecar (PR #1006). `latency_rollup.py` (PR #1007) then derives two dimensions: **per-kind
generation** means — `combat_s_per_beat` / `social_s_per_beat` / `camp_s_per_beat`, each the mean beat
`duration_api_ms` over beats classified by their tool calls (cold-open / combat / camp / social; combat
outranks camp) **straight from the transcripts, no sidecar needed**; and a **tool-exec split** from the
optional sidecar — `mean_tool_call_ms`, `slowest_tool` (largest *total* summed `wall_ms`), and
`tool_exec_pct` (= Σ tool wall-s ÷ Σ whole-beat `duration_ms`, with a `tool_exec_pct_basis` stamp).
These land as additive `scores.db` columns (`combat_s_per_beat`, `social_s_per_beat`, `mean_tool_call_ms`,
`slowest_tool`, `tool_exec_pct`, `duration_wall_s`; old rows read NULL via `ALTER TABLE`) and as a
one-line `TIMING |` readout next to `COVERAGE`, e.g.
`TIMING | beat~86s gen~96s cold~240s | combat~140s social~70s camp~95s | tool=3% slowest=scene_context`
(the tool clause is omitted when no sidecar). **The headline finding: engine tool-exec is only ~1–4% of
a beat** — routine beats are ~90–100% **generation/decode-bound** (Opus more so, extended thinking), so
when a combat turn feels slow it's the *model thinking*, not the tools. Everything degrades to `None`
without a sidecar, leaving the rest of the rollup byte-identical.
Loading