docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing by 100yenadmin · Pull Request #1031 · electricsheephq/WorldOS

100yenadmin · 2026-06-19T09:29:55Z

Brings the docs current with the post-24h reorient. Anchored on VISION: scores are measurement, never the target — no score-gaming.

qa/SCORING.md — the behavioral-gate severity principle (FATAL only for true integrity/correctness, not quality/completeness on short emergent runs): party_traveled beat-scoped (WARN <8 beats), combat_not_left_active started_late-aware. Framed as honest-measurement repair (the gates were FALSE-CAPPING legitimate single-scene / truncated-combat sessions on both models — the opposite of score-gaming; adversarially verified, corpus still REDs real failures). Plus the list-arg coercion contract (fix(engine): coerce string/comma-string → list on hot tool args (kills the model-agnostic RED-cap) #1027) and the Wave-1 timing columns.
docs/MODEL-TIERING-STRATEGY.md — the clean GLM model-profile system (one model choice flows; no GLM leak into Claude; switch-back clean; product always Claude; scorer always Claude); when to use GLM (cheap batch QA, not the release gate); the cap-rate finding (the ~30% was self-inflicted gates, not a GLM weakness). The pre-fix GLM read is labeled SUPERSEDED; final numbers are a placeholder pending the in-flight 1-v-1.
WorldOS-RUNBOOK.md / WorldOS-GUI-RUNBOOK.md — the GLM QA lane.

Adversarial verifier: SHIP (grounded in code, honest-measurement framing, no invented numbers, no stale claims). Docs-only. Merge held only until the 1-v-1 re-run fills the real GLM-vs-Claude numbers.

Summary by CodeRabbit

Documentation
- Added GLM 5.2 QA lane guidance for cost-effective batch QA operations
- Updated model tiering strategy clarifying Claude as release quality bar
- Enhanced scoring documentation with gate severity rules and optional tool-timing observability
Bug Fixes
- Fixed tool argument type coercion normalization
- Fixed severity discrimination for truncation and late-start scenarios

…-arg coercion, timing columns

… lane + cap-rate finding Post-24h-reorient doc currency, grounded in merged #1026/#1027/#1028/#1030. - docs/MODEL-TIERING-STRATEGY.md: new "GLM as a cheap batch-QA engine" section — the clean model-profile system (single WORLDOS_DM_MODEL flows coherently; no-op for Claude; defensive GLM-env scrub so switch-back to Opus is always clean; mixed-model guard; product play.sh/play_party.sh forced clean-Claude; scorer always isolated Claude). WHEN to use GLM (cheap batch QA, save Anthropic tokens — NOT the release gate). The cap-rate finding: the ~30% GLM cap rate was self-inflicted over-aggressive FATAL gates capping BOTH models (now fixed via #1027 + #1030), NOT a GLM weakness — honest-measurement repair, the opposite of score-gaming. PLACEHOLDER (no invented numbers) for the in-flight honest GLM-vs-Claude re-measure; pre-fix ~3.6/~3.6 marked superseded. - WorldOS-RUNBOOK.md + WorldOS-GUI-RUNBOOK.md: brief "GLM QA lane" notes (WORLDOS_DM_MODEL= glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2; profile auto-wires endpoint + raised timeouts across run_duo/run_party/run_combat_sprint/ui_playtest; scorer stays Claude), cross-linked to MODEL-TIERING. - qa/SCORING.md: §1a gate-severity contract (FATAL = true integrity only) + §1a.1 list-arg coercion (#1027) + §7 timing observability; north-star measurement-not-target framing. Additive only; Claude paths byte-identical. Anchored on VISION.md (felt session is the product; scores are measurement, never the target — no score-gaming).

coderabbitai · 2026-06-19T09:34:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b3d915bb-2bc3-4b37-a33c-e79172f348b8

📥 Commits

Reviewing files that changed from the base of the PR and between 43a5ecc and 9c45b6a.

📒 Files selected for processing (4)

WorldOS-GUI-RUNBOOK.md
WorldOS-RUNBOOK.md
docs/MODEL-TIERING-STRATEGY.md
qa/SCORING.md

📝 Walkthrough

Walkthrough

Adds documentation for a GLM 5.2 cost-saving QA lane to two runbooks and the model tiering strategy doc, establishing env var switching, profile wiring, and env-scrubbing rules. Updates qa/SCORING.md with gate-severity definitions, list-arg coercion contract, and optional tool-timing observability columns.

Changes

GLM QA Lane and Scoring Documentation

Layer / File(s)	Summary
Model tiering strategy: GLM role, env rules, quality results `docs/MODEL-TIERING-STRATEGY.md`	Adds 74-line GLM section defining GLM as QA-only cost lever, model-profile env-scrubbing/guarding rules, explanation of prior cap-rate measurement errors and their fixes, and measured GLM-vs-Claude quality/latency results with operational guidance.
GLM QA lane instructions in runbooks `WorldOS-RUNBOOK.md`, `WorldOS-GUI-RUNBOOK.md`	Adds GLM QA lane subsections covering `WORLDOS_DM_MODEL=glm-5.2`/`WORLDOS_ACTOR_MODEL=glm-5.2` switches, `qa/glm_profile.sh` wiring, higher timeout/retry ceilings, defensive env scrubbing, and confirmation that `qa/score.sh` stays on Claude/Sonnet.
SCORING.md: header, gate severity, and list-arg coercion `qa/SCORING.md`	Updates header date and measurement framing; adds gate-severity section defining FATAL/WARN boundary with two repairs (`party_traveled` beat-scoping, `combat_not_left_active` `started_late` discriminator); documents list-arg coercion contract via `BeforeValidator` aliases without altering emitted JSON schema.
SCORING.md: tool-timing observability `qa/SCORING.md`	Adds timing columns section documenting optional `WORLDOS_TOOLTIMING_PATH` sidecar, per-beat/tool-exec rollup metrics, additive `scores.db` columns, and `TIMING

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

electricsheephq/WorldOS#147: Introduced the QA scoring system documentation in qa/SCORING.md that this PR now extends with gate-severity and timing sections.
electricsheephq/WorldOS#1030: Implements the same party_traveled and combat_not_left_active behavioral gate-severity repairs in qa/assert_behavioral.py that this PR documents in qa/SCORING.md.

Poem

🐇 Hop, hop — GLM runs cheap through the night,
While Claude holds the gate for release's right!
Env vars scrubbed, no leakage in sight,
FATAL or WARN? The severity's tight.
Timing sidecars whisper small fractions of time —
The warren runs batches, and the docs are sublime! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description provides comprehensive context on the changes and rationale, but is missing the required CLA checklist and validation sections from the repository template.	Add the complete CLA checklist with appropriate checkboxes and a 'Validation' section listing the checks that were run before submission.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the three main documentation updates: post-reorientation currency, gate severity (honest measurement), GLM model-profile strategy, and timing columns.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

5-run same-SHA 1-v-1 on the fixed engine (43a5ecc): Claude story 4.13/mech 3.67/angry 3.33, GLM story 3.9/mech 3.8/angry 3.4 — all behavioral GREEN (0 RED-caps, vs ~30% pre-fix). GLM is comparable quality (within ~0.2; higher on mech+angry, ~0.2 lower on story); its real cost is LATENCY (cold-opens 604-872s, 3-4x Claude) → cheap overnight/VM batch sweeps, not interactive, not the release gate. Both below the RRI bar (story 4.3/mech 4.5).

Eva added 2 commits June 19, 2026 16:24

docs(qa): SCORING currency — gate severity (honest measurement), list…

b61d8ff

…-arg coercion, timing columns

100yenadmin merged commit f389325 into main Jun 19, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing#1031

docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing#1031
100yenadmin merged 3 commits into
mainfrom
docs/post-reorient-currency

100yenadmin commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

100yenadmin commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

100yenadmin commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading