docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing#1031
Conversation
…-arg coercion, timing columns
… lane + cap-rate finding Post-24h-reorient doc currency, grounded in merged #1026/#1027/#1028/#1030. - docs/MODEL-TIERING-STRATEGY.md: new "GLM as a cheap batch-QA engine" section — the clean model-profile system (single WORLDOS_DM_MODEL flows coherently; no-op for Claude; defensive GLM-env scrub so switch-back to Opus is always clean; mixed-model guard; product play.sh/play_party.sh forced clean-Claude; scorer always isolated Claude). WHEN to use GLM (cheap batch QA, save Anthropic tokens — NOT the release gate). The cap-rate finding: the ~30% GLM cap rate was self-inflicted over-aggressive FATAL gates capping BOTH models (now fixed via #1027 + #1030), NOT a GLM weakness — honest-measurement repair, the opposite of score-gaming. PLACEHOLDER (no invented numbers) for the in-flight honest GLM-vs-Claude re-measure; pre-fix ~3.6/~3.6 marked superseded. - WorldOS-RUNBOOK.md + WorldOS-GUI-RUNBOOK.md: brief "GLM QA lane" notes (WORLDOS_DM_MODEL= glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2; profile auto-wires endpoint + raised timeouts across run_duo/run_party/run_combat_sprint/ui_playtest; scorer stays Claude), cross-linked to MODEL-TIERING. - qa/SCORING.md: §1a gate-severity contract (FATAL = true integrity only) + §1a.1 list-arg coercion (#1027) + §7 timing observability; north-star measurement-not-target framing. Additive only; Claude paths byte-identical. Anchored on VISION.md (felt session is the product; scores are measurement, never the target — no score-gaming).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds documentation for a GLM 5.2 cost-saving QA lane to two runbooks and the model tiering strategy doc, establishing env var switching, profile wiring, and env-scrubbing rules. Updates ChangesGLM QA Lane and Scoring Documentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
5-run same-SHA 1-v-1 on the fixed engine (43a5ecc): Claude story 4.13/mech 3.67/angry 3.33, GLM story 3.9/mech 3.8/angry 3.4 — all behavioral GREEN (0 RED-caps, vs ~30% pre-fix). GLM is comparable quality (within ~0.2; higher on mech+angry, ~0.2 lower on story); its real cost is LATENCY (cold-opens 604-872s, 3-4x Claude) → cheap overnight/VM batch sweeps, not interactive, not the release gate. Both below the RRI bar (story 4.3/mech 4.5).
Brings the docs current with the post-24h reorient. Anchored on VISION: scores are measurement, never the target — no score-gaming.
qa/SCORING.md— the behavioral-gate severity principle (FATAL only for true integrity/correctness, not quality/completeness on short emergent runs):party_traveledbeat-scoped (WARN <8 beats),combat_not_left_activestarted_late-aware. Framed as honest-measurement repair (the gates were FALSE-CAPPING legitimate single-scene / truncated-combat sessions on both models — the opposite of score-gaming; adversarially verified, corpus still REDs real failures). Plus the list-arg coercion contract (fix(engine): coerce string/comma-string → list on hot tool args (kills the model-agnostic RED-cap) #1027) and the Wave-1 timing columns.docs/MODEL-TIERING-STRATEGY.md— the clean GLM model-profile system (one model choice flows; no GLM leak into Claude; switch-back clean; product always Claude; scorer always Claude); when to use GLM (cheap batch QA, not the release gate); the cap-rate finding (the ~30% was self-inflicted gates, not a GLM weakness). The pre-fix GLM read is labeled SUPERSEDED; final numbers are a placeholder pending the in-flight 1-v-1.WorldOS-RUNBOOK.md/WorldOS-GUI-RUNBOOK.md— the GLM QA lane.Adversarial verifier: SHIP (grounded in code, honest-measurement framing, no invented numbers, no stale claims). Docs-only. Merge held only until the 1-v-1 re-run fills the real GLM-vs-Claude numbers.
Summary by CodeRabbit
Documentation
Bug Fixes