Skip to content

docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing#1031

Merged
100yenadmin merged 3 commits into
mainfrom
docs/post-reorient-currency
Jun 19, 2026
Merged

docs: post-reorient currency — gate severity (honest measurement), GLM model-profile, timing#1031
100yenadmin merged 3 commits into
mainfrom
docs/post-reorient-currency

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented Jun 19, 2026

Copy link
Copy Markdown
Member

Brings the docs current with the post-24h reorient. Anchored on VISION: scores are measurement, never the target — no score-gaming.

  • qa/SCORING.md — the behavioral-gate severity principle (FATAL only for true integrity/correctness, not quality/completeness on short emergent runs): party_traveled beat-scoped (WARN <8 beats), combat_not_left_active started_late-aware. Framed as honest-measurement repair (the gates were FALSE-CAPPING legitimate single-scene / truncated-combat sessions on both models — the opposite of score-gaming; adversarially verified, corpus still REDs real failures). Plus the list-arg coercion contract (fix(engine): coerce string/comma-string → list on hot tool args (kills the model-agnostic RED-cap) #1027) and the Wave-1 timing columns.
  • docs/MODEL-TIERING-STRATEGY.md — the clean GLM model-profile system (one model choice flows; no GLM leak into Claude; switch-back clean; product always Claude; scorer always Claude); when to use GLM (cheap batch QA, not the release gate); the cap-rate finding (the ~30% was self-inflicted gates, not a GLM weakness). The pre-fix GLM read is labeled SUPERSEDED; final numbers are a placeholder pending the in-flight 1-v-1.
  • WorldOS-RUNBOOK.md / WorldOS-GUI-RUNBOOK.md — the GLM QA lane.

Adversarial verifier: SHIP (grounded in code, honest-measurement framing, no invented numbers, no stale claims). Docs-only. Merge held only until the 1-v-1 re-run fills the real GLM-vs-Claude numbers.

Summary by CodeRabbit

  • Documentation

    • Added GLM 5.2 QA lane guidance for cost-effective batch QA operations
    • Updated model tiering strategy clarifying Claude as release quality bar
    • Enhanced scoring documentation with gate severity rules and optional tool-timing observability
  • Bug Fixes

    • Fixed tool argument type coercion normalization
    • Fixed severity discrimination for truncation and late-start scenarios

Eva added 2 commits June 19, 2026 16:24
… lane + cap-rate finding

Post-24h-reorient doc currency, grounded in merged #1026/#1027/#1028/#1030.

- docs/MODEL-TIERING-STRATEGY.md: new "GLM as a cheap batch-QA engine" section —
  the clean model-profile system (single WORLDOS_DM_MODEL flows coherently; no-op for
  Claude; defensive GLM-env scrub so switch-back to Opus is always clean; mixed-model
  guard; product play.sh/play_party.sh forced clean-Claude; scorer always isolated
  Claude). WHEN to use GLM (cheap batch QA, save Anthropic tokens — NOT the release
  gate). The cap-rate finding: the ~30% GLM cap rate was self-inflicted over-aggressive
  FATAL gates capping BOTH models (now fixed via #1027 + #1030), NOT a GLM weakness —
  honest-measurement repair, the opposite of score-gaming. PLACEHOLDER (no invented
  numbers) for the in-flight honest GLM-vs-Claude re-measure; pre-fix ~3.6/~3.6 marked
  superseded.
- WorldOS-RUNBOOK.md + WorldOS-GUI-RUNBOOK.md: brief "GLM QA lane" notes (WORLDOS_DM_MODEL=
  glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2; profile auto-wires endpoint + raised timeouts across
  run_duo/run_party/run_combat_sprint/ui_playtest; scorer stays Claude), cross-linked to
  MODEL-TIERING.
- qa/SCORING.md: §1a gate-severity contract (FATAL = true integrity only) + §1a.1 list-arg
  coercion (#1027) + §7 timing observability; north-star measurement-not-target framing.

Additive only; Claude paths byte-identical. Anchored on VISION.md (felt session is the
product; scores are measurement, never the target — no score-gaming).
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b3d915bb-2bc3-4b37-a33c-e79172f348b8

📥 Commits

Reviewing files that changed from the base of the PR and between 43a5ecc and 9c45b6a.

📒 Files selected for processing (4)
  • WorldOS-GUI-RUNBOOK.md
  • WorldOS-RUNBOOK.md
  • docs/MODEL-TIERING-STRATEGY.md
  • qa/SCORING.md

📝 Walkthrough

Walkthrough

Adds documentation for a GLM 5.2 cost-saving QA lane to two runbooks and the model tiering strategy doc, establishing env var switching, profile wiring, and env-scrubbing rules. Updates qa/SCORING.md with gate-severity definitions, list-arg coercion contract, and optional tool-timing observability columns.

Changes

GLM QA Lane and Scoring Documentation

Layer / File(s) Summary
Model tiering strategy: GLM role, env rules, quality results
docs/MODEL-TIERING-STRATEGY.md
Adds 74-line GLM section defining GLM as QA-only cost lever, model-profile env-scrubbing/guarding rules, explanation of prior cap-rate measurement errors and their fixes, and measured GLM-vs-Claude quality/latency results with operational guidance.
GLM QA lane instructions in runbooks
WorldOS-RUNBOOK.md, WorldOS-GUI-RUNBOOK.md
Adds GLM QA lane subsections covering WORLDOS_DM_MODEL=glm-5.2/WORLDOS_ACTOR_MODEL=glm-5.2 switches, qa/glm_profile.sh wiring, higher timeout/retry ceilings, defensive env scrubbing, and confirmation that qa/score.sh stays on Claude/Sonnet.
SCORING.md: header, gate severity, and list-arg coercion
qa/SCORING.md
Updates header date and measurement framing; adds gate-severity section defining FATAL/WARN boundary with two repairs (party_traveled beat-scoping, combat_not_left_active started_late discriminator); documents list-arg coercion contract via BeforeValidator aliases without altering emitted JSON schema.
SCORING.md: tool-timing observability
qa/SCORING.md
Adds timing columns section documenting optional WORLDOS_TOOLTIMING_PATH sidecar, per-beat/tool-exec rollup metrics, additive scores.db columns, and `TIMING

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • electricsheephq/WorldOS#147: Introduced the QA scoring system documentation in qa/SCORING.md that this PR now extends with gate-severity and timing sections.
  • electricsheephq/WorldOS#1030: Implements the same party_traveled and combat_not_left_active behavioral gate-severity repairs in qa/assert_behavioral.py that this PR documents in qa/SCORING.md.

Poem

🐇 Hop, hop — GLM runs cheap through the night,
While Claude holds the gate for release's right!
Env vars scrubbed, no leakage in sight,
FATAL or WARN? The severity's tight.
Timing sidecars whisper small fractions of time —
The warren runs batches, and the docs are sublime! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides comprehensive context on the changes and rationale, but is missing the required CLA checklist and validation sections from the repository template. Add the complete CLA checklist with appropriate checkboxes and a 'Validation' section listing the checks that were run before submission.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the three main documentation updates: post-reorientation currency, gate severity (honest measurement), GLM model-profile strategy, and timing columns.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

5-run same-SHA 1-v-1 on the fixed engine (43a5ecc): Claude story 4.13/mech 3.67/angry 3.33,
GLM story 3.9/mech 3.8/angry 3.4 — all behavioral GREEN (0 RED-caps, vs ~30% pre-fix). GLM is
comparable quality (within ~0.2; higher on mech+angry, ~0.2 lower on story); its real cost is
LATENCY (cold-opens 604-872s, 3-4x Claude) → cheap overnight/VM batch sweeps, not interactive,
not the release gate. Both below the RRI bar (story 4.3/mech 4.5).
@100yenadmin 100yenadmin merged commit f389325 into main Jun 19, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant