fix(qa): gate-severity audit — stop FATAL behavioral gates false-capping short emergent duos#1030
Conversation
…rt-single-scene false-cap) party_traveled was FATAL (caps all 3 lenses ≤2.5) whenever the party stayed in one location and the strict in-place exception (clock_advanced AND arc_resolved AND beats≥8) wasn't met — which FATAL-capped legitimate SHORT single-scene social duos (the standard 6-beat persona) on BOTH Claude AND GLM. A model-agnostic false-cap (the memory flags it; verified: claude-1v1-1/2 opus runs both hit it). Fix: severity is beat-scoped — WARN below SINGLE_SCENE_MIN_BEATS(8), FATAL at/above it. The strict anti-gaming exception is UNCHANGED; a substantial run that never moves AND never progresses is still a FATAL stuck DM. Corpus fixture padded to 8 beats so it tests the (preserved) FATAL path. fast_gate 226 + corpus + taxonomy green.
…d-combat duos, FATAL only on genuine abandon
A short emergent duo that ENTERS combat near its beat budget and TRUNCATES mid-fight
legitimately never reaches end_combat — the fight was cut off, not abandoned — yet the
old bare FATAL combat_not_left_active RED-capped all three LLM lenses (<=2.5) on a run
that did nothing wrong. This is a HARNESS-LENGTH artifact, model-agnostic (trips on both
Claude opus and GLM), the same false-cap class as party_traveled.
Proven: qa/transcripts/claude-1v1-2 (opus duo) — start_combat fired at tool-call 36/42
(the last ~14% of the stream) and the final DM line is literally cut off mid-sentence;
only 7 beats. The old gate flipped it RED on combat_not_left_active alone.
Make the severity beat-scoped, mirroring the party_traveled fix:
- SHORT facade run (< COMBAT_ABANDON_MIN_BEATS=10) still mid-fight -> WARN (truncation)
- LONG run but start_combat fired in the final ~20% of the stream -> WARN (truncation)
- LONG run, start_combat EARLY, end_combat never called, still active -> FATAL (abandon:
a real state-integrity bug that corrupts the next load)
Signals are all already in scope (len(mv), the ordered tool stream, end_combat count); no
new inputs. The FATAL path for the real defect is preserved; the combat_ended WARN already
flags "combat may be left hanging" non-fatally, so the FATAL was redundant on truncation.
Corpus: combat_not_left_active fixture re-shaped to the ABANDON profile (12 facade beats,
start_combat EARLY in the stream, end_combat never called, combat active) so it still trips
the preserved FATAL path in isolation. Verified: claude-1v1-2 now GREEN (gate WARN);
the corpus case still RED on combat_not_left_active as the sole fatal.
…arty_traveled/combat fixtures + fix stale test Two consistency fixes around the beat-scoped severity changes: 1. builder.py was out of sync with the committed corpus. The prior party_traveled commit (af86a81) hand-padded cases/party_traveled/chat.jsonl to 8 player beats (so it lands on the PRESERVED FATAL path: party_traveled is FATAL only at >= SINGLE_SCENE_MIN_BEATS=8) but did NOT update builder.py's case_party_traveled — so re-running the generator reverted the fixture to 6 beats and silently broke the case (it would WARN, not RED). Update case_party_traveled to emit 8 beats. Likewise re-shape case_combat_not_left_active to the ABANDON profile (12 beats, early start_combat, no end_combat, combat active) so a regenerate round-trips to the same committed fixture that trips the preserved FATAL. 2. test_assert_behavioral.py had a stale test (test_party_traveled_still_red_when_arc_resolved_but_too_few_beats) that asserted party_traveled is FATAL/RED at 7 beats — the OLD pre-af86a81 behavior. After the beat-scoping a 7-beat single-scene run is correctly a WARN (not a lens-capping RED). This test was already FAILING on HEAD (a gap in the prior commit). Replace it with two tests that match the post-fix contract: a 7-beat single-scene vignette WARNs and stays GREEN; the >=8-beat substantial frozen run still RED-s (the preserved FATAL). Verified: corpus + root_cause + assert_behavioral = 92 passed; fast_gate 226 passed.
…combat) as truncation→WARN Hardening from the gate-severity adversarial review: when there is NO start_combat in the tool stream this session (a fight carried over from a resume), started_late is now True so the run is classified as truncation/resume (WARN), not an abandon (FATAL) — matching the rationale comment, which the prior form contradicted. The deliberate beat-floor for the genuinely-early-abandon case is kept (biases toward WARN = don't-false-cap, the load-bearing priority; a truly corrupt save is caught by the engine's start_combat next-load guard).
📝 WalkthroughWalkthroughTwo behavioral gates in ChangesBeat-scoped FATAL/WARN gate severity
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
| # earlier form FATAL'd a resumed-into-combat run, contradicting it). | ||
| started_late = ( | ||
| _last_sc < 0 | ||
| or (_last_sc >= 0 and _total_calls > 0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
qa/test_assert_behavioral.py (1)
881-889: ⚡ Quick winAdd a regression for “earlier combat ended, latest combat left active.”
Current additions validate party-traveled severity well, but they won’t catch the
combat_not_left_activeordering edge where a priorend_combatcan mask a later abandoned combat. A focused test for that sequence would prevent silent gate weakening.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@qa/test_assert_behavioral.py` around lines 881 - 889, The test suite lacks coverage for the combat_not_left_active ordering edge case where an earlier combat is properly ended but a later combat remains active and abandoned. Add a new test function after the existing test_party_traveled_still_red_on_substantial_frozen_run_too_few_visited test that constructs event sequences and game state using the helper functions _dm_text_turns, _single_scene_state, and _run_gate to specifically validate that the party_traveled gate correctly fails when this combat ordering scenario occurs. The test should verify that an earlier end_combat event does not mask a subsequent unclosed combat action, ensuring the gate maintains its validation strength.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@qa/assert_behavioral.py`:
- Around line 383-387: The `_abandoned` variable calculation checks
`tools.get("end_combat", 0) == 0` globally, which causes incorrect behavior when
multiple combat segments exist. If an earlier combat ended properly but a later
combat remains active, this check incorrectly evaluates to false. Modify the
condition to scope the `end_combat` check to only the latest/current combat
segment rather than checking the global tools dictionary value, ensuring that
abandons are detected accurately for the most recent combat segment only.
---
Nitpick comments:
In `@qa/test_assert_behavioral.py`:
- Around line 881-889: The test suite lacks coverage for the
combat_not_left_active ordering edge case where an earlier combat is properly
ended but a later combat remains active and abandoned. Add a new test function
after the existing
test_party_traveled_still_red_on_substantial_frozen_run_too_few_visited test
that constructs event sequences and game state using the helper functions
_dm_text_turns, _single_scene_state, and _run_gate to specifically validate that
the party_traveled gate correctly fails when this combat ordering scenario
occurs. The test should verify that an earlier end_combat event does not mask a
subsequent unclosed combat action, ensuring the gate maintains its validation
strength.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 634b8a0f-b677-4675-b480-594211d12ab8
📒 Files selected for processing (7)
qa/assert_behavioral.pyqa/gate_corpus/builder.pyqa/gate_corpus/cases/combat_not_left_active/chat.jsonlqa/gate_corpus/cases/combat_not_left_active/moves.jsonlqa/gate_corpus/cases/combat_not_left_active/run.jsonlqa/gate_corpus/cases/party_traveled/chat.jsonlqa/test_assert_behavioral.py
| _abandoned = ( | ||
| len(mv) >= COMBAT_ABANDON_MIN_BEATS | ||
| and not started_late | ||
| and tools.get("end_combat", 0) == 0 | ||
| ) |
There was a problem hiding this comment.
Scope abandon detection to the latest combat segment.
At Line 386, _abandoned checks tools.get("end_combat", 0) == 0 globally. If one combat ended earlier but a later combat was left active, this incorrectly downgrades a genuine abandon to WARN.
💡 Suggested fix
_last_sc = max((i for i, c in enumerate(_ordered_short) if c == "start_combat"),
default=-1)
+ _last_ec = max((i for i, c in enumerate(_ordered_short) if c == "end_combat"),
+ default=-1)
@@
- _abandoned = (
+ current_fight_unclosed = (_last_sc >= 0 and _last_ec < _last_sc)
+ _abandoned = (
len(mv) >= COMBAT_ABANDON_MIN_BEATS
and not started_late
- and tools.get("end_combat", 0) == 0
+ and current_fight_unclosed
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@qa/assert_behavioral.py` around lines 383 - 387, The `_abandoned` variable
calculation checks `tools.get("end_combat", 0) == 0` globally, which causes
incorrect behavior when multiple combat segments exist. If an earlier combat
ended properly but a later combat remains active, this check incorrectly
evaluates to false. Modify the condition to scope the `end_combat` check to only
the latest/current combat segment rather than checking the global tools
dictionary value, ensuring that abandons are detected accurately for the most
recent combat segment only.
…M model-profile, timing (#1031) * docs(qa): SCORING currency — gate severity (honest measurement), list-arg coercion, timing columns * docs: MODEL-TIERING + RUNBOOK currency — clean GLM model-profile + QA lane + cap-rate finding Post-24h-reorient doc currency, grounded in merged #1026/#1027/#1028/#1030. - docs/MODEL-TIERING-STRATEGY.md: new "GLM as a cheap batch-QA engine" section — the clean model-profile system (single WORLDOS_DM_MODEL flows coherently; no-op for Claude; defensive GLM-env scrub so switch-back to Opus is always clean; mixed-model guard; product play.sh/play_party.sh forced clean-Claude; scorer always isolated Claude). WHEN to use GLM (cheap batch QA, save Anthropic tokens — NOT the release gate). The cap-rate finding: the ~30% GLM cap rate was self-inflicted over-aggressive FATAL gates capping BOTH models (now fixed via #1027 + #1030), NOT a GLM weakness — honest-measurement repair, the opposite of score-gaming. PLACEHOLDER (no invented numbers) for the in-flight honest GLM-vs-Claude re-measure; pre-fix ~3.6/~3.6 marked superseded. - WorldOS-RUNBOOK.md + WorldOS-GUI-RUNBOOK.md: brief "GLM QA lane" notes (WORLDOS_DM_MODEL= glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2; profile auto-wires endpoint + raised timeouts across run_duo/run_party/run_combat_sprint/ui_playtest; scorer stays Claude), cross-linked to MODEL-TIERING. - qa/SCORING.md: §1a gate-severity contract (FATAL = true integrity only) + §1a.1 list-arg coercion (#1027) + §7 timing observability; north-star measurement-not-target framing. Additive only; Claude paths byte-identical. Anchored on VISION.md (felt session is the product; scores are measurement, never the target — no score-gaming). * docs: fill the honest GLM-vs-Claude numbers (post-fix 1-v-1, all GREEN) 5-run same-SHA 1-v-1 on the fixed engine (43a5ecc): Claude story 4.13/mech 3.67/angry 3.33, GLM story 3.9/mech 3.8/angry 3.4 — all behavioral GREEN (0 RED-caps, vs ~30% pre-fix). GLM is comparable quality (within ~0.2; higher on mech+angry, ~0.2 lower on story); its real cost is LATENCY (cold-opens 604-872s, 3-4x Claude) → cheap overnight/VM batch sweeps, not interactive, not the release gate. Both below the RRI bar (story 4.3/mech 4.5). --------- Co-authored-by: Eva <arncalso@gmail.com>
… + GLM + clean switching) (#1032) NOT a GA: the RRI gameplay gates (story>=4.3, mech>=4.5) are not yet met. This RC hardens the measurement (the behavioral gate stops false-capping good play — #1027/#1030, adversarially verified) and the model architecture (clean GLM<->Claude switching, no leaks — #1026/#1028; GLM measured comparable, latency the real cost) + the timing instrumentation + arc-smoke + the felt-world machinery, so the gameplay work that follows runs on honest signal. Co-authored-by: Eva <arncalso@gmail.com>
…mparable across rulers (#1034) The 2026-06 cycle materially tightened the scoring ruler (feature-engagement coverage scorer #1018, acts felt-shape #1001/#1002, betrayal un-inversion #999, romance gate #997, dm_advanced_time unmask #1024, gate-severity accuracy #1030), so a run scores LOWER under sc_d4b93982763a/lc_d7fcfddd5bf7 than under the v1.0.4 rulers — BY DESIGN (the scorer is a tightening feedback loop). Document the ruler-version mechanism + history in SCORING.md §0 and annotate it in the v1.0.5-rc1 CHANGELOG, so current numbers are never mis-compared to historic ones (every scores_db row is fenced by scoring_config_version/lens_config_version). Stable-checkpoint hygiene per owner. Co-authored-by: Eva <arncalso@gmail.com>
…red campaigns (#1036) (#1037) ROOT CAUSE The `structural_completeness` behavioral gate (qa/assert_behavioral.py) FATAL-capped AUTHORED golden-spine runs to 2.5. Sub-check (b) `unresolved_arc` fires when an active quest reaches session end open across a >=2-location arc with no quest-resolution call. But the campaign-arc quest is SEEDED from the authored adventure `hook` and is multi- session by design; the authored adventures (e.g. embergloom-pact) author NO closable sub-quests, so the DM legitimately never calls complete_quest / set_quest_status — and (b) FATAL-REDs even a clean 25-beat authored run. A self-inflicted false-cap. Sibling #1030 fixed party_traveled / combat_not_left_active the same way but missed this one. FIX (Option A scope guard — mirrors #1030's WARN-vs-FATAL discipline EXACTLY) - Compute `is_authored_campaign = bool(tools.get("start_adventure") or state.get("scenes"))`. `start_adventure` is the authored cold-open call (server.py:697), always in the tool stream `_tally` sees; `state["scenes"]` is non-empty only for seeded authored adventures (server.py serializes it; content.py persists authored scenes). - Demote ONLY sub-check (b) unresolved_arc from FATAL->WARN when authored AND the only open quest is the hook-seeded arc. The gate still APPENDS the WARN message (visibility kept); the run is no longer RED-capped on (b) alone. - Clause (a) approval-frozen stays FATAL ALWAYS. - PRESERVE FATAL for: any NON-authored run (the original narrated-not-engaged failure), AND an authored run that called add_quest (server.py:10165 — the DM's own quest-creation tool, distinguishable from the hook-seeded quest at gate time) and left it unresolved — a genuine dropped thread. New severity: `_unresolved_fatal = unresolved_arc and (not is_authored_campaign or bool(tools.get("add_quest")))`; `_structural_fatal = approval_frozen_run or _unresolved_fatal`. ANTI-SCORE-GAMING DUAL CORPUS PROOF - NEW GREEN fixture qa/gate_corpus/cases/structural_completeness_authored_warn/ (built by builder.py `case_structural_completeness_authored_warn`, recorded under a new RED-only- safe `green_cases` manifest key): authored profile (start_adventure + scenes + frozen companion + active hook quest + 2 locations + no resolution) -> gate exits GREEN with structural_completeness as [WARN]. Locked by test_behavioral_gate_corpus.py ::test_green_case_warns_but_stays_green (the inverse guard: re-promoting (b) to FATAL flips it RED and fails). - The EXISTING non-authored qa/gate_corpus/cases/structural_completeness/ fixture (no start_adventure, no scenes) regenerated cleanly and STILL exits FATAL RED. - 4 unit tests in qa/test_assert_behavioral.py: authored->WARN/GREEN, non-authored->FATAL/ RED, authored+add_quest->FATAL (carve-out), authored-via-scenes->WARN. - Coverage audit (test_manifest_covers_every_fatal_check) stays green — the gate still classifies structural_completeness as FATAL (fatal=<var>, not fatal=False), no drift. - BEHAVIORAL_GATE_TAXONOMY.json hint updated to document the #1036 authored-WARN behavior. Additive, no existing guard weakened. 78 focused tests pass single-process (no xdist). Co-authored-by: Eva <arncalso@gmail.com>
The self-inflicted, model-agnostic RED-cap
Phase-2 of the reorient ran a GLM-vs-Claude 1-v-1 and immediately found Claude opus runs RED-cap too (2/2). A RED behavioral gate caps all 3 lenses to ≤2.5, and several FATAL gates were nuking legitimate short emergent duos on both models. Root causes (all model-agnostic):
no_rejected_tool_calls— string-vs-list arg rejected → fixed separately (fix(engine): coerce string/comma-string → list on hot tool args (kills the model-agnostic RED-cap) #1027).party_traveled— a deep 6–7-beat single-scene social duo reads as "never left the opening scene" → FATAL. Fix: severity is beat-scoped — WARN belowSINGLE_SCENE_MIN_BEATS(8), FATAL at/above it. The strict anti-gaming in-place exception (clock+arc+beats) is unchanged; a substantial run that never moves and never progresses is still a FATAL stuck DM.combat_not_left_active— a 6-beat duo that enters combat near its budget and truncates mid-fight → "combat left active" FATAL (a harness-length artifact, not a DM bug). Fix: severity-aware via astarted_latediscriminator (where the laststart_combatlands in the tool stream) — a truncated/late-start fight, or a resume-into-combat (nostart_combatthis session), is WARN; only a genuine abandon (combat started early with room to resolve, neverend_combat'd, still active) is FATAL.Safety (adversarially verified)
A gate-severity audit workflow classified every FATAL gate (KEEP-FATAL vs over-aggressive), fixed only the quality/completeness ones, and an adversarial verifier confirmed no true integrity/correctness gate was weakened (player-seated, rejected-tools, dice, dm-output, SRD-correctness, xp all untouched) and the corpus fixtures still trip the preserved FATAL path (party_traveled padded to 8 beats; combat_not_left_active reshaped to a real-abandon profile). The verifier's two latent masking concerns were addressed: resume-into-combat → WARN (hardened); the early-abandon beat-floor is kept as a deliberate don't-false-cap bias (unreached; a truly corrupt save is caught by the engine's
start_combatnext-load guard). A builder-drift bug (the generator would have reverted the party_traveled pad) was also fixed.Verified
claude-1v1-1+claude-1v1-2(the opus runs that RED-capped) → now GREEN. fast_gate 226, corpus + taxonomy 39 pass. This is the precondition for an honest GLM-vs-Claude re-measure.Summary by CodeRabbit