Skip to content

fix(qa): gate-severity audit — stop FATAL behavioral gates false-capping short emergent duos#1030

Merged
100yenadmin merged 4 commits into
mainfrom
fix/party-traveled-false-cap
Jun 19, 2026
Merged

fix(qa): gate-severity audit — stop FATAL behavioral gates false-capping short emergent duos#1030
100yenadmin merged 4 commits into
mainfrom
fix/party-traveled-false-cap

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented Jun 19, 2026

Copy link
Copy Markdown
Member

The self-inflicted, model-agnostic RED-cap

Phase-2 of the reorient ran a GLM-vs-Claude 1-v-1 and immediately found Claude opus runs RED-cap too (2/2). A RED behavioral gate caps all 3 lenses to ≤2.5, and several FATAL gates were nuking legitimate short emergent duos on both models. Root causes (all model-agnostic):

  1. no_rejected_tool_calls — string-vs-list arg rejected → fixed separately (fix(engine): coerce string/comma-string → list on hot tool args (kills the model-agnostic RED-cap) #1027).
  2. party_traveled — a deep 6–7-beat single-scene social duo reads as "never left the opening scene" → FATAL. Fix: severity is beat-scoped — WARN below SINGLE_SCENE_MIN_BEATS(8), FATAL at/above it. The strict anti-gaming in-place exception (clock+arc+beats) is unchanged; a substantial run that never moves and never progresses is still a FATAL stuck DM.
  3. combat_not_left_active — a 6-beat duo that enters combat near its budget and truncates mid-fight → "combat left active" FATAL (a harness-length artifact, not a DM bug). Fix: severity-aware via a started_late discriminator (where the last start_combat lands in the tool stream) — a truncated/late-start fight, or a resume-into-combat (no start_combat this session), is WARN; only a genuine abandon (combat started early with room to resolve, never end_combat'd, still active) is FATAL.

Safety (adversarially verified)

A gate-severity audit workflow classified every FATAL gate (KEEP-FATAL vs over-aggressive), fixed only the quality/completeness ones, and an adversarial verifier confirmed no true integrity/correctness gate was weakened (player-seated, rejected-tools, dice, dm-output, SRD-correctness, xp all untouched) and the corpus fixtures still trip the preserved FATAL path (party_traveled padded to 8 beats; combat_not_left_active reshaped to a real-abandon profile). The verifier's two latent masking concerns were addressed: resume-into-combat → WARN (hardened); the early-abandon beat-floor is kept as a deliberate don't-false-cap bias (unreached; a truly corrupt save is caught by the engine's start_combat next-load guard). A builder-drift bug (the generator would have reverted the party_traveled pad) was also fixed.

Verified

claude-1v1-1 + claude-1v1-2 (the opus runs that RED-capped) → now GREEN. fast_gate 226, corpus + taxonomy 39 pass. This is the precondition for an honest GLM-vs-Claude re-measure.

Summary by CodeRabbit

  • Tests
    • Refined validation logic for combat and world progression scenarios to better differentiate between critical failures and edge cases.
    • Enhanced test fixtures to improve coverage of various session lengths and timing conditions.
    • Updated severity classifications to align validation outcomes with session characteristics.

Eva added 4 commits June 19, 2026 15:16
…rt-single-scene false-cap)

party_traveled was FATAL (caps all 3 lenses ≤2.5) whenever the party stayed in one
location and the strict in-place exception (clock_advanced AND arc_resolved AND beats≥8)
wasn't met — which FATAL-capped legitimate SHORT single-scene social duos (the standard
6-beat persona) on BOTH Claude AND GLM. A model-agnostic false-cap (the memory flags it;
verified: claude-1v1-1/2 opus runs both hit it). Fix: severity is beat-scoped — WARN below
SINGLE_SCENE_MIN_BEATS(8), FATAL at/above it. The strict anti-gaming exception is UNCHANGED;
a substantial run that never moves AND never progresses is still a FATAL stuck DM. Corpus
fixture padded to 8 beats so it tests the (preserved) FATAL path. fast_gate 226 + corpus + taxonomy green.
…d-combat duos, FATAL only on genuine abandon

A short emergent duo that ENTERS combat near its beat budget and TRUNCATES mid-fight
legitimately never reaches end_combat — the fight was cut off, not abandoned — yet the
old bare FATAL combat_not_left_active RED-capped all three LLM lenses (<=2.5) on a run
that did nothing wrong. This is a HARNESS-LENGTH artifact, model-agnostic (trips on both
Claude opus and GLM), the same false-cap class as party_traveled.

Proven: qa/transcripts/claude-1v1-2 (opus duo) — start_combat fired at tool-call 36/42
(the last ~14% of the stream) and the final DM line is literally cut off mid-sentence;
only 7 beats. The old gate flipped it RED on combat_not_left_active alone.

Make the severity beat-scoped, mirroring the party_traveled fix:
  - SHORT facade run (< COMBAT_ABANDON_MIN_BEATS=10) still mid-fight  -> WARN (truncation)
  - LONG run but start_combat fired in the final ~20% of the stream   -> WARN (truncation)
  - LONG run, start_combat EARLY, end_combat never called, still active -> FATAL (abandon:
    a real state-integrity bug that corrupts the next load)
Signals are all already in scope (len(mv), the ordered tool stream, end_combat count); no
new inputs. The FATAL path for the real defect is preserved; the combat_ended WARN already
flags "combat may be left hanging" non-fatally, so the FATAL was redundant on truncation.

Corpus: combat_not_left_active fixture re-shaped to the ABANDON profile (12 facade beats,
start_combat EARLY in the stream, end_combat never called, combat active) so it still trips
the preserved FATAL path in isolation. Verified: claude-1v1-2 now GREEN (gate WARN);
the corpus case still RED on combat_not_left_active as the sole fatal.
…arty_traveled/combat fixtures + fix stale test

Two consistency fixes around the beat-scoped severity changes:

1. builder.py was out of sync with the committed corpus. The prior party_traveled
   commit (af86a81) hand-padded cases/party_traveled/chat.jsonl to 8 player beats (so it
   lands on the PRESERVED FATAL path: party_traveled is FATAL only at >= SINGLE_SCENE_MIN_BEATS=8)
   but did NOT update builder.py's case_party_traveled — so re-running the generator
   reverted the fixture to 6 beats and silently broke the case (it would WARN, not RED).
   Update case_party_traveled to emit 8 beats. Likewise re-shape case_combat_not_left_active
   to the ABANDON profile (12 beats, early start_combat, no end_combat, combat active) so a
   regenerate round-trips to the same committed fixture that trips the preserved FATAL.

2. test_assert_behavioral.py had a stale test (test_party_traveled_still_red_when_arc_resolved_but_too_few_beats)
   that asserted party_traveled is FATAL/RED at 7 beats — the OLD pre-af86a81 behavior. After
   the beat-scoping a 7-beat single-scene run is correctly a WARN (not a lens-capping RED). This
   test was already FAILING on HEAD (a gap in the prior commit). Replace it with two tests that
   match the post-fix contract: a 7-beat single-scene vignette WARNs and stays GREEN; the >=8-beat
   substantial frozen run still RED-s (the preserved FATAL).

Verified: corpus + root_cause + assert_behavioral = 92 passed; fast_gate 226 passed.
…combat) as truncation→WARN

Hardening from the gate-severity adversarial review: when there is NO start_combat in the
tool stream this session (a fight carried over from a resume), started_late is now True so
the run is classified as truncation/resume (WARN), not an abandon (FATAL) — matching the
rationale comment, which the prior form contradicted. The deliberate beat-floor for the
genuinely-early-abandon case is kept (biases toward WARN = don't-false-cap, the load-bearing
priority; a truly corrupt save is caught by the engine's start_combat next-load guard).
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Two behavioral gates in qa/assert_behavioral.py are updated to use beat-scoped severity: combat_not_left_active now classifies active-at-end combat as FATAL only when the session is long, combat started early, and end_combat never fired; party_traveled uses session_beats >= SINGLE_SCENE_MIN_BEATS to decide FATAL vs WARN. Corpus fixtures and unit tests are updated to match.

Changes

Beat-scoped FATAL/WARN gate severity

Layer / File(s) Summary
combat_not_left_active beat-scoped severity + fixtures
qa/assert_behavioral.py, qa/gate_corpus/builder.py, qa/gate_corpus/cases/combat_not_left_active/run.jsonl, qa/gate_corpus/cases/combat_not_left_active/chat.jsonl, qa/gate_corpus/cases/combat_not_left_active/moves.jsonl
Gate logic is replaced with heuristics that inspect start_combat position in the ordered tool-call stream and absence of end_combat to emit FATAL only for genuine abandonment; truncation cases become WARN. Corpus fixture is extended to 12 beats with an early start_combat + attack sequence and no end_combat, exercising the new FATAL path.
party_traveled beat-scoped severity + fixtures + tests
qa/assert_behavioral.py, qa/gate_corpus/builder.py, qa/gate_corpus/cases/party_traveled/chat.jsonl, qa/test_assert_behavioral.py
party_traveled now passes fatal=_pt_fatal derived from session_beats >= SINGLE_SCENE_MIN_BEATS. Corpus fixture beat count raised from 6 to 8. Unit tests replace one regression case with two new scenarios: a short vignette that must exit GREEN with [WARN], and a substantial frozen run that must remain RED.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • electricsheephq/WorldOS#966: Directly related — updates party_traveled logic with the same SINGLE_SCENE_MIN_BEATS threshold and beat-scoped FATAL/WARN behavior that this PR further extends.

Poem

🐇 A combat abandoned? Not always a crime,
If the session was short, just a WARN is fine.
The beats must be many, the start must be early,
For FATAL to land on the fighter, surly.
Party not traveled? Check the beat score!
Short vignettes get mercy — RED needs much more. 🗺️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title precisely describes the main change: a gate-severity audit that fixes false-positive FATAL classifications in behavioral gates for short runs, preventing over-aggressive severity capping.
Description check ✅ Passed The description is detailed and well-structured, explaining the root causes, fixes, safety verification, and verification results, though it lacks explicit CLA/licensing checkbox marks from the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread qa/assert_behavioral.py
# earlier form FATAL'd a resumed-into-combat run, contradicting it).
started_late = (
_last_sc < 0
or (_last_sc >= 0 and _total_calls > 0

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
qa/test_assert_behavioral.py (1)

881-889: ⚡ Quick win

Add a regression for “earlier combat ended, latest combat left active.”

Current additions validate party-traveled severity well, but they won’t catch the combat_not_left_active ordering edge where a prior end_combat can mask a later abandoned combat. A focused test for that sequence would prevent silent gate weakening.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@qa/test_assert_behavioral.py` around lines 881 - 889, The test suite lacks
coverage for the combat_not_left_active ordering edge case where an earlier
combat is properly ended but a later combat remains active and abandoned. Add a
new test function after the existing
test_party_traveled_still_red_on_substantial_frozen_run_too_few_visited test
that constructs event sequences and game state using the helper functions
_dm_text_turns, _single_scene_state, and _run_gate to specifically validate that
the party_traveled gate correctly fails when this combat ordering scenario
occurs. The test should verify that an earlier end_combat event does not mask a
subsequent unclosed combat action, ensuring the gate maintains its validation
strength.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@qa/assert_behavioral.py`:
- Around line 383-387: The `_abandoned` variable calculation checks
`tools.get("end_combat", 0) == 0` globally, which causes incorrect behavior when
multiple combat segments exist. If an earlier combat ended properly but a later
combat remains active, this check incorrectly evaluates to false. Modify the
condition to scope the `end_combat` check to only the latest/current combat
segment rather than checking the global tools dictionary value, ensuring that
abandons are detected accurately for the most recent combat segment only.

---

Nitpick comments:
In `@qa/test_assert_behavioral.py`:
- Around line 881-889: The test suite lacks coverage for the
combat_not_left_active ordering edge case where an earlier combat is properly
ended but a later combat remains active and abandoned. Add a new test function
after the existing
test_party_traveled_still_red_on_substantial_frozen_run_too_few_visited test
that constructs event sequences and game state using the helper functions
_dm_text_turns, _single_scene_state, and _run_gate to specifically validate that
the party_traveled gate correctly fails when this combat ordering scenario
occurs. The test should verify that an earlier end_combat event does not mask a
subsequent unclosed combat action, ensuring the gate maintains its validation
strength.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 634b8a0f-b677-4675-b480-594211d12ab8

📥 Commits

Reviewing files that changed from the base of the PR and between dccc942 and 698636c.

📒 Files selected for processing (7)
  • qa/assert_behavioral.py
  • qa/gate_corpus/builder.py
  • qa/gate_corpus/cases/combat_not_left_active/chat.jsonl
  • qa/gate_corpus/cases/combat_not_left_active/moves.jsonl
  • qa/gate_corpus/cases/combat_not_left_active/run.jsonl
  • qa/gate_corpus/cases/party_traveled/chat.jsonl
  • qa/test_assert_behavioral.py

Comment thread qa/assert_behavioral.py
Comment on lines +383 to +387
_abandoned = (
len(mv) >= COMBAT_ABANDON_MIN_BEATS
and not started_late
and tools.get("end_combat", 0) == 0
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope abandon detection to the latest combat segment.

At Line 386, _abandoned checks tools.get("end_combat", 0) == 0 globally. If one combat ended earlier but a later combat was left active, this incorrectly downgrades a genuine abandon to WARN.

💡 Suggested fix
                 _last_sc = max((i for i, c in enumerate(_ordered_short) if c == "start_combat"),
                                default=-1)
+                _last_ec = max((i for i, c in enumerate(_ordered_short) if c == "end_combat"),
+                               default=-1)
@@
-                _abandoned = (
+                current_fight_unclosed = (_last_sc >= 0 and _last_ec < _last_sc)
+                _abandoned = (
                     len(mv) >= COMBAT_ABANDON_MIN_BEATS
                     and not started_late
-                    and tools.get("end_combat", 0) == 0
+                    and current_fight_unclosed
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@qa/assert_behavioral.py` around lines 383 - 387, The `_abandoned` variable
calculation checks `tools.get("end_combat", 0) == 0` globally, which causes
incorrect behavior when multiple combat segments exist. If an earlier combat
ended properly but a later combat remains active, this check incorrectly
evaluates to false. Modify the condition to scope the `end_combat` check to only
the latest/current combat segment rather than checking the global tools
dictionary value, ensuring that abandons are detected accurately for the most
recent combat segment only.

@100yenadmin 100yenadmin merged commit 43a5ecc into main Jun 19, 2026
20 checks passed
100yenadmin added a commit that referenced this pull request Jun 19, 2026
…M model-profile, timing (#1031)

* docs(qa): SCORING currency — gate severity (honest measurement), list-arg coercion, timing columns

* docs: MODEL-TIERING + RUNBOOK currency — clean GLM model-profile + QA lane + cap-rate finding

Post-24h-reorient doc currency, grounded in merged #1026/#1027/#1028/#1030.

- docs/MODEL-TIERING-STRATEGY.md: new "GLM as a cheap batch-QA engine" section —
  the clean model-profile system (single WORLDOS_DM_MODEL flows coherently; no-op for
  Claude; defensive GLM-env scrub so switch-back to Opus is always clean; mixed-model
  guard; product play.sh/play_party.sh forced clean-Claude; scorer always isolated
  Claude). WHEN to use GLM (cheap batch QA, save Anthropic tokens — NOT the release
  gate). The cap-rate finding: the ~30% GLM cap rate was self-inflicted over-aggressive
  FATAL gates capping BOTH models (now fixed via #1027 + #1030), NOT a GLM weakness —
  honest-measurement repair, the opposite of score-gaming. PLACEHOLDER (no invented
  numbers) for the in-flight honest GLM-vs-Claude re-measure; pre-fix ~3.6/~3.6 marked
  superseded.
- WorldOS-RUNBOOK.md + WorldOS-GUI-RUNBOOK.md: brief "GLM QA lane" notes (WORLDOS_DM_MODEL=
  glm-5.2 WORLDOS_ACTOR_MODEL=glm-5.2; profile auto-wires endpoint + raised timeouts across
  run_duo/run_party/run_combat_sprint/ui_playtest; scorer stays Claude), cross-linked to
  MODEL-TIERING.
- qa/SCORING.md: §1a gate-severity contract (FATAL = true integrity only) + §1a.1 list-arg
  coercion (#1027) + §7 timing observability; north-star measurement-not-target framing.

Additive only; Claude paths byte-identical. Anchored on VISION.md (felt session is the
product; scores are measurement, never the target — no score-gaming).

* docs: fill the honest GLM-vs-Claude numbers (post-fix 1-v-1, all GREEN)

5-run same-SHA 1-v-1 on the fixed engine (43a5ecc): Claude story 4.13/mech 3.67/angry 3.33,
GLM story 3.9/mech 3.8/angry 3.4 — all behavioral GREEN (0 RED-caps, vs ~30% pre-fix). GLM is
comparable quality (within ~0.2; higher on mech+angry, ~0.2 lower on story); its real cost is
LATENCY (cold-opens 604-872s, 3-4x Claude) → cheap overnight/VM batch sweeps, not interactive,
not the release gate. Both below the RRI bar (story 4.3/mech 4.5).

---------

Co-authored-by: Eva <arncalso@gmail.com>
100yenadmin added a commit that referenced this pull request Jun 19, 2026
… + GLM + clean switching) (#1032)

NOT a GA: the RRI gameplay gates (story>=4.3, mech>=4.5) are not yet met. This RC hardens the
measurement (the behavioral gate stops false-capping good play — #1027/#1030, adversarially
verified) and the model architecture (clean GLM<->Claude switching, no leaks — #1026/#1028;
GLM measured comparable, latency the real cost) + the timing instrumentation + arc-smoke +
the felt-world machinery, so the gameplay work that follows runs on honest signal.

Co-authored-by: Eva <arncalso@gmail.com>
100yenadmin added a commit that referenced this pull request Jun 19, 2026
…mparable across rulers (#1034)

The 2026-06 cycle materially tightened the scoring ruler (feature-engagement coverage scorer #1018,
acts felt-shape #1001/#1002, betrayal un-inversion #999, romance gate #997, dm_advanced_time unmask
#1024, gate-severity accuracy #1030), so a run scores LOWER under sc_d4b93982763a/lc_d7fcfddd5bf7
than under the v1.0.4 rulers — BY DESIGN (the scorer is a tightening feedback loop). Document the
ruler-version mechanism + history in SCORING.md §0 and annotate it in the v1.0.5-rc1 CHANGELOG, so
current numbers are never mis-compared to historic ones (every scores_db row is fenced by
scoring_config_version/lens_config_version). Stable-checkpoint hygiene per owner.

Co-authored-by: Eva <arncalso@gmail.com>
100yenadmin added a commit that referenced this pull request Jun 19, 2026
…red campaigns (#1036) (#1037)

ROOT CAUSE
The `structural_completeness` behavioral gate (qa/assert_behavioral.py) FATAL-capped
AUTHORED golden-spine runs to 2.5. Sub-check (b) `unresolved_arc` fires when an active
quest reaches session end open across a >=2-location arc with no quest-resolution call.
But the campaign-arc quest is SEEDED from the authored adventure `hook` and is multi-
session by design; the authored adventures (e.g. embergloom-pact) author NO closable
sub-quests, so the DM legitimately never calls complete_quest / set_quest_status — and
(b) FATAL-REDs even a clean 25-beat authored run. A self-inflicted false-cap. Sibling
#1030 fixed party_traveled / combat_not_left_active the same way but missed this one.

FIX (Option A scope guard — mirrors #1030's WARN-vs-FATAL discipline EXACTLY)
- Compute `is_authored_campaign = bool(tools.get("start_adventure") or state.get("scenes"))`.
  `start_adventure` is the authored cold-open call (server.py:697), always in the tool
  stream `_tally` sees; `state["scenes"]` is non-empty only for seeded authored adventures
  (server.py serializes it; content.py persists authored scenes).
- Demote ONLY sub-check (b) unresolved_arc from FATAL->WARN when authored AND the only open
  quest is the hook-seeded arc. The gate still APPENDS the WARN message (visibility kept);
  the run is no longer RED-capped on (b) alone.
- Clause (a) approval-frozen stays FATAL ALWAYS.
- PRESERVE FATAL for: any NON-authored run (the original narrated-not-engaged failure),
  AND an authored run that called add_quest (server.py:10165 — the DM's own quest-creation
  tool, distinguishable from the hook-seeded quest at gate time) and left it unresolved —
  a genuine dropped thread.

  New severity: `_unresolved_fatal = unresolved_arc and (not is_authored_campaign or
  bool(tools.get("add_quest")))`; `_structural_fatal = approval_frozen_run or _unresolved_fatal`.

ANTI-SCORE-GAMING DUAL CORPUS PROOF
- NEW GREEN fixture qa/gate_corpus/cases/structural_completeness_authored_warn/ (built by
  builder.py `case_structural_completeness_authored_warn`, recorded under a new RED-only-
  safe `green_cases` manifest key): authored profile (start_adventure + scenes + frozen
  companion + active hook quest + 2 locations + no resolution) -> gate exits GREEN with
  structural_completeness as [WARN]. Locked by test_behavioral_gate_corpus.py
  ::test_green_case_warns_but_stays_green (the inverse guard: re-promoting (b) to FATAL
  flips it RED and fails).
- The EXISTING non-authored qa/gate_corpus/cases/structural_completeness/ fixture (no
  start_adventure, no scenes) regenerated cleanly and STILL exits FATAL RED.
- 4 unit tests in qa/test_assert_behavioral.py: authored->WARN/GREEN, non-authored->FATAL/
  RED, authored+add_quest->FATAL (carve-out), authored-via-scenes->WARN.
- Coverage audit (test_manifest_covers_every_fatal_check) stays green — the gate still
  classifies structural_completeness as FATAL (fatal=<var>, not fatal=False), no drift.
- BEHAVIORAL_GATE_TAXONOMY.json hint updated to document the #1036 authored-WARN behavior.

Additive, no existing guard weakened. 78 focused tests pass single-process (no xdist).

Co-authored-by: Eva <arncalso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant