Skip to content

[qa][P0-harness] Quota circuit-breaker + stale-evidence hygiene — a 429 must yield QUOTA_BLOCKED, never a junk RRI #842

@100yenadmin

Description

@100yenadmin

Root cause (rc3 attempt-1 diagnosis, 2026-06-10 — full report in the session scratchpad)

The 22-min junk sweep @a245a2c: all 4 parallel personas died on HTTP 429 "You've hit your session limit" at the DM cold-open (evidence: identical error in all 4 vm2-*/backend.log). The harness has no quota circuit-breaker: (a) scripts/play.sh:313 burns a second call retrying a 429, then :332 masks it via the narration fallback; (b) qa/ui_playtest_app.sh:821-870 polls 600s before mis-bucketing as no_actor/no_provider; (c) the player-agent's limit banner was scored (veteran: sat=5 "derived" off a rate-limit banner); (d) the sweep then published an RRI from stale evidence (next issue) reading like a build regression.

Fix (one PR, qa/-only)

  1. qa/lib_beat_driver.sh (~:613-617 clawdnd_report_attempt_failure): on a result matching HTTP 429|hit your session limit|usage limit → emit [dm-attempt] QUOTA_EXHAUSTED until=<reset-if-parseable> + write a QUOTA_EXHAUSTED sentinel into the run dir; callers SKIP the retry (don't burn a second call).
  2. qa/ui_playtest_app.sh: ready-wait loop polls for the sentinel/marker → abort immediately with a new failure bucket quota_exhausted (add to APP_FAILURE_BUCKETS_JSON ~:209). qa/ui_playtest_score.py: detect the limit banner in a player verdict → same bucket, never a derived sat score.
  3. qa/vm/sweep_v2.sh: after the canary + each persona, grep for the marker; on hit → kill remaining personas, touch $RES/QUOTA_BLOCKED, skip duo/ui_audit, and write an explicit quota_blocked verdict instead of a junk RRI.
  4. Preflight lane alignment: the sweep's support_vm_preflight.py call must pass --provider claude --player-agent claude (today it checks codex defaults — wrong-lane noise).

Stale-evidence hygiene (same PR)

  1. Before the duo: rm -f qa/transcripts/vm2-duo.*.json (an aborted duo currently republishes the PREVIOUS run's lens scores byte-for-byte — rc3's "story 4.0/mech 3.0" were rc2's files, verified byte-identical).
  2. Rollup --runs: filter run dirs to run.json build_sha == $SHA (or wipe qa/ui_playtest_runs/vm2-* at sweep start, mirroring the per-persona play-state wipe) — rc3's RRI consumed three Jun-6 dirs (3 stale criticals, phantom persona).
  3. Behavioral: when duo.log has no [duo] done. line → report NOT_RUN (an evidence gap), never default-RED.

Tests

Static contract tests (the test_release_gate_static.py style, runs in the qa CI lane): sentinel branch exists in lib_beat_driver; quota bucket in ui_playtest_app + buckets JSON; sweep greps the marker + QUOTA_BLOCKED path; preflight call carries --provider claude; duo-transcript rm precedes the duo; the rollup runs-filter exists. Plus a unit test: clawdnd_report_attempt_failure on a 429 fixture emits the marker + skips retry (extend test_dm_session_remint.py's bash-via-pytest pattern).

Acceptance

A simulated-429 fixture run produces quota_blocked (not an RRI); a clean re-run @ the fix SHA produces a full-length sweep. Then rc3 re-runs for real.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions