Skip to content

feat(prompt+bench): structural decision trigger + reproducible benchmark#133

Merged
Fullstop000 merged 7 commits intomainfrom
decision-trigger-structural-rule
May 2, 2026
Merged

feat(prompt+bench): structural decision trigger + reproducible benchmark#133
Fullstop000 merged 7 commits intomainfrom
decision-trigger-structural-rule

Conversation

@Fullstop000
Copy link
Copy Markdown
Owner

Summary

  • Replace input-pattern enumeration in the Decision Inbox prompt with a four-property structural test: mutually-exclusive options + blocking + material consequence + delegated picker. The trigger is the shape of the agent's intended reply, not the asker's words.
  • Keep the PR-review example as the canonical anchor, not the rule.
  • Add bench/decision-trigger/ — reproducible benchmark that runs each case in an isolated claude/sonnet agent in parallel, classifies the response turn, and scores match-rate against predictions.

Why

The enumerated triggers didn't scale. Verdict-shaped requests in triage, hiring, time-boxing, and compliance use neutral phrasing ("tell me which 3 to fix", "walk me through whether we need X", "status on the auth bug") and were falling through to send_message. The structural rule generalizes to any new workflow without re-listing phrasings.

How it was validated

bench/decision-trigger/run.sh spawns 15 isolated claude/sonnet agents in parallel (one per case), sends each its DM, classifies the response turn from the server log, and scores against predictions:

match: 15/15

15 cases span 8 work domains: PR review, code explain, vendor pick, architecture, status check, triage, hiring, doc proofread, compliance, time-box, naming. 9 decision predictions / 6 chat predictions, all match.

The benchmark itself is structured for re-runs — agent-name uniqueness via run-id, intentional pause of non-bench agents during runs (avoids #all welcome storms), strict side-effect-free prompts.

Test plan

  • cargo test --lib drivers::prompt::tests — 11 prompt tests pass (8 existing + 1 new structural-property test, plus 2 modified for added assertions)
  • bench/decision-trigger/run.sh — 15/15 match
  • Manual sanity-check the new prompt section reads cleanly when an agent loads it

🤖 Generated with Claude Code

Fullstop000 and others added 7 commits May 1, 2026 18:24
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the input-pattern enumeration in the Decision Inbox prompt section
(PR-review phrasing, "should I X or Y", config-knob examples) with a
four-property structural test: mutually-exclusive options + blocking +
material consequence + delegated picker. The trigger is the shape of the
agent's intended reply, not the asker's words. The PR-review case
becomes the canonical example, not the rule.

Why: the enumeration didn't scale. Verdict-shaped requests in triage,
hiring, time-boxing, and compliance use neutral phrasing ("tell me which
3 to fix", "walk me through whether we need X") and were falling
through to send_message. The structural rule generalizes to any new
workflow without re-listing phrasings.

Add bench/decision-trigger/ — a reproducible benchmark that spins up
one isolated claude/sonnet agent per case in parallel, dispatches a
DM, and classifies the response turn as decision (dispatch_decision) or
chat (send_message). 15 cases across 8 work domains (PR review, vendor
pick, architecture, status, triage, hiring, doc, compliance, time-box,
naming). Current score: 15/15.

The benchmark intentionally pauses non-bench agents during runs so the
bench cohort isn't drowned in #all welcome messages. Side-effect-free
prompts only — README documents the constraint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… flag

Two follow-up changes building on the structural-rule rewrite:

1) Whole-prompt injectability for benchmark/A-B convenience.
   Adds CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE env var: when set to a readable
   file, the file's contents become the system prompt verbatim. Also adds
   PromptOptions.system_prompt_override for in-process tests/benches.
   Programmatic override wins over env var. Tool names must be pre-resolved
   in the override file (no template substitution). Lets the bench compare
   prompt variants without rebuilding the binary.

2) Drop include_stdin_notification_section + MessageNotificationStyle.
   The flag toggled between two phrasings of the same message-delivery
   contract — "you'll be restarted" vs "messages may arrive directly". The
   LLM doesn't need to distinguish; it just needs to know not to poll. One
   universal Message Notifications section now always emits, telling the
   agent to call check_messages at natural breakpoints.

Updates all 5 driver call sites to use the simpler PromptOptions {..Default
::default()} pattern. Adds 4 prompt tests covering both override paths and
asserting the conditional notification branching is gone.

bench/decision-trigger/README.md gains an A/B section showing how to use
the env var to compare prompt variants without recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds harder benchmark cases and a multi-model matrix runner that exposes
real differences between the structural-rule prompt and the model's own
inference style.

Hard cases (cases-hard.tsv, 15 scenarios):
- Realistic narrative framings (P0 escalation, sprint capacity, vendor
  procurement, hiring under deadline, SOC2 compliance, time-box at sprint
  end, architecture review, VP briefing)
- No verdict-flavored phrasing — no "merge or hold", no "what's your
  call", no "X or Y". Decisions must be inferred from situational context
- Trap cases for chat (rhetorical frustration, retrospective, exploration,
  status update, info request, debug ask, facilitator role)

Multi-model matrix:
- models.tsv lists (runtime, model, tier, label) rows. Default ships with
  the two-per-family pattern: Anthropic best/efficiency, OpenAI best/
  efficiency
- run.sh now takes RUNTIME, MODEL, RUN_LABEL, CASES via env so it can be
  driven by the matrix runner
- run-matrix.sh sweeps all rows in models.tsv, runs the bench once per
  model, collates a side-by-side matrix.tsv

Baseline (cases-hard.tsv, structural-rule prompt):
- claude/opus:        9/15  (conservative — implicit delegation reads as chat)
- claude/sonnet:      15/15 (best — infers delegation from context)
- codex/gpt-5.5:      14/15 (one hiring miss)
- codex/gpt-5.4-mini: 13/15 (one mis-fire, one silent)

All 4 models score 7/7 on chat cases. The discriminator is property #4
(Delegated) — whether the model treats "we need X by Y" as an implicit
delegation. Same prompt, same cases, 9-15/15 spread by model.

BASELINE.md captures this and lays out the implications for the next
prompt iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the actual head-to-head between the OLD prompt (input-pattern
enumeration on main) and the NEW prompt (four-property structural test
on this branch). Same 15 hard cases, same 4 models, parallel runner.

Headline scores (cases-hard.tsv):

  Model               Tier         OLD     NEW    Δ
  claude/opus         best         15/15   9/15   -6
  claude/sonnet       efficiency   14/15   15/15  +1
  codex/gpt-5.5       best         14/15   14/15   0
  codex/gpt-5.4-mini  efficiency   12/15   13/15  +1
  -------------------------------------------------
  average                          13.75   12.75  -1.0

Aggregate behavior:
  Decisions caught (32 max):  OLD 30/32 (94%) vs NEW 23/32 (72%)
  Chat held back (28 max):    OLD 25/28 (89%) vs NEW 28/28 (100%)

The structural rewrite is NOT a strict win. NEW closes the retrospective
false-positive (case 10: "in hindsight, was that the right call?" — OLD
over-fires on sonnet/gpt-5.5/gpt-5.4-mini, NEW correctly chats on all).
But NEW costs Opus 6 implicit-delegation decisions because Opus reads
property #4 (Delegated) strictly: "we need X by Y" doesn't count as
delegation without an explicit "you pick" clause.

Sonnet, gpt-5.5, and gpt-5.4-mini are stable across both prompts —
they infer delegation from situational context regardless of which rule
is loaded. The Opus regression is model-specific.

BASELINE.md captures the full per-case matrix, named winners and losers,
known failure modes (gpt-5.4-mini case 1 silent under NEW; gpt-5.5 case 5
flips), and three iteration paths for the next prompt revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Fullstop000 Fullstop000 merged commit 8e484c9 into main May 2, 2026
3 checks passed
@Fullstop000 Fullstop000 deleted the decision-trigger-structural-rule branch May 2, 2026 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant