Fullstop000 · Fullstop000 · May 2, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/.gitignore b/.gitignore
@@ -74,3 +74,4 @@ node_modules
 .mcp.json
 .opencode.json
 .windsurfrules
+/bench/decision-trigger/results/
diff --git a/bench/decision-trigger/BASELINE.md b/bench/decision-trigger/BASELINE.md
@@ -0,0 +1,126 @@
+# Decision-trigger benchmark — A/B baseline (OLD vs NEW prompt)
+
+Head-to-head between the **OLD** prompt (input-pattern enumeration, on `main` before PR #133) and the **NEW** prompt (four-property structural test, on PR #133 branch). Same 15 hard cases, same 4 models, same parallel runner. Captured 2026-05-02.
+
+## Headline scores (cases-hard.tsv)
+
+| Model | Tier | OLD | NEW | Δ |
+|---|---|---|---|---|
+| claude/opus | best | **15/15** | 9/15 | **−6** |
+| claude/sonnet | efficiency | 14/15 | **15/15** | +1 |
+| codex/gpt-5.5 | best | 14/15 | 14/15 | 0 |
+| codex/gpt-5.4-mini | efficiency | 12/15 | **13/15** | +1 |
+| **average** | | **13.75/15** | **12.75/15** | **−1.0** |
+
+NEW prompt **regresses on Opus by 6 points**, gains 1 on Sonnet, gains 1 on gpt-5.4-mini, washes on gpt-5.5. Net negative on average.
+
+## Aggregate behavior delta
+
+| | OLD prompt | NEW prompt |
+|---|---|---|
+| Decision-cases caught (max 32 = 8 cases × 4 models) | 30/32 (94%) | 23/32 (72%) |
+| Chat-cases held back (max 28 = 7 cases × 4 models) | 25/28 (89%) | **28/28 (100%)** |
+
+OLD is better at firing decisions. NEW is better at restraint. Different tradeoff, not a strict win.
+
+## Per-case breakdown
+
+```
+case  predicted   OLD-opus  NEW-opus    OLD-sonnet  NEW-sonnet  OLD-gpt5.5  NEW-gpt5.5  OLD-mini   NEW-mini
+ 1    decision    decision  chat        decision    decision    decision    decision    chat       unknown
+ 2    decision    decision  chat        decision    decision    decision    decision    decision   decision
+ 3    decision    decision  chat        decision    decision    decision    decision    decision   decision
+ 4    decision    decision  chat        decision    decision    decision    decision    decision   decision
+ 5    decision    decision  chat        decision    decision    decision    chat        decision   decision
+ 6    decision    decision  decision    decision    decision    decision    decision    decision   decision
+ 7    decision    decision  chat        decision    decision    decision    decision    decision   decision
+ 8    decision    decision  decision    decision    decision    decision    decision    chat       chat
+ 9    chat        chat      chat        chat        chat        chat        chat        chat       chat
+10    chat        chat      chat        decision    chat        decision    chat        decision   chat
+11    chat        chat      chat        chat        chat        chat        chat        chat       chat
+12    chat        chat      chat        chat        chat        chat        chat        chat       chat
+13    chat        chat      chat        chat        chat        chat        chat        chat       chat
+14    chat        chat      chat        chat        chat        chat        chat        chat       chat
+15    chat        chat      chat        chat        chat        chat        chat        chat       chat
+```
+
+## Where each prompt wins
+
+### NEW wins on case 10 (retrospective trap), 3 cells
+
+Case 10 prompt: *"I shipped the auth fix yesterday. In hindsight, given what we now know about the migration timing, was that the right call?"*
+
+| Model | OLD prompt | NEW prompt |
+|---|---|---|
+| sonnet | decision (over-fires) | chat ✓ |
+| gpt-5.5 | decision (over-fires) | chat ✓ |
+| gpt-5.4-mini | decision (over-fires) | chat ✓ |
+| opus | chat ✓ | chat ✓ |
+
+The structural rule's properties #2 (Blocking) and #3 (Material consequence) explicitly fail for retrospectives — the PR already shipped, nothing is gated on the agent's verdict. The OLD prompt's input-pattern matching can't distinguish *"was that the right call?"* from a current verdict, so 3/4 models fire incorrectly. **NEW is genuinely better at restraint.**
+
+### OLD wins on Opus, 6 cells (cases 1, 2, 3, 4, 5, 7 — all implicit-delegation decisions)
+
+Each of these cases presents mutually exclusive options + a deadline + situational context, but lacks an explicit *"you pick"* clause.
+
+- **OLD prompt** enumerates *"presents two or more concrete alternatives and asks you to pick"*. Opus interprets "presents alternatives + deadline" as the trigger and fires.
+- **NEW prompt** requires all four structural properties including #4 (Delegated). Opus reads *"we need a responder lined up before the call"* as the team's own action item, not a delegation to the agent, and refuses to fire.
+
+Why is this Opus-specific? Sonnet, gpt-5.5, and gpt-5.4-mini all infer delegation from situational context regardless of which prompt is loaded. Opus is the only model that strictly waits for an explicit *"you pick"* under the NEW rule. **The NEW rule's strict interpretation of property #4 is exactly what trips Opus.**
+
+### Stable across both prompts
+
+- **Case 6** (*"I need to give my VP an answer by 4pm"* — explicit time-anchored ask): every model fires decision under both prompts.
+- **Case 8** (sprint-end time-box with options laid out): every model except gpt-5.4-mini fires decision under both prompts. gpt-5.4-mini misses under both — model capability ceiling, not a prompt issue.
+- **All chat cases except 10**: clean restraint across all 4 models, both prompts.
+
+## Failure modes worth noting
+
+1. **gpt-5.4-mini case 1 (NEW = `unknown`)** — the model didn't call any tool at all under the NEW prompt. It saw the prompt, the run completed `reason=Natural`, but no `dispatch_decision` and no `send_message`. Same case, OLD prompt → it correctly chose `chat` (which is wrong vs prediction but a real choice). The NEW prompt seems to have caused gpt-5.4-mini to freeze on this prompt — worth investigating.
+
+2. **gpt-5.5 case 5 NEW = `chat`** — only OLD/NEW divergence on gpt-5.5. The hiring case under NEW landed in chat. Looking at the agent's actual reply would tell us why.
+
+## Conclusion
+
+**The structural rewrite is a tradeoff, not a strict win.** Average pass rate drops 1 point (13.75 → 12.75 / 15) across 4 models, but the loss is concentrated on a single model (Opus) and the gain is real signal (case 10 restraint).
+
+What it actually achieves:
+- ✅ **Clean restraint on retrospectives.** The OLD prompt's input-pattern matching has a known false-positive on retrospective phrasing; the NEW rule closes it.
+- ❌ **Loses on implicit-delegation decisions for Opus.** The strict reading of property #4 (Delegated) excludes the *"we need X by Y"* framings that real teams use all the time. Opus is the only model that takes this strictness literally.
+- 〇 **Wash on Sonnet, gpt-5.5, gpt-5.4-mini.** Those models infer delegation from situational context regardless of which prompt is loaded.
+
+## Implications for the next iteration
+
+Three options for the prompt-rule tuning:
+
+1. **Soften property #4.** Add a clause like *"a request that lays out mutually exclusive alternatives plus a deadline counts as implicit delegation, even without an explicit 'you pick'."* Recovers Opus without losing Sonnet/gpt-5.5/gpt-5.4-mini.
+2. **Accept the Opus regression.** Ship the NEW rule as-is — the chat-restraint gain is principled, and Opus users can be coached toward explicit phrasing. Trade decision-firing for false-positive avoidance.
+3. **Split the prompt by tier.** Opus gets a more permissive trigger, Sonnet gets the strict one. Maintenance cost.
+
+This baseline lets us measure each iteration against real signal instead of guessing. Re-run after any prompt change that affects routing.
+
+## Reproducing this report
+
+```bash
+# OLD prompt baseline (main, port 3002 + bridge 4322 to coexist with a running NEW server):
+git worktree add /tmp/chorus-main main
+cd /tmp/chorus-main && cargo build --bin chorus
+/tmp/chorus-main/target/debug/chorus serve --port 3002 --bridge-port 4322 \
+  > /tmp/chorus-old.log 2>&1 &
+CHORUS_LOG=/tmp/chorus-old.log \
+  CASES=$PWD/bench/decision-trigger/cases-hard.tsv \
+  ./bench/decision-trigger/run-matrix.sh http://localhost:3002
+
+# NEW prompt baseline (PR #133 branch, port 3001):
+cargo build --bin chorus
+./target/debug/chorus serve --port 3001 > /tmp/chorus-new.log 2>&1 &
+CHORUS_LOG=/tmp/chorus-new.log \
+  CASES=$PWD/bench/decision-trigger/cases-hard.tsv \
+  ./bench/decision-trigger/run-matrix.sh http://localhost:3001
+```
+
+Each matrix takes ~45-50 min (4 models, parallel-per-model). Raw results live under `bench/decision-trigger/results/matrix-<unix_ts>/`.
+
+Captured runs in this report:
+- OLD: `matrix-1777658557/`
+- NEW: `matrix-1777647089/`
diff --git a/bench/decision-trigger/README.md b/bench/decision-trigger/README.md
@@ -0,0 +1,145 @@
+# Decision-trigger benchmark
+
+Evaluates whether the prompt in `src/agent/drivers/prompt.rs` causes agents to correctly route work between the **decision channel** (`dispatch_decision`) and the **chat channel** (`send_message`).
+
+The current rule is structural — a request is a decision when ALL FOUR hold:
+
+1. **Mutually exclusive** options
+2. **Blocking** — the asker can't move until a pick lands
+3. **Material consequence** — the pick commits resources or forecloses paths
+4. **Delegated** — the asker is asking the agent to pick
+
+Cases that hit all four should produce `dispatch_decision`. Anything else should produce `send_message`.
+
+## What's measured
+
+| | Description |
+|---|---|
+| **Input** | 15 hand-curated prompts spanning 8 work domains (PR review, vendor pick, architecture, status, triage, hiring, doc edit, compliance, time-box, naming). |
+| **Setup** | One isolated Chorus agent per case (claude/sonnet), so there's no session-context bleed between cases. All agents run in parallel. |
+| **Signal** | Per-agent log scrape: did the agent call `dispatch_decision` or `send_message` in its response turn? |
+| **Score** | Match rate vs. the `predicted` column in `cases.tsv`. |
+
+## Why one-agent-per-case in parallel
+
+Running cases sequentially through a single agent corrupts the test in two ways:
+1. **Context bleed** — case N inherits memory of cases 1..N-1, so the agent's choice on case N is biased.
+2. **Stale-session timeouts** — codex/opencode `--resume` silently fails after a few minutes idle (see TODOS.md). Sequential runs hit this gap; one agent per case dodges it entirely.
+
+Total wall time is `max(per_agent_turn) ≈ 2 min`, not `sum`.
+
+## Prerequisites
+
+- `chorus` binary built: `cargo build --bin chorus`
+- Chorus server running with stdout/stderr captured to a log file
+- Claude runtime authed (`chorus setup` confirms)
+- `CHORUS_LOG` env var pointing to the server log (defaults to `/tmp/chorus-qa-server.log`)
+
+## Cases
+
+Two case files at different difficulty:
+
+| File | Style | What it measures |
+|---|---|---|
+| `cases.tsv` | Easy / smoke. Decision-shaped requests use verdict-flavored phrasing (*"merge or hold?"*, *"what do you recommend?"*, *"your call"*). | Sanity check: prompt teaches the rule at all. Both input-pattern and structural-rule prompts hit 15/15 on this. |
+| `cases-hard.tsv` | Realistic narrative scenarios. Decision-shaped requests use **neutral phrasing** (no "recommend", no "verdict", no "X or Y"). Trap cases include rhetorical frustration, retrospectives, exploration, status updates, and facilitation asks. | Differentiates prompts that pattern-match input phrasing from prompts that test the structural shape of the agent's intended reply. |
+
+To use the harder set:
+```bash
+CASES=$PWD/bench/decision-trigger/cases-hard.tsv ./bench/decision-trigger/run.sh
+```
+
+## Running
+
+Single run against the default model (`claude/sonnet`):
+```bash
+./bench/decision-trigger/run.sh
+```
+
+Pick a different runtime/model:
+```bash
+RUNTIME=codex MODEL=gpt-5.5 ./bench/decision-trigger/run.sh
+```
+
+Sweep all models in `models.tsv` and produce a side-by-side matrix:
+```bash
+./bench/decision-trigger/run-matrix.sh
+CASES=$PWD/bench/decision-trigger/cases-hard.tsv ./bench/decision-trigger/run-matrix.sh
+```
+
+Common options:
+```bash
+./bench/decision-trigger/run.sh http://localhost:3001    # explicit server URL
+KEEP_AGENTS=1 ./bench/decision-trigger/run.sh            # don't auto-delete agents on exit (forensics)
+CHORUS_LOG=/var/log/chorus.log ./bench/decision-trigger/run.sh
+```
+
+## Models matrix
+
+`models.tsv` lists the (runtime, model, tier) combinations the matrix runner sweeps. Default ships with the two-per-family pattern: best + efficiency for Anthropic and OpenAI.
+
+| runtime | model | tier | resolves to |
+|---|---|---|---|
+| claude | opus | best | Claude Opus 4.7 |
+| claude | sonnet | efficiency | Claude Sonnet 4.6 |
+| codex | gpt-5.5 | best | GPT-5.5 |
+| codex | gpt-5.4-mini | efficiency | GPT-5.4-mini |
+
+Add other rows (kimi, gemini, opencode) as Chorus drivers stabilize. Each row produces one column in the matrix output.
+
+## A/B testing prompt variants
+
+The whole system prompt is injectable via `CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE`. To compare a candidate prompt against the built-in:
+
+```bash
+# 1. Save the current built-in prompt (e.g. by capturing what build_system_prompt
+#    produces from a unit test or a one-shot CLI helper) to baseline.md.
+# 2. Write your candidate prompt to candidate.md.
+# 3. For each variant, restart the chorus server with the env var pointing at it:
+
+CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE=$PWD/baseline.md  chorus serve --port 3001 &
+./bench/decision-trigger/run.sh   # records run as bench/.../results/<ts>/results.tsv
+kill %1
+
+CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE=$PWD/candidate.md chorus serve --port 3001 &
+./bench/decision-trigger/run.sh
+kill %1
+```
+
+The override is a verbatim substitution — the file content becomes the system prompt. No template substitution, no merging. Tool names must already be resolved (use `mcp__chat__send_message` for the claude runtime, bare `send_message` for codex/kimi/gemini/opencode).
+
+## Output
+
+Each run writes to `bench/decision-trigger/results/<unix_ts>/`:
+
+- `results.tsv` — per-case `id, agent, predicted, actual, match, prompt`
+- `log-slice.txt` — the relevant slice of the server log for forensics
+
+Exit code is `0` if all cases match, `1` otherwise.
+
+## Cases (`cases.tsv`)
+
+Each row is `id <tab> predicted <tab> prompt`. To add a case:
+
+1. Append a new row.
+2. Set `predicted` to `decision` or `chat` based on the structural test above.
+3. Make the prompt **current-tense and unambiguous** about who is blocked. Retrospective phrasing ("should we have shipped X?") fails property #2 and is correctly classified as `chat`, so don't predict `decision` for it.
+
+## Interpreting results
+
+A `match: 15/15` confirms the prompt rule is well-formed for general work. Anything below that needs investigation:
+
+- **`predicted=decision actual=chat`** — the agent missed a verdict-shaped request. Either the prompt is too restrictive, or the case wording is too soft. Check whether all four properties actually hold; if so, the rule needs a stronger trigger for that workflow class.
+- **`predicted=chat actual=decision`** — the agent over-fired. The structural rule has a false positive. Tighten the trigger or improve the canonical example.
+- **`actual=unknown`** — the agent didn't call either tool, or the log scrape missed the call. Check `log-slice.txt`.
+
+## Known limitations
+
+- Single-runtime test (claude/sonnet). Codex/opencode have known stale-session bugs and aren't included until those drivers ship the analogous `--resume` guard.
+- Log-scrape classification is brittle to log format changes. If the `tool call agent=...` log line moves or renames, update the grep in `run.sh`.
+- Per-agent agent boot time (~10-30s) dominates wall time for short tests.
+- Cases must be **side-effect-free**. An agent given "edit X" or "fix typos in Y" will mutate the repo, leaving uncommitted changes. Frame action cases as "report what you'd change" or use a sandbox path the runner pre-stages and cleans up.
+
+## Provenance
+
+This benchmark was added in the PR that rewrote the prompt's decision trigger from input-pattern enumeration to a structural four-property test. See git history for context.
diff --git a/bench/decision-trigger/cases-hard.tsv b/bench/decision-trigger/cases-hard.tsv
@@ -0,0 +1,16 @@
+id	predicted	prompt
+1	decision	Acme just filed a P0. Their CEO is calling our CEO at 4pm. Eng-lead Bob is on vacation. Sarah (sr eng) rolled off another P0 yesterday and is fried. New eng Mike has time but doesn't know the codebase. We need a responder lined up before the call.
+2	decision	Sprint capacity is one fix slot left this week. Bug A: five customers hit a login latency spike. Bug B: one customer can't export a CSV larger than 100MB. Bug C: a feature 3% of users hit crashes on a specific edge case. Sprint ends Friday.
+3	decision	Eng wants Postgres for transactional consistency. Infra wants DynamoDB for ops simplicity. Finance wants whichever is cheaper. New service, architecture review tomorrow morning.
+4	decision	The SOC2 auditor wants finding F-7 remediated by Friday. We don't have a fix yet. Three paths I can see: deploy a compensating control we'd remove later; request a 2-week extension; accept the finding and document it.
+5	decision	Two finalists for the senior platform role. Alice: 8 years at AWS, deep K8s, communication is curt. Bob: 4 years but built our exact stack at his last startup, ships fast, but bus-factor risk if he leaves. Offer letter goes out tomorrow.
+6	decision	My VP is briefing the CEO at 5pm on whether we ship the mobile launch. I need to give my VP an answer by 4pm. The launch has two known crashers in iOS 18 hitting 3% of users. Android is clean.
+7	decision	We need an observability stack. Datadog: $120K/yr with features we won't use. New Relic: $60K/yr with an OK UI. Honeycomb: $40K/yr with the best ergonomics for our debug-heavy workflow. Procurement closes Q2 budget on Friday.
+8	decision	Two days into investigating the codex --resume stale-session bug. Sprint ends tomorrow. The fix path is murky — could be a TTL issue in rmcp, could be our session-id encoding, could be the codex CLI itself. We have a workaround that just skips resume on stale session.
+9	chat	Why are we even using event-sourcing for this service? Feels like overkill for the actual usage pattern.
+10	chat	I shipped the auth fix yesterday. In hindsight, given what we now know about the migration timing, was that the right call?
+11	chat	I'm thinking about proposing we deprecate the v1 API at the next architecture sync. What are the tradeoffs I should weigh before bringing it up?
+12	chat	Quick update: auth refactor is at 60%, the team wants to pause and reassess after seeing how the MCP migration went. No blockers, just a heads up.
+13	chat	What's the latency budget our SLO commits to? I'm setting the timeout in the new client SDK and want to make sure I match it.
+14	chat	Something crashed around 3:42am. Can you look at logs/server.log and tell me what went wrong?
+15	chat	Eng wants Postgres for transactional consistency, Infra wants DynamoDB for ops simplicity. I'm facilitating the architecture review tomorrow. Walk me through the tradeoff matrix so I can run a clean discussion — I'm not the decision-maker, I'm just running the meeting.