Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,4 @@ node_modules
.mcp.json
.opencode.json
.windsurfrules
/bench/decision-trigger/results/
126 changes: 126 additions & 0 deletions bench/decision-trigger/BASELINE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Decision-trigger benchmark — A/B baseline (OLD vs NEW prompt)

Head-to-head between the **OLD** prompt (input-pattern enumeration, on `main` before PR #133) and the **NEW** prompt (four-property structural test, on PR #133 branch). Same 15 hard cases, same 4 models, same parallel runner. Captured 2026-05-02.

## Headline scores (cases-hard.tsv)

| Model | Tier | OLD | NEW | Δ |
|---|---|---|---|---|
| claude/opus | best | **15/15** | 9/15 | **−6** |
| claude/sonnet | efficiency | 14/15 | **15/15** | +1 |
| codex/gpt-5.5 | best | 14/15 | 14/15 | 0 |
| codex/gpt-5.4-mini | efficiency | 12/15 | **13/15** | +1 |
| **average** | | **13.75/15** | **12.75/15** | **−1.0** |

NEW prompt **regresses on Opus by 6 points**, gains 1 on Sonnet, gains 1 on gpt-5.4-mini, washes on gpt-5.5. Net negative on average.

## Aggregate behavior delta

| | OLD prompt | NEW prompt |
|---|---|---|
| Decision-cases caught (max 32 = 8 cases × 4 models) | 30/32 (94%) | 23/32 (72%) |
| Chat-cases held back (max 28 = 7 cases × 4 models) | 25/28 (89%) | **28/28 (100%)** |

OLD is better at firing decisions. NEW is better at restraint. Different tradeoff, not a strict win.

## Per-case breakdown

```
case predicted OLD-opus NEW-opus OLD-sonnet NEW-sonnet OLD-gpt5.5 NEW-gpt5.5 OLD-mini NEW-mini
1 decision decision chat decision decision decision decision chat unknown
2 decision decision chat decision decision decision decision decision decision
3 decision decision chat decision decision decision decision decision decision
4 decision decision chat decision decision decision decision decision decision
5 decision decision chat decision decision decision chat decision decision
6 decision decision decision decision decision decision decision decision decision
7 decision decision chat decision decision decision decision decision decision
8 decision decision decision decision decision decision decision chat chat
9 chat chat chat chat chat chat chat chat chat
10 chat chat chat decision chat decision chat decision chat
11 chat chat chat chat chat chat chat chat chat
12 chat chat chat chat chat chat chat chat chat
13 chat chat chat chat chat chat chat chat chat
14 chat chat chat chat chat chat chat chat chat
15 chat chat chat chat chat chat chat chat chat
```

## Where each prompt wins

### NEW wins on case 10 (retrospective trap), 3 cells

Case 10 prompt: *"I shipped the auth fix yesterday. In hindsight, given what we now know about the migration timing, was that the right call?"*

| Model | OLD prompt | NEW prompt |
|---|---|---|
| sonnet | decision (over-fires) | chat ✓ |
| gpt-5.5 | decision (over-fires) | chat ✓ |
| gpt-5.4-mini | decision (over-fires) | chat ✓ |
| opus | chat ✓ | chat ✓ |

The structural rule's properties #2 (Blocking) and #3 (Material consequence) explicitly fail for retrospectives — the PR already shipped, nothing is gated on the agent's verdict. The OLD prompt's input-pattern matching can't distinguish *"was that the right call?"* from a current verdict, so 3/4 models fire incorrectly. **NEW is genuinely better at restraint.**

### OLD wins on Opus, 6 cells (cases 1, 2, 3, 4, 5, 7 — all implicit-delegation decisions)

Each of these cases presents mutually exclusive options + a deadline + situational context, but lacks an explicit *"you pick"* clause.

- **OLD prompt** enumerates *"presents two or more concrete alternatives and asks you to pick"*. Opus interprets "presents alternatives + deadline" as the trigger and fires.
- **NEW prompt** requires all four structural properties including #4 (Delegated). Opus reads *"we need a responder lined up before the call"* as the team's own action item, not a delegation to the agent, and refuses to fire.

Why is this Opus-specific? Sonnet, gpt-5.5, and gpt-5.4-mini all infer delegation from situational context regardless of which prompt is loaded. Opus is the only model that strictly waits for an explicit *"you pick"* under the NEW rule. **The NEW rule's strict interpretation of property #4 is exactly what trips Opus.**

### Stable across both prompts

- **Case 6** (*"I need to give my VP an answer by 4pm"* — explicit time-anchored ask): every model fires decision under both prompts.
- **Case 8** (sprint-end time-box with options laid out): every model except gpt-5.4-mini fires decision under both prompts. gpt-5.4-mini misses under both — model capability ceiling, not a prompt issue.
- **All chat cases except 10**: clean restraint across all 4 models, both prompts.

## Failure modes worth noting

1. **gpt-5.4-mini case 1 (NEW = `unknown`)** — the model didn't call any tool at all under the NEW prompt. It saw the prompt, the run completed `reason=Natural`, but no `dispatch_decision` and no `send_message`. Same case, OLD prompt → it correctly chose `chat` (which is wrong vs prediction but a real choice). The NEW prompt seems to have caused gpt-5.4-mini to freeze on this prompt — worth investigating.

2. **gpt-5.5 case 5 NEW = `chat`** — only OLD/NEW divergence on gpt-5.5. The hiring case under NEW landed in chat. Looking at the agent's actual reply would tell us why.

## Conclusion

**The structural rewrite is a tradeoff, not a strict win.** Average pass rate drops 1 point (13.75 → 12.75 / 15) across 4 models, but the loss is concentrated on a single model (Opus) and the gain is real signal (case 10 restraint).

What it actually achieves:
- ✅ **Clean restraint on retrospectives.** The OLD prompt's input-pattern matching has a known false-positive on retrospective phrasing; the NEW rule closes it.
- ❌ **Loses on implicit-delegation decisions for Opus.** The strict reading of property #4 (Delegated) excludes the *"we need X by Y"* framings that real teams use all the time. Opus is the only model that takes this strictness literally.
- 〇 **Wash on Sonnet, gpt-5.5, gpt-5.4-mini.** Those models infer delegation from situational context regardless of which prompt is loaded.

## Implications for the next iteration

Three options for the prompt-rule tuning:

1. **Soften property #4.** Add a clause like *"a request that lays out mutually exclusive alternatives plus a deadline counts as implicit delegation, even without an explicit 'you pick'."* Recovers Opus without losing Sonnet/gpt-5.5/gpt-5.4-mini.
2. **Accept the Opus regression.** Ship the NEW rule as-is — the chat-restraint gain is principled, and Opus users can be coached toward explicit phrasing. Trade decision-firing for false-positive avoidance.
3. **Split the prompt by tier.** Opus gets a more permissive trigger, Sonnet gets the strict one. Maintenance cost.

This baseline lets us measure each iteration against real signal instead of guessing. Re-run after any prompt change that affects routing.

## Reproducing this report

```bash
# OLD prompt baseline (main, port 3002 + bridge 4322 to coexist with a running NEW server):
git worktree add /tmp/chorus-main main
cd /tmp/chorus-main && cargo build --bin chorus
/tmp/chorus-main/target/debug/chorus serve --port 3002 --bridge-port 4322 \
> /tmp/chorus-old.log 2>&1 &
CHORUS_LOG=/tmp/chorus-old.log \
CASES=$PWD/bench/decision-trigger/cases-hard.tsv \
./bench/decision-trigger/run-matrix.sh http://localhost:3002

# NEW prompt baseline (PR #133 branch, port 3001):
cargo build --bin chorus
./target/debug/chorus serve --port 3001 > /tmp/chorus-new.log 2>&1 &
CHORUS_LOG=/tmp/chorus-new.log \
CASES=$PWD/bench/decision-trigger/cases-hard.tsv \
./bench/decision-trigger/run-matrix.sh http://localhost:3001
```

Each matrix takes ~45-50 min (4 models, parallel-per-model). Raw results live under `bench/decision-trigger/results/matrix-<unix_ts>/`.

Captured runs in this report:
- OLD: `matrix-1777658557/`
- NEW: `matrix-1777647089/`
145 changes: 145 additions & 0 deletions bench/decision-trigger/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Decision-trigger benchmark

Evaluates whether the prompt in `src/agent/drivers/prompt.rs` causes agents to correctly route work between the **decision channel** (`dispatch_decision`) and the **chat channel** (`send_message`).

The current rule is structural — a request is a decision when ALL FOUR hold:

1. **Mutually exclusive** options
2. **Blocking** — the asker can't move until a pick lands
3. **Material consequence** — the pick commits resources or forecloses paths
4. **Delegated** — the asker is asking the agent to pick

Cases that hit all four should produce `dispatch_decision`. Anything else should produce `send_message`.

## What's measured

| | Description |
|---|---|
| **Input** | 15 hand-curated prompts spanning 8 work domains (PR review, vendor pick, architecture, status, triage, hiring, doc edit, compliance, time-box, naming). |
| **Setup** | One isolated Chorus agent per case (claude/sonnet), so there's no session-context bleed between cases. All agents run in parallel. |
| **Signal** | Per-agent log scrape: did the agent call `dispatch_decision` or `send_message` in its response turn? |
| **Score** | Match rate vs. the `predicted` column in `cases.tsv`. |

## Why one-agent-per-case in parallel

Running cases sequentially through a single agent corrupts the test in two ways:
1. **Context bleed** — case N inherits memory of cases 1..N-1, so the agent's choice on case N is biased.
2. **Stale-session timeouts** — codex/opencode `--resume` silently fails after a few minutes idle (see TODOS.md). Sequential runs hit this gap; one agent per case dodges it entirely.

Total wall time is `max(per_agent_turn) ≈ 2 min`, not `sum`.

## Prerequisites

- `chorus` binary built: `cargo build --bin chorus`
- Chorus server running with stdout/stderr captured to a log file
- Claude runtime authed (`chorus setup` confirms)
- `CHORUS_LOG` env var pointing to the server log (defaults to `/tmp/chorus-qa-server.log`)

## Cases

Two case files at different difficulty:

| File | Style | What it measures |
|---|---|---|
| `cases.tsv` | Easy / smoke. Decision-shaped requests use verdict-flavored phrasing (*"merge or hold?"*, *"what do you recommend?"*, *"your call"*). | Sanity check: prompt teaches the rule at all. Both input-pattern and structural-rule prompts hit 15/15 on this. |
| `cases-hard.tsv` | Realistic narrative scenarios. Decision-shaped requests use **neutral phrasing** (no "recommend", no "verdict", no "X or Y"). Trap cases include rhetorical frustration, retrospectives, exploration, status updates, and facilitation asks. | Differentiates prompts that pattern-match input phrasing from prompts that test the structural shape of the agent's intended reply. |

To use the harder set:
```bash
CASES=$PWD/bench/decision-trigger/cases-hard.tsv ./bench/decision-trigger/run.sh
```

## Running

Single run against the default model (`claude/sonnet`):
```bash
./bench/decision-trigger/run.sh
```

Pick a different runtime/model:
```bash
RUNTIME=codex MODEL=gpt-5.5 ./bench/decision-trigger/run.sh
```

Sweep all models in `models.tsv` and produce a side-by-side matrix:
```bash
./bench/decision-trigger/run-matrix.sh
CASES=$PWD/bench/decision-trigger/cases-hard.tsv ./bench/decision-trigger/run-matrix.sh
```

Common options:
```bash
./bench/decision-trigger/run.sh http://localhost:3001 # explicit server URL
KEEP_AGENTS=1 ./bench/decision-trigger/run.sh # don't auto-delete agents on exit (forensics)
CHORUS_LOG=/var/log/chorus.log ./bench/decision-trigger/run.sh
```

## Models matrix

`models.tsv` lists the (runtime, model, tier) combinations the matrix runner sweeps. Default ships with the two-per-family pattern: best + efficiency for Anthropic and OpenAI.

| runtime | model | tier | resolves to |
|---|---|---|---|
| claude | opus | best | Claude Opus 4.7 |
| claude | sonnet | efficiency | Claude Sonnet 4.6 |
| codex | gpt-5.5 | best | GPT-5.5 |
| codex | gpt-5.4-mini | efficiency | GPT-5.4-mini |

Add other rows (kimi, gemini, opencode) as Chorus drivers stabilize. Each row produces one column in the matrix output.

## A/B testing prompt variants

The whole system prompt is injectable via `CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE`. To compare a candidate prompt against the built-in:

```bash
# 1. Save the current built-in prompt (e.g. by capturing what build_system_prompt
# produces from a unit test or a one-shot CLI helper) to baseline.md.
# 2. Write your candidate prompt to candidate.md.
# 3. For each variant, restart the chorus server with the env var pointing at it:

CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE=$PWD/baseline.md chorus serve --port 3001 &
./bench/decision-trigger/run.sh # records run as bench/.../results/<ts>/results.tsv
kill %1

CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE=$PWD/candidate.md chorus serve --port 3001 &
./bench/decision-trigger/run.sh
kill %1
```

The override is a verbatim substitution — the file content becomes the system prompt. No template substitution, no merging. Tool names must already be resolved (use `mcp__chat__send_message` for the claude runtime, bare `send_message` for codex/kimi/gemini/opencode).

## Output

Each run writes to `bench/decision-trigger/results/<unix_ts>/`:

- `results.tsv` — per-case `id, agent, predicted, actual, match, prompt`
- `log-slice.txt` — the relevant slice of the server log for forensics

Exit code is `0` if all cases match, `1` otherwise.

## Cases (`cases.tsv`)

Each row is `id <tab> predicted <tab> prompt`. To add a case:

1. Append a new row.
2. Set `predicted` to `decision` or `chat` based on the structural test above.
3. Make the prompt **current-tense and unambiguous** about who is blocked. Retrospective phrasing ("should we have shipped X?") fails property #2 and is correctly classified as `chat`, so don't predict `decision` for it.

## Interpreting results

A `match: 15/15` confirms the prompt rule is well-formed for general work. Anything below that needs investigation:

- **`predicted=decision actual=chat`** — the agent missed a verdict-shaped request. Either the prompt is too restrictive, or the case wording is too soft. Check whether all four properties actually hold; if so, the rule needs a stronger trigger for that workflow class.
- **`predicted=chat actual=decision`** — the agent over-fired. The structural rule has a false positive. Tighten the trigger or improve the canonical example.
- **`actual=unknown`** — the agent didn't call either tool, or the log scrape missed the call. Check `log-slice.txt`.

## Known limitations

- Single-runtime test (claude/sonnet). Codex/opencode have known stale-session bugs and aren't included until those drivers ship the analogous `--resume` guard.
- Log-scrape classification is brittle to log format changes. If the `tool call agent=...` log line moves or renames, update the grep in `run.sh`.
- Per-agent agent boot time (~10-30s) dominates wall time for short tests.
- Cases must be **side-effect-free**. An agent given "edit X" or "fix typos in Y" will mutate the repo, leaving uncommitted changes. Frame action cases as "report what you'd change" or use a sandbox path the runner pre-stages and cleans up.

## Provenance

This benchmark was added in the PR that rewrote the prompt's decision trigger from input-pattern enumeration to a structural four-property test. See git history for context.
16 changes: 16 additions & 0 deletions bench/decision-trigger/cases-hard.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
id predicted prompt
1 decision Acme just filed a P0. Their CEO is calling our CEO at 4pm. Eng-lead Bob is on vacation. Sarah (sr eng) rolled off another P0 yesterday and is fried. New eng Mike has time but doesn't know the codebase. We need a responder lined up before the call.
2 decision Sprint capacity is one fix slot left this week. Bug A: five customers hit a login latency spike. Bug B: one customer can't export a CSV larger than 100MB. Bug C: a feature 3% of users hit crashes on a specific edge case. Sprint ends Friday.
3 decision Eng wants Postgres for transactional consistency. Infra wants DynamoDB for ops simplicity. Finance wants whichever is cheaper. New service, architecture review tomorrow morning.
4 decision The SOC2 auditor wants finding F-7 remediated by Friday. We don't have a fix yet. Three paths I can see: deploy a compensating control we'd remove later; request a 2-week extension; accept the finding and document it.
5 decision Two finalists for the senior platform role. Alice: 8 years at AWS, deep K8s, communication is curt. Bob: 4 years but built our exact stack at his last startup, ships fast, but bus-factor risk if he leaves. Offer letter goes out tomorrow.
6 decision My VP is briefing the CEO at 5pm on whether we ship the mobile launch. I need to give my VP an answer by 4pm. The launch has two known crashers in iOS 18 hitting 3% of users. Android is clean.
7 decision We need an observability stack. Datadog: $120K/yr with features we won't use. New Relic: $60K/yr with an OK UI. Honeycomb: $40K/yr with the best ergonomics for our debug-heavy workflow. Procurement closes Q2 budget on Friday.
8 decision Two days into investigating the codex --resume stale-session bug. Sprint ends tomorrow. The fix path is murky — could be a TTL issue in rmcp, could be our session-id encoding, could be the codex CLI itself. We have a workaround that just skips resume on stale session.
9 chat Why are we even using event-sourcing for this service? Feels like overkill for the actual usage pattern.
10 chat I shipped the auth fix yesterday. In hindsight, given what we now know about the migration timing, was that the right call?
11 chat I'm thinking about proposing we deprecate the v1 API at the next architecture sync. What are the tradeoffs I should weigh before bringing it up?
12 chat Quick update: auth refactor is at 60%, the team wants to pause and reassess after seeing how the MCP migration went. No blockers, just a heads up.
13 chat What's the latency budget our SLO commits to? I'm setting the timeout in the new client SDK and want to make sure I match it.
14 chat Something crashed around 3:42am. Can you look at logs/server.log and tell me what went wrong?
15 chat Eng wants Postgres for transactional consistency, Infra wants DynamoDB for ops simplicity. I'm facilitating the architecture review tomorrow. Walk me through the tradeoff matrix so I can run a clean discussion — I'm not the decision-maker, I'm just running the meeting.
Loading
Loading