Problem
The lion's share of the SDD pipeline's behavior is in the prompt body of each workflow's .md source. A prompt edit that removes a section header, reorders the few-shot examples, or drops a constraint sentence can subtly change every output the agent produces — without changing a single line of code that a code review would catch. Full E2E catches it after the fact; the Layer 0 checks catch nothing about prompts. There is no cheap, deterministic test for "did this prompt edit change the agent's behavior in a way the author didn't intend?"
Desired behavior
A recorded-transcript replay harness: for each agent (sdd-spec, sdd-triage phases A/B/C, sdd-execute, sdd-validate, sdd-review), capture one or more canonical input contexts (the tracking issue body, the merged spec, the architecture record — whatever the agent reads) and the agent's actual output (the safe-outputs it emitted) from a real run. On every PR that touches the agent's .md, re-feed the captured input to the current prompt against the same model and same MCP toolset, and structurally diff the output against the captured baseline.
The diff is structural, not byte-for-byte:
- Did the agent emit the same set of safe-output types (e.g. one
create-issue and one add-comment)?
- Did the spec file the agent wrote contain the same set of section headers?
- Did the architecture record reference the same set of requirement IDs?
- Did the labels applied match the captured set?
- Are commit messages and PR titles within a structural template?
Behavior changes are flagged for human review on the PR; the test doesn't fail automatically. The reviewer accepts the diff (overwrites the baseline) or rejects (reverts the prompt edit).
Implementation
tests/fixtures/prompt-replay/<agent>/<scenario>/: each scenario contains input/ (the captured context), prompt.lock (the snapshot of the prompt body used to produce the baseline), and output/ (the captured safe-outputs).
- A new workflow
.github/workflows/prompt-replay.md runs on PRs that modify .github/workflows/*.md and re-runs each affected agent against its fixtures. This does cost LLM tokens, but it costs them only when the prompt actually changes — a typical infrastructure PR does not trigger replay.
- A
scripts/prompt-replay-record.sh helper to capture a new baseline from a recent real run.
- Outputs a diff comment on the PR. The reviewer can comment
/accept-replay to overwrite the baseline in the same PR, or /reject-replay to mark it for revert.
Cost ceiling
- Trigger is path-filtered to workflow
.md edits only (the prompt surface).
- Each agent has 1–2 fixtures; total replay cost per affected agent ≈ one E2E phase, ≈ $1–5.
- A typical prompt PR touches one agent: ≤ $10 per PR.
Acceptance
Out of scope
- Validating the agent's chain-of-thought (only the output safe-outputs are diffed).
- Catching model drift (Anthropic ships a Sonnet update, output changes a little). The structural diff is intentionally coarse to absorb that.
- Replaying the deterministic wrappers under
wrappers/ (no prompt, no value).
References
- Layer 1 (
/e2e dispatcher) is the orthogonal expensive test; Layer 3 is the cheap-when-relevant prompt test. Both have value.
.github/workflows/sdd-{spec,triage,execute-*,validate,review}.md
- The transcript-capture pattern overlaps with
gh-aw's built-in run logs; investigate whether gh aw run --capture is the right primitive before building from scratch.
Problem
The lion's share of the SDD pipeline's behavior is in the prompt body of each workflow's
.mdsource. A prompt edit that removes a section header, reorders the few-shot examples, or drops a constraint sentence can subtly change every output the agent produces — without changing a single line of code that a code review would catch. Full E2E catches it after the fact; the Layer 0 checks catch nothing about prompts. There is no cheap, deterministic test for "did this prompt edit change the agent's behavior in a way the author didn't intend?"Desired behavior
A recorded-transcript replay harness: for each agent (
sdd-spec,sdd-triagephases A/B/C,sdd-execute,sdd-validate,sdd-review), capture one or more canonical input contexts (the tracking issue body, the merged spec, the architecture record — whatever the agent reads) and the agent's actual output (the safe-outputs it emitted) from a real run. On every PR that touches the agent's.md, re-feed the captured input to the current prompt against the same model and same MCP toolset, and structurally diff the output against the captured baseline.The diff is structural, not byte-for-byte:
create-issueand oneadd-comment)?Behavior changes are flagged for human review on the PR; the test doesn't fail automatically. The reviewer accepts the diff (overwrites the baseline) or rejects (reverts the prompt edit).
Implementation
tests/fixtures/prompt-replay/<agent>/<scenario>/: each scenario containsinput/(the captured context),prompt.lock(the snapshot of the prompt body used to produce the baseline), andoutput/(the captured safe-outputs)..github/workflows/prompt-replay.mdruns on PRs that modify.github/workflows/*.mdand re-runs each affected agent against its fixtures. This does cost LLM tokens, but it costs them only when the prompt actually changes — a typical infrastructure PR does not trigger replay.scripts/prompt-replay-record.shhelper to capture a new baseline from a recent real run./accept-replayto overwrite the baseline in the same PR, or/reject-replayto mark it for revert.Cost ceiling
.mdedits only (the prompt surface).Acceptance
sdd-spec's prompt — removing one constraint sentence — triggers replay; the resulting spec drops a section the baseline includes; the diff comment names the missing section; reviewer accepts or rejects in-PR..mdoutside the prompt (e.g. frontmatter, secrets) does not trigger replay.sdd-dispatchfrom sdd-dispatch: add /dispatch on the tracking issue to cascade task execution with bounded parallelism; remove daily cron #81,sdd-fastpathfrom sdd: add fast-path for single-session features/bugs (agent proposes, human confirms, /approve runs) #82) gets a fixture set in the same PR that adds the agent.Out of scope
wrappers/(no prompt, no value).References
/e2edispatcher) is the orthogonal expensive test; Layer 3 is the cheap-when-relevant prompt test. Both have value..github/workflows/sdd-{spec,triage,execute-*,validate,review}.mdgh-aw's built-in run logs; investigate whethergh aw run --captureis the right primitive before building from scratch.