test (layer 3): recorded transcript replay for prompt regression detection

## Problem

The lion's share of the SDD pipeline's behavior is in the prompt body of each workflow's `.md` source. A prompt edit that removes a section header, reorders the few-shot examples, or drops a constraint sentence can subtly change every output the agent produces — without changing a single line of code that a code review would catch. Full E2E catches it after the fact; the Layer 0 checks catch nothing about prompts. There is no cheap, deterministic test for "did this prompt edit change the agent's behavior in a way the author didn't intend?"

## Desired behavior

A recorded-transcript replay harness: for each agent (`sdd-spec`, `sdd-triage` phases A/B/C, `sdd-execute`, `sdd-validate`, `sdd-review`), capture one or more canonical input contexts (the tracking issue body, the merged spec, the architecture record — whatever the agent reads) and the agent's actual output (the safe-outputs it emitted) from a real run. On every PR that touches the agent's `.md`, re-feed the captured input to the **current** prompt against the **same model and same MCP toolset**, and structurally diff the output against the captured baseline.

The diff is structural, not byte-for-byte:

- Did the agent emit the same set of safe-output types (e.g. one `create-issue` and one `add-comment`)?
- Did the spec file the agent wrote contain the same set of section headers?
- Did the architecture record reference the same set of requirement IDs?
- Did the labels applied match the captured set?
- Are commit messages and PR titles within a structural template?

Behavior changes are flagged for human review on the PR; the test doesn't fail automatically. The reviewer accepts the diff (overwrites the baseline) or rejects (reverts the prompt edit).

## Implementation

- `tests/fixtures/prompt-replay/<agent>/<scenario>/`: each scenario contains `input/` (the captured context), `prompt.lock` (the snapshot of the prompt body used to produce the baseline), and `output/` (the captured safe-outputs).
- A new workflow `.github/workflows/prompt-replay.md` runs on PRs that modify `.github/workflows/*.md` and re-runs each affected agent against its fixtures. This **does** cost LLM tokens, but it costs them only when the prompt actually changes — a typical infrastructure PR does not trigger replay.
- A `scripts/prompt-replay-record.sh` helper to capture a new baseline from a recent real run.
- Outputs a diff comment on the PR. The reviewer can comment `/accept-replay` to overwrite the baseline in the same PR, or `/reject-replay` to mark it for revert.

## Cost ceiling

- Trigger is path-filtered to workflow `.md` edits only (the prompt surface).
- Each agent has 1–2 fixtures; total replay cost per affected agent ≈ one E2E phase, ≈ $1–5.
- A typical prompt PR touches one agent: ≤ $10 per PR.

## Acceptance

- PR that subtly edits `sdd-spec`'s prompt — removing one constraint sentence — triggers replay; the resulting spec drops a section the baseline includes; the diff comment names the missing section; reviewer accepts or rejects in-PR.
- PR that edits a workflow `.md` outside the prompt (e.g. frontmatter, secrets) does not trigger replay.
- A new agent added to the suite (e.g. `sdd-dispatch` from #81, `sdd-fastpath` from #82) gets a fixture set in the same PR that adds the agent.

## Out of scope

- Validating the agent's chain-of-thought (only the output safe-outputs are diffed).
- Catching model drift (Anthropic ships a Sonnet update, output changes a little). The structural diff is intentionally coarse to absorb that.
- Replaying the deterministic wrappers under `wrappers/` (no prompt, no value).

## References

- Layer 1 (`/e2e` dispatcher) is the orthogonal expensive test; Layer 3 is the cheap-when-relevant prompt test. Both have value.
- `.github/workflows/sdd-{spec,triage,execute-*,validate,review}.md`
- The transcript-capture pattern overlaps with `gh-aw`'s built-in run logs; investigate whether `gh aw run --capture` is the right primitive before building from scratch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test (layer 3): recorded transcript replay for prompt regression detection #91

Problem

Desired behavior

Implementation

Cost ceiling

Acceptance

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

test (layer 3): recorded transcript replay for prompt regression detection #91

Description

Problem

Desired behavior

Implementation

Cost ceiling

Acceptance

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions