Problem
Real end-to-end testing of the SDD pipeline today requires running the full workflow against a consumer repository: open an issue, watch sdd-spec run, merge a PR, watch sdd-triage run, etc. Done manually on a real repo, this is expensive (LLM tokens + Actions minutes) and slow (a single feature spans hours). There is no automated cadence that exercises the pipeline end-to-end before changes ship.
Desired behavior
A dedicated spectacles-staging repository, owned by the same org, dedicated as the E2E target. A new workflow in spectacles, .github/workflows/e2e-dispatch.md, triggers on:
/e2e comment on a tracking issue or PR in the spectacles repo, from a write-access author. Optional scenario name argument selects which fixture to run.
- Nightly
schedule: at a low-traffic hour, default scenario.
The workflow:
- Resets the staging repo to a known commit (the lock files from the spectacles PR being tested or
main).
- Compiles and pushes the current spectacles workflows to the staging repo via
gh aw deploy (or equivalent).
- Opens a synthetic tracking issue in the staging repo from a fixture file under
tests/fixtures/e2e/<scenario>/issue.md. Bodies are deliberately tiny (~30 tokens) to bound cost.
- Polls the staging repo for lifecycle progress with a configurable timeout (default 30 minutes). Lifecycle transitions are observed via the
sdd:* labels.
- Asserts on terminal state: tracking issue reaches
sdd:done, the expected artifact files exist (docs/specs/.../*.md, decisions/*.md if the scenario expects an ADR), one implementation PR was opened and merged.
- Posts a summary comment back on the originating spectacles PR / a nightly issue: scenario, duration, token cost (from the lock file metadata), final state, any assertion failures.
- Tags the staging repo state for post-mortem if assertions fail; cleans up on success.
Scenarios
Initial fixture set, one fixture per scenario:
happy-path-feature — the canonical flow: open feature → spec → architecture → plan → execute → merge → done. The smallest possible feature body that still produces meaningful artifacts.
happy-path-bug — same flow from the kind:bug template.
revise-loop-spec — open feature → spec PR → /revise <note> → spec PR updated → merge → continues.
needs-human-handoff — feature body deliberately ambiguous, asserts the agent escalates needs-human and the workflow halts without erroring.
Each scenario is one LLM run end-to-end and costs a few dollars. Total nightly cost is bounded by the scenario count.
Implementation
tests/fixtures/e2e/<scenario>/{issue.md,expectations.yml}. Issue body and the assertion contract.
.github/workflows/e2e-dispatch.md (+ lock). A non-agentic workflow (pure GitHub Actions, no gh-aw agent step) that orchestrates the scenario.
- A small
scripts/e2e-assert.py that walks the staging repo's state via the GitHub API and checks expectations.yml.
- One-time setup script in
scripts/e2e-setup-staging.sh to provision the staging repo with the required secrets, branch protections-off, etc. Documented but not run by CI.
Cost ceilings
- Default scenario timeout: 30 minutes per scenario.
- Default
max-parallel across scenarios: 2 (a nightly run executes scenarios serially with no fan-out unless explicitly raised).
- Token budget: each fixture body is ≤ 50 tokens; the agent's own context is bounded by its existing limits. A typical scenario costs less than $5 in LLM tokens.
- Off switch: a
SPECTACLES_E2E_DISABLED repo variable shuts down the nightly schedule without a workflow edit.
Acceptance
/e2e happy-path-feature from a write-access author on a spectacles PR posts back a summary comment within 30 minutes naming the staging-repo run, the final state (sdd:done), and a token-cost estimate.
- A spectacles PR that breaks
sdd-spec's spec-file-write step fails the nightly run; the failure comment names the broken phase and links the staging-repo logs.
- The needs-human scenario succeeds when the agent applies the label and stops; it fails if the agent powers through without escalating.
- Cost telemetry: each run's cost (Actions minutes + estimated tokens) is posted in the summary; nightly budget is enforceable by reading the recent runs.
Out of scope
- Cross-repo scenarios (the
repo: seam is unexercised here).
- Performance / load testing.
- Scenarios that require human merge decisions (every scenario is autonomous from open through done; revise loops are simulated by a workflow step that comments
/revise on schedule).
References
Problem
Real end-to-end testing of the SDD pipeline today requires running the full workflow against a consumer repository: open an issue, watch
sdd-specrun, merge a PR, watchsdd-triagerun, etc. Done manually on a real repo, this is expensive (LLM tokens + Actions minutes) and slow (a single feature spans hours). There is no automated cadence that exercises the pipeline end-to-end before changes ship.Desired behavior
A dedicated
spectacles-stagingrepository, owned by the same org, dedicated as the E2E target. A new workflow in spectacles,.github/workflows/e2e-dispatch.md, triggers on:/e2ecomment on a tracking issue or PR in the spectacles repo, from a write-access author. Optional scenario name argument selects which fixture to run.schedule:at a low-traffic hour, default scenario.The workflow:
main).gh aw deploy(or equivalent).tests/fixtures/e2e/<scenario>/issue.md. Bodies are deliberately tiny (~30 tokens) to bound cost.sdd:*labels.sdd:done, the expected artifact files exist (docs/specs/.../*.md,decisions/*.mdif the scenario expects an ADR), one implementation PR was opened and merged.Scenarios
Initial fixture set, one fixture per scenario:
happy-path-feature— the canonical flow: open feature → spec → architecture → plan → execute → merge → done. The smallest possible feature body that still produces meaningful artifacts.happy-path-bug— same flow from thekind:bugtemplate.revise-loop-spec— open feature → spec PR →/revise <note>→ spec PR updated → merge → continues.needs-human-handoff— feature body deliberately ambiguous, asserts the agent escalatesneeds-humanand the workflow halts without erroring.Each scenario is one LLM run end-to-end and costs a few dollars. Total nightly cost is bounded by the scenario count.
Implementation
tests/fixtures/e2e/<scenario>/{issue.md,expectations.yml}. Issue body and the assertion contract..github/workflows/e2e-dispatch.md(+ lock). A non-agentic workflow (pure GitHub Actions, no gh-aw agent step) that orchestrates the scenario.scripts/e2e-assert.pythat walks the staging repo's state via the GitHub API and checksexpectations.yml.scripts/e2e-setup-staging.shto provision the staging repo with the required secrets, branch protections-off, etc. Documented but not run by CI.Cost ceilings
max-parallelacross scenarios: 2 (a nightly run executes scenarios serially with no fan-out unless explicitly raised).SPECTACLES_E2E_DISABLEDrepo variable shuts down the nightly schedule without a workflow edit.Acceptance
/e2e happy-path-featurefrom a write-access author on a spectacles PR posts back a summary comment within 30 minutes naming the staging-repo run, the final state (sdd:done), and a token-cost estimate.sdd-spec's spec-file-write step fails the nightly run; the failure comment names the broken phase and links the staging-repo logs.Out of scope
repo:seam is unexercised here)./reviseon schedule).References
mcp-smoke.lock.yml— pattern for a workflow that talks to external services from CI.