Skip to content

test (layer 1): /e2e staging-repo dispatcher for nightly + on-demand end-to-end runs #89

@norrietaylor

Description

@norrietaylor

Problem

Real end-to-end testing of the SDD pipeline today requires running the full workflow against a consumer repository: open an issue, watch sdd-spec run, merge a PR, watch sdd-triage run, etc. Done manually on a real repo, this is expensive (LLM tokens + Actions minutes) and slow (a single feature spans hours). There is no automated cadence that exercises the pipeline end-to-end before changes ship.

Desired behavior

A dedicated spectacles-staging repository, owned by the same org, dedicated as the E2E target. A new workflow in spectacles, .github/workflows/e2e-dispatch.md, triggers on:

  • /e2e comment on a tracking issue or PR in the spectacles repo, from a write-access author. Optional scenario name argument selects which fixture to run.
  • Nightly schedule: at a low-traffic hour, default scenario.

The workflow:

  1. Resets the staging repo to a known commit (the lock files from the spectacles PR being tested or main).
  2. Compiles and pushes the current spectacles workflows to the staging repo via gh aw deploy (or equivalent).
  3. Opens a synthetic tracking issue in the staging repo from a fixture file under tests/fixtures/e2e/<scenario>/issue.md. Bodies are deliberately tiny (~30 tokens) to bound cost.
  4. Polls the staging repo for lifecycle progress with a configurable timeout (default 30 minutes). Lifecycle transitions are observed via the sdd:* labels.
  5. Asserts on terminal state: tracking issue reaches sdd:done, the expected artifact files exist (docs/specs/.../*.md, decisions/*.md if the scenario expects an ADR), one implementation PR was opened and merged.
  6. Posts a summary comment back on the originating spectacles PR / a nightly issue: scenario, duration, token cost (from the lock file metadata), final state, any assertion failures.
  7. Tags the staging repo state for post-mortem if assertions fail; cleans up on success.

Scenarios

Initial fixture set, one fixture per scenario:

  • happy-path-feature — the canonical flow: open feature → spec → architecture → plan → execute → merge → done. The smallest possible feature body that still produces meaningful artifacts.
  • happy-path-bug — same flow from the kind:bug template.
  • revise-loop-spec — open feature → spec PR → /revise <note> → spec PR updated → merge → continues.
  • needs-human-handoff — feature body deliberately ambiguous, asserts the agent escalates needs-human and the workflow halts without erroring.

Each scenario is one LLM run end-to-end and costs a few dollars. Total nightly cost is bounded by the scenario count.

Implementation

  • tests/fixtures/e2e/<scenario>/{issue.md,expectations.yml}. Issue body and the assertion contract.
  • .github/workflows/e2e-dispatch.md (+ lock). A non-agentic workflow (pure GitHub Actions, no gh-aw agent step) that orchestrates the scenario.
  • A small scripts/e2e-assert.py that walks the staging repo's state via the GitHub API and checks expectations.yml.
  • One-time setup script in scripts/e2e-setup-staging.sh to provision the staging repo with the required secrets, branch protections-off, etc. Documented but not run by CI.

Cost ceilings

  • Default scenario timeout: 30 minutes per scenario.
  • Default max-parallel across scenarios: 2 (a nightly run executes scenarios serially with no fan-out unless explicitly raised).
  • Token budget: each fixture body is ≤ 50 tokens; the agent's own context is bounded by its existing limits. A typical scenario costs less than $5 in LLM tokens.
  • Off switch: a SPECTACLES_E2E_DISABLED repo variable shuts down the nightly schedule without a workflow edit.

Acceptance

  • /e2e happy-path-feature from a write-access author on a spectacles PR posts back a summary comment within 30 minutes naming the staging-repo run, the final state (sdd:done), and a token-cost estimate.
  • A spectacles PR that breaks sdd-spec's spec-file-write step fails the nightly run; the failure comment names the broken phase and links the staging-repo logs.
  • The needs-human scenario succeeds when the agent applies the label and stops; it fails if the agent powers through without escalating.
  • Cost telemetry: each run's cost (Actions minutes + estimated tokens) is posted in the summary; nightly budget is enforceable by reading the recent runs.

Out of scope

  • Cross-repo scenarios (the repo: seam is unexercised here).
  • Performance / load testing.
  • Scenarios that require human merge decisions (every scenario is autonomous from open through done; revise loops are simulated by a workflow step that comments /revise on schedule).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions