State-externalizing research harness for spec-driven development: budgeted exploration, importance-tagged evidence curation, and adversarial claim verification — all persisted as files, not context.
Based on Harness-1 — "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses" (Jiang et al., arXiv:2606.02373) and its reference implementation pat-jj/harness-1. This community extension adapts the paper's harness design to Spec Kit workflows. It is not affiliated with the paper authors or GitHub.
The research that feeds /speckit.specify and /speckit.plan — exploring a
codebase, evaluating libraries, checking API behavior — is long-horizon work,
and the agent's conversation context is a terrible place to keep its state:
findings silently fall out of the window, searches get repeated, claims are
written into the plan without ever being checked, and a new session starts
from zero.
Harness-1's diagnosis is that this is a separation-of-concerns failure: the model is forced to do bookkeeping (tracking what was found, what it supports, what was verified, what is duplicate) with the same machinery it uses for semantic decisions (what to search, what to retain, when to stop). Its fix is a stateful harness that holds the working memory environment-side — a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed deduplicated observations — and renders the model only a compact, budget-aware slice.
This extension applies the same split to spec-driven development. Your coding agent is the policy; a set of per-feature markdown state files is the harness.
| Harness-1 (paper) | This extension |
|---|---|
| Environment-side working memory | specs/<feature>/harness/ state files |
| Candidate pool | candidates.md — deduplicated, append-only IDs |
| Importance-tagged curated set | curated.md — capped, critical→low tags, eviction policy |
| Compact evidence links | evidence.md — pointers + ≤25-word excerpts, never bulk content |
| Verification records | verification.md — verdict, method, confidence per claim |
| Compressed, deduplicated observations | observations.md — ≤3-line entries, dup-of marking |
| Budget-aware context rendering | /speckit.harness.status + slice-only loading in every command |
| Policy decides: search / retain / verify / stop | The agent's only jobs inside /speckit.harness.explore |
| Recoverable search state | Resume any session from files via /speckit.harness.status |
One thing to be precise about up front: the harness does not make your
agent's conversation context persist — nothing can, and every agent starts a
new session empty. What it does is make the conversation disposable:
everything worth keeping is written to files at the moment it is learned, and
/speckit.harness.status re-renders the working picture into any new session
in one step. State survives; context is rebuilt on demand.
What that buys you, concretely:
| With the harness | Without it — research state lives in the conversation |
|---|---|
Session immortality — resume after a restart, compaction, window overflow, or crash with one status call |
whatever fell out of the context window is gone for good |
| Token efficiency — each step re-renders a bounded slice (default cap: 4,000 tokens), never the full history | continuing means re-reading an ever-growing transcript |
No repeated searches — the candidate pool dedups by source + topic; repeats are flagged dup-of |
the same query gets re-run days later because nothing remembers it ever ran |
Claims with verdicts — every load-bearing claim carries verified / refuted / unverifiable, plus method and confidence |
"I checked that somewhere" hardens into the plan unexamined |
| Remembered dead ends — refuted claims stay on file, marked, never deleted | the same wrong conclusion gets re-derived in the next session |
| Bounded research — explicit budgets and a marginal-gain stop rule decide when exploration ends | research ends when the agent drifts or the context fills up |
| Auditable, shareable evidence — plain markdown: diffable, PR-reviewable, and a teammate (or their agent) can take over mid-research | the evidence behind a plan exists only inside one person's expired chat |
The tutorial's step 6 shows the resume scene in action.
Research Harness is listed in the Spec Kit community extension catalog.
Option 1 — by name, from the community catalog. Spec Kit treats the
community catalog as discovery-only by default, so allow installs from it once
(per project, or per user via ~/.specify/extension-catalogs.yml):
# .specify/extension-catalogs.yml
catalogs:
- name: default
url: https://raw.githubusercontent.com/github/spec-kit/main/extensions/catalog.json
priority: 1
install_allowed: true
- name: community
url: https://raw.githubusercontent.com/github/spec-kit/main/extensions/catalog.community.json
priority: 2
install_allowed: truespecify extension add harnessOption 2 — zero config, pinned version. Install straight from a release URL:
specify extension add harness --from https://github.com/formin/spec-kit-harness/archive/refs/tags/v1.0.0.zipURL installs show an Untrusted Source warning and ask
Continue with installation? [y/N]— answery(in a non-interactive shell, pipe it:echo y | specify extension add …). Catalog installs skip this prompt.
Option 3 — for development:
git clone https://github.com/formin/spec-kit-harness
specify extension add --dev ./spec-kit-harnessVerify:
specify extension list
# ✓ Research Harness (v1.0.0)
# harness
# Commands: 5 | Hooks: 2 | Priority: 10 | Status: EnabledRequires Spec Kit >=0.2.0. Works with any agent Spec Kit supports (Claude
Code, GitHub Copilot, Cursor, Gemini CLI, …) — commands are plain prompt
files; no external tools, MCP servers, or network access required.
| Command | What it does | Touches disk |
|---|---|---|
/speckit.harness.init [mission] [key=value…] |
Create the six state files with budgets and stop conditions | creates harness/ (never overwrites) |
/speckit.harness.explore [question] |
Budgeted decide→act→bookkeep research loop | updates all state files |
/speckit.harness.verify [targets] |
Adversarial verification of load-bearing claims | verification.md, evidence.md, budget.md |
/speckit.harness.status [full | topic] |
Compact snapshot + one recommended next action | read-only |
/speckit.harness.report [scope] |
Synthesize evidence into research.md with a coverage table |
research.md only |
State lives in specs/<feature>/harness/ (or .specify/harness/global/ when
no feature directory exists).
The harness does not replace any core stage. It fills the research gap
between writing a spec and trusting a plan — and it hands its results to the
core flow through research.md, the exact artifact /speckit.plan's
Phase 0: Outline & Research generates and consumes.
/speckit.constitution project principles — no harness involvement
│
/speckit.specify ──▶ spec.md
│ └─ hook after_specify → /speckit.harness.init (optional prompt)
│ └─ creates specs/<feature>/harness/{budget, candidates,
│ curated, evidence, verification, observations}.md
│
│ ┌─ RESEARCH PHASE (harness home turf) ──────────────────────┐
│ │ /speckit.harness.explore budgeted evidence gathering │
│ │ ↳ appends: candidates · curated · evidence · │
│ │ observations · budget ledger (every iteration) │
│ │ /speckit.harness.verify check the spec's claims │
│ │ ↳ appends: verification records (+ evidence, ledger) │
│ │ /speckit.harness.report ─▶ research.md + coverage table │
│ │ ↳ the only harness write to a core artifact │
│ └────────────────────────────────────────────────────────────┘
│
/speckit.plan ──▶ plan.md Phase 0 starts from verified research.md
│ instead of one-shot, unverified research
│ └─ hook after_plan → /speckit.harness.verify (optional prompt)
│
/speckit.tasks ──▶ tasks.md gate: /speckit.harness.status shows no
│ unverified critical claims before this
/speckit.implement explore/status on demand for unknowns
discovered mid-implementation
Stage by stage:
| Core stage | Harness command | How it is used |
|---|---|---|
/speckit.constitution |
— | Not used. |
Right after /speckit.specify |
init (the after_specify hook offers it) |
Creates the per-feature state files; the spec's open questions become the mission and get explicit budgets. |
| Between specify and plan | explore → verify → report |
The main pass: gather evidence within budget, adversarially verify the spec's load-bearing claims, then write research.md with a requirement-coverage table. This is the deep, resumable replacement for plan's ad-hoc Phase 0 research. |
Right after /speckit.plan |
verify (the after_plan hook offers it) |
Re-checks the plan's factual claims against primary sources. Refuted claims come back as suggested edits — apply them via /speckit.clarify or by hand before they harden into tasks. |
Before /speckit.tasks |
status |
Go/no-go gate: the snapshot warns if any critical claim is still unverified or contradicted. |
During /speckit.implement |
explore / status as needed |
Budget-boxed investigation of unknowns discovered mid-implementation; a fresh session resumes from files, not from a lost context window. |
| Any time | status |
Read-only snapshot + exactly one recommended next action; the session-resume entry point. |
And when each harness file comes into existence, and who touches it afterward:
| File | Created at | Written during | Consumed by |
|---|---|---|---|
budget.md |
init — mission, budgets, stop conditions |
every explore/verify action (ledger + action log row) |
every command |
candidates.md |
init — empty pool |
explore: one row per discovery, deduplicated |
explore, status |
curated.md |
init — empty set |
explore (promotions, evictions); verify (demotes refuted entries) |
every command, report |
evidence.md |
init — empty |
explore and verify: pointer entries |
verify, report |
verification.md |
init — empty |
verify: one row per checked claim |
status (gate), report |
observations.md |
init — empty log |
every explore/verify action: ≤3-line compressed entry |
explore, status |
research.md |
report — the last harness step before /speckit.plan |
re-running report (only between the harness:begin/end markers) |
/speckit.plan Phase 0 |
All six state files exist from init onward — created once and never
clobbered (init is idempotent). status writes nothing, ever. research.md
is the one file born later, at report time — deliberately right before
planning consumes it.
Two rules keep the integration safe: the harness never edits
spec.md, plan.md, or tasks.md (corrections always flow back as suggested
edits), and the only core artifact it writes is research.md — between
<!-- harness:begin/end --> markers, preserving anything you wrote by hand.
Run once per feature, ideally right after /speckit.specify (the
after_specify hook offers to do this for you).
/speckit.harness.init How is session state currently handled, and what are the revocation options?
- The free text becomes the mission — the question this harness exists to
answer. You can add budget overrides inline:
searches=50 inspections=60. - Creates
budget.md,candidates.md,curated.md,evidence.md,verification.md,observations.mdunder the feature'sharness/directory, with budgets from your config (defaults: 30 searches, 40 inspections, 20 verifications, curated cap 25). - Idempotent: if a harness already exists it refuses to overwrite and shows status instead; a new mission passed as argument is appended to the mission list rather than replacing it.
/speckit.harness.explore What auth middleware exists and which routes bypass it?
(With no argument it uses the mission from budget.md.)
Each iteration the agent makes exactly one policy decision — SEARCH a
new query, INSPECT a known candidate, CURATE the curated set, or STOP —
then performs mandatory bookkeeping: log a ≤3-line compressed observation,
add deduplicated candidates, promote findings into the curated set with an
importance tag plus an evidence pointer, and account for the budget. The loop
stops on budget exhaustion, on the marginal-gain rule (3 consecutive actions
yielding no new curated evidence), or when the mission is answered.
After a session, budget.md carries an auditable ledger like:
| Resource | Budget | Spent | Remaining |
|----------|-------:|------:|----------:|
| searches | 30 | 1 | 29 |
| inspections | 40 | 1 | 39 |
## Action log
| # | Action | Target | Cost | New evidence? |
|---|--------|--------|------|---------------|
| 1 | SEARCH | .specify tree + templates listing | 1 search | yes (C001, C002 → E001, E002) |
| 2 | INSPECT | .specify/templates/spec-template.md | 1 inspection | yes (E002 confirmed) |and curated.md holds the working set, most important first:
| ID | Importance | Finding | Source candidate | Evidence |
|----|------------|---------|------------------|----------|
| E003 | critical | Templates ship bundled inside the specify-cli package; init needs no network access. | C003 | evidence.md#E003 |
| E001 | high | init scaffolds .specify/{templates,scripts,memory,…} and copies five artifact templates. | C001 | evidence.md#E001 |/speckit.harness.verify # default: spec.md + plan.md + unverified critical curated entries
/speckit.harness.verify plan.md # narrow to one artifact
Extracts load-bearing factual claims ("X is handled by Y", "library Z supports W", "there is no existing V") and, for each one, tries to refute it against the primary source — never the curated summary. Every check leaves a durable row:
| ID | Claim | Method | Verdict | Confidence | Evidence | Date |
|----|-------|--------|---------|------------|----------|------|
| V001 | Templates are bundled in specify-cli; init performs no network fetch | Re-read `specify init --help`; cross-checked offline scaffold | verified | high | E001 | 2026-06-11 |Refuted claims are demoted in curated.md (marked, never deleted — recorded
dead ends prevent re-deriving the same error) and reported back as concrete
suggested edits to spec.md/plan.md. The command does not edit your
artifacts itself.
/speckit.harness.status # compact snapshot
/speckit.harness.status full # 3× larger slices
/speckit.harness.status sessions # filter rows to a topic
Read-only and budget-free. Renders the Harness-1 "budget-aware context rendering": mission, remaining budgets, top curated entries, the open candidate frontier, refuted/unverified-critical warnings, recent observations — and closes with exactly one recommended next action derived from the state, e.g.:
Recommendation: /speckit.harness.verify — 2 critical claims unverified and
12 verification budget remaining → verify before planning.
This is also the session-resume entry point: open a fresh agent session,
run /speckit.harness.status, and continue exactly where research stopped —
nothing depended on the old context window.
/speckit.harness.report
Reads the full state (the only command that does), maps every requirement in
spec.md to its supporting evidence, and writes the feature's research.md
between <!-- harness:begin/end --> markers (hand-written sections outside
the markers are preserved). The coverage table makes evidence gaps visible
before /speckit.plan consumes the research:
### Requirement Coverage — 7/10
| Requirement | Status | Evidence | Verification |
|-------------|--------|----------|--------------|
| FR-001 Token revocation | covered-verified | E003, E007 | V002 (high) |
| FR-004 Admin audit log | covered-unverified | E011 | — |
| FR-006 SSO logout | uncovered | — | — |Statuses: covered-verified / covered-unverified / contradicted /
uncovered. Contradictions and uncovered requirements come with suggested
follow-ups (fix the artifact, explore more, or carry the risk explicitly).
This walks every core Spec Kit stage and shows exactly where the harness is used in each one. The app: a minimal single-user web todo list — add, complete, delete; todos must survive a restart.
specify init todo-app --integration claude # or your agent
cd todo-app
specify extension add harness # see Installation for the one-time catalog opt-in/speckit.constitution Keep it simple: smallest stack that works, test-first,
and no claim enters a plan without a checked source.
The harness is not involved at this stage — but that last principle is precisely what it operationalizes from stage 3 onward.
/speckit.specify Single-user web todo app: add, complete, delete; todos must
survive an app restart; keyboard-friendly UI.
→ specs/001-todo-app/spec.md with FR-001..FR-005 and two open questions
(persistence mechanism? UI stack?). The after_specify hook fires:
Set up a state-externalized research harness for this feature? → yes
/speckit.harness.init Which persistence option and minimal web stack satisfy
the constitution? searches=15 inspections=20 verifications=8
→ specs/001-todo-app/harness/ is created, and budget.md now reads:
## Mission
1. Which persistence option and minimal web stack satisfy the constitution?
| Resource | Budget | Spent | Remaining |
| searches | 15 | 0 | 15 |The harness's home turf: answer the spec's open questions before planning.
/speckit.harness.explore
A few iterations, each externalized to the state files as it happens:
SEARCH "sqlite vs json file vs localStorage, single-user persistence"
→ C001..C004 added to candidates.md
INSPECT C002 better-sqlite3 docs
→ curated E001 (high): "zero-config embedded DB; sync API fits a tiny app"
INSPECT C004 MDN localStorage
→ curated E002 (critical): "per-browser storage — todos would not survive
switching browsers; conflicts with FR-004 as written"
SEARCH "minimal node static file serving" → dup-of O-002, nothing new
STOP marginal gain exhausted · budget left: searches 12, inspections 17
Now verify the spec's load-bearing assumptions before they harden:
/speckit.harness.verify
| ID | Claim | Verdict | Confidence |
| V001 | A SQLite file survives app restart (FR-004) | verified | high |
| V002 | "Saving a JSON file is atomic" (spec note) | refuted — partial-write risk; use write-then-rename | high |The refuted claim comes back as a suggested edit — apply it via
/speckit.clarify or by hand. Then bridge the results into the core flow:
/speckit.harness.report
→ specs/001-todo-app/research.md:
### Requirement Coverage — 5/5
| Requirement | Status | Evidence | Verification |
| FR-004 persist across restart | covered-verified | E001 | V001 (high) |/speckit.plan SQLite via better-sqlite3, small Express server, vanilla JS frontend
Plan's Phase 0: Outline & Research finds research.md already populated
with verified, cited findings instead of re-deriving everything in one shot.
When the plan lands, the after_plan hook fires:
Verify plan claims against primary sources? → yes
/speckit.harness.verify plan.md
→ V003 "express.static serves the frontend with zero config" — verified (high)
/speckit.harness.status
→ 0 unverified critical claims · 1 corrected assumption (V002) already applied
→ Recommendation: proceed to /speckit.tasks
/speckit.tasks
/speckit.implement
Halfway through, a real unknown appears: should completed todos be deleted or archived, given the keyboard-undo flow? Box it instead of guessing:
/speckit.harness.explore Does undo require archived todos, or is hard delete enough? searches=5 inspections=5
→ E007 (medium): FR-003's undo wording implies soft-delete · 2 actions spent
Next morning you open a brand-new session. The previous conversation is gone — that is true of every agent, with or without this extension. The difference: the harness never kept the research state in the conversation in the first place, so nothing of value was lost. One command re-renders the full working picture from the files:
/speckit.harness.status
→ mission answered · budgets healthy · Recommendation: finish T012, T013
The files are the memory. A conversation context can die at any moment
(session restart, compaction, window overflow); the harness state survives
all of them, and /speckit.harness.status rebuilds your context from it in
one step.
| File | Role | Invariants |
|---|---|---|
budget.md |
Mission, budget ledger, stop conditions, action log | every budgeted action accounted |
candidates.md |
Everything discovered | dedup by source+topic; statuses new/inspected/curated/discarded |
curated.md |
What matters | hard cap; importance tags; refuted entries marked, not deleted |
evidence.md |
Where proof lives | pointers + locators; excerpts ≤ 25 words |
verification.md |
What was checked | verdict + method + confidence; primary sources only |
observations.md |
What happened | append-only; ≤3 lines each; duplicates flagged |
The files are ordinary markdown in your repo: diffable, reviewable in PRs, and shared by every agent and teammate working on the feature.
Light vs. deep research — budgets are per-session levers:
/speckit.harness.init quick sanity check on the websocket layer searches=8 inspections=10 verifications=5
/speckit.harness.init full audit of the billing pipeline searches=60 inspections=80 verifications=40
Several questions, one harness — running init again with a new question
appends it as mission #2 instead of clobbering state; report covers all
missions.
Verify-before-plan gate — accept the after_plan hook (or run
/speckit.harness.verify manually) so every load-bearing claim in plan.md
has a verdict before /speckit.tasks turns it into work items.
Team workflow — commit harness/ with the feature branch. Reviewers see
why the plan believes what it believes (evidence pointers + verdicts), and
a teammate's agent can pick up the research mid-flight via
/speckit.harness.status.
Token discipline — every command loads slices, never full files, within
the context_tokens render cap. Long-horizon research stops scaling with
conversation length and starts scaling with file size — which is effectively
unlimited.
Both optional (you are prompted):
after_specify→speckit.harness.init— set up the harness when a spec is created.after_plan→speckit.harness.verify— verify the plan's claims before they harden into tasks.
Copy config-template.yml to
.specify/extensions/harness/harness-config.yml and adjust budgets, the
curated-set cap, slice sizes, state location, and stop conditions:
budget:
searches: 30
inspections: 40
verifications: 20
context_tokens: 4000
curation:
max_curated: 25
evict_policy: lowest-importance-first
rendering:
candidates_slice: 10
curated_slice: 15
observations_slice: 8Precedence (lowest → highest): extension defaults → config file →
SPECKIT_HARNESS_* environment variables → per-invocation key=value
arguments to init.
init says the harness already exists. By design — it never overwrites.
To start over, delete the feature's harness/ directory yourself; to add a
question, pass it to init and it is appended as a new mission.
No specs/ feature directory yet? Commands fall back to
.specify/harness/global/. State written there stays useful for any feature;
report writes research.md next to it.
Exploration keeps stopping "marginal gain exhausted". That is the
Harness-1 stop rule working: 3 consecutive actions added nothing new to the
curated set. Sharpen the question (/speckit.harness.explore <narrower question>)
or raise the window in stop_conditions.marginal_gain_window.
A claim came back refuted — now what? The verify report includes the
suggested artifact edit. Apply it (or run /speckit.clarify), keep the
verification record as-is; the recorded dead end is what stops the error from
coming back.
Does it call any external services? No. The commands are prompt files;
all "infrastructure" is markdown in your repo. The only network access in the
whole lifecycle is your own specify extension add --from <url> download.
Install prompt blocks my CI. specify extension add --from <url> asks
Continue with installation? [y/N] on untrusted URLs; pipe the answer in
non-interactive shells: echo y | specify extension add harness --from <url>.
See docs/concepts.md for the full design mapping and the deliberate differences from the paper.
MIT © 2026 formin
Credits: Harness-1 by Pengcheng Jiang et al. (arXiv:2606.02373, pat-jj/harness-1); Spec Kit by GitHub.