feat: v4.0 — pipeline simplification + per-task adversarial review#19
Merged
Conversation
Collapse the four-skill pipeline into two skills (execute + audit) by absorbing decomposition, archetype detection, plan-mode marking, and the user approval gate into a new Phase 0 of execute. Auto-chain to superpowers:writing-plans when no plan file is available.
Adds the previously-uncaptured edits in execute's own supporting files: coordination-patterns.md (9 Phase 2 references), communication-protocol.md (Plan Stage Messages section + PLAN_REVIEW subsection), spawn-templates.md (plan-mode marking attribution). Adds shared-doc edits in workspace-templates (Plan Audit Result table removal, phase checklist), team-archetypes (10+ phase references), teammate-roles, and custom-roles. Establishes a phase numbering convention table mapping the dropped phases to Phase 0 sub-steps. Restructures the implementation order into 6 staged groups (30 steps).
…apse to v4 spec Bundles two major design additions into v4.0: 1. Per-task ephemeral Executor/Reviewer/Challenger pipeline replacing the current per-task light review. Each task now goes through 3 sequential stages with shutdown between each. Adds CHALLENGE_REVIEW message format, reviews/ workspace subdirectory, and impact_files / review_status / challenge_status fields on task-graph.json. 2. Role taxonomy collapse from 13 specialized templates to 3 universal roles (Executor / Reviewer / Challenger) with archetype-specific prompt extensions. Each archetype (Implementation/Research/Audit/Planning) maps the same role triple onto archetype-appropriate work. Audit deep code review reduced to integration-only since per-task Challenger covers within-task adversarial review. Implementation order expanded to 9 stages, 37 steps. Migration guide updated for custom-roles porting and v3.x workspace forward-compat. Risk catalog expanded for spawn cost, reviewer/challenger bottlenecks, and grep-based impact-analysis limits.
Audit ran against the actual codebase. Found and resolved 8 gaps and 2 new findings the spec had missed: - README scope was 5x larger than spec claimed (Teammate Roles table at lines 188-202 + Team Types table at 207-216 + 80-line Quickstart demo all need rewrite) - CLAUDE.md "Adding a New Teammate Role" common-task guide assumes the legacy 13-role taxonomy and needs full rewrite - plan-mode-protocol.md needs 4 sections updated (not just Ownership Boundary): Activation, Spawn Directive (ephemeral semantics), Revision Limits (per-task model), Workspace Tracking (Plan Proposals semantics) - prior-context-loading.md move is not pure — 3 inline content updates needed - custom-roles.md is a near-total rewrite (single-role template → 3 extension templates; 36-line example rewritten; new Porting subsection) - workspace-templates.md needs 2 more line updates (Stage enum, line 365) - demo scripts (NEW finding) hardcode Phase 1/Phase 2/Implementer references - validate-task-graph.sh schema flexibility (NEW finding) confirmed safe for the new task-graph.json fields; stale comment on line 46 is cosmetic Audit Findings table added to spec for traceability. Implementation Order expanded from 37 to 40 numbered steps.
Bite-sized 25-task plan executing the spec at docs/specs/2026-05-09-agent-team-v4-refactor-design.md. Organized into 9 stages: pre-flight, file moves, support file edits, role taxonomy collapse, Phase 0 + Phase 4 rewrites, audit scope reduction, workspace schema updates, dead-skill deletion, tests + meta + docs, and validate. Each task has multiple steps with concrete edits (Find/Replace patterns and full content blocks for major rewrites). Every task ends with a verification step and a commit. Self-review at the end of the plan confirms spec coverage, no placeholders, and consistent naming.
Moves prior-context-loading.md, plan-mode-protocol.md, and plan-proposal-example.md into the execute skill. Content updates follow in subsequent commits.
Updates Activation, Ownership Boundary, Spawn Directive, Revision Limits, and Workspace Tracking sections to reflect v4.0's Phase 0 sub-steps and ephemeral per-task spawn pattern.
Renames 9 Phase 2 + 1 Phase 1 references. Adds Per-Task Pipeline Coordination section describing the default Executor → Reviewer → Challenger lifecycle. Updates Adversarial Review Rounds to reflect that adversarial review is now default behavior.
…le diagram + add TOC entry Code review caught two issues: - Bounds bullet listed 6 agents (incl. re-Reviewer) but lifecycle diagram skipped the re-review step. Diagram now shows fix-Executor → re-Reviewer/re-Challenger → final verdict explicitly. - Per-Task Pipeline Coordination was missing from the Contents TOC. Added entry between Plan-Mode Coordination and Advanced Patterns.
…_REVIEW Renames the Plan Stage Messages section to Research Messages (FINDING and ANALYSIS are now reusable across archetypes during execute, not plan-stage-specific). Adds CHALLENGE_REVIEW message format for the new per-task Challenger role. Drops PLAN_REVIEW (no plan-reviewer agent in v4.0).
Code Review Messages H2 was already in the file but never listed in the Contents block. Adding the entry makes CHALLENGE_REVIEW (newly added in the prior commit) discoverable via TOC navigation.
Replaces specialized templates (Implementer, Migrator, Researcher, Tester, Auditor, Planner, Writer, Strategist, etc.) with Executor / Reviewer / Challenger templates parameterized by archetype directives (Implementation, Research, Audit, Planning). Reviewer template includes the impact-analysis directive (grep call sites, glob importers, populate impact_files in task-graph.json). Challenger template includes the rules-compliance + missed-issues directive.
…e only
Code review caught universal Reviewer template baking in impact-analysis
behavior that only makes sense for Implementation tasks. Moved the
directive (and the task-graph.json write permission) into the
Implementation Reviewer archetype directive block.
Also: removed unused {RESULTS_FORMAT} and {REPORT_FORMAT} placeholders
from the header note (never substituted anywhere in the file). Made
Challenger 'Impact files' field conditional. Softened universal
Executor 'owned area' rule to acknowledge archetype overrides.
Replaces 13-role catalog with Lead + 3 universal roles (Executor / Reviewer / Challenger) parameterized by archetype. Legacy role names preserved as colloquial archetype specializations in a lookup table. Notes hook agnosticism — no hook hardcodes role names, so the collapse has zero hook impact.
Code review caught 2 unlinked references in Selection Guide (team-archetypes guide, docs/custom-roles.md) — both now markdown links. Added pointer to error-recovery-protocol.md for the canonical recovery class definitions.
Replaces single-role custom template with three archetype-specific extension templates (Custom Executor / Reviewer / Challenger). Each extension layers directives on top of the standard prompt rather than defining a standalone role. Adds Database Migration Specialist example demonstrating the new pattern. Adds Porting v3.x custom roles section with three concrete migration examples (Implementer, Reviewer, Researcher patterns).
Replaces 10+ Phase 1 / Phase 2 references with Phase 0 sub-step references. Removes /agent-team:plan from the commands list. Adds Reviewer focus and Challenger focus subsections per archetype to guide what the per-task pipeline checks for each task type.
Phase 0 absorbs the work formerly done by skills/plan/ (resolution, prior-context loading, archetype detection, decomposition, plan-mode marking, user approval gate). Auto-chains to superpowers:writing-plans if no plan is found. Updates frontmatter description and trigger phrases to reflect the new entry-point role. Updates Preconditions to remove the require- approved-plan-in-workspace constraint.
Every task now spawns Executor → Reviewer → Challenger ephemerally. Reviewer performs grep-based impact analysis and populates impact_files in task-graph.json. Challenger deep-dives looking for missed issues. Each role shuts down before the next spawns. Adds reviews/ workspace subdirectory for per-task review artifacts. Updates task-graph.json schema usage with review_status and challenge_status fields. Replaces 'When chained via /agent-team:start' end-of-stage messaging with auto-chain to ../audit/SKILL.md. Also: cleans up 2 stale Phase 2 references (Worktree Isolation, Anti- Patterns) and updates the Overview paragraph for v4.0 framing.
Per-task Challengers now cover within-task adversarial review during execute Phase 4. Audit's Step 4 reduces to cross-task integration: read challenge docs for context, focus end-to-end review on impact_files overlapping multiple tasks, verify integrated build/test. Removes /agent-team:start references and legacy/manual workspace backward-compat framing.
Updates the audit-stage reviewer agent prompt: read per-task challenge docs for context, focus end-to-end review on impact_files overlapping multiple tasks, surface coverage gaps the per-task pipeline missed.
… challenge_status fields Documents the new reviews/ subdirectory created during Phase 4 by Reviewer and Challenger. Adds three new task-graph.json fields: impact_files (Reviewer's grep-based impact area), review_status, challenge_status. Confirms validate-task-graph.sh tolerates the extension (only requires subject/status/depends_on). Updates phase checklist and Stage enum for v4.0 vocabulary. Removes Plan Audit Result template (the 7-check audit is dropped).
Reviewer caught: line 74 still listed (plan, execute, or audit) for the
Stage field, contradicting the template enum {execute|audit} 50 lines
above. Updated to (`execute` or `audit`). Also tightened Pipeline
status values (approved → Phase 0.6, executed → Phase 4).
… comment Tasks 14, 15, 16, 20 bundled: - Delete skills/start/ and skills/plan/ entirely (Phase 0 of execute now absorbs all their work). - Update tests/structure/test-doc-references.sh: drop start/plan existence checks, add execute/examples/ check, replace Elegance Reviewer assertion with 3-role taxonomy check (Executor, Reviewer, Challenger), update message-types list (drop PLAN_REVIEW, add CHALLENGE_REVIEW + CODE_REVIEW), update spawn templates check to verify 3 universal roles, update plan-stage assertions to execute-stage. - Update See Also list in docs/team-archetypes.md to drop /agent-team:start (was missed in Task 8). - Update line 46 comment in scripts/validate-task-graph.sh to reflect Phase 0 (cosmetic). Test suite: 16 files, all passing.
Updates 14+ sections per the v4 refactor scope: - Tagline: 3 stages (plan/execute/audit) → 2 stages (execute/audit) - What It Does intro: per-task adversarial pipeline + auto-chain framing - Quickstart demo (~80 lines) full rewrite for Phase 0 + per-task E/R/C - Pipeline Commands table: drop /agent-team:start and /agent-team:plan rows - How It Works diagram: replace plan→execute→audit with execute(Phase 0) → execute(Phase 4 pipeline) → audit - Stage table: collapse 4 rows (Start/Plan/Execute/Audit) to 2 (Execute/Audit) with detailed phase breakdown - Drop "Plan team" / "Execute team" / "Audit team" composition list - Plan-aware paragraph: rewrite to Phase 0 vocabulary - Teammate Roles table: 13-row catalog → 4-row taxonomy (Lead, Executor, Reviewer, Challenger) with archetype specializations note - Team Types table: rewrite Default Roles column → per-archetype Reviewer/Challenger focus - Workspace tree: add reviews/ subdir + impact_files/review_status/ challenge_status fields - Plugin Structure tree: drop skills/start, skills/plan; expand skills/execute/ and skills/audit/ subfolder annotations
Updates 6+ sections per the v4 refactor scope: - Architecture diagram: drop skills/start, skills/plan; expand skills/execute and skills/audit subfolder annotations to include Phase 0 prep, Phase 4 per-task pipeline, integration-only review - Key Design Decisions: extend Team-per-stage bullet to mention the per-task ephemeral E/R/C pipeline; add 3-role universal taxonomy bullet noting legacy role names are colloquial archetype specializations - File Ownership table: delete 5 plan/start rows; rewrite execute rows to highlight Phase 0/4 + 3 universal templates; add skills/execute/examples/ row - Adding a New Teammate Role guide: full rewrite to point to docs/custom-roles.md archetype-extension model instead of direct teammate-roles.md edits - Adding a New Pipeline Stage guide: drop reference to deleted skills/start/SKILL.md routing table
Replaces Phase 1/Phase 2 demo blocks with Phase 0 (resolve + decompose + approve) and Phase 4 (per-task Executor -> Reviewer -> Challenger ephemeral pipeline). Updates teammate names from auth-impl-1/2 (Implementer) to executor-task-1/2 + ephemeral reviewer-task-N + challenger-task-N identifiers, demonstrating the new ephemeral spawn-shutdown-respawn pattern. Both scripts produce visually equivalent output: - generate-demo-cast.sh: asciicast-style with timing - record-demo.sh: plain stdout with colors
v4.0.0 collapses the 4-skill pipeline (start/plan/execute/audit) to 2 skills (execute/audit) with per-task adversarial review pipeline (Executor → Reviewer → Challenger ephemeral) and 3-role universal taxonomy. CHANGELOG entry covers BREAKING/Migration/Added/Changed/ Removed sections.
Three new integration test files covering the v4.0 schema additions and per-task pipeline lifecycle: - tests/integration/test-v4-schema.sh (17 tests) Verifies validate-task-graph.sh accepts new fields (impact_files, review_status, challenge_status), the in_review status, and the approved_with_findings status values. Verifies compute-critical-path and detect-resume work on v4.0 graphs. Verifies regression: graphs missing required fields are still blocked. - tests/integration/test-v4-pipeline-lifecycle.sh (15 tests) Simulates a task progressing through the per-task pipeline: pending → in_review → review_status approved → challenge_status approved → completed. Tests fix-cycle exhaustion (approved_with_ findings) and cross-task impact_files overlap (the audit-stage integration review scope). - tests/integration/test-v4-skill-coherence.sh (93 tests) Static analysis of v4.0 skill content. Verifies SKILL.md files contain the expected v4.0 instructions, cross-references resolve, and demo scripts use v4.0 vocabulary. The bash-level equivalent of the manual smoke tests (Task 23 in the plan) — these don't invoke the actual Claude skills (which require TeamCreate at runtime), but they verify the skill instructions themselves describe the v4.0 flow. Total test suite: 19 files, 336 test assertions, all passing. Auto-discovery via find works without runner changes.
A thorough audit caught several issues in the new integration tests:
**v4-schema.sh** (17 → 24 tests):
- Tests 1-3 were tautological: claimed to verify v4 status enum acceptance
but validate-task-graph.sh has no enum guard (it accepts any non-null
string). Reframed to honestly test what's actually verified: the script
doesn't choke on graphs containing the new fields.
- Tests 3, 4 had missing 'cwd' in JSON input causing validate-task-graph
to early-exit before validation. Fixed.
- Test 5 (detect-resume) used empty {} input, hitting the no-CWD silent
fallback. Fixed: pass cwd, assert stdout shows the workspace name and
remaining tasks.
- Tests 7a-7f wrote test docs and then grep'd them for fields the test
itself wrote — pure tautologies. Replaced with checks against the
PRODUCTION docs/workspace-templates.md (impact_files semantics, who
populates each field, validate-task-graph tolerance).
- Added Test 7 (regression: cycle detection still works).
**v4-pipeline-lifecycle.sh** (15 → 21 tests):
- Stages 3, 6 had missing cwd in JSON input. Fixed.
- Reframed jq-on-fixture assertions as 'schema integrity checks'
(verifying the field names jq reads match production spec) and added
spec contract checks at every stage:
- Stage 3: spawn-templates.md describes the impact-analysis directive
- Stage 4: communication-protocol.md documents CODE_REVIEW + CHALLENGE_REVIEW
- Stage 6: coordination-patterns.md documents max-1-fix-cycle bound
and log-to-issues.md on exhaustion
- Stage 7: audit/SKILL.md describes integration review using impact_files
- Removed sleep-based timing assumptions; now uses precise jq queries.
**v4-skill-coherence.sh** (93 → 101 tests):
- Test 5 'Phase 1|Phase 2' regex was too broad (would match any
incidental string). Narrowed to detect actual phase headings or
numbered Phase 1a/1b mentions.
- Test 9 archetype-directive count used substring matching (would pass
for partial matches). Now checks each role × archetype combination
with exact-match patterns: 4 archetypes × 3 roles = 12 directive
blocks (Executor, Reviewer, Challenger).
Net: 357 assertions (up from 336), no tautologies, every assertion
verifies meaningful behavior or a documented contract. All tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major refactor collapsing the 4-skill pipeline into 2 skills with a per-task adversarial review pipeline and 3-role universal taxonomy.
execute+audit).startandplandeleted.executeauto-invokessuperpowers:writing-plansif no plan, then auto-chains toauditon completionimpact_filesreviews/subdirectory,impact_files/review_status/challenge_statusfields,in_reviewstatusSpec:
docs/specs/2026-05-09-agent-team-v4-refactor-design.mdPlan:
docs/plans/2026-05-09-agent-team-v4-refactor.mdBREAKING
/agent-team:startremoved → use/agent-team:execute/agent-team:planremoved → use/superpowers:writing-plansthen/agent-team:executedocs/custom-roles.md) need porting — porting guide includedTest plan
bash tests/run-tests.sh— 19 files, 357 assertions, all passingclaude plugin validate .— passes (1 unrelated marketplace-description warning)skills/start,skills/plan,/agent-team:start, or/agent-team:planin production codescripts/generate-demo-cast.shproduces v4.0 demo with Phase 0 + per-task pipeline)CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1):/agent-team:execute docs/plans/<plan>.mdresolves the plan and proceeds to Phase 0 decomposition/agent-team:execute "<task>"(no plan) auto-chains tosuperpowers:writing-plans/agent-team:execute docs/plans/nonexistent.mdfails with clear error/agent-team:auditruns independently against an existing completed workspacereviews/task-{id}-review.mdandtask-{id}-challenge.mdare written,impact_filespopulated intask-graph.jsonCommits
26 commits on
feat/v4-refactor. Highlights: