Skip to content

feat: v4.0 — pipeline simplification + per-task adversarial review#19

Merged
ducdmdev merged 31 commits into
mainfrom
feat/v4-refactor
May 11, 2026
Merged

feat: v4.0 — pipeline simplification + per-task adversarial review#19
ducdmdev merged 31 commits into
mainfrom
feat/v4-refactor

Conversation

@ducdmdev
Copy link
Copy Markdown
Owner

Summary

Major refactor collapsing the 4-skill pipeline into 2 skills with a per-task adversarial review pipeline and 3-role universal taxonomy.

  • Skills: 4 → 2 (execute + audit). start and plan deleted.
  • Roles: 13 specialized templates → 3 universal (Executor / Reviewer / Challenger) + Lead, with archetype-specific extensions
  • Per-task pipeline: every task spawns ephemeral Executor → Reviewer → Challenger
  • Auto-chaining: execute auto-invokes superpowers:writing-plans if no plan, then auto-chains to audit on completion
  • Audit scope reduction: deep code review → integration-only review of files in 2+ tasks' impact_files
  • Workspace schema additions: reviews/ subdirectory, impact_files/review_status/challenge_status fields, in_review status
  • Hooks: unchanged (13 hooks, 16 scripts)

Spec: docs/specs/2026-05-09-agent-team-v4-refactor-design.md
Plan: docs/plans/2026-05-09-agent-team-v4-refactor.md

BREAKING

  • /agent-team:start removed → use /agent-team:execute
  • /agent-team:plan removed → use /superpowers:writing-plans then /agent-team:execute
  • 13-role taxonomy collapsed to 3 universal roles
  • Per-task spawn count is ~3× v3.x (Executor + Reviewer + Challenger ephemeral per task)
  • Custom roles (docs/custom-roles.md) need porting — porting guide included

Test plan

  • bash tests/run-tests.sh — 19 files, 357 assertions, all passing
  • claude plugin validate . — passes (1 unrelated marketplace-description warning)
  • No orphaned references to skills/start, skills/plan, /agent-team:start, or /agent-team:plan in production code
  • Demo regen verified (scripts/generate-demo-cast.sh produces v4.0 demo with Phase 0 + per-task pipeline)
  • Test audit: dispatched thorough audit of all 19 test files; fixed tautological assertions and missing-cwd false positives in integration tests
  • Manual smoke tests (require runtime — Claude Code session with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1):
    • /agent-team:execute docs/plans/<plan>.md resolves the plan and proceeds to Phase 0 decomposition
    • /agent-team:execute "<task>" (no plan) auto-chains to superpowers:writing-plans
    • /agent-team:execute docs/plans/nonexistent.md fails with clear error
    • /agent-team:audit runs independently against an existing completed workspace
    • Full E/R/C pipeline against a 2-task plan: each task goes through Executor → Reviewer → Challenger, reviews/task-{id}-review.md and task-{id}-challenge.md are written, impact_files populated in task-graph.json

Commits

26 commits on feat/v4-refactor. Highlights:

  • Spec design + audit findings incorporation
  • File moves + content updates (prior-context-loading, plan-mode-protocol, plan-proposal-example)
  • Phase 0 prepended to execute, Phase 4 rewritten to E/R/C pipeline
  • spawn-templates collapse (13 → 3) + impact-analysis directive moved to Implementation archetype only
  • teammate-roles + custom-roles + team-archetypes + workspace-templates + audit/SKILL.md + audit/agents/reviewer.md rewrites
  • README + CLAUDE.md + demo scripts rewritten for v4.0
  • Version bump to 4.0.0 + CHANGELOG entry
  • 3 new integration test files (schema, lifecycle, skill coherence) — 146 new assertions

ducdmdev added 30 commits March 26, 2026 13:26
Collapse the four-skill pipeline into two skills (execute + audit) by
absorbing decomposition, archetype detection, plan-mode marking, and
the user approval gate into a new Phase 0 of execute. Auto-chain to
superpowers:writing-plans when no plan file is available.
Adds the previously-uncaptured edits in execute's own supporting files:
coordination-patterns.md (9 Phase 2 references), communication-protocol.md
(Plan Stage Messages section + PLAN_REVIEW subsection), spawn-templates.md
(plan-mode marking attribution). Adds shared-doc edits in workspace-templates
(Plan Audit Result table removal, phase checklist), team-archetypes (10+ phase
references), teammate-roles, and custom-roles. Establishes a phase numbering
convention table mapping the dropped phases to Phase 0 sub-steps. Restructures
the implementation order into 6 staged groups (30 steps).
…apse to v4 spec

Bundles two major design additions into v4.0:

1. Per-task ephemeral Executor/Reviewer/Challenger pipeline replacing the
   current per-task light review. Each task now goes through 3 sequential
   stages with shutdown between each. Adds CHALLENGE_REVIEW message format,
   reviews/ workspace subdirectory, and impact_files / review_status /
   challenge_status fields on task-graph.json.

2. Role taxonomy collapse from 13 specialized templates to 3 universal roles
   (Executor / Reviewer / Challenger) with archetype-specific prompt
   extensions. Each archetype (Implementation/Research/Audit/Planning) maps
   the same role triple onto archetype-appropriate work.

Audit deep code review reduced to integration-only since per-task Challenger
covers within-task adversarial review.

Implementation order expanded to 9 stages, 37 steps. Migration guide updated
for custom-roles porting and v3.x workspace forward-compat. Risk catalog
expanded for spawn cost, reviewer/challenger bottlenecks, and grep-based
impact-analysis limits.
Audit ran against the actual codebase. Found and resolved 8 gaps and 2 new
findings the spec had missed:

- README scope was 5x larger than spec claimed (Teammate Roles table at lines
  188-202 + Team Types table at 207-216 + 80-line Quickstart demo all need
  rewrite)
- CLAUDE.md "Adding a New Teammate Role" common-task guide assumes the legacy
  13-role taxonomy and needs full rewrite
- plan-mode-protocol.md needs 4 sections updated (not just Ownership Boundary):
  Activation, Spawn Directive (ephemeral semantics), Revision Limits (per-task
  model), Workspace Tracking (Plan Proposals semantics)
- prior-context-loading.md move is not pure — 3 inline content updates needed
- custom-roles.md is a near-total rewrite (single-role template → 3 extension
  templates; 36-line example rewritten; new Porting subsection)
- workspace-templates.md needs 2 more line updates (Stage enum, line 365)
- demo scripts (NEW finding) hardcode Phase 1/Phase 2/Implementer references
- validate-task-graph.sh schema flexibility (NEW finding) confirmed safe for
  the new task-graph.json fields; stale comment on line 46 is cosmetic

Audit Findings table added to spec for traceability. Implementation Order
expanded from 37 to 40 numbered steps.
Bite-sized 25-task plan executing the spec at
docs/specs/2026-05-09-agent-team-v4-refactor-design.md. Organized into
9 stages: pre-flight, file moves, support file edits, role taxonomy
collapse, Phase 0 + Phase 4 rewrites, audit scope reduction,
workspace schema updates, dead-skill deletion, tests + meta + docs,
and validate.

Each task has multiple steps with concrete edits (Find/Replace
patterns and full content blocks for major rewrites). Every task
ends with a verification step and a commit. Self-review at the end
of the plan confirms spec coverage, no placeholders, and consistent
naming.
Moves prior-context-loading.md, plan-mode-protocol.md, and
plan-proposal-example.md into the execute skill. Content updates follow
in subsequent commits.
Updates Activation, Ownership Boundary, Spawn Directive, Revision
Limits, and Workspace Tracking sections to reflect v4.0's Phase 0
sub-steps and ephemeral per-task spawn pattern.
Renames 9 Phase 2 + 1 Phase 1 references. Adds Per-Task Pipeline
Coordination section describing the default Executor → Reviewer →
Challenger lifecycle. Updates Adversarial Review Rounds to reflect
that adversarial review is now default behavior.
…le diagram + add TOC entry

Code review caught two issues:
- Bounds bullet listed 6 agents (incl. re-Reviewer) but lifecycle
  diagram skipped the re-review step. Diagram now shows fix-Executor
  → re-Reviewer/re-Challenger → final verdict explicitly.
- Per-Task Pipeline Coordination was missing from the Contents TOC.
  Added entry between Plan-Mode Coordination and Advanced Patterns.
…_REVIEW

Renames the Plan Stage Messages section to Research Messages (FINDING and
ANALYSIS are now reusable across archetypes during execute, not
plan-stage-specific). Adds CHALLENGE_REVIEW message format for the new
per-task Challenger role. Drops PLAN_REVIEW (no plan-reviewer agent
in v4.0).
Code Review Messages H2 was already in the file but never listed in the
Contents block. Adding the entry makes CHALLENGE_REVIEW (newly added in
the prior commit) discoverable via TOC navigation.
Replaces specialized templates (Implementer, Migrator, Researcher,
Tester, Auditor, Planner, Writer, Strategist, etc.) with Executor /
Reviewer / Challenger templates parameterized by archetype directives
(Implementation, Research, Audit, Planning).

Reviewer template includes the impact-analysis directive (grep call
sites, glob importers, populate impact_files in task-graph.json).
Challenger template includes the rules-compliance + missed-issues
directive.
…e only

Code review caught universal Reviewer template baking in impact-analysis
behavior that only makes sense for Implementation tasks. Moved the
directive (and the task-graph.json write permission) into the
Implementation Reviewer archetype directive block.

Also: removed unused {RESULTS_FORMAT} and {REPORT_FORMAT} placeholders
from the header note (never substituted anywhere in the file). Made
Challenger 'Impact files' field conditional. Softened universal
Executor 'owned area' rule to acknowledge archetype overrides.
Replaces 13-role catalog with Lead + 3 universal roles (Executor /
Reviewer / Challenger) parameterized by archetype. Legacy role names
preserved as colloquial archetype specializations in a lookup table.

Notes hook agnosticism — no hook hardcodes role names, so the
collapse has zero hook impact.
Code review caught 2 unlinked references in Selection Guide
(team-archetypes guide, docs/custom-roles.md) — both now markdown
links. Added pointer to error-recovery-protocol.md for the canonical
recovery class definitions.
Replaces single-role custom template with three archetype-specific
extension templates (Custom Executor / Reviewer / Challenger). Each
extension layers directives on top of the standard prompt rather than
defining a standalone role.

Adds Database Migration Specialist example demonstrating the new
pattern. Adds Porting v3.x custom roles section with three concrete
migration examples (Implementer, Reviewer, Researcher patterns).
Replaces 10+ Phase 1 / Phase 2 references with Phase 0 sub-step
references. Removes /agent-team:plan from the commands list. Adds
Reviewer focus and Challenger focus subsections per archetype to
guide what the per-task pipeline checks for each task type.
Phase 0 absorbs the work formerly done by skills/plan/ (resolution,
prior-context loading, archetype detection, decomposition, plan-mode
marking, user approval gate). Auto-chains to superpowers:writing-plans
if no plan is found.

Updates frontmatter description and trigger phrases to reflect the
new entry-point role. Updates Preconditions to remove the require-
approved-plan-in-workspace constraint.
Every task now spawns Executor → Reviewer → Challenger ephemerally.
Reviewer performs grep-based impact analysis and populates
impact_files in task-graph.json. Challenger deep-dives looking for
missed issues. Each role shuts down before the next spawns.

Adds reviews/ workspace subdirectory for per-task review artifacts.
Updates task-graph.json schema usage with review_status and
challenge_status fields.

Replaces 'When chained via /agent-team:start' end-of-stage messaging
with auto-chain to ../audit/SKILL.md.

Also: cleans up 2 stale Phase 2 references (Worktree Isolation, Anti-
Patterns) and updates the Overview paragraph for v4.0 framing.
Per-task Challengers now cover within-task adversarial review during
execute Phase 4. Audit's Step 4 reduces to cross-task integration:
read challenge docs for context, focus end-to-end review on
impact_files overlapping multiple tasks, verify integrated build/test.

Removes /agent-team:start references and legacy/manual workspace
backward-compat framing.
Updates the audit-stage reviewer agent prompt: read per-task
challenge docs for context, focus end-to-end review on
impact_files overlapping multiple tasks, surface coverage gaps
the per-task pipeline missed.
… challenge_status fields

Documents the new reviews/ subdirectory created during Phase 4 by
Reviewer and Challenger. Adds three new task-graph.json fields:
impact_files (Reviewer's grep-based impact area), review_status,
challenge_status. Confirms validate-task-graph.sh tolerates the
extension (only requires subject/status/depends_on).

Updates phase checklist and Stage enum for v4.0 vocabulary.
Removes Plan Audit Result template (the 7-check audit is dropped).
Reviewer caught: line 74 still listed (plan, execute, or audit) for the
Stage field, contradicting the template enum {execute|audit} 50 lines
above. Updated to (`execute` or `audit`). Also tightened Pipeline
status values (approved → Phase 0.6, executed → Phase 4).
… comment

Tasks 14, 15, 16, 20 bundled:

- Delete skills/start/ and skills/plan/ entirely (Phase 0 of execute now
  absorbs all their work).
- Update tests/structure/test-doc-references.sh: drop start/plan
  existence checks, add execute/examples/ check, replace
  Elegance Reviewer assertion with 3-role taxonomy check (Executor,
  Reviewer, Challenger), update message-types list (drop PLAN_REVIEW,
  add CHALLENGE_REVIEW + CODE_REVIEW), update spawn templates check to
  verify 3 universal roles, update plan-stage assertions to execute-stage.
- Update See Also list in docs/team-archetypes.md to drop /agent-team:start
  (was missed in Task 8).
- Update line 46 comment in scripts/validate-task-graph.sh to reflect
  Phase 0 (cosmetic).

Test suite: 16 files, all passing.
Updates 14+ sections per the v4 refactor scope:

- Tagline: 3 stages (plan/execute/audit) → 2 stages (execute/audit)
- What It Does intro: per-task adversarial pipeline + auto-chain framing
- Quickstart demo (~80 lines) full rewrite for Phase 0 + per-task E/R/C
- Pipeline Commands table: drop /agent-team:start and /agent-team:plan rows
- How It Works diagram: replace plan→execute→audit with execute(Phase 0)
  → execute(Phase 4 pipeline) → audit
- Stage table: collapse 4 rows (Start/Plan/Execute/Audit) to 2 (Execute/Audit)
  with detailed phase breakdown
- Drop "Plan team" / "Execute team" / "Audit team" composition list
- Plan-aware paragraph: rewrite to Phase 0 vocabulary
- Teammate Roles table: 13-row catalog → 4-row taxonomy (Lead, Executor,
  Reviewer, Challenger) with archetype specializations note
- Team Types table: rewrite Default Roles column → per-archetype
  Reviewer/Challenger focus
- Workspace tree: add reviews/ subdir + impact_files/review_status/
  challenge_status fields
- Plugin Structure tree: drop skills/start, skills/plan; expand
  skills/execute/ and skills/audit/ subfolder annotations
Updates 6+ sections per the v4 refactor scope:

- Architecture diagram: drop skills/start, skills/plan; expand
  skills/execute and skills/audit subfolder annotations to include
  Phase 0 prep, Phase 4 per-task pipeline, integration-only review
- Key Design Decisions: extend Team-per-stage bullet to mention the
  per-task ephemeral E/R/C pipeline; add 3-role universal taxonomy
  bullet noting legacy role names are colloquial archetype
  specializations
- File Ownership table: delete 5 plan/start rows; rewrite execute
  rows to highlight Phase 0/4 + 3 universal templates; add
  skills/execute/examples/ row
- Adding a New Teammate Role guide: full rewrite to point to
  docs/custom-roles.md archetype-extension model instead of
  direct teammate-roles.md edits
- Adding a New Pipeline Stage guide: drop reference to deleted
  skills/start/SKILL.md routing table
Replaces Phase 1/Phase 2 demo blocks with Phase 0 (resolve + decompose
+ approve) and Phase 4 (per-task Executor -> Reviewer -> Challenger
ephemeral pipeline). Updates teammate names from auth-impl-1/2
(Implementer) to executor-task-1/2 + ephemeral reviewer-task-N +
challenger-task-N identifiers, demonstrating the new ephemeral
spawn-shutdown-respawn pattern.

Both scripts produce visually equivalent output:
- generate-demo-cast.sh: asciicast-style with timing
- record-demo.sh: plain stdout with colors
v4.0.0 collapses the 4-skill pipeline (start/plan/execute/audit) to
2 skills (execute/audit) with per-task adversarial review pipeline
(Executor → Reviewer → Challenger ephemeral) and 3-role universal
taxonomy. CHANGELOG entry covers BREAKING/Migration/Added/Changed/
Removed sections.
Three new integration test files covering the v4.0 schema additions
and per-task pipeline lifecycle:

- tests/integration/test-v4-schema.sh (17 tests)
  Verifies validate-task-graph.sh accepts new fields (impact_files,
  review_status, challenge_status), the in_review status, and the
  approved_with_findings status values. Verifies compute-critical-path
  and detect-resume work on v4.0 graphs. Verifies regression: graphs
  missing required fields are still blocked.

- tests/integration/test-v4-pipeline-lifecycle.sh (15 tests)
  Simulates a task progressing through the per-task pipeline:
  pending → in_review → review_status approved → challenge_status
  approved → completed. Tests fix-cycle exhaustion (approved_with_
  findings) and cross-task impact_files overlap (the audit-stage
  integration review scope).

- tests/integration/test-v4-skill-coherence.sh (93 tests)
  Static analysis of v4.0 skill content. Verifies SKILL.md files
  contain the expected v4.0 instructions, cross-references resolve,
  and demo scripts use v4.0 vocabulary. The bash-level equivalent of
  the manual smoke tests (Task 23 in the plan) — these don't invoke
  the actual Claude skills (which require TeamCreate at runtime),
  but they verify the skill instructions themselves describe the
  v4.0 flow.

Total test suite: 19 files, 336 test assertions, all passing.
Auto-discovery via find works without runner changes.
A thorough audit caught several issues in the new integration tests:

**v4-schema.sh** (17 → 24 tests):
- Tests 1-3 were tautological: claimed to verify v4 status enum acceptance
  but validate-task-graph.sh has no enum guard (it accepts any non-null
  string). Reframed to honestly test what's actually verified: the script
  doesn't choke on graphs containing the new fields.
- Tests 3, 4 had missing 'cwd' in JSON input causing validate-task-graph
  to early-exit before validation. Fixed.
- Test 5 (detect-resume) used empty {} input, hitting the no-CWD silent
  fallback. Fixed: pass cwd, assert stdout shows the workspace name and
  remaining tasks.
- Tests 7a-7f wrote test docs and then grep'd them for fields the test
  itself wrote — pure tautologies. Replaced with checks against the
  PRODUCTION docs/workspace-templates.md (impact_files semantics, who
  populates each field, validate-task-graph tolerance).
- Added Test 7 (regression: cycle detection still works).

**v4-pipeline-lifecycle.sh** (15 → 21 tests):
- Stages 3, 6 had missing cwd in JSON input. Fixed.
- Reframed jq-on-fixture assertions as 'schema integrity checks'
  (verifying the field names jq reads match production spec) and added
  spec contract checks at every stage:
  - Stage 3: spawn-templates.md describes the impact-analysis directive
  - Stage 4: communication-protocol.md documents CODE_REVIEW + CHALLENGE_REVIEW
  - Stage 6: coordination-patterns.md documents max-1-fix-cycle bound
    and log-to-issues.md on exhaustion
  - Stage 7: audit/SKILL.md describes integration review using impact_files
- Removed sleep-based timing assumptions; now uses precise jq queries.

**v4-skill-coherence.sh** (93 → 101 tests):
- Test 5 'Phase 1|Phase 2' regex was too broad (would match any
  incidental string). Narrowed to detect actual phase headings or
  numbered Phase 1a/1b mentions.
- Test 9 archetype-directive count used substring matching (would pass
  for partial matches). Now checks each role × archetype combination
  with exact-match patterns: 4 archetypes × 3 roles = 12 directive
  blocks (Executor, Reviewer, Challenger).

Net: 357 assertions (up from 336), no tautologies, every assertion
verifies meaningful behavior or a documented contract. All tests pass.
@ducdmdev ducdmdev merged commit f17a9e5 into main May 11, 2026
1 check passed
@ducdmdev ducdmdev deleted the feat/v4-refactor branch May 11, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant