Autonomous multi-agent coding workflow with competitive development and judicial review
# Two AI coders compete. Three judges pick the winner.
cube auto task.mdA self-improving coding workflow that orchestrates multiple AI agents to build production-ready features autonomously.
The Process:
- 2 AI writers implement the same task independently (Sonnet + Codex)
- 3 AI judges review both implementations
- System picks winner or synthesizes best of both
- Peer review validates final solution
- PR automatically created for human approval
The Result:
- 7x productivity improvement (conservative estimate)
- Dual approaches evaluated (not just 1)
- 3 independent reviews per feature
- Institutional knowledge captured
- Production-ready code
git clone https://github.com/aetheronhq/agent-cube.git
cd agent-cube
./install.shPrerequisites: cursor-agent CLI
# Create a task file
cat > my-task.md << 'EOF'
# Add String Utilities
Create capitalize() and slugify() functions in TypeScript.
Include tests. No external dependencies.
EOF
# Run autonomous workflow
cube auto my-task.md
# Watch it work (optional)
cube status my-task
# PR created automatically!- Quick Start Guide - 5 commands, 5 minutes
- Installation - Detailed setup
- Core Concepts - Framework overview
- Planning Guide - Architecture-first planning (v2 example)
- Task Breakdown - How to split features
- Phase Organization - How phases emerge
- Templates - Planning docs + task file templates
- Automation Guide - Autonomous workflows
- Human-in-Loop - When and how to intervene
# 1. Start autonomous workflow
cube auto task.md
# 2. Check progress
cube status task
# 3. See decisions
cube decide task
# 4. Resume/continue
cube auto task --resumeThat's it. The tool guides you for everything else.
Agent Cube isn't experimentalβit's built on proven techniques:
| Research | Finding | Application |
|---|---|---|
| Best-of-N Sampling (Anthropic, 2022) | N=2 reduces errors by 35% | 2 writers = different blind spots |
| LLM-as-Judge (Zheng et al., 2023) | AI judges achieve 85% agreement with humans | Scalable, consistent code review |
| Self-Refine (Madaan et al., 2023) | Iterative critique β revision improves quality | Feedback rounds until approved |
| Ensemble Methods (Dietterich, 2000) | Different models = different strengths | Sonnet + Codex + Gemini diversity |
Plus: Modern models (GPT-5 Codex, Sonnet 4.5 Thinking) are good enough to work largely unassisted. This wasn't possible 6 months ago.
Layer 1: Orchestrator
- Plans workflow
- Breaks down features
- Coordinates execution
Layer 2: Prompt Writers
- Generate detailed task prompts
- Create judge panel prompts
- Generate synthesis feedback
Layer 3: Code Writers + Judges
- 2 writers compete (different models)
- 3 judges independently review
- System picks winner or synthesizes
Git Worktrees:
- Each agent gets isolated filesystem
- Own branch, own git state
- Zero conflicts, true parallelization
Ports & Adapters:
- Pluggable CLI adapters (cursor-agent, gemini, etc.)
- Parser plugins for output formats
- Layout adapters for display
State Management:
- Explicit phase tracking
- Resume from any point
- Atomic writes, no corruption
Output:
- 15 production features
- ~34,000 lines of code
- Multi-tenancy, Auth0, CRUD factory, OpenAPI + SDK
- Production-ready quality (full tests, security scans, CI passing)
Timeline:
- 15 active work days
- 1 developer + Agent Cube
- vs 7-8 person team traditionally
Economics:
- Cost: $15k (salary + LLM)
- Traditional: $63-96k
- Savings: $48-81k (75-85%)
Quality:
- Synthesis improved 40% of tasks
- Multiple feedback rounds caught bugs early
- Comprehensive test coverage
Sonnet 4.5: UI/Frontend wins (3-0, 100%) Codex High: Backend wins (7/8, 88%) Grok: Best balanced judge
Insight: Task-model matching > using "best model" for everything
Fully customizable - use any models you want:
# python/cube.yaml
writers:
writer_a:
model: "sonnet-4.5-thinking"
writer_b:
model: "gpt-5-codex-high"
judges:
judge_1:
model: "sonnet-4.5-thinking"
judge_2:
model: "gpt-5-codex-high"
judge_3:
model: "gemini-2.5-pro" # Or grok, claude-code, etc.
cli_tools:
sonnet-4.5-thinking: cursor-agent
gpt-5-codex-high: cursor-agent
gemini-2.5-pro: geminiNo vendor lock-in. Fully extensible.
Two AI models implement the same task independently. Different approaches reveal trade-offs.
Three independent AI judges review both implementations. Majority vote or consensus determines winner.
Synthesis When both approaches have merit, system combines best elements. 40% of v2 features improved this way.
~5 interventions per complex feature. Tool provides clear guidance when it needs help.
Like pair programming Γ 5. Multiple perspectives, ideas you wouldn't have thought of, issues you would've missed.
- New features (2-8 hours scope)
- Complex architecture decisions
- Refactoring (multiple valid approaches)
- Production-critical code (needs thorough review)
- Tiny changes (<1 hour)
- Emergency hotfixes (too slow)
- Experimental code (unclear requirements)
- Simple scaffolding (overkill)
The sweet spot: Features where exploring alternatives adds value
Real example from v2:
- Task: API client scaffold
- All 3 AI judges: APPROVED β
- Human review: REJECTED β
What went wrong: Built a custom HTTP client (good code quality). Needed an OpenAPI code generator (wrong approach). Judges focused on code quality, missed strategy.
The lesson: AI judges catch bugs. Humans catch strategy misalignment. Both needed.
- ~5 interventions per complex feature (improving!)
- $200-400 per feature LLM costs (4-5x ROI though)
- Learning curve for planning docs
- Human validation always required
All improving weekly. Rapid iteration.
This Month:
- Web UI for managing multiple workflows
- Integration test framework
- More CLI adapters (Claude Code, Codex CLI direct)
This Quarter:
- Auto-orchestration (dependency-based task execution)
- Cost tracking and analytics
- Learning system (model selection optimization)
- Team collaboration features
Found a bug? Have an idea? Want to help?
Raise an issue: https://github.com/aetheronhq/agent-cube/issues
We'll use Agent Cube to fix Agent Cube! π―
See aetheron-connect-v2 for complete example:
planning/ # 33 architecture docs
implementation/
βββ phase-00/ # Scaffold
βββ phase-01/ # Foundation
βββ phase-02/ # Core (9 parallel tasks!)
β βββ tasks/
β βββ 02-auth-middleware.md
β βββ 02-crud-factory.md
β βββ ...
βββ panel/
βββ panel-metrics.md # All decisions, scores, learnings
Learn from a real project that shipped!
7x productivity improvement (conservative estimate)
- 1 person = 2 teams' output
- 3-5x ROI on cost
- Higher quality through competition
- Validated on real projects
Not replacing engineers. Multiplying output.
- Documentation: Start with docs/QUICK_START.md
- Issues: GitHub Issues
- Questions: GitHub Discussions
MIT License - see LICENSE file
Built with Agent Cube, for Agent Cube. π§β¨