Fix evolve to target underperformers and parallelize judge panel#11
Fix evolve to target underperformers and parallelize judge panel#11AaronGoldsmith merged 3 commits intomainfrom
Conversation
- Evolve command now targets agents below win-rate threshold instead of
champions, with --iterations and --threshold flags
- Evaluator-optimizer loop: refine, self-critique, repeat
- Add critique_refinement method to AgentBuilder with bug fixes:
- Try json.loads before _parse_agent_json (which discards non-agent JSON)
- Handle string booleans properly (bool("false") returns True)
- Parallelize judge panel with asyncio.gather + exception handling
- Bump max_output_tokens from 4096 to 16384 across all providers
- Add agent_max_output_tokens config field
- Preserve Bash tools migration in db.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Mobius’s self-improvement loop by evolving underperforming agents (instead of champions), adding an evaluator–optimizer refinement loop with self-critique, parallelizing judge execution to reduce evaluation latency, and increasing provider output token limits.
Changes:
- Update
mobius evolveto select and refine underperforming agents, with new--iterationsand--thresholdflags. - Parallelize judge panel execution using
asyncio.gather()and tolerate per-judge failures. - Increase per-provider output token limits to 16384 and add an
agent_max_output_tokensconfig field.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mobius/cli.py | Refactors evolve to target underperformers and adds iterative refinement/self-critique flags. |
| src/mobius/agent_builder.py | Adds critique_refinement self-critique method and parsing/normalization logic. |
| src/mobius/judge.py | Runs all judges concurrently and handles exceptions per judge. |
| src/mobius/config.py | Adds agent_max_output_tokens configuration field. |
| src/mobius/providers/openai.py | Raises OpenAI max_tokens to 16384. |
| src/mobius/providers/openrouter.py | Raises OpenRouter max_tokens to 16384. |
| src/mobius/providers/anthropic.py | Raises Anthropic max_tokens to 16384. |
| src/mobius/providers/google.py | Raises Gemini tool-path max_output_tokens to 16384. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ate CLI inputs - Add max_tokens parameter to all provider run functions (default 16384) - Only include successful judges in judge_models_used - Validate iterations >= 1 and threshold in [0,1] for evolve command - Strip whitespace before boolean normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78b2a09be6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
--iterationsand--thresholdflags for evaluator-optimizer loop with self-critiqueasyncio.gather(3x latency improvement)Bug fixes
bool("false")→Truein critique_refinement (now handles string booleans)_parse_agent_jsonfallback discarding non-agent JSON schemasagent_max_output_tokensconfig fieldSplits from PR #10
This is the core logic split from #10. Sandbox → feature/docker-sandbox. Tasks/skills → feature/agentic-tasks-skills.
Test plan
pytest tests/ -vmobius evolve coding --threshold 0.9runs without error🤖 Generated with Claude Code