Skip to content

Fix evolve to target underperformers and parallelize judge panel#11

Merged
AaronGoldsmith merged 3 commits intomainfrom
fix/evolve-parallel-judging
Mar 21, 2026
Merged

Fix evolve to target underperformers and parallelize judge panel#11
AaronGoldsmith merged 3 commits intomainfrom
fix/evolve-parallel-judging

Conversation

@AaronGoldsmith
Copy link
Copy Markdown
Owner

Summary

  • Evolve command now targets underperformers (low win rate + enough matches), not champions
  • Added --iterations and --threshold flags for evaluator-optimizer loop with self-critique
  • Judge panel runs all judges in parallel via asyncio.gather (3x latency improvement)
  • Bumped max_output_tokens from 4096 to 16384 across all providers

Bug fixes

  • Fixed bool("false")True in critique_refinement (now handles string booleans)
  • Fixed _parse_agent_json fallback discarding non-agent JSON schemas
  • Added agent_max_output_tokens config field

Splits from PR #10

This is the core logic split from #10. Sandbox → feature/docker-sandbox. Tasks/skills → feature/agentic-tasks-skills.

Test plan

  • pytest tests/ -v
  • mobius evolve coding --threshold 0.9 runs without error
  • Judge panel completes faster with parallel execution

🤖 Generated with Claude Code

- Evolve command now targets agents below win-rate threshold instead of
  champions, with --iterations and --threshold flags
- Evaluator-optimizer loop: refine, self-critique, repeat
- Add critique_refinement method to AgentBuilder with bug fixes:
  - Try json.loads before _parse_agent_json (which discards non-agent JSON)
  - Handle string booleans properly (bool("false") returns True)
- Parallelize judge panel with asyncio.gather + exception handling
- Bump max_output_tokens from 4096 to 16384 across all providers
- Add agent_max_output_tokens config field
- Preserve Bash tools migration in db.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 21, 2026 16:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Mobius’s self-improvement loop by evolving underperforming agents (instead of champions), adding an evaluator–optimizer refinement loop with self-critique, parallelizing judge execution to reduce evaluation latency, and increasing provider output token limits.

Changes:

  • Update mobius evolve to select and refine underperforming agents, with new --iterations and --threshold flags.
  • Parallelize judge panel execution using asyncio.gather() and tolerate per-judge failures.
  • Increase per-provider output token limits to 16384 and add an agent_max_output_tokens config field.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/mobius/cli.py Refactors evolve to target underperformers and adds iterative refinement/self-critique flags.
src/mobius/agent_builder.py Adds critique_refinement self-critique method and parsing/normalization logic.
src/mobius/judge.py Runs all judges concurrently and handles exceptions per judge.
src/mobius/config.py Adds agent_max_output_tokens configuration field.
src/mobius/providers/openai.py Raises OpenAI max_tokens to 16384.
src/mobius/providers/openrouter.py Raises OpenRouter max_tokens to 16384.
src/mobius/providers/anthropic.py Raises Anthropic max_tokens to 16384.
src/mobius/providers/google.py Raises Gemini tool-path max_output_tokens to 16384.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mobius/providers/anthropic.py Outdated
Comment thread src/mobius/judge.py Outdated
Comment thread src/mobius/providers/openai.py Outdated
Comment thread src/mobius/providers/google.py
Comment thread src/mobius/cli.py
Comment thread src/mobius/cli.py
Comment thread src/mobius/agent_builder.py
Comment thread src/mobius/config.py
Comment thread src/mobius/providers/openrouter.py Outdated
…ate CLI inputs

- Add max_tokens parameter to all provider run functions (default 16384)
- Only include successful judges in judge_models_used
- Validate iterations >= 1 and threshold in [0,1] for evolve command
- Strip whitespace before boolean normalization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AaronGoldsmith AaronGoldsmith marked this pull request as ready for review March 21, 2026 17:10
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78b2a09be6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mobius/cli.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AaronGoldsmith AaronGoldsmith merged commit c8dc3c2 into main Mar 21, 2026
2 checks passed
@AaronGoldsmith AaronGoldsmith deleted the fix/evolve-parallel-judging branch March 21, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants