Fix evolve to target underperformers and parallelize judge panel by AaronGoldsmith · Pull Request #11 · AaronGoldsmith/mobius

AaronGoldsmith · 2026-03-21T16:17:41Z

Summary

Evolve command now targets underperformers (low win rate + enough matches), not champions
Added --iterations and --threshold flags for evaluator-optimizer loop with self-critique
Judge panel runs all judges in parallel via asyncio.gather (3x latency improvement)
Bumped max_output_tokens from 4096 to 16384 across all providers

Bug fixes

Fixed bool("false") → True in critique_refinement (now handles string booleans)
Fixed _parse_agent_json fallback discarding non-agent JSON schemas
Added agent_max_output_tokens config field

Splits from PR #10

This is the core logic split from #10. Sandbox → feature/docker-sandbox. Tasks/skills → feature/agentic-tasks-skills.

Test plan

pytest tests/ -v
mobius evolve coding --threshold 0.9 runs without error
Judge panel completes faster with parallel execution

🤖 Generated with Claude Code

- Evolve command now targets agents below win-rate threshold instead of champions, with --iterations and --threshold flags - Evaluator-optimizer loop: refine, self-critique, repeat - Add critique_refinement method to AgentBuilder with bug fixes: - Try json.loads before _parse_agent_json (which discards non-agent JSON) - Handle string booleans properly (bool("false") returns True) - Parallelize judge panel with asyncio.gather + exception handling - Bump max_output_tokens from 4096 to 16384 across all providers - Add agent_max_output_tokens config field - Preserve Bash tools migration in db.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates Mobius’s self-improvement loop by evolving underperforming agents (instead of champions), adding an evaluator–optimizer refinement loop with self-critique, parallelizing judge execution to reduce evaluation latency, and increasing provider output token limits.

Changes:

Update mobius evolve to select and refine underperforming agents, with new --iterations and --threshold flags.
Parallelize judge panel execution using asyncio.gather() and tolerate per-judge failures.
Increase per-provider output token limits to 16384 and add an agent_max_output_tokens config field.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/mobius/cli.py	Refactors `evolve` to target underperformers and adds iterative refinement/self-critique flags.
src/mobius/agent_builder.py	Adds `critique_refinement` self-critique method and parsing/normalization logic.
src/mobius/judge.py	Runs all judges concurrently and handles exceptions per judge.
src/mobius/config.py	Adds `agent_max_output_tokens` configuration field.
src/mobius/providers/openai.py	Raises OpenAI `max_tokens` to 16384.
src/mobius/providers/openrouter.py	Raises OpenRouter `max_tokens` to 16384.
src/mobius/providers/anthropic.py	Raises Anthropic `max_tokens` to 16384.
src/mobius/providers/google.py	Raises Gemini tool-path `max_output_tokens` to 16384.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ate CLI inputs - Add max_tokens parameter to all provider run functions (default 16384) - Only include successful judges in judge_models_used - Validate iterations >= 1 and threshold in [0,1] for evolve command - Strip whitespace before boolean normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78b2a09be6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 21, 2026 16:17

Copilot started reviewing on behalf of AaronGoldsmith March 21, 2026 16:18 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

AaronGoldsmith marked this pull request as ready for review March 21, 2026 17:10

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

Comment thread src/mobius/cli.py

Fix lineage parent IDs across refine iterations

279975d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AaronGoldsmith merged commit c8dc3c2 into main Mar 21, 2026
2 checks passed

AaronGoldsmith deleted the fix/evolve-parallel-judging branch March 21, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix evolve to target underperformers and parallelize judge panel#11

Fix evolve to target underperformers and parallelize judge panel#11
AaronGoldsmith merged 3 commits intomainfrom
fix/evolve-parallel-judging

AaronGoldsmith commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AaronGoldsmith commented Mar 21, 2026

Summary

Bug fixes

Splits from PR #10

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants