feat(v0.2.2): Pareto selection, novelty filter, backend ensemble & cascade evaluation by weijt606 · Pull Request #7 · weijt606/polyharness

weijt606 · 2026-05-24T22:06:43Z

Summary

Brings four search-strategy techniques from recent open-source evolutionary-agent
projects into PolyHarness, plus observability to make them visible. Every feature is
opt-in, dependency-free, deterministic/reproducible, and keeps existing behavior
unchanged — in line with the project's principles (open-source, no security risk,
reasonable & reproducible design).

What's new

1. Pareto-frontier parent selection — `parent_selection: pareto`

Samples parents from the set of per-task winners instead of always branching from
the single overall-best candidate, keeping specialists alive as stepping stones to
avoid premature convergence. Reuses the per-task scores already in the search log —
no new data collected. (GEPA, arXiv:2507.19457)

2. Code novelty rejection — `novelty_filter` / `novelty_threshold` / `novelty_max_retries`

Detects near-duplicate candidates via stdlib difflib and skips their evaluation to
save API/compute budget. Off by default. (ShinkaEvolve, arXiv:2509.19349)

3. Adaptive backend ensemble — `proposer.ensemble` / `ph run --ensemble a,b,c`

A UCB1 bandit picks a backend per iteration and shifts picks toward backends that
produce improving candidates. Fully deterministic (no RNG); prints a per-backend
picks/improve-rate summary. Leverages the existing 8-backend support.
(ShinkaEvolve adaptive LLM-ensemble selection)

4. Cascade evaluation — `evaluator.cascade` / `cascade_threshold` / `cascade_stage1`

Scores a cheap first subset of tasks and only runs the rest if it clears the gate,
saving budget on weak candidates. Per-task mode only; the base harness is always
scored in full. Off by default. (AlphaEvolve/OpenEvolve cascade)

Plus

Reproducibility — search.seed makes tournament/pareto/novelty repeatable.
Observability — ph log marks Pareto-frontier members (◆); ph leaderboard
adds Pareto + Backend columns (Backend only when an ensemble was used).
SearchLog.pareto_win_counts() is the shared source of truth.
proposer_backend recorded in each candidate's metadata.json.
Removed 3 byte-identical duplicate files that tripped ruff N999.

Design notes

Backward compatible: all flags default to prior behavior; an injected single
proposer disables the bandit. No public API breakage.
No new dependencies: novelty (difflib), bandit (math), cascade — all stdlib.
No new attack surface: the bandit only selects among already-configured
backends; the sandbox/proposer boundaries are unchanged.

Testing

ruff check src/ tests/ — clean
pytest tests/ — 206 passed (173 → 206; +33)
End-to-end smoke on the bundled math-word-problems template with the offline
local backend for pareto + novelty + ensemble + cascade.

…v0.2.2) Borrowed three techniques from recent evolutionary-agent projects, all opt-in, dependency-free, and reproducible: - Pareto-frontier parent selection (parent_selection: pareto) — samples per-task winners instead of overall-best, keeping specialists as stepping stones (GEPA, arXiv:2507.19457). Reuses per-task scores already logged. - Code novelty rejection (novelty_filter/threshold/max_retries) — skips near-duplicate candidates before evaluation via stdlib difflib to save budget (ShinkaEvolve, arXiv:2509.19349). Off by default. - Adaptive backend ensemble (proposer.ensemble, ph run --ensemble) — UCB1 bandit picks a backend per iteration, rewarding improvement-over-parent. Deterministic; prints a per-backend picks/improve-rate summary. Also: search.seed for reproducible randomized runs; proposer_backend recorded in candidate metadata; removed 3 byte-identical duplicate files that tripped ruff N999. Docs (README/README_CN/CHANGELOG) and version synced. 194 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Borrows AlphaEvolve/OpenEvolve-style staged evaluation: score a cheap first subset of tasks and only run the rest if it clears `cascade_threshold`, saving evaluation budget on weak candidates. Complements the novelty filter and backend ensemble in cutting cost. - evaluator.cascade / cascade_threshold / cascade_stage1 config (off by default) - Orchestrator._evaluate_with_cascade: stage-1 tasks are never re-run, so the result is deterministic; the base harness is always scored in full. - Per-task mode only (non-empty `tasks` list); a no-op otherwise. - 5 new tests; docs (README/README_CN/CHANGELOG) synced. 199 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rboard Make the new search features visible: - ph log marks Pareto-frontier members (best on >=1 task) with ◆ in both tree and flat views, with a legend. - ph leaderboard gains a Pareto column and a Backend column (the latter shown only when an ensemble recorded a proposer_backend, so single-backend output is unchanged). - Extracted SearchLog.pareto_win_counts() as the single source of truth for frontier membership, reused by Orchestrator._pareto_select (de-duplicated). - Workspace.candidate_metadata() reads a candidate's metadata.json safely. 7 new tests; docs synced. 206 tests, lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a Backstory subsection citing GEPA, ShinkaEvolve, OpenEvolve, and the Darwin Gödel Machine, framing PolyHarness as the member of this wave specialized for agent harnesses + online evolution (ph wrap → ph evolve), and noting which technique each borrows. README + README_CN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

weijt606 and others added 4 commits May 25, 2026 00:02

weijt606 merged commit 30a1cdf into main May 24, 2026
3 checks passed

weijt606 deleted the v0.2.2 branch May 24, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v0.2.2): Pareto selection, novelty filter, backend ensemble & cascade evaluation#7

feat(v0.2.2): Pareto selection, novelty filter, backend ensemble & cascade evaluation#7
weijt606 merged 4 commits into
mainfrom
v0.2.2

weijt606 commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weijt606 commented May 24, 2026

Summary

What's new

1. Pareto-frontier parent selection — parent_selection: pareto

2. Code novelty rejection — novelty_filter / novelty_threshold / novelty_max_retries

3. Adaptive backend ensemble — proposer.ensemble / ph run --ensemble a,b,c

4. Cascade evaluation — evaluator.cascade / cascade_threshold / cascade_stage1

Plus

Design notes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Pareto-frontier parent selection — `parent_selection: pareto`

2. Code novelty rejection — `novelty_filter` / `novelty_threshold` / `novelty_max_retries`

3. Adaptive backend ensemble — `proposer.ensemble` / `ph run --ensemble a,b,c`

4. Cascade evaluation — `evaluator.cascade` / `cascade_threshold` / `cascade_stage1`