Skip to content

feat(v0.2.2): Pareto selection, novelty filter, backend ensemble & cascade evaluation#7

Merged
weijt606 merged 4 commits into
mainfrom
v0.2.2
May 24, 2026
Merged

feat(v0.2.2): Pareto selection, novelty filter, backend ensemble & cascade evaluation#7
weijt606 merged 4 commits into
mainfrom
v0.2.2

Conversation

@weijt606
Copy link
Copy Markdown
Owner

Summary

Brings four search-strategy techniques from recent open-source evolutionary-agent
projects into PolyHarness, plus observability to make them visible. Every feature is
opt-in, dependency-free, deterministic/reproducible, and keeps existing behavior
unchanged
— in line with the project's principles (open-source, no security risk,
reasonable & reproducible design).

What's new

1. Pareto-frontier parent selection — parent_selection: pareto

Samples parents from the set of per-task winners instead of always branching from
the single overall-best candidate, keeping specialists alive as stepping stones to
avoid premature convergence. Reuses the per-task scores already in the search log —
no new data collected. (GEPA, arXiv:2507.19457)

2. Code novelty rejection — novelty_filter / novelty_threshold / novelty_max_retries

Detects near-duplicate candidates via stdlib difflib and skips their evaluation to
save API/compute budget. Off by default. (ShinkaEvolve, arXiv:2509.19349)

3. Adaptive backend ensemble — proposer.ensemble / ph run --ensemble a,b,c

A UCB1 bandit picks a backend per iteration and shifts picks toward backends that
produce improving candidates. Fully deterministic (no RNG); prints a per-backend
picks/improve-rate summary. Leverages the existing 8-backend support.
(ShinkaEvolve adaptive LLM-ensemble selection)

4. Cascade evaluation — evaluator.cascade / cascade_threshold / cascade_stage1

Scores a cheap first subset of tasks and only runs the rest if it clears the gate,
saving budget on weak candidates. Per-task mode only; the base harness is always
scored in full. Off by default. (AlphaEvolve/OpenEvolve cascade)

Plus

  • Reproducibilitysearch.seed makes tournament/pareto/novelty repeatable.
  • Observabilityph log marks Pareto-frontier members (◆); ph leaderboard
    adds Pareto + Backend columns (Backend only when an ensemble was used).
    SearchLog.pareto_win_counts() is the shared source of truth.
  • proposer_backend recorded in each candidate's metadata.json.
  • Removed 3 byte-identical duplicate files that tripped ruff N999.

Design notes

  • Backward compatible: all flags default to prior behavior; an injected single
    proposer disables the bandit. No public API breakage.
  • No new dependencies: novelty (difflib), bandit (math), cascade — all stdlib.
  • No new attack surface: the bandit only selects among already-configured
    backends; the sandbox/proposer boundaries are unchanged.

Testing

  • ruff check src/ tests/ — clean
  • pytest tests/206 passed (173 → 206; +33)
  • End-to-end smoke on the bundled math-word-problems template with the offline
    local backend for pareto + novelty + ensemble + cascade.

weijt606 and others added 4 commits May 25, 2026 00:02
…v0.2.2)

Borrowed three techniques from recent evolutionary-agent projects, all
opt-in, dependency-free, and reproducible:

- Pareto-frontier parent selection (parent_selection: pareto) — samples
  per-task winners instead of overall-best, keeping specialists as stepping
  stones (GEPA, arXiv:2507.19457). Reuses per-task scores already logged.
- Code novelty rejection (novelty_filter/threshold/max_retries) — skips
  near-duplicate candidates before evaluation via stdlib difflib to save
  budget (ShinkaEvolve, arXiv:2509.19349). Off by default.
- Adaptive backend ensemble (proposer.ensemble, ph run --ensemble) — UCB1
  bandit picks a backend per iteration, rewarding improvement-over-parent.
  Deterministic; prints a per-backend picks/improve-rate summary.

Also: search.seed for reproducible randomized runs; proposer_backend recorded
in candidate metadata; removed 3 byte-identical duplicate files that tripped
ruff N999. Docs (README/README_CN/CHANGELOG) and version synced. 194 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Borrows AlphaEvolve/OpenEvolve-style staged evaluation: score a cheap first
subset of tasks and only run the rest if it clears `cascade_threshold`,
saving evaluation budget on weak candidates. Complements the novelty filter
and backend ensemble in cutting cost.

- evaluator.cascade / cascade_threshold / cascade_stage1 config (off by default)
- Orchestrator._evaluate_with_cascade: stage-1 tasks are never re-run, so the
  result is deterministic; the base harness is always scored in full.
- Per-task mode only (non-empty `tasks` list); a no-op otherwise.
- 5 new tests; docs (README/README_CN/CHANGELOG) synced. 199 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rboard

Make the new search features visible:
- ph log marks Pareto-frontier members (best on >=1 task) with ◆ in both
  tree and flat views, with a legend.
- ph leaderboard gains a Pareto column and a Backend column (the latter shown
  only when an ensemble recorded a proposer_backend, so single-backend output
  is unchanged).
- Extracted SearchLog.pareto_win_counts() as the single source of truth for
  frontier membership, reused by Orchestrator._pareto_select (de-duplicated).
- Workspace.candidate_metadata() reads a candidate's metadata.json safely.

7 new tests; docs synced. 206 tests, lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a Backstory subsection citing GEPA, ShinkaEvolve, OpenEvolve, and the
Darwin Gödel Machine, framing PolyHarness as the member of this wave
specialized for agent harnesses + online evolution (ph wrap → ph evolve),
and noting which technique each borrows. README + README_CN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@weijt606 weijt606 merged commit 30a1cdf into main May 24, 2026
3 checks passed
@weijt606 weijt606 deleted the v0.2.2 branch May 24, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant