diff --git a/CHANGELOG.md b/CHANGELOG.md index a2f792d..32723fb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,7 +4,49 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). -## [0.2.1] - 2026-04-09 +## [0.2.2] - 2026-05-24 + +### Added +- **Pareto-frontier parent selection** (`parent_selection: pareto`) — samples + parents from the set of per-task winners instead of always branching from the + single overall-best candidate, keeping specialists alive as stepping stones to + avoid premature convergence. Inspired by GEPA (arXiv:2507.19457). Reuses the + per-task scores already stored in the search log — no new data collected. +- **Code novelty rejection** (`novelty_filter`, `novelty_threshold`, + `novelty_max_retries`) — detects near-duplicate candidates via stdlib + `difflib` text similarity (no new dependencies) and skips their evaluation to + save API/compute budget. Inspired by ShinkaEvolve (arXiv:2509.19349). Off by + default. +- **Adaptive backend ensemble** (`proposer.ensemble`, `proposer.bandit_c`, + `ph run --ensemble b1,b2,...`) — when several backends are listed, a UCB1 + bandit picks one per iteration and shifts picks toward backends that produce + *improving* candidates. Fully deterministic (no RNG) and adds no new + dependencies. Run summary shows a per-backend picks/improve-rate table. + Inspired by ShinkaEvolve's adaptive LLM-ensemble selection. +- **Cascade evaluation** (`evaluator.cascade`, `cascade_threshold`, + `cascade_stage1`) — scores a cheap first subset of tasks and only runs the + rest if it clears the gate, saving budget on weak candidates (AlphaEvolve/ + OpenEvolve-style). Per-task mode only; the base harness is always scored in + full. Off by default. +- **Reproducible runs** (`search.seed`) — seeds the RNG so tournament/pareto/ + novelty regeneration are repeatable across runs. +- **Observability** — `ph log` marks Pareto-frontier members (◆); `ph leaderboard` + adds a Pareto column and a Backend column (shown only when an ensemble was + used). `SearchLog.pareto_win_counts()` powers both the CLI and the orchestrator. +- `proposer_backend` recorded in each candidate's `metadata.json` (ensemble mode) +- Hermes Agent adapter (`hermes`) — 8th proposer backend (`hermes chat -q`) +- `--strategy pareto` and `--ensemble` options for `ph run` +- `proposer/bandit.py` — UCB1 `BackendBandit` +- 31 new tests (206 total) + +### Changed +- Agent backends: 7 → 8 (added Hermes Agent) + +### Removed +- Stray byte-identical duplicate files (`collector 2.py`, `test_collector 2.py`, + `test_evolution 2.py`) that inflated the test count and tripped ruff N999 + + ### Added - `ph shell-hook install/uninstall/status` — zero-config auto-wrap for agent commands via shell preexec hook diff --git a/README.md b/README.md index 6ff6dc6..1699c26 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) -[![Tests](https://img.shields.io/badge/tests-212%20passing-brightgreen.svg)]() +[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]() [![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md) --- @@ -53,6 +53,12 @@ PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness > - Memory tools (like Supermemory) give agents persistent **memory** across conversations. > - **PolyHarness gives agents persistent self-evolution** — you get a repeatable way to refine how they work over time. +### Part of a wave — specialized for harnesses + +PolyHarness doesn't stand alone. A wave of open-source projects has shown that pairing LLMs with evolutionary search systematically improves code and prompts: [GEPA](https://github.com/gepa-ai/gepa) (reflective prompt evolution over a Pareto frontier), [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) (sample-efficient program evolution), [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) (an open AlphaEvolve), and the [Darwin Gödel Machine](https://sakana.ai/dgm/) (open-ended self-improving agents). + +Most of these evolve *general* programs or algorithms. PolyHarness is the member of this wave **specialized for agent harnesses** — the prompts, tool config, and orchestration *around* an existing agent — with a focus on **online evolution from real usage** (`ph wrap` → `ph evolve`). It borrows the strongest ideas from these projects and applies them to any CLI agent on your own tasks: Pareto-frontier parent selection (GEPA), code-novelty rejection and an adaptive backend ensemble (ShinkaEvolve), and cascade evaluation (AlphaEvolve/OpenEvolve). + ## What PolyHarness Is PolyHarness is the open-source engine for iteratively searching over an agent's harness. @@ -469,6 +475,16 @@ The Proposer reads **all of this** before generating the next candidate. It can When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `AGENTS.md` (Hermes), `OPENCODE.md` — each agent's native instruction format. +#### Backend ensemble (adaptive selection) + +Don't know which backend writes the best harness changes for your task? Let PolyHarness find out. Pass several and it picks one per iteration with a **UCB bandit**, shifting picks toward whichever backend actually produces *improving* candidates: + +```bash +ph run --ensemble "claude-code,codex,local" +``` + +At the end of the run you get a per-backend breakdown (picks + improve-rate). Selection is deterministic given the reward sequence, so runs stay reproducible. Inspired by ShinkaEvolve's adaptive LLM-ensemble selection. + ### Local Model Setup If you're running a local model (Ollama, vLLM, LM Studio, or any OpenAI-compatible server), use the `openai` backend: @@ -517,10 +533,16 @@ After `ph init`, the workspace has a `config.yaml` with these sections: search: max_iterations: 20 # Maximum search iterations early_stop_patience: 5 # Stop after N iterations with no improvement - parent_selection: best # Strategy: best | tournament | all + parent_selection: best # Strategy: best | tournament | all | pareto + novelty_filter: false # Reject near-duplicate candidates before eval (saves budget) + novelty_threshold: 0.97 # Similarity ratio above which a candidate is a near-duplicate + novelty_max_retries: 1 # Regenerate a near-duplicate this many times before skipping + seed: null # RNG seed — set an int to make randomized runs reproducible proposer: backend: api # api | openai | claude-code | claw-code | codex | hermes | opencode | local + ensemble: [] # If non-empty, pick among these backends per iteration via a UCB bandit + bandit_c: 1.41421356 # UCB exploration constant (higher = more exploration) model: claude-sonnet-4-20250514 # Model name (for api/openai backends) base_url: null # Custom API endpoint (for openai backend) api_key: null # API key override (null = use env var) @@ -532,6 +554,9 @@ evaluator: type: python # python | docker | custom entry: evaluate.py # Evaluator script entrypoint timeout: 300 # Per-task timeout in seconds + cascade: false # Stage cheap subset first; skip rest if it fails the gate (per-task mode) + cascade_threshold: 0.4 # Min stage-1 mean score required to run the full task set + cascade_stage1: 0 # Tasks in stage 1 (0 = auto, ~1/3 of the list) harness: language: python # Harness code language @@ -599,11 +624,11 @@ python -m polyharness --version | `ph init` | Initialize workspace with auto-copy of harness, tasks, eval script | | `ph run` | Start the optimization search loop | | `ph status` | Progress table with elapsed time, improvement rate, and delta | -| `ph log` | Search tree with delta (Δ) column (or `--flat` for table) | +| `ph log` | Search tree with delta (Δ) column and Pareto-frontier (◆) markers (or `--flat` for table) | | `ph best` | Show best candidate: score, per-task breakdown, changes summary | | `ph compare A B` | Compare two iterations: score deltas + unified code diff | | `ph diff ` | Shorthand for `compare 0 ` | -| `ph leaderboard` | Ranked table of all candidates (`--top N`, `--tasks` drilldown) | +| `ph leaderboard` | Ranked table of all candidates with Pareto (◆) and backend columns (`--top N`, `--tasks` drilldown) | | `ph trace ` | View stdout, stderr, metrics, exit code for an iteration | | `ph report` | Generate a full markdown report with score trends and per-task table | | `ph apply` | Copy best harness back to `base_harness/` (or `--target` dir) | @@ -647,7 +672,8 @@ python -m polyharness --version --dry-run Only evaluate the base harness, skip search --resume Continue an interrupted search from where it left off --backend Override proposer backend without editing config ---strategy Override parent selection: best | tournament | all +--strategy Override parent selection: best | tournament | all | pareto +--ensemble b1,b2,... Pick among multiple backends per iteration via a UCB bandit ``` ### `ph wrap` options diff --git a/README_CN.md b/README_CN.md index 01b7bee..f9a8581 100644 --- a/README_CN.md +++ b/README_CN.md @@ -15,7 +15,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) -[![Tests](https://img.shields.io/badge/tests-212%20passing-brightgreen.svg)]() +[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]() [![English](https://img.shields.io/badge/Docs-English-blue.svg)](README.md) --- @@ -53,6 +53,12 @@ PolyHarness 填补了这个空白。它把 Meta-Harness 搜索变成了一个任 > - 记忆工具(如 Supermemory)赋予 agent 跨会话的持久**记忆**。 > - **PolyHarness 赋予 agent 持久的自我进化能力**,你可以用可重复运行的方式持续调整它们的工作方式。 +### 这波浪潮中的一员——专精 harness + +PolyHarness 并非孤例。一批开源项目已经证明:把 LLM 与进化搜索结合,能系统性地改进代码与 prompt——[GEPA](https://github.com/gepa-ai/gepa)(在 Pareto 前沿上做反思式 prompt 进化)、[ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve)(样本高效的程序进化)、[OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve)(AlphaEvolve 的开源实现),以及 [Darwin Gödel Machine](https://sakana.ai/dgm/)(开放式自我改进 agent)。 + +它们大多进化的是*通用*程序或算法。PolyHarness 是这波浪潮里**专精 agent harness** 的那一员——优化的是包裹在现有 agent *外层*的 prompt、工具配置与编排,并聚焦于**从真实使用中在线进化**(`ph wrap` → `ph evolve`)。它把这些项目中最有效的思路借鉴过来,应用到你自己任务上的任意 CLI agent:Pareto 前沿父代选择(GEPA)、代码新颖性拒绝与自适应后端集成(ShinkaEvolve)、级联评估(AlphaEvolve/OpenEvolve)。 + ## PolyHarness 是什么 PolyHarness 是一个通过迭代评估与搜索来探索 agent harness 变体的开源引擎。 @@ -469,6 +475,16 @@ Proposer 在生成下一个候选之前会读取**所有这些信息**。它能 当你运行 `ph init --agent claude-code` 时,PolyHarness 会在 workspace 中自动生成 `CLAUDE.md` 指令文件,告诉 agent 如何作为优化 Proposer 工作。`CLAW.md`、`CODEX.md`、`AGENTS.md`(Hermes)、`OPENCODE.md` 也是同样的机制,每个 agent 都使用它自己的原生指令格式。 +#### 后端集成(自适应择优) + +不确定哪个后端最擅长你的任务?让 PolyHarness 替你试。一次传入多个后端,它会用 **UCB bandit** 每轮挑一个,并逐渐把选择倾向"真正产出改进候选"的后端: + +```bash +ph run --ensemble "claude-code,codex,local" +``` + +运行结束会给出每个后端的明细(选中次数 + 改进率)。在给定奖励序列下选择是确定性的,因此运行可复现。该机制借鉴自 ShinkaEvolve 的自适应 LLM 集成选择。 + ### 本地模型配置 如果你在本地运行模型(Ollama、vLLM、LM Studio 或任何 OpenAI 兼容服务),使用 `openai` 后端: @@ -517,10 +533,16 @@ proposer: search: max_iterations: 20 # 最大搜索迭代次数 early_stop_patience: 5 # 连续 N 轮无改进后停止 - parent_selection: best # 父候选选择策略: best | tournament | all + parent_selection: best # 父候选选择策略: best | tournament | all | pareto + novelty_filter: false # 评估前拒绝近重复候选,节省预算 + novelty_threshold: 0.97 # 超过此相似度判定为近重复 + novelty_max_retries: 1 # 跳过前重新生成近重复候选的次数 + seed: null # 随机种子 — 设为整数可让带随机性的搜索可复现 proposer: backend: api # api | openai | claude-code | claw-code | codex | hermes | opencode | local + ensemble: [] # 非空时,每轮用 UCB bandit 在这些后端中择优 + bandit_c: 1.41421356 # UCB 探索常数(越大越偏探索) model: claude-sonnet-4-20250514 # 模型名称(api/openai 后端使用) base_url: null # 自定义 API 端点(openai 后端使用) api_key: null # API 密钥覆盖(null = 使用环境变量) @@ -532,6 +554,9 @@ evaluator: type: python # python | docker | custom entry: evaluate.py # 评估脚本入口 timeout: 300 # 每个任务的超时时间(秒) + cascade: false # 先评便宜的任务子集,未过门槛则跳过其余(逐任务模式) + cascade_threshold: 0.4 # 进入完整任务集所需的第一阶段最低均分 + cascade_stage1: 0 # 第一阶段任务数(0 = 自动,约占 1/3) harness: language: python # Harness 代码语言 @@ -599,11 +624,11 @@ python -m polyharness --version | `ph init` | 初始化 workspace,自动复制 harness、任务、评估脚本 | | `ph run` | 启动优化搜索循环 | | `ph status` | 进度表格,包含耗时、改进率和增量 | -| `ph log` | 搜索树带增量(Δ)列,或用 `--flat` 查看表格视图 | +| `ph log` | 搜索树带增量(Δ)列和 Pareto 前沿(◆)标记,或用 `--flat` 查看表格视图 | | `ph best` | 展示最佳候选:分数、逐任务明细、变更摘要 | | `ph compare A B` | 对比两个迭代:分数差异 + 统一代码 diff | | `ph diff ` | `compare 0 ` 的快捷方式 | -| `ph leaderboard` | 候选排名表(`--top N`、`--tasks` 展开每题分数) | +| `ph leaderboard` | 候选排名表,含 Pareto(◆)与后端列(`--top N`、`--tasks` 展开每题分数) | | `ph trace ` | 查看某次迭代的 stdout、stderr、metrics、退出码 | | `ph report` | 生成完整 markdown 报告,包含分数趋势和逐任务表格 | | `ph apply` | 将最优 harness 回写到 `base_harness/`,或通过 `--target` 指定目录 | @@ -647,7 +672,8 @@ python -m polyharness --version --dry-run 仅评估基线 harness,跳过搜索 --resume 从上次中断处继续搜索 --backend 覆盖 proposer 后端,无需修改配置 ---strategy 覆盖父候选选择策略: best | tournament | all +--strategy 覆盖父候选选择策略: best | tournament | all | pareto +--ensemble b1,b2,... 每轮用 UCB bandit 在多个后端中择优 ``` ### `ph wrap` 选项 diff --git a/package.json b/package.json index d652445..515e58e 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "polyharness", - "version": "0.2.1", + "version": "0.2.2", "description": "Make your AI agent evolve automatically through iterative harness optimization.", "keywords": ["agent", "harness", "optimization", "meta-harness", "cli"], "license": "MIT", diff --git a/pyproject.toml b/pyproject.toml index a104fa9..a93374c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "polyharness" -version = "0.2.1" +version = "0.2.2" description = "Automated harness optimization for AI agents — make your agent evolve." readme = "README.md" license = "MIT" diff --git a/src/polyharness/__init__.py b/src/polyharness/__init__.py index e93c856..e932175 100644 --- a/src/polyharness/__init__.py +++ b/src/polyharness/__init__.py @@ -1,3 +1,3 @@ """PolyHarness — Automated harness optimization for AI agents.""" -__version__ = "0.2.1" +__version__ = "0.2.2" diff --git a/src/polyharness/cli.py b/src/polyharness/cli.py index c364703..1991984 100644 --- a/src/polyharness/cli.py +++ b/src/polyharness/cli.py @@ -247,10 +247,16 @@ def init( ) @click.option( "--strategy", - type=click.Choice(["best", "tournament", "all"], case_sensitive=False), + type=click.Choice(["best", "tournament", "all", "pareto"], case_sensitive=False), default=None, help="Override parent selection strategy.", ) +@click.option( + "--ensemble", + default=None, + metavar="b1,b2,...", + help="Comma-separated backends to pick among per iteration via a UCB bandit.", +) def run( workspace: str, max_iterations: int | None, @@ -258,6 +264,7 @@ def run( resume: bool, backend: str | None, strategy: str | None, + ensemble: str | None, ): """Start the optimization search loop.""" from polyharness.orchestrator import Orchestrator @@ -275,6 +282,16 @@ def run( if backend is not None: config.proposer.backend = backend # type: ignore[assignment] + if ensemble is not None: + names = [b.strip() for b in ensemble.split(",") if b.strip()] + try: + # Validate against the config model (rejects unknown backend names). + config.proposer.ensemble = names # type: ignore[assignment] + config = config.model_validate(config.model_dump()) + except Exception as exc: + console.print(f"[red]Error:[/red] Invalid --ensemble value: {exc}") + raise SystemExit(1) + if strategy is not None: config.search.parent_selection = strategy # type: ignore[assignment] @@ -597,21 +614,35 @@ def log(workspace: str, flat: bool): best_i = search_log.best_iteration parent_scores = {e.iteration: e.score for e in search_log.entries} + pareto_front = set(search_log.pareto_win_counts()) if flat: - _print_log_flat(search_log.entries, best_i, parent_scores) + _print_log_flat(search_log.entries, best_i, parent_scores, pareto_front) else: - _print_log_tree(search_log.entries, best_i, parent_scores) + _print_log_tree(search_log.entries, best_i, parent_scores, pareto_front) + legend = "[yellow]★[/yellow] best" + if pareto_front: + legend += " [magenta]◆[/magenta] Pareto frontier (best on ≥1 task)" console.print( f"\n{len(search_log)} iterations | " - f"best: iter_{best_i} ({search_log.best_score:.4f})" + f"best: iter_{best_i} ({search_log.best_score:.4f}) | {legend}" ) -def _log_entry_label(entry, best_i: int, parent_scores: dict[int, float] | None = None) -> str: +def _log_entry_label( + entry, + best_i: int, + parent_scores: dict[int, float] | None = None, + pareto_front: set[int] | None = None, +) -> str: """Format a single log entry as a rich-styled label.""" star = " [bold yellow]★[/bold yellow]" if entry.iteration == best_i else "" + pf = ( + " [magenta]◆[/magenta]" + if pareto_front and entry.iteration in pareto_front + else "" + ) score_color = "green" if entry.score >= entry.best_so_far else "white" delta = "" if parent_scores and entry.parent is not None and entry.parent in parent_scores: @@ -624,11 +655,16 @@ def _log_entry_label(entry, best_i: int, parent_scores: dict[int, float] | None delta = " [dim]+0.0000[/dim]" return ( f"[bold cyan]iter_{entry.iteration}[/bold cyan] " - f"[{score_color}]{entry.score:.4f}[/{score_color}]{delta}{star}" + f"[{score_color}]{entry.score:.4f}[/{score_color}]{delta}{pf}{star}" ) -def _print_log_tree(entries, best_i: int, parent_scores: dict[int, float] | None = None) -> None: +def _print_log_tree( + entries, + best_i: int, + parent_scores: dict[int, float] | None = None, + pareto_front: set[int] | None = None, +) -> None: """Print a rich Tree showing parent→child relationships.""" from rich.tree import Tree @@ -644,26 +680,31 @@ def _print_log_tree(entries, best_i: int, parent_scores: dict[int, float] | None roots = children.get(None, []) if not roots: # Fallback to flat if no root found - _print_log_flat(entries, best_i, parent_scores) + _print_log_flat(entries, best_i, parent_scores, pareto_front) return tree = Tree("[bold]Search Tree[/bold]") def _add_children(parent_tree, iteration: int) -> None: for child in children.get(iteration, []): - label = _log_entry_label(child, best_i, parent_scores) + label = _log_entry_label(child, best_i, parent_scores, pareto_front) branch = parent_tree.add(label) _add_children(branch, child.iteration) for root_entry in roots: - label = _log_entry_label(root_entry, best_i, parent_scores) + label = _log_entry_label(root_entry, best_i, parent_scores, pareto_front) root_branch = tree.add(label) _add_children(root_branch, root_entry.iteration) console.print(tree) -def _print_log_flat(entries, best_i: int, parent_scores: dict[int, float] | None = None) -> None: +def _print_log_flat( + entries, + best_i: int, + parent_scores: dict[int, float] | None = None, + pareto_front: set[int] | None = None, +) -> None: """Print a chronological table of all iterations.""" table = Table(title="Search Log") table.add_column("Iteration", style="cyan") @@ -671,6 +712,8 @@ def _print_log_flat(entries, best_i: int, parent_scores: dict[int, float] | None table.add_column("Score", style="green") table.add_column("Δ", style="bold") table.add_column("Best", style="bold green") + if pareto_front: + table.add_column("PF", style="magenta", justify="center") table.add_column("", style="yellow") for e in entries: @@ -685,14 +728,17 @@ def _print_log_flat(entries, best_i: int, parent_scores: dict[int, float] | None delta = f"[red]{d:.4f}[/red]" else: delta = "[dim]+0.0000[/dim]" - table.add_row( + row = [ f"iter_{e.iteration}", parent_str, f"{e.score:.4f}", delta, f"{e.best_so_far:.4f}", - star, - ) + ] + if pareto_front: + row.append("◆" if e.iteration in pareto_front else "") + row.append(star) + table.add_row(*row) console.print(table) @@ -991,6 +1037,11 @@ def leaderboard(workspace: str, top: int | None, tasks: bool): entries = entries[:top] base_score = next((e.score for e in log.entries if e.iteration == 0), 0.0) + pareto_front = set(log.pareto_win_counts()) + + # Backend per candidate (only meaningful when an ensemble was used). + backends = {e.iteration: ws.candidate_metadata(e.iteration).get("proposer_backend") for e in entries} + show_backend = any(backends.values()) # Gather all task names all_task_names: list[str] = [] @@ -1006,6 +1057,10 @@ def leaderboard(workspace: str, top: int | None, tasks: bool): table.add_column("Score", style="green") table.add_column("vs Base", style="bold") table.add_column("Parent", style="dim") + if pareto_front: + table.add_column("PF", style="magenta", justify="center") + if show_backend: + table.add_column("Backend", style="blue") if tasks: for tn in all_task_names: table.add_column(tn, style="white", width=8) @@ -1029,6 +1084,10 @@ def leaderboard(workspace: str, top: int | None, tasks: bool): vs_base, parent_str, ] + if pareto_front: + row.append("◆" if entry.iteration in pareto_front else "") + if show_backend: + row.append(backends.get(entry.iteration) or "—") if tasks: for tn in all_task_names: val = entry.task_scores.get(tn) @@ -1037,7 +1096,10 @@ def leaderboard(workspace: str, top: int | None, tasks: bool): table.add_row(*row) console.print(table) - console.print(f"\n{len(log)} total iterations | Showing top {len(entries)}") + footer = f"\n{len(log)} total iterations | Showing top {len(entries)}" + if pareto_front: + footer += " | [magenta]◆[/magenta] Pareto frontier" + console.print(footer) # --- trace command --- diff --git a/src/polyharness/config.py b/src/polyharness/config.py index d2f47a0..92041ff 100644 --- a/src/polyharness/config.py +++ b/src/polyharness/config.py @@ -8,6 +8,13 @@ import yaml from pydantic import BaseModel, Field +# Single source of truth for proposer backend names. Used by both the fixed +# `backend` field and the optional `ensemble` list (which gets validation for +# free by reusing this Literal alias). +BackendName = Literal[ + "api", "openai", "claude-code", "claw-code", "codex", "hermes", "opencode", "local" +] + class SearchConfig(BaseModel): """Search loop parameters.""" @@ -16,17 +23,66 @@ class SearchConfig(BaseModel): early_stop_patience: int = Field( default=5, ge=1, description="Stop after N iterations without improvement." ) - parent_selection: Literal["best", "tournament", "all"] = Field( - default="best", description="Parent candidate selection strategy." + seed: int | None = Field( + default=None, + description=( + "Optional RNG seed. When set, randomized strategies (tournament, " + "pareto, novelty regeneration) become reproducible across runs." + ), + ) + parent_selection: Literal["best", "tournament", "all", "pareto"] = Field( + default="best", + description=( + "Parent candidate selection strategy. " + "'pareto' samples from the per-task winners (GEPA-style frontier) " + "to avoid premature convergence to a single overall-best candidate." + ), + ) + novelty_filter: bool = Field( + default=False, + description=( + "Reject near-duplicate candidates before evaluation to save budget " + "(ShinkaEvolve-style novelty rejection). Off by default." + ), + ) + novelty_threshold: float = Field( + default=0.97, + ge=0.0, + le=1.0, + description=( + "Text-similarity ratio (0–1) above which a candidate is treated as a " + "near-duplicate of an earlier one. Higher = stricter (fewer rejections)." + ), + ) + novelty_max_retries: int = Field( + default=1, + ge=0, + description=( + "How many times to regenerate a near-duplicate candidate before " + "skipping its evaluation entirely." + ), ) class ProposerConfig(BaseModel): """Proposer agent configuration.""" - backend: Literal["api", "openai", "claude-code", "claw-code", "codex", "hermes", "opencode", "local"] = Field( + backend: BackendName = Field( default="api", description="Proposer backend type." ) + ensemble: list[BackendName] = Field( + default_factory=list, + description=( + "Optional list of backends. When non-empty, the orchestrator picks a " + "backend per iteration via a UCB bandit that favors backends producing " + "improving candidates. Empty (default) = always use `backend`." + ), + ) + bandit_c: float = Field( + default=1.41421356, + ge=0.0, + description="UCB exploration constant for ensemble selection. Higher = more exploration.", + ) model: str = Field( default="claude-sonnet-4-20250514", description="Model for the Proposer agent." ) @@ -52,6 +108,29 @@ class EvaluatorConfig(BaseModel): entry: str = Field(default="evaluate.py", description="Evaluator script entrypoint.") timeout: int = Field(default=300, ge=1, description="Per-task timeout in seconds.") tasks: list[str] = Field(default_factory=list, description="Task file paths.") + cascade: bool = Field( + default=False, + description=( + "Staged evaluation: score a cheap first subset of tasks, and only run " + "the rest if that subset clears `cascade_threshold` (AlphaEvolve/" + "OpenEvolve-style cascade). Saves budget on weak candidates. Requires " + "per-task mode (a non-empty `tasks` list); ignored otherwise." + ), + ) + cascade_threshold: float = Field( + default=0.4, + ge=0.0, + le=1.0, + description="Minimum stage-1 mean score required to proceed to the full task set.", + ) + cascade_stage1: int = Field( + default=0, + ge=0, + description=( + "Number of tasks in the cheap first stage. 0 = auto (about one third " + "of the task list, leaving at least one task for stage 2)." + ), + ) class HarnessConfig(BaseModel): diff --git a/src/polyharness/orchestrator.py b/src/polyharness/orchestrator.py index 83a833a..9aa0e6e 100644 --- a/src/polyharness/orchestrator.py +++ b/src/polyharness/orchestrator.py @@ -3,14 +3,16 @@ from __future__ import annotations import random +import shutil from dataclasses import dataclass from rich.console import Console from rich.table import Table from polyharness.config import PolyHarnessConfig -from polyharness.evaluator import BaseEvaluator, create_evaluator +from polyharness.evaluator import BaseEvaluator, EvalResult, create_evaluator from polyharness.proposer import BaseProposer, create_proposer +from polyharness.proposer.bandit import BackendBandit from polyharness.search_log import SearchLog from polyharness.workspace import Workspace @@ -38,21 +40,51 @@ def __init__( config: PolyHarnessConfig, proposer: BaseProposer | None = None, evaluator: BaseEvaluator | None = None, + proposers: dict[str, BaseProposer] | None = None, ): self.workspace = workspace self.config = config - self.proposer = proposer or create_proposer(config.proposer) self.evaluator = evaluator or create_evaluator(config.evaluator, cwd=workspace.root) self.search_log = SearchLog(workspace.search_log_path) + # Cache of backend-name → proposer. Pre-seeded ones (e.g. from tests) + # are used as-is; others are created lazily on first use. + self._proposer_cache: dict[str, BaseProposer] = dict(proposers or {}) + + # Ensemble (bandit) mode is opt-in and only active when the caller did + # not inject a single fixed proposer. An explicit `proposer=` always + # wins, keeping existing behavior and tests unchanged. + ensemble = config.proposer.ensemble + if proposer is None and ensemble: + self.bandit: BackendBandit | None = BackendBandit( + list(ensemble), c=config.proposer.bandit_c + ) + self.proposer: BaseProposer | None = None # chosen per iteration + else: + self.bandit = None + self.proposer = ( + proposer + or self._proposer_cache.get(config.proposer.backend) + or create_proposer(config.proposer) + ) + def run(self, resume: bool = False) -> SearchResult: """Execute the full search loop.""" max_iter = self.config.search.max_iterations + # Reproducibility: seed RNG so tournament/pareto/novelty are repeatable. + if self.config.search.seed is not None: + random.seed(self.config.search.seed) + console.rule("[bold blue]PolyHarness Optimization Loop") console.print(f"Max iterations: {max_iter}") console.print(f"Early stop patience: {self.config.search.early_stop_patience}") - console.print(f"Proposer backend: {self.config.proposer.backend}") + if self.bandit is not None: + console.print( + f"Proposer ensemble: {', '.join(self.bandit.backends)} (UCB bandit)" + ) + else: + console.print(f"Proposer backend: {self.config.proposer.backend}") console.print() # Determine starting point (resume or fresh) @@ -139,37 +171,53 @@ def run(self, resume: bool = False) -> SearchResult: for i in range(start_iter, max_iter + 1): progress.update(task, description=f"iter_{i}") + backend: str | None = None try: # Step 1: Select parent parent = self._select_parent() - # Step 2: Prepare candidate directory (copy from parent) - cand_dir = self.workspace.prepare_candidate(i, parent) + # Step 1.5: Select which backend proposes this iteration + # (bandit) or fall back to the single fixed proposer. + backend, proposer = self._select_proposer() - # Step 3: Proposer generates new candidate - metadata = self.proposer.propose( - workspace_root=self.workspace.root, - candidate_dir=cand_dir, - iteration=i, - parent=parent, + # Steps 2–3: Propose a candidate, optionally rejecting + # near-duplicates (novelty filter). + cand_dir, metadata, accepted = self._propose_with_novelty( + i, parent, proposer ) - # Step 3.5: Verify proposer produced a harness file - if not (cand_dir / "harness.py").exists(): - raise FileNotFoundError( - f"Proposer did not generate harness.py in iter_{i}" + # If the candidate is a near-duplicate even after retries, + # skip its (potentially expensive) evaluation entirely. + if not accepted: + console.print( + f"\n[yellow]iter_{i}: skipped — near-duplicate of an " + f"earlier candidate (saved evaluation budget)[/yellow]" ) + # Drop the dangling candidate dir so its copied-from-parent + # score.json doesn't pollute the leaderboard. + shutil.rmtree(cand_dir, ignore_errors=True) + self._reward_backend(backend, 0.0) # duplicate = no value + patience_counter += 1 + progress.update(task, advance=1) + if patience_counter >= self.config.search.early_stop_patience: + break + continue # Step 4: Evaluate score = self._evaluate_iteration(i) except Exception as exc: console.print(f"\n[red]iter_{i} failed: {exc}[/red]") + self._reward_backend(backend, 0.0) # failure = no value patience_counter += 1 progress.update(task, advance=1) if patience_counter >= self.config.search.early_stop_patience: break continue + # Record which backend produced this candidate (observability). + if backend is not None: + metadata = {**metadata, "proposer_backend": backend} + # Step 5: Store results log_entry = self.search_log.entries[-1] self.workspace.store_iteration( @@ -180,6 +228,11 @@ def run(self, resume: bool = False) -> SearchResult: metadata=metadata, ) + # Reward the backend when its candidate improved over its parent. + self._reward_backend( + backend, 1.0 if score > self._parent_score(parent) else 0.0 + ) + # Step 6: Update best & check early stop if score > best_score: best_score = score @@ -222,10 +275,8 @@ def _evaluate_iteration(self, iteration: int, is_base: bool = False) -> float: else: cand_dir = self.workspace.candidate_path(iteration) - eval_result = self.evaluator.evaluate( - candidate_dir=cand_dir, - tasks=self.config.evaluator.tasks, - ) + # Base harness is always scored in full; candidates may use cascade. + eval_result = self._run_eval(cand_dir, allow_cascade=not is_base) parent = None if is_base else self.search_log.best_iteration self.search_log.append( @@ -246,6 +297,47 @@ def _evaluate_iteration(self, iteration: int, is_base: bool = False) -> float: return eval_result.overall_score + def _run_eval(self, cand_dir, *, allow_cascade: bool) -> EvalResult: + """Evaluate a candidate, applying cascade when enabled and applicable.""" + tasks = self.config.evaluator.tasks + if allow_cascade and self.config.evaluator.cascade and len(tasks) >= 2: + return self._evaluate_with_cascade(cand_dir, tasks) + return self.evaluator.evaluate(candidate_dir=cand_dir, tasks=tasks) + + def _evaluate_with_cascade(self, cand_dir, tasks: list[str]) -> EvalResult: + """Staged evaluation: cheap subset first, full set only if it clears the gate. + + Splits *tasks* into a stage-1 subset and the rest. A candidate whose + stage-1 mean falls below ``cascade_threshold`` is rejected early without + running stage 2, saving evaluation budget on weak candidates + (AlphaEvolve/OpenEvolve-style cascade). Stage-1 tasks are never + re-evaluated, so the result is deterministic. + """ + k = self.config.evaluator.cascade_stage1 + if k <= 0: + k = max(1, (len(tasks) + 2) // 3) # ~1/3 of tasks, rounded up + k = min(k, len(tasks) - 1) # always leave at least one task for stage 2 + + stage1, stage2 = tasks[:k], tasks[k:] + r1 = self.evaluator.evaluate(candidate_dir=cand_dir, tasks=stage1) + + threshold = self.config.evaluator.cascade_threshold + if r1.overall_score < threshold: + console.print( + f"[dim] cascade: gated at stage 1 " + f"({r1.overall_score:.2f} < {threshold:.2f}) — " + f"skipped {len(stage2)} task(s)[/dim]" + ) + return r1 + + r2 = self.evaluator.evaluate(candidate_dir=cand_dir, tasks=stage2) + task_scores = {**r1.task_scores, **r2.task_scores} + traces = {**r1.traces, **r2.traces} + overall = ( + sum(task_scores.values()) / len(task_scores) if task_scores else 0.0 + ) + return EvalResult(overall_score=overall, task_scores=task_scores, traces=traces) + def _select_parent(self) -> int: """Select parent candidate based on strategy.""" strategy = self.config.search.parent_selection @@ -253,6 +345,8 @@ def _select_parent(self) -> int: return self.search_log.best_iteration elif strategy == "tournament": return self._tournament_select() + elif strategy == "pareto": + return self._pareto_select() else: # "all" — proposer decides, so we pass best as default parent return self.search_log.best_iteration @@ -271,6 +365,150 @@ def _tournament_select(self, k: int = 3) -> int: contestants = random.sample(entries, k) return max(contestants, key=lambda e: e.score).iteration + def _pareto_select(self) -> int: + """GEPA-style Pareto-frontier parent selection. + + Rather than always branching from the single best *overall* candidate, + build the set of candidates that achieve the top score on at least one + individual task ("per-task winners"), then sample one of them weighted + by how many tasks it wins. This keeps specialists that are strong on a + subset of tasks alive as stepping stones, avoiding premature + convergence (Pareto-based selection, GEPA — arXiv:2507.19457). + + Falls back to ``best`` when per-task scores are unavailable. + """ + win_counts = self.search_log.pareto_win_counts() + if not win_counts: + return self.search_log.best_iteration + + iterations = list(win_counts.keys()) + weights = [win_counts[i] for i in iterations] + return random.choices(iterations, weights=weights, k=1)[0] + + def _select_proposer(self) -> tuple[str | None, BaseProposer]: + """Pick the proposer for this iteration. + + Returns ``(backend_name, proposer)``. In single-backend mode the name + is ``None`` and the fixed proposer is returned. In ensemble mode the + UCB bandit chooses a backend and its (lazily created) proposer. + """ + if self.bandit is None: + assert self.proposer is not None + return None, self.proposer + backend = self.bandit.select() + return backend, self._get_proposer(backend) + + def _get_proposer(self, backend: str) -> BaseProposer: + """Return (creating + caching on first use) the proposer for *backend*.""" + if backend not in self._proposer_cache: + sub_config = self.config.proposer.model_copy(update={"backend": backend}) + self._proposer_cache[backend] = create_proposer(sub_config) + return self._proposer_cache[backend] + + def _reward_backend(self, backend: str | None, reward: float) -> None: + """Feed a reward to the bandit (no-op in single-backend mode).""" + if self.bandit is not None and backend is not None: + self.bandit.update(backend, reward) + + def _parent_score(self, parent: int | None) -> float: + """Score of the parent iteration (0.0 if unknown).""" + if parent is None: + return 0.0 + for entry in self.search_log.entries: + if entry.iteration == parent: + return entry.score + return 0.0 + + def _propose_with_novelty(self, iteration: int, parent: int, proposer: BaseProposer): + """Propose a candidate, optionally rejecting near-duplicates. + + Returns ``(candidate_dir, metadata, accepted)``. When the novelty + filter is enabled and the proposer keeps producing a candidate that is + too similar to an earlier one, regenerate up to ``novelty_max_retries`` + times; if still a near-duplicate, return ``accepted=False`` so the + caller can skip evaluation and save budget (ShinkaEvolve-style code + novelty rejection — arXiv:2509.19349). + """ + cand_dir, metadata = self._propose_candidate(iteration, parent, proposer) + + if not self.config.search.novelty_filter: + return cand_dir, metadata, True + + threshold = self.config.search.novelty_threshold + max_retries = self.config.search.novelty_max_retries + + for attempt in range(max_retries + 1): + similarity = self._max_similarity(iteration, cand_dir) + if similarity < threshold: + return cand_dir, metadata, True + if attempt < max_retries: + console.print( + f"[dim]iter_{iteration}: candidate {similarity:.2f} similar to an " + f"existing one — regenerating ({attempt + 1}/{max_retries})[/dim]" + ) + cand_dir, metadata = self._propose_candidate(iteration, parent, proposer) + + return cand_dir, metadata, False + + def _propose_candidate(self, iteration: int, parent: int, proposer: BaseProposer): + """Prepare a candidate dir, run the proposer, and verify output. + + Returns ``(candidate_dir, metadata)``. Raises ``FileNotFoundError`` + when the proposer fails to produce ``harness.py``. + """ + cand_dir = self.workspace.prepare_candidate(iteration, parent) + metadata = proposer.propose( + workspace_root=self.workspace.root, + candidate_dir=cand_dir, + iteration=iteration, + parent=parent, + ) + if not (cand_dir / "harness.py").exists(): + raise FileNotFoundError( + f"Proposer did not generate harness.py in iter_{iteration}" + ) + return cand_dir, metadata + + def _max_similarity(self, iteration: int, cand_dir) -> float: + """Max text similarity of *cand_dir* against all earlier candidates. + + Uses :class:`difflib.SequenceMatcher` (stdlib, no extra deps) on the + concatenated editable harness files. Returns a ratio in ``[0, 1]``. + """ + from difflib import SequenceMatcher + + new_text = self._candidate_text(cand_dir) + if not new_text: + return 0.0 + + best = 0.0 + for entry in self.search_log.entries: + if entry.iteration == iteration: + continue + other_dir = self.workspace.candidate_path(entry.iteration) + if not other_dir.exists(): + continue + other_text = self._candidate_text(other_dir) + if not other_text: + continue + ratio = SequenceMatcher(None, new_text, other_text).ratio() + if ratio > best: + best = ratio + return best + + def _candidate_text(self, cand_dir) -> str: + """Concatenate a candidate's editable harness files into one blob.""" + parts: list[str] = [] + for fname in self.config.harness.editable_files: + f = cand_dir / fname + if f.is_file(): + parts.append(f.read_text()) + if not parts: + entry = cand_dir / self.config.harness.entry + if entry.is_file(): + parts.append(entry.read_text()) + return "\n".join(parts) + def _print_iteration(self, iteration: int, score: float, best_so_far: float, parent: int | None) -> None: parent_str = f"iter_{parent}" if parent is not None else "base" delta = score - best_so_far if iteration > 0 else 0 @@ -288,6 +526,19 @@ def _print_summary(self, result: SearchResult) -> None: table.add_row("Best score", f"{result.best_score:.4f}") table.add_row("Total iterations", str(result.total_iterations)) console.print(table) + + # Ensemble bandit breakdown: which backend earned its picks. + if self.bandit is not None and self.bandit.total_pulls > 0: + bandit_table = Table(title="Proposer ensemble (UCB bandit)") + bandit_table.add_column("Backend") + bandit_table.add_column("Picks", justify="right") + bandit_table.add_column("Improve rate", justify="right") + for backend, s in self.bandit.stats().items(): + bandit_table.add_row( + backend, str(s["pulls"]), f"{s['mean_reward']:.2f}" + ) + console.print(bandit_table) + console.print( "\nRun [bold]ph best[/bold] to see details, or [bold]ph apply[/bold] to apply the result." ) diff --git a/src/polyharness/proposer/bandit.py b/src/polyharness/proposer/bandit.py new file mode 100644 index 0000000..e64019c --- /dev/null +++ b/src/polyharness/proposer/bandit.py @@ -0,0 +1,84 @@ +"""UCB1 bandit for adaptive multi-backend proposer selection. + +When several proposer backends are available, we don't know up front which one +writes the best harness changes for a given task. Instead of committing to one, +the orchestrator can treat backend choice as a multi-armed bandit: each +iteration it picks the backend with the highest UCB score, observes whether the +produced candidate improved, and updates its estimate. + +Design notes (aligned with project principles): +- **Deterministic.** UCB1 is fully deterministic given the reward sequence; + ties break by configured backend order. No RNG, so runs are reproducible. +- **No new dependencies.** Pure stdlib (``math``). +- **No new attack surface.** It only chooses among already-configured backends; + it never constructs commands or executes anything itself. + +Inspired by ShinkaEvolve's adaptive LLM-ensemble selection (arXiv:2509.19349). +""" + +from __future__ import annotations + +import math +from dataclasses import dataclass + + +@dataclass +class _Arm: + count: int = 0 + total_reward: float = 0.0 + + @property + def mean(self) -> float: + return self.total_reward / self.count if self.count else 0.0 + + +class BackendBandit: + """UCB1 multi-armed bandit over a fixed set of backend names.""" + + def __init__(self, backends: list[str], c: float = 1.41421356): + if not backends: + raise ValueError("BackendBandit requires at least one backend.") + # Preserve order (used for deterministic tie-breaking) and dedupe. + self.backends: list[str] = list(dict.fromkeys(backends)) + self.c = c + self._arms: dict[str, _Arm] = {b: _Arm() for b in self.backends} + + @property + def total_pulls(self) -> int: + return sum(arm.count for arm in self._arms.values()) + + def select(self) -> str: + """Return the backend to use next. + + Every backend is tried once before UCB scoring kicks in. Ties resolve + to the earliest backend in the configured order, keeping selection + deterministic and reproducible. + """ + # Cold start: try each unpulled backend in order first. + for b in self.backends: + if self._arms[b].count == 0: + return b + + total = self.total_pulls + + def ucb(b: str) -> float: + arm = self._arms[b] + return arm.mean + self.c * math.sqrt(2 * math.log(total) / arm.count) + + # max() returns the first item on ties → deterministic by order. + return max(self.backends, key=ucb) + + def update(self, backend: str, reward: float) -> None: + """Record a reward in ``[0, 1]`` for *backend*.""" + if backend not in self._arms: + raise KeyError(f"Unknown backend for bandit update: {backend}") + arm = self._arms[backend] + arm.count += 1 + arm.total_reward += reward + + def stats(self) -> dict[str, dict[str, float | int]]: + """Return per-backend pull counts and mean rewards (for reporting).""" + return { + b: {"pulls": arm.count, "mean_reward": round(arm.mean, 4)} + for b, arm in self._arms.items() + } diff --git a/src/polyharness/search_log.py b/src/polyharness/search_log.py index a2bfbc3..a648276 100644 --- a/src/polyharness/search_log.py +++ b/src/polyharness/search_log.py @@ -82,5 +82,30 @@ def best_iteration(self) -> int: return 0 return max(self._entries, key=lambda e: e.score).iteration + def pareto_win_counts(self) -> dict[int, int]: + """Map each Pareto-frontier iteration to the number of tasks it wins. + + A candidate is on the frontier if it achieves the top score on at + least one individual task (GEPA-style per-task winners). The values + are how many tasks each frontier member wins. Returns an empty dict + when no per-task scores are recorded. + """ + entries = [e for e in self._entries if e.task_scores] + if not entries: + return {} + + task_names: set[str] = set() + for e in entries: + task_names.update(e.task_scores.keys()) + + eps = 1e-9 + counts: dict[int, int] = {} + for task in task_names: + best = max(e.task_scores.get(task, float("-inf")) for e in entries) + for e in entries: + if e.task_scores.get(task, float("-inf")) >= best - eps: + counts[e.iteration] = counts.get(e.iteration, 0) + 1 + return counts + def __len__(self) -> int: return len(self._entries) diff --git a/src/polyharness/workspace.py b/src/polyharness/workspace.py index 9cfcfb6..9b6dd76 100644 --- a/src/polyharness/workspace.py +++ b/src/polyharness/workspace.py @@ -194,6 +194,16 @@ def search_log_path(self) -> Path: def candidate_path(self, iteration: int) -> Path: return self.candidates_dir / f"iter_{iteration}" + def candidate_metadata(self, iteration: int) -> dict: + """Read a candidate's metadata.json (empty dict if absent/unreadable).""" + meta_file = self.candidate_path(iteration) / "metadata.json" + if not meta_file.exists(): + return {} + try: + return json.loads(meta_file.read_text()) + except (json.JSONDecodeError, ValueError): + return {} + def is_initialized(self) -> bool: """Check if workspace has required structure.""" return ( diff --git a/tests/test_bandit.py b/tests/test_bandit.py new file mode 100644 index 0000000..7903896 --- /dev/null +++ b/tests/test_bandit.py @@ -0,0 +1,66 @@ +"""Tests for the UCB backend-selection bandit.""" + +import pytest + +from polyharness.proposer.bandit import BackendBandit + + +def test_cold_start_tries_each_backend_in_order(): + b = BackendBandit(["api", "local", "codex"]) + # With nothing pulled yet, selection walks the backends in order. + assert b.select() == "api" + b.update("api", 1.0) + assert b.select() == "local" + b.update("local", 1.0) + assert b.select() == "codex" + + +def test_converges_to_better_backend(): + b = BackendBandit(["good", "bad"], c=0.5) + # Cold start: one pull each. + b.update(b.select(), 1.0) # good + b.update(b.select(), 0.0) # bad + # Now reward "good" highly and "bad" poorly over many rounds. + picks = {"good": 0, "bad": 0} + for _ in range(50): + choice = b.select() + picks[choice] += 1 + b.update(choice, 1.0 if choice == "good" else 0.0) + assert picks["good"] > picks["bad"] + + +def test_deterministic_tie_breaks_by_order(): + # Two identical arms → ties always resolve to the first backend. + b1 = BackendBandit(["x", "y"]) + b2 = BackendBandit(["x", "y"]) + for _ in range(10): + c1, c2 = b1.select(), b2.select() + assert c1 == c2 + b1.update(c1, 0.5) + b2.update(c2, 0.5) + + +def test_dedupe_preserves_order(): + b = BackendBandit(["api", "api", "local"]) + assert b.backends == ["api", "local"] + + +def test_empty_backends_raises(): + with pytest.raises(ValueError): + BackendBandit([]) + + +def test_update_unknown_backend_raises(): + b = BackendBandit(["api"]) + with pytest.raises(KeyError): + b.update("nope", 1.0) + + +def test_stats_shape(): + b = BackendBandit(["api", "local"]) + b.update("api", 1.0) + b.update("api", 0.0) + stats = b.stats() + assert stats["api"] == {"pulls": 2, "mean_reward": 0.5} + assert stats["local"] == {"pulls": 0, "mean_reward": 0.0} + assert b.total_pulls == 2 diff --git a/tests/test_cli_features.py b/tests/test_cli_features.py index a0497c6..d60e603 100644 --- a/tests/test_cli_features.py +++ b/tests/test_cli_features.py @@ -201,6 +201,63 @@ def test_log_shows_delta(runner, workspace): assert "Δ" in result.output or "delta" in result.output.lower() or "+0.2" in result.output +def test_log_marks_pareto_frontier(runner, workspace): + """ph log marks per-task winners with the Pareto-frontier glyph.""" + from polyharness.search_log import SearchLog + + log = SearchLog(workspace.search_log_path) + log.append(0, None, 0.5, {"A": 0.5, "B": 0.5}) + log.append(1, 0, 0.5, {"A": 0.9, "B": 0.1}) # wins A + log.append(2, 0, 0.5, {"A": 0.1, "B": 0.9}) # wins B + + result = runner.invoke(main, ["log", "--workspace", str(workspace.root)]) + assert result.exit_code == 0 + assert "◆" in result.output + assert "Pareto frontier" in result.output + + +def test_log_no_pareto_marker_without_task_scores(runner, workspace): + """No frontier glyph when candidates have no per-task scores.""" + from polyharness.search_log import SearchLog + + log = SearchLog(workspace.search_log_path) + log.append(0, None, 0.3, {}) + log.append(1, 0, 0.5, {}) + + result = runner.invoke(main, ["log", "--workspace", str(workspace.root)]) + assert result.exit_code == 0 + assert "◆" not in result.output + + +def test_leaderboard_shows_backend_when_recorded(runner, workspace): + """ph leaderboard surfaces proposer_backend when an ensemble was used.""" + from polyharness.search_log import SearchLog + + log = SearchLog(workspace.search_log_path) + log.append(0, None, 0.3, {"A": 0.3}) + log.append(1, 0, 0.6, {"A": 0.6}) + workspace.store_iteration(0, 0.3, {"A": 0.3}, parent=None, metadata={"source": "base"}) + workspace.store_iteration(1, 0.6, {"A": 0.6}, parent=0, metadata={"proposer_backend": "codex"}) + + result = runner.invoke(main, ["leaderboard", "--workspace", str(workspace.root)]) + assert result.exit_code == 0 + assert "Backend" in result.output + assert "codex" in result.output + + +def test_leaderboard_hides_backend_without_ensemble(runner, workspace): + """No Backend column when no candidate recorded a proposer_backend.""" + from polyharness.search_log import SearchLog + + log = SearchLog(workspace.search_log_path) + log.append(0, None, 0.3, {"A": 0.3}) + log.append(1, 0, 0.6, {"A": 0.6}) + + result = runner.invoke(main, ["leaderboard", "--workspace", str(workspace.root)]) + assert result.exit_code == 0 + assert "Backend" not in result.output + + # --- ph run --resume --- diff --git a/tests/test_config.py b/tests/test_config.py index 750f1f2..b10d48a 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -3,6 +3,9 @@ import tempfile from pathlib import Path +import pytest +from pydantic import ValidationError + from polyharness.config import PolyHarnessConfig @@ -11,10 +14,45 @@ def test_default_config(): assert cfg.search.max_iterations == 20 assert cfg.search.early_stop_patience == 5 assert cfg.proposer.backend == "api" + assert cfg.proposer.ensemble == [] # single-backend by default + assert cfg.search.seed is None assert cfg.evaluator.type == "python" assert cfg.harness.language == "python" +def test_ensemble_accepts_valid_backends(): + cfg = PolyHarnessConfig.model_validate( + {"proposer": {"ensemble": ["local", "api", "codex"]}} + ) + assert cfg.proposer.ensemble == ["local", "api", "codex"] + + +def test_ensemble_rejects_unknown_backend(): + with pytest.raises(ValidationError): + PolyHarnessConfig.model_validate({"proposer": {"ensemble": ["bogus"]}}) + + +def test_parent_selection_accepts_pareto(): + cfg = PolyHarnessConfig.model_validate({"search": {"parent_selection": "pareto"}}) + assert cfg.search.parent_selection == "pareto" + + +def test_cascade_defaults_and_roundtrip(): + cfg = PolyHarnessConfig() + assert cfg.evaluator.cascade is False + assert cfg.evaluator.cascade_threshold == 0.4 + assert cfg.evaluator.cascade_stage1 == 0 + + cfg.evaluator.cascade = True + cfg.evaluator.cascade_stage1 = 3 + with tempfile.TemporaryDirectory() as tmp: + path = Path(tmp) / "config.yaml" + cfg.to_yaml(path) + loaded = PolyHarnessConfig.from_yaml(path) + assert loaded.evaluator.cascade is True + assert loaded.evaluator.cascade_stage1 == 3 + + def test_config_roundtrip_yaml(): cfg = PolyHarnessConfig() cfg.proposer.backend = "claude-code" # type: ignore[assignment] diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index 3c08e52..daf7a64 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -198,6 +198,332 @@ def test_orchestrator_resume_already_complete(tmp_path): assert result.best_score > 0 +def test_orchestrator_pareto_selection(tmp_path): + """Pareto selection should run end-to-end and find improvements.""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 5 + config.search.early_stop_patience = 10 + config.search.parent_selection = "pareto" + + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), + evaluator=MockEvaluator(), + ) + result = orch.run() + + assert isinstance(result, SearchResult) + assert result.total_iterations >= 5 + assert result.best_score > 0.3 + + +def test_pareto_select_picks_per_task_winner(tmp_path): + """A per-task specialist should be selectable even when not best overall. + + iter_0 is mediocre on every task; iter_1 and iter_2 each win exactly one + task. 'best' selection would never branch from a specialist, but the + Pareto frontier keeps them alive. + """ + import random + + ws = _setup_workspace(tmp_path) + config = ws.load_config() + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), + evaluator=MockEvaluator(), + ) + + # Same overall score (0.5) but different per-task profiles. + orch.search_log.append(0, None, 0.5, {"A": 0.5, "B": 0.5}) + orch.search_log.append(1, 0, 0.5, {"A": 0.9, "B": 0.1}) # wins task A + orch.search_log.append(2, 0, 0.5, {"A": 0.1, "B": 0.9}) # wins task B + + random.seed(0) + picks = {orch._pareto_select() for _ in range(100)} + + # iter_0 wins no task → must never be chosen; both specialists reachable. + assert 0 not in picks + assert picks == {1, 2} + + +def test_pareto_select_falls_back_without_task_scores(tmp_path): + """Without per-task scores, Pareto selection degrades to best-overall.""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), + evaluator=MockEvaluator(), + ) + orch.search_log.append(0, None, 0.4, {}) + orch.search_log.append(1, 0, 0.7, {}) + + assert orch._pareto_select() == 1 # == best_iteration + + +def test_novelty_filter_skips_duplicate(tmp_path): + """A proposer that always emits identical code should get its candidates + rejected (and their evaluation skipped) when the novelty filter is on.""" + + class ConstantProposer(BaseProposer): + def propose(self, workspace_root, candidate_dir, iteration, parent): + (candidate_dir / "harness.py").write_text("SCORE_HINT = 0.3\n") + return {"changes_summary": "no change"} + + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 5 + config.search.early_stop_patience = 3 + config.search.novelty_filter = True + config.search.novelty_threshold = 0.97 + config.search.novelty_max_retries = 1 + + orch = Orchestrator( + workspace=ws, + config=config, + proposer=ConstantProposer(), + evaluator=MockEvaluator(), + ) + orch.run() + + # Only the base (iter_0) is evaluated; every later duplicate is skipped, + # so it never gets appended to the search log. + logged = [e.iteration for e in orch.search_log.entries] + assert logged == [0] + # And the skipped candidate dir is cleaned up (no dangling copy). + assert not ws.candidate_path(1).exists() + + +def test_novelty_filter_allows_novel(tmp_path): + """Distinct candidates should pass the novelty gate and be evaluated.""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 3 + config.search.early_stop_patience = 10 + config.search.novelty_filter = True + config.search.novelty_threshold = 0.97 + + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), # writes a distinct harness each iteration + evaluator=MockEvaluator(), + ) + result = orch.run() + + assert result.best_score > 0.3 + for i in (1, 2, 3): + assert (ws.candidate_path(i) / "score.json").exists() + + +def test_max_similarity_detects_identical(tmp_path): + """_max_similarity returns ~1.0 for identical code, low for distinct code.""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), + evaluator=MockEvaluator(), + ) + + # iter_0 = copy of base ("SCORE_HINT = 0.3") + ws.prepare_candidate(0, parent=None) + orch.search_log.append(0, None, 0.3, {"mock_task": 0.3}) + + # An identical candidate scores ~1.0 similarity against iter_0. + dup = ws.prepare_candidate(1, parent=0) + (dup / "harness.py").write_text("SCORE_HINT = 0.3\n") + assert orch._max_similarity(1, dup) > 0.97 + + # A clearly different candidate scores low. + novel = ws.prepare_candidate(2, parent=0) + (novel / "harness.py").write_text( + "import math\n\ndef solve(x):\n return math.sqrt(x) * 42 + len(str(x))\n" + ) + assert orch._max_similarity(2, novel) < 0.97 + + +def test_orchestrator_ensemble_bandit(tmp_path): + """The bandit should favor the backend that produces improvements.""" + + class HighProposer(BaseProposer): + """Improves every iteration.""" + + def propose(self, workspace_root, candidate_dir, iteration, parent): + score = min(0.4 + iteration * 0.1, 1.0) + (candidate_dir / "harness.py").write_text(f"SCORE_HINT = {score}\n") + return {"changes_summary": "high"} + + class LowProposer(BaseProposer): + """Never improves over the base.""" + + def propose(self, workspace_root, candidate_dir, iteration, parent): + (candidate_dir / "harness.py").write_text("SCORE_HINT = 0.3\n") + return {"changes_summary": "low"} + + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 6 + config.search.early_stop_patience = 20 + config.proposer.ensemble = ["local", "api"] # valid backend names + + orch = Orchestrator( + workspace=ws, + config=config, + # Inject mocks keyed by backend name so no real CLIs/API are touched. + proposers={"local": HighProposer(), "api": LowProposer()}, + evaluator=MockEvaluator(), + ) + result = orch.run() + + assert orch.bandit is not None + stats = orch.bandit.stats() + # Both arms are tried at least once (cold start), and the improving backend + # ("local"=HighProposer) earns a perfect improve-rate while the other earns 0. + assert stats["local"]["pulls"] >= 1 + assert stats["api"]["pulls"] >= 1 + assert stats["local"]["mean_reward"] == 1.0 + assert stats["api"]["mean_reward"] == 0.0 + assert stats["local"]["pulls"] >= stats["api"]["pulls"] + assert result.best_score >= 0.9 + # The winning candidate records which backend produced it. + best_meta = json.loads( + (ws.candidate_path(result.best_iteration) / "metadata.json").read_text() + ) + assert best_meta["proposer_backend"] == "local" + + +def test_ensemble_disabled_when_proposer_injected(tmp_path): + """An explicit single proposer wins over an ensemble config (back-compat).""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 2 + config.search.early_stop_patience = 10 + config.proposer.ensemble = ["local", "api"] + + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), # explicit → disables the bandit + evaluator=MockEvaluator(), + ) + assert orch.bandit is None + result = orch.run() + assert result.best_score > 0.3 + + +def test_seed_makes_search_reproducible(tmp_path): + """Same seed + randomized strategy → identical parent-selection trajectory.""" + + def run_once(path): + ws = Workspace.init(path) + (ws.base_harness_dir / "harness.py").write_text("SCORE_HINT = 0.3\n") + config = ws.load_config() + config.search.max_iterations = 6 + config.search.early_stop_patience = 20 + config.search.parent_selection = "tournament" + config.search.seed = 42 + orch = Orchestrator( + workspace=ws, + config=config, + proposer=MockProposer(), + evaluator=MockEvaluator(), + ) + orch.run() + return [e.parent for e in orch.search_log.entries] + + assert run_once(tmp_path / "a") == run_once(tmp_path / "b") + + +class PerTaskEvaluator(BaseEvaluator): + """Evaluator that scores by task name and records which task lists it ran.""" + + def __init__(self, task_scores): + self.task_scores = task_scores + self.calls: list[list[str]] = [] + + def evaluate(self, candidate_dir, tasks): + from pathlib import Path + + stems = [Path(t).stem for t in tasks] + self.calls.append(stems) + ts = {s: self.task_scores.get(s, 0.0) for s in stems} + overall = sum(ts.values()) / len(ts) if ts else 0.0 + return EvalResult(overall_score=overall, task_scores=ts) + + +def _cascade_config(ws, **overrides): + config = ws.load_config() + config.search.max_iterations = 2 + config.search.early_stop_patience = 10 + config.evaluator.tasks = ["t1.json", "t2.json", "t3.json", "t4.json"] + config.evaluator.cascade = True + config.evaluator.cascade_stage1 = 2 + config.evaluator.cascade_threshold = 0.5 + for k, v in overrides.items(): + setattr(config.evaluator, k, v) + return config + + +def test_cascade_gates_weak_candidate(tmp_path): + """A candidate failing stage 1 should never trigger stage-2 evaluation.""" + ws = _setup_workspace(tmp_path) + config = _cascade_config(ws) + ev = PerTaskEvaluator({"t1": 0.3, "t2": 0.3, "t3": 0.9, "t4": 0.9}) + + Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev).run() + + # Base harness is scored in full (cascade never applies to it). + assert ev.calls[0] == ["t1", "t2", "t3", "t4"] + # Candidates run only stage 1 and are gated → stage 2 never runs alone. + assert ["t1", "t2"] in ev.calls + assert ["t3", "t4"] not in ev.calls + + +def test_cascade_runs_full_for_strong_candidate(tmp_path): + """A candidate clearing stage 1 should proceed to the full task set.""" + ws = _setup_workspace(tmp_path) + config = _cascade_config(ws) + ev = PerTaskEvaluator({"t1": 0.8, "t2": 0.8, "t3": 0.8, "t4": 0.8}) + + orch = Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev) + orch.run() + + assert ["t1", "t2"] in ev.calls # stage 1 + assert ["t3", "t4"] in ev.calls # stage 2 ran too + cand = next(e for e in orch.search_log.entries if e.iteration == 1) + assert set(cand.task_scores) == {"t1", "t2", "t3", "t4"} + + +def test_cascade_disabled_runs_full(tmp_path): + """With cascade off, every evaluation uses the full task list.""" + ws = _setup_workspace(tmp_path) + config = _cascade_config(ws, cascade=False) + ev = PerTaskEvaluator({"t1": 0.3, "t2": 0.3, "t3": 0.9, "t4": 0.9}) + + Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev).run() + + assert ["t1", "t2"] not in ev.calls + assert all(call == ["t1", "t2", "t3", "t4"] for call in ev.calls) + + +def test_cascade_base_always_full(tmp_path): + """The base harness is fully evaluated even when candidates are gated.""" + ws = _setup_workspace(tmp_path) + config = _cascade_config(ws, cascade_threshold=0.99) # gate every candidate + ev = PerTaskEvaluator({"t1": 0.5, "t2": 0.5, "t3": 0.5, "t4": 0.5}) + + Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev).run() + + assert ev.calls[0] == ["t1", "t2", "t3", "t4"] + + def test_orchestrator_error_recovery(tmp_path): """Orchestrator should skip failing iterations and continue.""" diff --git a/tests/test_search_log.py b/tests/test_search_log.py index 454984a..872d110 100644 --- a/tests/test_search_log.py +++ b/tests/test_search_log.py @@ -59,3 +59,23 @@ def test_log_entry_roundtrip(): assert restored.parent == 1 assert restored.score == 0.72 assert restored.task_scores == {"a": 0.8} + + +def test_pareto_win_counts(tmp_path): + log = SearchLog(tmp_path / "search_log.jsonl") + log.append(0, None, 0.5, {"A": 0.5, "B": 0.5}) + log.append(1, 0, 0.5, {"A": 0.9, "B": 0.1}) # wins task A + log.append(2, 0, 0.5, {"A": 0.1, "B": 0.9}) # wins task B + + counts = log.pareto_win_counts() + # iter_0 wins nothing; iter_1 and iter_2 each win one task. + assert set(counts) == {1, 2} + assert counts[1] == 1 + assert counts[2] == 1 + + +def test_pareto_win_counts_empty_without_task_scores(tmp_path): + log = SearchLog(tmp_path / "search_log.jsonl") + log.append(0, None, 0.3, {}) + log.append(1, 0, 0.5, {}) + assert log.pareto_win_counts() == {} diff --git a/tests/test_workspace.py b/tests/test_workspace.py index 75f8622..ee8878e 100644 --- a/tests/test_workspace.py +++ b/tests/test_workspace.py @@ -241,3 +241,11 @@ def test_apply_best(tmp_path): assert (target / "harness.py").read_text() == "# optimized\n" assert not (target / "score.json").exists() assert not (target / "traces").exists() + + +def test_candidate_metadata(tmp_path): + ws = Workspace.init(tmp_path / "ws") + ws.store_iteration(0, 0.5, {"A": 0.5}, parent=None, metadata={"proposer_backend": "codex"}) + assert ws.candidate_metadata(0)["proposer_backend"] == "codex" + # Missing candidate → empty dict (no crash). + assert ws.candidate_metadata(99) == {}