Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,49 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.2.1] - 2026-04-09
## [0.2.2] - 2026-05-24

### Added
- **Pareto-frontier parent selection** (`parent_selection: pareto`) — samples
parents from the set of per-task winners instead of always branching from the
single overall-best candidate, keeping specialists alive as stepping stones to
avoid premature convergence. Inspired by GEPA (arXiv:2507.19457). Reuses the
per-task scores already stored in the search log — no new data collected.
- **Code novelty rejection** (`novelty_filter`, `novelty_threshold`,
`novelty_max_retries`) — detects near-duplicate candidates via stdlib
`difflib` text similarity (no new dependencies) and skips their evaluation to
save API/compute budget. Inspired by ShinkaEvolve (arXiv:2509.19349). Off by
default.
- **Adaptive backend ensemble** (`proposer.ensemble`, `proposer.bandit_c`,
`ph run --ensemble b1,b2,...`) — when several backends are listed, a UCB1
bandit picks one per iteration and shifts picks toward backends that produce
*improving* candidates. Fully deterministic (no RNG) and adds no new
dependencies. Run summary shows a per-backend picks/improve-rate table.
Inspired by ShinkaEvolve's adaptive LLM-ensemble selection.
- **Cascade evaluation** (`evaluator.cascade`, `cascade_threshold`,
`cascade_stage1`) — scores a cheap first subset of tasks and only runs the
rest if it clears the gate, saving budget on weak candidates (AlphaEvolve/
OpenEvolve-style). Per-task mode only; the base harness is always scored in
full. Off by default.
- **Reproducible runs** (`search.seed`) — seeds the RNG so tournament/pareto/
novelty regeneration are repeatable across runs.
- **Observability** — `ph log` marks Pareto-frontier members (◆); `ph leaderboard`
adds a Pareto column and a Backend column (shown only when an ensemble was
used). `SearchLog.pareto_win_counts()` powers both the CLI and the orchestrator.
- `proposer_backend` recorded in each candidate's `metadata.json` (ensemble mode)
- Hermes Agent adapter (`hermes`) — 8th proposer backend (`hermes chat -q`)
- `--strategy pareto` and `--ensemble` options for `ph run`
- `proposer/bandit.py` — UCB1 `BackendBandit`
- 31 new tests (206 total)

### Changed
- Agent backends: 7 → 8 (added Hermes Agent)

### Removed
- Stray byte-identical duplicate files (`collector 2.py`, `test_collector 2.py`,
`test_evolution 2.py`) that inflated the test count and tripped ruff N999



### Added
- `ph shell-hook install/uninstall/status` — zero-config auto-wrap for agent commands via shell preexec hook
Expand Down
36 changes: 31 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-212%20passing-brightgreen.svg)]()
[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
[![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md)

---
Expand Down Expand Up @@ -53,6 +53,12 @@ PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness
> - Memory tools (like Supermemory) give agents persistent **memory** across conversations.
> - **PolyHarness gives agents persistent self-evolution** — you get a repeatable way to refine how they work over time.

### Part of a wave — specialized for harnesses

PolyHarness doesn't stand alone. A wave of open-source projects has shown that pairing LLMs with evolutionary search systematically improves code and prompts: [GEPA](https://github.com/gepa-ai/gepa) (reflective prompt evolution over a Pareto frontier), [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) (sample-efficient program evolution), [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) (an open AlphaEvolve), and the [Darwin Gödel Machine](https://sakana.ai/dgm/) (open-ended self-improving agents).

Most of these evolve *general* programs or algorithms. PolyHarness is the member of this wave **specialized for agent harnesses** — the prompts, tool config, and orchestration *around* an existing agent — with a focus on **online evolution from real usage** (`ph wrap` → `ph evolve`). It borrows the strongest ideas from these projects and applies them to any CLI agent on your own tasks: Pareto-frontier parent selection (GEPA), code-novelty rejection and an adaptive backend ensemble (ShinkaEvolve), and cascade evaluation (AlphaEvolve/OpenEvolve).

## What PolyHarness Is

PolyHarness is the open-source engine for iteratively searching over an agent's harness.
Expand Down Expand Up @@ -469,6 +475,16 @@ The Proposer reads **all of this** before generating the next candidate. It can

When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `AGENTS.md` (Hermes), `OPENCODE.md` — each agent's native instruction format.

#### Backend ensemble (adaptive selection)

Don't know which backend writes the best harness changes for your task? Let PolyHarness find out. Pass several and it picks one per iteration with a **UCB bandit**, shifting picks toward whichever backend actually produces *improving* candidates:

```bash
ph run --ensemble "claude-code,codex,local"
```

At the end of the run you get a per-backend breakdown (picks + improve-rate). Selection is deterministic given the reward sequence, so runs stay reproducible. Inspired by ShinkaEvolve's adaptive LLM-ensemble selection.

### Local Model Setup

If you're running a local model (Ollama, vLLM, LM Studio, or any OpenAI-compatible server), use the `openai` backend:
Expand Down Expand Up @@ -517,10 +533,16 @@ After `ph init`, the workspace has a `config.yaml` with these sections:
search:
max_iterations: 20 # Maximum search iterations
early_stop_patience: 5 # Stop after N iterations with no improvement
parent_selection: best # Strategy: best | tournament | all
parent_selection: best # Strategy: best | tournament | all | pareto
novelty_filter: false # Reject near-duplicate candidates before eval (saves budget)
novelty_threshold: 0.97 # Similarity ratio above which a candidate is a near-duplicate
novelty_max_retries: 1 # Regenerate a near-duplicate this many times before skipping
seed: null # RNG seed — set an int to make randomized runs reproducible

proposer:
backend: api # api | openai | claude-code | claw-code | codex | hermes | opencode | local
ensemble: [] # If non-empty, pick among these backends per iteration via a UCB bandit
bandit_c: 1.41421356 # UCB exploration constant (higher = more exploration)
model: claude-sonnet-4-20250514 # Model name (for api/openai backends)
base_url: null # Custom API endpoint (for openai backend)
api_key: null # API key override (null = use env var)
Expand All @@ -532,6 +554,9 @@ evaluator:
type: python # python | docker | custom
entry: evaluate.py # Evaluator script entrypoint
timeout: 300 # Per-task timeout in seconds
cascade: false # Stage cheap subset first; skip rest if it fails the gate (per-task mode)
cascade_threshold: 0.4 # Min stage-1 mean score required to run the full task set
cascade_stage1: 0 # Tasks in stage 1 (0 = auto, ~1/3 of the list)

harness:
language: python # Harness code language
Expand Down Expand Up @@ -599,11 +624,11 @@ python -m polyharness --version
| `ph init` | Initialize workspace with auto-copy of harness, tasks, eval script |
| `ph run` | Start the optimization search loop |
| `ph status` | Progress table with elapsed time, improvement rate, and delta |
| `ph log` | Search tree with delta (Δ) column (or `--flat` for table) |
| `ph log` | Search tree with delta (Δ) column and Pareto-frontier (◆) markers (or `--flat` for table) |
| `ph best` | Show best candidate: score, per-task breakdown, changes summary |
| `ph compare A B` | Compare two iterations: score deltas + unified code diff |
| `ph diff <N>` | Shorthand for `compare 0 <N>` |
| `ph leaderboard` | Ranked table of all candidates (`--top N`, `--tasks` drilldown) |
| `ph leaderboard` | Ranked table of all candidates with Pareto (◆) and backend columns (`--top N`, `--tasks` drilldown) |
| `ph trace <N>` | View stdout, stderr, metrics, exit code for an iteration |
| `ph report` | Generate a full markdown report with score trends and per-task table |
| `ph apply` | Copy best harness back to `base_harness/` (or `--target` dir) |
Expand Down Expand Up @@ -647,7 +672,8 @@ python -m polyharness --version
--dry-run Only evaluate the base harness, skip search
--resume Continue an interrupted search from where it left off
--backend <name> Override proposer backend without editing config
--strategy <name> Override parent selection: best | tournament | all
--strategy <name> Override parent selection: best | tournament | all | pareto
--ensemble b1,b2,... Pick among multiple backends per iteration via a UCB bandit
```

### `ph wrap` options
Expand Down
36 changes: 31 additions & 5 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-212%20passing-brightgreen.svg)]()
[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
[![English](https://img.shields.io/badge/Docs-English-blue.svg)](README.md)

---
Expand Down Expand Up @@ -53,6 +53,12 @@ PolyHarness 填补了这个空白。它把 Meta-Harness 搜索变成了一个任
> - 记忆工具(如 Supermemory)赋予 agent 跨会话的持久**记忆**。
> - **PolyHarness 赋予 agent 持久的自我进化能力**,你可以用可重复运行的方式持续调整它们的工作方式。

### 这波浪潮中的一员——专精 harness

PolyHarness 并非孤例。一批开源项目已经证明:把 LLM 与进化搜索结合,能系统性地改进代码与 prompt——[GEPA](https://github.com/gepa-ai/gepa)(在 Pareto 前沿上做反思式 prompt 进化)、[ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve)(样本高效的程序进化)、[OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve)(AlphaEvolve 的开源实现),以及 [Darwin Gödel Machine](https://sakana.ai/dgm/)(开放式自我改进 agent)。

它们大多进化的是*通用*程序或算法。PolyHarness 是这波浪潮里**专精 agent harness** 的那一员——优化的是包裹在现有 agent *外层*的 prompt、工具配置与编排,并聚焦于**从真实使用中在线进化**(`ph wrap` → `ph evolve`)。它把这些项目中最有效的思路借鉴过来,应用到你自己任务上的任意 CLI agent:Pareto 前沿父代选择(GEPA)、代码新颖性拒绝与自适应后端集成(ShinkaEvolve)、级联评估(AlphaEvolve/OpenEvolve)。

## PolyHarness 是什么

PolyHarness 是一个通过迭代评估与搜索来探索 agent harness 变体的开源引擎。
Expand Down Expand Up @@ -469,6 +475,16 @@ Proposer 在生成下一个候选之前会读取**所有这些信息**。它能

当你运行 `ph init --agent claude-code` 时,PolyHarness 会在 workspace 中自动生成 `CLAUDE.md` 指令文件,告诉 agent 如何作为优化 Proposer 工作。`CLAW.md`、`CODEX.md`、`AGENTS.md`(Hermes)、`OPENCODE.md` 也是同样的机制,每个 agent 都使用它自己的原生指令格式。

#### 后端集成(自适应择优)

不确定哪个后端最擅长你的任务?让 PolyHarness 替你试。一次传入多个后端,它会用 **UCB bandit** 每轮挑一个,并逐渐把选择倾向"真正产出改进候选"的后端:

```bash
ph run --ensemble "claude-code,codex,local"
```

运行结束会给出每个后端的明细(选中次数 + 改进率)。在给定奖励序列下选择是确定性的,因此运行可复现。该机制借鉴自 ShinkaEvolve 的自适应 LLM 集成选择。

### 本地模型配置

如果你在本地运行模型(Ollama、vLLM、LM Studio 或任何 OpenAI 兼容服务),使用 `openai` 后端:
Expand Down Expand Up @@ -517,10 +533,16 @@ proposer:
search:
max_iterations: 20 # 最大搜索迭代次数
early_stop_patience: 5 # 连续 N 轮无改进后停止
parent_selection: best # 父候选选择策略: best | tournament | all
parent_selection: best # 父候选选择策略: best | tournament | all | pareto
novelty_filter: false # 评估前拒绝近重复候选,节省预算
novelty_threshold: 0.97 # 超过此相似度判定为近重复
novelty_max_retries: 1 # 跳过前重新生成近重复候选的次数
seed: null # 随机种子 — 设为整数可让带随机性的搜索可复现

proposer:
backend: api # api | openai | claude-code | claw-code | codex | hermes | opencode | local
ensemble: [] # 非空时,每轮用 UCB bandit 在这些后端中择优
bandit_c: 1.41421356 # UCB 探索常数(越大越偏探索)
model: claude-sonnet-4-20250514 # 模型名称(api/openai 后端使用)
base_url: null # 自定义 API 端点(openai 后端使用)
api_key: null # API 密钥覆盖(null = 使用环境变量)
Expand All @@ -532,6 +554,9 @@ evaluator:
type: python # python | docker | custom
entry: evaluate.py # 评估脚本入口
timeout: 300 # 每个任务的超时时间(秒)
cascade: false # 先评便宜的任务子集,未过门槛则跳过其余(逐任务模式)
cascade_threshold: 0.4 # 进入完整任务集所需的第一阶段最低均分
cascade_stage1: 0 # 第一阶段任务数(0 = 自动,约占 1/3)

harness:
language: python # Harness 代码语言
Expand Down Expand Up @@ -599,11 +624,11 @@ python -m polyharness --version
| `ph init` | 初始化 workspace,自动复制 harness、任务、评估脚本 |
| `ph run` | 启动优化搜索循环 |
| `ph status` | 进度表格,包含耗时、改进率和增量 |
| `ph log` | 搜索树带增量(Δ),或用 `--flat` 查看表格视图 |
| `ph log` | 搜索树带增量(Δ)列和 Pareto 前沿(◆)标记,或用 `--flat` 查看表格视图 |
| `ph best` | 展示最佳候选:分数、逐任务明细、变更摘要 |
| `ph compare A B` | 对比两个迭代:分数差异 + 统一代码 diff |
| `ph diff <N>` | `compare 0 <N>` 的快捷方式 |
| `ph leaderboard` | 候选排名表(`--top N`、`--tasks` 展开每题分数) |
| `ph leaderboard` | 候选排名表,含 Pareto(◆)与后端列(`--top N`、`--tasks` 展开每题分数) |
| `ph trace <N>` | 查看某次迭代的 stdout、stderr、metrics、退出码 |
| `ph report` | 生成完整 markdown 报告,包含分数趋势和逐任务表格 |
| `ph apply` | 将最优 harness 回写到 `base_harness/`,或通过 `--target` 指定目录 |
Expand Down Expand Up @@ -647,7 +672,8 @@ python -m polyharness --version
--dry-run 仅评估基线 harness,跳过搜索
--resume 从上次中断处继续搜索
--backend <name> 覆盖 proposer 后端,无需修改配置
--strategy <name> 覆盖父候选选择策略: best | tournament | all
--strategy <name> 覆盖父候选选择策略: best | tournament | all | pareto
--ensemble b1,b2,... 每轮用 UCB bandit 在多个后端中择优
```

### `ph wrap` 选项
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "polyharness",
"version": "0.2.1",
"version": "0.2.2",
"description": "Make your AI agent evolve automatically through iterative harness optimization.",
"keywords": ["agent", "harness", "optimization", "meta-harness", "cli"],
"license": "MIT",
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "polyharness"
version = "0.2.1"
version = "0.2.2"
description = "Automated harness optimization for AI agents — make your agent evolve."
readme = "README.md"
license = "MIT"
Expand Down
2 changes: 1 addition & 1 deletion src/polyharness/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""PolyHarness — Automated harness optimization for AI agents."""

__version__ = "0.2.1"
__version__ = "0.2.2"
Loading
Loading