Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,35 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.2.4] - 2026-05-26

### Added
- **Held-out test split** (`evaluator.eval_split`, `val_tasks`, `test_tasks`) —
evolve the harness against `val_tasks` (selection, Pareto, and early-stop all
use the validation set), then score only the best candidate **once** on the
held-out `test_tasks` at the end. The test score never drives selection — it's
an honest, post-hoc number that exposes harness overfitting to the eval set
(borrowed from the Stanford Meta-Harness reference's val/test methodology).
Per-task mode only; off by default. Result shown in the run summary and
`ph best`, and persisted to `summary/holdout_test.json`.
- **Proposer improvement principles** — every Proposer prompt/instruction
(API, CLI, and the injected `CLAUDE.md`/`AGENTS.md` etc.) now carries shared
directives distilled from the official Stanford Meta-Harness reference Skill
(MIT, re-authored not copied): change a real mechanism rather than tuning
constants, stay general / don't overfit the eval set, ground changes in trace
evidence, and state a falsifiable hypothesis. Pushes proposers toward
higher-value, generalizable candidates (complements the post-hoc novelty filter).

### Changed
- Backstory in README / README_CN corrected: the Stanford Meta-Harness framework
is now open-sourced (MIT); reframed PolyHarness's positioning accordingly and
linked the official repo.
- Added an **Acknowledgments** section (README / README_CN) crediting the open
works PolyHarness borrows ideas from and stating that **no third-party code is
bundled**; CONTRIBUTING documents the attribution policy.
- Updated the `ph shell-hook install` help text to the current agent invocations
(`codex exec`, `opencode run`) — a leftover from the v0.2.3 adapter refresh.

## [0.2.3] - 2026-05-26

### Fixed
Expand Down
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,17 @@ ruff check src/ tests/
3. Ensure `ruff check` and `pytest` pass before submitting.
4. Keep PRs focused — one feature or fix per PR.

## Third-party code & attribution

PolyHarness ships **no vendored third-party source**. When you borrow an idea
from another project:

- Re-implement it in our own code; do **not** copy source from a licensed
project without preserving its license and copyright notice.
- Attribute the source in an inline comment, and add substantial mechanisms to
the **Acknowledgments** section of the README.
- Borrowing *ideas* from open/MIT works is welcome; vendoring their *code* is not.

## Reporting Issues

Use [GitHub Issues](https://github.com/weijt606/polyharness/issues). Include:
Expand Down
25 changes: 21 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]()
[![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md)

---
Expand Down Expand Up @@ -45,9 +45,9 @@ Stanford's [Meta-Harness paper](https://arxiv.org/abs/2603.28052) (IRIS Lab, 202

The key insight? When you give an AI agent access to *full diagnostic history* — not just the latest score, but every past attempt's code, traces, and failure modes — it can *systematically evolve* its own harness configuration. The paper called this "non-Markovian search" and showed it outperforms simple best-of-N sampling by a wide margin.

But the paper only released the final optimized artifact (`agent.py`). **The search framework itself was never open-sourced.**
Stanford has since open-sourced the [Meta-Harness framework](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the reference implementation plus the paper's two experiments (text-classification memory search and Terminal-Bench 2 scaffold evolution).

PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness search available to everyone — for any agent, any task, any evaluation pipeline.
PolyHarness takes the same core idea in a different direction: rather than a reference framework you adapt per experiment, it's a productized, multi-backend CLI that optimizes the harness around *any* CLI agent on *your* tasks — with online evolution from real usage, not just batch runs on a benchmark.

> **Think of it this way:**
> - Memory tools (like Supermemory) give agents persistent **memory** across conversations.
Expand Down Expand Up @@ -557,6 +557,9 @@ evaluator:
cascade: false # Stage cheap subset first; skip rest if it fails the gate (per-task mode)
cascade_threshold: 0.4 # Min stage-1 mean score required to run the full task set
cascade_stage1: 0 # Tasks in stage 1 (0 = auto, ~1/3 of the list)
eval_split: false # Hold out a test set: evolve on val_tasks, score best on test_tasks once (per-task mode)
val_tasks: [] # Task files used during search when eval_split is on
test_tasks: [] # Held-out task files; best candidate scored on these once at the end

harness:
language: python # Harness code language
Expand Down Expand Up @@ -812,6 +815,20 @@ ruff check src/ tests/ # lint

<p align="center"><strong>Give your agent self-evolution. It's about time.</strong></p>

## Acknowledgments

PolyHarness **bundles no third-party source code**. Its techniques are
independently re-implemented from public papers, docs, and open-source repos,
and attributed inline in the code where relevant:

- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the harness-search formulation, the proposer "improvement principles", and the held-out val/test methodology.
- [GEPA](https://github.com/gepa-ai/gepa) — Pareto-frontier candidate selection.
- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) — code-novelty rejection and adaptive (bandit) backend selection.
- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve — cascade evaluation.
- [Darwin Gödel Machine](https://sakana.ai/dgm/) — open-ended self-improvement framing.

Ideas are borrowed; no code is copied. All projects and trademarks belong to their respective owners.

## License

MIT
MIT — see [LICENSE](LICENSE). © 2026 weijt606.
23 changes: 19 additions & 4 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]()
[![English](https://img.shields.io/badge/Docs-English-blue.svg)](README.md)

---
Expand Down Expand Up @@ -45,9 +45,9 @@

关键洞察在于:当你给 AI agent 提供的是*完整诊断历史*,而不只是最新分数,它就能*系统性地进化*自己的 harness 配置。这份历史包含每次尝试的代码、轨迹和失败模式。论文把这种方法称为“非马尔可夫搜索”,并证明它明显优于简单的 best-of-N 采样。

但论文只发布了最终优化产物(`agent.py`)。**搜索框架本身从未开源。**
斯坦福随后已经开源了 [Meta-Harness 框架](https://github.com/stanford-iris-lab/meta-harness)(MIT)——参考实现,加上论文的两个实验(文本分类 memory 搜索、Terminal-Bench 2 scaffold 进化)。

PolyHarness 填补了这个空白。它把 Meta-Harness 搜索变成了一个任何人都能使用的开源引擎,适用于任意 agent、任意任务和任意评估流程
PolyHarness 把同一个核心思路带向了不同方向:它不是一个需要逐实验改造的参考框架,而是一个产品化、多后端的 CLI——优化包裹在*任意* CLI agent *外层*的 harness、跑在*你自己*的任务上,并支持从真实使用中在线进化,而不只是在 benchmark 上批量跑

> **可以这样理解:**
> - 记忆工具(如 Supermemory)赋予 agent 跨会话的持久**记忆**。
Expand Down Expand Up @@ -557,6 +557,9 @@ evaluator:
cascade: false # 先评便宜的任务子集,未过门槛则跳过其余(逐任务模式)
cascade_threshold: 0.4 # 进入完整任务集所需的第一阶段最低均分
cascade_stage1: 0 # 第一阶段任务数(0 = 自动,约占 1/3)
eval_split: false # 留出测试集:在 val_tasks 上进化,末轮在 test_tasks 上评最佳一次(逐任务模式)
val_tasks: [] # eval_split 开启时,搜索期间使用的任务文件
test_tasks: [] # 留出任务文件;仅末轮对最佳候选评一次

harness:
language: python # Harness 代码语言
Expand Down Expand Up @@ -812,6 +815,18 @@ ruff check src/ tests/ # lint

<p align="center"><strong>给你的 agent 自我进化能力。是时候了。</strong></p>

## 致谢

PolyHarness **不打包任何第三方源代码**。其技术均为依据公开论文、文档与开源仓库**独立重新实现**,并在相关代码处就近注明出处:

- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness)(MIT)—— harness 搜索的问题表述、proposer "改进原则"、val/test 留出方法学。
- [GEPA](https://github.com/gepa-ai/gepa) —— Pareto 前沿父代选择。
- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) —— 代码新颖性拒绝、自适应(bandit)后端选择。
- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve —— 级联评估。
- [Darwin Gödel Machine](https://sakana.ai/dgm/) —— 开放式自我改进的思路。

只借鉴思路,不复制代码。各项目与商标归各自所有者所有。

## 许可

MIT
MIT —— 见 [LICENSE](LICENSE)。© 2026 weijt606。
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "polyharness",
"version": "0.2.3",
"version": "0.2.4",
"description": "Make your AI agent evolve automatically through iterative harness optimization.",
"keywords": [
"agent",
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "polyharness"
version = "0.2.3"
version = "0.2.4"
description = "Automated harness optimization for AI agents — make your agent evolve."
readme = "README.md"
license = "MIT"
Expand Down
2 changes: 1 addition & 1 deletion src/polyharness/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""PolyHarness — Automated harness optimization for AI agents."""

__version__ = "0.2.3"
__version__ = "0.2.4"
15 changes: 14 additions & 1 deletion src/polyharness/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,19 @@ def best(workspace: str):
console.print(f"Score: {log.best_score:.4f}")
console.print(f"Directory: {best_dir}")

# Held-out test score (only present when eval_split was used)
holdout_file = ws.summary_dir / "holdout_test.json"
if holdout_file.exists():
try:
ho = json.loads(holdout_file.read_text())
if ho.get("iteration") == best_i and "test_overall_score" in ho:
console.print(
f"[bold]Held-out test score:[/bold] {ho['test_overall_score']:.4f} "
"[dim](not used for selection)[/dim]"
)
except (json.JSONDecodeError, ValueError):
pass

# Show score details
score_data = _load_score(best_dir)
if score_data.get("task_scores"):
Expand Down Expand Up @@ -1918,7 +1931,7 @@ def install(rc: str | None):
"""Install shell hook to auto-wrap agent commands.

Adds a preexec hook to your shell rc file so that commands like
`claude -p ...`, `claw -p ...`, `codex ...`, `hermes chat -q ...`, `opencode -p ...`
`claude -p ...`, `claw -p ...`, `codex exec ...`, `hermes chat -q ...`, `opencode run ...`
are automatically wrapped with `ph wrap --auto-evolve`.
"""
rc_path = Path(rc) if rc else _detect_shell_rc()
Expand Down
23 changes: 23 additions & 0 deletions src/polyharness/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,29 @@ class EvaluatorConfig(BaseModel):
"of the task list, leaving at least one task for stage 2)."
),
)
eval_split: bool = Field(
default=False,
description=(
"Hold out a test set: evolve on `val_tasks`, then evaluate only the best "
"candidate once on `test_tasks` at the end (a held-out, honest score that "
"never drives selection). Per-task mode only. Off by default = use `tasks` "
"throughout."
),
)
val_tasks: list[str] = Field(
default_factory=list,
description=(
"Task files used DURING search when eval_split is on "
"(selection, Pareto, and early-stop all use these)."
),
)
test_tasks: list[str] = Field(
default_factory=list,
description=(
"Held-out task files; the best candidate is scored on these ONCE at "
"the end. Never used for selection."
),
)


class HarnessConfig(BaseModel):
Expand Down
66 changes: 63 additions & 3 deletions src/polyharness/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from __future__ import annotations

import json
import random
import shutil
from dataclasses import dataclass
Expand All @@ -26,6 +27,7 @@ class SearchResult:
best_iteration: int
best_score: float
total_iterations: int
test_score: float | None = None # held-out test score (eval_split only)


class Orchestrator:
Expand Down Expand Up @@ -85,6 +87,12 @@ def run(self, resume: bool = False) -> SearchResult:
)
else:
console.print(f"Proposer backend: {self.config.proposer.backend}")
_ecfg = self.config.evaluator
if _ecfg.eval_split and _ecfg.val_tasks and _ecfg.test_tasks:
console.print(
f"Eval split: evolve on {len(_ecfg.val_tasks)} val task(s), "
f"held-out {len(_ecfg.test_tasks)} test task(s)"
)
console.print()

# Determine starting point (resume or fresh)
Expand Down Expand Up @@ -255,11 +263,13 @@ def run(self, resume: bool = False) -> SearchResult:
if patience_counter >= self.config.search.early_stop_patience:
console.print("\n[yellow]Early stopping triggered.[/yellow]")

# Final summary
# Final summary (+ held-out test eval of the best candidate, if enabled)
test_score = self._evaluate_holdout(best_iteration)
result = SearchResult(
best_iteration=best_iteration,
best_score=best_score,
total_iterations=len(self.search_log) - 1,
test_score=test_score,
)
self._print_summary(result)
return result
Expand Down Expand Up @@ -297,13 +307,60 @@ def _evaluate_iteration(self, iteration: int, is_base: bool = False) -> float:

return eval_result.overall_score

def _search_tasks(self) -> list[str]:
"""Tasks used during the search loop.

With ``eval_split`` on, the loop evolves against ``val_tasks`` (held-out
``test_tasks`` are only touched once at the end). Otherwise the regular
``tasks`` list is used.
"""
cfg = self.config.evaluator
if cfg.eval_split and cfg.val_tasks:
return cfg.val_tasks
return cfg.tasks

def _run_eval(self, cand_dir, *, allow_cascade: bool) -> EvalResult:
"""Evaluate a candidate, applying cascade when enabled and applicable."""
tasks = self.config.evaluator.tasks
tasks = self._search_tasks()
if allow_cascade and self.config.evaluator.cascade and len(tasks) >= 2:
return self._evaluate_with_cascade(cand_dir, tasks)
return self.evaluator.evaluate(candidate_dir=cand_dir, tasks=tasks)

def _evaluate_holdout(self, best_iteration: int) -> float | None:
"""Score the best candidate once on the held-out ``test_tasks``.

Returns the overall test score (or ``None`` when split is off / unavailable).
The result never drives selection — it's an honest, post-hoc number. Stored
in ``summary/holdout_test.json``.
"""
cfg = self.config.evaluator
if not cfg.eval_split or not cfg.test_tasks:
return None
cand = self.workspace.candidate_path(best_iteration)
if not cand.exists():
return None

console.print(
f"\n[bold]Held-out test:[/bold] scoring iter_{best_iteration} on "
f"{len(cfg.test_tasks)} test task(s) (not used for selection)..."
)
res = self.evaluator.evaluate(candidate_dir=cand, tasks=cfg.test_tasks)

self.workspace.summary_dir.mkdir(exist_ok=True)
(self.workspace.summary_dir / "holdout_test.json").write_text(
json.dumps(
{
"iteration": best_iteration,
"test_overall_score": res.overall_score,
"test_task_scores": res.task_scores,
},
indent=2,
ensure_ascii=False,
)
+ "\n"
)
return res.overall_score

def _evaluate_with_cascade(self, cand_dir, tasks: list[str]) -> EvalResult:
"""Staged evaluation: cheap subset first, full set only if it clears the gate.

Expand Down Expand Up @@ -523,7 +580,10 @@ def _print_summary(self, result: SearchResult) -> None:
console.rule("[bold green]Search Complete")
table = Table(show_header=False)
table.add_row("Best iteration", f"iter_{result.best_iteration}")
table.add_row("Best score", f"{result.best_score:.4f}")
score_label = "Best score (val)" if result.test_score is not None else "Best score"
table.add_row(score_label, f"{result.best_score:.4f}")
if result.test_score is not None:
table.add_row("Held-out test score", f"{result.test_score:.4f}")
table.add_row("Total iterations", str(result.total_iterations))
console.print(table)

Expand Down
5 changes: 3 additions & 2 deletions src/polyharness/proposer/api_proposer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

import anthropic

from polyharness.proposer.base import BaseProposer
from polyharness.proposer.base import PROPOSER_PRINCIPLES, BaseProposer

TOOL_DEFINITIONS: list[dict[str, Any]] = [
{
Expand Down Expand Up @@ -115,7 +115,8 @@ def _build_system_prompt(workspace_root: Path, candidate_dir: Path, iteration: i
- Only write files inside your candidate directory ({candidate_dir.relative_to(workspace_root)}/).
- You can read any file in the workspace.
- Make targeted improvements based on evidence from traces.
"""

{PROPOSER_PRINCIPLES}"""


class APIProposer(BaseProposer):
Expand Down
Loading
Loading