diff --git a/CHANGELOG.md b/CHANGELOG.md index 2028a97..56feab0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,35 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). +## [0.2.4] - 2026-05-26 + +### Added +- **Held-out test split** (`evaluator.eval_split`, `val_tasks`, `test_tasks`) — + evolve the harness against `val_tasks` (selection, Pareto, and early-stop all + use the validation set), then score only the best candidate **once** on the + held-out `test_tasks` at the end. The test score never drives selection — it's + an honest, post-hoc number that exposes harness overfitting to the eval set + (borrowed from the Stanford Meta-Harness reference's val/test methodology). + Per-task mode only; off by default. Result shown in the run summary and + `ph best`, and persisted to `summary/holdout_test.json`. +- **Proposer improvement principles** — every Proposer prompt/instruction + (API, CLI, and the injected `CLAUDE.md`/`AGENTS.md` etc.) now carries shared + directives distilled from the official Stanford Meta-Harness reference Skill + (MIT, re-authored not copied): change a real mechanism rather than tuning + constants, stay general / don't overfit the eval set, ground changes in trace + evidence, and state a falsifiable hypothesis. Pushes proposers toward + higher-value, generalizable candidates (complements the post-hoc novelty filter). + +### Changed +- Backstory in README / README_CN corrected: the Stanford Meta-Harness framework + is now open-sourced (MIT); reframed PolyHarness's positioning accordingly and + linked the official repo. +- Added an **Acknowledgments** section (README / README_CN) crediting the open + works PolyHarness borrows ideas from and stating that **no third-party code is + bundled**; CONTRIBUTING documents the attribution policy. +- Updated the `ph shell-hook install` help text to the current agent invocations + (`codex exec`, `opencode run`) — a leftover from the v0.2.3 adapter refresh. + ## [0.2.3] - 2026-05-26 ### Fixed diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1ed3e2c..2a867bb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -38,6 +38,17 @@ ruff check src/ tests/ 3. Ensure `ruff check` and `pytest` pass before submitting. 4. Keep PRs focused — one feature or fix per PR. +## Third-party code & attribution + +PolyHarness ships **no vendored third-party source**. When you borrow an idea +from another project: + +- Re-implement it in our own code; do **not** copy source from a licensed + project without preserving its license and copyright notice. +- Attribute the source in an inline comment, and add substantial mechanisms to + the **Acknowledgments** section of the README. +- Borrowing *ideas* from open/MIT works is welcome; vendoring their *code* is not. + ## Reporting Issues Use [GitHub Issues](https://github.com/weijt606/polyharness/issues). Include: diff --git a/README.md b/README.md index b927adf..bee5377 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) -[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]() +[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]() [![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md) --- @@ -45,9 +45,9 @@ Stanford's [Meta-Harness paper](https://arxiv.org/abs/2603.28052) (IRIS Lab, 202 The key insight? When you give an AI agent access to *full diagnostic history* — not just the latest score, but every past attempt's code, traces, and failure modes — it can *systematically evolve* its own harness configuration. The paper called this "non-Markovian search" and showed it outperforms simple best-of-N sampling by a wide margin. -But the paper only released the final optimized artifact (`agent.py`). **The search framework itself was never open-sourced.** +Stanford has since open-sourced the [Meta-Harness framework](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the reference implementation plus the paper's two experiments (text-classification memory search and Terminal-Bench 2 scaffold evolution). -PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness search available to everyone — for any agent, any task, any evaluation pipeline. +PolyHarness takes the same core idea in a different direction: rather than a reference framework you adapt per experiment, it's a productized, multi-backend CLI that optimizes the harness around *any* CLI agent on *your* tasks — with online evolution from real usage, not just batch runs on a benchmark. > **Think of it this way:** > - Memory tools (like Supermemory) give agents persistent **memory** across conversations. @@ -557,6 +557,9 @@ evaluator: cascade: false # Stage cheap subset first; skip rest if it fails the gate (per-task mode) cascade_threshold: 0.4 # Min stage-1 mean score required to run the full task set cascade_stage1: 0 # Tasks in stage 1 (0 = auto, ~1/3 of the list) + eval_split: false # Hold out a test set: evolve on val_tasks, score best on test_tasks once (per-task mode) + val_tasks: [] # Task files used during search when eval_split is on + test_tasks: [] # Held-out task files; best candidate scored on these once at the end harness: language: python # Harness code language @@ -812,6 +815,20 @@ ruff check src/ tests/ # lint

Give your agent self-evolution. It's about time.

+## Acknowledgments + +PolyHarness **bundles no third-party source code**. Its techniques are +independently re-implemented from public papers, docs, and open-source repos, +and attributed inline in the code where relevant: + +- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the harness-search formulation, the proposer "improvement principles", and the held-out val/test methodology. +- [GEPA](https://github.com/gepa-ai/gepa) — Pareto-frontier candidate selection. +- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) — code-novelty rejection and adaptive (bandit) backend selection. +- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve — cascade evaluation. +- [Darwin Gödel Machine](https://sakana.ai/dgm/) — open-ended self-improvement framing. + +Ideas are borrowed; no code is copied. All projects and trademarks belong to their respective owners. + ## License -MIT +MIT — see [LICENSE](LICENSE). © 2026 weijt606. diff --git a/README_CN.md b/README_CN.md index ff2a381..ebd0aed 100644 --- a/README_CN.md +++ b/README_CN.md @@ -15,7 +15,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) -[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]() +[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]() [![English](https://img.shields.io/badge/Docs-English-blue.svg)](README.md) --- @@ -45,9 +45,9 @@ 关键洞察在于:当你给 AI agent 提供的是*完整诊断历史*,而不只是最新分数,它就能*系统性地进化*自己的 harness 配置。这份历史包含每次尝试的代码、轨迹和失败模式。论文把这种方法称为“非马尔可夫搜索”,并证明它明显优于简单的 best-of-N 采样。 -但论文只发布了最终优化产物(`agent.py`)。**搜索框架本身从未开源。** +斯坦福随后已经开源了 [Meta-Harness 框架](https://github.com/stanford-iris-lab/meta-harness)(MIT)——参考实现,加上论文的两个实验(文本分类 memory 搜索、Terminal-Bench 2 scaffold 进化)。 -PolyHarness 填补了这个空白。它把 Meta-Harness 搜索变成了一个任何人都能使用的开源引擎,适用于任意 agent、任意任务和任意评估流程。 +PolyHarness 把同一个核心思路带向了不同方向:它不是一个需要逐实验改造的参考框架,而是一个产品化、多后端的 CLI——优化包裹在*任意* CLI agent *外层*的 harness、跑在*你自己*的任务上,并支持从真实使用中在线进化,而不只是在 benchmark 上批量跑。 > **可以这样理解:** > - 记忆工具(如 Supermemory)赋予 agent 跨会话的持久**记忆**。 @@ -557,6 +557,9 @@ evaluator: cascade: false # 先评便宜的任务子集,未过门槛则跳过其余(逐任务模式) cascade_threshold: 0.4 # 进入完整任务集所需的第一阶段最低均分 cascade_stage1: 0 # 第一阶段任务数(0 = 自动,约占 1/3) + eval_split: false # 留出测试集:在 val_tasks 上进化,末轮在 test_tasks 上评最佳一次(逐任务模式) + val_tasks: [] # eval_split 开启时,搜索期间使用的任务文件 + test_tasks: [] # 留出任务文件;仅末轮对最佳候选评一次 harness: language: python # Harness 代码语言 @@ -812,6 +815,18 @@ ruff check src/ tests/ # lint

给你的 agent 自我进化能力。是时候了。

+## 致谢 + +PolyHarness **不打包任何第三方源代码**。其技术均为依据公开论文、文档与开源仓库**独立重新实现**,并在相关代码处就近注明出处: + +- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness)(MIT)—— harness 搜索的问题表述、proposer "改进原则"、val/test 留出方法学。 +- [GEPA](https://github.com/gepa-ai/gepa) —— Pareto 前沿父代选择。 +- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) —— 代码新颖性拒绝、自适应(bandit)后端选择。 +- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve —— 级联评估。 +- [Darwin Gödel Machine](https://sakana.ai/dgm/) —— 开放式自我改进的思路。 + +只借鉴思路,不复制代码。各项目与商标归各自所有者所有。 + ## 许可 -MIT \ No newline at end of file +MIT —— 见 [LICENSE](LICENSE)。© 2026 weijt606。 \ No newline at end of file diff --git a/package.json b/package.json index 81c718e..2c00f66 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "polyharness", - "version": "0.2.3", + "version": "0.2.4", "description": "Make your AI agent evolve automatically through iterative harness optimization.", "keywords": [ "agent", diff --git a/pyproject.toml b/pyproject.toml index 889f635..ad32ec3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "polyharness" -version = "0.2.3" +version = "0.2.4" description = "Automated harness optimization for AI agents — make your agent evolve." readme = "README.md" license = "MIT" diff --git a/src/polyharness/__init__.py b/src/polyharness/__init__.py index fedd9b3..a81a521 100644 --- a/src/polyharness/__init__.py +++ b/src/polyharness/__init__.py @@ -1,3 +1,3 @@ """PolyHarness — Automated harness optimization for AI agents.""" -__version__ = "0.2.3" +__version__ = "0.2.4" diff --git a/src/polyharness/cli.py b/src/polyharness/cli.py index 1991984..512e889 100644 --- a/src/polyharness/cli.py +++ b/src/polyharness/cli.py @@ -573,6 +573,19 @@ def best(workspace: str): console.print(f"Score: {log.best_score:.4f}") console.print(f"Directory: {best_dir}") + # Held-out test score (only present when eval_split was used) + holdout_file = ws.summary_dir / "holdout_test.json" + if holdout_file.exists(): + try: + ho = json.loads(holdout_file.read_text()) + if ho.get("iteration") == best_i and "test_overall_score" in ho: + console.print( + f"[bold]Held-out test score:[/bold] {ho['test_overall_score']:.4f} " + "[dim](not used for selection)[/dim]" + ) + except (json.JSONDecodeError, ValueError): + pass + # Show score details score_data = _load_score(best_dir) if score_data.get("task_scores"): @@ -1918,7 +1931,7 @@ def install(rc: str | None): """Install shell hook to auto-wrap agent commands. Adds a preexec hook to your shell rc file so that commands like - `claude -p ...`, `claw -p ...`, `codex ...`, `hermes chat -q ...`, `opencode -p ...` + `claude -p ...`, `claw -p ...`, `codex exec ...`, `hermes chat -q ...`, `opencode run ...` are automatically wrapped with `ph wrap --auto-evolve`. """ rc_path = Path(rc) if rc else _detect_shell_rc() diff --git a/src/polyharness/config.py b/src/polyharness/config.py index f845f52..36e6a55 100644 --- a/src/polyharness/config.py +++ b/src/polyharness/config.py @@ -132,6 +132,29 @@ class EvaluatorConfig(BaseModel): "of the task list, leaving at least one task for stage 2)." ), ) + eval_split: bool = Field( + default=False, + description=( + "Hold out a test set: evolve on `val_tasks`, then evaluate only the best " + "candidate once on `test_tasks` at the end (a held-out, honest score that " + "never drives selection). Per-task mode only. Off by default = use `tasks` " + "throughout." + ), + ) + val_tasks: list[str] = Field( + default_factory=list, + description=( + "Task files used DURING search when eval_split is on " + "(selection, Pareto, and early-stop all use these)." + ), + ) + test_tasks: list[str] = Field( + default_factory=list, + description=( + "Held-out task files; the best candidate is scored on these ONCE at " + "the end. Never used for selection." + ), + ) class HarnessConfig(BaseModel): diff --git a/src/polyharness/orchestrator.py b/src/polyharness/orchestrator.py index 9aa0e6e..e4e22f6 100644 --- a/src/polyharness/orchestrator.py +++ b/src/polyharness/orchestrator.py @@ -2,6 +2,7 @@ from __future__ import annotations +import json import random import shutil from dataclasses import dataclass @@ -26,6 +27,7 @@ class SearchResult: best_iteration: int best_score: float total_iterations: int + test_score: float | None = None # held-out test score (eval_split only) class Orchestrator: @@ -85,6 +87,12 @@ def run(self, resume: bool = False) -> SearchResult: ) else: console.print(f"Proposer backend: {self.config.proposer.backend}") + _ecfg = self.config.evaluator + if _ecfg.eval_split and _ecfg.val_tasks and _ecfg.test_tasks: + console.print( + f"Eval split: evolve on {len(_ecfg.val_tasks)} val task(s), " + f"held-out {len(_ecfg.test_tasks)} test task(s)" + ) console.print() # Determine starting point (resume or fresh) @@ -255,11 +263,13 @@ def run(self, resume: bool = False) -> SearchResult: if patience_counter >= self.config.search.early_stop_patience: console.print("\n[yellow]Early stopping triggered.[/yellow]") - # Final summary + # Final summary (+ held-out test eval of the best candidate, if enabled) + test_score = self._evaluate_holdout(best_iteration) result = SearchResult( best_iteration=best_iteration, best_score=best_score, total_iterations=len(self.search_log) - 1, + test_score=test_score, ) self._print_summary(result) return result @@ -297,13 +307,60 @@ def _evaluate_iteration(self, iteration: int, is_base: bool = False) -> float: return eval_result.overall_score + def _search_tasks(self) -> list[str]: + """Tasks used during the search loop. + + With ``eval_split`` on, the loop evolves against ``val_tasks`` (held-out + ``test_tasks`` are only touched once at the end). Otherwise the regular + ``tasks`` list is used. + """ + cfg = self.config.evaluator + if cfg.eval_split and cfg.val_tasks: + return cfg.val_tasks + return cfg.tasks + def _run_eval(self, cand_dir, *, allow_cascade: bool) -> EvalResult: """Evaluate a candidate, applying cascade when enabled and applicable.""" - tasks = self.config.evaluator.tasks + tasks = self._search_tasks() if allow_cascade and self.config.evaluator.cascade and len(tasks) >= 2: return self._evaluate_with_cascade(cand_dir, tasks) return self.evaluator.evaluate(candidate_dir=cand_dir, tasks=tasks) + def _evaluate_holdout(self, best_iteration: int) -> float | None: + """Score the best candidate once on the held-out ``test_tasks``. + + Returns the overall test score (or ``None`` when split is off / unavailable). + The result never drives selection — it's an honest, post-hoc number. Stored + in ``summary/holdout_test.json``. + """ + cfg = self.config.evaluator + if not cfg.eval_split or not cfg.test_tasks: + return None + cand = self.workspace.candidate_path(best_iteration) + if not cand.exists(): + return None + + console.print( + f"\n[bold]Held-out test:[/bold] scoring iter_{best_iteration} on " + f"{len(cfg.test_tasks)} test task(s) (not used for selection)..." + ) + res = self.evaluator.evaluate(candidate_dir=cand, tasks=cfg.test_tasks) + + self.workspace.summary_dir.mkdir(exist_ok=True) + (self.workspace.summary_dir / "holdout_test.json").write_text( + json.dumps( + { + "iteration": best_iteration, + "test_overall_score": res.overall_score, + "test_task_scores": res.task_scores, + }, + indent=2, + ensure_ascii=False, + ) + + "\n" + ) + return res.overall_score + def _evaluate_with_cascade(self, cand_dir, tasks: list[str]) -> EvalResult: """Staged evaluation: cheap subset first, full set only if it clears the gate. @@ -523,7 +580,10 @@ def _print_summary(self, result: SearchResult) -> None: console.rule("[bold green]Search Complete") table = Table(show_header=False) table.add_row("Best iteration", f"iter_{result.best_iteration}") - table.add_row("Best score", f"{result.best_score:.4f}") + score_label = "Best score (val)" if result.test_score is not None else "Best score" + table.add_row(score_label, f"{result.best_score:.4f}") + if result.test_score is not None: + table.add_row("Held-out test score", f"{result.test_score:.4f}") table.add_row("Total iterations", str(result.total_iterations)) console.print(table) diff --git a/src/polyharness/proposer/api_proposer.py b/src/polyharness/proposer/api_proposer.py index f5b10c5..bc540f8 100644 --- a/src/polyharness/proposer/api_proposer.py +++ b/src/polyharness/proposer/api_proposer.py @@ -8,7 +8,7 @@ import anthropic -from polyharness.proposer.base import BaseProposer +from polyharness.proposer.base import PROPOSER_PRINCIPLES, BaseProposer TOOL_DEFINITIONS: list[dict[str, Any]] = [ { @@ -115,7 +115,8 @@ def _build_system_prompt(workspace_root: Path, candidate_dir: Path, iteration: i - Only write files inside your candidate directory ({candidate_dir.relative_to(workspace_root)}/). - You can read any file in the workspace. - Make targeted improvements based on evidence from traces. -""" + +{PROPOSER_PRINCIPLES}""" class APIProposer(BaseProposer): diff --git a/src/polyharness/proposer/base.py b/src/polyharness/proposer/base.py index 954d635..de39eec 100644 --- a/src/polyharness/proposer/base.py +++ b/src/polyharness/proposer/base.py @@ -5,6 +5,22 @@ from abc import ABC, abstractmethod from pathlib import Path +# Shared improvement directives appended to every Proposer's prompt/instructions. +# Distilled from the Stanford Meta-Harness reference Skill (MIT) — re-authored, +# not copied — to push proposers toward high-value, generalizable changes. +PROPOSER_PRINCIPLES = """\ +## Improvement principles +- Change a real mechanism, not just constants. If your edit only tweaks thresholds, + weights, or wording versus the parent, it is a low-value parameter variant — + instead change the actual logic, strategy, or data structure. +- Stay general; do not overfit. Never hardcode answers or task-specific knowledge to + inflate the score. The harness must generalize beyond the eval cases you can see. +- Ground the change in evidence. Before finalizing, point to the specific failures or + regressions in the traces that your change targets, and reason through why it fixes them. +- State a falsifiable hypothesis. In your summary, say what you expect to improve and + why, so the next iteration can tell whether it held. +""" + class BaseProposer(ABC): """Abstract Proposer interface. diff --git a/src/polyharness/proposer/cli_proposer.py b/src/polyharness/proposer/cli_proposer.py index a8458c0..5ba8123 100644 --- a/src/polyharness/proposer/cli_proposer.py +++ b/src/polyharness/proposer/cli_proposer.py @@ -11,7 +11,7 @@ from pathlib import Path from polyharness.proposer.adapters import CLIAdapter, get_adapter -from polyharness.proposer.base import BaseProposer +from polyharness.proposer.base import PROPOSER_PRINCIPLES, BaseProposer def _build_prompt( @@ -73,7 +73,8 @@ def _build_prompt( - Do NOT modify files outside {cand_rel}/. - Do NOT delete or overwrite score.json or metadata.json (the evaluator writes those). - Aim for improvements that are testable and measurable. -""" + +{PROPOSER_PRINCIPLES}""" class CLIProposer(BaseProposer): diff --git a/src/polyharness/workspace.py b/src/polyharness/workspace.py index 9b6dd76..1af7ea7 100644 --- a/src/polyharness/workspace.py +++ b/src/polyharness/workspace.py @@ -414,4 +414,17 @@ def _agent_instruction_content(backend: str, filename: str) -> str: - Compare the best candidate's traces with worse ones to spot patterns. - If a previous approach regressed, explain why and try a different angle. - Small, targeted edits beat large rewrites. + +## Improvement Principles + +- **Change a real mechanism, not just constants.** If your edit only tweaks + thresholds, weights, or wording versus the parent, it is a low-value parameter + variant — change the actual logic, strategy, or data structure instead. +- **Stay general; do not overfit.** Never hardcode answers or task-specific + knowledge to inflate the score. The harness must generalize beyond the eval + cases you can see. +- **Ground the change in evidence.** Point to the specific trace failures your + change targets, and reason through why it fixes them. +- **State a falsifiable hypothesis.** Say what you expect to improve and why, so + the next iteration can tell whether it held. """ diff --git a/tests/test_cli_adapters.py b/tests/test_cli_adapters.py index 974fc93..a197e28 100644 --- a/tests/test_cli_adapters.py +++ b/tests/test_cli_adapters.py @@ -177,6 +177,25 @@ def test_build_prompt_includes_leaderboard(tmp_path): assert "0.8" in prompt +def test_proposer_prompts_include_improvement_principles(tmp_path): + """Both the CLI and API proposer prompts carry the shared improvement principles.""" + from polyharness.proposer.api_proposer import _build_system_prompt + from polyharness.proposer.base import PROPOSER_PRINCIPLES + + ws = tmp_path / "ws" + cand = ws / "candidates" / "iter_1" + cand.mkdir(parents=True) + + cli_prompt = _build_prompt(ws, cand, 1, 0) + api_prompt = _build_system_prompt(ws, cand, 1, 0) + + # The anti-parameter-tuning directive is the distinctive marker. + assert "parameter variant" in PROPOSER_PRINCIPLES + assert "parameter variant" in cli_prompt + assert "parameter variant" in api_prompt + assert "falsifiable hypothesis" in cli_prompt + + # --------------------------------------------------------------------------- # CLIProposer # --------------------------------------------------------------------------- diff --git a/tests/test_config.py b/tests/test_config.py index b10d48a..0d1da47 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -53,6 +53,24 @@ def test_cascade_defaults_and_roundtrip(): assert loaded.evaluator.cascade_stage1 == 3 +def test_eval_split_defaults_and_roundtrip(): + cfg = PolyHarnessConfig() + assert cfg.evaluator.eval_split is False + assert cfg.evaluator.val_tasks == [] + assert cfg.evaluator.test_tasks == [] + + cfg.evaluator.eval_split = True + cfg.evaluator.val_tasks = ["tasks/v1.json", "tasks/v2.json"] + cfg.evaluator.test_tasks = ["tasks/t1.json"] + with tempfile.TemporaryDirectory() as tmp: + path = Path(tmp) / "config.yaml" + cfg.to_yaml(path) + loaded = PolyHarnessConfig.from_yaml(path) + assert loaded.evaluator.eval_split is True + assert loaded.evaluator.val_tasks == ["tasks/v1.json", "tasks/v2.json"] + assert loaded.evaluator.test_tasks == ["tasks/t1.json"] + + def test_config_roundtrip_yaml(): cfg = PolyHarnessConfig() cfg.proposer.backend = "claude-code" # type: ignore[assignment] diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index daf7a64..1a7f366 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -524,6 +524,45 @@ def test_cascade_base_always_full(tmp_path): assert ev.calls[0] == ["t1", "t2", "t3", "t4"] +def test_eval_split_holds_out_test(tmp_path): + """With eval_split, search runs on val tasks; the best candidate is scored + once on the held-out test tasks at the end.""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 3 + config.search.early_stop_patience = 10 + config.evaluator.eval_split = True + config.evaluator.val_tasks = ["v1.json", "v2.json"] + config.evaluator.test_tasks = ["t1.json", "t2.json"] + ev = PerTaskEvaluator({"v1": 0.6, "v2": 0.6, "t1": 0.9, "t2": 0.9}) + + result = Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev).run() + + # Test tasks are scored exactly once (the held-out final eval), never during search. + assert sum(1 for c in ev.calls if c == ["t1", "t2"]) == 1 + # The search loop (base + candidates) ran on val tasks. + assert sum(1 for c in ev.calls if c == ["v1", "v2"]) >= 2 + # The held-out score is reported and persisted, and never drove selection. + assert result.test_score == 0.9 + holdout = json.loads((ws.summary_dir / "holdout_test.json").read_text()) + assert holdout["test_overall_score"] == 0.9 + + +def test_eval_split_off_by_default(tmp_path): + """Without eval_split there is no held-out eval (back-compat).""" + ws = _setup_workspace(tmp_path) + config = ws.load_config() + config.search.max_iterations = 2 + config.search.early_stop_patience = 10 + + result = Orchestrator( + ws, config, proposer=MockProposer(), evaluator=MockEvaluator() + ).run() + + assert result.test_score is None + assert not (ws.summary_dir / "holdout_test.json").exists() + + def test_orchestrator_error_recovery(tmp_path): """Orchestrator should skip failing iterations and continue."""