diff --git a/CHANGELOG.md b/CHANGELOG.md
index 2028a97..56feab0 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,35 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## [0.2.4] - 2026-05-26
+
+### Added
+- **Held-out test split** (`evaluator.eval_split`, `val_tasks`, `test_tasks`) —
+  evolve the harness against `val_tasks` (selection, Pareto, and early-stop all
+  use the validation set), then score only the best candidate **once** on the
+  held-out `test_tasks` at the end. The test score never drives selection — it's
+  an honest, post-hoc number that exposes harness overfitting to the eval set
+  (borrowed from the Stanford Meta-Harness reference's val/test methodology).
+  Per-task mode only; off by default. Result shown in the run summary and
+  `ph best`, and persisted to `summary/holdout_test.json`.
+- **Proposer improvement principles** — every Proposer prompt/instruction
+  (API, CLI, and the injected `CLAUDE.md`/`AGENTS.md` etc.) now carries shared
+  directives distilled from the official Stanford Meta-Harness reference Skill
+  (MIT, re-authored not copied): change a real mechanism rather than tuning
+  constants, stay general / don't overfit the eval set, ground changes in trace
+  evidence, and state a falsifiable hypothesis. Pushes proposers toward
+  higher-value, generalizable candidates (complements the post-hoc novelty filter).
+
+### Changed
+- Backstory in README / README_CN corrected: the Stanford Meta-Harness framework
+  is now open-sourced (MIT); reframed PolyHarness's positioning accordingly and
+  linked the official repo.
+- Added an **Acknowledgments** section (README / README_CN) crediting the open
+  works PolyHarness borrows ideas from and stating that **no third-party code is
+  bundled**; CONTRIBUTING documents the attribution policy.
+- Updated the `ph shell-hook install` help text to the current agent invocations
+  (`codex exec`, `opencode run`) — a leftover from the v0.2.3 adapter refresh.
+
 ## [0.2.3] - 2026-05-26
 
 ### Fixed
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1ed3e2c..2a867bb 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -38,6 +38,17 @@ ruff check src/ tests/
 3. Ensure `ruff check` and `pytest` pass before submitting.
 4. Keep PRs focused — one feature or fix per PR.
 
+## Third-party code & attribution
+
+PolyHarness ships **no vendored third-party source**. When you borrow an idea
+from another project:
+
+- Re-implement it in our own code; do **not** copy source from a licensed
+  project without preserving its license and copyright notice.
+- Attribute the source in an inline comment, and add substantial mechanisms to
+  the **Acknowledgments** section of the README.
+- Borrowing *ideas* from open/MIT works is welcome; vendoring their *code* is not.
+
 ## Reporting Issues
 
 Use [GitHub Issues](https://github.com/weijt606/polyharness/issues). Include:
diff --git a/README.md b/README.md
index b927adf..bee5377 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@
 
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
-[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
+[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]()
 [![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md)
 
 ---
@@ -45,9 +45,9 @@ Stanford's [Meta-Harness paper](https://arxiv.org/abs/2603.28052) (IRIS Lab, 202
 
 The key insight? When you give an AI agent access to *full diagnostic history* — not just the latest score, but every past attempt's code, traces, and failure modes — it can *systematically evolve* its own harness configuration. The paper called this "non-Markovian search" and showed it outperforms simple best-of-N sampling by a wide margin.
 
-But the paper only released the final optimized artifact (`agent.py`). **The search framework itself was never open-sourced.**
+Stanford has since open-sourced the [Meta-Harness framework](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the reference implementation plus the paper's two experiments (text-classification memory search and Terminal-Bench 2 scaffold evolution).
 
-PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness search available to everyone — for any agent, any task, any evaluation pipeline.
+PolyHarness takes the same core idea in a different direction: rather than a reference framework you adapt per experiment, it's a productized, multi-backend CLI that optimizes the harness around *any* CLI agent on *your* tasks — with online evolution from real usage, not just batch runs on a benchmark.
 
 > **Think of it this way:**
 > - Memory tools (like Supermemory) give agents persistent **memory** across conversations.
@@ -557,6 +557,9 @@ evaluator:
   cascade: false               # Stage cheap subset first; skip rest if it fails the gate (per-task mode)
   cascade_threshold: 0.4       # Min stage-1 mean score required to run the full task set
   cascade_stage1: 0            # Tasks in stage 1 (0 = auto, ~1/3 of the list)
+  eval_split: false            # Hold out a test set: evolve on val_tasks, score best on test_tasks once (per-task mode)
+  val_tasks: []                # Task files used during search when eval_split is on
+  test_tasks: []               # Held-out task files; best candidate scored on these once at the end
 
 harness:
   language: python             # Harness code language
@@ -812,6 +815,20 @@ ruff check src/ tests/       # lint
 
 <p align="center"><strong>Give your agent self-evolution. It's about time.</strong></p>
 
+## Acknowledgments
+
+PolyHarness **bundles no third-party source code**. Its techniques are
+independently re-implemented from public papers, docs, and open-source repos,
+and attributed inline in the code where relevant:
+
+- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness) (MIT) — the harness-search formulation, the proposer "improvement principles", and the held-out val/test methodology.
+- [GEPA](https://github.com/gepa-ai/gepa) — Pareto-frontier candidate selection.
+- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) — code-novelty rejection and adaptive (bandit) backend selection.
+- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve — cascade evaluation.
+- [Darwin Gödel Machine](https://sakana.ai/dgm/) — open-ended self-improvement framing.
+
+Ideas are borrowed; no code is copied. All projects and trademarks belong to their respective owners.
+
 ## License
 
-MIT
+MIT — see [LICENSE](LICENSE). © 2026 weijt606.
diff --git a/README_CN.md b/README_CN.md
index ff2a381..ebd0aed 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -15,7 +15,7 @@
 
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
-[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
+[![Tests](https://img.shields.io/badge/tests-210%20passing-brightgreen.svg)]()
 [![English](https://img.shields.io/badge/Docs-English-blue.svg)](README.md)
 
 ---
@@ -45,9 +45,9 @@
 
 关键洞察在于：当你给 AI agent 提供的是*完整诊断历史*，而不只是最新分数，它就能*系统性地进化*自己的 harness 配置。这份历史包含每次尝试的代码、轨迹和失败模式。论文把这种方法称为“非马尔可夫搜索”，并证明它明显优于简单的 best-of-N 采样。
 
-但论文只发布了最终优化产物（`agent.py`）。**搜索框架本身从未开源。**
+斯坦福随后已经开源了 [Meta-Harness 框架](https://github.com/stanford-iris-lab/meta-harness)（MIT）——参考实现，加上论文的两个实验（文本分类 memory 搜索、Terminal-Bench 2 scaffold 进化）。
 
-PolyHarness 填补了这个空白。它把 Meta-Harness 搜索变成了一个任何人都能使用的开源引擎，适用于任意 agent、任意任务和任意评估流程。
+PolyHarness 把同一个核心思路带向了不同方向：它不是一个需要逐实验改造的参考框架，而是一个产品化、多后端的 CLI——优化包裹在*任意* CLI agent *外层*的 harness、跑在*你自己*的任务上，并支持从真实使用中在线进化，而不只是在 benchmark 上批量跑。
 
 > **可以这样理解：**
 > - 记忆工具（如 Supermemory）赋予 agent 跨会话的持久**记忆**。
@@ -557,6 +557,9 @@ evaluator:
   cascade: false               # 先评便宜的任务子集，未过门槛则跳过其余（逐任务模式）
   cascade_threshold: 0.4       # 进入完整任务集所需的第一阶段最低均分
   cascade_stage1: 0            # 第一阶段任务数（0 = 自动，约占 1/3）
+  eval_split: false            # 留出测试集：在 val_tasks 上进化，末轮在 test_tasks 上评最佳一次（逐任务模式）
+  val_tasks: []                # eval_split 开启时，搜索期间使用的任务文件
+  test_tasks: []               # 留出任务文件；仅末轮对最佳候选评一次
 
 harness:
   language: python             # Harness 代码语言
@@ -812,6 +815,18 @@ ruff check src/ tests/       # lint
 
 <p align="center"><strong>给你的 agent 自我进化能力。是时候了。</strong></p>
 
+## 致谢
+
+PolyHarness **不打包任何第三方源代码**。其技术均为依据公开论文、文档与开源仓库**独立重新实现**,并在相关代码处就近注明出处:
+
+- [Stanford Meta-Harness](https://github.com/stanford-iris-lab/meta-harness)（MIT）—— harness 搜索的问题表述、proposer "改进原则"、val/test 留出方法学。
+- [GEPA](https://github.com/gepa-ai/gepa) —— Pareto 前沿父代选择。
+- [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) —— 代码新颖性拒绝、自适应（bandit）后端选择。
+- [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) / AlphaEvolve —— 级联评估。
+- [Darwin Gödel Machine](https://sakana.ai/dgm/) —— 开放式自我改进的思路。
+
+只借鉴思路,不复制代码。各项目与商标归各自所有者所有。
+
 ## 许可
 
-MIT
\ No newline at end of file
+MIT —— 见 [LICENSE](LICENSE)。© 2026 weijt606。
\ No newline at end of file
diff --git a/package.json b/package.json
index 81c718e..2c00f66 100644
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "polyharness",
-  "version": "0.2.3",
+  "version": "0.2.4",
   "description": "Make your AI agent evolve automatically through iterative harness optimization.",
   "keywords": [
     "agent",
diff --git a/pyproject.toml b/pyproject.toml
index 889f635..ad32ec3 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "polyharness"
-version = "0.2.3"
+version = "0.2.4"
 description = "Automated harness optimization for AI agents — make your agent evolve."
 readme = "README.md"
 license = "MIT"
diff --git a/src/polyharness/__init__.py b/src/polyharness/__init__.py
index fedd9b3..a81a521 100644
--- a/src/polyharness/__init__.py
+++ b/src/polyharness/__init__.py
@@ -1,3 +1,3 @@
 """PolyHarness — Automated harness optimization for AI agents."""
 
-__version__ = "0.2.3"
+__version__ = "0.2.4"
diff --git a/src/polyharness/cli.py b/src/polyharness/cli.py
index 1991984..512e889 100644
--- a/src/polyharness/cli.py
+++ b/src/polyharness/cli.py
@@ -573,6 +573,19 @@ def best(workspace: str):
     console.print(f"Score: {log.best_score:.4f}")
     console.print(f"Directory: {best_dir}")
 
+    # Held-out test score (only present when eval_split was used)
+    holdout_file = ws.summary_dir / "holdout_test.json"
+    if holdout_file.exists():
+        try:
+            ho = json.loads(holdout_file.read_text())
+            if ho.get("iteration") == best_i and "test_overall_score" in ho:
+                console.print(
+                    f"[bold]Held-out test score:[/bold] {ho['test_overall_score']:.4f} "
+                    "[dim](not used for selection)[/dim]"
+                )
+        except (json.JSONDecodeError, ValueError):
+            pass
+
     # Show score details
     score_data = _load_score(best_dir)
     if score_data.get("task_scores"):
@@ -1918,7 +1931,7 @@ def install(rc: str | None):
     """Install shell hook to auto-wrap agent commands.
 
     Adds a preexec hook to your shell rc file so that commands like
-    `claude -p ...`, `claw -p ...`, `codex ...`, `hermes chat -q ...`, `opencode -p ...`
+    `claude -p ...`, `claw -p ...`, `codex exec ...`, `hermes chat -q ...`, `opencode run ...`
     are automatically wrapped with `ph wrap --auto-evolve`.
     """
     rc_path = Path(rc) if rc else _detect_shell_rc()
diff --git a/src/polyharness/config.py b/src/polyharness/config.py
index f845f52..36e6a55 100644
--- a/src/polyharness/config.py
+++ b/src/polyharness/config.py
@@ -132,6 +132,29 @@ class EvaluatorConfig(BaseModel):
             "of the task list, leaving at least one task for stage 2)."
         ),
     )
+    eval_split: bool = Field(
+        default=False,
+        description=(
+            "Hold out a test set: evolve on `val_tasks`, then evaluate only the best "
+            "candidate once on `test_tasks` at the end (a held-out, honest score that "
+            "never drives selection). Per-task mode only. Off by default = use `tasks` "
+            "throughout."
+        ),
+    )
+    val_tasks: list[str] = Field(
+        default_factory=list,
+        description=(
+            "Task files used DURING search when eval_split is on "
+            "(selection, Pareto, and early-stop all use these)."
+        ),
+    )
+    test_tasks: list[str] = Field(
+        default_factory=list,
+        description=(
+            "Held-out task files; the best candidate is scored on these ONCE at "
+            "the end. Never used for selection."
+        ),
+    )
 
 
 class HarnessConfig(BaseModel):
diff --git a/src/polyharness/orchestrator.py b/src/polyharness/orchestrator.py
index 9aa0e6e..e4e22f6 100644
--- a/src/polyharness/orchestrator.py
+++ b/src/polyharness/orchestrator.py
@@ -2,6 +2,7 @@
 
 from __future__ import annotations
 
+import json
 import random
 import shutil
 from dataclasses import dataclass
@@ -26,6 +27,7 @@ class SearchResult:
     best_iteration: int
     best_score: float
     total_iterations: int
+    test_score: float | None = None  # held-out test score (eval_split only)
 
 
 class Orchestrator:
@@ -85,6 +87,12 @@ def run(self, resume: bool = False) -> SearchResult:
             )
         else:
             console.print(f"Proposer backend: {self.config.proposer.backend}")
+        _ecfg = self.config.evaluator
+        if _ecfg.eval_split and _ecfg.val_tasks and _ecfg.test_tasks:
+            console.print(
+                f"Eval split: evolve on {len(_ecfg.val_tasks)} val task(s), "
+                f"held-out {len(_ecfg.test_tasks)} test task(s)"
+            )
         console.print()
 
         # Determine starting point (resume or fresh)
@@ -255,11 +263,13 @@ def run(self, resume: bool = False) -> SearchResult:
         if patience_counter >= self.config.search.early_stop_patience:
             console.print("\n[yellow]Early stopping triggered.[/yellow]")
 
-        # Final summary
+        # Final summary (+ held-out test eval of the best candidate, if enabled)
+        test_score = self._evaluate_holdout(best_iteration)
         result = SearchResult(
             best_iteration=best_iteration,
             best_score=best_score,
             total_iterations=len(self.search_log) - 1,
+            test_score=test_score,
         )
         self._print_summary(result)
         return result
@@ -297,13 +307,60 @@ def _evaluate_iteration(self, iteration: int, is_base: bool = False) -> float:
 
         return eval_result.overall_score
 
+    def _search_tasks(self) -> list[str]:
+        """Tasks used during the search loop.
+
+        With ``eval_split`` on, the loop evolves against ``val_tasks`` (held-out
+        ``test_tasks`` are only touched once at the end). Otherwise the regular
+        ``tasks`` list is used.
+        """
+        cfg = self.config.evaluator
+        if cfg.eval_split and cfg.val_tasks:
+            return cfg.val_tasks
+        return cfg.tasks
+
     def _run_eval(self, cand_dir, *, allow_cascade: bool) -> EvalResult:
         """Evaluate a candidate, applying cascade when enabled and applicable."""
-        tasks = self.config.evaluator.tasks
+        tasks = self._search_tasks()
         if allow_cascade and self.config.evaluator.cascade and len(tasks) >= 2:
             return self._evaluate_with_cascade(cand_dir, tasks)
         return self.evaluator.evaluate(candidate_dir=cand_dir, tasks=tasks)
 
+    def _evaluate_holdout(self, best_iteration: int) -> float | None:
+        """Score the best candidate once on the held-out ``test_tasks``.
+
+        Returns the overall test score (or ``None`` when split is off / unavailable).
+        The result never drives selection — it's an honest, post-hoc number. Stored
+        in ``summary/holdout_test.json``.
+        """
+        cfg = self.config.evaluator
+        if not cfg.eval_split or not cfg.test_tasks:
+            return None
+        cand = self.workspace.candidate_path(best_iteration)
+        if not cand.exists():
+            return None
+
+        console.print(
+            f"\n[bold]Held-out test:[/bold] scoring iter_{best_iteration} on "
+            f"{len(cfg.test_tasks)} test task(s) (not used for selection)..."
+        )
+        res = self.evaluator.evaluate(candidate_dir=cand, tasks=cfg.test_tasks)
+
+        self.workspace.summary_dir.mkdir(exist_ok=True)
+        (self.workspace.summary_dir / "holdout_test.json").write_text(
+            json.dumps(
+                {
+                    "iteration": best_iteration,
+                    "test_overall_score": res.overall_score,
+                    "test_task_scores": res.task_scores,
+                },
+                indent=2,
+                ensure_ascii=False,
+            )
+            + "\n"
+        )
+        return res.overall_score
+
     def _evaluate_with_cascade(self, cand_dir, tasks: list[str]) -> EvalResult:
         """Staged evaluation: cheap subset first, full set only if it clears the gate.
 
@@ -523,7 +580,10 @@ def _print_summary(self, result: SearchResult) -> None:
         console.rule("[bold green]Search Complete")
         table = Table(show_header=False)
         table.add_row("Best iteration", f"iter_{result.best_iteration}")
-        table.add_row("Best score", f"{result.best_score:.4f}")
+        score_label = "Best score (val)" if result.test_score is not None else "Best score"
+        table.add_row(score_label, f"{result.best_score:.4f}")
+        if result.test_score is not None:
+            table.add_row("Held-out test score", f"{result.test_score:.4f}")
         table.add_row("Total iterations", str(result.total_iterations))
         console.print(table)
 
diff --git a/src/polyharness/proposer/api_proposer.py b/src/polyharness/proposer/api_proposer.py
index f5b10c5..bc540f8 100644
--- a/src/polyharness/proposer/api_proposer.py
+++ b/src/polyharness/proposer/api_proposer.py
@@ -8,7 +8,7 @@
 
 import anthropic
 
-from polyharness.proposer.base import BaseProposer
+from polyharness.proposer.base import PROPOSER_PRINCIPLES, BaseProposer
 
 TOOL_DEFINITIONS: list[dict[str, Any]] = [
     {
@@ -115,7 +115,8 @@ def _build_system_prompt(workspace_root: Path, candidate_dir: Path, iteration: i
 - Only write files inside your candidate directory ({candidate_dir.relative_to(workspace_root)}/).
 - You can read any file in the workspace.
 - Make targeted improvements based on evidence from traces.
-"""
+
+{PROPOSER_PRINCIPLES}"""
 
 
 class APIProposer(BaseProposer):
diff --git a/src/polyharness/proposer/base.py b/src/polyharness/proposer/base.py
index 954d635..de39eec 100644
--- a/src/polyharness/proposer/base.py
+++ b/src/polyharness/proposer/base.py
@@ -5,6 +5,22 @@
 from abc import ABC, abstractmethod
 from pathlib import Path
 
+# Shared improvement directives appended to every Proposer's prompt/instructions.
+# Distilled from the Stanford Meta-Harness reference Skill (MIT) — re-authored,
+# not copied — to push proposers toward high-value, generalizable changes.
+PROPOSER_PRINCIPLES = """\
+## Improvement principles
+- Change a real mechanism, not just constants. If your edit only tweaks thresholds,
+  weights, or wording versus the parent, it is a low-value parameter variant —
+  instead change the actual logic, strategy, or data structure.
+- Stay general; do not overfit. Never hardcode answers or task-specific knowledge to
+  inflate the score. The harness must generalize beyond the eval cases you can see.
+- Ground the change in evidence. Before finalizing, point to the specific failures or
+  regressions in the traces that your change targets, and reason through why it fixes them.
+- State a falsifiable hypothesis. In your summary, say what you expect to improve and
+  why, so the next iteration can tell whether it held.
+"""
+
 
 class BaseProposer(ABC):
     """Abstract Proposer interface.
diff --git a/src/polyharness/proposer/cli_proposer.py b/src/polyharness/proposer/cli_proposer.py
index a8458c0..5ba8123 100644
--- a/src/polyharness/proposer/cli_proposer.py
+++ b/src/polyharness/proposer/cli_proposer.py
@@ -11,7 +11,7 @@
 from pathlib import Path
 
 from polyharness.proposer.adapters import CLIAdapter, get_adapter
-from polyharness.proposer.base import BaseProposer
+from polyharness.proposer.base import PROPOSER_PRINCIPLES, BaseProposer
 
 
 def _build_prompt(
@@ -73,7 +73,8 @@ def _build_prompt(
 - Do NOT modify files outside {cand_rel}/.
 - Do NOT delete or overwrite score.json or metadata.json (the evaluator writes those).
 - Aim for improvements that are testable and measurable.
-"""
+
+{PROPOSER_PRINCIPLES}"""
 
 
 class CLIProposer(BaseProposer):
diff --git a/src/polyharness/workspace.py b/src/polyharness/workspace.py
index 9b6dd76..1af7ea7 100644
--- a/src/polyharness/workspace.py
+++ b/src/polyharness/workspace.py
@@ -414,4 +414,17 @@ def _agent_instruction_content(backend: str, filename: str) -> str:
 - Compare the best candidate's traces with worse ones to spot patterns.
 - If a previous approach regressed, explain why and try a different angle.
 - Small, targeted edits beat large rewrites.
+
+## Improvement Principles
+
+- **Change a real mechanism, not just constants.** If your edit only tweaks
+  thresholds, weights, or wording versus the parent, it is a low-value parameter
+  variant — change the actual logic, strategy, or data structure instead.
+- **Stay general; do not overfit.** Never hardcode answers or task-specific
+  knowledge to inflate the score. The harness must generalize beyond the eval
+  cases you can see.
+- **Ground the change in evidence.** Point to the specific trace failures your
+  change targets, and reason through why it fixes them.
+- **State a falsifiable hypothesis.** Say what you expect to improve and why, so
+  the next iteration can tell whether it held.
 """
diff --git a/tests/test_cli_adapters.py b/tests/test_cli_adapters.py
index 974fc93..a197e28 100644
--- a/tests/test_cli_adapters.py
+++ b/tests/test_cli_adapters.py
@@ -177,6 +177,25 @@ def test_build_prompt_includes_leaderboard(tmp_path):
     assert "0.8" in prompt
 
 
+def test_proposer_prompts_include_improvement_principles(tmp_path):
+    """Both the CLI and API proposer prompts carry the shared improvement principles."""
+    from polyharness.proposer.api_proposer import _build_system_prompt
+    from polyharness.proposer.base import PROPOSER_PRINCIPLES
+
+    ws = tmp_path / "ws"
+    cand = ws / "candidates" / "iter_1"
+    cand.mkdir(parents=True)
+
+    cli_prompt = _build_prompt(ws, cand, 1, 0)
+    api_prompt = _build_system_prompt(ws, cand, 1, 0)
+
+    # The anti-parameter-tuning directive is the distinctive marker.
+    assert "parameter variant" in PROPOSER_PRINCIPLES
+    assert "parameter variant" in cli_prompt
+    assert "parameter variant" in api_prompt
+    assert "falsifiable hypothesis" in cli_prompt
+
+
 # ---------------------------------------------------------------------------
 # CLIProposer
 # ---------------------------------------------------------------------------
diff --git a/tests/test_config.py b/tests/test_config.py
index b10d48a..0d1da47 100644
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -53,6 +53,24 @@ def test_cascade_defaults_and_roundtrip():
     assert loaded.evaluator.cascade_stage1 == 3
 
 
+def test_eval_split_defaults_and_roundtrip():
+    cfg = PolyHarnessConfig()
+    assert cfg.evaluator.eval_split is False
+    assert cfg.evaluator.val_tasks == []
+    assert cfg.evaluator.test_tasks == []
+
+    cfg.evaluator.eval_split = True
+    cfg.evaluator.val_tasks = ["tasks/v1.json", "tasks/v2.json"]
+    cfg.evaluator.test_tasks = ["tasks/t1.json"]
+    with tempfile.TemporaryDirectory() as tmp:
+        path = Path(tmp) / "config.yaml"
+        cfg.to_yaml(path)
+        loaded = PolyHarnessConfig.from_yaml(path)
+    assert loaded.evaluator.eval_split is True
+    assert loaded.evaluator.val_tasks == ["tasks/v1.json", "tasks/v2.json"]
+    assert loaded.evaluator.test_tasks == ["tasks/t1.json"]
+
+
 def test_config_roundtrip_yaml():
     cfg = PolyHarnessConfig()
     cfg.proposer.backend = "claude-code"  # type: ignore[assignment]
diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py
index daf7a64..1a7f366 100644
--- a/tests/test_orchestrator.py
+++ b/tests/test_orchestrator.py
@@ -524,6 +524,45 @@ def test_cascade_base_always_full(tmp_path):
     assert ev.calls[0] == ["t1", "t2", "t3", "t4"]
 
 
+def test_eval_split_holds_out_test(tmp_path):
+    """With eval_split, search runs on val tasks; the best candidate is scored
+    once on the held-out test tasks at the end."""
+    ws = _setup_workspace(tmp_path)
+    config = ws.load_config()
+    config.search.max_iterations = 3
+    config.search.early_stop_patience = 10
+    config.evaluator.eval_split = True
+    config.evaluator.val_tasks = ["v1.json", "v2.json"]
+    config.evaluator.test_tasks = ["t1.json", "t2.json"]
+    ev = PerTaskEvaluator({"v1": 0.6, "v2": 0.6, "t1": 0.9, "t2": 0.9})
+
+    result = Orchestrator(ws, config, proposer=MockProposer(), evaluator=ev).run()
+
+    # Test tasks are scored exactly once (the held-out final eval), never during search.
+    assert sum(1 for c in ev.calls if c == ["t1", "t2"]) == 1
+    # The search loop (base + candidates) ran on val tasks.
+    assert sum(1 for c in ev.calls if c == ["v1", "v2"]) >= 2
+    # The held-out score is reported and persisted, and never drove selection.
+    assert result.test_score == 0.9
+    holdout = json.loads((ws.summary_dir / "holdout_test.json").read_text())
+    assert holdout["test_overall_score"] == 0.9
+
+
+def test_eval_split_off_by_default(tmp_path):
+    """Without eval_split there is no held-out eval (back-compat)."""
+    ws = _setup_workspace(tmp_path)
+    config = ws.load_config()
+    config.search.max_iterations = 2
+    config.search.early_stop_patience = 10
+
+    result = Orchestrator(
+        ws, config, proposer=MockProposer(), evaluator=MockEvaluator()
+    ).run()
+
+    assert result.test_score is None
+    assert not (ws.summary_dir / "holdout_test.json").exists()
+
+
 def test_orchestrator_error_recovery(tmp_path):
     """Orchestrator should skip failing iterations and continue."""