feat: borrow from the now-open Stanford Meta-Harness — proposer principles + held-out val/test (v0.2.4)#9
Merged
Conversation
…0.2.4) Borrow (re-author, MIT-compatible) the high-value directives from the official Stanford Meta-Harness reference Skill and inject them into every Proposer prompt/instruction (API, CLI, and the workspace CLAUDE.md/AGENTS.md): change a real mechanism not just constants, don't overfit the eval set, ground changes in trace evidence, state a falsifiable hypothesis. Complements the post-hoc novelty filter by raising candidate quality at the source. Also: the Stanford Meta-Harness framework is now open-sourced (MIT) — corrected the now-stale 'never open-sourced' Backstory in README/README_CN and linked the official repo. 207 tests, lint clean.
Borrow the Stanford Meta-Harness val/test methodology: evolve the harness on val_tasks (selection, Pareto, early-stop all use val), then score ONLY the best candidate once on held-out test_tasks at the end. The test score never drives selection — an honest post-hoc number. Per-task mode only, off by default (eval_split). Shown in the run summary + ph best, persisted to summary/holdout_test.json. Folded into the 0.2.4 release alongside the proposer improvement principles. 210 tests, lint clean.
…ons) State explicitly that PolyHarness bundles NO third-party code — its techniques are independently re-implemented from public papers/docs/MIT repos (Stanford Meta-Harness, GEPA, ShinkaEvolve, OpenEvolve, Darwin Gödel Machine) and attributed inline. Add an Acknowledgments section (README/README_CN), expand the License line (MIT, (c) 2026 weijt606), and document the attribution policy in CONTRIBUTING. Part of the 0.2.4 release.
Leftover from the v0.2.3 adapter refresh: the 'ph shell-hook install' docstring still showed 'codex ...' / 'opencode -p ...'. Align with current invocations. Part of 0.2.4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stanford's Meta-Harness framework is now open-sourced (MIT). This release borrows
two battle-tested ideas from it — re-implemented in our own code, no third-party
source bundled — and tightens the project's open-source declarations.
What's new
Proposer improvement principles
Every Proposer prompt/instruction (API, CLI, and the injected
CLAUDE.md/AGENTS.md)now carries shared directives re-authored from the official Meta-Harness reference
Skill: change a real mechanism, not just constants; don't overfit / hardcode the
eval set; ground changes in trace evidence; state a falsifiable hypothesis.
Raises candidate quality at the source — complements the post-hoc novelty filter.
Held-out val/test split (
evaluator.eval_split,val_tasks,test_tasks)Evolve the harness on
val_tasks(selection, Pareto, and early-stop all use thevalidation set), then score only the best candidate, once, on held-out
test_tasksat the end. The test score never drives selection — an honest, post-hoc number that
exposes harness overfitting to the eval set (the Meta-Harness val/test methodology).
Per-task mode only; off by default. Shown in the run summary and
ph best,persisted to
summary/holdout_test.json.Open-source / attribution
PolyHarness's positioning and linked the official repo.
PolyHarness borrows ideas from (Stanford Meta-Harness, GEPA, ShinkaEvolve,
OpenEvolve, Darwin Gödel Machine), and stating explicitly that no third-party
code is bundled — ideas are re-implemented and attributed inline.
ph shell-hook installhelp string (codex exec/opencode run).Compatibility & safety
new execution surface.
Testing
ruff check src/ tests/— cleanpytest tests/— 210 passed (+6: val/test-split + config + proposer-principles tests)localbackend):init→run→bestverified on v0.2.4.