soar-eval: per-module eval harness for comparing architectural changes#581
Draft
kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
Draft
soar-eval: per-module eval harness for comparing architectural changes#581kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
Conversation
Two tools for measuring architectural changes: 1. eval.py: transfer eval — trains chunking agent on generated Blocks World tasks, measures whether learned chunks speed up held-out tasks. Three conditions: trained-transfer, fresh baseline, no-learning control. 2. module_eval.py: per-module eval — wraps existing test suites, captures quantitative stats per test (decisions, firings, WM, chunks), diffs across builds. Two-layer design: visibility (raw facts) separated from decision (maintainer-configured policy). The separation matters: instruments measure, policies decide. Mixing them creates Goodhart targets.
- Print agent name on first metric row for each changed agent - Replace hardcoded absolute path with repo-relative path
Draft policy for discussion with the group: - lower_is_better: decisions, elaboration_cycles, production_firings, wm_max, timing - neutral: chunk count, user productions, wm_current/mean - timing_noise_floor: 0.005s (5ms) — ignore absolute deltas below this - timing_relative_threshold: 1.0% — ignore relative deltas below this - Both thresholds must be exceeded to flag a timing change apply_policy() now loads and applies timing thresholds from policy.
- Use REPO_ROOT for all subprocess cwd (relative paths resolve correctly) - Resolve user-provided paths with expanduser().resolve() - Fix EpMemFunctionalTests glob (was missing 3 agents) - Add PerformanceTests suite with separate root dir - Validated: 38 smem (37 pass), 47 epmem (47 pass), 15 perf (15 pass)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A Python eval harness that runs existing Soar test agents against two builds, captures per-agent quantitative metrics, and reports build-to-build deltas. No kernel changes required.
The main tool is
module_eval.py, which runs suites like ChunkingTests and records decision cycles, production firings, WM peak, chunks learned, and CPU time per agent. Outputs JSON.A second tool,
transfer_eval.py, measures chunking transfer on generated Blocks World tasks: does learning on one set of tasks speed up a different set?Why
In the semantic memory thread, John asked how alternative implementations might be evaluated and compared. With three independent researchers working on semantic learning, we need a shared way to measure what our changes actually do.
This is a first step in that direction. It's intentionally simple — wrap what Soar already has, capture numbers, compare builds. The goal is to get a measurement loop running, then improve it together.
Design: measurements separate from acceptance criteria
The harness separates what changed from whether it's good.
This separation is deliberate. When instruments and judgments live in the same layer, contributors optimize the classifier instead of the system. Keeping them apart means the measurement stays honest as the policy evolves.
Example
Comparing upstream vs a kernel change across 19 chunking demo agents:
In this run, no other agent changed on deterministic metrics.
How metrics are collected
Metrics are parsed from Soar CLI
statsoutput and the per-run-->summary line. Seeds are fixed and recorded in JSON so runs are reproducible.Usage
What's included
Where this is headed
This is the starting point, not the finish line. Next steps I'd like to work on with the group:
policy.jsonthat the maintainer ownsLimitations
Review request
I'd appreciate feedback on:
Test plan
module_eval.py runon ChunkingTests produces valid JSONmodule_eval.py comparedetects real differences between buildsmodule_eval.py compare --facts-onlyreports raw deltas without judgmenttransfer_eval.pyruns across fixed seeds and reports transfer metrics