soar-eval: per-module eval harness for comparing architectural changes by kimjune01 · Pull Request #581 · SoarGroup/Soar

kimjune01 · 2026-04-12T05:05:04Z

What this is

A Python eval harness that runs existing Soar test agents against two builds, captures per-agent quantitative metrics, and reports build-to-build deltas. No kernel changes required.

The main tool is module_eval.py, which runs suites like ChunkingTests and records decision cycles, production firings, WM peak, chunks learned, and CPU time per agent. Outputs JSON.

A second tool, transfer_eval.py, measures chunking transfer on generated Blocks World tasks: does learning on one set of tasks speed up a different set?

Why

In the semantic memory thread, John asked how alternative implementations might be evaluated and compared. With three independent researchers working on semantic learning, we need a shared way to measure what our changes actually do.

This is a first step in that direction. It's intentionally simple — wrap what Soar already has, capture numbers, compare builds. The goal is to get a measurement loop running, then improve it together.

Design: measurements separate from acceptance criteria

The harness separates what changed from whether it's good.

Layer 1 (visibility): raw deltas per test per metric. What changed, by how much.
Layer 2 (decision): a configurable policy that the maintainer controls. Which metrics matter, what noise to ignore, what counts as a regression.

This separation is deliberate. When instruments and judgments live in the same layer, contributors optimize the classifier instead of the system. Keeping them apart means the measurement stays honest as the policy evolves.

Example

Comparing upstream vs a kernel change across 19 chunking demo agents:

Agent                                          Metric              Base  Candidate   Delta      %
BW_Hierarchical_Look_Ahead                     decisions             66         46     -20  -30.3%
BW_Hierarchical_Look_Ahead                     production_firings   651        340    -311  -47.8%
BW_Hierarchical_Look_Ahead                     wm_max               679        235    -444  -65.4%
BW_Hierarchical_Look_Ahead                     productions_chunks     3          1      -2  -66.7%

In this run, no other agent changed on deterministic metrics.

How metrics are collected

Metrics are parsed from Soar CLI stats output and the per-run --> summary line. Seeds are fixed and recorded in JSON so runs are reproducible.

Usage

# Capture baseline
python soar-eval/module_eval.py run --soar build/SoarCLI/soar --suite ChunkingTests --out upstream.json

# Capture candidate
python soar-eval/module_eval.py run --soar build-pr/SoarCLI/soar --suite ChunkingTests --out candidate.json

# Compare — just the numbers
python soar-eval/module_eval.py compare --base upstream.json --candidate candidate.json --facts-only

# Compare — with policy
python soar-eval/module_eval.py compare --base upstream.json --candidate candidate.json

What's included

soar-eval/
├── module_eval.py               # Run suites, capture stats, diff builds
├── transfer_eval.py             # Chunking transfer on generated BW tasks
├── tasks/blocks_world.py        # BW task generator
├── agents/bw-op-subgoal-base.soar   # BW chunking agent (relative paths)
└── results/                     # Output directory (not committed)

Where this is headed

This is the starting point, not the finish line. Next steps I'd like to work on with the group:

Expand to smem, epmem, and performance suites
Define what a semantic-learning test agent should look like
Add a policy.json that the maintainer owns
Have each researcher provide a test agent for their approach
Compare all approaches on the same harness

Limitations

Only ChunkingTests validated end-to-end so far
Transfer eval is chunking-specific
CPU timing is machine-dependent — multiple runs recommended
BW task generator is 3-block only (13 states) — a smoke test, not a benchmark

Review request

I'd appreciate feedback on:

Is this structure useful?
Are these the right initial metrics?
Which suites should we prioritize next?

Test plan

module_eval.py run on ChunkingTests produces valid JSON
module_eval.py compare detects real differences between builds
module_eval.py compare --facts-only reports raw deltas without judgment
transfer_eval.py runs across fixed seeds and reports transfer metrics
Run on smem/epmem suites
Test on a third-party machine

Two tools for measuring architectural changes: 1. eval.py: transfer eval — trains chunking agent on generated Blocks World tasks, measures whether learned chunks speed up held-out tasks. Three conditions: trained-transfer, fresh baseline, no-learning control. 2. module_eval.py: per-module eval — wraps existing test suites, captures quantitative stats per test (decisions, firings, WM, chunks), diffs across builds. Two-layer design: visibility (raw facts) separated from decision (maintainer-configured policy). The separation matters: instruments measure, policies decide. Mixing them creates Goodhart targets.

- Print agent name on first metric row for each changed agent - Replace hardcoded absolute path with repo-relative path

Draft policy for discussion with the group: - lower_is_better: decisions, elaboration_cycles, production_firings, wm_max, timing - neutral: chunk count, user productions, wm_current/mean - timing_noise_floor: 0.005s (5ms) — ignore absolute deltas below this - timing_relative_threshold: 1.0% — ignore relative deltas below this - Both thresholds must be exceeded to flag a timing change apply_policy() now loads and applies timing thresholds from policy.

- Use REPO_ROOT for all subprocess cwd (relative paths resolve correctly) - Resolve user-provided paths with expanduser().resolve() - Fix EpMemFunctionalTests glob (was missing 3 agents) - Add PerformanceTests suite with separate root dir - Validated: 38 smem (37 pass), 47 epmem (47 pass), 15 perf (15 pass)

kimjune01 added 6 commits April 11, 2026 05:11

rename eval.py to transfer_eval.py for clarity

3ed17d3

remove experimental RL-substate files from PR scope

5673550

fix agent column in diff output, use relative paths in base agent

8517326

- Print agent name on first metric row for each changed agent - Replace hardcoded absolute path with repo-relative path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

soar-eval: per-module eval harness for comparing architectural changes#581

soar-eval: per-module eval harness for comparing architectural changes#581
kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
kimjune01:soar-eval

kimjune01 commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kimjune01 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

Why

Design: measurements separate from acceptance criteria

Example

How metrics are collected

Usage

What's included

Where this is headed

Limitations

Review request

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kimjune01 commented Apr 12, 2026 •

edited

Loading