Skip to content

soar-eval: per-module eval harness for comparing architectural changes#581

Draft
kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
kimjune01:soar-eval
Draft

soar-eval: per-module eval harness for comparing architectural changes#581
kimjune01 wants to merge 6 commits intoSoarGroup:developmentfrom
kimjune01:soar-eval

Conversation

@kimjune01
Copy link
Copy Markdown

@kimjune01 kimjune01 commented Apr 12, 2026

What this is

A Python eval harness that runs existing Soar test agents against two builds, captures per-agent quantitative metrics, and reports build-to-build deltas. No kernel changes required.

The main tool is module_eval.py, which runs suites like ChunkingTests and records decision cycles, production firings, WM peak, chunks learned, and CPU time per agent. Outputs JSON.

A second tool, transfer_eval.py, measures chunking transfer on generated Blocks World tasks: does learning on one set of tasks speed up a different set?

Why

In the semantic memory thread, John asked how alternative implementations might be evaluated and compared. With three independent researchers working on semantic learning, we need a shared way to measure what our changes actually do.

This is a first step in that direction. It's intentionally simple — wrap what Soar already has, capture numbers, compare builds. The goal is to get a measurement loop running, then improve it together.

Design: measurements separate from acceptance criteria

The harness separates what changed from whether it's good.

  • Layer 1 (visibility): raw deltas per test per metric. What changed, by how much.
  • Layer 2 (decision): a configurable policy that the maintainer controls. Which metrics matter, what noise to ignore, what counts as a regression.

This separation is deliberate. When instruments and judgments live in the same layer, contributors optimize the classifier instead of the system. Keeping them apart means the measurement stays honest as the policy evolves.

Example

Comparing upstream vs a kernel change across 19 chunking demo agents:

Agent                                          Metric              Base  Candidate   Delta      %
BW_Hierarchical_Look_Ahead                     decisions             66         46     -20  -30.3%
BW_Hierarchical_Look_Ahead                     production_firings   651        340    -311  -47.8%
BW_Hierarchical_Look_Ahead                     wm_max               679        235    -444  -65.4%
BW_Hierarchical_Look_Ahead                     productions_chunks     3          1      -2  -66.7%

In this run, no other agent changed on deterministic metrics.

How metrics are collected

Metrics are parsed from Soar CLI stats output and the per-run --> summary line. Seeds are fixed and recorded in JSON so runs are reproducible.

Usage

# Capture baseline
python soar-eval/module_eval.py run --soar build/SoarCLI/soar --suite ChunkingTests --out upstream.json

# Capture candidate
python soar-eval/module_eval.py run --soar build-pr/SoarCLI/soar --suite ChunkingTests --out candidate.json

# Compare — just the numbers
python soar-eval/module_eval.py compare --base upstream.json --candidate candidate.json --facts-only

# Compare — with policy
python soar-eval/module_eval.py compare --base upstream.json --candidate candidate.json

What's included

soar-eval/
├── module_eval.py               # Run suites, capture stats, diff builds
├── transfer_eval.py             # Chunking transfer on generated BW tasks
├── tasks/blocks_world.py        # BW task generator
├── agents/bw-op-subgoal-base.soar   # BW chunking agent (relative paths)
└── results/                     # Output directory (not committed)

Where this is headed

This is the starting point, not the finish line. Next steps I'd like to work on with the group:

  • Expand to smem, epmem, and performance suites
  • Define what a semantic-learning test agent should look like
  • Add a policy.json that the maintainer owns
  • Have each researcher provide a test agent for their approach
  • Compare all approaches on the same harness

Limitations

  • Only ChunkingTests validated end-to-end so far
  • Transfer eval is chunking-specific
  • CPU timing is machine-dependent — multiple runs recommended
  • BW task generator is 3-block only (13 states) — a smoke test, not a benchmark

Review request

I'd appreciate feedback on:

  • Is this structure useful?
  • Are these the right initial metrics?
  • Which suites should we prioritize next?

Test plan

  • module_eval.py run on ChunkingTests produces valid JSON
  • module_eval.py compare detects real differences between builds
  • module_eval.py compare --facts-only reports raw deltas without judgment
  • transfer_eval.py runs across fixed seeds and reports transfer metrics
  • Run on smem/epmem suites
  • Test on a third-party machine

Two tools for measuring architectural changes:

1. eval.py: transfer eval — trains chunking agent on generated Blocks
   World tasks, measures whether learned chunks speed up held-out tasks.
   Three conditions: trained-transfer, fresh baseline, no-learning control.

2. module_eval.py: per-module eval — wraps existing test suites, captures
   quantitative stats per test (decisions, firings, WM, chunks), diffs
   across builds. Two-layer design: visibility (raw facts) separated from
   decision (maintainer-configured policy).

The separation matters: instruments measure, policies decide. Mixing
them creates Goodhart targets.
- Print agent name on first metric row for each changed agent
- Replace hardcoded absolute path with repo-relative path
Draft policy for discussion with the group:
- lower_is_better: decisions, elaboration_cycles, production_firings, wm_max, timing
- neutral: chunk count, user productions, wm_current/mean
- timing_noise_floor: 0.005s (5ms) — ignore absolute deltas below this
- timing_relative_threshold: 1.0% — ignore relative deltas below this
- Both thresholds must be exceeded to flag a timing change

apply_policy() now loads and applies timing thresholds from policy.
- Use REPO_ROOT for all subprocess cwd (relative paths resolve correctly)
- Resolve user-provided paths with expanduser().resolve()
- Fix EpMemFunctionalTests glob (was missing 3 agents)
- Add PerformanceTests suite with separate root dir
- Validated: 38 smem (37 pass), 47 epmem (47 pass), 15 perf (15 pass)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant