Description
The eval harness run_nlcmd_impl() only tracks orchestrator-level tokens from the top-level claude -p call via _parse_claude_usage(), but misses all subagent tokens spawned via the Task tool (understander, bold-proposer, critique, reducer, consensus). This causes nlcmd to report ~$0.91/task while the actual cost is likely ~$20-30+/task.
The sanity check that reveals the bug:
- full mode: $22/task in 2,494s → $0.009/second
- nlcmd mode: $0.91/task in 5,309s → $0.0002/second
- Both use Opus for planning and Sonnet for impl → cost-per-second should be similar
- 52x difference in cost-per-second is impossible with the same models
Affected module: python/agentize/eval/eval_harness.py
Proposed Solution
File Changes
| File |
Level |
Purpose |
python/agentize/eval/eval_harness.md |
minor |
Update Execution Modes table and nlcmd cost-tracking description |
python/tests/test_eval_harness.py |
minor |
Update cost_note assertion; add test verifying JSONL tracking path |
python/agentize/eval/eval_harness.py |
major |
Snapshot JSONL files before Phase 1, sum new files after Phase 2, remove _parse_claude_usage call and planning_tokens/planning_cost_usd fields |
Implementation Steps
Step 1: Update documentation (Estimated: ~15 LOC)
python/agentize/eval/eval_harness.md — Add nlcmd row to the Mode table. Add nlcmd subsection explaining JSONL-based cost tracking covers both Phase 1 (NL planning) and Phase 2 (FSM impl). Remove stale "subagent tokens not included" text.
Step 2: Update tests (Estimated: ~15 LOC)
python/tests/test_eval_harness.py, TestNlcmdImpl.test_result_has_planner_cmd — Update cost_note value expectation.
- Same file,
TestNlcmdImpl — Add test_jsonl_cost_tracking_applied that monkeypatches _list_jsonl_files and _sum_jsonl_usage to verify the JSONL path is exercised.
Step 3: Implement JSONL tracking in run_nlcmd_impl() (Estimated: ~15 LOC)
python/agentize/eval/eval_harness.py, run_nlcmd_impl():
- Add
files_before = _list_jsonl_files() before Phase 1 starts (~line 870)
- Update
cost_note to "cost estimated from new JSONL session files"
- Remove lines 913-920 (
_parse_claude_usage call and planning_tokens/planning_cost_usd)
- After Phase 2 completion (~line 999), add JSONL diff block:
files_after = _list_jsonl_files()
new_files = sorted(files_after - files_before)
if new_files:
usage = _sum_jsonl_usage(new_files)
result["input_tokens"] = usage["input_tokens"]
result["output_tokens"] = usage["output_tokens"]
result["cache_read_tokens"] = usage["cache_read"]
result["cache_write_tokens"] = usage["cache_write"]
result["tokens"] = usage["tokens"]
result["cost_usd"] = usage["cost_usd"]
- Also add JSONL diff to timeout early-return path (~line 961)
Test Strategy
python -m pytest python/tests/test_eval_harness.py -v
- Updated
test_result_has_planner_cmd must pass (cost_note field present)
- New
test_jsonl_cost_tracking_applied must pass (JSONL diff path active)
Related PR
TBD - will be updated when PR is created
Description
The eval harness
run_nlcmd_impl()only tracks orchestrator-level tokens from the top-levelclaude -pcall via_parse_claude_usage(), but misses all subagent tokens spawned via the Task tool (understander, bold-proposer, critique, reducer, consensus). This causes nlcmd to report ~$0.91/task while the actual cost is likely ~$20-30+/task.The sanity check that reveals the bug:
Affected module:
python/agentize/eval/eval_harness.pyProposed Solution
File Changes
python/agentize/eval/eval_harness.mdpython/tests/test_eval_harness.pycost_noteassertion; add test verifying JSONL tracking pathpython/agentize/eval/eval_harness.py_parse_claude_usagecall andplanning_tokens/planning_cost_usdfieldsImplementation Steps
Step 1: Update documentation (Estimated: ~15 LOC)
python/agentize/eval/eval_harness.md— Addnlcmdrow to the Mode table. Add nlcmd subsection explaining JSONL-based cost tracking covers both Phase 1 (NL planning) and Phase 2 (FSM impl). Remove stale "subagent tokens not included" text.Step 2: Update tests (Estimated: ~15 LOC)
python/tests/test_eval_harness.py,TestNlcmdImpl.test_result_has_planner_cmd— Updatecost_notevalue expectation.TestNlcmdImpl— Addtest_jsonl_cost_tracking_appliedthat monkeypatches_list_jsonl_filesand_sum_jsonl_usageto verify the JSONL path is exercised.Step 3: Implement JSONL tracking in
run_nlcmd_impl()(Estimated: ~15 LOC)python/agentize/eval/eval_harness.py,run_nlcmd_impl():files_before = _list_jsonl_files()before Phase 1 starts (~line 870)cost_noteto"cost estimated from new JSONL session files"_parse_claude_usagecall andplanning_tokens/planning_cost_usd)Test Strategy
python -m pytest python/tests/test_eval_harness.py -vtest_result_has_planner_cmdmust pass (cost_note field present)test_jsonl_cost_tracking_appliedmust pass (JSONL diff path active)Related PR
TBD - will be updated when PR is created