feat(harness): wire marginal-utility tracker into task ledger#1451
Conversation
4361f92 to
0257f5e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a28f015eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const scores = candidateWindow.map((r) => r.p); | ||
| scores.push(score); |
There was a problem hiding this comment.
Use raw step scores in the sliding-window average
When the signal pattern changes, this re-averages prior p estimates instead of the per-step scores the formula describes. For example, a pass/fail/pass sequence should average raw scores [1,0,1], but the second entry becomes the previous weighted p, so lastP is inflated and later deltas/plateau detection no longer represent the last window of tool calls. Store the raw step score in the window (or otherwise average raw scores) rather than feeding prior estimates back into the calculation.
Useful? React with 👍 / 👎.
| }; | ||
| } | ||
|
|
||
| if (summary.consecutive_low_delta < r.plateau_steps) { |
There was a problem hiding this comment.
Honor custom plateau_delta when deciding plateaus
For callers that pass a non-default plateau_delta, the recommendation still uses summary.consecutive_low_delta, which is computed by summary() with the hard-coded LOW_DELTA_THRESHOLD = 0.02; the resolved r.plateau_delta is only returned in the policy snapshot and never affects should_stop. This makes stricter or looser early-stop thresholds silently ineffective, so hosts cannot tune the plateau condition as the policy API promises.
Useful? React with 👍 / 👎.
Every tool call routed through applyToolCallToTask now feeds a StepSignal into recordStep(). The resulting MarginalUtilityState is persisted on TaskMeta._mu_state and the new step's (step, p, delta) is appended to TaskMeta.cost_curve. cost_curve is capped at 500 entries so the meta record stays bounded even on long-running tasks. Signals derived from RecordedToolCall: - toolOk = call.ok - assertPasses = (call.tool === 'oc_assert' && call.ok) ? 1 : 0 - assertFails = (call.tool === 'oc_assert' && !call.ok) ? 1 : 0 - checkpointAdvanced= call.tool === 'oc_checkpoint' && call.ok The budget ledger never sees tool result payloads, so the assert verdict is approximated from the call's ok bit — a finer-grained inconclusive signal can land later if the recorder starts threading verdict bodies through. This is the wire-up promised in the original #1428-A brief that was deferred from PR #1437. Without it cost_curve was inert and every downstream consumer (early-stop policy, cost-curve report) was dead code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9a28f01 to
f31e5dd
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Analysis & merge summary (automated review)Intent / direction. This is correctly framed as a fix, not a feature: PR #1437 landed the marginal-utility tracker module and the #1359 (SSOT) alignment — consistent.
Changes made to reach merge-readiness. The branch was stale — it re-included the tracker + early-stop commits that are already on Verification. The Merged to |
Summary
Stacked on #1446. Completes the wire-up promised in the original #1428-A brief that was deferred from PR #1437.
applyToolCallToTasknow callsrecordStep()on every tool call.TaskMeta.cost_curveis appended on each step (capped at 500 entries).TaskMeta._mu_statepersists the sliding window across calls.Why this is a fix not a feature
PR #1437 added the tracker module and the cost_curve field but no caller invoked it. Every downstream consumer (#1446 early-stop policy, future cost-curve report) was dead code without this. Calling out as a fix to the original Part 1 commitment.
SSOT (#1359) alignment
Test plan
tests/core/task-ledger/marginal-utility-wireup.test.ts— 6/6 pass.tests/core/task-ledger/budget.test.ts— 11/11 still pass (no regression).Stacked on #1446 → #1437.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com