Skip to content

feat(harness): wire marginal-utility tracker into task ledger#1451

Merged
shaun0927 merged 1 commit into
developfrom
feat/1428-tracker-wireup
May 28, 2026
Merged

feat(harness): wire marginal-utility tracker into task ledger#1451
shaun0927 merged 1 commit into
developfrom
feat/1428-tracker-wireup

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

Stacked on #1446. Completes the wire-up promised in the original #1428-A brief that was deferred from PR #1437.

  • applyToolCallToTask now calls recordStep() on every tool call.
  • TaskMeta.cost_curve is appended on each step (capped at 500 entries).
  • TaskMeta._mu_state persists the sliding window across calls.
  • Signals: tool ok, oc_assert pass/fail (from call.ok), oc_checkpoint advance.

Why this is a fix not a feature

PR #1437 added the tracker module and the cost_curve field but no caller invoked it. Every downstream consumer (#1446 early-stop policy, future cost-curve report) was dead code without this. Calling out as a fix to the original Part 1 commitment.

SSOT (#1359) alignment

  • Pure server-side derivation. No host coupling, no model.
  • Bounded growth: cost_curve cap = 500 entries.

Test plan

  • tests/core/task-ledger/marginal-utility-wireup.test.ts — 6/6 pass.
  • tests/core/task-ledger/budget.test.ts — 11/11 still pass (no regression).

Stacked on #1446#1437.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

@shaun0927 shaun0927 force-pushed the feat/1428-early-stop-policy branch from 4361f92 to 0257f5e Compare May 28, 2026 13:58
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a28f015eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +122 to +123
const scores = candidateWindow.map((r) => r.p);
scores.push(score);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use raw step scores in the sliding-window average

When the signal pattern changes, this re-averages prior p estimates instead of the per-step scores the formula describes. For example, a pass/fail/pass sequence should average raw scores [1,0,1], but the second entry becomes the previous weighted p, so lastP is inflated and later deltas/plateau detection no longer represent the last window of tool calls. Store the raw step score in the window (or otherwise average raw scores) rather than feeding prior estimates back into the calculation.

Useful? React with 👍 / 👎.

Comment thread src/core/task-ledger/early-stop.ts Outdated
};
}

if (summary.consecutive_low_delta < r.plateau_steps) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor custom plateau_delta when deciding plateaus

For callers that pass a non-default plateau_delta, the recommendation still uses summary.consecutive_low_delta, which is computed by summary() with the hard-coded LOW_DELTA_THRESHOLD = 0.02; the resolved r.plateau_delta is only returned in the policy snapshot and never affects should_stop. This makes stricter or looser early-stop thresholds silently ineffective, so hosts cannot tune the plateau condition as the policy API promises.

Useful? React with 👍 / 👎.

Every tool call routed through applyToolCallToTask now feeds a
StepSignal into recordStep(). The resulting MarginalUtilityState is
persisted on TaskMeta._mu_state and the new step's (step, p, delta)
is appended to TaskMeta.cost_curve. cost_curve is capped at 500
entries so the meta record stays bounded even on long-running tasks.

Signals derived from RecordedToolCall:
  - toolOk            = call.ok
  - assertPasses      = (call.tool === 'oc_assert' && call.ok) ? 1 : 0
  - assertFails       = (call.tool === 'oc_assert' && !call.ok) ? 1 : 0
  - checkpointAdvanced= call.tool === 'oc_checkpoint' && call.ok

The budget ledger never sees tool result payloads, so the assert
verdict is approximated from the call's ok bit — a finer-grained
inconclusive signal can land later if the recorder starts threading
verdict bodies through.

This is the wire-up promised in the original #1428-A brief that was
deferred from PR #1437. Without it cost_curve was inert and every
downstream consumer (early-stop policy, cost-curve report) was dead
code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shaun0927 shaun0927 force-pushed the feat/1428-tracker-wireup branch from 9a28f01 to f31e5dd Compare May 28, 2026 16:28
@shaun0927 shaun0927 changed the base branch from feat/1428-early-stop-policy to develop May 28, 2026 16:28
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@shaun0927 shaun0927 merged commit 1113604 into develop May 28, 2026
@shaun0927
Copy link
Copy Markdown
Owner Author

Analysis & merge summary (automated review)

Intent / direction. This is correctly framed as a fix, not a feature: PR #1437 landed the marginal-utility tracker module and the TaskMeta.cost_curve / _mu_state fields, but no caller invoked recordStep(), so every downstream consumer (early-stop policy, cost-curve report) was dead code. This PR wires the tracker into applyToolCallToTask.

#1359 (SSOT) alignment — consistent.

  • P2 (harness, not agent) and P4 (facts before decisions): pure deterministic server-side derivation of a p_success estimate from tool-call signals. No model call, no host coupling, no Date.now() / I/O.
  • Bounded artifacts: cost_curve hard-capped at 500 entries; the tracker window is capped at 20 inside recordStep. No unbounded growth.
  • Both new TaskMeta fields are optional and additive (...meta spread), so existing consumers are unaffected.

Changes made to reach merge-readiness. The branch was stale — it re-included the tracker + early-stop commits that are already on develop (via #1437) and in #1446. I rebased it onto develop so it contains only the wire-up commit, and retargeted the base from the feature branch to develop (per the repo convention that PRs target develop).

Verification. tsc build clean; marginal-utility-wireup (6), budget (11), marginal-utility (7) tests pass (24/24). Independent code-review pass found no P0/P1/P2.

The assertInconclusives: 0 approximation (assert verdict derived from the ok bit, not the verdict body) is a documented, conservative limitation, not a correctness defect.

Merged to develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant