feat(harness): wire marginal-utility tracker into task ledger by shaun0927 · Pull Request #1451 · shaun0927/openchrome

shaun0927 · 2026-05-28T13:57:28Z

Summary

Stacked on #1446. Completes the wire-up promised in the original #1428-A brief that was deferred from PR #1437.

applyToolCallToTask now calls recordStep() on every tool call.
TaskMeta.cost_curve is appended on each step (capped at 500 entries).
TaskMeta._mu_state persists the sliding window across calls.
Signals: tool ok, oc_assert pass/fail (from call.ok), oc_checkpoint advance.

Why this is a fix not a feature

PR #1437 added the tracker module and the cost_curve field but no caller invoked it. Every downstream consumer (#1446 early-stop policy, future cost-curve report) was dead code without this. Calling out as a fix to the original Part 1 commitment.

SSOT (#1359) alignment

Pure server-side derivation. No host coupling, no model.
Bounded growth: cost_curve cap = 500 entries.

Test plan

tests/core/task-ledger/marginal-utility-wireup.test.ts — 6/6 pass.
tests/core/task-ledger/budget.test.ts — 11/11 still pass (no regression).

Stacked on #1446 → #1437.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a28f015eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T13:59:48Z

+  const scores = candidateWindow.map((r) => r.p);
+  scores.push(score);


Use raw step scores in the sliding-window average

When the signal pattern changes, this re-averages prior p estimates instead of the per-step scores the formula describes. For example, a pass/fail/pass sequence should average raw scores [1,0,1], but the second entry becomes the previous weighted p, so lastP is inflated and later deltas/plateau detection no longer represent the last window of tool calls. Store the raw step score in the window (or otherwise average raw scores) rather than feeding prior estimates back into the calculation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-28T13:59:48Z

+    };
+  }
+
+  if (summary.consecutive_low_delta < r.plateau_steps) {


Honor custom plateau_delta when deciding plateaus

For callers that pass a non-default plateau_delta, the recommendation still uses summary.consecutive_low_delta, which is computed by summary() with the hard-coded LOW_DELTA_THRESHOLD = 0.02; the resolved r.plateau_delta is only returned in the policy snapshot and never affects should_stop. This makes stricter or looser early-stop thresholds silently ineffective, so hosts cannot tune the plateau condition as the policy API promises.

Useful? React with 👍 / 👎.

Every tool call routed through applyToolCallToTask now feeds a StepSignal into recordStep(). The resulting MarginalUtilityState is persisted on TaskMeta._mu_state and the new step's (step, p, delta) is appended to TaskMeta.cost_curve. cost_curve is capped at 500 entries so the meta record stays bounded even on long-running tasks. Signals derived from RecordedToolCall: - toolOk = call.ok - assertPasses = (call.tool === 'oc_assert' && call.ok) ? 1 : 0 - assertFails = (call.tool === 'oc_assert' && !call.ok) ? 1 : 0 - checkpointAdvanced= call.tool === 'oc_checkpoint' && call.ok The budget ledger never sees tool result payloads, so the assert verdict is approximated from the call's ok bit — a finer-grained inconclusive signal can land later if the recorder starts threading verdict bodies through. This is the wire-up promised in the original #1428-A brief that was deferred from PR #1437. Without it cost_curve was inert and every downstream consumer (early-stop policy, cost-curve report) was dead code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector · 2026-05-28T16:28:40Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

shaun0927 · 2026-05-28T16:40:07Z

Analysis & merge summary (automated review)

Intent / direction. This is correctly framed as a fix, not a feature: PR #1437 landed the marginal-utility tracker module and the TaskMeta.cost_curve / _mu_state fields, but no caller invoked recordStep(), so every downstream consumer (early-stop policy, cost-curve report) was dead code. This PR wires the tracker into applyToolCallToTask.

#1359 (SSOT) alignment — consistent.

P2 (harness, not agent) and P4 (facts before decisions): pure deterministic server-side derivation of a p_success estimate from tool-call signals. No model call, no host coupling, no Date.now() / I/O.
Bounded artifacts: cost_curve hard-capped at 500 entries; the tracker window is capped at 20 inside recordStep. No unbounded growth.
Both new TaskMeta fields are optional and additive (...meta spread), so existing consumers are unaffected.

Changes made to reach merge-readiness. The branch was stale — it re-included the tracker + early-stop commits that are already on develop (via #1437) and in #1446. I rebased it onto develop so it contains only the wire-up commit, and retargeted the base from the feature branch to develop (per the repo convention that PRs target develop).

Verification. tsc build clean; marginal-utility-wireup (6), budget (11), marginal-utility (7) tests pass (24/24). Independent code-review pass found no P0/P1/P2.

The assertInconclusives: 0 approximation (assert verdict derived from the ok bit, not the verdict body) is a documented, conservative limitation, not a correctness defect.

Merged to develop.

shaun0927 force-pushed the feat/1428-early-stop-policy branch from 4361f92 to 0257f5e Compare May 28, 2026 13:58

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

shaun0927 force-pushed the feat/1428-tracker-wireup branch from 9a28f01 to f31e5dd Compare May 28, 2026 16:28

shaun0927 changed the base branch from feat/1428-early-stop-policy to develop May 28, 2026 16:28

shaun0927 merged commit 1113604 into develop May 28, 2026

shaun0927 mentioned this pull request May 29, 2026

chore(release): back-merge main into develop (reconcile divergence) #1455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(harness): wire marginal-utility tracker into task ledger#1451

feat(harness): wire marginal-utility tracker into task ledger#1451
shaun0927 merged 1 commit into
developfrom
feat/1428-tracker-wireup

shaun0927 commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

shaun0927 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const scores = candidateWindow.map((r) => r.p);
		scores.push(score);

Conversation

shaun0927 commented May 28, 2026

Summary

Why this is a fix not a feature

SSOT (#1359) alignment

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

shaun0927 commented May 28, 2026

Analysis & merge summary (automated review)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant