Skip to content

fix(counterfactual): bound scan window + per-agent cap (600s timeout fix)#228

Merged
cipher813 merged 1 commit into
mainfrom
fix/counterfactual-bound-scan-window
May 19, 2026
Merged

fix(counterfactual): bound scan window + per-agent cap (600s timeout fix)#228
cipher813 merged 1 commit into
mainfrom
fix/counterfactual-bound-scan-window

Conversation

@cipher813
Copy link
Copy Markdown
Owner

ROADMAP L293alpha-engine-replay-counterfactual Lambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window; the per-artifact s3:get_object loop inside compute_and_emit blew past the timeout. SF Catch=States.ALL keeps the pipeline green so the failure is observability-silent — no agent_counterfactual_rule_fit CW datapoint has emitted in 1.5 weeks.

Per the entry's fix options (a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda: shipping (a) — simplest correct shape, no infrastructure surface change, cost-discipline-aligned.

What ships

Component Change
replay/counterfactual.py DEFAULT_WINDOW_DAYS 56→28; new DEFAULT_MAX_ARTIFACTS_PER_AGENT=500; cap applied PRE _load_artifact in _list_artifact_keys_in_window so the bound translates directly into Lambda wall-clock
lambda_counterfactual/handler.py Default window threaded from module constant; new event field max_artifacts_per_agent (default 500, explicit None disables for ad-hoc deeper-corpus runs); docstring + start log updated. Shell-run dry path (PR #225) preserved
tests/test_replay_counterfactual.py New TestPerAgentArtifactCap (4 tests): drops-oldest, None unbounded, 0-treated-as-unbounded, per-agent independence
tests/test_lambda_counterfactual_handler.py test_default_window_days updated for 28d + 500 cap; +2 tests for event override + explicit None

Local 5/15 edits abandoned (intentional)

The local working-tree edits on the in-use alpha-engine-backtester clone removed the merged shell-run-dry path (PR #225) and downgraded the lib pin v0.20.0 → v0.16.0. Both regressions vs origin/main. Working in a clean throwaway worktree on origin/main so those edits are not part of this PR. Recommend git restore lambda_counterfactual/ in the working tree post-merge to clean them up.

What does NOT ship

  • MemorySize bump — not needed at 28d + per-agent cap. Held in reserve for option (b) if future corpus growth needs it.
  • Spot/batch refactor — option (c), out of scope.
  • Concurrency / threaded loads — out of scope; revisit at next corpus growth.

Post-merge deploy step

bash infrastructure/deploy_counterfactual.sh

Manual — no CI deploy for this Lambda. Prior ECR :latest push was 2026-05-05; SYSTEM_STATE should record the new image SHA after the deploy.

Tests

pytest tests/ -q  →  1693 passed, 0 failed

Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293, the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation).

🤖 Generated with Claude Code

…fix)

ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently
timing out at the 600s ceiling since 2026-05-16 (last successful real
Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the
56-day window, and the per-artifact s3:get_object loop inside
``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means
the pipeline stays green; the failure is observability-silent — no
``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks.

Per the entry's listed fix options [(a) cap/window the scan, (b) bump
memory + bound workload, (c) move off Lambda]: shipping option (a) —
simplest correct shape, no infrastructure surface change, no new
deploy machinery. Cost-discipline-aligned (pre-alpha-validation).

## What ships

1. **`replay/counterfactual.py`**:
   - ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day
     statistical-significance heuristic the LdP triple-barrier fits
     consume; at ~585 artifacts/day this bounds the get_object loop
     near ~16k instead of the 32k+ that timed out.
   - New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` —
     second-order bound. Even if the window is reverted operationally
     or a heavy population-wide agent's per-day rate grows, no single
     agent's backlog can stall the run. Most-recent-first enforcement
     in ``_list_artifact_keys_in_window`` (day-by-day iteration was
     already end_date-backward, so the natural list ordering IS
     most-recent-first — cap drops old keys, not fresh ones).
   - ``_list_artifact_keys_in_window`` gains the cap parameter
     (applied PRE ``_load_artifact`` so the bound translates directly
     into Lambda wall-clock).
   - ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end;
     log line surfaces the value for ops visibility.

2. **`lambda_counterfactual/handler.py`**:
   - Default ``window_days`` now reads from the module constant (28).
   - New event field ``max_artifacts_per_agent`` (default 500; explicit
     ``None`` disables the cap for ad-hoc deeper-corpus runs).
   - Event-shape docstring + start-log line updated.
   - The shell-run dry path (PR #225) is UNTOUCHED — the locally-
     uncommitted 5/15 edits that removed it would have regressed
     merged work + downgraded the lib pin v0.20.0 → v0.16.0; those
     edits are abandoned by working on a clean origin/main worktree.

3. **Tests** (+5 net):
   - ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests):
     cap drops oldest when one agent dominates (most-recent-first
     truncation), cap=None returns all keys, cap=0 treated as
     unbounded (defensive — less-surprising), cap applied
     independently per agent_id_base.
   - ``test_lambda_counterfactual_handler``:
     ``test_default_window_days`` updated to expect 28 + the new
     ``max_artifacts_per_agent=500`` default; two new tests for
     event-override + explicit-None.

## What does NOT ship

- **MemorySize bump** — not needed at 28d + per-agent cap. Held in
  reserve for option (b) if a future corpus expansion needs it.
- **Spot/batch refactor** — option (c), out of scope. The entry
  flags it as the heavier alternative; option (a) is sufficient.
- **Concurrency / threaded loads** — could shave more time but
  introduces ordering + S3 throttling concerns. Out of scope; revisit
  if 28d + cap still bumps the ceiling at the next corpus growth.

## Post-merge deploy step

  bash infrastructure/deploy_counterfactual.sh

(Manual — there is no CI deploy for the counterfactual Lambda. Per
the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05.
SYSTEM_STATE should record the new image SHA after the deploy.)

## Tests

  pytest tests/ -q  ->  1693 passed, 0 failed

Composes with: PR #225 (shell-run dry path — preserved + protected by
the abandon-stale-local-edits decision), ROADMAP L293 (this entry),
the closed L266 agent-justification gate (this is its counterfactual
leg, restored to operation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 1a1196a into main May 19, 2026
1 check passed
@cipher813 cipher813 deleted the fix/counterfactual-bound-scan-window branch May 19, 2026 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant