fix(counterfactual): bound scan window + per-agent cap (600s timeout fix)#228
Merged
Merged
Conversation
…fix)
ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently
timing out at the 600s ceiling since 2026-05-16 (last successful real
Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the
56-day window, and the per-artifact s3:get_object loop inside
``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means
the pipeline stays green; the failure is observability-silent — no
``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks.
Per the entry's listed fix options [(a) cap/window the scan, (b) bump
memory + bound workload, (c) move off Lambda]: shipping option (a) —
simplest correct shape, no infrastructure surface change, no new
deploy machinery. Cost-discipline-aligned (pre-alpha-validation).
## What ships
1. **`replay/counterfactual.py`**:
- ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day
statistical-significance heuristic the LdP triple-barrier fits
consume; at ~585 artifacts/day this bounds the get_object loop
near ~16k instead of the 32k+ that timed out.
- New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` —
second-order bound. Even if the window is reverted operationally
or a heavy population-wide agent's per-day rate grows, no single
agent's backlog can stall the run. Most-recent-first enforcement
in ``_list_artifact_keys_in_window`` (day-by-day iteration was
already end_date-backward, so the natural list ordering IS
most-recent-first — cap drops old keys, not fresh ones).
- ``_list_artifact_keys_in_window`` gains the cap parameter
(applied PRE ``_load_artifact`` so the bound translates directly
into Lambda wall-clock).
- ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end;
log line surfaces the value for ops visibility.
2. **`lambda_counterfactual/handler.py`**:
- Default ``window_days`` now reads from the module constant (28).
- New event field ``max_artifacts_per_agent`` (default 500; explicit
``None`` disables the cap for ad-hoc deeper-corpus runs).
- Event-shape docstring + start-log line updated.
- The shell-run dry path (PR #225) is UNTOUCHED — the locally-
uncommitted 5/15 edits that removed it would have regressed
merged work + downgraded the lib pin v0.20.0 → v0.16.0; those
edits are abandoned by working on a clean origin/main worktree.
3. **Tests** (+5 net):
- ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests):
cap drops oldest when one agent dominates (most-recent-first
truncation), cap=None returns all keys, cap=0 treated as
unbounded (defensive — less-surprising), cap applied
independently per agent_id_base.
- ``test_lambda_counterfactual_handler``:
``test_default_window_days`` updated to expect 28 + the new
``max_artifacts_per_agent=500`` default; two new tests for
event-override + explicit-None.
## What does NOT ship
- **MemorySize bump** — not needed at 28d + per-agent cap. Held in
reserve for option (b) if a future corpus expansion needs it.
- **Spot/batch refactor** — option (c), out of scope. The entry
flags it as the heavier alternative; option (a) is sufficient.
- **Concurrency / threaded loads** — could shave more time but
introduces ordering + S3 throttling concerns. Out of scope; revisit
if 28d + cap still bumps the ceiling at the next corpus growth.
## Post-merge deploy step
bash infrastructure/deploy_counterfactual.sh
(Manual — there is no CI deploy for the counterfactual Lambda. Per
the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05.
SYSTEM_STATE should record the new image SHA after the deploy.)
## Tests
pytest tests/ -q -> 1693 passed, 0 failed
Composes with: PR #225 (shell-run dry path — preserved + protected by
the abandon-stale-local-edits decision), ROADMAP L293 (this entry),
the closed L266 agent-justification gate (this is its counterfactual
leg, restored to operation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ROADMAP L293 —
alpha-engine-replay-counterfactualLambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window; the per-artifacts3:get_objectloop insidecompute_and_emitblew past the timeout. SFCatch=States.ALLkeeps the pipeline green so the failure is observability-silent — noagent_counterfactual_rule_fitCW datapoint has emitted in 1.5 weeks.Per the entry's fix options (a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda: shipping (a) — simplest correct shape, no infrastructure surface change, cost-discipline-aligned.
What ships
replay/counterfactual.pyDEFAULT_WINDOW_DAYS56→28; newDEFAULT_MAX_ARTIFACTS_PER_AGENT=500; cap applied PRE_load_artifactin_list_artifact_keys_in_windowso the bound translates directly into Lambda wall-clocklambda_counterfactual/handler.pymax_artifacts_per_agent(default 500, explicitNonedisables for ad-hoc deeper-corpus runs); docstring + start log updated. Shell-run dry path (PR #225) preservedtests/test_replay_counterfactual.pyTestPerAgentArtifactCap(4 tests): drops-oldest, None unbounded, 0-treated-as-unbounded, per-agent independencetests/test_lambda_counterfactual_handler.pytest_default_window_daysupdated for 28d + 500 cap; +2 tests for event override + explicit NoneLocal 5/15 edits abandoned (intentional)
The local working-tree edits on the in-use
alpha-engine-backtesterclone removed the merged shell-run-dry path (PR #225) and downgraded the lib pin v0.20.0 → v0.16.0. Both regressions vsorigin/main. Working in a clean throwaway worktree onorigin/mainso those edits are not part of this PR. Recommendgit restore lambda_counterfactual/in the working tree post-merge to clean them up.What does NOT ship
Post-merge deploy step
Manual — no CI deploy for this Lambda. Prior ECR
:latestpush was 2026-05-05; SYSTEM_STATE should record the new image SHA after the deploy.Tests
Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293, the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation).
🤖 Generated with Claude Code