fix(counterfactual): bound scan window + per-agent cap (600s timeout fix) by cipher813 · Pull Request #228 · cipher813/alpha-engine-backtester

cipher813 · 2026-05-19T22:02:01Z

ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window; the per-artifact s3:get_object loop inside compute_and_emit blew past the timeout. SF Catch=States.ALL keeps the pipeline green so the failure is observability-silent — no agent_counterfactual_rule_fit CW datapoint has emitted in 1.5 weeks.

Per the entry's fix options (a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda: shipping (a) — simplest correct shape, no infrastructure surface change, cost-discipline-aligned.

What ships

Component	Change
`replay/counterfactual.py`	`DEFAULT_WINDOW_DAYS` 56→28; new `DEFAULT_MAX_ARTIFACTS_PER_AGENT=500`; cap applied PRE `_load_artifact` in `_list_artifact_keys_in_window` so the bound translates directly into Lambda wall-clock
`lambda_counterfactual/handler.py`	Default window threaded from module constant; new event field `max_artifacts_per_agent` (default 500, explicit `None` disables for ad-hoc deeper-corpus runs); docstring + start log updated. Shell-run dry path (PR #225) preserved
`tests/test_replay_counterfactual.py`	New `TestPerAgentArtifactCap` (4 tests): drops-oldest, None unbounded, 0-treated-as-unbounded, per-agent independence
`tests/test_lambda_counterfactual_handler.py`	`test_default_window_days` updated for 28d + 500 cap; +2 tests for event override + explicit None

Local 5/15 edits abandoned (intentional)

The local working-tree edits on the in-use alpha-engine-backtester clone removed the merged shell-run-dry path (PR #225) and downgraded the lib pin v0.20.0 → v0.16.0. Both regressions vs origin/main. Working in a clean throwaway worktree on origin/main so those edits are not part of this PR. Recommend git restore lambda_counterfactual/ in the working tree post-merge to clean them up.

What does NOT ship

MemorySize bump — not needed at 28d + per-agent cap. Held in reserve for option (b) if future corpus growth needs it.
Spot/batch refactor — option (c), out of scope.
Concurrency / threaded loads — out of scope; revisit at next corpus growth.

Post-merge deploy step

bash infrastructure/deploy_counterfactual.sh

Manual — no CI deploy for this Lambda. Prior ECR :latest push was 2026-05-05; SYSTEM_STATE should record the new image SHA after the deploy.

Tests

pytest tests/ -q  →  1693 passed, 0 failed

Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293, the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation).

🤖 Generated with Claude Code

…fix) ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window, and the per-artifact s3:get_object loop inside ``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means the pipeline stays green; the failure is observability-silent — no ``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks. Per the entry's listed fix options [(a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda]: shipping option (a) — simplest correct shape, no infrastructure surface change, no new deploy machinery. Cost-discipline-aligned (pre-alpha-validation). ## What ships 1. **`replay/counterfactual.py`**: - ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day statistical-significance heuristic the LdP triple-barrier fits consume; at ~585 artifacts/day this bounds the get_object loop near ~16k instead of the 32k+ that timed out. - New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` — second-order bound. Even if the window is reverted operationally or a heavy population-wide agent's per-day rate grows, no single agent's backlog can stall the run. Most-recent-first enforcement in ``_list_artifact_keys_in_window`` (day-by-day iteration was already end_date-backward, so the natural list ordering IS most-recent-first — cap drops old keys, not fresh ones). - ``_list_artifact_keys_in_window`` gains the cap parameter (applied PRE ``_load_artifact`` so the bound translates directly into Lambda wall-clock). - ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end; log line surfaces the value for ops visibility. 2. **`lambda_counterfactual/handler.py`**: - Default ``window_days`` now reads from the module constant (28). - New event field ``max_artifacts_per_agent`` (default 500; explicit ``None`` disables the cap for ad-hoc deeper-corpus runs). - Event-shape docstring + start-log line updated. - The shell-run dry path (PR #225) is UNTOUCHED — the locally- uncommitted 5/15 edits that removed it would have regressed merged work + downgraded the lib pin v0.20.0 → v0.16.0; those edits are abandoned by working on a clean origin/main worktree. 3. **Tests** (+5 net): - ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests): cap drops oldest when one agent dominates (most-recent-first truncation), cap=None returns all keys, cap=0 treated as unbounded (defensive — less-surprising), cap applied independently per agent_id_base. - ``test_lambda_counterfactual_handler``: ``test_default_window_days`` updated to expect 28 + the new ``max_artifacts_per_agent=500`` default; two new tests for event-override + explicit-None. ## What does NOT ship - **MemorySize bump** — not needed at 28d + per-agent cap. Held in reserve for option (b) if a future corpus expansion needs it. - **Spot/batch refactor** — option (c), out of scope. The entry flags it as the heavier alternative; option (a) is sufficient. - **Concurrency / threaded loads** — could shave more time but introduces ordering + S3 throttling concerns. Out of scope; revisit if 28d + cap still bumps the ceiling at the next corpus growth. ## Post-merge deploy step bash infrastructure/deploy_counterfactual.sh (Manual — there is no CI deploy for the counterfactual Lambda. Per the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05. SYSTEM_STATE should record the new image SHA after the deploy.) ## Tests pytest tests/ -q -> 1693 passed, 0 failed Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293 (this entry), the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 1a1196a into main May 19, 2026
1 check passed

cipher813 deleted the fix/counterfactual-bound-scan-window branch May 19, 2026 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(counterfactual): bound scan window + per-agent cap (600s timeout fix)#228

fix(counterfactual): bound scan window + per-agent cap (600s timeout fix)#228
cipher813 merged 1 commit into
mainfrom
fix/counterfactual-bound-scan-window

cipher813 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 19, 2026

What ships

Local 5/15 edits abandoned (intentional)

What does NOT ship

Post-merge deploy step

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant