feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions) by cipher813 · Pull Request #225 · cipher813/alpha-engine-backtester

cipher813 · 2026-05-18T22:03:17Z

Summary

Closes the last two Saturday-SF shell-run keystone skip-exceptions (ReplayConcordance / Counterfactual) by giving both replay Image-Lambda handlers a verified clean no-write dry path, so the keystone can route them dry (boot + imports for real, benign success before the scan) instead of pure-skipping them.

Dry event-key (verbatim)

"dry_run_llm" (boolean) — reuses the canonical keystone key established verbatim by the Research Lambda in alpha-engine-data step_function.json ("dry_run_llm.$": "$.research_dry"). No invented key. Distinct from the handlers' pre-existing dry_run event key, whose compute-but-don't-emit-metrics semantic is intentionally left untouched (backward compatible — proven by retained legacy tests).

The SF rewire (the ReplayConcordance/Counterfactual states must pass "dry_run_llm": true under shell_run, and stop hard-skipping them) is a separate alpha-engine-data follow-on PR — not in this repo.

Shared helper (canonical, no copy-paste)

replay/__init__.py — the single implementation imported by both handlers:

SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm"
is_shell_run_dry(event) — tolerant bool / str / None coercion
shell_run_dry_response(handler_name, t0) — benign {"status": "DRY_RUN", ...} envelope

Per-handler short-circuit proof

lambda_concordance/handler.py — the dry check sits after _ensure_init() and after the deferred from replay.batch import compute_and_emit_concordance (boot + module imports run for real — the keystone's whole point), but before the compute_and_emit_concordance(...) call. That call is the sole entry to the replay.batch decision_artifacts S3 scan, the langchain_anthropic / target-model replay calls, the CloudWatch agent_cheap_model_concordance emit, and the S3 summary persist. The dry path returns before all of them → zero external/LLM calls, zero S3/CW writes.

lambda_counterfactual/handler.py — symmetric: dry check after _ensure_init() + from replay.counterfactual import compute_and_emit, before that call (the sole entry to the decision_artifacts S3 scan + sklearn fit + CloudWatch agent_counterfactual_rule_fit emit + S3 per-agent persist). No LLM on this path regardless.

Both return a status the SF (Catch-wrapped, non-blocking) treats as success.

Known SEPARATE issue (OUT OF SCOPE)

Counterfactual times out at 600s on real Saturday runs from corpus growth (~32,740 artifacts; last success 2026-05-13; the SF Catch swallows it). This PR does not touch the scan logic and does not attempt to fix that timeout — it remains a distinct bug to be tracked separately. Aside: because the dry path skips the scan entirely, it incidentally avoids the timeout under shell_run only; the real-Saturday timeout is unaffected and still owed.

Tests

+15 handler tests (TestShellRunDryPath ×2): dry short-circuits before scan (compute never called, _ensure_init still runs, success status), string-true coercion, dry=false / absent take the real path, legacy dry_run still takes the real path with emit_metrics=False.
New tests/test_replay_shell_run_dry.py for the shared helper.
Full backtester suite: 1684 passed, 5 skipped, 1 deselected (parity deselected as usual), 0 failed. No new deps, no secrets.

🤖 Generated with Claude Code

…factual (closes SF skip-exceptions) Closes the last two Saturday-SF shell-run keystone skip-exceptions (ReplayConcordance / Counterfactual) by giving both replay Image-Lambda handlers a verified clean no-write dry path so the keystone can route them dry instead of pure-skipping them. Dry event-key (verbatim): "dry_run_llm" (boolean) Reuses the canonical keystone key established verbatim by the Research Lambda in alpha-engine-data step_function.json (`"dry_run_llm.$": "$.research_dry"`). No invented key. Distinct from the handlers' pre-existing `dry_run` event key, whose compute-but-don't-emit-metrics semantic is intentionally left untouched (backward compatible — proven by retained legacy tests). The SF rewire (ReplayConcordance/Counterfactual states must pass "dry_run_llm": true under shell_run) is a SEPARATE alpha-engine-data follow-on PR; not in this repo. Shared helper (canonical, no copy-paste): replay/__init__.py - SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm" - is_shell_run_dry(event) — tolerant bool/str/None coercion - shell_run_dry_response(...) — benign {"status": "DRY_RUN", ...} Both handlers import and call the single implementation. Per-handler short-circuit proof: lambda_concordance/handler.py — the dry check sits AFTER _ensure_init() and AFTER the deferred `from replay.batch import compute_and_emit_concordance` (boot + module imports run for real — the keystone's whole point), but BEFORE the compute_and_emit_ concordance call. That call is the sole entry to the replay.batch decision_artifacts S3 scan, the langchain_anthropic / target-model replay calls, the CloudWatch agent_cheap_model_concordance emit, and the S3 summary persist. Dry path returns before all of them: zero external/LLM calls, zero S3/CW writes. lambda_counterfactual/handler.py — symmetric: dry check after _ensure_init() + `from replay.counterfactual import compute_and_emit`, before that call (the sole entry to the decision_artifacts S3 scan + sklearn fit + CloudWatch agent_counterfactual_rule_fit emit + S3 per-agent persist). No LLM on this path regardless. Tests assert compute_and_emit[_concordance] is never called under dry, _ensure_init still runs, and the SF (Catch-wrapped, non-blocking) gets a success status. Known SEPARATE issue (OUT OF SCOPE here — do not conflate): Counterfactual times out at 600s on real Saturday runs from corpus growth (~32,740 artifacts; last success 2026-05-13; the SF Catch swallows it). This PR does not touch the scan logic and does not attempt to fix that timeout. Aside: because the dry path skips the scan entirely, it incidentally avoids the timeout under shell_run — but the real-Saturday timeout remains a distinct bug to be tracked separately. Tests: +15 handler tests (TestShellRunDryPath x2: dry short-circuits before scan, string-true coercion, dry=false / absent take the real path, legacy `dry_run` still takes the real path) + a new tests/test_replay_shell_run_dry.py for the shared helper. Full backtester suite: 1684 passed, 5 skipped, 1 deselected (parity deselected as usual), 0 failed. No new deps, no secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fix) (#228) ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window, and the per-artifact s3:get_object loop inside ``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means the pipeline stays green; the failure is observability-silent — no ``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks. Per the entry's listed fix options [(a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda]: shipping option (a) — simplest correct shape, no infrastructure surface change, no new deploy machinery. Cost-discipline-aligned (pre-alpha-validation). ## What ships 1. **`replay/counterfactual.py`**: - ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day statistical-significance heuristic the LdP triple-barrier fits consume; at ~585 artifacts/day this bounds the get_object loop near ~16k instead of the 32k+ that timed out. - New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` — second-order bound. Even if the window is reverted operationally or a heavy population-wide agent's per-day rate grows, no single agent's backlog can stall the run. Most-recent-first enforcement in ``_list_artifact_keys_in_window`` (day-by-day iteration was already end_date-backward, so the natural list ordering IS most-recent-first — cap drops old keys, not fresh ones). - ``_list_artifact_keys_in_window`` gains the cap parameter (applied PRE ``_load_artifact`` so the bound translates directly into Lambda wall-clock). - ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end; log line surfaces the value for ops visibility. 2. **`lambda_counterfactual/handler.py`**: - Default ``window_days`` now reads from the module constant (28). - New event field ``max_artifacts_per_agent`` (default 500; explicit ``None`` disables the cap for ad-hoc deeper-corpus runs). - Event-shape docstring + start-log line updated. - The shell-run dry path (PR #225) is UNTOUCHED — the locally- uncommitted 5/15 edits that removed it would have regressed merged work + downgraded the lib pin v0.20.0 → v0.16.0; those edits are abandoned by working on a clean origin/main worktree. 3. **Tests** (+5 net): - ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests): cap drops oldest when one agent dominates (most-recent-first truncation), cap=None returns all keys, cap=0 treated as unbounded (defensive — less-surprising), cap applied independently per agent_id_base. - ``test_lambda_counterfactual_handler``: ``test_default_window_days`` updated to expect 28 + the new ``max_artifacts_per_agent=500`` default; two new tests for event-override + explicit-None. ## What does NOT ship - **MemorySize bump** — not needed at 28d + per-agent cap. Held in reserve for option (b) if a future corpus expansion needs it. - **Spot/batch refactor** — option (c), out of scope. The entry flags it as the heavier alternative; option (a) is sufficient. - **Concurrency / threaded loads** — could shave more time but introduces ordering + S3 throttling concerns. Out of scope; revisit if 28d + cap still bumps the ceiling at the next corpus growth. ## Post-merge deploy step bash infrastructure/deploy_counterfactual.sh (Manual — there is no CI deploy for the counterfactual Lambda. Per the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05. SYSTEM_STATE should record the new image SHA after the deploy.) ## Tests pytest tests/ -q -> 1693 passed, 0 failed Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293 (this entry), the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 669abad into main May 18, 2026
1 check passed

cipher813 deleted the feat/replay-shell-run-dry branch May 18, 2026 22:13

cipher813 mentioned this pull request May 19, 2026

fix(counterfactual): bound scan window + per-agent cap (600s timeout fix) #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions)#225

feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions)#225
cipher813 merged 1 commit into
mainfrom
feat/replay-shell-run-dry

cipher813 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 18, 2026

Summary

Dry event-key (verbatim)

Shared helper (canonical, no copy-paste)

Per-handler short-circuit proof

Known SEPARATE issue (OUT OF SCOPE)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant