feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions)#225
Merged
Merged
Conversation
…factual (closes SF skip-exceptions)
Closes the last two Saturday-SF shell-run keystone skip-exceptions
(ReplayConcordance / Counterfactual) by giving both replay Image-Lambda
handlers a verified clean no-write dry path so the keystone can route
them dry instead of pure-skipping them.
Dry event-key (verbatim): "dry_run_llm" (boolean)
Reuses the canonical keystone key established verbatim by the
Research Lambda in alpha-engine-data step_function.json
(`"dry_run_llm.$": "$.research_dry"`). No invented key. Distinct from
the handlers' pre-existing `dry_run` event key, whose
compute-but-don't-emit-metrics semantic is intentionally left
untouched (backward compatible — proven by retained legacy tests).
The SF rewire (ReplayConcordance/Counterfactual states must pass
"dry_run_llm": true under shell_run) is a SEPARATE alpha-engine-data
follow-on PR; not in this repo.
Shared helper (canonical, no copy-paste): replay/__init__.py
- SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm"
- is_shell_run_dry(event) — tolerant bool/str/None coercion
- shell_run_dry_response(...) — benign {"status": "DRY_RUN", ...}
Both handlers import and call the single implementation.
Per-handler short-circuit proof:
lambda_concordance/handler.py — the dry check sits AFTER
_ensure_init() and AFTER the deferred `from replay.batch import
compute_and_emit_concordance` (boot + module imports run for real —
the keystone's whole point), but BEFORE the compute_and_emit_
concordance call. That call is the sole entry to the replay.batch
decision_artifacts S3 scan, the langchain_anthropic / target-model
replay calls, the CloudWatch agent_cheap_model_concordance emit, and
the S3 summary persist. Dry path returns before all of them: zero
external/LLM calls, zero S3/CW writes.
lambda_counterfactual/handler.py — symmetric: dry check after
_ensure_init() + `from replay.counterfactual import compute_and_emit`,
before that call (the sole entry to the decision_artifacts S3 scan +
sklearn fit + CloudWatch agent_counterfactual_rule_fit emit + S3
per-agent persist). No LLM on this path regardless.
Tests assert compute_and_emit[_concordance] is never called under
dry, _ensure_init still runs, and the SF (Catch-wrapped,
non-blocking) gets a success status.
Known SEPARATE issue (OUT OF SCOPE here — do not conflate):
Counterfactual times out at 600s on real Saturday runs from corpus
growth (~32,740 artifacts; last success 2026-05-13; the SF Catch
swallows it). This PR does not touch the scan logic and does not
attempt to fix that timeout. Aside: because the dry path skips the
scan entirely, it incidentally avoids the timeout under shell_run —
but the real-Saturday timeout remains a distinct bug to be tracked
separately.
Tests: +15 handler tests (TestShellRunDryPath x2: dry short-circuits
before scan, string-true coercion, dry=false / absent take the real
path, legacy `dry_run` still takes the real path) + a new
tests/test_replay_shell_run_dry.py for the shared helper. Full
backtester suite: 1684 passed, 5 skipped, 1 deselected (parity
deselected as usual), 0 failed. No new deps, no secrets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 19, 2026
…fix) (#228) ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently timing out at the 600s ceiling since 2026-05-16 (last successful real Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the 56-day window, and the per-artifact s3:get_object loop inside ``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means the pipeline stays green; the failure is observability-silent — no ``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks. Per the entry's listed fix options [(a) cap/window the scan, (b) bump memory + bound workload, (c) move off Lambda]: shipping option (a) — simplest correct shape, no infrastructure surface change, no new deploy machinery. Cost-discipline-aligned (pre-alpha-validation). ## What ships 1. **`replay/counterfactual.py`**: - ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day statistical-significance heuristic the LdP triple-barrier fits consume; at ~585 artifacts/day this bounds the get_object loop near ~16k instead of the 32k+ that timed out. - New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` — second-order bound. Even if the window is reverted operationally or a heavy population-wide agent's per-day rate grows, no single agent's backlog can stall the run. Most-recent-first enforcement in ``_list_artifact_keys_in_window`` (day-by-day iteration was already end_date-backward, so the natural list ordering IS most-recent-first — cap drops old keys, not fresh ones). - ``_list_artifact_keys_in_window`` gains the cap parameter (applied PRE ``_load_artifact`` so the bound translates directly into Lambda wall-clock). - ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end; log line surfaces the value for ops visibility. 2. **`lambda_counterfactual/handler.py`**: - Default ``window_days`` now reads from the module constant (28). - New event field ``max_artifacts_per_agent`` (default 500; explicit ``None`` disables the cap for ad-hoc deeper-corpus runs). - Event-shape docstring + start-log line updated. - The shell-run dry path (PR #225) is UNTOUCHED — the locally- uncommitted 5/15 edits that removed it would have regressed merged work + downgraded the lib pin v0.20.0 → v0.16.0; those edits are abandoned by working on a clean origin/main worktree. 3. **Tests** (+5 net): - ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests): cap drops oldest when one agent dominates (most-recent-first truncation), cap=None returns all keys, cap=0 treated as unbounded (defensive — less-surprising), cap applied independently per agent_id_base. - ``test_lambda_counterfactual_handler``: ``test_default_window_days`` updated to expect 28 + the new ``max_artifacts_per_agent=500`` default; two new tests for event-override + explicit-None. ## What does NOT ship - **MemorySize bump** — not needed at 28d + per-agent cap. Held in reserve for option (b) if a future corpus expansion needs it. - **Spot/batch refactor** — option (c), out of scope. The entry flags it as the heavier alternative; option (a) is sufficient. - **Concurrency / threaded loads** — could shave more time but introduces ordering + S3 throttling concerns. Out of scope; revisit if 28d + cap still bumps the ceiling at the next corpus growth. ## Post-merge deploy step bash infrastructure/deploy_counterfactual.sh (Manual — there is no CI deploy for the counterfactual Lambda. Per the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05. SYSTEM_STATE should record the new image SHA after the deploy.) ## Tests pytest tests/ -q -> 1693 passed, 0 failed Composes with: PR #225 (shell-run dry path — preserved + protected by the abandon-stale-local-edits decision), ROADMAP L293 (this entry), the closed L266 agent-justification gate (this is its counterfactual leg, restored to operation). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the last two Saturday-SF shell-run keystone skip-exceptions (
ReplayConcordance/Counterfactual) by giving both replay Image-Lambda handlers a verified clean no-write dry path, so the keystone can route them dry (boot + imports for real, benign success before the scan) instead of pure-skipping them.Dry event-key (verbatim)
"dry_run_llm"(boolean) — reuses the canonical keystone key established verbatim by the Research Lambda inalpha-engine-datastep_function.json("dry_run_llm.$": "$.research_dry"). No invented key. Distinct from the handlers' pre-existingdry_runevent key, whose compute-but-don't-emit-metrics semantic is intentionally left untouched (backward compatible — proven by retained legacy tests).Shared helper (canonical, no copy-paste)
replay/__init__.py— the single implementation imported by both handlers:SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm"is_shell_run_dry(event)— tolerant bool / str / None coercionshell_run_dry_response(handler_name, t0)— benign{"status": "DRY_RUN", ...}envelopePer-handler short-circuit proof
lambda_concordance/handler.py— the dry check sits after_ensure_init()and after the deferredfrom replay.batch import compute_and_emit_concordance(boot + module imports run for real — the keystone's whole point), but before thecompute_and_emit_concordance(...)call. That call is the sole entry to thereplay.batchdecision_artifactsS3 scan, thelangchain_anthropic/ target-model replay calls, the CloudWatchagent_cheap_model_concordanceemit, and the S3 summary persist. The dry path returns before all of them → zero external/LLM calls, zero S3/CW writes.lambda_counterfactual/handler.py— symmetric: dry check after_ensure_init()+from replay.counterfactual import compute_and_emit, before that call (the sole entry to thedecision_artifactsS3 scan + sklearn fit + CloudWatchagent_counterfactual_rule_fitemit + S3 per-agent persist). No LLM on this path regardless.Both return a
statusthe SF (Catch-wrapped, non-blocking) treats as success.Known SEPARATE issue (OUT OF SCOPE)
Counterfactualtimes out at 600s on real Saturday runs from corpus growth (~32,740 artifacts; last success 2026-05-13; the SF Catch swallows it). This PR does not touch the scan logic and does not attempt to fix that timeout — it remains a distinct bug to be tracked separately. Aside: because the dry path skips the scan entirely, it incidentally avoids the timeout under shell_run only; the real-Saturday timeout is unaffected and still owed.Tests
TestShellRunDryPath×2): dry short-circuits before scan (compute never called,_ensure_initstill runs, success status), string-truecoercion,dry=false/ absent take the real path, legacydry_runstill takes the real path withemit_metrics=False.tests/test_replay_shell_run_dry.pyfor the shared helper.🤖 Generated with Claude Code