Skip to content

feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions)#225

Merged
cipher813 merged 1 commit into
mainfrom
feat/replay-shell-run-dry
May 18, 2026
Merged

feat(backtester): shell-run dry path for replay concordance + counterfactual (closes SF skip-exceptions)#225
cipher813 merged 1 commit into
mainfrom
feat/replay-shell-run-dry

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Closes the last two Saturday-SF shell-run keystone skip-exceptions (ReplayConcordance / Counterfactual) by giving both replay Image-Lambda handlers a verified clean no-write dry path, so the keystone can route them dry (boot + imports for real, benign success before the scan) instead of pure-skipping them.

Dry event-key (verbatim)

"dry_run_llm" (boolean) — reuses the canonical keystone key established verbatim by the Research Lambda in alpha-engine-data step_function.json ("dry_run_llm.$": "$.research_dry"). No invented key. Distinct from the handlers' pre-existing dry_run event key, whose compute-but-don't-emit-metrics semantic is intentionally left untouched (backward compatible — proven by retained legacy tests).

The SF rewire (the ReplayConcordance/Counterfactual states must pass "dry_run_llm": true under shell_run, and stop hard-skipping them) is a separate alpha-engine-data follow-on PR — not in this repo.

Shared helper (canonical, no copy-paste)

replay/__init__.py — the single implementation imported by both handlers:

  • SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm"
  • is_shell_run_dry(event) — tolerant bool / str / None coercion
  • shell_run_dry_response(handler_name, t0) — benign {"status": "DRY_RUN", ...} envelope

Per-handler short-circuit proof

lambda_concordance/handler.py — the dry check sits after _ensure_init() and after the deferred from replay.batch import compute_and_emit_concordance (boot + module imports run for real — the keystone's whole point), but before the compute_and_emit_concordance(...) call. That call is the sole entry to the replay.batch decision_artifacts S3 scan, the langchain_anthropic / target-model replay calls, the CloudWatch agent_cheap_model_concordance emit, and the S3 summary persist. The dry path returns before all of them → zero external/LLM calls, zero S3/CW writes.

lambda_counterfactual/handler.py — symmetric: dry check after _ensure_init() + from replay.counterfactual import compute_and_emit, before that call (the sole entry to the decision_artifacts S3 scan + sklearn fit + CloudWatch agent_counterfactual_rule_fit emit + S3 per-agent persist). No LLM on this path regardless.

Both return a status the SF (Catch-wrapped, non-blocking) treats as success.

Known SEPARATE issue (OUT OF SCOPE)

Counterfactual times out at 600s on real Saturday runs from corpus growth (~32,740 artifacts; last success 2026-05-13; the SF Catch swallows it). This PR does not touch the scan logic and does not attempt to fix that timeout — it remains a distinct bug to be tracked separately. Aside: because the dry path skips the scan entirely, it incidentally avoids the timeout under shell_run only; the real-Saturday timeout is unaffected and still owed.

Tests

  • +15 handler tests (TestShellRunDryPath ×2): dry short-circuits before scan (compute never called, _ensure_init still runs, success status), string-true coercion, dry=false / absent take the real path, legacy dry_run still takes the real path with emit_metrics=False.
  • New tests/test_replay_shell_run_dry.py for the shared helper.
  • Full backtester suite: 1684 passed, 5 skipped, 1 deselected (parity deselected as usual), 0 failed. No new deps, no secrets.

🤖 Generated with Claude Code

…factual (closes SF skip-exceptions)

Closes the last two Saturday-SF shell-run keystone skip-exceptions
(ReplayConcordance / Counterfactual) by giving both replay Image-Lambda
handlers a verified clean no-write dry path so the keystone can route
them dry instead of pure-skipping them.

Dry event-key (verbatim): "dry_run_llm" (boolean)
  Reuses the canonical keystone key established verbatim by the
  Research Lambda in alpha-engine-data step_function.json
  (`"dry_run_llm.$": "$.research_dry"`). No invented key. Distinct from
  the handlers' pre-existing `dry_run` event key, whose
  compute-but-don't-emit-metrics semantic is intentionally left
  untouched (backward compatible — proven by retained legacy tests).
  The SF rewire (ReplayConcordance/Counterfactual states must pass
  "dry_run_llm": true under shell_run) is a SEPARATE alpha-engine-data
  follow-on PR; not in this repo.

Shared helper (canonical, no copy-paste): replay/__init__.py
  - SHELL_RUN_DRY_EVENT_KEY = "dry_run_llm"
  - is_shell_run_dry(event)      — tolerant bool/str/None coercion
  - shell_run_dry_response(...)  — benign {"status": "DRY_RUN", ...}
  Both handlers import and call the single implementation.

Per-handler short-circuit proof:
  lambda_concordance/handler.py — the dry check sits AFTER
  _ensure_init() and AFTER the deferred `from replay.batch import
  compute_and_emit_concordance` (boot + module imports run for real —
  the keystone's whole point), but BEFORE the compute_and_emit_
  concordance call. That call is the sole entry to the replay.batch
  decision_artifacts S3 scan, the langchain_anthropic / target-model
  replay calls, the CloudWatch agent_cheap_model_concordance emit, and
  the S3 summary persist. Dry path returns before all of them: zero
  external/LLM calls, zero S3/CW writes.

  lambda_counterfactual/handler.py — symmetric: dry check after
  _ensure_init() + `from replay.counterfactual import compute_and_emit`,
  before that call (the sole entry to the decision_artifacts S3 scan +
  sklearn fit + CloudWatch agent_counterfactual_rule_fit emit + S3
  per-agent persist). No LLM on this path regardless.

  Tests assert compute_and_emit[_concordance] is never called under
  dry, _ensure_init still runs, and the SF (Catch-wrapped,
  non-blocking) gets a success status.

Known SEPARATE issue (OUT OF SCOPE here — do not conflate):
  Counterfactual times out at 600s on real Saturday runs from corpus
  growth (~32,740 artifacts; last success 2026-05-13; the SF Catch
  swallows it). This PR does not touch the scan logic and does not
  attempt to fix that timeout. Aside: because the dry path skips the
  scan entirely, it incidentally avoids the timeout under shell_run —
  but the real-Saturday timeout remains a distinct bug to be tracked
  separately.

Tests: +15 handler tests (TestShellRunDryPath x2: dry short-circuits
  before scan, string-true coercion, dry=false / absent take the real
  path, legacy `dry_run` still takes the real path) + a new
  tests/test_replay_shell_run_dry.py for the shared helper. Full
  backtester suite: 1684 passed, 5 skipped, 1 deselected (parity
  deselected as usual), 0 failed. No new deps, no secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 669abad into main May 18, 2026
1 check passed
@cipher813 cipher813 deleted the feat/replay-shell-run-dry branch May 18, 2026 22:13
cipher813 added a commit that referenced this pull request May 19, 2026
…fix) (#228)

ROADMAP L293 — alpha-engine-replay-counterfactual Lambda has been silently
timing out at the 600s ceiling since 2026-05-16 (last successful real
Saturday-SF run 5/13). Captured-artifact corpus crossed ~32,740 in the
56-day window, and the per-artifact s3:get_object loop inside
``compute_and_emit`` blew past the timeout. SF Catch=States.ALL means
the pipeline stays green; the failure is observability-silent — no
``agent_counterfactual_rule_fit`` CW datapoint has emitted in 1.5 weeks.

Per the entry's listed fix options [(a) cap/window the scan, (b) bump
memory + bound workload, (c) move off Lambda]: shipping option (a) —
simplest correct shape, no infrastructure surface change, no new
deploy machinery. Cost-discipline-aligned (pre-alpha-validation).

## What ships

1. **`replay/counterfactual.py`**:
   - ``DEFAULT_WINDOW_DAYS`` 56 → 28. 4 weeks still > the 30-day
     statistical-significance heuristic the LdP triple-barrier fits
     consume; at ~585 artifacts/day this bounds the get_object loop
     near ~16k instead of the 32k+ that timed out.
   - New module constant ``DEFAULT_MAX_ARTIFACTS_PER_AGENT = 500`` —
     second-order bound. Even if the window is reverted operationally
     or a heavy population-wide agent's per-day rate grows, no single
     agent's backlog can stall the run. Most-recent-first enforcement
     in ``_list_artifact_keys_in_window`` (day-by-day iteration was
     already end_date-backward, so the natural list ordering IS
     most-recent-first — cap drops old keys, not fresh ones).
   - ``_list_artifact_keys_in_window`` gains the cap parameter
     (applied PRE ``_load_artifact`` so the bound translates directly
     into Lambda wall-clock).
   - ``compute_and_emit`` threads ``max_artifacts_per_agent`` end-to-end;
     log line surfaces the value for ops visibility.

2. **`lambda_counterfactual/handler.py`**:
   - Default ``window_days`` now reads from the module constant (28).
   - New event field ``max_artifacts_per_agent`` (default 500; explicit
     ``None`` disables the cap for ad-hoc deeper-corpus runs).
   - Event-shape docstring + start-log line updated.
   - The shell-run dry path (PR #225) is UNTOUCHED — the locally-
     uncommitted 5/15 edits that removed it would have regressed
     merged work + downgraded the lib pin v0.20.0 → v0.16.0; those
     edits are abandoned by working on a clean origin/main worktree.

3. **Tests** (+5 net):
   - ``test_replay_counterfactual.TestPerAgentArtifactCap`` (4 tests):
     cap drops oldest when one agent dominates (most-recent-first
     truncation), cap=None returns all keys, cap=0 treated as
     unbounded (defensive — less-surprising), cap applied
     independently per agent_id_base.
   - ``test_lambda_counterfactual_handler``:
     ``test_default_window_days`` updated to expect 28 + the new
     ``max_artifacts_per_agent=500`` default; two new tests for
     event-override + explicit-None.

## What does NOT ship

- **MemorySize bump** — not needed at 28d + per-agent cap. Held in
  reserve for option (b) if a future corpus expansion needs it.
- **Spot/batch refactor** — option (c), out of scope. The entry
  flags it as the heavier alternative; option (a) is sufficient.
- **Concurrency / threaded loads** — could shave more time but
  introduces ordering + S3 throttling concerns. Out of scope; revisit
  if 28d + cap still bumps the ceiling at the next corpus growth.

## Post-merge deploy step

  bash infrastructure/deploy_counterfactual.sh

(Manual — there is no CI deploy for the counterfactual Lambda. Per
the ROADMAP entry, the prior ECR ``:latest`` push was 2026-05-05.
SYSTEM_STATE should record the new image SHA after the deploy.)

## Tests

  pytest tests/ -q  ->  1693 passed, 0 failed

Composes with: PR #225 (shell-run dry path — preserved + protected by
the abandon-stale-local-edits decision), ROADMAP L293 (this entry),
the closed L266 agent-justification gate (this is its counterfactual
leg, restored to operation).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant