feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function#139
Merged
Conversation
Inserts the eval-judge state between Backtester success and the
existing SaturdayHealthCheck, with two-tier sampling driven by the
SF execution date.
Saturday SF flow update:
CheckBacktesterStatus (Success)
→ CheckSkipEvalJudge (skip-flag bypass to health check)
→ ComputeEvalCadence (Pass — extracts day_of_month + eval_date
from $$.Execution.StartTime via intrinsic
StringSplit)
→ CheckMonthlyCadence (Choice — day_of_month < '08' lex compare
⇒ first Saturday of the month)
→ EvalJudgeFirstSaturday (Lambda: force_sonnet_pass=true)
→ EvalJudgeWeekly (Lambda: force_sonnet_pass=false)
→ SaturdayHealthCheck → NotifyComplete
Both eval Tasks Catch States.ALL → SaturdayHealthCheck (eval is
observability per ROADMAP §1635 — pipeline must NOT halt on eval
failure, including infra-level Lambda errors).
Cadence: monthly Sonnet sweep on the first Saturday of each month
(per ROADMAP §1626 every-4th-run cadence — first Saturday is
deterministic, requires no state, and approximates ~4-week intervals
to within one week). Other Saturdays run Haiku-only with the per-
artifact <3 escalation gate inside the Lambda. The lex compare on
zero-padded '01'-'07' < '08' is correct because all values are
2-char strings.
IAM update: alpha-engine-eventbridge-sfn-role policy gains
alpha-engine-research-eval-judge* under LambdaInvoke. Updated in BOTH
deploy_step_function.sh and deploy_step_function_daily.sh because
the comment at deploy_step_function.sh:91-92 documents the shared-
policy convention — last deploy script to run wins, so they must
stay in sync.
Tests: tests/test_sf_eval_judge_wiring.py (21 tests) pins the wiring
shape — backtester success transition, skip-gate behavior, cadence
computation, Choice default = weekly (not monthly — would otherwise
silently ship every weekly run on the more expensive Sonnet sweep),
Lambda payload contract, non-blocking failure semantics. Test surface
406 → 427.
Deploy order owed:
1. From alpha-engine-research: ./infrastructure/deploy.sh eval_judge
(creates alpha-engine-research-eval-judge:live)
2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh
(updates SF JSON + IAM policy)
First live run: Sat 2026-05-09 (per the post-typed-state observation
gate noted in handoff_260502_llm_judge_workstream). Sat 5/9 happens
to be the first Saturday of May → ships the monthly Sonnet sweep on
the very first eval pipeline run, which is the calibration corpus
PR 4 (CloudWatch metric + dashboard) will read from.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the alpha-engine-research-eval-judge Lambda (PR #91) into the Saturday Step Function. Adds two-tier sampling driven by the SF execution date — monthly Sonnet sweep on the first Saturday of each month, Haiku-only with per-artifact escalation on the others.
SF flow update
Both eval Tasks
Catch: States.ALL → SaturdayHealthCheck. Eval is observability per ROADMAP §1635 — pipeline must NOT halt on eval failure.Cadence rationale
Monthly Sonnet sweep on the first Saturday of each month (matches ROADMAP §1626's "every 4th run" with a deterministic, stateless rule that approximates ~4-week intervals to within one week). The lex-compare on zero-padded
'01'–'07' < '08'is correct because allday_of_monthvalues are 2-char strings.Other Saturdays run Haiku-only with the per-artifact
<3escalation gate inside the Lambda — borderline artifacts still get a Sonnet pass; only the full sweep is gated by cadence.IAM
Adds
alpha-engine-research-eval-judge*to theLambdaInvokeresource list in bothdeploy_step_function.shanddeploy_step_function_daily.sh. The shared-policy convention (deploy_step_function.sh:91-92) means the last script to run wins, so both must stay in sync.Test plan
pytest tests/test_sf_eval_judge_wiring.py(21 tests pin transition + payload shape).Deploy order
./infrastructure/deploy.sh eval_judge(createsalpha-engine-research-eval-judge:live)./infrastructure/deploy_step_function.sh(updates SF JSON + IAM policy)Order matters — if the SF picks up the new state before the Lambda exists, the EvalJudge state would fail-fast on first invocation (caught by the non-blocking
Catch, but unnecessary noise in CloudWatch).Out of scope (PR 4+)
AlphaEngine/Eval/agent_quality_score+ SNS rolling-4-week-mean alarm (ROADMAP §1634).🤖 Generated with Claude Code