feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function by cipher813 · Pull Request #139 · cipher813/alpha-engine-data

cipher813 · 2026-05-03T13:41:58Z

Summary

Wires the alpha-engine-research-eval-judge Lambda (PR #91) into the Saturday Step Function. Adds two-tier sampling driven by the SF execution date — monthly Sonnet sweep on the first Saturday of each month, Haiku-only with per-artifact escalation on the others.

SF flow update

CheckBacktesterStatus (Success)
  → CheckSkipEvalJudge        (skip_eval_judge bypass → SaturdayHealthCheck)
  → ComputeEvalCadence        (Pass: extract day_of_month + eval_date from
                               $$.Execution.StartTime via intrinsic StringSplit)
  → CheckMonthlyCadence       (Choice: day_of_month < '08' lex compare)
      → EvalJudgeFirstSaturday  (Lambda: force_sonnet_pass=true)
      → EvalJudgeWeekly         (Lambda: force_sonnet_pass=false)
  → SaturdayHealthCheck → NotifyComplete

Both eval Tasks Catch: States.ALL → SaturdayHealthCheck. Eval is observability per ROADMAP §1635 — pipeline must NOT halt on eval failure.

Cadence rationale

Monthly Sonnet sweep on the first Saturday of each month (matches ROADMAP §1626's "every 4th run" with a deterministic, stateless rule that approximates ~4-week intervals to within one week). The lex-compare on zero-padded '01'–'07' < '08' is correct because all day_of_month values are 2-char strings.

Other Saturdays run Haiku-only with the per-artifact <3 escalation gate inside the Lambda — borderline artifacts still get a Sonnet pass; only the full sweep is gated by cadence.

IAM

Adds alpha-engine-research-eval-judge* to the LambdaInvoke resource list in both deploy_step_function.sh and deploy_step_function_daily.sh. The shared-policy convention (deploy_step_function.sh:91-92) means the last script to run wins, so both must stay in sync.

Test plan

pytest tests/test_sf_eval_judge_wiring.py (21 tests pin transition + payload shape).
Full suite: 406 → 427 passing.
Manual JSON validation: state graph reachable; Choice defaults correct (weekly not monthly).
Live verification: Sat 2026-05-09 SF run. 5/9 is the first Saturday of May → ships the monthly Sonnet sweep on the first eval pipeline run, which becomes the calibration corpus for PR 4.

Deploy order

From alpha-engine-research: ./infrastructure/deploy.sh eval_judge (creates alpha-engine-research-eval-judge:live)
From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM policy)

Order matters — if the SF picks up the new state before the Lambda exists, the EvalJudge state would fail-fast on first invocation (caught by the non-blocking Catch, but unnecessary noise in CloudWatch).

Out of scope (PR 4+)

CloudWatch metric AlphaEngine/Eval/agent_quality_score + SNS rolling-4-week-mean alarm (ROADMAP §1634).
Streamlit dashboard quality-trend page + prompt-version → quality-score correlation chart (ROADMAP §1632–1633).
Human cross-validation writeup (ROADMAP §1627 calibration record).

🤖 Generated with Claude Code

Inserts the eval-judge state between Backtester success and the existing SaturdayHealthCheck, with two-tier sampling driven by the SF execution date. Saturday SF flow update: CheckBacktesterStatus (Success) → CheckSkipEvalJudge (skip-flag bypass to health check) → ComputeEvalCadence (Pass — extracts day_of_month + eval_date from $$.Execution.StartTime via intrinsic StringSplit) → CheckMonthlyCadence (Choice — day_of_month < '08' lex compare ⇒ first Saturday of the month) → EvalJudgeFirstSaturday (Lambda: force_sonnet_pass=true) → EvalJudgeWeekly (Lambda: force_sonnet_pass=false) → SaturdayHealthCheck → NotifyComplete Both eval Tasks Catch States.ALL → SaturdayHealthCheck (eval is observability per ROADMAP §1635 — pipeline must NOT halt on eval failure, including infra-level Lambda errors). Cadence: monthly Sonnet sweep on the first Saturday of each month (per ROADMAP §1626 every-4th-run cadence — first Saturday is deterministic, requires no state, and approximates ~4-week intervals to within one week). Other Saturdays run Haiku-only with the per- artifact <3 escalation gate inside the Lambda. The lex compare on zero-padded '01'-'07' < '08' is correct because all values are 2-char strings. IAM update: alpha-engine-eventbridge-sfn-role policy gains alpha-engine-research-eval-judge* under LambdaInvoke. Updated in BOTH deploy_step_function.sh and deploy_step_function_daily.sh because the comment at deploy_step_function.sh:91-92 documents the shared- policy convention — last deploy script to run wins, so they must stay in sync. Tests: tests/test_sf_eval_judge_wiring.py (21 tests) pins the wiring shape — backtester success transition, skip-gate behavior, cadence computation, Choice default = weekly (not monthly — would otherwise silently ship every weekly run on the more expensive Sonnet sweep), Lambda payload contract, non-blocking failure semantics. Test surface 406 → 427. Deploy order owed: 1. From alpha-engine-research: ./infrastructure/deploy.sh eval_judge (creates alpha-engine-research-eval-judge:live) 2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM policy) First live run: Sat 2026-05-09 (per the post-typed-state observation gate noted in handoff_260502_llm_judge_workstream). Sat 5/9 happens to be the first Saturday of May → ships the monthly Sonnet sweep on the very first eval pipeline run, which is the calibration corpus PR 4 (CloudWatch metric + dashboard) will read from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 64b2439 into main May 3, 2026
1 check passed

cipher813 deleted the feat/llm-judge-sf-wiring-pr3c branch May 3, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function#139

feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function#139
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-sf-wiring-pr3c

cipher813 commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 3, 2026

Summary

SF flow update

Cadence rationale

IAM

Test plan

Deploy order

Out of scope (PR 4+)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant