Skip to content

feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function#139

Merged
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-sf-wiring-pr3c
May 3, 2026
Merged

feat(sf): PR 3c LLM-as-judge wiring on Saturday Step Function#139
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-sf-wiring-pr3c

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Wires the alpha-engine-research-eval-judge Lambda (PR #91) into the Saturday Step Function. Adds two-tier sampling driven by the SF execution date — monthly Sonnet sweep on the first Saturday of each month, Haiku-only with per-artifact escalation on the others.

SF flow update

CheckBacktesterStatus (Success)
  → CheckSkipEvalJudge        (skip_eval_judge bypass → SaturdayHealthCheck)
  → ComputeEvalCadence        (Pass: extract day_of_month + eval_date from
                               $$.Execution.StartTime via intrinsic StringSplit)
  → CheckMonthlyCadence       (Choice: day_of_month < '08' lex compare)
      → EvalJudgeFirstSaturday  (Lambda: force_sonnet_pass=true)
      → EvalJudgeWeekly         (Lambda: force_sonnet_pass=false)
  → SaturdayHealthCheck → NotifyComplete

Both eval Tasks Catch: States.ALL → SaturdayHealthCheck. Eval is observability per ROADMAP §1635 — pipeline must NOT halt on eval failure.

Cadence rationale

Monthly Sonnet sweep on the first Saturday of each month (matches ROADMAP §1626's "every 4th run" with a deterministic, stateless rule that approximates ~4-week intervals to within one week). The lex-compare on zero-padded '01''07' < '08' is correct because all day_of_month values are 2-char strings.

Other Saturdays run Haiku-only with the per-artifact <3 escalation gate inside the Lambda — borderline artifacts still get a Sonnet pass; only the full sweep is gated by cadence.

IAM

Adds alpha-engine-research-eval-judge* to the LambdaInvoke resource list in both deploy_step_function.sh and deploy_step_function_daily.sh. The shared-policy convention (deploy_step_function.sh:91-92) means the last script to run wins, so both must stay in sync.

Test plan

  • pytest tests/test_sf_eval_judge_wiring.py (21 tests pin transition + payload shape).
  • Full suite: 406 → 427 passing.
  • Manual JSON validation: state graph reachable; Choice defaults correct (weekly not monthly).
  • Live verification: Sat 2026-05-09 SF run. 5/9 is the first Saturday of May → ships the monthly Sonnet sweep on the first eval pipeline run, which becomes the calibration corpus for PR 4.

Deploy order

  1. From alpha-engine-research: ./infrastructure/deploy.sh eval_judge (creates alpha-engine-research-eval-judge:live)
  2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM policy)

Order matters — if the SF picks up the new state before the Lambda exists, the EvalJudge state would fail-fast on first invocation (caught by the non-blocking Catch, but unnecessary noise in CloudWatch).

Out of scope (PR 4+)

  • CloudWatch metric AlphaEngine/Eval/agent_quality_score + SNS rolling-4-week-mean alarm (ROADMAP §1634).
  • Streamlit dashboard quality-trend page + prompt-version → quality-score correlation chart (ROADMAP §1632–1633).
  • Human cross-validation writeup (ROADMAP §1627 calibration record).

🤖 Generated with Claude Code

Inserts the eval-judge state between Backtester success and the
existing SaturdayHealthCheck, with two-tier sampling driven by the
SF execution date.

Saturday SF flow update:

  CheckBacktesterStatus (Success)
    → CheckSkipEvalJudge (skip-flag bypass to health check)
    → ComputeEvalCadence (Pass — extracts day_of_month + eval_date
                           from $$.Execution.StartTime via intrinsic
                           StringSplit)
    → CheckMonthlyCadence (Choice — day_of_month < '08' lex compare
                            ⇒ first Saturday of the month)
        → EvalJudgeFirstSaturday (Lambda: force_sonnet_pass=true)
        → EvalJudgeWeekly        (Lambda: force_sonnet_pass=false)
    → SaturdayHealthCheck → NotifyComplete

Both eval Tasks Catch States.ALL → SaturdayHealthCheck (eval is
observability per ROADMAP §1635 — pipeline must NOT halt on eval
failure, including infra-level Lambda errors).

Cadence: monthly Sonnet sweep on the first Saturday of each month
(per ROADMAP §1626 every-4th-run cadence — first Saturday is
deterministic, requires no state, and approximates ~4-week intervals
to within one week). Other Saturdays run Haiku-only with the per-
artifact <3 escalation gate inside the Lambda. The lex compare on
zero-padded '01'-'07' < '08' is correct because all values are
2-char strings.

IAM update: alpha-engine-eventbridge-sfn-role policy gains
alpha-engine-research-eval-judge* under LambdaInvoke. Updated in BOTH
deploy_step_function.sh and deploy_step_function_daily.sh because
the comment at deploy_step_function.sh:91-92 documents the shared-
policy convention — last deploy script to run wins, so they must
stay in sync.

Tests: tests/test_sf_eval_judge_wiring.py (21 tests) pins the wiring
shape — backtester success transition, skip-gate behavior, cadence
computation, Choice default = weekly (not monthly — would otherwise
silently ship every weekly run on the more expensive Sonnet sweep),
Lambda payload contract, non-blocking failure semantics. Test surface
406 → 427.

Deploy order owed:
  1. From alpha-engine-research: ./infrastructure/deploy.sh eval_judge
     (creates alpha-engine-research-eval-judge:live)
  2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh
     (updates SF JSON + IAM policy)

First live run: Sat 2026-05-09 (per the post-typed-state observation
gate noted in handoff_260502_llm_judge_workstream). Sat 5/9 happens
to be the first Saturday of May → ships the monthly Sonnet sweep on
the very first eval pipeline run, which is the calibration corpus
PR 4 (CloudWatch metric + dashboard) will read from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 64b2439 into main May 3, 2026
1 check passed
@cipher813 cipher813 deleted the feat/llm-judge-sf-wiring-pr3c branch May 3, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant