feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions) by cipher813 · Pull Request #202 · cipher813/alpha-engine-research

cipher813 · 2026-05-18T21:53:38Z

Summary

Closes the two Saturday-SF Friday-PM shell-run keystone skip-exceptions (alpha-engine-data #260): skip_eval_judge + skip_rationale_clustering were hard-skipped under the Friday shell_run because no clean no-write dry path existed — so their import/bootstrap was never exercised on the Friday smoke.

eval_judge_submit always _persist_client_side_skips (S3 put) + creates an Anthropic Message Batch.
rationale_clustering's _persist_analysis S3 put_object was not gated by its existing dry_run flag (that flag only suppressed the CW metric).

Event flag (verbatim — the SF rewire passes this exact key)

dry_run_llm

Matches Research's existing convention (step_function.json: "dry_run_llm.$": "$.research_dry"). Single canonical shared substrate in evals/lambda_dry.py — no per-handler bespoke logic, no copy-paste, not coupled to dry_run.py (that installs LangGraph agent stubs, irrelevant here).

Per-handler short-circuit points

Boot + module-level imports run for real first (the keystone's whole point), then handler() returns before any external call:

Handler	Returns before	Dry result
`eval_judge_submit_handler`	`build_batch_plan` / `_persist_client_side_skips` (S3 put) / `submit_batch` (Anthropic Batch create)	`{status:"EMPTY", batch_id:"dry-run-no-batch", processing_status:"ended_empty", plan_s3_key:None}`
`eval_judge_poll_handler`	`anthropic.Anthropic` / `poll_batch`	terminal `processing_status:"ended"`
`eval_judge_process_handler`	`anthropic.Anthropic` / `process_batch_results` (S3 plan `get_object` + results stream + per-artifact persist + CW)	`status:"OK"`
`rationale_clustering_handler`	`compute_and_emit` (decision_artifacts/ read + `_analysis/` S3 persist + CW emit), after the `evals.rationale_clustering` import	`status:"OK"`

Dry sentinel threading submit → poll → process

submit returns batch_id="dry-run-no-batch" + status="EMPTY". The existing EvalJudgePollChoice SF Choice routes status == "EMPTY" straight to EvalJudgeProcess, skipping the poll loop entirely. Poll + Process also treat the threaded sentinel batch_id as dry defensively (belt-and-braces if the SF wiring reaches Poll).

No-Anthropic / no-write proof

tests/test_eval_judge_shell_run_dry.py asserts, per handler under dry_run_llm:

anthropic.Anthropic not called (assert_not_called)
boto3.client / process_batch_results / compute_and_emit / submit_batch / _persist_client_side_skips / poll_batch / build_batch_plan not called → zero S3 put/get, zero CW
success status returned (EMPTY/ended/OK)
boot-still-runs smoke: the dry short-circuit is inside handler() after module imports, so a broken import still fails the dry pass
legacy rationale dry_run flag behavior preserved (still runs compute_and_emit, only suppresses CW) — only the new dry_run_llm is the full short-circuit; non-dry path still takes the real track

Tests / suite

+14 tests. Full research suite: 1357 passed (1343 → 1357). No new deps, no secrets/prompts committed.

Follow-on (separate repo)

The SF rewire to route EvalJudge* / RationaleClustering → dry (pass "dry_run_llm.$": "$.research_dry", drop the skip_eval_judge / skip_rationale_clustering hard-skips from ApplyShellRunDefaults) is a separate alpha-engine-data follow-on — not in this PR.

🤖 Generated with Claude Code

…lustering (closes SF skip-exceptions) The Saturday SF Friday-PM shell-run keystone (alpha-engine-data #260) hard-SKIPPED EvalJudge* + RationaleClustering (skip_eval_judge / skip_rationale_clustering) because no clean no-write dry path existed: the submit handler always _persist_client_side_skips + creates an Anthropic Message Batch, and rationale_clustering's _persist_analysis S3 put_object was NOT gated by its existing dry_run flag (that flag only suppressed the CW metric). Their import/bootstrap was therefore never exercised on the Friday smoke. Event flag (verbatim — the SF rewire passes this exact key): dry_run_llm Matches Research's existing convention (step_function.json: "dry_run_llm.$": "$.research_dry"). Single canonical shared substrate in evals/lambda_dry.py — no per-handler bespoke logic, no copy-paste, not coupled to dry_run.py (that installs LangGraph agent stubs, irrelevant here). Per-handler short-circuit (boot + imports run for real FIRST — the keystone's whole point — then return before any external call): - eval_judge_submit: returns dry sentinel {status:"EMPTY", batch_id:"dry-run-no-batch", processing_status:"ended_empty", plan_s3_key:None} BEFORE build_batch_plan / _persist_client_side_skips (S3 put) / submit_batch (Anthropic Batch create). status=EMPTY makes the existing EvalJudgePollChoice route straight to Process, skipping the poll loop entirely. - eval_judge_poll: detects sentinel/flag, returns terminal processing_status:"ended" BEFORE anthropic.Anthropic / poll_batch. - eval_judge_process: detects sentinel/flag, returns status:"OK" BEFORE anthropic.Anthropic / process_batch_results (which does the S3 plan get_object + results stream + per-artifact persist + CW). - rationale_clustering: returns status:"OK" BEFORE compute_and_emit (decision_artifacts/ read + _analysis/ S3 persist + CW emit), after the evals.rationale_clustering import. Sentinel threading: submit's "dry-run-no-batch" batch_id + status=EMPTY flows submit -> (poll skipped by SF) -> process; poll/process also treat the threaded sentinel batch_id as dry defensively. No-Anthropic/no-write proof: tests assert anthropic.Anthropic not-called, boto3.client/process_batch_results/compute_and_emit not-called, no S3 put/get, success status returned; plus a boot-still- runs smoke proving the dry short-circuit is inside handler() after module imports. Legacy rationale `dry_run` flag behavior preserved (still runs compute, only suppresses CW) — only the new dry_run_llm is the full short-circuit. Tests: +14 (tests/test_eval_judge_shell_run_dry.py). Full research suite 1357 passed (1343 -> 1357). No new deps, no secrets/prompts. The SF rewire to route EvalJudge*/RationaleClustering states -> dry (pass "dry_run_llm.$": "$.research_dry", drop the skip_eval_judge / skip_rationale_clustering hard-skips from ApplyShellRunDefaults) is a separate alpha-engine-data follow-on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit af0f481 into main May 18, 2026
1 check passed

cipher813 deleted the feat/eval-judge-rationale-shell-run-dry branch May 18, 2026 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions)#202

feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions)#202
cipher813 merged 1 commit into
mainfrom
feat/eval-judge-rationale-shell-run-dry

cipher813 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 18, 2026

Summary

Event flag (verbatim — the SF rewire passes this exact key)

Per-handler short-circuit points

Dry sentinel threading submit → poll → process

No-Anthropic / no-write proof

Tests / suite

Follow-on (separate repo)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant