feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions)#202
Merged
Conversation
…lustering (closes SF skip-exceptions)
The Saturday SF Friday-PM shell-run keystone (alpha-engine-data #260)
hard-SKIPPED EvalJudge* + RationaleClustering (skip_eval_judge /
skip_rationale_clustering) because no clean no-write dry path existed:
the submit handler always _persist_client_side_skips + creates an
Anthropic Message Batch, and rationale_clustering's _persist_analysis
S3 put_object was NOT gated by its existing dry_run flag (that flag
only suppressed the CW metric). Their import/bootstrap was therefore
never exercised on the Friday smoke.
Event flag (verbatim — the SF rewire passes this exact key):
dry_run_llm
Matches Research's existing convention (step_function.json:
"dry_run_llm.$": "$.research_dry"). Single canonical shared substrate
in evals/lambda_dry.py — no per-handler bespoke logic, no copy-paste,
not coupled to dry_run.py (that installs LangGraph agent stubs,
irrelevant here).
Per-handler short-circuit (boot + imports run for real FIRST — the
keystone's whole point — then return before any external call):
- eval_judge_submit: returns dry sentinel
{status:"EMPTY", batch_id:"dry-run-no-batch",
processing_status:"ended_empty", plan_s3_key:None} BEFORE
build_batch_plan / _persist_client_side_skips (S3 put) /
submit_batch (Anthropic Batch create). status=EMPTY makes the
existing EvalJudgePollChoice route straight to Process, skipping the
poll loop entirely.
- eval_judge_poll: detects sentinel/flag, returns terminal
processing_status:"ended" BEFORE anthropic.Anthropic / poll_batch.
- eval_judge_process: detects sentinel/flag, returns status:"OK"
BEFORE anthropic.Anthropic / process_batch_results (which does the
S3 plan get_object + results stream + per-artifact persist + CW).
- rationale_clustering: returns status:"OK" BEFORE compute_and_emit
(decision_artifacts/ read + _analysis/ S3 persist + CW emit), after
the evals.rationale_clustering import.
Sentinel threading: submit's "dry-run-no-batch" batch_id + status=EMPTY
flows submit -> (poll skipped by SF) -> process; poll/process also
treat the threaded sentinel batch_id as dry defensively.
No-Anthropic/no-write proof: tests assert anthropic.Anthropic
not-called, boto3.client/process_batch_results/compute_and_emit
not-called, no S3 put/get, success status returned; plus a boot-still-
runs smoke proving the dry short-circuit is inside handler() after
module imports. Legacy rationale `dry_run` flag behavior preserved
(still runs compute, only suppresses CW) — only the new dry_run_llm is
the full short-circuit.
Tests: +14 (tests/test_eval_judge_shell_run_dry.py). Full research
suite 1357 passed (1343 -> 1357). No new deps, no secrets/prompts.
The SF rewire to route EvalJudge*/RationaleClustering states -> dry
(pass "dry_run_llm.$": "$.research_dry", drop the skip_eval_judge /
skip_rationale_clustering hard-skips from ApplyShellRunDefaults) is a
separate alpha-engine-data follow-on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the two Saturday-SF Friday-PM shell-run keystone skip-exceptions (alpha-engine-data #260):
skip_eval_judge+skip_rationale_clusteringwere hard-skipped under the Fridayshell_runbecause no clean no-write dry path existed — so their import/bootstrap was never exercised on the Friday smoke._persist_client_side_skips(S3 put) + creates an Anthropic Message Batch._persist_analysisS3put_objectwas not gated by its existingdry_runflag (that flag only suppressed the CW metric).Event flag (verbatim — the SF rewire passes this exact key)
Matches Research's existing convention (
step_function.json:"dry_run_llm.$": "$.research_dry"). Single canonical shared substrate inevals/lambda_dry.py— no per-handler bespoke logic, no copy-paste, not coupled todry_run.py(that installs LangGraph agent stubs, irrelevant here).Per-handler short-circuit points
Boot + module-level imports run for real first (the keystone's whole point), then
handler()returns before any external call:eval_judge_submit_handlerbuild_batch_plan/_persist_client_side_skips(S3 put) /submit_batch(Anthropic Batch create){status:"EMPTY", batch_id:"dry-run-no-batch", processing_status:"ended_empty", plan_s3_key:None}eval_judge_poll_handleranthropic.Anthropic/poll_batchprocessing_status:"ended"eval_judge_process_handleranthropic.Anthropic/process_batch_results(S3 planget_object+ results stream + per-artifact persist + CW)status:"OK"rationale_clustering_handlercompute_and_emit(decision_artifacts/ read +_analysis/S3 persist + CW emit), after theevals.rationale_clusteringimportstatus:"OK"Dry sentinel threading submit → poll → process
submit returns
batch_id="dry-run-no-batch"+status="EMPTY". The existingEvalJudgePollChoiceSF Choice routesstatus == "EMPTY"straight toEvalJudgeProcess, skipping the poll loop entirely. Poll + Process also treat the threaded sentinelbatch_idas dry defensively (belt-and-braces if the SF wiring reaches Poll).No-Anthropic / no-write proof
tests/test_eval_judge_shell_run_dry.pyasserts, per handler underdry_run_llm:anthropic.Anthropicnot called (assert_not_called)boto3.client/process_batch_results/compute_and_emit/submit_batch/_persist_client_side_skips/poll_batch/build_batch_plannot called → zero S3 put/get, zero CWEMPTY/ended/OK)handler()after module imports, so a broken import still fails the dry passdry_runflag behavior preserved (still runscompute_and_emit, only suppresses CW) — only the newdry_run_llmis the full short-circuit; non-dry path still takes the real trackTests / suite
+14 tests. Full research suite: 1357 passed (1343 → 1357). No new deps, no secrets/prompts committed.
Follow-on (separate repo)
The SF rewire to route
EvalJudge*/RationaleClustering→ dry (pass"dry_run_llm.$": "$.research_dry", drop theskip_eval_judge/skip_rationale_clusteringhard-skips fromApplyShellRunDefaults) is a separate alpha-engine-data follow-on — not in this PR.🤖 Generated with Claude Code