Skip to content

feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions)#202

Merged
cipher813 merged 1 commit into
mainfrom
feat/eval-judge-rationale-shell-run-dry
May 18, 2026
Merged

feat(research): shell-run dry path for eval-judge chain + rationale clustering (closes SF skip-exceptions)#202
cipher813 merged 1 commit into
mainfrom
feat/eval-judge-rationale-shell-run-dry

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Closes the two Saturday-SF Friday-PM shell-run keystone skip-exceptions (alpha-engine-data #260): skip_eval_judge + skip_rationale_clustering were hard-skipped under the Friday shell_run because no clean no-write dry path existed — so their import/bootstrap was never exercised on the Friday smoke.

  • eval_judge_submit always _persist_client_side_skips (S3 put) + creates an Anthropic Message Batch.
  • rationale_clustering's _persist_analysis S3 put_object was not gated by its existing dry_run flag (that flag only suppressed the CW metric).

Event flag (verbatim — the SF rewire passes this exact key)

dry_run_llm

Matches Research's existing convention (step_function.json: "dry_run_llm.$": "$.research_dry"). Single canonical shared substrate in evals/lambda_dry.py — no per-handler bespoke logic, no copy-paste, not coupled to dry_run.py (that installs LangGraph agent stubs, irrelevant here).

Per-handler short-circuit points

Boot + module-level imports run for real first (the keystone's whole point), then handler() returns before any external call:

Handler Returns before Dry result
eval_judge_submit_handler build_batch_plan / _persist_client_side_skips (S3 put) / submit_batch (Anthropic Batch create) {status:"EMPTY", batch_id:"dry-run-no-batch", processing_status:"ended_empty", plan_s3_key:None}
eval_judge_poll_handler anthropic.Anthropic / poll_batch terminal processing_status:"ended"
eval_judge_process_handler anthropic.Anthropic / process_batch_results (S3 plan get_object + results stream + per-artifact persist + CW) status:"OK"
rationale_clustering_handler compute_and_emit (decision_artifacts/ read + _analysis/ S3 persist + CW emit), after the evals.rationale_clustering import status:"OK"

Dry sentinel threading submit → poll → process

submit returns batch_id="dry-run-no-batch" + status="EMPTY". The existing EvalJudgePollChoice SF Choice routes status == "EMPTY" straight to EvalJudgeProcess, skipping the poll loop entirely. Poll + Process also treat the threaded sentinel batch_id as dry defensively (belt-and-braces if the SF wiring reaches Poll).

No-Anthropic / no-write proof

tests/test_eval_judge_shell_run_dry.py asserts, per handler under dry_run_llm:

  • anthropic.Anthropic not called (assert_not_called)
  • boto3.client / process_batch_results / compute_and_emit / submit_batch / _persist_client_side_skips / poll_batch / build_batch_plan not called → zero S3 put/get, zero CW
  • success status returned (EMPTY/ended/OK)
  • boot-still-runs smoke: the dry short-circuit is inside handler() after module imports, so a broken import still fails the dry pass
  • legacy rationale dry_run flag behavior preserved (still runs compute_and_emit, only suppresses CW) — only the new dry_run_llm is the full short-circuit; non-dry path still takes the real track

Tests / suite

+14 tests. Full research suite: 1357 passed (1343 → 1357). No new deps, no secrets/prompts committed.

Follow-on (separate repo)

The SF rewire to route EvalJudge* / RationaleClustering → dry (pass "dry_run_llm.$": "$.research_dry", drop the skip_eval_judge / skip_rationale_clustering hard-skips from ApplyShellRunDefaults) is a separate alpha-engine-data follow-on — not in this PR.

🤖 Generated with Claude Code

…lustering (closes SF skip-exceptions)

The Saturday SF Friday-PM shell-run keystone (alpha-engine-data #260)
hard-SKIPPED EvalJudge* + RationaleClustering (skip_eval_judge /
skip_rationale_clustering) because no clean no-write dry path existed:
the submit handler always _persist_client_side_skips + creates an
Anthropic Message Batch, and rationale_clustering's _persist_analysis
S3 put_object was NOT gated by its existing dry_run flag (that flag
only suppressed the CW metric). Their import/bootstrap was therefore
never exercised on the Friday smoke.

Event flag (verbatim — the SF rewire passes this exact key):
  dry_run_llm
Matches Research's existing convention (step_function.json:
"dry_run_llm.$": "$.research_dry"). Single canonical shared substrate
in evals/lambda_dry.py — no per-handler bespoke logic, no copy-paste,
not coupled to dry_run.py (that installs LangGraph agent stubs,
irrelevant here).

Per-handler short-circuit (boot + imports run for real FIRST — the
keystone's whole point — then return before any external call):
- eval_judge_submit: returns dry sentinel
  {status:"EMPTY", batch_id:"dry-run-no-batch",
   processing_status:"ended_empty", plan_s3_key:None} BEFORE
  build_batch_plan / _persist_client_side_skips (S3 put) /
  submit_batch (Anthropic Batch create). status=EMPTY makes the
  existing EvalJudgePollChoice route straight to Process, skipping the
  poll loop entirely.
- eval_judge_poll: detects sentinel/flag, returns terminal
  processing_status:"ended" BEFORE anthropic.Anthropic / poll_batch.
- eval_judge_process: detects sentinel/flag, returns status:"OK"
  BEFORE anthropic.Anthropic / process_batch_results (which does the
  S3 plan get_object + results stream + per-artifact persist + CW).
- rationale_clustering: returns status:"OK" BEFORE compute_and_emit
  (decision_artifacts/ read + _analysis/ S3 persist + CW emit), after
  the evals.rationale_clustering import.

Sentinel threading: submit's "dry-run-no-batch" batch_id + status=EMPTY
flows submit -> (poll skipped by SF) -> process; poll/process also
treat the threaded sentinel batch_id as dry defensively.

No-Anthropic/no-write proof: tests assert anthropic.Anthropic
not-called, boto3.client/process_batch_results/compute_and_emit
not-called, no S3 put/get, success status returned; plus a boot-still-
runs smoke proving the dry short-circuit is inside handler() after
module imports. Legacy rationale `dry_run` flag behavior preserved
(still runs compute, only suppresses CW) — only the new dry_run_llm is
the full short-circuit.

Tests: +14 (tests/test_eval_judge_shell_run_dry.py). Full research
suite 1357 passed (1343 -> 1357). No new deps, no secrets/prompts.

The SF rewire to route EvalJudge*/RationaleClustering states -> dry
(pass "dry_run_llm.$": "$.research_dry", drop the skip_eval_judge /
skip_rationale_clustering hard-skips from ApplyShellRunDefaults) is a
separate alpha-engine-data follow-on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit af0f481 into main May 18, 2026
1 check passed
@cipher813 cipher813 deleted the feat/eval-judge-rationale-shell-run-dry branch May 18, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant