fix(plan): add tier-model failover to plan step executor (#1448)#1450
Conversation
Plan steps with a tier annotation now iterate through the tier's model list on failover-eligible failures (HTTP 400/429, rate limit, quota exhaustion). Previously, resolve_step_tool collapsed the tier into a single (tool, model_spec) pair and execute_csa_step tried only that one model — gemini-cli returning 400 caused the entire workflow to abort. Changes: - StepTarget::CsaTool gains tier_name field so the executor can build the full fallback chain via ordered_tier_candidates() - New plan_cmd_tier_failover module with execute_csa_step_with_tier_failover that loops through candidates using classify_next_model_failure_with_elapsed - StepExecutionOutcome gains stderr field for failover pattern detection - Extracted test-only helpers to plan_cmd_steps_test_helpers (monolith gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces tier-level failover for CSA steps, allowing the system to automatically attempt alternative models when a primary model fails due to issues like rate limiting. Additionally, it adds stderr capturing to step execution outcomes and refactors test helpers into a separate module. Review feedback identifies that the failover summary log currently omits the failure reason for the final candidate in a tier. Furthermore, it is noted that stderr capturing for bash steps is currently ineffective because the underlying process inherits stderr rather than piping it for capture.
| if let Some(rate_limit) = detected | ||
| && idx + 1 < candidates.len() | ||
| { | ||
| warn!( | ||
| "{label} - {spec_label} failed ({}); advancing to next tier model", | ||
| rate_limit.reason | ||
| ); | ||
| eprintln!("{label} - {spec_label} FAILOVER ({})", rate_limit.reason); | ||
| failures.push(TierAttemptFailure { | ||
| model_spec: spec_label.to_string(), | ||
| reason: rate_limit.reason, | ||
| }); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The "all models failed" summary log (line 116) will be incomplete because the last model's failure is not added to the failures vector. The current logic only pushes to failures when idx + 1 < candidates.len(). If the last candidate fails, its reason is omitted from the final summary warning.
if let Some(rate_limit) = detected {
if idx + 1 < candidates.len() {
warn!(
"{label} - {spec_label} failed ({}); advancing to next tier model",
rate_limit.reason
);
eprintln!("{label} - {spec_label} FAILOVER ({})", rate_limit.reason);
failures.push(TierAttemptFailure {
model_spec: spec_label.to_string(),
reason: rate_limit.reason,
});
continue;
} else {
failures.push(TierAttemptFailure {
model_spec: spec_label.to_string(),
reason: rate_limit.reason,
});
}
}| }; | ||
|
|
||
| let stdout = String::from_utf8_lossy(&output.stdout).to_string(); | ||
| let stderr_str = String::from_utf8_lossy(&output.stderr).to_string(); |
There was a problem hiding this comment.
The stderr_str captured here will be empty for bash steps because spawn_bash (line 172) uses Stdio::inherit() for stderr. To correctly populate this field for the journal or pattern detection, spawn_bash would need to use Stdio::piped(), and execute_bash_step would then need to manually print the captured stderr to maintain terminal visibility.
Summary
resolve_step_toolcollapsed the tier into a single(tool, model_spec)pair andexecute_csa_steptried only that one model — gemini-cli returning 400 caused the entire workflow to abort with no recoveryStepTarget::CsaToolgainstier_namefield so the executor can build the full fallback chain viaordered_tier_candidates()Changes
plan_cmd_tier_failover.rs—execute_csa_step_with_tier_failover()loops through tier candidates usingclassify_next_model_failure_with_elapsed()for failover gatingplan_cmd_steps.rs—StepTarget::CsaToolcarriestier_name,resolve_step_toolpreserves itplan_cmd_exec.rs—StepExecutionOutcomegainsstderrfield for failover pattern detectionplan_cmd_steps_test_helpers.rs— test-onlyexecute_plan/execute_stephelpers (monolith gate compliance)Closes #1448
Test plan
cargo check --package cli-sub-agentpassescargo clippy --package cli-sub-agent -- -D warningscleancargo fmt --checkcleancsa review --range main...HEADverdict: PASS (session 01KRWHSC6ZZ)csa plan run patterns/mktd/workflow.tomlagainst a repo where gemini-cli returns 400 — verify it falls through to claude-code/codex🤖 Generated with Claude Code