Skip to content

fix(plan): add tier-model failover to plan step executor (#1448)#1450

Merged
RyderFreeman4Logos merged 2 commits into
mainfrom
fix/1448-plan-step-tier-failover
May 18, 2026
Merged

fix(plan): add tier-model failover to plan step executor (#1448)#1450
RyderFreeman4Logos merged 2 commits into
mainfrom
fix/1448-plan-step-tier-failover

Conversation

@RyderFreeman4Logos
Copy link
Copy Markdown
Owner

Summary

  • Plan steps with tier annotations now iterate through the tier's model fallback chain on failover-eligible failures (HTTP 400/429, rate limit, quota exhaustion)
  • Previously, resolve_step_tool collapsed the tier into a single (tool, model_spec) pair and execute_csa_step tried only that one model — gemini-cli returning 400 caused the entire workflow to abort with no recovery
  • StepTarget::CsaTool gains tier_name field so the executor can build the full fallback chain via ordered_tier_candidates()

Changes

  • New: plan_cmd_tier_failover.rsexecute_csa_step_with_tier_failover() loops through tier candidates using classify_next_model_failure_with_elapsed() for failover gating
  • Modified: plan_cmd_steps.rsStepTarget::CsaTool carries tier_name, resolve_step_tool preserves it
  • Modified: plan_cmd_exec.rsStepExecutionOutcome gains stderr field for failover pattern detection
  • Extracted: plan_cmd_steps_test_helpers.rs — test-only execute_plan/execute_step helpers (monolith gate compliance)

Closes #1448

Test plan

  • cargo check --package cli-sub-agent passes
  • cargo clippy --package cli-sub-agent -- -D warnings clean
  • cargo fmt --check clean
  • csa review --range main...HEAD verdict: PASS (session 01KRWHSC6ZZ)
  • Manual: run csa plan run patterns/mktd/workflow.toml against a repo where gemini-cli returns 400 — verify it falls through to claude-code/codex

🤖 Generated with Claude Code

RyderFreeman4Logos and others added 2 commits May 17, 2026 20:20
Plan steps with a tier annotation now iterate through the tier's model
list on failover-eligible failures (HTTP 400/429, rate limit, quota
exhaustion). Previously, resolve_step_tool collapsed the tier into a
single (tool, model_spec) pair and execute_csa_step tried only that one
model — gemini-cli returning 400 caused the entire workflow to abort.

Changes:
- StepTarget::CsaTool gains tier_name field so the executor can build
  the full fallback chain via ordered_tier_candidates()
- New plan_cmd_tier_failover module with execute_csa_step_with_tier_failover
  that loops through candidates using classify_next_model_failure_with_elapsed
- StepExecutionOutcome gains stderr field for failover pattern detection
- Extracted test-only helpers to plan_cmd_steps_test_helpers (monolith gate)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@RyderFreeman4Logos RyderFreeman4Logos merged commit f253f0f into main May 18, 2026
3 of 7 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tier-level failover for CSA steps, allowing the system to automatically attempt alternative models when a primary model fails due to issues like rate limiting. Additionally, it adds stderr capturing to step execution outcomes and refactors test helpers into a separate module. Review feedback identifies that the failover summary log currently omits the failure reason for the final candidate in a tier. Furthermore, it is noted that stderr capturing for bash steps is currently ineffective because the underlying process inherits stderr rather than piping it for capture.

Comment on lines +102 to +115
if let Some(rate_limit) = detected
&& idx + 1 < candidates.len()
{
warn!(
"{label} - {spec_label} failed ({}); advancing to next tier model",
rate_limit.reason
);
eprintln!("{label} - {spec_label} FAILOVER ({})", rate_limit.reason);
failures.push(TierAttemptFailure {
model_spec: spec_label.to_string(),
reason: rate_limit.reason,
});
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The "all models failed" summary log (line 116) will be incomplete because the last model's failure is not added to the failures vector. The current logic only pushes to failures when idx + 1 < candidates.len(). If the last candidate fails, its reason is omitted from the final summary warning.

                if let Some(rate_limit) = detected {
                    if idx + 1 < candidates.len() {
                        warn!(
                            "{label} - {spec_label} failed ({}); advancing to next tier model",
                            rate_limit.reason
                        );
                        eprintln!("{label} - {spec_label} FAILOVER ({})", rate_limit.reason);
                        failures.push(TierAttemptFailure {
                            model_spec: spec_label.to_string(),
                            reason: rate_limit.reason,
                        });
                        continue;
                    } else {
                        failures.push(TierAttemptFailure {
                            model_spec: spec_label.to_string(),
                            reason: rate_limit.reason,
                        });
                    }
                }

};

let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr_str = String::from_utf8_lossy(&output.stderr).to_string();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stderr_str captured here will be empty for bash steps because spawn_bash (line 172) uses Stdio::inherit() for stderr. To correctly populate this field for the journal or pattern detection, spawn_bash would need to use Stdio::piped(), and execute_bash_step would then need to manually print the captured stderr to maintain terminal visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dev2merge mktd plan step fails consistently (DRAFT TODO + RECON Dim 4)

1 participant