feat: LLM model fallback on failure — fast failover to secondary provider/model#736
feat: LLM model fallback on failure — fast failover to secondary provider/model#736phuongvm wants to merge 2 commits into
Conversation
…ider/model
Implements automatic model fallback when the primary LLM provider fails:
- Fast failover on first retryable error (429, 5xx, timeout, connection error)
- Per-agent fallback configuration via env vars (DERIVER, DIALECTIC, DREAMER, SUMMARY)
- Cross-provider support (e.g., primary=openai, fallback=lmstudio)
- Configurable HTTP timeout (LLM_DEFAULT_TIMEOUT, default 180s)
- Langfuse observability: is_fallback metadata on generation spans
- Clean telemetry: no separate spans for failed attempts
- Summarizer: direct track_name passing for cleaner GENERATION traces
Configuration example:
DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
Files: src/llm/api.py, config.py, registry.py, runtime.py, tool_loop.py,
credentials.py, telemetry/logging.py, utils/summarizer.py,
.env.template, CLAUDE.md, tests/
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughAdds LMStudio as a supported transport and implements a first-failure LLM model fallback driven by a new ContextVar. Honcho now threads a DEFAULT_TIMEOUT into clients, emits generation-level Langfuse observations including is_fallback and usage, and includes unit and integration tests for fallback and retry classification. ChangesLLM Model Fallback & LMStudio Provider
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/llm/registry.py (1)
124-160:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winLMStudio requires an API key to be usable, even for unauthenticated local servers.
CLIENTS["lmstudio"]is only registered whenLMSTUDIO_API_KEYis set (Line 124), andclient_for_model_configraisesMissing API key for lmstudio(Line 150) when none resolves. Local LMStudio servers typically don't require auth, but theAsyncOpenAIclient also rejects an empty key — so operators must supply a placeholder key to use the feature. This is a reasonable design, but worth surfacing in the docs (see the.env.templatesuggestion) so users don't hit a confusing "missing API key" error for a keyless local endpoint.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/registry.py` around lines 124 - 160, CLIENTS["lmstudio"] and client_for_model_config currently require settings.LLM.LMSTUDIO_API_KEY because AsyncOpenAI rejects empty keys; to avoid confusing "Missing API key for lmstudio" errors, add guidance and a placeholder to the environment template and docs: update the .env.template to include LMSTUDIO_API_KEY with a clear comment like "placeholder for local unauthenticated LMStudio (e.g. 'local' or 'unused')" and add a short note in README/docs near LMStudio setup explaining that local LMStudio instances may be unauthenticated and a dummy key must be supplied (refer to settings.LLM.LMSTUDIO_API_KEY, CLIENTS['lmstudio'], AsyncOpenAI, and client_for_model_config when describing the requirement).src/llm/tool_loop.py (1)
609-633:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReset
force_fallbackbefore the synthesis call.Line 609 resets
current_attemptto 1 for the final synthesis call, but does not resetforce_fallback. If the last tool iteration setforce_fallback=True(due to a retryable error in its retry callback), that state will persist. The synthesis call at line 612 invokesget_attempt_plan()which readsforce_fallback.get(), causing the synthesis to incorrectly use the fallback model on the first attempt instead of trying the primary.This inconsistency with the per-iteration reset at line 343 can lead to unexpected fallback usage.
🐛 Proposed fix
current_attempt.set(1) + force_fallback.set(False) async def _final_call() -> HonchoLLMCallResponse[Any]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/tool_loop.py` around lines 609 - 633, The final synthesis call in _final_call resets current_attempt but does not reset force_fallback, so get_attempt_plan() may read a stale force_fallback and pick the fallback model; before calling get_attempt_plan() / honcho_llm_call_inner in _final_call, reset force_fallback to its default (same place and semantics as the per-iteration reset at line 343) so the final synthesis starts with force_fallback cleared; ensure you reference and set the same force_fallback.get()/set(...) mechanism used elsewhere so get_attempt_plan() observes the cleared value.
🧹 Nitpick comments (4)
.env.template (1)
79-85: ⚡ Quick winDocument the new LMStudio provider config.
This PR adds
lmstudioas a transport plusLMSTUDIO_API_KEY/LMSTUDIO_BASE_URL, but the template still advertises onlyopenai, anthropic, gemini(Line 79) and lists no LMStudio key/base-URL placeholders. Operators won't discover the new provider from the template.📝 Suggested additions
-# Supported transports: openai, anthropic, gemini +# Supported transports: openai, anthropic, gemini, lmstudio # Each transport picks up its API key from the corresponding LLM_*_API_KEY. # Base URLs are set per-module via MODEL_CONFIG__OVERRIDES__BASE_URL. # LLM_OPENAI_API_KEY=your-api-key-here # LLM_ANTHROPIC_API_KEY= # LLM_GEMINI_API_KEY= +# LMStudio (OpenAI-compatible local server). API key can be any non-empty +# placeholder (e.g. "lm-studio") if the local server doesn't require auth. +# LLM_LMSTUDIO_API_KEY=lm-studio +# LLM_LMSTUDIO_BASE_URL=http://localhost:1234/v1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.env.template around lines 79 - 85, Update the .env.template to document the new lmstudio transport and its config keys: add "lmstudio" to the Supported transports line and include placeholder entries for LMSTUDIO_API_KEY and LMSTUDIO_BASE_URL so operators can discover and configure the provider; reference the transport name "lmstudio" and the variables "LMSTUDIO_API_KEY" and "LMSTUDIO_BASE_URL" when adding the lines near the existing LLM_* entries.src/config.py (1)
88-101: 💤 Low value
lmstudioshorthand won't normalize.
_normalize_model_transportstill only recognizes{"anthropic", "openai", "gemini"}(Line 98). Now thatlmstudiois a validModelTransport, a value likemodel="lmstudio/qwen3.5-9b"(without an explicittransport) won't be split, leavingmodelas the full string. Sincetransportis required on these models the practical blast radius is small, but addinglmstudiokeeps the shorthand consistent with the supported transports.♻️ Proposed change
- if prefix in {"anthropic", "openai", "gemini"}: + if prefix in {"anthropic", "openai", "gemini", "lmstudio"}:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config.py` around lines 88 - 101, The shorthand normalization in _normalize_model_transport currently only recognizes prefixes {"anthropic", "openai", "gemini"}; add "lmstudio" to that set so model strings like "lmstudio/qwen3.5-9b" are split into transport="lmstudio" and model="qwen3.5-9b" when transport_value is None, preserving the existing split logic in the function.src/llm/runtime.py (2)
120-122: 💤 Low valueRedundant None check.
Line 114 already ensures
model_config.fallback is not Nonewhenuse_fallbackis True, so the None check at line 120 is unreachable. The early-return at lines 120-121 can never execute.♻️ Simplify by removing unreachable guard
if not use_fallback: return model_config fb = model_config.fallback - if fb is None: - return model_config - return ModelConfig(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/runtime.py` around lines 120 - 122, Remove the unreachable guard that checks `fb` for None and early-returns `model_config` in the runtime logic: since earlier code already guarantees `model_config.fallback` is not None when `use_fallback` is True, delete the `if fb is None: return model_config` block (the variables `fb`, `use_fallback`, and `model_config` are the relevant symbols) so the function proceeds with the intended fallback handling without the redundant check; ensure no other code depends on that early-return.
176-181: 💤 Low valueFallback warning fires on every fallback attempt, not just the first switch.
The condition
not is_primarywill trigger the warning on every retry attempt that uses the fallback model (e.g., attempts 2, 3 if fallback was triggered on attempt 2). The message "switching from ... to ..." might be slightly misleading for subsequent attempts, though the attempt number provides context. This provides detailed observability but could be verbose.If you prefer to log only once when fallback is first activated, you could check
attempt == current_attempt.get()to detect the transition, or rely on the "Fast fallback triggered" log inapi.py:339-343as the single notification.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/runtime.py` around lines 176 - 181, The fallback warning currently fires on every non-primary attempt because it only checks not is_primary; change the condition so the logger.warning in the fallback block only runs when the fallback is first activated (e.g., when attempt == current_attempt.get()) or another "first-switch" indicator, e.g., if not is_primary and runtime_model_config.fallback is not None and attempt == current_attempt.get(): then emit the warning (keep the same message referencing runtime_model_config.transport, runtime_model_config.model, provider, selected.model, attempt, retry_attempts).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CLAUDE.md`:
- Line 352: The sentence "Non-retryable errors (400, 200, ValueError, etc.) do
NOT trigger fallback — they follow normal retry behavior" is contradictory;
update the wording to state that non-retryable errors are not retried and
therefore do not trigger fallback, and correct the examples/list (e.g., remove
200 if inappropriate) so it's clear which status codes or exceptions are
considered non-retryable (for example: "Non-retryable errors (e.g., HTTP 400,
422, or synchronous ValueError) are not retried and do not trigger a fallback;
they are surfaced immediately"). Ensure the new sentence replaces the existing
one exactly where it currently appears.
- Around line 349-350: Remove the generic `OSError` entry from the documented
retryable/fallback list and instead list only the specific connection/timeout
families or SDK errors; update the bullet that currently reads "`OSError`" to
either delete it or replace it with specific exceptions (e.g.,
connection/timeout error families) and ensure the SDK-specific entries remain:
`APIConnectionError`, `APITimeoutError`, `InternalServerError`,
`ServiceUnavailableError`, `RateLimitError`.
In `@tests/llm/test_fallback.py`:
- Around line 345-357: test_default_is_false should not mutate the ContextVar
before asserting the default and both tests must restore prior state to avoid
leaking between tests; for TestForceFallbackContextVar.test_default_is_false
call force_fallback.get() directly and assert it is False (do not call
force_fallback.set(False) first), and for
TestForceFallbackContextVar.test_set_and_get when you call
force_fallback.set(True) capture the returned token and use
ContextVar.reset(token) (in a finally or teardown) to restore the previous value
(do the same when you set False) so the ContextVar's prior state is always
restored after each test.
- Around line 232-246: The test mirror _is_retryable_error in
tests/llm/test_fallback.py is missing HTTP status 408, causing behavior drift
versus the production nested _is_retryable_error; update the status check in the
test function (the _is_retryable_error staticmethod) to treat status 408 as
retryable (i.e., include 408 alongside 429/5xx), or alternatively refactor the
production helper into a module-level function that both src/llm/api.py and
tests/llm/test_fallback.py can import and reuse to ensure parity.
---
Outside diff comments:
In `@src/llm/registry.py`:
- Around line 124-160: CLIENTS["lmstudio"] and client_for_model_config currently
require settings.LLM.LMSTUDIO_API_KEY because AsyncOpenAI rejects empty keys; to
avoid confusing "Missing API key for lmstudio" errors, add guidance and a
placeholder to the environment template and docs: update the .env.template to
include LMSTUDIO_API_KEY with a clear comment like "placeholder for local
unauthenticated LMStudio (e.g. 'local' or 'unused')" and add a short note in
README/docs near LMStudio setup explaining that local LMStudio instances may be
unauthenticated and a dummy key must be supplied (refer to
settings.LLM.LMSTUDIO_API_KEY, CLIENTS['lmstudio'], AsyncOpenAI, and
client_for_model_config when describing the requirement).
In `@src/llm/tool_loop.py`:
- Around line 609-633: The final synthesis call in _final_call resets
current_attempt but does not reset force_fallback, so get_attempt_plan() may
read a stale force_fallback and pick the fallback model; before calling
get_attempt_plan() / honcho_llm_call_inner in _final_call, reset force_fallback
to its default (same place and semantics as the per-iteration reset at line 343)
so the final synthesis starts with force_fallback cleared; ensure you reference
and set the same force_fallback.get()/set(...) mechanism used elsewhere so
get_attempt_plan() observes the cleared value.
---
Nitpick comments:
In @.env.template:
- Around line 79-85: Update the .env.template to document the new lmstudio
transport and its config keys: add "lmstudio" to the Supported transports line
and include placeholder entries for LMSTUDIO_API_KEY and LMSTUDIO_BASE_URL so
operators can discover and configure the provider; reference the transport name
"lmstudio" and the variables "LMSTUDIO_API_KEY" and "LMSTUDIO_BASE_URL" when
adding the lines near the existing LLM_* entries.
In `@src/config.py`:
- Around line 88-101: The shorthand normalization in _normalize_model_transport
currently only recognizes prefixes {"anthropic", "openai", "gemini"}; add
"lmstudio" to that set so model strings like "lmstudio/qwen3.5-9b" are split
into transport="lmstudio" and model="qwen3.5-9b" when transport_value is None,
preserving the existing split logic in the function.
In `@src/llm/runtime.py`:
- Around line 120-122: Remove the unreachable guard that checks `fb` for None
and early-returns `model_config` in the runtime logic: since earlier code
already guarantees `model_config.fallback` is not None when `use_fallback` is
True, delete the `if fb is None: return model_config` block (the variables `fb`,
`use_fallback`, and `model_config` are the relevant symbols) so the function
proceeds with the intended fallback handling without the redundant check; ensure
no other code depends on that early-return.
- Around line 176-181: The fallback warning currently fires on every non-primary
attempt because it only checks not is_primary; change the condition so the
logger.warning in the fallback block only runs when the fallback is first
activated (e.g., when attempt == current_attempt.get()) or another
"first-switch" indicator, e.g., if not is_primary and
runtime_model_config.fallback is not None and attempt == current_attempt.get():
then emit the warning (keep the same message referencing
runtime_model_config.transport, runtime_model_config.model, provider,
selected.model, attempt, retry_attempts).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 549c695a-712e-4d46-9099-f5f1db2b1b4d
📒 Files selected for processing (13)
.env.templateCLAUDE.mdsrc/config.pysrc/llm/api.pysrc/llm/credentials.pysrc/llm/registry.pysrc/llm/runtime.pysrc/llm/tool_loop.pysrc/telemetry/logging.pysrc/utils/summarizer.pytests/llm/test_fallback.pytests/llm/test_fallback_integration.pytests/utils/test_clients.py
Critical: - Reset force_fallback ContextVar before synthesis call (tool_loop.py) Prevents stale fallback state from causing synthesis to use fallback model Important: - Fallback warning now fires only once (on attempt 2) to avoid log spam - Add lmstudio to _normalize_model_transport shorthand parser - Document LMStudio provider in .env.template with placeholder key guidance - Fix CLAUDE.md: remove generic OSError, correct non-retryable error examples Tests: - Add HTTP 408 to _is_retryable_error mirror test - Remove OSError from test error classification - Fix ContextVar test state leakage (restore original state in finally) - Fix test_default_is_false to not mutate ContextVar before asserting All 26 tests passing. Lint clean.
Summary
Implements automatic model fallback when the primary LLM provider fails with retryable errors (rate limit, timeout, server error, connection error).
Key Changes
LLM_DEFAULT_TIMEOUT(default 180s) for all LLM HTTP clientsis_fallback: truemetadata; no separate spans for failed attemptstrack_namepassing for cleaner GENERATION tracesConfiguration Example
Retryable Errors (trigger fallback)
Tests
26 tests covering:
Split from #692 per maintainer request to focus on LLM fallback scope.
Summary by CodeRabbit
New Features
Documentation
Tests