Skip to content

feat: LLM model fallback on failure — fast failover to secondary provider/model#736

Open
phuongvm wants to merge 2 commits into
plastic-labs:mainfrom
phuongvm:pr-692-fallback-core
Open

feat: LLM model fallback on failure — fast failover to secondary provider/model#736
phuongvm wants to merge 2 commits into
plastic-labs:mainfrom
phuongvm:pr-692-fallback-core

Conversation

@phuongvm
Copy link
Copy Markdown

@phuongvm phuongvm commented May 29, 2026

Summary

Implements automatic model fallback when the primary LLM provider fails with retryable errors (rate limit, timeout, server error, connection error).

Key Changes

  • Fast failover: Switches to fallback model on the first retryable error — does NOT exhaust retries on the primary provider
  • Per-agent configuration: Independent fallback config via env vars for DERIVER, DIALECTIC (per reasoning level), DREAMER, SUMMARY agents
  • Cross-provider support: Primary and fallback can use different providers (e.g., primary=openai, fallback=lmstudio)
  • Configurable timeout: LLM_DEFAULT_TIMEOUT (default 180s) for all LLM HTTP clients
  • Clean observability: Langfuse generation spans include is_fallback: true metadata; no separate spans for failed attempts
  • Summarizer telemetry: Direct track_name passing for cleaner GENERATION traces

Configuration Example

DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
LLM_DEFAULT_TIMEOUT=300.0

Retryable Errors (trigger fallback)

  • HTTP 429 (Rate Limit)
  • HTTP 5xx (Server Errors)
  • TimeoutError / ConnectionError
  • SDK-specific: APIConnectionError, APITimeoutError, InternalServerError, ServiceUnavailableError, RateLimitError

Tests

26 tests covering:

  • Retry error classification (429, 5xx, 408, timeout, connection errors)
  • Plan attempt selection with/without fallback
  • Cross-provider fallback (openai→anthropic, lmstudio→openai)
  • ContextVar state management
  • End-to-end integration (primary fails → fallback succeeds, both fail, etc.)

Split from #692 per maintainer request to focus on LLM fallback scope.

Summary by CodeRabbit

  • New Features

    • Added LMStudio as a supported LLM provider.
    • Global LLM timeout configurable.
    • Per-feature, cross-provider LLM model fallbacks with automatic switch on retryable failures.
  • Documentation

    • Added LLM Model Fallback guidance covering behavior, retryable vs non-retryable errors, and observability signals (fallback warnings/metadata).
  • Tests

    • Added unit and integration tests validating fallback and retry behaviors.

Review Change Stack

…ider/model

Implements automatic model fallback when the primary LLM provider fails:

- Fast failover on first retryable error (429, 5xx, timeout, connection error)
- Per-agent fallback configuration via env vars (DERIVER, DIALECTIC, DREAMER, SUMMARY)
- Cross-provider support (e.g., primary=openai, fallback=lmstudio)
- Configurable HTTP timeout (LLM_DEFAULT_TIMEOUT, default 180s)
- Langfuse observability: is_fallback metadata on generation spans
- Clean telemetry: no separate spans for failed attempts
- Summarizer: direct track_name passing for cleaner GENERATION traces

Configuration example:
  DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
  DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b

Files: src/llm/api.py, config.py, registry.py, runtime.py, tool_loop.py,
       credentials.py, telemetry/logging.py, utils/summarizer.py,
       .env.template, CLAUDE.md, tests/
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73c86913-79f3-4270-a9ef-93009817ba03

📥 Commits

Reviewing files that changed from the base of the PR and between 1c9c753 and 6e5b7ac.

📒 Files selected for processing (6)
  • .env.template
  • CLAUDE.md
  • src/config.py
  • src/llm/runtime.py
  • src/llm/tool_loop.py
  • tests/llm/test_fallback.py
✅ Files skipped from review due to trivial changes (2)
  • .env.template
  • CLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/config.py
  • src/llm/tool_loop.py
  • src/llm/runtime.py
  • tests/llm/test_fallback.py

Walkthrough

Adds LMStudio as a supported transport and implements a first-failure LLM model fallback driven by a new ContextVar. Honcho now threads a DEFAULT_TIMEOUT into clients, emits generation-level Langfuse observations including is_fallback and usage, and includes unit and integration tests for fallback and retry classification.

Changes

LLM Model Fallback & LMStudio Provider

Layer / File(s) Summary
Configuration schema and environment defaults
src/config.py, .env.template
ModelTransport adds lmstudio; LLMSettings adds LMSTUDIO_API_KEY, LMSTUDIO_BASE_URL, and DEFAULT_TIMEOUT. .env.template documents LMStudio and per-agent fallback placeholders.
LMStudio credential & client wiring
src/llm/credentials.py, src/llm/registry.py
Resolve LMStudio API key, register lmstudio client as OpenAI-compatible using configured base URL, and use LLM.DEFAULT_TIMEOUT for default/override clients.
Fallback state & attempt planning
src/llm/runtime.py, src/llm/tool_loop.py
Introduce exported force_fallback: ContextVar[bool]; extend selection/planning to accept it and surface is_fallback; ensure tool-loop clears fallback state between iterations.
honcho_llm_call orchestration & retry trigger
src/llm/api.py
Add _update_langfuse_usage, _is_retryable_error, tenacity before-retry callback that sets force_fallback on retryable errors, enforce max_input_tokens on tool-less calls, and restore contextvars in a finally block.
Langfuse/telemetry wiring and summarizer changes
src/telemetry/logging.py, src/utils/summarizer.py
conditional_observe gains as_type for generation routing. Summarizer functions remove function-level decorators and pass explicit track_name into honcho_llm_call. Tests updated to expect generation-level updates and usage reporting.
Unit & integration tests
tests/llm/test_fallback.py, tests/llm/test_fallback_integration.py, tests/utils/test_clients.py
New unit tests for select/plan/ContextVar and retry-classification; integration tests with mocked backends validate first-failure fallback and both-fail behavior; Langfuse assertions updated to check generation-level metadata and usage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • plastic-labs/honcho#643: Extends default client wiring in src/llm/registry.py to incorporate *_BASE_URL configuration, related to LMStudio base URL routing here.

Suggested reviewers

  • VVoruganti

Poem

🐰 I nudge the fallback switch with care,
When networks fail or quotas stare,
LMStudio waits, timeouts aligned,
Fast failover keeps us unconfined.
Langfuse counts each generation's beat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementing LLM model fallback on retryable failures with fast failover to a secondary provider/model, which directly aligns with the primary objective of the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/llm/registry.py (1)

124-160: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

LMStudio requires an API key to be usable, even for unauthenticated local servers.

CLIENTS["lmstudio"] is only registered when LMSTUDIO_API_KEY is set (Line 124), and client_for_model_config raises Missing API key for lmstudio (Line 150) when none resolves. Local LMStudio servers typically don't require auth, but the AsyncOpenAI client also rejects an empty key — so operators must supply a placeholder key to use the feature. This is a reasonable design, but worth surfacing in the docs (see the .env.template suggestion) so users don't hit a confusing "missing API key" error for a keyless local endpoint.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/registry.py` around lines 124 - 160, CLIENTS["lmstudio"] and
client_for_model_config currently require settings.LLM.LMSTUDIO_API_KEY because
AsyncOpenAI rejects empty keys; to avoid confusing "Missing API key for
lmstudio" errors, add guidance and a placeholder to the environment template and
docs: update the .env.template to include LMSTUDIO_API_KEY with a clear comment
like "placeholder for local unauthenticated LMStudio (e.g. 'local' or 'unused')"
and add a short note in README/docs near LMStudio setup explaining that local
LMStudio instances may be unauthenticated and a dummy key must be supplied
(refer to settings.LLM.LMSTUDIO_API_KEY, CLIENTS['lmstudio'], AsyncOpenAI, and
client_for_model_config when describing the requirement).
src/llm/tool_loop.py (1)

609-633: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset force_fallback before the synthesis call.

Line 609 resets current_attempt to 1 for the final synthesis call, but does not reset force_fallback. If the last tool iteration set force_fallback=True (due to a retryable error in its retry callback), that state will persist. The synthesis call at line 612 invokes get_attempt_plan() which reads force_fallback.get(), causing the synthesis to incorrectly use the fallback model on the first attempt instead of trying the primary.

This inconsistency with the per-iteration reset at line 343 can lead to unexpected fallback usage.

🐛 Proposed fix
     current_attempt.set(1)
+    force_fallback.set(False)
 
     async def _final_call() -> HonchoLLMCallResponse[Any]:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/tool_loop.py` around lines 609 - 633, The final synthesis call in
_final_call resets current_attempt but does not reset force_fallback, so
get_attempt_plan() may read a stale force_fallback and pick the fallback model;
before calling get_attempt_plan() / honcho_llm_call_inner in _final_call, reset
force_fallback to its default (same place and semantics as the per-iteration
reset at line 343) so the final synthesis starts with force_fallback cleared;
ensure you reference and set the same force_fallback.get()/set(...) mechanism
used elsewhere so get_attempt_plan() observes the cleared value.
🧹 Nitpick comments (4)
.env.template (1)

79-85: ⚡ Quick win

Document the new LMStudio provider config.

This PR adds lmstudio as a transport plus LMSTUDIO_API_KEY / LMSTUDIO_BASE_URL, but the template still advertises only openai, anthropic, gemini (Line 79) and lists no LMStudio key/base-URL placeholders. Operators won't discover the new provider from the template.

📝 Suggested additions
-# Supported transports: openai, anthropic, gemini
+# Supported transports: openai, anthropic, gemini, lmstudio
 # Each transport picks up its API key from the corresponding LLM_*_API_KEY.
 # Base URLs are set per-module via MODEL_CONFIG__OVERRIDES__BASE_URL.
 #
 LLM_OPENAI_API_KEY=your-api-key-here
 # LLM_ANTHROPIC_API_KEY=
 # LLM_GEMINI_API_KEY=
+# LMStudio (OpenAI-compatible local server). API key can be any non-empty
+# placeholder (e.g. "lm-studio") if the local server doesn't require auth.
+# LLM_LMSTUDIO_API_KEY=lm-studio
+# LLM_LMSTUDIO_BASE_URL=http://localhost:1234/v1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.env.template around lines 79 - 85, Update the .env.template to document the
new lmstudio transport and its config keys: add "lmstudio" to the Supported
transports line and include placeholder entries for LMSTUDIO_API_KEY and
LMSTUDIO_BASE_URL so operators can discover and configure the provider;
reference the transport name "lmstudio" and the variables "LMSTUDIO_API_KEY" and
"LMSTUDIO_BASE_URL" when adding the lines near the existing LLM_* entries.
src/config.py (1)

88-101: 💤 Low value

lmstudio shorthand won't normalize.

_normalize_model_transport still only recognizes {"anthropic", "openai", "gemini"} (Line 98). Now that lmstudio is a valid ModelTransport, a value like model="lmstudio/qwen3.5-9b" (without an explicit transport) won't be split, leaving model as the full string. Since transport is required on these models the practical blast radius is small, but adding lmstudio keeps the shorthand consistent with the supported transports.

♻️ Proposed change
-        if prefix in {"anthropic", "openai", "gemini"}:
+        if prefix in {"anthropic", "openai", "gemini", "lmstudio"}:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config.py` around lines 88 - 101, The shorthand normalization in
_normalize_model_transport currently only recognizes prefixes {"anthropic",
"openai", "gemini"}; add "lmstudio" to that set so model strings like
"lmstudio/qwen3.5-9b" are split into transport="lmstudio" and model="qwen3.5-9b"
when transport_value is None, preserving the existing split logic in the
function.
src/llm/runtime.py (2)

120-122: 💤 Low value

Redundant None check.

Line 114 already ensures model_config.fallback is not None when use_fallback is True, so the None check at line 120 is unreachable. The early-return at lines 120-121 can never execute.

♻️ Simplify by removing unreachable guard
     if not use_fallback:
         return model_config
 
     fb = model_config.fallback
-    if fb is None:
-        return model_config
-
     return ModelConfig(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/runtime.py` around lines 120 - 122, Remove the unreachable guard that
checks `fb` for None and early-returns `model_config` in the runtime logic:
since earlier code already guarantees `model_config.fallback` is not None when
`use_fallback` is True, delete the `if fb is None: return model_config` block
(the variables `fb`, `use_fallback`, and `model_config` are the relevant
symbols) so the function proceeds with the intended fallback handling without
the redundant check; ensure no other code depends on that early-return.

176-181: 💤 Low value

Fallback warning fires on every fallback attempt, not just the first switch.

The condition not is_primary will trigger the warning on every retry attempt that uses the fallback model (e.g., attempts 2, 3 if fallback was triggered on attempt 2). The message "switching from ... to ..." might be slightly misleading for subsequent attempts, though the attempt number provides context. This provides detailed observability but could be verbose.

If you prefer to log only once when fallback is first activated, you could check attempt == current_attempt.get() to detect the transition, or rely on the "Fast fallback triggered" log in api.py:339-343 as the single notification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/runtime.py` around lines 176 - 181, The fallback warning currently
fires on every non-primary attempt because it only checks not is_primary; change
the condition so the logger.warning in the fallback block only runs when the
fallback is first activated (e.g., when attempt == current_attempt.get()) or
another "first-switch" indicator, e.g., if not is_primary and
runtime_model_config.fallback is not None and attempt == current_attempt.get():
then emit the warning (keep the same message referencing
runtime_model_config.transport, runtime_model_config.model, provider,
selected.model, attempt, retry_attempts).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CLAUDE.md`:
- Line 352: The sentence "Non-retryable errors (400, 200, ValueError, etc.) do
NOT trigger fallback — they follow normal retry behavior" is contradictory;
update the wording to state that non-retryable errors are not retried and
therefore do not trigger fallback, and correct the examples/list (e.g., remove
200 if inappropriate) so it's clear which status codes or exceptions are
considered non-retryable (for example: "Non-retryable errors (e.g., HTTP 400,
422, or synchronous ValueError) are not retried and do not trigger a fallback;
they are surfaced immediately"). Ensure the new sentence replaces the existing
one exactly where it currently appears.
- Around line 349-350: Remove the generic `OSError` entry from the documented
retryable/fallback list and instead list only the specific connection/timeout
families or SDK errors; update the bullet that currently reads "`OSError`" to
either delete it or replace it with specific exceptions (e.g.,
connection/timeout error families) and ensure the SDK-specific entries remain:
`APIConnectionError`, `APITimeoutError`, `InternalServerError`,
`ServiceUnavailableError`, `RateLimitError`.

In `@tests/llm/test_fallback.py`:
- Around line 345-357: test_default_is_false should not mutate the ContextVar
before asserting the default and both tests must restore prior state to avoid
leaking between tests; for TestForceFallbackContextVar.test_default_is_false
call force_fallback.get() directly and assert it is False (do not call
force_fallback.set(False) first), and for
TestForceFallbackContextVar.test_set_and_get when you call
force_fallback.set(True) capture the returned token and use
ContextVar.reset(token) (in a finally or teardown) to restore the previous value
(do the same when you set False) so the ContextVar's prior state is always
restored after each test.
- Around line 232-246: The test mirror _is_retryable_error in
tests/llm/test_fallback.py is missing HTTP status 408, causing behavior drift
versus the production nested _is_retryable_error; update the status check in the
test function (the _is_retryable_error staticmethod) to treat status 408 as
retryable (i.e., include 408 alongside 429/5xx), or alternatively refactor the
production helper into a module-level function that both src/llm/api.py and
tests/llm/test_fallback.py can import and reuse to ensure parity.

---

Outside diff comments:
In `@src/llm/registry.py`:
- Around line 124-160: CLIENTS["lmstudio"] and client_for_model_config currently
require settings.LLM.LMSTUDIO_API_KEY because AsyncOpenAI rejects empty keys; to
avoid confusing "Missing API key for lmstudio" errors, add guidance and a
placeholder to the environment template and docs: update the .env.template to
include LMSTUDIO_API_KEY with a clear comment like "placeholder for local
unauthenticated LMStudio (e.g. 'local' or 'unused')" and add a short note in
README/docs near LMStudio setup explaining that local LMStudio instances may be
unauthenticated and a dummy key must be supplied (refer to
settings.LLM.LMSTUDIO_API_KEY, CLIENTS['lmstudio'], AsyncOpenAI, and
client_for_model_config when describing the requirement).

In `@src/llm/tool_loop.py`:
- Around line 609-633: The final synthesis call in _final_call resets
current_attempt but does not reset force_fallback, so get_attempt_plan() may
read a stale force_fallback and pick the fallback model; before calling
get_attempt_plan() / honcho_llm_call_inner in _final_call, reset force_fallback
to its default (same place and semantics as the per-iteration reset at line 343)
so the final synthesis starts with force_fallback cleared; ensure you reference
and set the same force_fallback.get()/set(...) mechanism used elsewhere so
get_attempt_plan() observes the cleared value.

---

Nitpick comments:
In @.env.template:
- Around line 79-85: Update the .env.template to document the new lmstudio
transport and its config keys: add "lmstudio" to the Supported transports line
and include placeholder entries for LMSTUDIO_API_KEY and LMSTUDIO_BASE_URL so
operators can discover and configure the provider; reference the transport name
"lmstudio" and the variables "LMSTUDIO_API_KEY" and "LMSTUDIO_BASE_URL" when
adding the lines near the existing LLM_* entries.

In `@src/config.py`:
- Around line 88-101: The shorthand normalization in _normalize_model_transport
currently only recognizes prefixes {"anthropic", "openai", "gemini"}; add
"lmstudio" to that set so model strings like "lmstudio/qwen3.5-9b" are split
into transport="lmstudio" and model="qwen3.5-9b" when transport_value is None,
preserving the existing split logic in the function.

In `@src/llm/runtime.py`:
- Around line 120-122: Remove the unreachable guard that checks `fb` for None
and early-returns `model_config` in the runtime logic: since earlier code
already guarantees `model_config.fallback` is not None when `use_fallback` is
True, delete the `if fb is None: return model_config` block (the variables `fb`,
`use_fallback`, and `model_config` are the relevant symbols) so the function
proceeds with the intended fallback handling without the redundant check; ensure
no other code depends on that early-return.
- Around line 176-181: The fallback warning currently fires on every non-primary
attempt because it only checks not is_primary; change the condition so the
logger.warning in the fallback block only runs when the fallback is first
activated (e.g., when attempt == current_attempt.get()) or another
"first-switch" indicator, e.g., if not is_primary and
runtime_model_config.fallback is not None and attempt == current_attempt.get():
then emit the warning (keep the same message referencing
runtime_model_config.transport, runtime_model_config.model, provider,
selected.model, attempt, retry_attempts).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 549c695a-712e-4d46-9099-f5f1db2b1b4d

📥 Commits

Reviewing files that changed from the base of the PR and between 85239a6 and 1c9c753.

📒 Files selected for processing (13)
  • .env.template
  • CLAUDE.md
  • src/config.py
  • src/llm/api.py
  • src/llm/credentials.py
  • src/llm/registry.py
  • src/llm/runtime.py
  • src/llm/tool_loop.py
  • src/telemetry/logging.py
  • src/utils/summarizer.py
  • tests/llm/test_fallback.py
  • tests/llm/test_fallback_integration.py
  • tests/utils/test_clients.py

Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread tests/llm/test_fallback.py
Comment thread tests/llm/test_fallback.py Outdated
Critical:
- Reset force_fallback ContextVar before synthesis call (tool_loop.py)
  Prevents stale fallback state from causing synthesis to use fallback model

Important:
- Fallback warning now fires only once (on attempt 2) to avoid log spam
- Add lmstudio to _normalize_model_transport shorthand parser
- Document LMStudio provider in .env.template with placeholder key guidance
- Fix CLAUDE.md: remove generic OSError, correct non-retryable error examples

Tests:
- Add HTTP 408 to _is_retryable_error mirror test
- Remove OSError from test error classification
- Fix ContextVar test state leakage (restore original state in finally)
- Fix test_default_is_false to not mutate ContextVar before asserting

All 26 tests passing. Lint clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants