Skip to content

feat: LLM model fallback on failure — fast failover to secondary provider/model#692

Open
phuongvm wants to merge 14 commits into
plastic-labs:mainfrom
phuongvm:feat/llm-model-fallback
Open

feat: LLM model fallback on failure — fast failover to secondary provider/model#692
phuongvm wants to merge 14 commits into
plastic-labs:mainfrom
phuongvm:feat/llm-model-fallback

Conversation

@phuongvm
Copy link
Copy Markdown

@phuongvm phuongvm commented May 15, 2026

Summary

Introduces a robust model fallback mechanism for Honcho's LLM layer. When a primary LLM provider/model returns a retryable error (HTTP 429/5xx, timeout, connection error), the system automatically fails over to a configured fallback model on the very next attempt — no need to exhaust all retries on a dead provider first.

Problem

When an LLM provider is rate-limited (429), experiencing server errors (5xx), or unreachable (timeout/connection error), Honcho would retry on the same provider up to retry_attempts (default 3) before giving up entirely. This meant:

  1. Wasted time — retrying a dead provider is useless
  2. Wasted tokens — partial failures still consume quota
  3. No graceful degradation — no fallback to a local/secondary model

Solution

Fast Fallback on First Failure

A force_fallback ContextVar tracks whether the current LLM call series has hit a retryable error. On the first detectable failure:

  1. The before_retry_callback in honcho_llm_call() detects retryable errors via _is_retryable_error()
  2. It sets force_fallback = True immediately
  3. The plan_attempt() function selects the fallback ModelConfig on all subsequent attempts

Configurable via Environment Variables

Each agent can define its fallback model independently using __-delimited Pydantic env vars:

# Example: LMStudio local model as fallback for the Deriver agent
DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
DERIVER_MODEL_CONFIG__FALLBACK__API_KEY=lmstudio-key
DERIVER_MODEL_CONFIG__FALLBACK__BASE_URL=http://localhost:1234/v1

The same pattern works for all agents:

  • DERIVER_MODEL_CONFIG__FALLBACK__* (memory worker)
  • DIALECTIC_MODEL_CONFIG__FALLBACK__* (query/chat)
  • DREAMER_MODEL_CONFIG__FALLBACK__* (memory consolidation)
  • MCP_MODEL_CONFIG__FALLBACK__* (MCP server)

Observability: Langfuse Signal

When fallback is active, the single generation span (not a separate failed-attempt span) is updated with:

  • is_fallback: true in metadata
  • The fallback provider and model names appear on the span

This keeps traces clean — no pollution from failed-attempt spans.

Configurable HTTP Timeout

Added LLMSettings.DEFAULT_TIMEOUT (env var LLM_DEFAULT_TIMEOUT, default 180.0) replacing the hardcoded 600.0 for Anthropic and SDK defaults (~60s) for OpenAI/LMStudio clients. All provider clients now use a consistent, configurable timeout.

Architecture

honcho_llm_call()
  │
  ├─ _get_attempt_plan()
  │   └─ plan_attempt()
  │       ├─ Normal:    returns primary ModelConfig (attempt 1 of N)
  │       ├─ Fallback:  returns fallback ModelConfig (attempt 2..N)
  │       └─ ContextVar: force_fallback controls which path
  │
  ├─ before_retry_callback()
  │   ├─ Checks _is_retryable_error() on the exception
  │   ├─ Sets force_fallback = True on first retryable failure
  │   └─ Updates Langfuse observation with fallback metadata
  │
  └─ honcho_llm_call_inner() — calls the selected provider

Fallback Decision Logic

Attempt 1: Primary model → succeeds? Done.
Attempt 1: Primary model → 429/5xx/timeout?
  → force_fallback = True
  → Attempt 2: Fallback model → succeeds? Done.
  → Attempt 2: Fallback model → also fails?
  → Attempt 3: Fallback model → ...
  → Exhausted → error propagated to caller

Retryable Error Detection

_is_retryable_error() classifies these as triggers:

Error Status/Type Triggers Fallback
Rate Limited HTTP 429
Server Error HTTP 5xx
Timeout TimeoutError
Connection Error ConnectionError
SDK API Timeout APITimeoutError
SDK Connection Error APIConnectionError
SDK Server Error InternalServerError
SDK Unavailable ServiceUnavailableError
SDK Rate Limit RateLimitError
All Others ❌ (pass through)

Files Changed

Core Logic

File Change
src/llm/runtime.py Added _force_fallback ContextVar, force_fallback parameter to select_model_config_for_attempt(), is_fallback param to update_current_langfuse_observation()
src/llm/api.py Added _is_retryable_error(), before_retry_callback with first-failure fallback trigger, _get_attempt_plan() with is_fallback computation
src/llm/tool_loop.py Reset force_fallback.set(False) per tool iteration
src/llm/registry.py Replaced hardcoded timeout=600.0 / SDK defaults with settings.LLM.DEFAULT_TIMEOUT (env var LLM_DEFAULT_TIMEOUT)
src/config.py Added LLMSettings.DEFAULT_TIMEOUT: float = 180.0

Configuration

File Change
.env.template Added LLM_DEFAULT_TIMEOUT env var documentation

Tests

File Tests Scope
tests/llm/test_fallback.py 23 unit tests select_model_config_for_attempt, plan_attempt, _is_retryable_error, cross-provider fallback (openai→anthropic, nous→openai), ContextVar lifecycle
tests/llm/test_fallback_integration.py 3 integration tests Fallback triggers on first failure, both primary+fallback fail → error propagated, cross-provider fallback

Documentation

File Change
CLAUDE.md Added LLM fallback architecture, config patterns, and verification commands

Usage

# Configure a fallback model (LMStudio local)
export DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
export DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
export DERIVER_MODEL_CONFIG__FALLBACK__API_KEY=your-key
export DERIVER_MODEL_CONFIG__FALLBACK__BASE_URL=http://localhost:1234/v1

# Configure timeout (default 180s)
export LLM_DEFAULT_TIMEOUT=300.0

Testing

# Unit tests
uv run pytest tests/llm/test_fallback.py -v

# Integration tests
uv run pytest tests/llm/test_fallback_integration.py -v

# Full LLM suite
uv run pytest tests/llm/ -v

Verification

  • Fallback triggers on first retryable failure, not on the final retry (default was changed from legacy behavior)
  • Only the successful generation span is kept in Langfuse (no failed-attempt span pollution)
  • Works with all agent types: Deriver, Dialectic, Dreamer, MCP
  • Zero runtime overhead when no retries occur (ContextVar starts at False)
  • All existing tests continue to pass (114 passed, 2 pre-existing failures unrelated: test_update_env_key_creates_file_if_missing, test_refresh_nous_credentials_success)

Summary by CodeRabbit

  • New Features

    • Immediate LLM model fallback on first retryable failure; added Nous and LMStudio provider support.
  • Configuration

    • New global LLM timeout and expanded env placeholders for per-agent fallback and LLM settings.
  • Improvements

    • Generation-level Langfuse observability and clearer summarizer trace names; Nous credential auto-refresh for resilient retries.
  • Infrastructure

    • Multi-stage container build and tightened Docker ignore rules.
  • Documentation & Tests

    • Added specs, design/proposal docs, verification plans, and tests for fallback and Nous refresh.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds per-agent immediate LLM fallback and Nous provider support with OAuth auto-refresh; forces fallback on first retryable failure via a ContextVar; converts Langfuse traces to generation-type with token reporting; refactors summarizer telemetry; updates MCP Docker/Bun dev setup; expands tests and OpenSpec documentation.

Changes

LLM Fallback, Nous Provider, and Observability

Layer / File(s) Summary
Provider config and timeouts
src/config.py, src/llm/credentials.py, src/llm/registry.py
Adds lmstudio and nous transports, LMStudio/Nous settings, and DEFAULT_TIMEOUT; client factories and registry use the configured timeout.
Nous OAuth refresh module
src/llm/nous_refresh.py, tests/llm/test_nous_refresh.py
Implements async Nous refresh + agent-key minting with persisted state, .env mutation, in-memory settings update attempt, and tests for load/save/update/refresh flows.
OpenAI backend Nous auto-refresh
src/llm/backends/openai.py, tests/llm/test_backends/test_nous_autorefresh.py
OpenAIBackend gains is_nous and retries on AuthenticationError by invoking Nous refresh; streaming and non-streaming retry paths updated and tested.
Fallback runtime and attempt planning
src/llm/runtime.py, src/llm/tool_loop.py
Introduces exported force_fallback ContextVar, extends select_model_config_for_attempt/plan_attempt to accept it, switches selection logic to allow immediate fallback, and resets force_fallback per tool-loop iteration.
honcho_llm_call retry and fallback integration
src/llm/api.py, tests/llm/test_fallback_integration.py
Classifies retryable errors, sets force_fallback on first retryable failure to trigger immediate fallback, preserves/restores contextvars, and updates Langfuse observation with is_fallback. Integration tests validate primary→fallback flows and failure semantics.
Langfuse generation tracing and token reporting
src/telemetry/logging.py, src/llm/api.py, tests/utils/test_clients.py
conditional_observe accepts as_type; honcho_llm_call emits generation observations and reports token usage_details; tests updated to expect update_current_generation calls and usage details.
Summarizer telemetry refactor
src/utils/summarizer.py
Removes @conditional_observe decorators for summarizers and passes track_name into honcho_llm_call to produce top-level GENERATION traces.
Fallback and unit tests
tests/llm/test_fallback.py, tests/llm/test_fallback_integration.py, tests/llm/test_backends/test_nous_autorefresh.py
Adds unit and integration tests covering fallback selection, retryability classification, cross-provider mapping, Nous refresh behavior, and token/trace assertions.

MCP Deployment and Configuration

Layer / File(s) Summary
MCP Docker build and ignores
.dockerignore, mcp/.dockerignore, mcp/Dockerfile, mcp/package.json
Excludes Wrangler artifacts and node_modules from Docker context; replaces MCP Dockerfile with Bun builder → Node 20 runtime, sets CI=true, and runs npx wrangler dev; adds browser field to package.json.
MCP Bun mock server
mcp/run-mocked.ts
Adds Bun-based dev entrypoint that mocks cloudflare:email and serves the MCP worker via Bun.serve.

OpenSpec Documentation and Planning

Layer / File(s) Summary
LLM fallback OpenSpec
openspec/changes/archive/2026-05-15-llm-model-fallback-*/, openspec/specs/llm-model-fallback/spec.md, .env.template
Design, proposal, spec, tasks, and env template updates documenting per-agent fallback configuration, first-failure switching, cross-provider parameter mapping, and observability logging placeholders.
Langfuse generation tracing OpenSpec
openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/, openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/, openspec/specs/observability-langfuse/spec.md
Design, exploration, proposal, tasks and spec formalizing generation observation semantics, token reporting, and summarizer telemetry fixes.
MCP QA verification OpenSpec
openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/, openspec/specs/mcp-qa-verification/spec.md
Architecture explorations, verification report, proposal, spec and tasks for end-to-end MCP QA and strict test-data isolation.
Workspace coordination
openspec/workspace/*
Added journals, leader memories, lessons, and session notes capturing progress, decisions, and verification notes.

Sequence Diagram (high level)

sequenceDiagram
  participant Client
  participant honcho_llm_call
  participant Runtime
  participant Backend
  participant Langfuse
  Client->>honcho_llm_call: request (track_name?)
  honcho_llm_call->>Runtime: plan_attempt(force_fallback?)
  Runtime->>Backend: select client_for_model_config & call
  Backend->>honcho_llm_call: response / AuthenticationError / HTTP error
  honcho_llm_call->>Runtime: classify retryable -> set force_fallback
  honcho_llm_call->>Langfuse: update_current_generation(name/model, metadata, usage_details, is_fallback)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • plastic-labs/honcho#643: modifies client wiring and base-URL/timeouts in src/llm/registry.py and related provider settings (related to LM transport/base-URL wiring).

Suggested reviewers

  • Rajat-Ahuja1997
  • VVoruganti

🐰 "A Rabbit's Ode to Fallback and Flow"

Fast as a hop when retries fail,
Nous refreshes, credentials sail,
Generations trace model and token,
Fallbacks leap where primaries are broken,
Docker, Bun, and specs align—hooray!

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@Rajat-Ahuja1997
Copy link
Copy Markdown
Collaborator

hey @phuongvm thanks for the contribution. it looks like the commits/files changes are adding a lot of changes that are not in the scope of the PR title/description. could you refactor the PR to be focused on the LLM model fallback on failure?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mcp/src/config.ts (1)

19-23: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Documentation contradicts implementation: baseUrl is now exposed via request header.

The comment at lines 19-23 states:

"It is intentionally not exposed as a request header: routing public requests to an internal URL would be a latency and security regression."

However, line 50 now reads X-Honcho-Base-Url from the request header as the first fallback source, which directly contradicts this stated security principle. If exposing the base URL via header is acceptable, the comment must be updated. If it's not acceptable, the header check should be removed.

📝 Proposed documentation fix

Update the comment to reflect the new multi-source strategy:

 /**
  * Parse configuration from request headers and Worker env bindings.
  * Throws on missing required fields so callers get clear errors.
  *
- * The Honcho API URL is read from the `HONCHO_API_URL` env var when set,
- * allowing operators to run this Worker alongside a self-hosted Honcho
- * instance (see the "Self-Hosted Honcho" section in README.md). It is
- * intentionally not exposed as a request header: routing public requests
- * to an internal URL would be a latency and security regression.
+ * The Honcho API URL is resolved from multiple sources in priority order:
+ * 1. X-Honcho-Base-Url request header (for per-request override)
+ * 2. HONCHO_API_URL environment variable (Worker env or process.env)
+ * 3. Default https://api.honcho.dev
+ *
+ * Operators should validate that X-Honcho-Base-Url is only accepted from
+ * trusted sources to prevent routing public requests to internal endpoints.
  */

Also applies to: 50-50

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/src/config.ts` around lines 19 - 23, The comment claiming the Honcho
baseUrl is "intentionally not exposed as a request header" is out of sync with
the code that reads the X-Honcho-Base-Url header first when resolving baseUrl;
update the comment in mcp/src/config.ts to describe the current multi-source
resolution order (X-Honcho-Base-Url header → HONCHO_API_URL env var → default),
mention the header name `X-Honcho-Base-Url` and the `HONCHO_API_URL` env var,
and note any security considerations for accepting a header-provided `baseUrl`
(or alternatively remove the header-based fallback if you prefer to enforce the
original policy) so the comment matches the behavior of the `baseUrl` resolution
code.
🧹 Nitpick comments (6)
openspec/specs/observability-langfuse/spec.md (1)

4-4: ⚡ Quick win

Address the TBD placeholder in the Purpose section.

The "TBD:" prefix suggests this section is incomplete. Since this is the main observability-langfuse spec (not archived), the purpose statement should be finalized.

📝 Suggested completion
 ## Purpose
-TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho.
+Defines core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho, including generation-type observations, model attribution, and metadata tracking.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openspec/specs/observability-langfuse/spec.md` at line 4, Replace the
placeholder "TBD: Core capabilities for integrating Langfuse LLM observability
and tracing telemetry within Honcho." in the Purpose section with a finalized,
declarative purpose sentence that summarizes the spec (e.g., "Define core
capabilities and integration patterns for Langfuse LLM observability and tracing
telemetry within Honcho, including required events, tracing spans,
configuration, and storage/retention expectations."). Edit the Purpose header
content so it no longer begins with "TBD" and clearly states the goal, scope,
and intended audience for the observability-langfuse spec.
openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md (1)

7-26: ⚡ Quick win

Fix markdown formatting for consistency and readability.

The document has several markdown formatting issues that affect rendering and readability:

  • Headings at lines 7, 10, 13 need blank lines below them
  • Code fence at lines 17-25 needs blank lines before and after
📝 Proposed formatting fixes
 ## The Issue
+
 When integrating Langfuse for LLM traces, models like `gpt-4o-mini` or `claude-3-haiku` automatically displayed token usage in the Langfuse dashboard, while custom models (e.g., `qwen/qwen3.5-9b` hosted via LMStudio) showed empty token fields despite successful completions.
 
 ## Root Cause
+
 Langfuse features an automatic server-side tokenizer that calculates token usage if it recognizes the `model` identifier. If the model is not recognized (as is the case with custom or local models), Langfuse silently fails to compute tokens unless explicit `usage_details` are pushed by the application SDK.
 
 ## The Fix / Lesson
+
 To ensure consistent token logging across **all** models (including custom and local deployments), the SDK must explicitly capture `input_tokens` and `output_tokens` from the underlying client response and report them via the Langfuse generation active observer. 
 
 Example Implementation:
+
 ```python
 if usage_data:
     langfuse.get_client().update_current_generation(
         usage_details={
             "input": input_tokens,
             "output": output_tokens
         }
     )

Always explicitly pass usage to Langfuse to guarantee observability parity regardless of backend.


</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md
around lines 7 - 26, The markdown has rendering issues: add a blank line after
each top-level heading ("The Issue", "Root Cause", "The Fix / Lesson") and
ensure the example code block is a proper fenced block with a language tag and
blank lines before and after it; specifically fix the example so the fenced
block contains the python snippet that calls
langfuse.get_client().update_current_generation(... usage_details={ "input":
input_tokens, "output": output_tokens }) and make sure the closing triple
backticks are present and followed by a blank line before the final sentence
"Always explicitly pass usage to Langfuse..." so headings, fences, and spacing
render correctly.


</details>

</blockquote></details>
<details>
<summary>mcp/src/config.ts (1)</summary><blockquote>

`50-50`: _💤 Low value_

**Unsafe type cast accessing process.env.**

The cast `(globalThis as any).process?.env?.HONCHO_API_URL` bypasses type safety. While the optional chaining prevents runtime errors, TypeScript won't catch if the access pattern changes.




<details>
<summary>♻️ Proposed type-safe alternative</summary>

```diff
-    baseUrl: request.headers.get("X-Honcho-Base-Url")?.trim() || env?.HONCHO_API_URL || (globalThis as any).process?.env?.HONCHO_API_URL || "https://api.honcho.dev",
+    baseUrl: 
+      request.headers.get("X-Honcho-Base-Url")?.trim() || 
+      env?.HONCHO_API_URL || 
+      (typeof process !== 'undefined' ? process.env?.HONCHO_API_URL : undefined) ||
+      "https://api.honcho.dev",

This uses a typeof guard instead of casting globalThis.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/src/config.ts` at line 50, The baseUrl property in config.ts uses an
unsafe cast "(globalThis as any).process?.env?.HONCHO_API_URL"; replace that
cast with a type-safe runtime guard that checks for existence and type (e.g.
typeof (globalThis as any).process === 'object' and typeof (globalThis as
any).process.env === 'object') before reading HONCHO_API_URL, and fall back to
the existing env?.HONCHO_API_URL or default; update the baseUrl expression to
use that checked access instead of the direct cast.
mcp/run-mocked.ts (1)

22-28: ⚡ Quick win

Consider preserving type safety for the fetch handler.

The as any cast on line 25 bypasses type checking for the fetch handler signature. While expedient, this could mask incompatibilities between the imported module's fetch signature and Bun.serve's expectations.

♻️ Proposed type-safe alternative
 import("./src/index.ts").then((module) => {
   const port = process.env.HONCHO_MCP_PORT ? Number(process.env.HONCHO_MCP_PORT) : 8787;
+  
+  // Explicitly type the fetch handler
+  const fetchHandler: (request: Request) => Response | Promise<Response> = 
+    (req) => module.default.fetch(req, {} as any, {} as any);
+  
   Bun.serve({
-    fetch: module.default.fetch as any,
+    fetch: fetchHandler,
     port: port,
   });

Note: The Cloudflare Worker fetch signature includes env and executionCtx parameters that aren't needed in the Bun mock environment, so we pass empty objects for those.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/run-mocked.ts` around lines 22 - 28, The current cast
"module.default.fetch as any" hides type errors; instead create a type-safe
wrapper that matches Bun.serve's expected fetch signature and calls
module.default.fetch with dummy env/executionCtx parameters: implement a small
function (e.g., const bunFetch = (req: Request) => module.default.fetch(req, {},
{})) and pass bunFetch to Bun.serve (replace module.default.fetch as any). This
preserves type safety for Bun.serve while accommodating Cloudflare-style fetch
handlers that expect (request, env, executionCtx).
src/llm/nous_refresh.py (1)

17-17: 💤 Low value

Unused import.

asyncio is imported but never referenced — only httpx.AsyncClient is used for async I/O.

🛠️ Remove unused import
-import asyncio
 import json
 import logging
 import os
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/nous_refresh.py` at line 17, Remove the unused asyncio import from
the top of src/llm/nous_refresh.py: delete the line importing asyncio since only
httpx.AsyncClient is used for async I/O (there are no asyncio.* references in
this module), and run the linter/pytest to confirm no breakages.
tests/llm/test_fallback_integration.py (1)

169-198: 💤 Low value

Consider making the exception assertion more specific.

The test currently catches the base Exception class. While this works for verifying that an error is raised, being more specific would improve the test's precision.

📋 Optional improvement to exception assertion

You could assert that the raised exception has a status_code attribute or is an instance of a specific error type:

         with pytest.raises(Exception):
             await honcho_llm_call(

Could be:

-        with pytest.raises(Exception):
+        with pytest.raises(Exception) as exc_info:
             await honcho_llm_call(
                 model_config=config,
                 prompt="test prompt",
                 max_tokens=100,
                 enable_retry=True,
                 retry_attempts=2,
             )
+        # Verify it's a retryable error
+        assert hasattr(exc_info.value, 'status_code')

However, the current implementation is acceptable for an integration test focused on error propagation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/llm/test_fallback_integration.py` around lines 169 - 198, The test
test_both_primary_and_fallback_fail is too broad by using
pytest.raises(Exception); update it to assert a more specific failure from
honcho_llm_call: either replace pytest.raises(Exception) with a concrete error
type your code raises (e.g., BackendError/HTTPError) or capture the exception
via "with pytest.raises(Exception) as excinfo:" and then assert on
excinfo.value.status_code (or another specific attribute) to ensure the raised
error comes from the FailingBackend/fallback path; reference honcho_llm_call and
FailingBackend when making the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CLAUDE.md`:
- Line 321: Update the misleading example in the line that currently reads
"Non-retryable errors (400, 200, ValueError, etc.)..." by removing the 200
status and replacing it with appropriate non-success examples (e.g., 4xx codes
such as 400 or 404) or rewording to "Non-retryable errors (e.g., 400, 404,
ValueError)" so the phrase "Non-retryable errors" no longer lists a 200 success
code; locate the exact sentence "Non-retryable errors (400, 200, ValueError,
etc.)..." and change it accordingly.

In `@mcp/Dockerfile`:
- Around line 1-24: The image currently runs as root; create a non-root user and
switch to it before the final CMD to satisfy least-privilege checks. Add steps
in the Dockerfile after copying files and before CMD to create a user/group
(e.g., appuser), chown the /app directory to that user, set HOME and USER
environment variables as needed, and add a USER appuser line so the process (the
CMD invoking npx wrangler dev) runs as the non-root account instead of root;
ensure the created user exists in both the builder and runtime stages or that
ownership is adjusted in the runtime stage where CMD is defined.
- Line 24: The Dockerfile currently uses CMD with "npx wrangler dev ..." which
is a dev-only pattern; replace that CMD to invoke the Bun-native mock server
provided in this PR (run-mocked.ts) using the Bun runtime and pass
HONCHO_API_URL via the container environment instead of wrapping it into a --var
flag; locate the CMD line that mentions "wrangler dev" and change it to run the
Bun runtime for run-mocked.ts (or alternately switch to a production path such
as using "wrangler deploy" or a compiled bundle) so the image runs the
production-ready server instead of wrangler dev.

In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md`:
- Line 126: Fix the typo in the "Next command" sentence: change the word
"approplies" to "applies" in the line that reads "**Next command**:
**`/opsx-propose`** (or workspace equivalent) when Commander approplies change
name and scope boundaries." so it reads "...when Commander applies change name
and scope boundaries." and ensure the corrected sentence preserves the bolding
and command formatting for "/opsx-propose".

In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.md`:
- Around line 18-22: Update the verification command example so Python-invoked
tooling is run via the project's virtual env: replace the line containing
"openspec validate --changes --no-interactive --json" with "uv run openspec
validate --changes --no-interactive --json" in the snippet in tasks.md; leave
non-Python tools like the rg commands unchanged so contributors consistently use
uv for Python commands.

In `@openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/design.md`:
- Around line 49-53: The doc lists incorrect env var prefixes for Dialectic and
Dreamer fallbacks—operators will set ignored keys like
DIALECTIC_MODEL_CONFIG__... and DREAM_MODEL_CONFIG__...; update the examples to
the real keys: for Dialectic use per-level prefixes
DIALECTIC_LEVELS__<level>__MODEL_CONFIG__FALLBACK__FALLBACK__TRANSPORT and
DIALECTIC_LEVELS__<level>__MODEL_CONFIG__FALLBACK__FALLBACK__MODEL (replace
<level> with the level name/number), and for Dreamer split the examples into
DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__FALLBACK__TRANSPORT /
DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__FALLBACK__MODEL and
DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__FALLBACK__TRANSPORT /
DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__FALLBACK__MODEL so the documented
FALLBACK__TRANSPORT and FALLBACK__MODEL keys map to the actual config keys.

In
`@openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/proposal.md`:
- Around line 3-8: The proposal incorrectly prescribes global
FALLBACK_MODEL_CONFIG__TRANSPORT and FALLBACK_MODEL_CONFIG__MODEL variables
while the codebase uses per-agent (and per-level for Dialectic) fallback
settings; update the proposal text to reference and instruct configuring the
per-agent fallback keys (e.g., Deriver fallback, Dialectic per-level fallback,
Dreamer fallback, Summary fallback) instead of global FALLBACK_* vars and
include guidance on configuring each agent's specific config keys and rollout
steps to match the actual settings surface used by ModelConfig.fallback.

In
`@openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/specs/llm-model-fallback/spec.md`:
- Around line 11-12: Update the env var names in the spec to match the actual
configuration shape: replace uses of DIALECTIC_MODEL_CONFIG__FALLBACK__TRANSPORT
and DIALECTIC_MODEL_CONFIG__FALLBACK__MODEL with the per-level Dialectic
fallback keys (e.g., DIALECTIC__<LEVEL>__FALLBACK__TRANSPORT and
DIALECTIC__<LEVEL>__FALLBACK__MODEL) and replace DREAM_MODEL_CONFIG__... with
the Dream deduction/induction keys (e.g., DREAM__DEDUCTION__MODEL /
DREAM__INDUCTION__MODEL and corresponding TRANSPORT keys); apply the same naming
corrections to the other occurrences referenced (lines 15-16) so the spec
matches the executable config shape.

In `@openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/tasks.md`:
- Around line 12-13: The checklist items reference outdated filenames
(.env.example and docker-compose/honcho/.env); update the task lines in tasks.md
to the current file paths used by this change (replace `.env.example` with the
actual env file name used by the repo and replace `docker-compose/honcho/.env`
with the correct docker/ honcho env path), and mirror those same updates for the
corresponding entries mentioned on lines 18-20 so all checklist items point to
the real files for accurate traceability.

In `@openspec/specs/llm-model-fallback/spec.md`:
- Line 4: Replace the placeholder Purpose text "TBD - created by archiving
change llm-model-fallback-2026-05-14" in the spec's Purpose section with a
concise, descriptive statement of intent: describe the goal of the
llm-model-fallback spec, the problem it addresses, intended users/consumers, and
the expected behavior or outcomes (e.g., how fallback selection should work,
criteria for model choice, and any constraints). Ensure the Purpose is a few
clear sentences that make the spec discoverable and owned rather than leaving
the archived placeholder.
- Around line 14-15: The spec currently references non-existent top-level env
keys (DIALECTIC_MODEL_CONFIG__FALLBACK__TRANSPORT and
DIALECTIC_MODEL_CONFIG__FALLBACK__MODEL); update the normative scenarios to use
the implemented per-level Dialectic fallback keys (e.g.,
DIALECTIC_MODEL_CONFIG__<LEVEL>__FALLBACK__TRANSPORT and
DIALECTIC_MODEL_CONFIG__<LEVEL>__FALLBACK__MODEL) and adjust Dream agent cases
to use separate deduction/induction fallback keys (e.g.,
DREAM_MODEL_CONFIG__DEDUCTION__FALLBACK__* and
DREAM_MODEL_CONFIG__INDUCTION__FALLBACK__*) so the spec matches the actual
configuration shape and avoids invalid operator configuration.

In `@openspec/specs/mcp-qa-verification/spec.md`:
- Line 4: Replace the placeholder "TBD: Verification protocols and requirements
for testing Honcho MCP tools." in the Purpose section with a concise purpose
statement that summarizes the spec's intent (e.g., describe that this document
defines verification protocols, E2E tool verification criteria, data isolation
requirements, and acceptance criteria for Honcho MCP tools), ensuring the new
text appears under the Purpose header and clearly frames the requirements that
follow.

In `@src/llm/backends/openai.py`:
- Around line 117-141: The _call_with_autorefresh helper is implemented but
never used while its logic is duplicated three times; replace the duplicated
401-retry blocks in the parse, create, and stream code paths with calls to
_call_with_autorefresh(use_parse=..., params=...) so all retries go through that
single function, and inside _call_with_autorefresh change the silent "except
Exception: pass" around settings.LLM.NOUS_API_KEY = new_key to catch the
exception and log it (e.g., logger.exception or logger.error with context)
before continuing; ensure the helper still re-attempts the request after setting
self._client.api_key and returns the retried result or re-raises the original
AuthenticationError if refresh fails.
- Around line 8-11: The code imports AsyncOpenAI and AuthenticationError but
misses the exceptions used in the structured-output parsing paths; add imports
for BadRequestError and LengthFinishReasonError from the openai package so the
exception handlers that reference LengthFinishReasonError and BadRequestError
(used around the structured-output parsing logic that follows AsyncOpenAI usage)
will work correctly instead of raising NameError.

In `@src/llm/nous_refresh.py`:
- Around line 130-148: The function update_env_key currently returns early when
env_path doesn't exist, which breaks tests and first-run behavior; change
update_env_key to create the .env file instead of skipping: if not
env_path.exists() then create parent directories as needed, initialize lines to
an empty list (or include a leading newline if you want separation), and proceed
to append/write "LLM_NOUS_API_KEY={new_key}\n" (preserving the existing logic
for updating when the key exists); ensure you still call
env_path.write_text("".join(lines)) and logger.info("Updated .env with new Nous
API key") so tests and runtime see the file created.
- Around line 226-234: The code silently fails importing settings because it
uses `from honcho.config import settings`; replace that import with `from
src.config import settings` in src/llm/nous_refresh.py so the in-memory update
to settings.LLM.NOUS_API_KEY actually runs, and change the bare except to catch
ImportError (or log the exception) so failures aren't silently swallowed; keep
the logger.info("In-memory settings.LLM.NOUS_API_KEY updated") on success and
log a debug or warning with the exception on failure.

In `@tests/llm/test_nous_refresh.py`:
- Around line 22-25: The test test_load_state_missing is flaky because
load_state() reads NOUS_REFRESH_TOKEN from the environment; make the test
hermetic by deleting that env var before calling load_state: in
test_load_state_missing use the pytest monkeypatch fixture to call
monkeypatch.delenv("NOUS_REFRESH_TOKEN", raising=False) (keep the existing
patch.object(Path, "exists", return_value=False)) so the function cannot see an
external NOUS_REFRESH_TOKEN and the assertion load_state() == {} becomes
reliable.
- Around line 78-88: The test test_find_project_root_with_dotenv is dead code
with no assertions; replace the patch/pass block with a real exercise of
_find_project_root: change the working directory into the deep directory (use
monkeypatch.chdir(deep) or equivalent), call
src.llm.nous_refresh._find_project_root() and assert the returned Path equals
tmp_path (the directory containing the created .env), or if you prefer remove
the test entirely; update references to the test name and _find_project_root to
ensure the test either asserts correct behavior or is deleted to avoid
misleading coverage.
- Around line 155-158: Move the three delayed imports into the module import
block at the top of the test file: relocate "from types import SimpleNamespace",
"from unittest.mock import Mock", and "import sys" next to the other imports so
they're available before tests reference SimpleNamespace, Mock, and sys (used in
tests around usages of SimpleNamespace and Mock); remove the duplicate or
trailing imports at L156-L158 to keep import order clear and consistent.

---

Outside diff comments:
In `@mcp/src/config.ts`:
- Around line 19-23: The comment claiming the Honcho baseUrl is "intentionally
not exposed as a request header" is out of sync with the code that reads the
X-Honcho-Base-Url header first when resolving baseUrl; update the comment in
mcp/src/config.ts to describe the current multi-source resolution order
(X-Honcho-Base-Url header → HONCHO_API_URL env var → default), mention the
header name `X-Honcho-Base-Url` and the `HONCHO_API_URL` env var, and note any
security considerations for accepting a header-provided `baseUrl` (or
alternatively remove the header-based fallback if you prefer to enforce the
original policy) so the comment matches the behavior of the `baseUrl` resolution
code.

---

Nitpick comments:
In `@mcp/run-mocked.ts`:
- Around line 22-28: The current cast "module.default.fetch as any" hides type
errors; instead create a type-safe wrapper that matches Bun.serve's expected
fetch signature and calls module.default.fetch with dummy env/executionCtx
parameters: implement a small function (e.g., const bunFetch = (req: Request) =>
module.default.fetch(req, {}, {})) and pass bunFetch to Bun.serve (replace
module.default.fetch as any). This preserves type safety for Bun.serve while
accommodating Cloudflare-style fetch handlers that expect (request, env,
executionCtx).

In `@mcp/src/config.ts`:
- Line 50: The baseUrl property in config.ts uses an unsafe cast "(globalThis as
any).process?.env?.HONCHO_API_URL"; replace that cast with a type-safe runtime
guard that checks for existence and type (e.g. typeof (globalThis as
any).process === 'object' and typeof (globalThis as any).process.env ===
'object') before reading HONCHO_API_URL, and fall back to the existing
env?.HONCHO_API_URL or default; update the baseUrl expression to use that
checked access instead of the direct cast.

In `@openspec/specs/observability-langfuse/spec.md`:
- Line 4: Replace the placeholder "TBD: Core capabilities for integrating
Langfuse LLM observability and tracing telemetry within Honcho." in the Purpose
section with a finalized, declarative purpose sentence that summarizes the spec
(e.g., "Define core capabilities and integration patterns for Langfuse LLM
observability and tracing telemetry within Honcho, including required events,
tracing spans, configuration, and storage/retention expectations."). Edit the
Purpose header content so it no longer begins with "TBD" and clearly states the
goal, scope, and intended audience for the observability-langfuse spec.

In
`@openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md`:
- Around line 7-26: The markdown has rendering issues: add a blank line after
each top-level heading ("The Issue", "Root Cause", "The Fix / Lesson") and
ensure the example code block is a proper fenced block with a language tag and
blank lines before and after it; specifically fix the example so the fenced
block contains the python snippet that calls
langfuse.get_client().update_current_generation(... usage_details={ "input":
input_tokens, "output": output_tokens }) and make sure the closing triple
backticks are present and followed by a blank line before the final sentence
"Always explicitly pass usage to Langfuse..." so headings, fences, and spacing
render correctly.

In `@src/llm/nous_refresh.py`:
- Line 17: Remove the unused asyncio import from the top of
src/llm/nous_refresh.py: delete the line importing asyncio since only
httpx.AsyncClient is used for async I/O (there are no asyncio.* references in
this module), and run the linter/pytest to confirm no breakages.

In `@tests/llm/test_fallback_integration.py`:
- Around line 169-198: The test test_both_primary_and_fallback_fail is too broad
by using pytest.raises(Exception); update it to assert a more specific failure
from honcho_llm_call: either replace pytest.raises(Exception) with a concrete
error type your code raises (e.g., BackendError/HTTPError) or capture the
exception via "with pytest.raises(Exception) as excinfo:" and then assert on
excinfo.value.status_code (or another specific attribute) to ensure the raised
error comes from the FailingBackend/fallback path; reference honcho_llm_call and
FailingBackend when making the assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dabf06a1-1d82-4339-92bf-950b96274ba4

📥 Commits

Reviewing files that changed from the base of the PR and between 8fcbb54 and da417b2.

⛔ Files ignored due to path filters (1)
  • mcp/bun.lock is excluded by !**/*.lock
📒 Files selected for processing (57)
  • .dockerignore
  • .env.template
  • CLAUDE.md
  • mcp/.dockerignore
  • mcp/Dockerfile
  • mcp/package.json
  • mcp/run-mocked.ts
  • mcp/src/config.ts
  • mcp/src/index.ts
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yaml
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/design.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/explorations/2026-05-05-nested-langfuse-spans.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/proposal.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/.gitkeep
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/observability-langfuse/spec.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yaml
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/design.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/proposal.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/specs/observability-langfuse/spec.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/.openspec.yaml
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/design.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-mcp-architecture.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-philosophy-and-usecases.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/proposal.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/specs/mcp-qa-verification/spec.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/tasks.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/design.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/proposal.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/specs/llm-model-fallback/spec.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/tasks.md
  • openspec/specs/llm-model-fallback/spec.md
  • openspec/specs/mcp-qa-verification/spec.md
  • openspec/specs/observability-langfuse/spec.md
  • openspec/workspace/journal/2026-05-05/agent_share_afternoon.md
  • openspec/workspace/memories/leader/2026-05-05/archive_01.md
  • openspec/workspace/memories/leader/2026-05-05/current.md
  • openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md
  • openspec/workspace/sessions/agent_share.md
  • src/config.py
  • src/llm/api.py
  • src/llm/backends/openai.py
  • src/llm/credentials.py
  • src/llm/nous_refresh.py
  • src/llm/registry.py
  • src/llm/runtime.py
  • src/llm/tool_loop.py
  • src/telemetry/logging.py
  • src/utils/summarizer.py
  • tests/llm/test_backends/test_nous_autorefresh.py
  • tests/llm/test_fallback.py
  • tests/llm/test_fallback_integration.py
  • tests/llm/test_nous_refresh.py
  • tests/utils/test_clients.py

Comment thread CLAUDE.md
- `OSError`
- SDK-specific: `APIConnectionError`, `APITimeoutError`, `InternalServerError`, `ServiceUnavailableError`, `RateLimitError`

Non-retryable errors (400, 200, ValueError, etc.) do NOT trigger fallback — they follow normal retry behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify success vs error semantics in non-retryable list.

Including 200 in the non-retryable error examples is misleading because a 200 response is successful and should not enter retry/error handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` at line 321, Update the misleading example in the line that
currently reads "Non-retryable errors (400, 200, ValueError, etc.)..." by
removing the 200 status and replacing it with appropriate non-success examples
(e.g., 4xx codes such as 400 or 404) or rewording to "Non-retryable errors
(e.g., 400, 404, ValueError)" so the phrase "Non-retryable errors" no longer
lists a 200 success code; locate the exact sentence "Non-retryable errors (400,
200, ValueError, etc.)..." and change it accordingly.

Comment thread mcp/Dockerfile Outdated
Comment on lines +1 to +24
# --- Stage 1: Lightning Fast Dependencies ---
FROM oven/bun:1 AS builder
WORKDIR /app

# Copy configuration and use Bun's superior async network stack to bypass WSL MTU hangs
COPY package.json ./
RUN bun install

# Copy source code
COPY . .

# --- Stage 2: Node Native Runtime ---
FROM node:22
WORKDIR /app

# Ingest all the raw files (including node_modules) built gracefully by Bun
COPY --from=builder /app /app

# Setup environment
ENV CI=true

# Revert boot command to native NPM/Node context to bypass Wrangler's anti-Bun blocker
# Crucially, we wrap it in a shell to inject the Docker container's environment variable into Wrangler's isolated sandbox
CMD ["sh", "-c", "npx wrangler dev src/index.ts --port 8787 --ip 0.0.0.0 --var HONCHO_API_URL:$HONCHO_API_URL"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a non-root user for security.

The container runs as root, which violates the principle of least privilege and increases attack surface. Trivy static analysis flagged this as DS-0002.

🔒 Proposed fix

Add these lines before the final CMD:

 # Setup environment 
 ENV CI=true
 
+# Create non-root user
+RUN groupadd -r mcpuser && useradd -r -g mcpuser mcpuser && \
+    chown -R mcpuser:mcpuser /app
+USER mcpuser
+
 # Revert boot command to native NPM/Node context to bypass Wrangler's anti-Bun blocker
🧰 Tools
🪛 Trivy (0.69.3)

[error] 1-1: Image user should not be 'root'

Specify at least 1 USER command in Dockerfile with non-root user as argument

Rule: DS-0002

Learn more

(IaC/Dockerfile)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/Dockerfile` around lines 1 - 24, The image currently runs as root; create
a non-root user and switch to it before the final CMD to satisfy least-privilege
checks. Add steps in the Dockerfile after copying files and before CMD to create
a user/group (e.g., appuser), chown the /app directory to that user, set HOME
and USER environment variables as needed, and add a USER appuser line so the
process (the CMD invoking npx wrangler dev) runs as the non-root account instead
of root; ensure the created user exists in both the builder and runtime stages
or that ownership is adjusted in the runtime stage where CMD is defined.

Comment thread mcp/Dockerfile Outdated

# Revert boot command to native NPM/Node context to bypass Wrangler's anti-Bun blocker
# Crucially, we wrap it in a shell to inject the Docker container's environment variable into Wrangler's isolated sandbox
CMD ["sh", "-c", "npx wrangler dev src/index.ts --port 8787 --ip 0.0.0.0 --var HONCHO_API_URL:$HONCHO_API_URL"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Using wrangler dev in production is a critical deployment anti-pattern.

The CMD runs wrangler dev, which is explicitly designed for local development with hot-reload, verbose logging, and no production optimizations. For Docker deployments, you should either:

  1. Use wrangler deploy (if targeting Cloudflare Workers), or
  2. Compile a standalone bundle using bun build --compile or similar, or
  3. Use the native Bun runtime via run-mocked.ts (which this PR already provides)

The shell wrapper workaround to inject $HONCHO_API_URL into --var further suggests this isn't the intended deployment path.

🚀 Proposed fix: Use the Bun native runtime

Replace line 24 with:

-CMD ["sh", "-c", "npx wrangler dev src/index.ts --port 8787 --ip 0.0.0.0 --var HONCHO_API_URL:$HONCHO_API_URL"]
+CMD ["bun", "run", "run-mocked.ts"]

This uses the native Bun mock server already provided in this PR, which is production-ready and doesn't require Wrangler's dev mode.

Alternatively, if Node compatibility is required:

-CMD ["sh", "-c", "npx wrangler dev src/index.ts --port 8787 --ip 0.0.0.0 --var HONCHO_API_URL:$HONCHO_API_URL"]
+CMD ["node", "--import", "tsx", "run-mocked.ts"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/Dockerfile` at line 24, The Dockerfile currently uses CMD with "npx
wrangler dev ..." which is a dev-only pattern; replace that CMD to invoke the
Bun-native mock server provided in this PR (run-mocked.ts) using the Bun runtime
and pass HONCHO_API_URL via the container environment instead of wrapping it
into a --var flag; locate the CMD line that mentions "wrangler dev" and change
it to run the Bun runtime for run-mocked.ts (or alternately switch to a
production path such as using "wrangler deploy" or a compiled bundle) so the
image runs the production-ready server instead of wrangler dev.

## 10. Status

**Exploration**: Closed with **decision to formalize via OpenSpec proposal** (recommended Option A).
**Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander approplies change name and scope boundaries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in the next-command sentence.

Line 126 contains approplies, which appears to be a typo. Please correct it to improve command guidance readability.

🧰 Tools
🪛 LanguageTool

[grammar] ~126-~126: Ensure spelling is correct
Context: ...or workspace equivalent) when Commander approplies change name and scope boundaries.

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md`
at line 126, Fix the typo in the "Next command" sentence: change the word
"approplies" to "applies" in the line that reads "**Next command**:
**`/opsx-propose`** (or workspace equivalent) when Commander approplies change
name and scope boundaries." so it reads "...when Commander applies change name
and scope boundaries." and ensure the corrected sentence preserves the bolding
and command formatting for "/opsx-propose".

Comment on lines +18 to +22
openspec validate --changes --no-interactive --json
rg "@conditional_observe\(name=\"LLM Call\", as_type=\"generation\"\)" src/llm/api.py
rg "update_current_generation\(" src/llm/runtime.py
rg "update_current_generation\.assert_called_once_with" tests/utils/test_clients.py
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use uv run in verification command examples.

Line 18 currently shows openspec validate ...; use uv run openspec validate ... (and keep Python-invoked tooling under uv) so contributors run the documented checks in the project environment consistently.

Based on learnings: Always use 'uv run' or 'uv' prefix for Python commands to ensure virtual environment usage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.md`
around lines 18 - 22, Update the verification command example so Python-invoked
tooling is run via the project's virtual env: replace the line containing
"openspec validate --changes --no-interactive --json" with "uv run openspec
validate --changes --no-interactive --json" in the snippet in tasks.md; leave
non-Python tools like the rg commands unchanged so contributors consistently use
uv for Python commands.

Comment thread src/llm/nous_refresh.py
Comment thread src/llm/nous_refresh.py
Comment thread tests/llm/test_nous_refresh.py Outdated
Comment thread tests/llm/test_nous_refresh.py Outdated
Comment on lines +155 to +158
# Helper for SimpleNamespace
from types import SimpleNamespace
from unittest.mock import Mock
import sys
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Imports placed at end of file — confusing and works only by accident.

SimpleNamespace, Mock, and sys are used at L112, L114, L119, L121, L139, L140, but imported at L156-158. Python tolerates this because module-level code executes before any test function runs, but it makes the file very hard to scan and will trip readers expecting standard import order at the top. Move these next to the other imports.

🛠️ Move to the imports block at top of file
 import json
 import os
+import sys
 from pathlib import Path
-from unittest.mock import AsyncMock, patch
+from types import SimpleNamespace
+from unittest.mock import AsyncMock, Mock, patch

 import pytest
 ...
-# Helper for SimpleNamespace
-from types import SimpleNamespace
-from unittest.mock import Mock
-import sys
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Helper for SimpleNamespace
from types import SimpleNamespace
from unittest.mock import Mock
import sys
import json
import os
import sys
from pathlib import Path
from types import SimpleNamespace
from unittest.mock import AsyncMock, Mock, patch
import pytest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/llm/test_nous_refresh.py` around lines 155 - 158, Move the three
delayed imports into the module import block at the top of the test file:
relocate "from types import SimpleNamespace", "from unittest.mock import Mock",
and "import sys" next to the other imports so they're available before tests
reference SimpleNamespace, Mock, and sys (used in tests around usages of
SimpleNamespace and Mock); remove the duplicate or trailing imports at L156-L158
to keep import order clear and consistent.

@phuongvm phuongvm force-pushed the feat/llm-model-fallback branch from da417b2 to 3a03615 Compare May 16, 2026 06:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/specs/mcp-qa-verification/spec.md (1)

11-16: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align cleanup requirement with non-deletable peer behavior.

Line 11 says “All test data MUST be deleted,” but the scenario on Line 15 only requires deleting sessions/conclusions. Please explicitly exclude non-deletable peer records or require peer tagging + archival-safe retention.

📝 Proposed wording
-The verification process MUST operate exclusively on test-specific entities (e.g. peers prefixed with `test-qa-`, isolated sessions). It MUST NOT mutate or delete existing production data. All test data MUST be deleted at the end.
+The verification process MUST operate exclusively on test-specific entities (e.g. peers prefixed with `test-qa-`, isolated sessions). It MUST NOT mutate or delete existing production data. All deletable test entities (at minimum sessions and conclusions) MUST be deleted at the end.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/specs/mcp-qa-verification/spec.md`
around lines 11 - 16, Update the "Test cleanup" requirement to exclude
non-deletable peer records: change the blanket "All test data MUST be deleted"
to a constrained rule that deletes all test-specific entities (e.g., sessions,
conclusions, resources prefixed with `test-qa-`) while explicitly excluding
immutable/non-deletable peer records, and add a requirement that peers that
cannot be deleted must be tagged as test artifacts and moved to archival-safe
retention (or flagged for retention) instead of deletion; reference the "Test
cleanup" scenario, the phrase "All test data MUST be deleted", and the peer
prefix `test-qa-` when making the wording change.
openspec/workspace/memories/leader/2026-05-05/archive_01.md (1)

8-29: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add blank lines after section headings to satisfy markdownlint.

Line 8, Line 16, Line 20, Line 23, and Line 26 need one blank line below each heading (MD022).

Suggested patch
 ## What Was Done
+
 1. Conducted an investigation into Langfuse telemetry for Honcho LLM traces, focusing on missing model metadata in the dashboard for background tasks (`Dialectic Agent`, `Minimal Deriver`).
 ...
 ## Decisions Made
+
 - Added explicit `usage_details` to the active generation instead of relying on Langfuse's server-side auto-computation, ensuring consistent observability for custom models.
 ...
 ## Blockers
+
 - None.
 
 ## Next Steps
+
 - Await next user instructions for any subsequent capabilities or changes.
 
 ## Key Learnings
+
 - Container processes caching Python modules must be explicitly restarted during active QA.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openspec/workspace/memories/leader/2026-05-05/archive_01.md` around lines 8 -
29, Add a single blank line after each top-level section heading in the markdown
to satisfy markdownlint MD022: insert one empty line immediately after the
headings "## What Was Done", "## Decisions Made", "## Blockers", "## Next
Steps", and "## Key Learnings" so each heading is followed by a blank line
before the content.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-philosophy-and-usecases.md`:
- Around line 321-326: The document excerpt leaks real participant identifiers
(e.g., "hermes" and "phuong_lambert") in the discussion of the peer_id vs
target_peer_id paradigm; redact these by replacing actual names with sanitized
placeholders (e.g., "<participant_A>", "<participant_B>") throughout the
paragraph and the four listed collection examples, and ensure any other
occurrences of peer_id and target_peer_id examples use the sanitized
placeholders consistently so committed history contains no real identifiers.

In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md`:
- Around line 18-35: The verification artifact contains real environment
identifiers; replace sensitive values in the JSON fields workspace_id,
peer_count, peers (each object id values) and session_count, sessions (each id
values) with synthetic placeholders (e.g. "test-workspace", "test-peer-1", ...,
"session-1", and matching counts) in the mcp verification report so the peers
array entries like {"id": "phuong_lambert"} and {"id": "hermes"} and sessions
like {"id": "hermes-agent"} are sanitized; ensure peer_count and session_count
are updated to reflect the new fixture counts and keep metadata/configuration
objects unchanged.

In `@openspec/specs/mcp-qa-verification/spec.md`:
- Around line 16-20: Update the "verification process" and "Scenario: Test
cleanup" wording to enumerate exactly which test entities must be deleted and
how deletion is verified: replace the vague "all test data" with a concrete list
(e.g., peers prefixed with `test-qa-`, isolated sessions, test conclusions,
test-only artifacts) and add an explicit verification step that these listed
entities no longer exist (e.g., "THEN a discovery query returns zero peers
matching prefix `test-qa-` and no sessions or conclusions with the test flag are
present"); ensure the Scenario includes both the deletion action and an
objective check (query or API result) for each entity type so compliance is
programmatically testable.

In `@openspec/specs/observability-langfuse/spec.md`:
- Around line 3-4: Replace the placeholder purpose under the "## Purpose"
heading (the line starting with "TBD: Core capabilities for integrating
Langfuse...") with a concrete scope statement that summarizes the capability's
intent, key responsibilities, and boundaries (e.g., what telemetry/tracing will
be provided, supported integrations, expected outputs, and non-goals); update
the sentence so it reads as a clear one- or two-sentence scope for the
observability-langfuse spec rather than "TBD."

In `@openspec/workspace/memories/leader/2026-05-05/current.md`:
- Around line 6-17: Add a blank line immediately after each Markdown heading
"Current Status", "Active Focus", and "Critical Context" in the file (these are
the headers in current.md) to satisfy MD022; ensure there is an empty line
between each header line and the following paragraph/list so the sections read:
"Current Status" + blank line + content, "Active Focus" + blank line + content,
and "Critical Context" + blank line + content.

In
`@openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md`:
- Around line 7-26: Add missing blank lines to satisfy markdownlint: insert a
blank line after the headings "The Issue", "Root Cause", and "The Fix / Lesson"
and ensure there is a blank line immediately before and after the fenced code
block containing the example update_current_generation usage (the
triple-backtick block around the python example); this will resolve MD022 and
MD031 warnings without changing the content.

In `@src/llm/api.py`:
- Around line 271-275: In the function that classifies retryable HTTP exceptions
(e.g., is_retryable_exception(exc) in src/llm/api.py), extend the HTTP status
check so that HTTP 408 is treated as retryable: update the condition that
currently returns status == 429 or (500 <= status < 600) to also return true for
status == 408 (e.g., status == 429 or status == 408 or (500 <= status < 600));
this ensures a 408 timeout will trigger the fast-failover to the fallback model
on the first retryable error.
- Around line 194-197: The ContextVars current_attempt and force_fallback are
being set (current_attempt.set(1), force_fallback.set(False)) but never
restored, so nested calls (honcho_llm_call) leak state; capture the tokens
returned by .set(...) when you initialize these vars and wrap the call logic in
a try/finally so you reset the tokens in the finally block to restore prior
task-local state; locate the setup in honcho_llm_call (and any early return
paths) and ensure every return path happens after the finally resets the tokens.
Also update _is_retryable_error to treat HTTP status == 408 as retryable in the
same status-code checks (add status == 408 alongside existing timeout/status
checks).

In `@src/llm/nous_refresh.py`:
- Around line 71-80: save_state() currently writes STATE_FILE and its parent
directory with default process umask, leaving them potentially world-readable;
ensure the directory and file are hardened by setting restrictive permissions:
make STATE_FILE.parent directory mode 0o700 (use os.chmod on STATE_FILE.parent
after mkdir) and ensure the temp file (tmp) is created with mode 0o600 (either
create securely via os.open with mode 0o600 or write then call os.chmod(tmp,
0o600)) before calling tmp.replace(STATE_FILE) so the final persisted STATE_FILE
is non-world-readable; reference save_state, STATE_FILE, STATE_FILE.parent and
tmp when applying the changes.

In `@tests/llm/test_fallback_integration.py`:
- Line 191: Replace the broad pytest.raises(Exception) with a specific test
exception: add a small custom exception class (e.g., ExpectedTestError) near the
top of tests/llm/test_fallback_integration.py, update the test to use
pytest.raises(ExpectedTestError) instead of Exception, and ensure the code path
exercised by the test (the function or method under test referenced in the test)
raises ExpectedTestError so the test only catches the intended failure.

In `@tests/llm/test_nous_refresh.py`:
- Around line 110-126: The test creates mock_client but never injects it, so
refresh_nous_credentials() ends up creating real httpx.AsyncClient instances in
refresh_access_token() and mint_agent_key(); before calling
refresh_nous_credentials(), monkeypatch httpx.AsyncClient in the refresh_mod
module to return the prepared mock_client (i.e., set
refresh_mod.httpx.AsyncClient to a lambda/factory that returns mock_client) so
both AsyncClient usages use the mock and the test becomes hermetic.

---

Outside diff comments:
In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/specs/mcp-qa-verification/spec.md`:
- Around line 11-16: Update the "Test cleanup" requirement to exclude
non-deletable peer records: change the blanket "All test data MUST be deleted"
to a constrained rule that deletes all test-specific entities (e.g., sessions,
conclusions, resources prefixed with `test-qa-`) while explicitly excluding
immutable/non-deletable peer records, and add a requirement that peers that
cannot be deleted must be tagged as test artifacts and moved to archival-safe
retention (or flagged for retention) instead of deletion; reference the "Test
cleanup" scenario, the phrase "All test data MUST be deleted", and the peer
prefix `test-qa-` when making the wording change.

In `@openspec/workspace/memories/leader/2026-05-05/archive_01.md`:
- Around line 8-29: Add a single blank line after each top-level section heading
in the markdown to satisfy markdownlint MD022: insert one empty line immediately
after the headings "## What Was Done", "## Decisions Made", "## Blockers", "##
Next Steps", and "## Key Learnings" so each heading is followed by a blank line
before the content.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b06f5ef3-cb56-43bd-a70c-82681ed629b8

📥 Commits

Reviewing files that changed from the base of the PR and between da417b2 and 3a03615.

⛔ Files ignored due to path filters (1)
  • mcp/bun.lock is excluded by !**/*.lock
📒 Files selected for processing (55)
  • .dockerignore
  • .env.template
  • CLAUDE.md
  • mcp/.dockerignore
  • mcp/Dockerfile
  • mcp/package.json
  • mcp/run-mocked.ts
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yaml
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/design.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/explorations/2026-05-05-nested-langfuse-spans.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/proposal.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/.gitkeep
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/observability-langfuse/spec.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yaml
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/design.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/proposal.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/specs/observability-langfuse/spec.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/.openspec.yaml
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/design.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-mcp-architecture.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-philosophy-and-usecases.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/proposal.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/specs/mcp-qa-verification/spec.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/tasks.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/design.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/proposal.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/specs/llm-model-fallback/spec.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/tasks.md
  • openspec/specs/llm-model-fallback/spec.md
  • openspec/specs/mcp-qa-verification/spec.md
  • openspec/specs/observability-langfuse/spec.md
  • openspec/workspace/journal/2026-05-05/agent_share_afternoon.md
  • openspec/workspace/memories/leader/2026-05-05/archive_01.md
  • openspec/workspace/memories/leader/2026-05-05/current.md
  • openspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.md
  • openspec/workspace/sessions/agent_share.md
  • src/config.py
  • src/llm/api.py
  • src/llm/backends/openai.py
  • src/llm/credentials.py
  • src/llm/nous_refresh.py
  • src/llm/registry.py
  • src/llm/runtime.py
  • src/llm/tool_loop.py
  • src/telemetry/logging.py
  • src/utils/summarizer.py
  • tests/llm/test_backends/test_nous_autorefresh.py
  • tests/llm/test_fallback.py
  • tests/llm/test_fallback_integration.py
  • tests/llm/test_nous_refresh.py
  • tests/utils/test_clients.py
✅ Files skipped from review due to trivial changes (16)
  • openspec/workspace/journal/2026-05-05/agent_share_afternoon.md
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yaml
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/.openspec.yaml
  • mcp/.dockerignore
  • openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/design.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yaml
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/design.md
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/design.md
  • .env.template
  • openspec/changes/archive/2026-05-15-llm-model-fallback-2026-05-14/tasks.md
  • .dockerignore
  • CLAUDE.md
  • openspec/workspace/sessions/agent_share.md
  • openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/proposal.md
🚧 Files skipped from review as they are similar to previous changes (12)
  • mcp/package.json
  • tests/utils/test_clients.py
  • mcp/Dockerfile
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/tasks.md
  • mcp/run-mocked.ts
  • src/telemetry/logging.py
  • tests/llm/test_backends/test_nous_autorefresh.py
  • src/config.py
  • tests/llm/test_fallback.py
  • src/llm/tool_loop.py
  • src/llm/registry.py
  • src/utils/summarizer.py

Comment on lines +16 to +20
The verification process MUST operate exclusively on test-specific entities (e.g. peers prefixed with `test-qa-`, isolated sessions). It MUST NOT mutate or delete existing production data. All test data MUST be deleted at the end.

#### Scenario: Test cleanup
- **WHEN** the QA sequence completes
- **THEN** all test entities (sessions, conclusions) must be explicitly deleted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify cleanup acceptance criteria to avoid impossible interpretation.

The requirement says all test data must be deleted, but the scenario validates only sessions/conclusions. Please tighten wording so compliance is objectively testable.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 18-18: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openspec/specs/mcp-qa-verification/spec.md` around lines 16 - 20, Update the
"verification process" and "Scenario: Test cleanup" wording to enumerate exactly
which test entities must be deleted and how deletion is verified: replace the
vague "all test data" with a concrete list (e.g., peers prefixed with
`test-qa-`, isolated sessions, test conclusions, test-only artifacts) and add an
explicit verification step that these listed entities no longer exist (e.g.,
"THEN a discovery query returns zero peers matching prefix `test-qa-` and no
sessions or conclusions with the test flag are present"); ensure the Scenario
includes both the deletion action and an objective check (query or API result)
for each entity type so compliance is programmatically testable.

Comment on lines +3 to +4
## Purpose
TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace placeholder purpose with a concrete scope statement.

Line 4 still says TBD, which leaves this capability underspecified at the top-level and makes requirement intent harder to scan.

📝 Proposed fix
 ## Purpose
-TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho.
+Define Langfuse observability requirements for Honcho LLM calls, including generation-type tracing, model attribution from attempt planning, metadata preservation, and summarization trace shape.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Purpose
TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho.
## Purpose
Define Langfuse observability requirements for Honcho LLM calls, including generation-type tracing, model attribution from attempt planning, metadata preservation, and summarization trace shape.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openspec/specs/observability-langfuse/spec.md` around lines 3 - 4, Replace
the placeholder purpose under the "## Purpose" heading (the line starting with
"TBD: Core capabilities for integrating Langfuse...") with a concrete scope
statement that summarizes the capability's intent, key responsibilities, and
boundaries (e.g., what telemetry/tracing will be provided, supported
integrations, expected outputs, and non-goals); update the sentence so it reads
as a clear one- or two-sentence scope for the observability-langfuse spec rather
than "TBD."

Comment on lines +6 to +17
## Current Status
- Successfully archived `honcho-langfuse-generation-traces`.
- Tracing infrastructure is fully operational. Both API and background workers (Deriver) are logging `GENERATION` traces correctly.
- Custom models (e.g. `lmstudio` `qwen3.5`) are actively reporting explicit `usage_details` (tokens) to Langfuse.
- Main specifications for `observability-langfuse` have been established and updated in the project repository.

## Active Focus
- Standing by for new OpenSpec initiatives or orchestration tasks.

## Critical Context
- Langfuse configuration relies on explicitly passing `as_type="generation"` to the `@conditional_observe` decorator in `honcho_llm_call`.
- Explicit `usage_details` must be provided to Langfuse active generation object to track tokens for custom models reliably.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Insert blank lines below headings for MD022 compliance.

Line 6, Line 12, and Line 15 should each be followed by a blank line.

Suggested patch
 ## Current Status
+
 - Successfully archived `honcho-langfuse-generation-traces`.
 ...
 ## Active Focus
+
 - Standing by for new OpenSpec initiatives or orchestration tasks.
 
 ## Critical Context
+
 - Langfuse configuration relies on explicitly passing `as_type="generation"` to the `@conditional_observe` decorator in `honcho_llm_call`.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 6-6: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 12-12: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 15-15: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openspec/workspace/memories/leader/2026-05-05/current.md` around lines 6 -
17, Add a blank line immediately after each Markdown heading "Current Status",
"Active Focus", and "Critical Context" in the file (these are the headers in
current.md) to satisfy MD022; ensure there is an empty line between each header
line and the following paragraph/list so the sections read: "Current Status" +
blank line + content, "Active Focus" + blank line + content, and "Critical
Context" + blank line + content.

Comment thread src/llm/api.py
Comment thread src/llm/api.py Outdated
Comment thread src/llm/nous_refresh.py
Comment thread tests/llm/test_fallback_integration.py Outdated
patch("src.llm.runtime.client_for_model_config", return_value=mock_client),
patch.object(executor_mod, "backend_for_provider", side_effect=mock_backend_for_provider),
):
with pytest.raises(Exception):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the test file
git ls-files tests/llm/test_fallback_integration.py

Repository: plastic-labs/honcho

Length of output: 102


🏁 Script executed:

# Read the test file to understand the context around line 191
wc -l tests/llm/test_fallback_integration.py

Repository: plastic-labs/honcho

Length of output: 106


🏁 Script executed:

# Read the full file to understand the context
cat -n tests/llm/test_fallback_integration.py

Repository: plastic-labs/honcho

Length of output: 8005


Use a specific exception type instead of pytest.raises(Exception).

Catching Exception broadly can hide unrelated failures and weakens the test's signal. Create a dedicated test exception class and use it consistently so the test only catches the expected exception.

Suggested fix
+class MockHTTPError(Exception):
+    def __init__(self, status_code: int):
+        super().__init__(f"HTTP {status_code}")
+        self.status_code = status_code
+
 class FailingBackend:
@@
     async def complete(self, **kwargs):
         self.call_count += 1
-        exc = Exception(f"HTTP {self.status_code}")
-        exc.status_code = self.status_code
-        raise exc
+        raise MockHTTPError(self.status_code)
@@
     async def stream(self, **kwargs):
         self.call_count += 1
-        exc = Exception(f"HTTP {self.status_code}")
-        exc.status_code = self.status_code
-        raise exc
+        raise MockHTTPError(self.status_code)
@@
-        with pytest.raises(Exception):
+        with pytest.raises(MockHTTPError):
             await honcho_llm_call(
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 191-191: Do not assert blind exception: Exception

(B017)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/llm/test_fallback_integration.py` at line 191, Replace the broad
pytest.raises(Exception) with a specific test exception: add a small custom
exception class (e.g., ExpectedTestError) near the top of
tests/llm/test_fallback_integration.py, update the test to use
pytest.raises(ExpectedTestError) instead of Exception, and ensure the code path
exercised by the test (the function or method under test referenced in the test)
raises ExpectedTestError so the test only catches the intended failure.

Comment thread tests/llm/test_nous_refresh.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/llm/api.py (1)

351-363: 💤 Low value

Consider extracting duplicated Langfuse usage update logic.

The usage-reporting blocks at lines 351-363 and 404-416 are nearly identical. Extracting to a small helper would reduce duplication and ensure consistent behavior.

♻️ Suggested helper
def _update_langfuse_usage(result: HonchoLLMCallResponse[Any]) -> None:
    """Report token usage to Langfuse if enabled."""
    if not settings.LANGFUSE_PUBLIC_KEY:
        return
    try:
        from langfuse import get_client
        usage = {}
        if result.input_tokens is not None:
            usage["input"] = result.input_tokens
        if result.output_tokens is not None:
            usage["output"] = result.output_tokens
        if usage:
            get_client().update_current_generation(usage_details=usage)
    except Exception as exc:
        logger.debug("Failed to update Langfuse usage: %s", exc)

Then replace both blocks with:

if isinstance(result, HonchoLLMCallResponse):
    _update_langfuse_usage(result)

Also applies to: 404-416

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/api.py` around lines 351 - 363, Extract the duplicated Langfuse
usage-reporting logic into a helper like _update_langfuse_usage(result:
HonchoLLMCallResponse) that checks settings.LANGFUSE_PUBLIC_KEY, builds the
usage dict from result.input_tokens and result.output_tokens, calls
get_client().update_current_generation(usage_details=usage) inside a try/except
and logs failures via logger.debug; then replace both inlined blocks in
src/llm/api.py (the blocks that test isinstance(result, HonchoLLMCallResponse)
and perform the update) with a single call to _update_langfuse_usage(result).
src/llm/nous_refresh.py (1)

169-170: ⚡ Quick win

Use custom exception types instead of RuntimeError.

Per coding guidelines, exceptions should be defined in src/exceptions.py. Consider a NousAuthError or similar for these external API failures, enabling callers to catch and handle them specifically.

♻️ Suggested approach

In src/exceptions.py:

class NousAuthError(Exception):
    """Raised when Nous OAuth operations fail."""
    pass

Then in nous_refresh.py:

+from src.exceptions import NousAuthError
...
-    raise RuntimeError(f"Token refresh failed {resp.status_code}: {resp.text}")
+    raise NousAuthError(f"Token refresh failed {resp.status_code}: {resp.text}")
...
-    raise RuntimeError(f"Agent key mint failed {resp.status_code}: {resp.text}")
+    raise NousAuthError(f"Agent key mint failed {resp.status_code}: {resp.text}")

Also applies to: 186-187

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/nous_refresh.py` around lines 169 - 170, Define a custom
NousAuthError in src/exceptions.py (class NousAuthError(Exception): ...) and
import it into src/llm/nous_refresh.py, then replace the RuntimeError raises
that check resp.status_code != 200 (the current raise RuntimeError(f"Token
refresh failed {resp.status_code}: {resp.text}")) and the similar raise at the
other occurrence (lines referenced 186-187) with raises of NousAuthError
including the same contextual message and resp details so callers can catch this
specific auth failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/llm/api.py`:
- Around line 351-363: Extract the duplicated Langfuse usage-reporting logic
into a helper like _update_langfuse_usage(result: HonchoLLMCallResponse) that
checks settings.LANGFUSE_PUBLIC_KEY, builds the usage dict from
result.input_tokens and result.output_tokens, calls
get_client().update_current_generation(usage_details=usage) inside a try/except
and logs failures via logger.debug; then replace both inlined blocks in
src/llm/api.py (the blocks that test isinstance(result, HonchoLLMCallResponse)
and perform the update) with a single call to _update_langfuse_usage(result).

In `@src/llm/nous_refresh.py`:
- Around line 169-170: Define a custom NousAuthError in src/exceptions.py (class
NousAuthError(Exception): ...) and import it into src/llm/nous_refresh.py, then
replace the RuntimeError raises that check resp.status_code != 200 (the current
raise RuntimeError(f"Token refresh failed {resp.status_code}: {resp.text}")) and
the similar raise at the other occurrence (lines referenced 186-187) with raises
of NousAuthError including the same contextual message and resp details so
callers can catch this specific auth failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e0a1b1e3-e42b-45fe-95c1-2d6746a7925f

📥 Commits

Reviewing files that changed from the base of the PR and between 3a03615 and 493cd95.

📒 Files selected for processing (3)
  • src/llm/api.py
  • src/llm/nous_refresh.py
  • tests/llm/test_fallback_integration.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/llm/test_fallback_integration.py

phuongvm added a commit to phuongvm/honcho that referenced this pull request May 17, 2026
- Extract duplicated Langfuse usage reporting into _update_langfuse_usage() helper
- Add NousAuthError custom exception in src/exceptions.py
- Replace RuntimeError with NousAuthError in nous_refresh.py
- Harden save_state() file permissions (0o600 for state file and temp)
- Fix test_refresh_nous_credentials_success: properly mock httpx.AsyncClient
- Fix test_track_name_updates_langfuse_span_name: expect 2 update_current_generation calls (generation + usage)
- Fix update_env_key() to create .env file when missing instead of no-op
- Sanitize PII in openspec docs (replace real peer IDs with placeholders)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llm/api.py (1)

189-190: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale fallback behavior text in the docstring.

Line 189-Line 190 still says fallback happens on the final retry attempt, but this implementation now fast-falls back on the first retryable failure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/api.py` around lines 189 - 190, Update the stale docstring that
states the backup provider/model is used on the final retry attempt: change it
to say the configured ModelConfig.fallback is used immediately on the first
retryable failure (i.e., the first retry attempt) rather than only on the final
retry; update any related wording in the docstring in src/llm/api.py near the
ModelConfig/fallback description (search for "Backup provider/model" and "final
retry attempt") so it accurately reflects the current fast-fail fallback
behavior.
♻️ Duplicate comments (5)
tests/llm/test_nous_refresh.py (2)

22-25: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make test_load_state_missing hermetic to ambient env vars.

Line 22 can still fail when NOUS_REFRESH_TOKEN is set in CI/shell, because load_state() may return a token-backed dict instead of {}.

✅ Minimal fix
-def test_load_state_missing(tmp_path: Path) -> None:
+def test_load_state_missing(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
     """load_state returns {} when state file does not exist."""
+    monkeypatch.delenv("NOUS_REFRESH_TOKEN", raising=False)
     with patch.object(Path, "exists", return_value=False):
         assert load_state() == {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/llm/test_nous_refresh.py` around lines 22 - 25, The test
test_load_state_missing can still see NOUS_REFRESH_TOKEN from the environment;
make it hermetic by clearing or patching os.environ during the test so
load_state() cannot read a real token. Update the test_load_state_missing to
patch the environment (e.g., using unittest.mock.patch.dict on os.environ to
remove or set NOUS_REFRESH_TOKEN to an empty value) while also patching
Path.exists to False so load_state() is guaranteed to return {}; reference the
test function name test_load_state_missing and the load_state function when
making the change.

78-89: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the placeholder pass test with a real assertion (or remove it).

Line 85-Line 88 currently never validates _find_project_root, so this test always passes regardless of behavior.

✅ Minimal fix
 def test_find_project_root_with_dotenv(tmp_path: Path) -> None:
     """_find_project_root locates directory containing .env."""
     (tmp_path / ".env").touch()
-    # Simulate file deep in subdir
     deep = tmp_path / "sub" / "deep"
     deep.mkdir(parents=True)
-    fake_file = deep / "dummy.py"
-    with patch("src.llm.nous_refresh.Path", side_effect=lambda x: tmp_path / x if x == "__file__" else Path(x)):
-        # simpler: patch Path(__file__).resolve to return deep file
-        pass  # This is complex; test at integration level may be better.
-    # Skip unit-level path walk test — rely on manual verification later.
+    assert _find_project_root(start=deep / "dummy.py") == tmp_path
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/llm/test_nous_refresh.py` around lines 78 - 89, The test
test_find_project_root_with_dotenv currently contains a no-op pass; replace it
with a real assertion that _find_project_root finds the tmp_path containing the
.env. Inside the patch of "src.llm.nous_refresh.Path", make
Path("__file__").resolve() (or the side_effect lambda) return the fake_file
(deep / "dummy.py"), then call
src.llm.nous_refresh._find_project_root(str(fake_file)) (or the module's
_find_project_root function) and assert the result == tmp_path (or
tmp_path.resolve()), ensuring the test validates that _find_project_root locates
the directory containing .env; alternatively remove the whole test if you prefer
not to implement the patch.
src/llm/api.py (1)

213-219: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee ContextVar restoration with try/finally around the call flow.

Line 213-Line 214 set task-local state, but _restore_contextvars() only runs on success paths. If decorated() or execute_tool_loop() raises, state leaks into subsequent same-task calls.

🔧 Suggested fix shape
     _current_attempt_token = current_attempt.set(1)
     _force_fallback_token = force_fallback.set(False)

     def _restore_contextvars() -> None:
         current_attempt.reset(_current_attempt_token)
         force_fallback.reset(_force_fallback_token)

-    if not tools or not tool_executor:
-        result = await decorated()
-        ...
-        _restore_contextvars()
-        return result
-
-    result = await execute_tool_loop(...)
-    ...
-    _restore_contextvars()
-    return result
+    try:
+        if not tools or not tool_executor:
+            result = await decorated()
+            ...
+            return result
+
+        result = await execute_tool_loop(...)
+        ...
+        return result
+    finally:
+        _restore_contextvars()

Also applies to: 364-387, 390-428

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/api.py` around lines 213 - 219, The contextvars set by
current_attempt.set and force_fallback.set must be restored even on exceptions:
wrap the call flow that invokes decorated() (and similarly any
execute_tool_loop() call sites) in a try/finally where the finally always calls
_restore_contextvars(), so _restore_contextvars() runs on both success and error
paths; apply the same try/finally pattern in the other affected blocks that set
these tokens (the blocks around execute_tool_loop and the ranges analogous to
lines 364-387 and 390-428) to guarantee no task-local state leaks.
openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md (1)

4-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redact environment-specific identifiers in committed verification artifacts.

Line 4, Line 18, Line 46, and Line 75 still expose concrete workspace/peer/session IDs from a real environment. Please replace these with synthetic fixtures consistently across the report.

🧹 Suggested redaction pattern
-**Target Environment**: `hermes_workspace` (PTDEV)
+**Target Environment**: `test-workspace` (fixture)

-  "workspace_id": "hermes_workspace",
+  "workspace_id": "test-workspace",

-  "peers": [ {"id": "hao-nguyen"}, {"id": "aziz"}, ... ]
+  "peers": [ {"id": "test-peer-1"}, {"id": "test-peer-2"}, ... ]

-  "sessions": [ {"id":"test-session-1"}, ... ]
+  "sessions": [ {"id":"session-1"}, {"id":"session-2"}, ... ]

Also applies to: 18-35, 46-53, 75-82

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md`
around lines 4 - 5, The report exposes real environment identifiers (e.g., the
Target Environment value "hermes_workspace" and concrete workspace/peer/session
IDs found in the report) and needs redaction; locate occurrences of those
environment-specific strings in the file
openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
(including the blocks around the header and the other repeated sections) and
replace them with consistent synthetic fixtures (for example "env_placeholder",
"workspace_id_fixture", "peer_id_fixture", "session_id_fixture") everywhere they
appear so the same placeholder is used across the report. Ensure you update all
instances in the noted sections and preserve formatting/markdown while not
altering other content.
src/llm/nous_refresh.py (1)

235-242: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the in-memory settings import and stop swallowing update failures silently.

Line 236 imports honcho.config, which does not match the runtime module path used elsewhere (src.config), so the in-memory key update can silently fail every time.

🔧 Suggested fix
         # 6. Update in-memory settings globally (if Honcho is running)
         try:
-            from honcho.config import settings
+            from src.config import settings

             settings.LLM.NOUS_API_KEY = agent_key
             logger.info("In-memory settings.LLM.NOUS_API_KEY updated")
-        except Exception:
+        except ImportError as exc:
             # settings may not be importable in all contexts (tests, CLI)
-            pass
+            logger.debug("Could not import settings for in-memory update: %s", exc)
As per coding guidelines, "Use explicit error handling with appropriate exception types from src/exceptions.py".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/nous_refresh.py` around lines 235 - 242, The current import uses
honcho.config and swallows all exceptions; change the import to use the runtime
module path by importing settings from src.config (replace "from honcho.config
import settings" with "from src.config import settings"), then update the
try/except to catch and handle specific exceptions (ImportError, AttributeError
and the project-specific ConfigError from src.exceptions) instead of a bare
except; on failure log a clear error with the exception details via logger.error
including agent_key/context and re-raise or propagate the error so the update to
settings.LLM.NOUS_API_KEY (the settings.LLM.NOUS_API_KEY assignment referencing
agent_key) does not fail silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/llm/api.py`:
- Around line 189-190: Update the stale docstring that states the backup
provider/model is used on the final retry attempt: change it to say the
configured ModelConfig.fallback is used immediately on the first retryable
failure (i.e., the first retry attempt) rather than only on the final retry;
update any related wording in the docstring in src/llm/api.py near the
ModelConfig/fallback description (search for "Backup provider/model" and "final
retry attempt") so it accurately reflects the current fast-fail fallback
behavior.

---

Duplicate comments:
In
`@openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md`:
- Around line 4-5: The report exposes real environment identifiers (e.g., the
Target Environment value "hermes_workspace" and concrete workspace/peer/session
IDs found in the report) and needs redaction; locate occurrences of those
environment-specific strings in the file
openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
(including the blocks around the header and the other repeated sections) and
replace them with consistent synthetic fixtures (for example "env_placeholder",
"workspace_id_fixture", "peer_id_fixture", "session_id_fixture") everywhere they
appear so the same placeholder is used across the report. Ensure you update all
instances in the noted sections and preserve formatting/markdown while not
altering other content.

In `@src/llm/api.py`:
- Around line 213-219: The contextvars set by current_attempt.set and
force_fallback.set must be restored even on exceptions: wrap the call flow that
invokes decorated() (and similarly any execute_tool_loop() call sites) in a
try/finally where the finally always calls _restore_contextvars(), so
_restore_contextvars() runs on both success and error paths; apply the same
try/finally pattern in the other affected blocks that set these tokens (the
blocks around execute_tool_loop and the ranges analogous to lines 364-387 and
390-428) to guarantee no task-local state leaks.

In `@src/llm/nous_refresh.py`:
- Around line 235-242: The current import uses honcho.config and swallows all
exceptions; change the import to use the runtime module path by importing
settings from src.config (replace "from honcho.config import settings" with
"from src.config import settings"), then update the try/except to catch and
handle specific exceptions (ImportError, AttributeError and the project-specific
ConfigError from src.exceptions) instead of a bare except; on failure log a
clear error with the exception details via logger.error including
agent_key/context and re-raise or propagate the error so the update to
settings.LLM.NOUS_API_KEY (the settings.LLM.NOUS_API_KEY assignment referencing
agent_key) does not fail silently.

In `@tests/llm/test_nous_refresh.py`:
- Around line 22-25: The test test_load_state_missing can still see
NOUS_REFRESH_TOKEN from the environment; make it hermetic by clearing or
patching os.environ during the test so load_state() cannot read a real token.
Update the test_load_state_missing to patch the environment (e.g., using
unittest.mock.patch.dict on os.environ to remove or set NOUS_REFRESH_TOKEN to an
empty value) while also patching Path.exists to False so load_state() is
guaranteed to return {}; reference the test function name
test_load_state_missing and the load_state function when making the change.
- Around line 78-89: The test test_find_project_root_with_dotenv currently
contains a no-op pass; replace it with a real assertion that _find_project_root
finds the tmp_path containing the .env. Inside the patch of
"src.llm.nous_refresh.Path", make Path("__file__").resolve() (or the side_effect
lambda) return the fake_file (deep / "dummy.py"), then call
src.llm.nous_refresh._find_project_root(str(fake_file)) (or the module's
_find_project_root function) and assert the result == tmp_path (or
tmp_path.resolve()), ensuring the test validates that _find_project_root locates
the directory containing .env; alternatively remove the whole test if you prefer
not to implement the patch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dde833f4-dac0-432a-986f-f29c7101fdda

📥 Commits

Reviewing files that changed from the base of the PR and between 493cd95 and 944b811.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/2026-05-07-honcho-philosophy-and-usecases.md
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
  • src/exceptions.py
  • src/llm/api.py
  • src/llm/nous_refresh.py
  • tests/llm/test_nous_refresh.py
  • tests/utils/test_clients.py
✅ Files skipped from review due to trivial changes (1)
  • src/exceptions.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/utils/test_clients.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/llm/nous_refresh.py (1)

78-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

State directory still created with default umask.

Lines 82 and 84 correctly chmod the temp/final file to 0o600, but STATE_FILE.parent.mkdir(parents=True, exist_ok=True) on line 78 still uses the process umask. On shared hosts, ~/.honcho/ and its tmp file (briefly) remain world-readable before the chmod runs. Two small hardening steps close this:

🔒 Suggested hardening
-    STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
-    # Atomic write via temp file + rename
-    tmp = STATE_FILE.with_suffix(".tmp")
-    tmp.write_text(json.dumps(state, indent=2))
-    os.chmod(tmp, 0o600)
-    tmp.replace(STATE_FILE)
-    os.chmod(STATE_FILE, 0o600)
+    STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
+    os.chmod(STATE_FILE.parent, 0o700)
+    # Atomic write via temp file + rename; create with 0600 to avoid a
+    # window where the tmp file is world-readable before chmod.
+    tmp = STATE_FILE.with_suffix(".tmp")
+    fd = os.open(tmp, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
+    with os.fdopen(fd, "w") as f:
+        f.write(json.dumps(state, indent=2))
+    tmp.replace(STATE_FILE)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/llm/nous_refresh.py` around lines 78 - 84, STATE_FILE.parent.mkdir uses
the process umask so the directory (and the temp file created inside it) can be
briefly world-readable; wrap the mkdir in a temporary umask change and
explicitly set the directory mode to a restrictive value (e.g. 0o700) so the dir
is created with correct permissions: save old_umask = os.umask(0), call
STATE_FILE.parent.mkdir(parents=True, exist_ok=True, mode=0o700) and then
restore os.umask(old_umask); keep the existing tmp write/chmod/replace sequence
(STATE_FILE.with_suffix and subsequent os.chmod calls) so files remain 0o600.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/llm/nous_refresh.py`:
- Around line 78-84: STATE_FILE.parent.mkdir uses the process umask so the
directory (and the temp file created inside it) can be briefly world-readable;
wrap the mkdir in a temporary umask change and explicitly set the directory mode
to a restrictive value (e.g. 0o700) so the dir is created with correct
permissions: save old_umask = os.umask(0), call
STATE_FILE.parent.mkdir(parents=True, exist_ok=True, mode=0o700) and then
restore os.umask(old_umask); keep the existing tmp write/chmod/replace sequence
(STATE_FILE.with_suffix and subsequent os.chmod calls) so files remain 0o600.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d26aa54-a672-4be2-bc8d-5ad35554ee23

📥 Commits

Reviewing files that changed from the base of the PR and between 944b811 and 99b4b02.

📒 Files selected for processing (4)
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md
  • src/llm/api.py
  • src/llm/nous_refresh.py
  • tests/llm/test_nous_refresh.py
✅ Files skipped from review due to trivial changes (1)
  • openspec/changes/archive/2026-05-07-verify-honcho-mcp-tools/explorations/mcp-verification-report.md

phuonglambert and others added 12 commits May 21, 2026 20:22
- Introduced new capability for tracing LLM calls as Langfuse generation observations.
- Updated `honcho_llm_call` to use `@conditional_observe` with `as_type="generation"`.
- Modified `update_current_langfuse_observation` to set the `model` field and preserve metadata.
- Added comprehensive documentation and tasks for verification and future enhancements.
- Created exploration and proposal documents to formalize the integration process.
…custom models and initialize project documentation
…name passing to honcho_llm_call for improved Langfuse telemetry aggregation.
- Force fallback on first retryable failure (429/5xx/timeout)
- Configurable via FALLBACK_MODEL_CONFIG__ env vars
- Langfuse signal: is_fallback=true on generation span
- Configurable timeout via LLM_DEFAULT_TIMEOUT=180.0
- 26 tests (23 unit + 3 integration) all passing
- OpenSpec spec synced and archived
- Extract duplicated Langfuse usage reporting into _update_langfuse_usage() helper
- Add NousAuthError custom exception in src/exceptions.py
- Replace RuntimeError with NousAuthError in nous_refresh.py
- Harden save_state() file permissions (0o600 for state file and temp)
- Fix test_refresh_nous_credentials_success: properly mock httpx.AsyncClient
- Fix test_track_name_updates_langfuse_span_name: expect 2 update_current_generation calls (generation + usage)
- Fix update_env_key() to create .env file when missing instead of no-op
- Sanitize PII in openspec docs (replace real peer IDs with placeholders)
- Extract Langfuse usage update into _update_langfuse_usage helper
- Guarantee ContextVar restoration with try/finally block
- Update stale docstring about fallback behavior
- Fix import path in nous_refresh.py (honcho.config -> src.config)
- Fix bare except with specific exceptions and logging
- Make test_load_state_missing hermetic with monkeypatch
- Replace no-op pass with real assertion in test_find_project_root
- Redact environment identifiers in mcp-verification-report.md
@phuongvm phuongvm force-pushed the feat/llm-model-fallback branch 2 times, most recently from 849dbb9 to 53d5156 Compare May 21, 2026 14:03
@phuongvm
Copy link
Copy Markdown
Author

PR #692 Code Review: LLM Model Fallback on Failure

Overall Assessment: Approve with comments

This is a well-architected feature with strong test coverage (950+ lines of tests). The fast-failover design is the correct choice — retrying a dead provider wastes time and tokens. The implementation is production-ready with the caveats below.


✅ Strengths

  1. Fast failover on first failure — Correct architectural decision. No point exhausting retries on a 429'd provider.
  2. ContextVar-based state management — Thread-safe for async Python, properly scoped.
  3. Context variable cleanuptry/finally with _restore_contextvars() prevents ContextVar leaks across calls.
  4. Backward compatibility preserved — Legacy behavior (fallback on final attempt) still works when force_fallback is False.
  5. Comprehensive error classification_is_retryable_error() covers HTTP status codes, built-in exceptions, and SDK-specific exceptions.
  6. Per-agent configuration — Independent fallback per agent via __-delimited env vars.
  7. Cross-provider support — Fallback can use a different transport (e.g., OpenAI → LMStudio).
  8. Test coverage — 4 new test files with unit tests for select_model_config_for_attempt, integration tests with mock backends, Nous auto-refresh tests.
  9. Observability — Langfuse metadata with is_fallback: true, WARNING-level logs with primary→fallback details.
  10. Configurable timeoutLLM_DEFAULT_TIMEOUT replaces hardcoded 600s (Anthropic) and ~60s (OpenAI).
  11. Summarizer telemetry fix — Removing outer @conditional_observe and using track_name directly produces cleaner Langfuse traces.

🔴 Critical Issues

1. DRY Violation: Auth-refresh logic duplicated 3 times in OpenAIBackend

File: src/llm/backends/openai.py

The _call_with_autorefresh method was added (lines ~113-139) but the existing complete() method still has the same auth-refresh logic duplicated inline in both the parse() path and the create() path. This means the same 15-line block appears 3 times.

Impact: Maintenance burden — any change to the refresh flow must be made in 3 places. If one copy drifts, silent bugs follow.

Fix: Refactor complete() to delegate to _call_with_autorefresh instead of duplicating the logic:

# Instead of inline try/except AuthenticationError blocks:
if use_parse:
    response = await self._call_with_autorefresh(use_parse=True, params=params)
else:
    response = await self._call_with_autorefresh(use_parse=False, params=params)

2. Race condition in concurrent Nous OAuth refresh

File: src/llm/nous_refresh.py

If 5 concurrent requests all hit 401 simultaneously, all 5 will call refresh_nous_credentials() in parallel. Each will exchange the refresh_token and mint a new agent key. The Nous API may invalidate the refresh_token after first use, causing 4 of the 5 refreshes to fail.

Fix: Add an asyncio.Lock to serialize refresh attempts:

_refresh_lock = asyncio.Lock()

async def refresh_nous_credentials() -> str | None:
    async with _refresh_lock:
        # ... existing refresh logic

🟡 Medium Issues

3. NousAuthError breaks exception hierarchy

File: src/exceptions.py

NousAuthError inherits from Exception directly, while all other Honcho exceptions inherit from HonchoException. This means a except HonchoException catch won't trap Nous auth failures.

Fix:

class NousAuthError(HonchoException):
    """Raised when Nous OAuth operations fail."""
    pass

4. `settings.LLM.NOUS_API_KEY=***

File: src/llm/backends/openai.py

Pydantic models are often frozen or have validators. Silently swallowing exceptions here means if the settings update fails, the in-memory key is stale for the next request.

Fix: Log at WARNING level if the update fails, and consider whether the settings object even supports mutation at runtime.

5. _is_retryable_error uses string-based type matching

File: src/llm/api.py

return type(exc).__name__ in (
    "APIConnectionError", "APITimeoutError", "InternalServerError",
    "ServiceUnavailableError", "RateLimitError",
)

This avoids importing SDK-specific exceptions (good for decoupling), but is fragile to SDK renames. If OpenAI renames RateLimitError to RateLimitExceededError, this silently stops matching.

Recommendation: Add a comment documenting this tradeoff and the SDK versions these names were verified against. Consider adding a periodic CI assertion that these class names still exist in the installed SDK versions.

6. force_fallback resets per tool-loop iteration

File: src/llm/tool_loop.py

current_attempt.set(1)
force_fallback.set(False)

Each tool-loop iteration resets force_fallback to False. If the primary provider is persistently down, every tool iteration will waste one attempt on the primary before falling back. This is correct for transient failures but suboptimal for sustained outages.

Recommendation: Consider an optional "sticky fallback" mode where once fallback activates, it persists for the entire tool loop. Could be a parameter like sticky_fallback: bool = False.


🟢 Minor / Style

7. Double blank line after _update_langfuse_usage

File: src/llm/api.py (tool-less path)

There's a double blank line between _update_langfuse_usage(result) and the if trace_name block. Should be single blank line per PEP 8.

8. lmstudio and nous use get_openai_override_client (LRU cached)

This works correctly since LMStudio/Nous are OpenAI-compatible. The @lru_cache(maxsize=128) keys on (base_url, api_key) so different configs get different clients. No issue, just confirming this was reviewed.

9. PR scope: OpenSpec files inflate diff

The PR includes ~30 OpenSpec archive/workspace files (~1500+ lines). While these document the design process, they significantly increase review burden. Consider separating design documentation from implementation in future PRs.


Test Coverage Assessment

Area Coverage Notes
select_model_config_for_attempt with force_fallback ✅ Thorough All attempt/fallback combinations tested
_is_retryable_error classification ✅ Good HTTP codes, built-in exceptions, SDK exceptions
Integration: primary fails → fallback succeeds ✅ Tested Mock backends verify call counts
Integration: both fail → error propagated ✅ Tested
Nous auto-refresh ✅ Tested test_nous_refresh.py + test_backends/test_nous_autorefresh.py
Streaming fallback ⚠️ Not tested Integration tests only cover non-streaming path
Tool loop fallback reset ⚠️ Not tested No test verifies force_fallback resets per iteration
Langfuse metadata ⚠️ Not tested is_fallback metadata not asserted in tests

Recommendation: Add at least one streaming fallback test and one tool-loop iteration test.


Summary

Category Count
Critical (must fix before merge) 2
Medium (should fix) 4
Minor (nice to have) 3
Test gaps 3

The core fallback architecture is sound. The two critical issues (DRY violation in auth-refresh, race condition in Nous refresh) should be addressed before merge. The medium issues are improvements that can follow in a subsequent PR if timeline is tight.

Copy link
Copy Markdown
Author

@phuongvm phuongvm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hermes Agent Code Review -- PR #692

Verdict: Changes Requested (3 critical issues, 4 warnings, 5 suggestions)


What is well done

  • ContextVar scoping -- force_fallback is properly saved with a reset token (_force_fallback_token) and restored in a finally block. This prevents cross-call leakage.
  • First-failure fast failover -- The design correctly switches to fallback on the first retryable error, not after exhausting all retries. This is the right UX for rate limits and outages.
  • Langfuse span metadata -- Annotating the single generation span with is_fallback instead of creating separate failed-attempt spans is clean and avoids trace pollution.
  • Backward compatibility -- When no fallback is configured, force_fallback stays False and the existing retry path is unchanged.
  • Timeout consolidation -- Replacing hardcoded 600.0 (Anthropic) / SDK defaults (~60s) with a single LLM_DEFAULT_TIMEOUT env var is good. 180s default is reasonable.
  • Test coverage breadth -- 23 unit tests + 3 integration tests cover the core logic: force_fallback True/False, no-fallback config, cross-provider, final-attempt fallback.

CRITICAL

  1. _is_retryable_error() catches OSError too broadly

    • src/llm/api.py:308: isinstance(exc, (TimeoutError, ConnectionError, OSError)) -- OSError includes many non-retryable errors (permission denied, disk full, etc.). ConnectionError is already a subclass of OSError, so the OSError catch-all is redundant and dangerous.
    • Fix: Remove OSError from the tuple. TimeoutError and ConnectionError are sufficient for network errors. SDK-specific exceptions cover the rest.
  2. Missing agent name in fallback log (spec violation)

    • src/llm/api.py:339-343: The fallback log message includes transport/model but does not include the agent name (deriver, dialectic, dreamer, etc.). The spec (openspec/specs/llm-model-fallback/spec.md, Requirement 3) explicitly requires: "agent=deriver, primary=<provider>/<model>, fallback=<provider>/<model>, reason=<error>".
    • Fix: Thread agent name through honcho_llm_call and include it in the log message.
  3. nous_refresh.py writes to .env file at runtime

    • src/llm/nous_refresh.py: The update_env_key() function modifies the .env file on disk after refreshing credentials. In containerized deployments (Docker, k8s), the .env is baked into the image or mounted read-only -- this write will fail silently or cause issues.
    • Fix: Write to a runtime state file (like the OAuth state file) and have the credential loader check that file before falling back to .env. Or make this a no-op when running in a container.

WARNINGS

  1. No test for partial fallback config

    • The spec (Requirement 8, last scenario) requires that when only FALLBACK__TRANSPORT is set without FALLBACK__MODEL, the system ignores the incomplete config. No test verifies this edge case.
    • Fix: Add a test case with fallback=ModelConfig(transport="anthropic", model=None) and verify it returns primary.
  2. settings.LLM.NOUS_API_KEY mutation in _call_with_autorefresh

    • src/llm/backends/openai.py:~143: settings.LLM.NOUS_API_KEY = new_key -- mutating a Pydantic BaseSettings field at runtime is fragile. If the settings class is re-instantiated or validated elsewhere, the mutation is lost.
    • Fix: Consider storing the refreshed key in a dedicated module-level variable or the OAuth state file, and have the settings getter check that source first.
  3. Stream path does not get _update_langfuse_usage

    • src/llm/api.py:~456-462: _update_langfuse_usage(result) is only called for HonchoLLMCallResponse (non-stream). Streaming responses return AsyncIterator[HonchoLLMCallStreamChunk] and skip usage reporting entirely.
    • Fix: Hook usage reporting into the stream consumer or document this as a known gap.
  4. honcho_llm_call ContextVar reset in try/finally

    • The _restore_contextvars() function is defined but I did not see a finally block calling it in the visible diff. If the ContextVars are not reset on exit (exception or normal return), subsequent calls may inherit stale force_fallback=True state.
    • Verify: Ensure the try/finally wrapping the call body actually calls _restore_contextvars().

SUGGESTIONS

  1. select_model_config_for_attempt() docstring says "last attempt" but now also handles force_fallback -- the docstring was updated but could be clearer about precedence: force_fallback=True overrides attempt number logic entirely.

  2. _is_retryable_error() uses type(exc).__name__ string matching -- this is fragile. If SDK exception class names change, the check silently stops working. Consider importing the actual SDK exception types when available.

  3. LMStudio and Nous share OpenAIBackend -- this is correct (they are OpenAI-compatible), but the is_nous=True flag on OpenAIBackend is the only distinguishing feature. Consider whether a NousBackend subclass would be cleaner as the Nous-specific logic grows (OAuth refresh, error handling).

  4. PR scope: 4,342 additions in 58 files -- this PR includes not just fallback logic but also: Langfuse generation traces, MCP QA verification, Nous OAuth refresh, Docker/MCP infrastructure, and OpenSpec archival. Consider splitting into smaller PRs for easier review.


Reviewed by Hermes Agent

@phuongvm
Copy link
Copy Markdown
Author

⚠️ PR has been split into focused PRs per maintainer request:

PR Scope URL
#736 LLM model fallback on failure #736
#737 Nous Research provider support #737
#738 MCP Dockerfile improvements #738

All tests passing (37/37 total). Branches pushed to phuongvm/honcho.

- Remove OSError from _is_retryable_error() — OSError is too broad
  and includes non-retryable errors (permission denied, disk full).
  ConnectionError is already a subclass of OSError, making it redundant.

- Add agent name (track_name) to fallback log message per spec
  Requirement 3. Log now includes: agent label, primary model,
  fallback model, attempt number, and error type/message.

- Skip .env file write in containerized environments (Docker/k8s).
  .env is baked into images or mounted read-only in containers.
  Persisted state file from save_state() is sufficient for runtime
  credential management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants