plastic-labs · phuongvm · Apr 22, 2026 · Apr 24, 2026 · May 1, 2026 · May 2, 2026
diff --git a/.dockerignore b/.dockerignore
@@ -12,3 +12,7 @@ docker-compose.yml.example
 .vscode/**
 data/**
 .venv
+.wrangler
+**/.wrangler
+mcp/.wrangler
+mcp/node_modules
diff --git a/.env.template b/.env.template
@@ -89,6 +89,7 @@ LLM_OPENAI_API_KEY=your-api-key-here
 # =============================================================================
 # Global LLM settings
 # LLM_DEFAULT_MAX_TOKENS=2500
+# LLM_DEFAULT_TIMEOUT=180.0  # HTTP timeout (seconds) for all LLM provider clients
 # LLM_MAX_TOOL_OUTPUT_CHARS=10000  # Max chars for tool output (~2500 tokens)
 # LLM_MAX_MESSAGE_CONTENT_CHARS=2000  # Max chars per message in tool results
 
@@ -119,8 +120,8 @@ LLM_OPENAI_API_KEY=your-api-key-here
 # DERIVER_FLUSH_ENABLED=false  # Bypass batch token threshold, process work immediately
 # DERIVER_MODEL_CONFIG__FALLBACK__MODEL=
 # DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=
-# DERIVER_MODEL_CONFIG__OVERRIDES__BASE_URL=
-# DERIVER_MODEL_CONFIG__OVERRIDES__API_KEY_ENV=
+# DERIVER_MODEL_CONFIG__FALLBACK__OVERRIDES__BASE_URL=
+# DERIVER_MODEL_CONFIG__FALLBACK__OVERRIDES__API_KEY_ENV=
 
 # =============================================================================
 # Peer Card
@@ -168,6 +169,14 @@ LLM_OPENAI_API_KEY=your-api-key-here
 # Optional backup per level (must set both or neither):
 # DIALECTIC_LEVELS__max__MODEL_CONFIG__FALLBACK__MODEL=gemini-2.5-pro
 # DIALECTIC_LEVELS__max__MODEL_CONFIG__FALLBACK__TRANSPORT=gemini
+# DIALECTIC_LEVELS__high__MODEL_CONFIG__FALLBACK__MODEL=
+# DIALECTIC_LEVELS__high__MODEL_CONFIG__FALLBACK__TRANSPORT=
+# DIALECTIC_LEVELS__medium__MODEL_CONFIG__FALLBACK__MODEL=
+# DIALECTIC_LEVELS__medium__MODEL_CONFIG__FALLBACK__TRANSPORT=
+# DIALECTIC_LEVELS__low__MODEL_CONFIG__FALLBACK__MODEL=
+# DIALECTIC_LEVELS__low__MODEL_CONFIG__FALLBACK__TRANSPORT=
+# DIALECTIC_LEVELS__minimal__MODEL_CONFIG__FALLBACK__MODEL=
+# DIALECTIC_LEVELS__minimal__MODEL_CONFIG__FALLBACK__TRANSPORT=
 
 # =============================================================================
 # Summary
@@ -186,6 +195,9 @@ LLM_OPENAI_API_KEY=your-api-key-here
 # SUMMARY_MAX_TOKENS_SHORT=1000
 # SUMMARY_MAX_TOKENS_LONG=4000
 # SUMMARY_MODEL_CONFIG__FALLBACK__MODEL=
+# SUMMARY_MODEL_CONFIG__FALLBACK__TRANSPORT=
+# SUMMARY_MODEL_CONFIG__FALLBACK__OVERRIDES__BASE_URL=
+# SUMMARY_MODEL_CONFIG__FALLBACK__OVERRIDES__API_KEY_ENV=
 
 # =============================================================================
 # Dream
@@ -201,6 +213,10 @@ LLM_OPENAI_API_KEY=your-api-key-here
 # DREAM_DEDUCTION_MODEL_CONFIG__OVERRIDES__BASE_URL=https://openrouter.ai/api/v1
 # DREAM_INDUCTION_MODEL_CONFIG__MODEL=your-model-here
 # DREAM_INDUCTION_MODEL_CONFIG__OVERRIDES__BASE_URL=https://openrouter.ai/api/v1
+# DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__MODEL=
+# DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__TRANSPORT=
+# DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__MODEL=
+# DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__TRANSPORT=
 # DREAM_DOCUMENT_THRESHOLD=50
 # DREAM_IDLE_TIMEOUT_MINUTES=60
 # DREAM_MIN_HOURS_BETWEEN_DREAMS=8

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -297,3 +297,56 @@ src/
 ### Notes
 
 - Always use `uv run` or `uv` to prefix any commands related to python to ensure you use the virtual environment
+
+## LLM Model Fallback
+
+Honcho supports automatic fallback to a secondary LLM model when the primary model fails (rate limit, timeout, API error).
+
+### How It Works
+
+1. **First-failure trigger**: When the primary model returns a retryable error (429, 5xx, timeout, connection error), the system immediately switches to the fallback model on the next attempt — it does NOT wait for all retries to exhaust.
+2. **Per-agent configuration**: Each agent (Deriver, Dialectic, Dreamer, Summary) has independent fallback configuration.
+3. **Cross-provider support**: The fallback model can use a different provider (e.g., primary=openai, fallback=lmstudio).
+4. **Backward compatible**: If no fallback is configured, behavior is identical to before.
+
+### Configuration
+
+Set these environment variables for each agent:
+
+```bash
+# Deriver
+DERIVER_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
+DERIVER_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
+
+# Dialectic (per reasoning level)
+DIALECTIC_LEVELS__low__MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
+DIALECTIC_LEVELS__low__MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
+
+# Summary
+SUMMARY_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
+SUMMARY_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
+
+# Dream
+DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
+DREAM_DEDUCTION_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
+DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__TRANSPORT=lmstudio
+DREAM_INDUCTION_MODEL_CONFIG__FALLBACK__MODEL=qwen/qwen3.5-9b
+```
+
+### Observability
+
+- **Logs**: WARNING-level log emitted when fallback is activated, including primary→fallback provider/model info
+- **Langfuse**: Generation span metadata includes `is_fallback: true` when fallback model is used
+- **No separate span**: Only the successful attempt's span is kept (failed attempts do not create separate spans)
+
+### Retryable Errors
+
+The following errors trigger fast fallback:
+- HTTP 429 (Too Many Requests / Rate Limit)
+- HTTP 5xx (Server Errors)
+- `TimeoutError`
+- `ConnectionError`
+- `OSError`
+- SDK-specific: `APIConnectionError`, `APITimeoutError`, `InternalServerError`, `ServiceUnavailableError`, `RateLimitError`
+
+Non-retryable errors (400, 200, ValueError, etc.) do NOT trigger fallback — they follow normal retry behavior.
diff --git a/mcp/.dockerignore b/mcp/.dockerignore
@@ -0,0 +1,2 @@
+.wrangler
+node_modules
diff --git a/mcp/Dockerfile b/mcp/Dockerfile
@@ -0,0 +1,24 @@
+# --- Stage 1: Lightning Fast Dependencies ---
+FROM oven/bun:1 AS builder
+WORKDIR /app
+
+# Copy configuration and use Bun's superior async network stack to bypass WSL MTU hangs
+COPY package.json ./
+RUN bun install
+
+# Copy source code
+COPY . .
+
+# --- Stage 2: Node Native Runtime ---
+FROM node:20
+WORKDIR /app
+
+# Ingest all the raw files (including node_modules) built gracefully by Bun
+COPY --from=builder /app /app
+
+# Setup environment 
+ENV CI=true
+
+# Revert boot command to native NPM/Node context to bypass Wrangler's anti-Bun blocker
+# Crucially, we wrap it in a shell to inject the Docker container's environment variable into Wrangler's isolated sandbox
+CMD sh -c "npx wrangler dev src/index.ts --port 8787 --ip 0.0.0.0 --var HONCHO_API_URL:$HONCHO_API_URL"
diff --git a/mcp/bun.lock b/mcp/bun.lock
diff --git a/mcp/package.json b/mcp/package.json
@@ -21,6 +21,10 @@
     "nanoid": "^5.1.7",
     "zod": "^4.3.6"
   },
+  "browser": {
+    "cloudflare:email": false,
+    "core-js-pure/features/instance/trim": false
+  },
   "devDependencies": {
     "@cloudflare/workers-types": "^4.20241002.0",
     "typescript": "^5.3.3",

diff --git a/mcp/run-mocked.ts b/mcp/run-mocked.ts
@@ -0,0 +1,31 @@
+// The "Bùa Ngải" Mock execution strategy requested by the User
+import { plugin } from "bun";
+
+plugin({
+  name: "cloudflare-mock",
+  setup(build) {
+    // Intercept standard resolution for cloudflare:email
+    build.onResolve({ filter: /^cloudflare:email$/ }, () => ({
+      path: "mock-cloudflare-email",
+      namespace: "mock",
+    }));
+
+    // Supply a dummy payload that successfully resolves but does nothing
+    build.onLoad({ filter: /.*/, namespace: "mock" }, () => ({
+      contents: "export default {}; export const EmailMessage = class {};",
+      loader: "js",
+    }));
+  },
+});
+
+// We dynamically import the main index ONLY AFTER the plugin is registered
+import("./src/index.ts").then((module) => {
+  const port = process.env.HONCHO_MCP_PORT ? Number(process.env.HONCHO_MCP_PORT) : 8787;
+  Bun.serve({
+    fetch: module.default.fetch as any,
+    port: port,
+  });
+  console.log(`[Honcho MCP] Native mock server booted up on port ${port}!`);
+}).catch(err => {
+  console.error("Failed to dynamically load MCP:", err);
+});
diff --git a/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yaml b/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-05-05
diff --git a/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/design.md b/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/design.md
@@ -0,0 +1,32 @@
+## Context
+
+Langfuse aggregates model identification and token usage at the `GENERATION` observation level. When an observation is created as a `SPAN`, it acts purely as a grouping container and its table view lacks these columns. 
+
+In `src/utils/summarizer.py`, the `create_short_summary` and `create_long_summary` functions are wrapped in `@conditional_observe` decorators, emitting `SPAN` observations. However, they internally call `honcho_llm_call`, which emits a nested `GENERATION` observation. Langfuse captures the model/tokens on the inner Generation, leaving the outer Span telemetry empty in the UI. 
+
+Other modules (like the Dialectic Agent) avoid this by not using an outer decorator, and instead passing `track_name` directly to `honcho_llm_call()`, creating a single top-level `GENERATION` named exactly what it needs to be.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Consolidate the Langfuse telemetry for summarizer operations into single `GENERATION` observations to expose model and token data at the root level of the trace.
+- Align `src/utils/summarizer.py` with the `track_name` pattern established elsewhere in the codebase.
+
+**Non-Goals:**
+- Modifying the behavior, logic, or model choices of the summarizer algorithms themselves.
+- Altering the implementation of `honcho_llm_call` or its core telemetry structure.
+
+## Decisions
+
+**1. Remove outer `@conditional_observe` from summarizers**
+Instead of having a `SPAN` encompassing a `GENERATION`, we will completely remove the outer `SPAN`.
+*Rationale:* Nested observations where the inner contains all the actionable metadata lead to confusing UI experiences. A single pure generation is exactly what these functions represent.
+
+**2. Pass `track_name` to `honcho_llm_call`**
+We will add `track_name="Create Short Summary"` and `track_name="Create Long Summary"` to their respective `honcho_llm_call` arguments.
+*Rationale:* `honcho_llm_call` is already designed to consume a `track_name` and correctly name the `GENERATION` trace. This perfectly satisfies the observability requirement without redundant decorators.
+
+## Risks / Trade-offs
+
+- **Risk:** Any manual metadata that the outer `@conditional_observe` might have been capturing could be lost.
+- **Mitigation:** The summarizer functions do not pass any special manual context kwargs or tags to the decorator; they simply name the span. The inner generation captures all input/output payload data reliably.
diff --git a/...fix-summarizer-telemetry-spans/explorations/2026-05-05-nested-langfuse-spans.md b/...fix-summarizer-telemetry-spans/explorations/2026-05-05-nested-langfuse-spans.md
@@ -0,0 +1,47 @@
+# Exploration: Nested Langfuse Observations & Missing Telemetry Columns
+
+**Date**: 2026-05-05
+**Topic**: Root cause analysis for `Create Short Summary` (and similar traces) missing model and token attribution in the Langfuse UI.
+
+## The Problem
+The user noticed that the trace observation named **"Create Short Summary"** appears in Langfuse without a defined model or token usage, despite the underlying LLM call functioning correctly. They hypothesized that this might be a widespread issue for other traces.
+
+## System Analysis
+
+### How the Summarizer works
+In `src/utils/summarizer.py`, the functions are structured like this:
+```python
+@conditional_observe(name="Create Short Summary")
+async def create_short_summary(...) -> HonchoLLMCallResponse[str]:
+    # ...
+    return await honcho_llm_call(...)
+```
+
+### The "Nested Observation" Conflict
+1. The `@conditional_observe` decorator on the outer function creates a **SPAN** observation (since it does not specify `as_type="generation"`).
+2. Inside that function, `honcho_llm_call` executes. `honcho_llm_call` is wrapped with `@conditional_observe(name="LLM Call", as_type="generation")`.
+3. Consequently, Langfuse records a nested hierarchy:
+   - **SPAN**: "Create Short Summary" *(No model/tokens)*
+     - **GENERATION**: "LLM Call" *(Contains model/tokens)*
+
+Because Langfuse UI (specifically the trace and generations tables) only aggregates model and token statistics at the **GENERATION** level, the top-level "Create Short Summary" span appears empty.
+
+### How other modules (e.g. Dialectic Agent) avoid this
+In `src/dialectic/core.py`, the `Dialectic Agent` does **not** use an outer `@conditional_observe` span. Instead, it utilizes the `track_name` argument directly:
+```python
+return await honcho_llm_call(
+    # ...
+    track_name="Dialectic Agent",
+)
+```
+This causes the inner `honcho_llm_call` GENERATION observation to dynamically rename itself to "Dialectic Agent", cleanly consolidating the trace into a single generation with full token/model attribution.
+
+## Scope of the Issue
+A codebase-wide search reveals that ONLY the `create_short_summary` and `create_long_summary` functions suffer from this nested `@conditional_observe` pattern. 
+
+## Recommended Path Forward (Actionable Fix)
+We can fix this permanently and elegantly by aligning the summarizer module with the `Dialectic Agent` pattern:
+1. **Remove** `@conditional_observe(name="Create Short Summary")` and `name="Create Long Summary"` from `src/utils/summarizer.py`.
+2. **Inject** `track_name="Create Short Summary"` and `track_name="Create Long Summary"` into their respective `honcho_llm_call(...)` invocations.
+
+This will collapse the nested traces into single, pure GENERATION observations that fully populate the Langfuse UI.
diff --git a/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/proposal.md b/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/proposal.md
@@ -0,0 +1,22 @@
+## Why
+
+Langfuse UI only aggregates and displays `model` and `tokens` explicitly at the `GENERATION` observation level. Currently, the `create_short_summary` and `create_long_summary` functions in `src/utils/summarizer.py` are wrapped with `@conditional_observe`, which creates a top-level `SPAN` observation. Inside them, `honcho_llm_call` creates a nested `GENERATION` observation. Because the root observation is a `SPAN`, the Langfuse dashboard does not display model and token usage for the summarizer traces at a glance, obscuring critical telemetry for these operations.
+
+## What Changes
+
+- Remove the `@conditional_observe(name="Create Short Summary")` and `@conditional_observe(name="Create Long Summary")` decorators from the summarizer functions.
+- Inject the names explicitly via the `track_name="Create Short Summary"` and `track_name="Create Long Summary"` parameters in the inner `honcho_llm_call` invocations.
+- This collapses the traces into single, pure `GENERATION` observations that properly bubble up their telemetry in the Langfuse UI, mirroring the successful pattern used in `Dialectic Agent`.
+
+## Capabilities
+
+### New Capabilities
+None.
+
+### Modified Capabilities
+- `observability-langfuse`: Require that all LLM interactions, including summarizers, cleanly propagate top-level generation traces without opaque span wrappers.
+
+## Impact
+
+- `src/utils/summarizer.py`: The `@conditional_observe` decorators will be removed and replaced with explicit `track_name` kwargs in the LLM calls.
+- **Langfuse Telemetry**: Summarizer operations will appear as top-level `GENERATION`s instead of nested inside `SPAN`s, granting full visibility to token and model metadata.
diff --git a/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/.gitkeep b/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/.gitkeep
diff --git a/.../2026-05-05-fix-summarizer-telemetry-spans/specs/observability-langfuse/spec.md b/.../2026-05-05-fix-summarizer-telemetry-spans/specs/observability-langfuse/spec.md
@@ -0,0 +1,7 @@
+## MODIFIED Requirements
+
+### Requirement: Langfuse Observability Tracing
+
+#### Scenario: Summarization Tracing
+- **WHEN** a background task or explicit request triggers `create_short_summary` or `create_long_summary`
+- **THEN** the system MUST trace it as a top-level `GENERATION` observation without nested `SPAN` wrappers to ensure accurate model and token attribution in the Langfuse UI
diff --git a/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.md b/openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.md
@@ -0,0 +1,11 @@
+## 1. Source Modification
+
+- [x] 1.1 Remove `@conditional_observe(name="Create Short Summary")` from `create_short_summary` in `src/utils/summarizer.py`.
+- [x] 1.2 Inject `track_name="Create Short Summary"` into the `honcho_llm_call` within `create_short_summary`.
+- [x] 1.3 Remove `@conditional_observe(name="Create Long Summary")` from `create_long_summary` in `src/utils/summarizer.py`.
+- [x] 1.4 Inject `track_name="Create Long Summary"` into the `honcho_llm_call` within `create_long_summary`.
+
+## 2. Verification
+
+- [x] 2.1 Trigger a summary flow within the application (e.g. via agent context limits or explicit test).
+- [x] 2.2 Verify in the Langfuse UI that `Create Short Summary` and `Create Long Summary` now appear as root-level `GENERATION` observations containing valid `model` and `tokens` columns, with no empty top-level SPAN wrappers.
diff --git a/openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yaml b/openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-05-05