feat: Langfuse LLM observability — generation tracing with token usage reporting#693
feat: Langfuse LLM observability — generation tracing with token usage reporting#693phuongvm wants to merge 4 commits into
Conversation
WalkthroughThis PR upgrades Langfuse observability for LLM calls by adding generation-typed observation support with explicit token reporting. The ChangesLangfuse Generation Tracing Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
openspec/workspace/sessions/agent_share.md (1)
1-133: ⚡ Quick winAvoid maintaining two near-identical coordination docs as active sources.
openspec/workspace/sessions/agent_share.mdandopenspec/workspace/journal/2026-05-05/agent_share_afternoon.mdcurrently duplicate the same operational content. Please keep one canonical live doc and make the other a pointer/snapshot note to reduce drift risk during active coordination.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openspec/workspace/sessions/agent_share.md` around lines 1 - 133, Two near-identical coordination docs (openspec/workspace/sessions/agent_share.md and openspec/workspace/journal/2026-05-05/agent_share_afternoon.md) must be deduplicated: choose one as the canonical live document (update its header and Ground Rules to declare it canonical), convert the other into a snapshot/pointer that contains a short note linking to the canonical file and the snapshot timestamp, merge any unique entries from Updates Log or Shared Notes into the canonical file (referencing the unique markers like "Updates Log" and "Shared Notes") before archiving the snapshot, update the File Ownership table to reflect the single live file, and ensure all agents and any automation reference only the canonical path going forward.src/llm/api.py (1)
311-323: ⚡ Quick winConsolidate duplicate Langfuse usage reporting into a helper function.
The Langfuse usage update logic is duplicated in both the tool-less path (lines 312-323) and tool-loop path (lines 363-374). Extracting this to a helper function improves maintainability and reduces the risk of inconsistent updates.
♻️ Proposed refactor
Add a helper function at module level:
def _update_langfuse_generation_usage(result: HonchoLLMCallResponse[Any]) -> None: """Update Langfuse generation with token usage from the response.""" if not settings.LANGFUSE_PUBLIC_KEY: return try: from langfuse import get_client usage = {} if result.input_tokens is not None: usage["input"] = result.input_tokens if result.output_tokens is not None: usage["output"] = result.output_tokens if usage: get_client().update_current_generation(usage_details=usage) except Exception as exc: logger.debug("Failed to update Langfuse usage: %s", exc)Then replace both blocks with:
result: ( HonchoLLMCallResponse[Any] | AsyncIterator[HonchoLLMCallStreamChunk] ) = await decorated() - - if isinstance(result, HonchoLLMCallResponse) and settings.LANGFUSE_PUBLIC_KEY: - try: - from langfuse import get_client - usage = {} - if result.input_tokens is not None: - usage["input"] = result.input_tokens - if result.output_tokens is not None: - usage["output"] = result.output_tokens - if usage: - get_client().update_current_generation(usage_details=usage) - except Exception as exc: - logger.debug("Failed to update Langfuse usage: %s", exc) - + if isinstance(result, HonchoLLMCallResponse): + _update_langfuse_generation_usage(result) if trace_name and isinstance(result, HonchoLLMCallResponse):And similarly for the tool-loop path around line 363.
Also applies to: 363-374
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/api.py` around lines 311 - 323, Extract the duplicated Langfuse usage reporting into a module-level helper function (e.g., def _update_langfuse_generation_usage(result: HonchoLLMCallResponse[Any]) -> None) that checks settings.LANGFUSE_PUBLIC_KEY, imports get_client from langfuse, builds the usage dict from result.input_tokens and result.output_tokens, calls get_client().update_current_generation(usage_details=usage) when non-empty, and logs exceptions via logger.debug; then replace the two duplicated blocks in the tool-less path and the tool-loop path (the existing code handling HonchoLLMCallResponse around the current if blocks) with calls to this new _update_langfuse_generation_usage(result).openspec/specs/observability-langfuse/spec.md (1)
4-4: 💤 Low valueConsider completing the TBD placeholder.
The Purpose section contains "TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho." Since the requirements below are well-defined, the purpose statement could be finalized to better introduce the specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openspec/specs/observability-langfuse/spec.md` at line 4, Replace the "TBD: Core capabilities for integrating Langfuse LLM observability and tracing telemetry within Honcho." placeholder in the Purpose section with a concise finalized purpose statement that describes the spec's scope—e.g., that it defines core capabilities for integrating Langfuse LLM observability and tracing into Honcho, including telemetry collection (metrics, traces, logs), integration points (SDK hooks, middleware), configuration options, data retention and sampling rules, and supported user-facing features (dashboards, alerting, trace correlation). Update the Purpose paragraph to mention the intended audience and high-level goals so it clearly introduces the rest of the document.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md`:
- Line 126: Replace the misspelled word "approplies" with "applies" in the
sentence "**Next command**: **`/opsx-propose`** (or workspace equivalent) when
Commander approplies change name and scope boundaries." so it reads "...when
Commander applies change name and scope boundaries." — update the phrase
containing "approplies" accordingly.
---
Nitpick comments:
In `@openspec/specs/observability-langfuse/spec.md`:
- Line 4: Replace the "TBD: Core capabilities for integrating Langfuse LLM
observability and tracing telemetry within Honcho." placeholder in the Purpose
section with a concise finalized purpose statement that describes the spec's
scope—e.g., that it defines core capabilities for integrating Langfuse LLM
observability and tracing into Honcho, including telemetry collection (metrics,
traces, logs), integration points (SDK hooks, middleware), configuration
options, data retention and sampling rules, and supported user-facing features
(dashboards, alerting, trace correlation). Update the Purpose paragraph to
mention the intended audience and high-level goals so it clearly introduces the
rest of the document.
In `@openspec/workspace/sessions/agent_share.md`:
- Around line 1-133: Two near-identical coordination docs
(openspec/workspace/sessions/agent_share.md and
openspec/workspace/journal/2026-05-05/agent_share_afternoon.md) must be
deduplicated: choose one as the canonical live document (update its header and
Ground Rules to declare it canonical), convert the other into a snapshot/pointer
that contains a short note linking to the canonical file and the snapshot
timestamp, merge any unique entries from Updates Log or Shared Notes into the
canonical file (referencing the unique markers like "Updates Log" and "Shared
Notes") before archiving the snapshot, update the File Ownership table to
reflect the single live file, and ensure all agents and any automation reference
only the canonical path going forward.
In `@src/llm/api.py`:
- Around line 311-323: Extract the duplicated Langfuse usage reporting into a
module-level helper function (e.g., def
_update_langfuse_generation_usage(result: HonchoLLMCallResponse[Any]) -> None)
that checks settings.LANGFUSE_PUBLIC_KEY, imports get_client from langfuse,
builds the usage dict from result.input_tokens and result.output_tokens, calls
get_client().update_current_generation(usage_details=usage) when non-empty, and
logs exceptions via logger.debug; then replace the two duplicated blocks in the
tool-less path and the tool-loop path (the existing code handling
HonchoLLMCallResponse around the current if blocks) with calls to this new
_update_langfuse_generation_usage(result).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 83d13b2b-ab24-4032-a295-42a680b7350d
📒 Files selected for processing (24)
openspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/.openspec.yamlopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/design.mdopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/explorations/2026-05-05-nested-langfuse-spans.mdopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/proposal.mdopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/.gitkeepopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/specs/observability-langfuse/spec.mdopenspec/changes/archive/2026-05-05-fix-summarizer-telemetry-spans/tasks.mdopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/.openspec.yamlopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/design.mdopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.mdopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/proposal.mdopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/specs/observability-langfuse/spec.mdopenspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/tasks.mdopenspec/specs/observability-langfuse/spec.mdopenspec/workspace/journal/2026-05-05/agent_share_afternoon.mdopenspec/workspace/memories/leader/2026-05-05/archive_01.mdopenspec/workspace/memories/leader/2026-05-05/current.mdopenspec/workspace/memories/leader/lesson_learnt/L-001-langfuse-custom-model-tokens.mdopenspec/workspace/sessions/agent_share.mdsrc/llm/api.pysrc/llm/runtime.pysrc/telemetry/logging.pysrc/utils/summarizer.pytests/utils/test_clients.py
| ## 10. Status | ||
|
|
||
| **Exploration**: Closed with **decision to formalize via OpenSpec proposal** (recommended Option A). | ||
| **Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander approplies change name and scope boundaries. |
There was a problem hiding this comment.
Fix typo in documentation.
"approplies" should be "applies".
📝 Proposed fix
-**Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander approplies change name and scope boundaries.
+**Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander applies change name and scope boundaries.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander approplies change name and scope boundaries. | |
| **Next command**: **`/opsx-propose`** (or workspace equivalent) when Commander applies change name and scope boundaries. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@openspec/changes/archive/2026-05-05-honcho-langfuse-generation-traces/explorations/2026-05-05-langfuse-generation-observations-openspec-gap.md`
at line 126, Replace the misspelled word "approplies" with "applies" in the
sentence "**Next command**: **`/opsx-propose`** (or workspace equivalent) when
Commander approplies change name and scope boundaries." so it reads "...when
Commander applies change name and scope boundaries." — update the phrase
containing "approplies" accordingly.
|
The actual langfuse updates seem directionally good, but please clean up all of the openspec files and remove from the PR |
|
Openspec files have been removed per request. PR is now clean and ready for re-review. |
- Introduced new capability for tracing LLM calls as Langfuse generation observations. - Updated `honcho_llm_call` to use `@conditional_observe` with `as_type="generation"`. - Modified `update_current_langfuse_observation` to set the `model` field and preserve metadata. - Added comprehensive documentation and tasks for verification and future enhancements. - Created exploration and proposal documents to formalize the integration process.
…custom models and initialize project documentation
…name passing to honcho_llm_call for improved Langfuse telemetry aggregation.
9a4ea1c to
4bd10a7
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/llm/api.py (2)
418-429: ⚡ Quick winConsider extracting token usage reporting to a helper function.
The token usage reporting logic appears twice (lines 418-429 and 470-481) with identical implementation. Since this is best-effort telemetry with specific exception handling, extracting it to a helper would improve maintainability.
♻️ Proposed refactor
Add a helper function:
def _report_langfuse_token_usage( result: HonchoLLMCallResponse[Any], ) -> None: """Report token usage to Langfuse for the current generation span.""" if not settings.LANGFUSE_PUBLIC_KEY: return try: from langfuse import get_client usage = {} if result.input_tokens is not None: usage["input"] = result.input_tokens if result.output_tokens is not None: usage["output"] = result.output_tokens if usage: get_client().update_current_generation(usage_details=usage) except Exception as exc: logger.debug("Failed to update Langfuse usage: %s", exc)Then replace both occurrences with:
if isinstance(result, HonchoLLMCallResponse): _report_langfuse_token_usage(result)Note: The bare
Exceptioncatch is appropriate here for best-effort telemetry and should not be changed.Also applies to: 470-481
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/api.py` around lines 418 - 429, Extract the duplicated Langfuse token-usage reporting into a helper function (e.g., _report_langfuse_token_usage(result: HonchoLLMCallResponse)) that checks settings.LANGFUSE_PUBLIC_KEY, imports get_client from langfuse, builds the usage dict from result.input_tokens/result.output_tokens, calls get_client().update_current_generation(usage_details=usage) when non-empty, and catches/ logs exceptions with logger.debug as currently done; then replace both duplicated blocks that test isinstance(result, HonchoLLMCallResponse) with a single call to _report_langfuse_token_usage(result) (leaving the bare Exception catch behavior unchanged).
318-412: 💤 Low valueConsider extracting the truncation logic for reusability.
The max_input_tokens enforcement logic is well-implemented and the token-based cap detection correctly handles the single-oversized-message edge case. The new toolless path provides clear separation from the tool-enabled flow.
However, this creates a parallel code path with significant duplication of the honcho_llm_call_inner invocation logic (lines 356-397 vs 227-266). Consider whether a shared helper or parameter flag could reduce this duplication in a future refactor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/llm/api.py` around lines 318 - 412, The truncation and cap-detection code duplicated for the toolless path should be extracted into a small reusable helper (e.g., compute_truncated_messages_or_original) that encapsulates use of count_message_tokens and truncate_messages_to_fit and returns (messages, hit_input_token_cap); replace references to toolless_messages/toolless_hit_input_token_cap with that helper in the toolless branch and update calls to honcho_llm_call_inner to use the returned messages so the same invocation logic (currently in _toolless_call / wrapped) can be shared with the tool-enabled path, reducing duplicate honcho_llm_call_inner invocation code and keeping existing behavior (including stream, track_name, enable_retry, and decorated() fallthrough).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/llm/api.py`:
- Around line 418-429: Extract the duplicated Langfuse token-usage reporting
into a helper function (e.g., _report_langfuse_token_usage(result:
HonchoLLMCallResponse)) that checks settings.LANGFUSE_PUBLIC_KEY, imports
get_client from langfuse, builds the usage dict from
result.input_tokens/result.output_tokens, calls
get_client().update_current_generation(usage_details=usage) when non-empty, and
catches/ logs exceptions with logger.debug as currently done; then replace both
duplicated blocks that test isinstance(result, HonchoLLMCallResponse) with a
single call to _report_langfuse_token_usage(result) (leaving the bare Exception
catch behavior unchanged).
- Around line 318-412: The truncation and cap-detection code duplicated for the
toolless path should be extracted into a small reusable helper (e.g.,
compute_truncated_messages_or_original) that encapsulates use of
count_message_tokens and truncate_messages_to_fit and returns (messages,
hit_input_token_cap); replace references to
toolless_messages/toolless_hit_input_token_cap with that helper in the toolless
branch and update calls to honcho_llm_call_inner to use the returned messages so
the same invocation logic (currently in _toolless_call / wrapped) can be shared
with the tool-enabled path, reducing duplicate honcho_llm_call_inner invocation
code and keeping existing behavior (including stream, track_name, enable_retry,
and decorated() fallthrough).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f4025465-1eea-4c44-a06b-63af2d0e0028
📒 Files selected for processing (5)
src/llm/api.pysrc/llm/runtime.pysrc/telemetry/logging.pysrc/utils/summarizer.pytests/utils/test_clients.py
Summary
Integrates Langfuse as an observability layer for Honcho's LLM calls. Every LLM generation is automatically traced as a Langfuse generation span with model name, provider, input/output tokens, and reasoning effort metadata — enabling cost tracking, latency analysis, and debugging across all Honcho agents.
Problem
Honcho's LLM calls (Deriver, Dialectic, Dreamer, MCP agents) had no centralized observability. Teams could not:
Solution
Automatic Generation Tracing
Every call through
honcho_llm_call()is automatically traced as a Langfuse generation span via the@conditional_observedecorator. The span is updated with runtime metadata: model name, provider, input/output tokens, and is_fallback status.Token Usage Reporting
Token counts are reported on the generation span after each call completes, working for both standard providers (Anthropic, OpenAI) and custom model configurations.
Refactored Summarizer Telemetry
Outer summarizer span decorators were replaced with direct
track_namepassing tohoncho_llm_call(), enabling properly nested traces under correct parent traces and eliminating orphaned/duplicate generation spans.Configuration
Langfuse is configured via environment variables:
LANGFUSE_PUBLIC_KEYLANGFUSE_SECRET_KEYLANGFUSE_HOSTThe
@conditional_observedecorator is a no-op with zero overhead when Langfuse is not configured.Architecture
flowchart TD A[honcho_llm_call] --> B[@conditional_observe] B --> C[Create Langfuse generation span] C --> D[resolve_runtime_model_config] D --> E[plan_attempt / select provider+model] E --> F[honcho_llm_call_inner] F --> G[update_current_generation with tokens]Trace Flow
Files Changed
src/llm/api.py@conditional_observedecorator, post-call Langfuse token usage reportingsrc/llm/runtime.pyupdate_current_langfuse_observation()for span metadata updatessrc/llm/tool_loop.pysrc/llm/executor.pysrc/utils/summarizer/main.pysrc/utils/summarizer/paper.pysrc/utils/summarizer/summarization.pysrc/telemetry/validation.pyopenspec/specs/observability-langfuse/spec.mdNotes for Reviewers
Summary by CodeRabbit