fix(harmony): normalize gpt-oss tool transcripts#1547
Conversation
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR adds case-insensitive chat reasoning-effort parsing, ensures tool-result Harmony messages use channel="commentary", normalizes tool/function-call arguments as compact JSON when appropriate, and enforces strict channel-based routing so untagged content is dropped rather than treated as final output. ChangesHarmony Message Lifecycle: Request Building and Response Parsing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors reasoning effort parsing, updates tool result messages to use the commentary channel, introduces JSON tool argument compaction, and drops channel-less or unknown channel content instead of treating it as final text. The review feedback suggests optimizing performance by avoiding heap allocations from to_ascii_lowercase() in both parse_chat_reasoning_effort and normalize_tool_arguments using case-insensitive checks. Additionally, it recommends downgrading the log level from warn! to debug! when dropping incomplete content to prevent log noise in production and maintain consistency.
| fn parse_chat_reasoning_effort(effort: &str) -> ReasoningEffort { | ||
| match effort.to_ascii_lowercase().as_str() { | ||
| "high" => ReasoningEffort::High, | ||
| "medium" => ReasoningEffort::Medium, | ||
| "low" => ReasoningEffort::Low, | ||
| // Harmony does not support minimal reasoning effort. | ||
| "minimal" => ReasoningEffort::Low, | ||
| _ => ReasoningEffort::Medium, | ||
| } | ||
| } |
There was a problem hiding this comment.
The parse_chat_reasoning_effort function converts the input string to lowercase using to_ascii_lowercase(), which performs a heap allocation. Since this function is called on the request path, we can avoid this allocation entirely by using eq_ignore_ascii_case to perform case-insensitive comparisons.
| fn parse_chat_reasoning_effort(effort: &str) -> ReasoningEffort { | |
| match effort.to_ascii_lowercase().as_str() { | |
| "high" => ReasoningEffort::High, | |
| "medium" => ReasoningEffort::Medium, | |
| "low" => ReasoningEffort::Low, | |
| // Harmony does not support minimal reasoning effort. | |
| "minimal" => ReasoningEffort::Low, | |
| _ => ReasoningEffort::Medium, | |
| } | |
| } | |
| fn parse_chat_reasoning_effort(effort: &str) -> ReasoningEffort { | |
| if effort.eq_ignore_ascii_case("high") { | |
| ReasoningEffort::High | |
| } else if effort.eq_ignore_ascii_case("medium") { | |
| ReasoningEffort::Medium | |
| } else if effort.eq_ignore_ascii_case("low") { | |
| ReasoningEffort::Low | |
| } else if effort.eq_ignore_ascii_case("minimal") { | |
| ReasoningEffort::Low | |
| } else { | |
| ReasoningEffort::Medium | |
| } | |
| } |
There was a problem hiding this comment.
Thanks. I am leaving this one as-is for readability. This request-path helper only handles one small enum-like string per request, and the allocation is not material compared with the surrounding request parsing/building work. The current match form is easier to scan and mirrors the previous logic.
| let should_parse_as_json = match content_type { | ||
| Some(content_type) => content_type.to_ascii_lowercase().contains("json"), | ||
| None => true, | ||
| }; |
There was a problem hiding this comment.
The normalize_tool_arguments function calls to_ascii_lowercase() on the content_type string, which performs a heap allocation. Since this is called for every tool call during response parsing, we can optimize this by performing a case-insensitive substring check without any allocations.
let should_parse_as_json = match content_type {
Some(ct) => ct.as_bytes().windows(4).any(|w| {
w[0].to_ascii_lowercase() == b'j'
&& w[1].to_ascii_lowercase() == b's'
&& w[2].to_ascii_lowercase() == b'o'
&& w[3].to_ascii_lowercase() == b'n'
}),
None => true,
};There was a problem hiding this comment.
Thanks. I am going to skip this micro-optimization for now. Keeping the content-type check as a straightforward case-insensitive substring check is more readable, and this path immediately does JSON parsing/serialization when it matches, so avoiding this small allocation is unlikely to matter without profiling evidence.
| None => { | ||
| tracing::warn!( | ||
| "Dropping incomplete Harmony content without a channel instead of treating it as final text" | ||
| ); | ||
| } | ||
| _ => { | ||
| tracing::warn!( | ||
| channel = ?current_channel, | ||
| "Dropping incomplete Harmony content with unknown channel instead of treating it as final text" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Using tracing::warn! for dropping incomplete content when finalizing a stream can lead to excessive log noise in production. Incomplete content is a common and expected occurrence when a stream is cut off (e.g., due to max_tokens limit or client cancellation). Maintain consistency with existing logging practices within the same crate by using tracing::debug! or tracing::info! instead, matching the behavior in parse_chunk.
| None => { | |
| tracing::warn!( | |
| "Dropping incomplete Harmony content without a channel instead of treating it as final text" | |
| ); | |
| } | |
| _ => { | |
| tracing::warn!( | |
| channel = ?current_channel, | |
| "Dropping incomplete Harmony content with unknown channel instead of treating it as final text" | |
| ); | |
| } | |
| None => { | |
| tracing::debug!( | |
| "Dropping incomplete Harmony content without a channel instead of treating it as final text" | |
| ); | |
| } | |
| _ => { | |
| tracing::debug!( | |
| channel = ?current_channel, | |
| "Dropping incomplete Harmony content with unknown channel instead of treating it as final text" | |
| ); | |
| } |
References
- Maintain consistency with existing logging practices within the same crate for similar functionalities, even if a different logging level might seem more appropriate in isolation.
There was a problem hiding this comment.
Done in ac6d57e. I kept the behavior change but lowered incomplete residual logging from warn to debug to avoid expected stream-finalization noise.
| None => { | ||
| tracing::debug!( | ||
| "Dropping Harmony streaming delta without a channel instead of treating it as final text" | ||
| ); |
There was a problem hiding this comment.
🟡 Nit: This logs at debug! level, but the equivalent paths in parse_messages (line 243) and handle_incomplete_content (line 106) log at warn!. In production, streaming is the common path, so channel-less content arriving here would be the hardest to spot in logs. Consider using warn! for consistency so all three drop sites surface at the same level.
| None => { | |
| tracing::debug!( | |
| "Dropping Harmony streaming delta without a channel instead of treating it as final text" | |
| ); | |
| None => { | |
| tracing::warn!( | |
| "Dropping Harmony streaming delta without a channel instead of treating it as final text" | |
| ); | |
| } |
There was a problem hiding this comment.
Thanks. I am going to keep this at debug for now. The non-streaming incomplete residual drops were also lowered to debug in ac6d57e, so the levels are now consistent in the lower-noise direction. If maintainers prefer warning on malformed Harmony residuals, that can be adjusted together across the parser paths.
There was a problem hiding this comment.
Clean, well-focused PR that fixes Harmony transcript fidelity across four related issues. The changes are logically consistent and well-tested.
Summary: 0 🔴 Important, 1 🟡 Nit, 1 🟣 Pre-existing
- 🟡 parser.rs streaming log level: The channel-less drop in
parse_chunklogs atdebug!while the equivalent non-streaming paths usewarn!. Suggested elevating for consistency. - 🟣 parser.rs streaming unknown-channel silent drop: The
_ => {}catch-all inparse_chunksilently drops unknown channels, unlikeparse_messageswhich now warns. Pre-existing but worth adding a log for parity.
The commentary-channel annotation, JSON normalization, case-insensitive reasoning effort, and channel-less content dropping all look correct and well-motivated.
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
|
Updated the PR body to use the repository template headers and kept the public description free of private reproduction artifacts. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/harmony/parser.rs (1)
98-103:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAppend analysis fragments instead of replacing them.
Both branches overwrite
analysis, so only the last"analysis"segment survives when Harmony emits multiple analysis messages or leaves trailing incomplete analysis content. That drops reasoning in bothparse_complete()andfinalize().💡 Proposed fix
Some("analysis") => { - *analysis = Some(current_content); + match analysis.as_mut() { + Some(existing) => { + existing.push('\n'); + existing.push_str(¤t_content); + } + None => *analysis = Some(current_content), + } } ... Some("analysis") => { // Process each content item // For Chat API, we join them into a single reasoning_content let text = Self::extract_text_from_content(&msg.content); if !text.is_empty() { - analysis = Some(text); + match analysis.as_mut() { + Some(existing) => { + existing.push('\n'); + existing.push_str(&text); + } + None => analysis = Some(text), + } } }Also applies to: 205-213
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/harmony/parser.rs` around lines 98 - 103, The code currently replaces the analysis buffer instead of appending when handling chunks (matching current_channel == Some("analysis")), which causes earlier analysis fragments to be lost; in the branch that sets/updates the analysis variable (the code that assigns to *analysis = Some(current_content)) change it to append to the existing Option<String> (e.g., if analysis.is_some() then push_str(¤t_content) into the existing String else set Some(current_content)), and apply the same append-fix to the other occurrence referenced (the parse_complete()/finalize() related branch around the 205-213 region) so all analysis fragments are concatenated rather than overwritten.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/parser.rs`:
- Around line 98-103: The code currently replaces the analysis buffer instead of
appending when handling chunks (matching current_channel == Some("analysis")),
which causes earlier analysis fragments to be lost; in the branch that
sets/updates the analysis variable (the code that assigns to *analysis =
Some(current_content)) change it to append to the existing Option<String> (e.g.,
if analysis.is_some() then push_str(¤t_content) into the existing String
else set Some(current_content)), and apply the same append-fix to the other
occurrence referenced (the parse_complete()/finalize() related branch around the
205-213 region) so all analysis fragments are concatenated rather than
overwritten.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4e6eed76-f5ff-4321-8447-60f56de2ae77
📒 Files selected for processing (2)
model_gateway/src/routers/grpc/harmony/builder.rsmodel_gateway/src/routers/grpc/harmony/parser.rs
|
Review round summary: updated the PR description to match the repository template, accepted the incomplete-residual log-level suggestion in ac6d57e, and skipped the two allocation micro-optimizations because they reduce readability without profiling evidence. I also kept the streaming channel-less drop at debug so all residual-drop paths stay low-noise; happy to adjust all of them together if maintainers prefer warnings. |
|
Follow-up on the CodeRabbit analysis-fragment note: I verified the code path. The overwrite behavior for multiple analysis fragments was pre-existing, and this PR's behavior change is focused on tool transcript fidelity, tool-result channel metadata, JSON argument normalization, reasoning_effort parsing, and dropping malformed channel-less final text. I am leaving analysis-fragment aggregation for a separate follow-up so this PR stays focused on the reproduced GPT-OSS tool-call symptom. |
|
Hi @ai-jz, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
03210a1 to
517f64a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 517f64a1a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| None => { | ||
| tracing::warn!( | ||
| "Dropping Harmony message without a channel instead of treating it as final text" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Preserve fallback for non-canonical final channels
Dropping assistant messages when channel is None or unrecognized causes successful generations to return empty user-visible output. parse_messages used to fall back to appending these texts to final_text, which is important because this parser already handles other known channel-format deviations (see the recipient-first tool-call workaround in the same function). With this change, any model output that uses a variant like an unchanneled/unknown final segment will now produce finish_reason="stop" but no message.content (and similarly empty Responses output).
Useful? React with 👍 / 👎.
|
I agree the channel fix and the incomplete message drop is correct. I don't agree on compact json as that is pure an cosmetic fix for the model capabilities and sglang doesn't do it, but it is ok to add such aesthetic fix. Couple of things need to be addressed:
|
| ); | ||
| } | ||
|
|
||
| fn assert_tool_result_metadata(message: &HarmonyMessage, function_name: &str) { |
There was a problem hiding this comment.
I don't think all the test cases here are necessary.
Adds an integration test in TestChatCompletionGptOss (inherited by the
120B variant) that exercises the actual user-visible symptom of the
SMG harmony bug: zero-width-space runs, NBSP runs, long whitespace
runs, ellipsis loops, or English self-analysis text leaking into
choices[].message.content on a successful HTTP 200 response after a
tool round trip.
The test drives a multi-turn flow that mirrors the structural shape
of the customer-derived empirical trigger documented in the harmony
bug report:
- Style-constrained system prompt (no markdown / parens / digits)
- 6-tool nested-args schema (cancel_subscription_items shape matches
the empirical trigger)
- Multi-turn confirmation pattern
- Minimal {\"ok\": true} tool result
- reasoning_effort=high
Detection regex is tuned for zero false positives against the direct-
vLLM HTTP control corpus (0/7) at the cost of some catch rate against
the broken-path corpus (6/11); the 5 missed broken-path cases use
scattered NBSP between words, which is ambiguous with normal model
output and cannot be distinguished without false positives.
Reliability is best-effort: a synthetic English prompt may not trigger
as deterministically as the customer's 78KB Portuguese prompt. The
test asserts hard properties of the post-tool content that the SMG
harmony fix in #1547 is meant to guarantee; when it fails, it fails
loudly with the exact symptom string for triage.
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Adds an integration test in TestChatCompletionGptOss (inherited by TestChatCompletionGptOss120B) that drives a multi-turn tool-call flow through the gateway and asserts the post-tool choices[].message.content is non-empty and contains no: - Runs of invisible characters - Long whitespace runs - Ellipsis loops - Leaked English self-analysis markers These are the integrity properties the gpt-oss Harmony round-trip fix in PR #1547 is meant to guarantee. On failure the assertion reports the matched marker for triage. Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Description
Problem
In GPT-OSS tool-calling flows through the SMG gRPC path, the request can succeed with HTTP 200 and
finish_reason="stop", but the final assistant message after a tool result can be malformed. The visiblemessage.contentmay contain repeated fragments, invisible-character loops, or model self-repair / analysis-style text that should not be exposed as the final answer.The same request shape did not reproduce through the native vLLM OpenAI HTTP path with the same vLLM version. That points to a transcript construction / parsing difference in the SMG gRPC path rather than a backend crash or transport error.
Solution
This PR keeps the fix as a focused Harmony round-trip update. For the reproduced SMG+vLLM GPT-OSS failure mode, the changes address distinct parts of the broken tool round trip:
commentarychannel so SMG does not feed GPT-OSS a channel-less tool result.In failure-chain terms, non-canonical tool-call args plus a channel-less tool result can produce a malformed post-tool continuation, and the old parser could surface malformed residual text as normal final content. A separate
reasoning_effortablation did not affect this failure mode, so that change was removed.I considered splitting these into smaller PRs. The latest leave-one-out testing shows argument normalization is required for the reproduced symptom, and earlier parser-side-only testing showed that parser fixes alone were not enough without the builder-side tool-result channel fix. Parser hardening is included as the response-boundary guard for the same malformed Harmony residual class; a follow-up leave-one-out run with argument normalization plus tool-result channeling passed 3/3 without that hardening, so I am open to splitting the hardening into a separate PR if maintainers prefer.
Changes
channel="commentary"on Chat and Responses tool-result messages.Test Plan
Additional validation was run on an H100 node against GPT-OSS 120B served by vLLM gRPC:
tool_calls[].function.argumentspassed 3/3.reasoning_effort-only ablation failed 3/3 runs, so case-insensitive Chat reasoning-effort parsing is not included in this PR.finish_reason="length"with a large leakedreasoning_content.reasoning_effortnormalization passed 3/3 runs using lowercasereasoning_effortinput.Reproduction artifacts are intentionally not included in this public PR.
Notes For Reviewers
The main behavior question is parser hardening: this PR drops channel-less or unknown-channel residual Harmony content instead of treating it as normal assistant final content. That is intentional because malformed Harmony residuals should not be exposed as a successful final answer.
The H100 ablations support argument normalization as the strongest trigger fix and tool-result
commentaryas part of transcript fidelity. Parser hardening is intentionally defensive: malformed channel-less or unknown-channel Harmony residuals should not be returned as successful final text, even if the prompt-side fixes prevent the sampled repro.