Restrict think-state scan to assistant prefill tail#1277
Open
eilidhmae wants to merge 1 commit into
Open
Conversation
ResponseGenerator._tokenize chose initial_state for the generation
state machine by scanning the entire tokenized prompt for the model's
<think> / </think> tokens. When a user message contained a literal
"<think>" with no matching "</think>", the scan flipped initial_state
to "reasoning" for the whole generation, routing every emitted token
to message.reasoning and leaving message.content absent.
The intent of the scan is to detect whether the chat template injected
a <think> in the assistant prefill — a tail-of-prompt question. The
same function already has a tail-bounded scan (last 11 tokens) below
this one for segment finalisation, so the unbounded form was the
outlier. Bound the initial-state scan with start=max(0, len(prompt)-11)
to match.
Repro without this fix: send a chat completion to a server running a
model whose tokenizer has <think>/</think> in its vocab (e.g. any
Qwen3-family checkpoint), with a user message containing the literal
string "<think>" anywhere. finish_reason is "stop", completion_tokens
is nonzero, but message.content is absent and message.reasoning carries
the output. With the fix, output lands in message.content as expected.
Adds two regression tests in tests/test_server.py:
- test_tokenize_user_think_does_not_flip_initial_state (the bug case)
- test_tokenize_prefill_think_flips_initial_state (positive guard so
a future over-bound doesn't silently break the intended behaviour).
eilidhmae
added a commit
to eilidhmae/pi-tools
that referenced
this pull request
May 15, 2026
Three verdict-extracting grep pipelines in tools/bash/adversary-pass.sh and tools/bash/adversary-scan.sh die when $REVIEW / $OUT are empty: grep returns 1, pipefail propagates the failure through the command substitution, and `set -e` aborts the script before the `[[ -z "$VERDICT" ]] && VERDICT="UNKNOWN"` fallbacks can run. In practice this silently dropped one review per affected file AND killed the rest of the scan batch — Phase 3's install.sh case hit this because mlx_lm.server 0.31.3 bifurcated output into message.reasoning for prompts containing literal `<think>` tokens (see ml-explore/mlx-lm#1277). Append `|| true` to each grep pipeline so empty input cleanly yields an empty assignment, and the existing UNKNOWN-default fallbacks become reachable. PEER_VERDICT (adversary-pass.sh:333) already had the equivalent `|| echo "UNKNOWN"` guard; this commit brings the other three sites in line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ResponseGenerator._tokenize chose initial_state for the generation
state machine by scanning the entire tokenized prompt for the model's
<think>/</think>tokens. When a user message contained a literal<think>with no matching</think>, the scan flipped initial_stateto "reasoning" for the whole generation, routing every emitted token
to message.reasoning and leaving message.content absent.
The intent of the scan is to detect whether the chat template injected
a
<think>in the assistant prefill — a tail-of-prompt question. Thesame function already has a tail-bounded scan (last 11 tokens) below
this one for segment finalisation, so the unbounded form was the
outlier. Bound the initial-state scan with start=max(0, len(prompt)-11)
to match.
Repro without this fix: send a chat completion to a server running a
model whose tokenizer has
<think>/</think>in its vocab (e.g. anyQwen3-family checkpoint), with a user message containing the literal
string
<think>anywhere. finish_reason is "stop", completion_tokensis nonzero, but message.content is absent and message.reasoning carries
the output. With the fix, output lands in message.content as expected.
Adds two regression tests in
tests/test_server.py:test_tokenize_user_think_does_not_flip_initial_state(the bug case)test_tokenize_prefill_think_flips_initial_state(positive guard soa future over-bound doesn't silently break the intended behaviour).
Generated with Claude Code