Send empty message right after first token generation (continuous batching)#4020
Send empty message right after first token generation (continuous batching)#4020dkalinowski wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements support for sending an empty control message immediately after the first token generation in continuous batching scenarios. This addresses the case where the first token generation iteration produces no visible text output, allowing clients to receive an early signal that generation has started and includes the assistant role as required by the OpenAI streaming specification.
Changes:
- Added
loopIterationcounter to track streaming iterations inGenAiServableExecutionContext - Implemented logic to send a control chunk when the first iteration produces empty text
- Added
serializeStreamingFirstTokenControlChunk()method to create properly formatted first-chunk responses for both chat completions and completions endpoints
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/llm/servable.hpp | Adds loopIteration field to track which streaming iteration is currently being processed |
| src/llm/servable.cpp | Implements logic to send control chunk on first iteration when text is empty, increments loop counter |
| src/llm/apis/openai_completions.hpp | Declares new method for serializing the first token control chunk |
| src/llm/apis/openai_completions.cpp | Implements serialization of control chunk with role field and null content for OpenAI spec compliance |
| return buffer.GetString(); | ||
| } | ||
|
|
||
| std::string OpenAIChatCompletionsHandler::serializeStreamingFirstTokenControlChunk() { |
There was a problem hiding this comment.
I don't understand the context of FirstTokenControlChunk
what is the "control" aspect here?
| choice.SetObject(); | ||
|
|
||
| choice.AddMember("index", 0, allocator); | ||
| if (endpoint == Endpoint::CHAT_COMPLETIONS) { |
There was a problem hiding this comment.
I think we could document this behavior maybe in the API reference, so it's clear that we send that empty response (and only for CB pipelines right?)
| std::shared_ptr<ov::genai::TextStreamer> textStreamer; | ||
| bool sendLoopbackSignal = false; | ||
| std::string lastStreamerCallbackOutput; | ||
| size_t loopIteration = 0; |
There was a problem hiding this comment.
This name does not explain the purpose to me. Also, couldn't this be a bool like decodingPhase? Or even an enum like RequestProcessingPhase.prefill / RequestProcessingPhase.decode - starting with prefill and switching to decode after first read finishes.
|
|
||
| // Reusable helper: asserts that a streaming chat completion chunk is the initial | ||
| // initial empty message with role:assistant and content:null. | ||
| inline void assertInitialStreamChatCompletionChunk(const std::string& response, const std::string& expectedModel) { |
There was a problem hiding this comment.
How about test for completion endpoint?
| assertInitialStreamChatCompletionChunk(response, params.modelName); | ||
| return; | ||
| } | ||
| replyCounter++; |
🛠 Summary
CVS-181341
CVS-177373