feat: expose cached token usage in responses.#1514
Conversation
zhang-minchao
commented
May 21, 2026
单测通过538/539. 只有`ChunkedPrefillSchedulerTest.OnPrefillPreemptOffDecode`不通过
There was a problem hiding this comment.
Code Review
This pull request implements tracking and reporting for cached tokens served from the prefix cache across the C++, C, and Python APIs, while also enabling the qwen3 reasoning detector. It introduces detailed token accounting in the usage statistics, including prompt_tokens_details and completion_tokens_details. Feedback identifies a logic error in Sequence::current_num_cached_tokens() where cached tokens are under-reported for prompts that are not block-aligned; the reviewer suggests simplifying the calculation by taking the minimum of cached tokens and prompt tokens.
| size_t Sequence::current_num_cached_tokens() const { | ||
| size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(), | ||
| host_kv_state_.shared_kv_tokens_num()); | ||
| if (cached_tokens <= num_prompt_tokens_) { | ||
| return cached_tokens; | ||
| } | ||
|
|
||
| size_t block_size = 0; | ||
| if (kv_state_.shared_kv_blocks_num() > 0 && kv_state_.num_kv_blocks() > 0) { | ||
| block_size = kv_state_.kv_blocks()[0].size(); | ||
| } else if (host_kv_state_.shared_kv_blocks_num() > 0 && | ||
| host_kv_state_.num_kv_blocks() > 0) { | ||
| block_size = host_kv_state_.kv_blocks()[0].size(); | ||
| } | ||
| if (block_size == 0) { | ||
| return 0; | ||
| } | ||
| return (num_prompt_tokens_ / block_size) * block_size; | ||
| } |
There was a problem hiding this comment.
The logic for calculating cached tokens under-reports the count when the prompt is not block-aligned and the cache match covers the entire prompt.
For example, if block_size is 16 and num_prompt_tokens_ is 10, and the prefix cache matches the first block (16 tokens), cached_tokens will be 16. The current logic (16 <= 10) is false, and it returns (10 / 16) * 16 = 0, even though all 10 prompt tokens were served from the cache.
Similarly, if num_prompt_tokens_ is 20 and 2 blocks are shared (32 tokens), it returns (20 / 16) * 16 = 16, missing the 4 tokens in the second block.
The correct value should be the minimum of the cached tokens and the prompt tokens. This also simplifies the implementation by removing the need to calculate block_size.
size_t Sequence::current_num_cached_tokens() const {
size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(),
host_kv_state_.shared_kv_tokens_num());
return std::min(cached_tokens, num_prompt_tokens_);
}| const std::unordered_map<std::string, std::string> auto_paser_map = { | ||
| // {"deepseek_v3", "deepseek-v3"}, | ||
| // {"qwen3", "qwen3"}, | ||
| {"qwen3", "qwen3"}, |
There was a problem hiding this comment.
why remove // for qwen3?