Skip to content

feat: expose cached token usage in responses.#1514

Open
zhang-minchao wants to merge 1 commit into
jd-opensource:mainfrom
zhang-minchao:feat/cached_tokens_output
Open

feat: expose cached token usage in responses.#1514
zhang-minchao wants to merge 1 commit into
jd-opensource:mainfrom
zhang-minchao:feat/cached_tokens_output

Conversation

@zhang-minchao
Copy link
Copy Markdown
Collaborator

image 单测通过538/539. 只有`ChunkedPrefillSchedulerTest.OnPrefillPreemptOffDecode`不通过

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements tracking and reporting for cached tokens served from the prefix cache across the C++, C, and Python APIs, while also enabling the qwen3 reasoning detector. It introduces detailed token accounting in the usage statistics, including prompt_tokens_details and completion_tokens_details. Feedback identifies a logic error in Sequence::current_num_cached_tokens() where cached tokens are under-reported for prompts that are not block-aligned; the reviewer suggests simplifying the calculation by taking the minimum of cached tokens and prompt tokens.

Comment on lines +694 to +712
size_t Sequence::current_num_cached_tokens() const {
size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(),
host_kv_state_.shared_kv_tokens_num());
if (cached_tokens <= num_prompt_tokens_) {
return cached_tokens;
}

size_t block_size = 0;
if (kv_state_.shared_kv_blocks_num() > 0 && kv_state_.num_kv_blocks() > 0) {
block_size = kv_state_.kv_blocks()[0].size();
} else if (host_kv_state_.shared_kv_blocks_num() > 0 &&
host_kv_state_.num_kv_blocks() > 0) {
block_size = host_kv_state_.kv_blocks()[0].size();
}
if (block_size == 0) {
return 0;
}
return (num_prompt_tokens_ / block_size) * block_size;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating cached tokens under-reports the count when the prompt is not block-aligned and the cache match covers the entire prompt.

For example, if block_size is 16 and num_prompt_tokens_ is 10, and the prefix cache matches the first block (16 tokens), cached_tokens will be 16. The current logic (16 <= 10) is false, and it returns (10 / 16) * 16 = 0, even though all 10 prompt tokens were served from the cache.

Similarly, if num_prompt_tokens_ is 20 and 2 blocks are shared (32 tokens), it returns (20 / 16) * 16 = 16, missing the 4 tokens in the second block.

The correct value should be the minimum of the cached tokens and the prompt tokens. This also simplifies the implementation by removing the need to calculate block_size.

size_t Sequence::current_num_cached_tokens() const {
  size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(),
                                  host_kv_state_.shared_kv_tokens_num());
  return std::min(cached_tokens, num_prompt_tokens_);
}

const std::unordered_map<std::string, std::string> auto_paser_map = {
// {"deepseek_v3", "deepseek-v3"},
// {"qwen3", "qwen3"},
{"qwen3", "qwen3"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove // for qwen3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants