feat: expose cached token usage in responses. by zhang-minchao · Pull Request #1514 · jd-opensource/xllm

zhang-minchao · 2026-05-21T16:23:29Z

单测通过538/539. 只有`ChunkedPrefillSchedulerTest.OnPrefillPreemptOffDecode`不通过

gemini-code-assist

Code Review

This pull request implements tracking and reporting for cached tokens served from the prefix cache across the C++, C, and Python APIs, while also enabling the qwen3 reasoning detector. It introduces detailed token accounting in the usage statistics, including prompt_tokens_details and completion_tokens_details. Feedback identifies a logic error in Sequence::current_num_cached_tokens() where cached tokens are under-reported for prompts that are not block-aligned; the reviewer suggests simplifying the calculation by taking the minimum of cached tokens and prompt tokens.

gemini-code-assist · 2026-05-21T16:26:34Z

+size_t Sequence::current_num_cached_tokens() const {
+  size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(),
+                                  host_kv_state_.shared_kv_tokens_num());
+  if (cached_tokens <= num_prompt_tokens_) {
+    return cached_tokens;
+  }
+
+  size_t block_size = 0;
+  if (kv_state_.shared_kv_blocks_num() > 0 && kv_state_.num_kv_blocks() > 0) {
+    block_size = kv_state_.kv_blocks()[0].size();
+  } else if (host_kv_state_.shared_kv_blocks_num() > 0 &&
+             host_kv_state_.num_kv_blocks() > 0) {
+    block_size = host_kv_state_.kv_blocks()[0].size();
+  }
+  if (block_size == 0) {
+    return 0;
+  }
+  return (num_prompt_tokens_ / block_size) * block_size;
+}


The logic for calculating cached tokens under-reports the count when the prompt is not block-aligned and the cache match covers the entire prompt.

For example, if block_size is 16 and num_prompt_tokens_ is 10, and the prefix cache matches the first block (16 tokens), cached_tokens will be 16. The current logic (16 <= 10) is false, and it returns (10 / 16) * 16 = 0, even though all 10 prompt tokens were served from the cache.

Similarly, if num_prompt_tokens_ is 20 and 2 blocks are shared (32 tokens), it returns (20 / 16) * 16 = 16, missing the 4 tokens in the second block.

The correct value should be the minimum of the cached tokens and the prompt tokens. This also simplifies the implementation by removing the need to calculate block_size.

size_t Sequence::current_num_cached_tokens() const { size_t cached_tokens = std::max(kv_state_.shared_kv_tokens_num(), host_kv_state_.shared_kv_tokens_num()); return std::min(cached_tokens, num_prompt_tokens_); }

XuZhang99 · 2026-05-22T08:02:44Z

 const std::unordered_map<std::string, std::string> auto_paser_map = {
    // {"deepseek_v3", "deepseek-v3"},
-    // {"qwen3", "qwen3"},
+    {"qwen3", "qwen3"},


why remove // for qwen3?

feat: expose cached token usage in responses.

f4f567a

zhang-minchao requested review from Clement-Wang26, DongheJin, DragonFive, JimHsiung, Kang-Meng, RobbieLeung, XuZhang99, liujinguang0125, liutongxuan, walsonyang, xiao-yu-chen, yingxudeng and yq33victor as code owners May 21, 2026 16:23

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

XuZhang99 reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose cached token usage in responses.#1514

feat: expose cached token usage in responses.#1514
zhang-minchao wants to merge 1 commit into
jd-opensource:mainfrom
zhang-minchao:feat/cached_tokens_output

zhang-minchao commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

XuZhang99 May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhang-minchao commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

XuZhang99 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants