feat: use FIA for qwen3.5 prefill attention on npu. by Sinle4Cat · Pull Request #1448 · jd-opensource/xllm

Sinle4Cat · 2026-05-14T02:58:59Z

feat(npu): use FIA for qwen3.5 prefill attention

gemini-code-assist

Code Review

This pull request introduces the npu_fused_infer_attention and x_flash_attention_infer kernels to support optimized attention mechanisms on NPU hardware. It also updates the AttentionImpl and Qwen3GatedDeltaNetBaseImpl layers to utilize these new kernels during prefill and chunked-prefill phases. Feedback highlights a critical scalability issue where a hardcoded constant for tiling nodes is too small for production workloads, potentially causing crashes. Additionally, there is a performance concern regarding inefficient tensor allocations on the device within loops and a violation of the repository's style guide regarding relative include paths.

gemini-code-assist · 2026-05-14T03:00:50Z

+
+namespace {
+
+constexpr uint32_t kMaxExtraInfoNodes = 25;


The constant kMaxExtraInfoNodes is set to 25, which is likely too small for production workloads. In make_extra_tiling, core_idx is incremented for each sequence in the batch (line 119) or for each KV head in long-KV sequences (line 136). For a model with 8 KV heads and a batch size of 4, if all sequences are long-KV, core_idx will reach 32, exceeding this limit and causing a LOG(FATAL) crash via the CHECK_LT calls on lines 117 and 134. Please increase this limit to support larger batches and head counts.

gemini-code-assist · 2026-05-14T03:00:50Z

+          int32_tensor_on_device({static_cast<int32_t>(sub_q_len)}, device);
+      auto sub_kv_lens = int32_tensor_on_device(
+          {static_cast<int32_t>(past_kv_len + q_start + sub_q_len)}, device);


Creating new tensors on the device inside a nested loop is highly inefficient. int32_tensor_on_device performs a CPU allocation, a GPU allocation, and a host-to-device copy on every iteration. This will significantly degrade performance during the prefill phase. Consider pre-allocating these tensors outside the loop or using a more efficient way to pass these scalar values to the kernel.

gemini-code-assist · 2026-05-14T03:00:50Z

+#include "aclnn_x_flash_attention_infer.h"
+#include "core/kernels/npu/aclnn/pytorch_npu_helper.hpp"
+#include "core/kernels/npu/utils.h"
+#include "xllm_ops_api.h"


Relative paths in #include are prohibited by the repository style guide. Please use project-root-relative paths for all includes, such as core/kernels/npu/xllm_ops/xllm_ops_api.h.

References

No relative paths in #include. Always use project-root-relative paths. ^(link)

…1448-check

…1448-check # Conflicts: # xllm/core/kernels/npu/npu_fused_infer_attention.cpp # xllm/core/layers/npu_torch/attention.cpp

default added 2 commits May 11, 2026 19:55

feat(npu): use FIA for qwen3.5 prefill attention

ee7a139

Support x flash attention prefill on NPU

b448a69

Sinle4Cat requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners May 14, 2026 02:59

Sinle4Cat changed the title ~~X flash~~ feat(npu): use FIA for qwen3.5 prefill attention May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Sinle4Cat force-pushed the x_flash branch 2 times, most recently from d671a25 to 4971f79 Compare May 14, 2026 03:38

bugfix: x flash attention long kv x mte excepiton

39f5b84

Sinle4Cat force-pushed the x_flash branch from 4971f79 to 39f5b84 Compare May 14, 2026 03:43

liutongxuan changed the title ~~feat(npu): use FIA for qwen3.5 prefill attention~~ feat: use FIA for qwen3.5 prefill attention on npu. May 14, 2026

sinle4cat added 7 commits May 14, 2026 15:43

Merge preview/qwen3.5-qwen3.6 into x_flash

f182a6f

bugfix: route chunked prefill to x flash attention

b02e42f

bugfix: improve x flash attention long kv stability

fa77f62

Merge remote-tracking branch 'origin/preview/qwen3.5-qwen3.6' into pr…

2d41489

…1448-check

Merge remote-tracking branch 'origin/preview/qwen3.5-qwen3.6' into pr…

53cbf1c

…1448-check # Conflicts: # xllm/core/kernels/npu/npu_fused_infer_attention.cpp # xllm/core/layers/npu_torch/attention.cpp

bugfix: fix long kv xflash prefill grouping

6d7cdfb

chore: remove qwen35 gdn debug logging

7192530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use FIA for qwen3.5 prefill attention on npu.#1448

feat: use FIA for qwen3.5 prefill attention on npu.#1448
Sinle4Cat wants to merge 10 commits into
jd-opensource:preview/qwen3.5-qwen3.6from
Sinle4Cat:x_flash

Sinle4Cat commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sinle4Cat commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant