feat: use FIA for qwen3.5 prefill attention on npu.#1448
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the npu_fused_infer_attention and x_flash_attention_infer kernels to support optimized attention mechanisms on NPU hardware. It also updates the AttentionImpl and Qwen3GatedDeltaNetBaseImpl layers to utilize these new kernels during prefill and chunked-prefill phases. Feedback highlights a critical scalability issue where a hardcoded constant for tiling nodes is too small for production workloads, potentially causing crashes. Additionally, there is a performance concern regarding inefficient tensor allocations on the device within loops and a violation of the repository's style guide regarding relative include paths.
|
|
||
| namespace { | ||
|
|
||
| constexpr uint32_t kMaxExtraInfoNodes = 25; |
There was a problem hiding this comment.
The constant kMaxExtraInfoNodes is set to 25, which is likely too small for production workloads. In make_extra_tiling, core_idx is incremented for each sequence in the batch (line 119) or for each KV head in long-KV sequences (line 136). For a model with 8 KV heads and a batch size of 4, if all sequences are long-KV, core_idx will reach 32, exceeding this limit and causing a LOG(FATAL) crash via the CHECK_LT calls on lines 117 and 134. Please increase this limit to support larger batches and head counts.
| int32_tensor_on_device({static_cast<int32_t>(sub_q_len)}, device); | ||
| auto sub_kv_lens = int32_tensor_on_device( | ||
| {static_cast<int32_t>(past_kv_len + q_start + sub_q_len)}, device); |
There was a problem hiding this comment.
Creating new tensors on the device inside a nested loop is highly inefficient. int32_tensor_on_device performs a CPU allocation, a GPU allocation, and a host-to-device copy on every iteration. This will significantly degrade performance during the prefill phase. Consider pre-allocating these tensors outside the loop or using a more efficient way to pass these scalar values to the kernel.
| #include "aclnn_x_flash_attention_infer.h" | ||
| #include "core/kernels/npu/aclnn/pytorch_npu_helper.hpp" | ||
| #include "core/kernels/npu/utils.h" | ||
| #include "xllm_ops_api.h" |
There was a problem hiding this comment.
Relative paths in #include are prohibited by the repository style guide. Please use project-root-relative paths for all includes, such as core/kernels/npu/xllm_ops/xllm_ops_api.h.
References
- No relative paths in #include. Always use project-root-relative paths. (link)
d671a25 to
4971f79
Compare
…1448-check # Conflicts: # xllm/core/kernels/npu/npu_fused_infer_attention.cpp # xllm/core/layers/npu_torch/attention.cpp
feat(npu): use FIA for qwen3.5 prefill attention