Skip to content

feat: use FIA for qwen3.5 prefill attention on npu.#1448

Open
Sinle4Cat wants to merge 10 commits into
jd-opensource:preview/qwen3.5-qwen3.6from
Sinle4Cat:x_flash
Open

feat: use FIA for qwen3.5 prefill attention on npu.#1448
Sinle4Cat wants to merge 10 commits into
jd-opensource:preview/qwen3.5-qwen3.6from
Sinle4Cat:x_flash

Conversation

@Sinle4Cat
Copy link
Copy Markdown

feat(npu): use FIA for qwen3.5 prefill attention

@Sinle4Cat Sinle4Cat changed the title X flash feat(npu): use FIA for qwen3.5 prefill attention May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the npu_fused_infer_attention and x_flash_attention_infer kernels to support optimized attention mechanisms on NPU hardware. It also updates the AttentionImpl and Qwen3GatedDeltaNetBaseImpl layers to utilize these new kernels during prefill and chunked-prefill phases. Feedback highlights a critical scalability issue where a hardcoded constant for tiling nodes is too small for production workloads, potentially causing crashes. Additionally, there is a performance concern regarding inefficient tensor allocations on the device within loops and a violation of the repository's style guide regarding relative include paths.


namespace {

constexpr uint32_t kMaxExtraInfoNodes = 25;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The constant kMaxExtraInfoNodes is set to 25, which is likely too small for production workloads. In make_extra_tiling, core_idx is incremented for each sequence in the batch (line 119) or for each KV head in long-KV sequences (line 136). For a model with 8 KV heads and a batch size of 4, if all sequences are long-KV, core_idx will reach 32, exceeding this limit and causing a LOG(FATAL) crash via the CHECK_LT calls on lines 117 and 134. Please increase this limit to support larger batches and head counts.

Comment on lines +129 to +131
int32_tensor_on_device({static_cast<int32_t>(sub_q_len)}, device);
auto sub_kv_lens = int32_tensor_on_device(
{static_cast<int32_t>(past_kv_len + q_start + sub_q_len)}, device);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating new tensors on the device inside a nested loop is highly inefficient. int32_tensor_on_device performs a CPU allocation, a GPU allocation, and a host-to-device copy on every iteration. This will significantly degrade performance during the prefill phase. Consider pre-allocating these tensors outside the loop or using a more efficient way to pass these scalar values to the kernel.

Comment on lines +26 to +29
#include "aclnn_x_flash_attention_infer.h"
#include "core/kernels/npu/aclnn/pytorch_npu_helper.hpp"
#include "core/kernels/npu/utils.h"
#include "xllm_ops_api.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Relative paths in #include are prohibited by the repository style guide. Please use project-root-relative paths for all includes, such as core/kernels/npu/xllm_ops/xllm_ops_api.h.

References
  1. No relative paths in #include. Always use project-root-relative paths. (link)

@Sinle4Cat Sinle4Cat force-pushed the x_flash branch 2 times, most recently from d671a25 to 4971f79 Compare May 14, 2026 03:38
@liutongxuan liutongxuan changed the title feat(npu): use FIA for qwen3.5 prefill attention feat: use FIA for qwen3.5 prefill attention on npu. May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant