fused_sigmoid_gating_tilelang tilelang adapt in qwen3.x by BikingNow · Pull Request #1465 · jd-opensource/xllm

BikingNow · 2026-05-15T14:25:42Z

adapt fused_sigmoid_gating_tilelang in tilelang

gemini-code-assist

Code Review

This pull request introduces the fused_sigmoid_gating_delta_rule kernel for Ascend NPUs, encompassing its TileLang implementation, C++ wrapper, and integration into the ops_api. It also optimizes the fused_gdn_gating and split_qkv_rmsnorm_mrope kernels by removing unnecessary temporary buffers. The review feedback primarily addresses precision loss by suggesting that the SSM state be stored in float32 throughout the computation. Additionally, the reviewer identified several style guide violations concerning Python type annotations, logging, and the improper use of auto for primitive types in C++.

gemini-code-assist · 2026-05-15T14:29:22Z

+        ssm_state_indices: T.Tensor([max_num_seqs], "int32"),
+        cu_seqlens: T.Tensor([max_num_seqs + 1], "int32"),
+        out: T.Tensor([total_tokens_padded, nv, dv], input_dtype),
+        final_state: T.Tensor([max_num_seqs, nv, dk, dv], input_dtype),


The SSM state (final_state) should be saved in high precision (accum_dtype) to avoid precision loss during the recurrent update. Currently, it is being cast to input_dtype (bf16), which can lead to significant errors in SSM scans over long sequences.

Suggested change

final_state: T.Tensor([max_num_seqs, nv, dk, dv], input_dtype),

final_state: T.Tensor([max_num_seqs, nv, dk, dv], accum_dtype),

gemini-code-assist · 2026-05-15T14:29:22Z

+                    T.tile.cast(h_store_vec, h_vec, "CAST_RINT", vec_block_v * dk)
+                    T.set_flag("v", "mte3", 5)
+                    T.wait_flag("v", "mte3", 5)
+                    T.copy(h_store_vec, final_state[seq_idx, v_head_idx, :, v_offset : v_offset + vec_block_v])


If final_state is updated to accum_dtype, the cast to input_dtype is unnecessary and should be removed to maintain precision.

Suggested change

T.tile.cast(h_store_vec, h_vec, "CAST_RINT", vec_block_v * dk)

T.set_flag("v", "mte3", 5)

T.wait_flag("v", "mte3", 5)

T.copy(h_store_vec, final_state[seq_idx, v_head_idx, :, v_offset : v_offset + vec_block_v])

T.set_flag("v", "mte3", 5)

T.wait_flag("v", "mte3", 5)

T.copy(h_vec, final_state[seq_idx, v_head_idx, :, v_offset : v_offset + vec_block_v])

gemini-code-assist · 2026-05-15T14:29:22Z

+  CHECK(init_state.scalar_type() == torch::kFloat32 ||
+        init_state.scalar_type() == torch::kBFloat16)
+      << "TileLang fused_sigmoid_gating_delta_rule: init_state must be float32 "
+         "or bf16";


The TileLang kernel is compiled with accum_dtype = "float" (float32) for init_state. Allowing kBFloat16 here will result in a type mismatch and incorrect memory access in the kernel. This check should strictly enforce kFloat32.

CHECK_EQ(init_state.scalar_type(), torch::kFloat32) << "TileLang fused_sigmoid_gating_delta_rule: init_state must be float32";

gemini-code-assist · 2026-05-15T14:29:22Z

+def golden(
+    A_log,
+    a,
+    dt_bias,
+    query,
+    key,
+    value,
+    beta,
+    init_state,
+    ssm_state_indices,
+    cu_seqlens,
+    scale=None,
+    use_qk_l2norm=True,
+    softplus_beta=1.0,
+):


Python function signatures must include type annotations for all parameters and return types per the repository style guide.

References

Type annotations are required on all function signatures (parameters and return types). ^(link)

gemini-code-assist · 2026-05-15T14:29:22Z

+
+    torch.testing.assert_close(out.cpu(), out_golden, rtol=2e-2, atol=2e-2)
+    torch.testing.assert_close(final_state.cpu(), final_state_golden, rtol=2e-2, atol=2e-2)
+    print("Kernel Output Match!")


Do not use print() for logging. All diagnostic output must go through the shared logger at scripts/logger.py.

Suggested change

print("Kernel Output Match!")

logger.info("Kernel Output Match!")

References

All Python diagnostic output MUST go through the shared logger at scripts/logger.py. Do not use print() for logging. ^(link)

gemini-code-assist · 2026-05-15T14:29:22Z

+  const auto nv = value.size(1);
+  const auto nk = query.size(1);
+  const auto dk = query.size(2);
+  const auto dv = value.size(2);


Do not use auto for simple/primitive types like int64_t.

Suggested change

const auto nv = value.size(1);

const auto nk = query.size(1);

const auto dk = query.size(2);

const auto dv = value.size(2);

const int64_t nv = value.size(1);

const int64_t nk = query.size(1);

const int64_t dk = query.size(2);

const int64_t dv = value.size(2);

References

Do not use auto for simple/primitive types (int32_t, float, bool, std::string, etc.). ^(link)

gemini-code-assist · 2026-05-15T14:29:23Z

+  const auto options = query.options();
+
+  auto out = torch::empty({query.size(0), nv, dv}, options);
+  auto final_state = torch::empty({compiled_n, nv, dk, dv}, options);


To maintain precision for the SSM state, final_state should be allocated as float32 if the kernel is updated to return accum_dtype.

Suggested change

auto final_state = torch::empty({compiled_n, nv, dk, dv}, options);

torch::Tensor final_state = torch::empty({compiled_n, nv, dk, dv}, options.dtype(torch::kFloat32));

gemini-code-assist · 2026-05-15T14:29:23Z

+  auto q = params.q;
+  auto k = params.k;
+  auto v = params.v;


Avoid using auto for torch::Tensor types to improve readability and adhere to the style guide's preference for explicit types over auto for non-complex types.

Suggested change

auto q = params.q;

auto k = params.k;

auto v = params.v;

torch::Tensor q = params.q;

torch::Tensor k = params.k;

torch::Tensor v = params.v;

References

Do not use auto for simple/primitive types. auto is acceptable for complex types (iterators, lambdas, template-deduced types) but not for int32_t, float, bool, std::string, etc. ^(link)

gemini-code-assist · 2026-05-15T14:29:23Z

+  auto init_state_small = torch::index_select(
+      params.initial_state_source, 0, indices);


The TileLang kernel expects init_state to be float32. Ensure the input is cast to kFloat32 before passing it to the wrapper.

Suggested change

auto init_state_small = torch::index_select(

params.initial_state_source, 0, indices);

torch::Tensor init_state_small = torch::index_select(

params.initial_state_source, 0, indices).to(torch::kFloat32);

BikingNow requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners May 15, 2026 14:25

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

feat: fused_sigmoid_gating_delta_rule 开发（合并提交记录）

e0775d5

BikingNow force-pushed the qwen_tl branch from d6e8a39 to e0775d5 Compare May 19, 2026 11:33

BikingNow added 5 commits May 19, 2026 19:44

resolve: 解决冲突

7ccb09e

fix bug in decode phase, acl graph capture do not accept item action

cb4afac

print shape

2fac503

add print 2

f594e3d

print shape 3

feee5d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fused_sigmoid_gating_tilelang tilelang adapt in qwen3.x#1465

fused_sigmoid_gating_tilelang tilelang adapt in qwen3.x#1465
BikingNow wants to merge 6 commits into
jd-opensource:preview/qwen3.5-qwen3.6from
BikingNow:qwen_tl

BikingNow commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	final_state: T.Tensor([max_num_seqs, nv, dk, dv], input_dtype),
	final_state: T.Tensor([max_num_seqs, nv, dk, dv], accum_dtype),

	print("Kernel Output Match!")
	logger.info("Kernel Output Match!")

	auto final_state = torch::empty({compiled_n, nv, dk, dv}, options);
	torch::Tensor final_state = torch::empty({compiled_n, nv, dk, dv}, options.dtype(torch::kFloat32));

		auto init_state_small = torch::index_select(
		params.initial_state_source, 0, indices);

Conversation

BikingNow commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant