Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost by CuiYifeng · Pull Request #3738 · intel/torch-xpu-ops

CuiYifeng · 2026-05-22T07:33:59Z

This PR is to set wg_size to SIMD * 4 for the vectorized LayerNorm kernel when N / vec_size <= max_wg_size.

Motivation:
The vectorized LayerNorm kernel uses inter-subgroup tree-reduce in compute_stats(). For wg_size = 1024 and SIMD32, the inter-subgroup tree-reduce requires 5 rounds of reduction, where average subgroup utilization during reduce is only ~19.4%. Stall sampling shows barrier synchronization as the biggest stall source (~31.5% of all stalls). Reducing wg_size / SIMD to 4 cuts tree-reduce to 2 rounds, eliminates ~92% of sync stalls.

Why gate on N / vec_size <= max_wg_size:
Smaller wg_size / SIMD means more concurrent workgroups per Xe-core, increasing L1 pressure. The Pass2 of vectorized LayerNorm kernel has a working set that scales as:
W_L1 = concurrent_WGs×N×element_size
when W_L1 exceeds the LSC cache capacity, Pass2 degrades from L1 hit to L3 hit, offsetting the barrier reduction gains. Since LSC cache sizes vary across platforms and cannot be queried at runtime, we need a platform-independent guard.

CuiYifeng · 2026-05-22T07:38:52Z

LayerNorm Speedup Ratio:

Shape (batch, dim)	Speedup (fp32)	Speedup (fp16)	Speedup (bf16)
(1024, 128)	0.99x	1.03x	1.13x
(1024, 256)	1.00x	1.04x	1.01x
(1024, 512)	1.01x	1.00x	0.97x
(1024, 1024)	1.44x	1.34x	1.34x
(1024, 2048)	1.29x	1.79x	1.65x
(1024, 4096)	1.02x	2.20x	1.83x
(1024, 8192)	1.00x	1.00x	1.00x
(1024, 16384)	1.00x	1.00x	1.01x
(1024, 32768)	1.00x	1.00x	1.00x
(1024, 65536)	1.00x	0.98x	0.99x
(1024, 131072)	1.00x	1.02x	1.00x

Copilot

Pull request overview

Skill file(s) read: .github/skills/xpu-ops-pr-review/SKILL.md.

This PR adjusts work-group size selection for the vectorized LayerNorm SYCL kernel to prefer a smaller fixed work-group (SIMD * 4) in order to reduce inter-subgroup tree-reduce barrier overhead.

Changes:

Replaced the previous “shrink max_wg_size by powers of two” logic with a fixed preferred work-group size (SIMD * 4) when N / vec_size <= max_wg_size.
Kept the existing fallback of using max_wg_size when N / vec_size > max_wg_size.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

+  // with 4 subgroups per workgroup
+  constexpr int64_t preferred_wg_size = SIMD * 4;
+  if (n <= max_wg_size && preferred_wg_size <= max_wg_size) {
+    return preferred_wg_size;
  }
  return max_wg_size;


jianyizh · 2026-05-22T08:40:14Z

+  // with 4 subgroups per workgroup
+  constexpr int64_t preferred_wg_size = SIMD * 4;
+  if (n <= max_wg_size && preferred_wg_size <= max_wg_size) {
+    return preferred_wg_size;


@CuiYifeng

preferred_wg_size <= max_wg_size Why we need this? Max_wg_size for every device should > 128

Copilot comments makes sense when n < 128?

You also need consider occupancy. Consider batchsize=64 on B70, n=1024. if you have 32 xe core. wg_size = 1024 gives occupancy 100%, but may suffer sync stall. wg_size 512 only 50% occupancy, but less cost on barrier. Does wg_size 128 gives the best performance? You need have a test.

@jianyizh

Because not sure if max_wg_size >= SIMD * 4 always holds true. However, agree with your opinion about max_wg_size. preferred_wg_size <= max_wg_size has been removed.

Do you mean when max_wg_size < SIMD*4, the function now returns max_wg_size without shrinking it relative to n? Make sense. That's another reason to remove preferred_wg_size <= max_wg_size.

Good point. Based on the ratio ratio for batch_size=40 and dim_size=4096 (n=1024), no obvious has been observed. Below is the data collected from B60:

dtype Speedup

fp32 0.98x

fp16 1.00x

bf16 1.01x

Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost

f5f0634

CuiYifeng requested a review from Copilot May 22, 2026 07:41

github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 22, 2026

Copilot started reviewing on behalf of CuiYifeng May 22, 2026 07:41 View session

chuanqi129 marked this pull request as draft May 22, 2026 07:45

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/LayerNormKernels.cpp

chuanqi129 marked this pull request as ready for review May 22, 2026 07:45

CuiYifeng requested a review from Copilot May 22, 2026 07:52

Copilot started reviewing on behalf of CuiYifeng May 22, 2026 07:52 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/LayerNormKernels.cpp

Comment on lines +594 to 599

// with 4 subgroups per workgroup

constexpr int64_t preferred_wg_size = SIMD * 4;

if (n <= max_wg_size && preferred_wg_size <= max_wg_size) {

return preferred_wg_size;

}

return max_wg_size;

CuiYifeng requested review from LuFinch and jianyizh May 22, 2026 08:30

jianyizh requested changes May 22, 2026

View reviewed changes

CuiYifeng linked an issue May 22, 2026 that may be closed by this pull request

[Bug Skip]: Functionality error of test_torchvision_roi_ops.py::TestRoIAlign::test_autocast #3518

Closed

CuiYifeng removed a link to an issue May 22, 2026

[Bug Skip]: Functionality error of test_torchvision_roi_ops.py::TestRoIAlign::test_autocast #3518

Closed

CuiYifeng force-pushed the yifeng/norm_sync_opt branch from c9aa77c to f5f0634 Compare May 22, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost#3738

Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost#3738
CuiYifeng wants to merge 1 commit into
mainfrom
yifeng/norm_sync_opt

CuiYifeng commented May 22, 2026

Uh oh!

CuiYifeng commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

jianyizh May 22, 2026

Uh oh!

CuiYifeng May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CuiYifeng commented May 22, 2026

Uh oh!

CuiYifeng commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jianyizh May 22, 2026

Choose a reason for hiding this comment

Uh oh!

CuiYifeng May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CuiYifeng commented May 22, 2026 •

edited

Loading