Skip to content

Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost#3738

Open
CuiYifeng wants to merge 1 commit into
mainfrom
yifeng/norm_sync_opt
Open

Simplify layer_norm_wg_size_select to reduce tree-reduce barrier cost#3738
CuiYifeng wants to merge 1 commit into
mainfrom
yifeng/norm_sync_opt

Conversation

@CuiYifeng
Copy link
Copy Markdown
Contributor

This PR is to set wg_size to SIMD * 4 for the vectorized LayerNorm kernel when N / vec_size <= max_wg_size.

Motivation:
The vectorized LayerNorm kernel uses inter-subgroup tree-reduce in compute_stats(). For wg_size = 1024 and SIMD32, the inter-subgroup tree-reduce requires 5 rounds of reduction, where average subgroup utilization during reduce is only ~19.4%. Stall sampling shows barrier synchronization as the biggest stall source (~31.5% of all stalls). Reducing wg_size / SIMD to 4 cuts tree-reduce to 2 rounds, eliminates ~92% of sync stalls.

Why gate on N / vec_size <= max_wg_size:
Smaller wg_size / SIMD means more concurrent workgroups per Xe-core, increasing L1 pressure. The Pass2 of vectorized LayerNorm kernel has a working set that scales as:
W_L1 = concurrent_WGs×N×element_size
when W_L1 exceeds the LSC cache capacity, Pass2 degrades from L1 hit to L3 hit, offsetting the barrier reduction gains. Since LSC cache sizes vary across platforms and cannot be queried at runtime, we need a platform-independent guard.

@CuiYifeng
Copy link
Copy Markdown
Contributor Author

CuiYifeng commented May 22, 2026

LayerNorm Speedup Ratio:

Shape (batch, dim) Speedup (fp32) Speedup (fp16) Speedup (bf16)
(1024, 128) 0.99x 1.03x 1.13x
(1024, 256) 1.00x 1.04x 1.01x
(1024, 512) 1.01x 1.00x 0.97x
(1024, 1024) 1.44x 1.34x 1.34x
(1024, 2048) 1.29x 1.79x 1.65x
(1024, 4096) 1.02x 2.20x 1.83x
(1024, 8192) 1.00x 1.00x 1.00x
(1024, 16384) 1.00x 1.00x 1.01x
(1024, 32768) 1.00x 1.00x 1.00x
(1024, 65536) 1.00x 0.98x 0.99x
(1024, 131072) 1.00x 1.02x 1.00x

@CuiYifeng CuiYifeng requested a review from Copilot May 22, 2026 07:41
@github-actions github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 22, 2026
@chuanqi129 chuanqi129 marked this pull request as draft May 22, 2026 07:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Skill file(s) read: .github/skills/xpu-ops-pr-review/SKILL.md.

This PR adjusts work-group size selection for the vectorized LayerNorm SYCL kernel to prefer a smaller fixed work-group (SIMD * 4) in order to reduce inter-subgroup tree-reduce barrier overhead.

Changes:

  • Replaced the previous “shrink max_wg_size by powers of two” logic with a fixed preferred work-group size (SIMD * 4) when N / vec_size <= max_wg_size.
  • Kept the existing fallback of using max_wg_size when N / vec_size > max_wg_size.

Comment thread src/ATen/native/xpu/sycl/LayerNormKernels.cpp
@chuanqi129 chuanqi129 marked this pull request as ready for review May 22, 2026 07:45
@CuiYifeng CuiYifeng requested a review from Copilot May 22, 2026 07:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +594 to 599
// with 4 subgroups per workgroup
constexpr int64_t preferred_wg_size = SIMD * 4;
if (n <= max_wg_size && preferred_wg_size <= max_wg_size) {
return preferred_wg_size;
}
return max_wg_size;
@CuiYifeng CuiYifeng requested review from LuFinch and jianyizh May 22, 2026 08:30
// with 4 subgroups per workgroup
constexpr int64_t preferred_wg_size = SIMD * 4;
if (n <= max_wg_size && preferred_wg_size <= max_wg_size) {
return preferred_wg_size;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CuiYifeng

  1. preferred_wg_size <= max_wg_size Why we need this? Max_wg_size for every device should > 128
  2. Copilot comments makes sense when n < 128?
  3. You also need consider occupancy. Consider batchsize=64 on B70, n=1024. if you have 32 xe core. wg_size = 1024 gives occupancy 100%, but may suffer sync stall. wg_size 512 only 50% occupancy, but less cost on barrier. Does wg_size 128 gives the best performance? You need have a test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianyizh

  1. Because not sure if max_wg_size >= SIMD * 4 always holds true. However, agree with your opinion about max_wg_size. preferred_wg_size <= max_wg_size has been removed.
  2. Do you mean when max_wg_size < SIMD*4, the function now returns max_wg_size without shrinking it relative to n? Make sense. That's another reason to remove preferred_wg_size <= max_wg_size.
  3. Good point. Based on the ratio ratio for batch_size=40 and dim_size=4096 (n=1024), no obvious has been observed. Below is the data collected from B60:
dtype Speedup
fp32 0.98x
fp16 1.00x
bf16 1.01x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants