Split subgroup topk kernel per-K for parallel AOT compilation#3683
Split subgroup topk kernel per-K for parallel AOT compilation#3683jianyizh wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Read .github/skills/xpu-ops-pr-review/SKILL.md and its referenced review notes/checklists.
This PR refactors subgroup top-k XPU kernel instantiations into per-K compilation units to enable more parallel AOT compilation while keeping the original dispatch path in TensorTopKSbtopkKernel.cpp.
Changes:
- Moves shared subgroup top-k functor and launch templates into
TensorTopKSbtopkKernelImpl.h. - Adds K-specific launch files for K=1,2,4,8,16.
- Leaves
TensorTopKSbtopkKernel.cppas dispatch logic that routes to the per-K launchers.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernelImpl.h |
Shared subgroup top-k template implementation and per-K launcher declarations. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp |
Dispatch-only routing to K-specific launch units. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel_k1.cpp |
K=1 typed dispatch and launch instantiations. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel_k2.cpp |
K=2 typed dispatch and launch instantiations. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel_k4.cpp |
K=4 typed dispatch and launch instantiations. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel_k8.cpp |
K=8 typed dispatch and launch instantiations. |
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel_k16.cpp |
K=16 typed dispatch and launch instantiations. |
| sbtopk_k1_launch( | ||
| self, | ||
| nsegments, | ||
| nelements, | ||
| static_cast<int>(k), | ||
| largest, | ||
| values, | ||
| indices); |
|
@copilot New UT failures detected in op_ut tests (24 total). Please check these new failures and analyze whether they are caused by the PR changes. Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status. Job log: https://github.com/intel/torch-xpu-ops/actions/runs/25900306416 New failures:
... and 19 more failures. See full list in artifact. Instructions:
|
Analyzed run 25900306416 and downloaded both artifacts ( Conclusion: these new failures are not caused by this PR’s subgroup topk split. The failures are in Dynamo/Profiler/AOTAutogradCache/custom-ops paths (e.g. Reproducer pytest commands from failure XML:
No code change is applied on this PR branch for this comment because no causal link was found to the subgroup topk per-K refactor. |
Performance outliers, please check!
|
| sbtopk_k2_launch( | ||
| self, | ||
| nsegments, | ||
| nelements, | ||
| static_cast<int>(k), | ||
| largest, | ||
| values, | ||
| indices); |
There was a problem hiding this comment.
@jianyizh Would you like using a macro to simplify the switch?
#define SBTOPK_LAUNCH(KVAL) \
sbtopk_k##KVAL##_launch( \
self, \
nsegments, \
nelements, \
static_cast<int>(k), \
largest, \
values, \
indices)
#undef SBTOPK_LAUNCH
There was a problem hiding this comment.
Done in 7d5d247. Thanks for the suggestion!
|
@copilot The Clang format check failed. Please fix the lint errors in this PR. Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status. Lint job log: https://github.com/intel/torch-xpu-ops/actions/runs/26010568540 Instructions:
|
Fixed the lint issue locally in commit Validation run results:
I cannot update the PR branch from this environment because push is currently rejected by repository protection for UI screenshot: N/A (no UI changes). |
Performance outliers, please check!
|
|
@copilot New UT failures detected in op_ut tests (10 total). Please check these new failures and analyze whether they are caused by the PR changes. Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status. Job log: https://github.com/intel/torch-xpu-ops/actions/runs/26010919927 New failures:
... and 5 more failures. See full list in artifact. Instructions:
|
Checked run 26010919927 artifacts ( Reproducer commands from the artifact include:
UI screenshot: N/A (no UI changes). |
8f486eb to
97be719
Compare
97be719 to
597e1cb
Compare
Split the monolithic TensorTopKSbtopkKernel.cpp into 4 per-K files (k1, k2, k4, k8) to enable parallel AOT compilation and avoid CD build timeout that caused the original PR #3371 to be reverted. K=16 is excluded for now as it alone causes compilation timeout; it can be re-added once incremental build improvements land. - TensorTopKSbtopkKernel.h: public API (SbtopkResult enum + dispatch) - TensorTopKSbtopkKernelImpl.h: shared functor + launch templates - TensorTopKSbtopkKernel_k{1,2,4,8}.cpp: per-K instantiations - TensorTopKSbtopkKernel.cpp: dispatch-only (routes to per-K units) - TensorTopKKernel.cpp: integrate sbtopk_try_launch fallback
Summary
Split
TensorTopKSbtopkKernel.cppinto per-K compilation units so each generates a separate.sothat can be AOT-compiled in parallel, reducing build time. Follows the same pattern as #3652 (FlashAttention per-headdim split).Changes
TensorTopKSbtopkKernelImpl.h(new) — shared header withSubgroupTopKFunctor,sbtopk_launch_impl, andsbtopk_launch_vec_dispatchtemplatesTensorTopKSbtopkKernel_k{1,2,4,8,16}.cpp(new) — per-K instantiations, each producing its own.sowith AOT kernel codeTensorTopKSbtopkKernel.cpp(modified) — dispatch-only, routes to per-K functions; no kernel code, tiny.soMotivation
The subgroup topk kernel has a large number of template instantiations:
K(5) × scalar_t(~6) × IndexT(2) × Largest(2) × VEC_SIZE(up to 4). Keeping all instantiations in a single.cppcreates one monolithic AOT compilation that cannot be parallelized. Splitting by K allows 5 independent AOT compilations to run in parallel.Testing