Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA by AKloniecki · Pull Request #3275 · intel/torch-xpu-ops

AKloniecki · 2026-04-07T08:03:22Z

Extract pointwise_op_impl helper into a shared DeviceAddCmulCdiv.h header
Update ForeachFunctors.h to include the shared header instead of defining pointwise_op_impl directly
Update PointwiseOpsKernels.cpp to use the shared helper in AddcmulFunctor, removing duplicated FMA logic and unused #include <functional> / #include <type_traits>

Copilot

Pull request overview

This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).

Changes:

Updated XPU addcmul kernel to use std::fma (with an alpha == 1 fast-path) for floating-point accumulator types.
Introduced a pointwise_op_impl helper in ForeachFunctors.h to centralize FMA behavior for input + alpha * op(tensor1, tensor2).
Switched foreach pointwise scalar and scalar-list functors to call pointwise_op_impl.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp	Adds `std::fma` usage in `AddcmulFunctor` to match CUDA fused behavior.
src/ATen/native/xpu/sycl/ForeachFunctors.h	Adds `pointwise_op_impl` and routes foreach pointwise ops through it for FMA parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

astachowiczhabana · 2026-04-07T11:48:26Z

What's the issue this PR is fixing?

astachowiczhabana · 2026-04-07T12:29:17Z

+please fix the linter

guangyey

Overall LGTM.

guangyey · 2026-04-13T03:05:48Z

@AKloniecki could you please fix the lint issue.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

astachowiczhabana · 2026-04-13T14:03:04Z

@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-14T21:16:19Z

Performance outliers, please check!

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	mnasnet1_0	1.038304	0.848556

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-16T23:32:38Z

Performance outliers, please check!

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnext50_32x4d	0.902251	0.838351

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

github-actions · 2026-04-28T19:11:55Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
timm_models_bfloat16_training	vit_base_patch16_siglip_256	0.691655	0.745540
timm_models_bfloat16_training	visformer_small	0.719375	0.760122
timm_models_bfloat16_training	dm_nfnet_f0	0.640112	0.766557
timm_models_bfloat16_training	deit_base_distilled_patch16_224	0.750772	0.783908
timm_models_bfloat16_training	beit_base_patch16_224	0.740306	0.815450
timm_models_bfloat16_training	mobilenetv3_large_100	0.783371	0.819866
timm_models_bfloat16_training	nfnet_l0	0.737466	0.825696
timm_models_bfloat16_training	inception_v3	0.783976	0.837689
timm_models_bfloat16_training	adv_inception_v3	0.788242	0.846674
timm_models_bfloat16_training	mobilevit_s	0.720221	0.962036

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
timm_models_bfloat16_training	convnextv2_nano.fcmae_ft_in22k_in1k	0.809682	0.816940
timm_models_bfloat16_training	swin_base_patch4_window7_224	0.838258	0.827202
timm_models_bfloat16_training	repvgg_a2	0.828009	0.841644
timm_models_bfloat16_training	deit_tiny_patch16_224.fb_in1k	0.865368	0.843404
timm_models_bfloat16_training	ghostnet_100	0.845110	0.850271
timm_models_bfloat16_training	mobilenetv2_100	0.840877	0.878631
timm_models_bfloat16_training	tf_efficientnet_b0	0.803188	0.918825

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

EikanWang

@AKloniecki , please avoid using std::{math} in general. If the CUDA has used __{math}__, please use sycl::native::{math}. Otherwise, sycl::{math} is the preference.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

AKloniecki · 2026-05-07T10:51:59Z

@EikanWang I've applied the change you've suggested. Now sycl math is used instead of std.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

+// Trait to detect multiply-like functors (std::multiplies<T> for any T).
+// This enables the FMA fast-path for any instantiation of std::multiplies,
+// not just std::multiplies<opmath_t>.
+template <typename Op>
+struct is_multiply_op : std::false_type {};
+
+template <typename T>
+struct is_multiply_op<std::multiplies<T>> : std::true_type {};
+


github-actions · 2026-05-09T04:05:47Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	mnasnet1_0	0.993066	0.728724
torchbench_bfloat16_training	resnext50_32x4d	0.943190	0.736630
torchbench_bfloat16_training	densenet121	0.771538	0.778071

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	dcgan	0.821567	0.803764
torchbench_bfloat16_training	mobilenet_v3_large	0.970699	0.813746

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

chuanqi129 · 2026-05-18T10:04:43Z

@copilot The Clang format check failed. Please fix the lint errors in this PR.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status.

Lint job log: https://github.com/intel/torch-xpu-ops/actions/runs/26026778377

Instructions:

Review the lint failure log via the link above
Fix all lint errors in the relevant source files
Ensure both Python lint and Clang format checks pass
Use reply comments to report progress - do NOT modify the PR description plan directly

Copilot · 2026-05-18T10:08:27Z

@copilot The Clang format check failed. Please fix the lint errors in this PR.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status....

Fixed in 318889e. The clang-format failure was caused by <sycl/sycl.hpp> appearing after the STL headers in DeviceAddCmulCdiv.h. Per the .clang-format IncludeCategories rules, .hpp headers have priority 1 and should sort before non-extension standard headers (priority 2). Moved <sycl/sycl.hpp> to be first in the include block.

…h CUDA Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

AKloniecki requested review from EikanWang, astachowiczhabana, Copilot and guangyey April 7, 2026 08:03

Copilot started reviewing on behalf of AKloniecki April 7, 2026 08:06 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

Copilot started work on behalf of AKloniecki April 7, 2026 08:46 View session

Copilot finished work on behalf of AKloniecki April 7, 2026 08:58

guangyey reviewed Apr 7, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

guangyey approved these changes Apr 7, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 13, 2026 08:55

Copilot started reviewing on behalf of AKloniecki April 13, 2026 08:55 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

Copilot AI review requested due to automatic review settings April 14, 2026 09:07

Copilot started reviewing on behalf of AKloniecki April 14, 2026 09:08 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 16, 2026 09:10

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

tszulist-hbn approved these changes Apr 16, 2026

View reviewed changes

astachowiczhabana approved these changes Apr 16, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 17, 2026 12:42

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from 31a0c36 to f92eaa5 Compare April 28, 2026 12:21

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 6, 2026 08:03

Copilot started reviewing on behalf of AKloniecki May 6, 2026 08:04 View session

EikanWang requested changes May 6, 2026

View reviewed changes

Copilot AI reviewed May 6, 2026

View reviewed changes

AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from 5d1be24 to 5ff1d63 Compare May 7, 2026 10:48

AKloniecki changed the title ~~Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA~~ Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA May 7, 2026

AKloniecki mentioned this pull request May 7, 2026

Reenable test_addcmul_alpha_one_fma_parity dtypes F32 and F64 on XPU. pytorch/pytorch#182811

Draft

AKloniecki requested review from EikanWang and Copilot May 8, 2026 07:50

Copilot started reviewing on behalf of AKloniecki May 8, 2026 07:51 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 12, 2026 08:56

Copilot started reviewing on behalf of AKloniecki May 12, 2026 08:57 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

Copilot started work on behalf of chuanqi129 May 18, 2026 10:05 View session

Copilot AI review requested due to automatic review settings May 18, 2026 10:08

AKloniecki review requested due to automatic review settings May 18, 2026 10:08

Copilot finished work on behalf of chuanqi129 May 18, 2026 10:08

Copilot AI requested a review from chuanqi129 May 18, 2026 10:08

github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 21, 2026

chuanqi129 marked this pull request as draft May 21, 2026 14:14

chuanqi129 marked this pull request as ready for review May 21, 2026 14:14

Use sycl::fma in addcmul and foreach pointwise ops for FMA parity wit…

5a03171

…h CUDA Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from b1742ef to 5a03171 Compare May 22, 2026 17:28

Conversation

AKloniecki commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

astachowiczhabana commented Apr 7, 2026

Uh oh!

astachowiczhabana commented Apr 7, 2026

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

astachowiczhabana commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented Apr 14, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

AKloniecki commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions Bot commented May 9, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

AKloniecki commented Apr 7, 2026 •

edited

Loading