Skip to content

Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275

Open
AKloniecki wants to merge 1 commit into
mainfrom
aklonieckix/use-std-fma-in=addcmul
Open

Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275
AKloniecki wants to merge 1 commit into
mainfrom
aklonieckix/use-std-fma-in=addcmul

Conversation

@AKloniecki
Copy link
Copy Markdown
Contributor

@AKloniecki AKloniecki commented Apr 7, 2026

  • Extract pointwise_op_impl helper into a shared DeviceAddCmulCdiv.h header
  • Update ForeachFunctors.h to include the shared header instead of defining pointwise_op_impl directly
  • Update PointwiseOpsKernels.cpp to use the shared helper in AddcmulFunctor, removing duplicated FMA logic and unused #include <functional> / #include <type_traits>

Fixes: #2759

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).

Changes:

  • Updated XPU addcmul kernel to use std::fma (with an alpha == 1 fast-path) for floating-point accumulator types.
  • Introduced a pointwise_op_impl helper in ForeachFunctors.h to centralize FMA behavior for input + alpha * op(tensor1, tensor2).
  • Switched foreach pointwise scalar and scalar-list functors to call pointwise_op_impl.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Adds std::fma usage in AddcmulFunctor to match CUDA fused behavior.
src/ATen/native/xpu/sycl/ForeachFunctors.h Adds pointwise_op_impl and routes foreach pointwise ops through it for FMA parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
@astachowiczhabana
Copy link
Copy Markdown
Contributor

What's the issue this PR is fixing?

@astachowiczhabana
Copy link
Copy Markdown
Contributor

+please fix the linter

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
Copy link
Copy Markdown
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

@guangyey
Copy link
Copy Markdown
Contributor

@AKloniecki could you please fix the lint issue.

Copilot AI review requested due to automatic review settings April 13, 2026 08:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
@astachowiczhabana
Copy link
Copy Markdown
Contributor

@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR

Copilot AI review requested due to automatic review settings April 14, 2026 09:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 1.038304 0.848556

Copilot AI review requested due to automatic review settings April 16, 2026 09:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp
Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnext50_32x4d 0.902251 0.838351

Copilot AI review requested due to automatic review settings April 17, 2026 12:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
@AKloniecki AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from 31a0c36 to f92eaa5 Compare April 28, 2026 12:21
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
timm_models_bfloat16_training vit_base_patch16_siglip_256 0.691655 0.745540
timm_models_bfloat16_training visformer_small 0.719375 0.760122
timm_models_bfloat16_training dm_nfnet_f0 0.640112 0.766557
timm_models_bfloat16_training deit_base_distilled_patch16_224 0.750772 0.783908
timm_models_bfloat16_training beit_base_patch16_224 0.740306 0.815450
timm_models_bfloat16_training mobilenetv3_large_100 0.783371 0.819866
timm_models_bfloat16_training nfnet_l0 0.737466 0.825696
timm_models_bfloat16_training inception_v3 0.783976 0.837689
timm_models_bfloat16_training adv_inception_v3 0.788242 0.846674
timm_models_bfloat16_training mobilevit_s 0.720221 0.962036
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
timm_models_bfloat16_training convnextv2_nano.fcmae_ft_in22k_in1k 0.809682 0.816940
timm_models_bfloat16_training swin_base_patch4_window7_224 0.838258 0.827202
timm_models_bfloat16_training repvgg_a2 0.828009 0.841644
timm_models_bfloat16_training deit_tiny_patch16_224.fb_in1k 0.865368 0.843404
timm_models_bfloat16_training ghostnet_100 0.845110 0.850271
timm_models_bfloat16_training mobilenetv2_100 0.840877 0.878631
timm_models_bfloat16_training tf_efficientnet_b0 0.803188 0.918825

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 6, 2026 08:03
Copy link
Copy Markdown
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AKloniecki , please avoid using std::{math} in general. If the CUDA has used __{math}__, please use sycl::native::{math}. Otherwise, sycl::{math} is the preference.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@AKloniecki AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from 5d1be24 to 5ff1d63 Compare May 7, 2026 10:48
@AKloniecki AKloniecki changed the title Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA May 7, 2026
@AKloniecki
Copy link
Copy Markdown
Contributor Author

@EikanWang I've applied the change you've suggested. Now sycl math is used instead of std.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
Comment on lines +19 to +27
// Trait to detect multiply-like functors (std::multiplies<T> for any T).
// This enables the FMA fast-path for any instantiation of std::multiplies,
// not just std::multiplies<opmath_t>.
template <typename Op>
struct is_multiply_op : std::false_type {};

template <typename T>
struct is_multiply_op<std::multiplies<T>> : std::true_type {};

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 0.993066 0.728724
torchbench_bfloat16_training resnext50_32x4d 0.943190 0.736630
torchbench_bfloat16_training densenet121 0.771538 0.778071
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training dcgan 0.821567 0.803764
torchbench_bfloat16_training mobilenet_v3_large 0.970699 0.813746

Copilot AI review requested due to automatic review settings May 12, 2026 08:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp
Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
@chuanqi129
Copy link
Copy Markdown
Contributor

@copilot The Clang format check failed. Please fix the lint errors in this PR.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status.

Lint job log: https://github.com/intel/torch-xpu-ops/actions/runs/26026778377

Instructions:

  1. Review the lint failure log via the link above
  2. Fix all lint errors in the relevant source files
  3. Ensure both Python lint and Clang format checks pass
  4. Use reply comments to report progress - do NOT modify the PR description plan directly

Copilot AI review requested due to automatic review settings May 18, 2026 10:08
@AKloniecki AKloniecki review requested due to automatic review settings May 18, 2026 10:08
Copy link
Copy Markdown
Contributor

Copilot AI commented May 18, 2026

@copilot The Clang format check failed. Please fix the lint errors in this PR.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status....

Fixed in 318889e. The clang-format failure was caused by <sycl/sycl.hpp> appearing after the STL headers in DeviceAddCmulCdiv.h. Per the .clang-format IncludeCategories rules, .hpp headers have priority 1 and should sort before non-extension standard headers (priority 2). Moved <sycl/sycl.hpp> to be first in the include block.

Copilot AI requested a review from chuanqi129 May 18, 2026 10:08
@github-actions github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 21, 2026
@chuanqi129 chuanqi129 marked this pull request as draft May 21, 2026 14:14
@chuanqi129 chuanqi129 marked this pull request as ready for review May 21, 2026 14:14
…h CUDA

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
@AKloniecki AKloniecki force-pushed the aklonieckix/use-std-fma-in=addcmul branch from b1742ef to 5a03171 Compare May 22, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Skip]: New failed cases 2026-1-22

8 participants