Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275
Use sycl::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275AKloniecki wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).
Changes:
- Updated XPU
addcmulkernel to usestd::fma(with analpha == 1fast-path) for floating-point accumulator types. - Introduced a
pointwise_op_implhelper inForeachFunctors.hto centralize FMA behavior forinput + alpha * op(tensor1, tensor2). - Switched foreach pointwise scalar and scalar-list functors to call
pointwise_op_impl.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp | Adds std::fma usage in AddcmulFunctor to match CUDA fused behavior. |
| src/ATen/native/xpu/sycl/ForeachFunctors.h | Adds pointwise_op_impl and routes foreach pointwise ops through it for FMA parity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
What's the issue this PR is fixing? |
|
+please fix the linter |
|
@AKloniecki could you please fix the lint issue. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Performance outliers, please check!
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Performance outliers, please check!
|
31a0c36 to
f92eaa5
Compare
Performance outliers, please check!
|
EikanWang
left a comment
There was a problem hiding this comment.
@AKloniecki , please avoid using std::{math} in general. If the CUDA has used __{math}__, please use sycl::native::{math}. Otherwise, sycl::{math} is the preference.
5d1be24 to
5ff1d63
Compare
|
@EikanWang I've applied the change you've suggested. Now sycl math is used instead of std. |
| // Trait to detect multiply-like functors (std::multiplies<T> for any T). | ||
| // This enables the FMA fast-path for any instantiation of std::multiplies, | ||
| // not just std::multiplies<opmath_t>. | ||
| template <typename Op> | ||
| struct is_multiply_op : std::false_type {}; | ||
|
|
||
| template <typename T> | ||
| struct is_multiply_op<std::multiplies<T>> : std::true_type {}; | ||
|
|
Performance outliers, please check!
|
|
@copilot The Clang format check failed. Please fix the lint errors in this PR. Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status. Lint job log: https://github.com/intel/torch-xpu-ops/actions/runs/26026778377 Instructions:
|
Fixed in 318889e. The clang-format failure was caused by |
…h CUDA Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
b1742ef to
5a03171
Compare
pointwise_op_implhelper into a sharedDeviceAddCmulCdiv.hheaderForeachFunctors.hto include the shared header instead of definingpointwise_op_impldirectlyPointwiseOpsKernels.cppto use the shared helper inAddcmulFunctor, removing duplicated FMA logic and unused#include <functional>/#include <type_traits>Fixes: #2759