Skip to content

Int64 support for UpsampleNearest3d#3737

Open
kdrozd-dev wants to merge 5 commits into
mainfrom
fix/upsample-nearest3d-int64-indexing
Open

Int64 support for UpsampleNearest3d#3737
kdrozd-dev wants to merge 5 commits into
mainfrom
fix/upsample-nearest3d-int64-indexing

Conversation

@kdrozd-dev
Copy link
Copy Markdown
Contributor

Fixes: #2510.

Align with changes from: pytorch/pytorch#144865. Improve error messages to resemble the cuda ones and change the relevant int32 variables to int64.

Comment thread src/ATen/native/xpu/sycl/UpSampleNearest3dKernels.cpp Outdated
@github-actions github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 22, 2026
Comment thread src/ATen/native/xpu/sycl/UpSampleNearest3dKernels.cpp Outdated
@chuanqi129 chuanqi129 marked this pull request as draft May 22, 2026 07:41
@chuanqi129 chuanqi129 marked this pull request as ready for review May 22, 2026 07:41
Co-authored-by: Slawomir Siwek <slawomir.siwek@intel.com>
Comment thread src/ATen/native/xpu/sycl/UpSampleNearest3dKernels.cpp Outdated
Co-authored-by: Slawomir Siwek <slawomir.siwek@intel.com>
Copy link
Copy Markdown
Contributor

@pbielak pbielak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but similar to my changes with the MaxPool3D kernel (see PR: #3558 and #3632 ), we should check the impact of the int64_t "hardcoding". Maybe a dispatch over int32/int64 would give better perf results.

@kdrozd-dev
Copy link
Copy Markdown
Contributor Author

kdrozd-dev commented May 22, 2026

Looks good, but similar to my changes with the MaxPool3D kernel (see PR: #3558 and #3632 ), we should check the impact of the int64_t "hardcoding". Maybe a dispatch over int32/int64 would give better perf results.

Seems like a good idea will switch to dispatch approach after a quick benchmark

@kdrozd-dev
Copy link
Copy Markdown
Contributor Author

kdrozd-dev commented May 22, 2026

3 runs with 20 execs per test case each

Case Shape dtype Output numel Speedup (med) Speedup (avg) Stable
small-f32 (2, 64, 16, 16, 16) float32 4M 0.869x 0.899x no
small-bf16 (2, 64, 16, 16, 16) bfloat16 4M 1.044x 1.062x no
small-f16 (2, 64, 16, 16, 16) float16 4M 1.482x 1.287x no
med-f32 (4, 128, 32, 32, 32) float32 134M 1.034x 0.869x no
med-bf16 (4, 128, 32, 32, 32) bfloat16 134M 0.986x 1.089x no
med-f16 (4, 128, 32, 32, 32) float16 134M 0.925x 0.848x no
large-f32 (1, 32, 64, 128, 128) float32 268M 1.053x 1.053x YES
large-bf16 (1, 32, 64, 128, 128) bfloat16 268M 1.058x 1.058x YES
xl-bf16 (1, 64, 64, 128, 256) bfloat16 1.07B 1.054x 1.054x YES
xl-f32 (1, 64, 64, 128, 256) float32 1.07B 1.055x 1.055x YES
med-exact-f32 (4, 128, 32, 32, 32) float32 134M 0.847x 0.845x no
med-scale3-f32 (2, 64, 16, 16, 16) float32 14M 1.302x 1.063x no

The benchmark shows that for stable test cases dispatch based approach is faster by around 5%. For small kernels results were mostly noise and varied greatly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[upstream_ut] RuntimeError: Expected output.numel() <= std::numeric_limits<int32_t>::max() to be true, but got fa

3 participants