Add single workgroup topk kernel for XPU (part 2 of #3369)#3372
Conversation
735b721 to
e7cf066
Compare
aa5150e to
d742fa2
Compare
e7cf066 to
6bd0fbd
Compare
d742fa2 to
f926f72
Compare
6bd0fbd to
6f002ec
Compare
f926f72 to
659972f
Compare
6f002ec to
12b8ed3
Compare
659972f to
d4daf78
Compare
98186e7 to
4fe12fd
Compare
4fe12fd to
1e308ff
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an additional optimized XPU SYCL TopK kernel specialized for large last-dimension slices, and updates topk dispatch + radix conversion to improve performance and NaN behavior.
Changes:
- Introduce a single-workgroup (1024-thread) radix-select + gather TopK kernel (unsorted output).
- Extend
sbtopk_try_launchdispatch to choose between subgroup-sorted vs single-workgroup-unsorted paths. - Adjust radix conversion to map NaNs to the maximum radix value for consistent selection behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.h | Declares the new single-workgroup TopK launch helper. |
| src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp | Implements the single-workgroup radix-select TopK kernel and its launcher. |
| src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h | Updates comments to reflect multiple optimized kernel paths. |
| src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp | Adds dispatch logic for the new single-workgroup kernel alongside subgroup TopK. |
| src/ATen/native/xpu/sycl/SortingRadixSelect.h | Updates TopKTypeConfig::convert to treat NaNs as maximum radix values. |
47be176 to
b6170a9
Compare
cc9071f to
195567a
Compare
dd8e48f to
bde1bdb
Compare
1b7acb8 to
989aad8
Compare
989aad8 to
46a59e2
Compare
|
/merge -f "UT unrelated fail" |
|
❌ Not enough approvals (0/2). At least 2 approvals are required to merge. |
SYCL translation of PyTorch CUDA's single-block radix select path. A 1024-thread workgroup processes one slice using RADIX_BITS=4 radix select to find the k-th value, then gathers matching elements. Output is unsorted. Best for large dim (>= 4096). Dispatch updated: dim < 1024 -> original; k <= 16 + large batch -> subgroup kernel; dim >= 4096 -> single workgroup kernel; else -> original. Also fixes NaN handling in SortingRadixSelect.h for half/float/double. 432/432 accuracy tests pass, 324/324 sortedness tests pass.
Add IndexT template parameter to SbtopkGatherFunctor so that shared memory, histogram counts, and element indices use the correct index type. Dispatch IndexT as int when nsegments*nelements <= INT_MAX, int64_t otherwise. Remove the nsegments <= INT_MAX guard in the caller since the kernel now handles both cases internally.
- Add alignas(alignof(LoadT)) on all scalar_t src[VEC_SIZE] arrays used for vectorized loads (3 occurrences) to ensure proper alignment - Replace magic numbers smem[64]/smem[65] with named constants SMEM_FOUND_FLAG / SMEM_FOUND_IDX for clarity and maintainability
…el_submit, simplify EPT with PowerOf2Floor, remove unnecessary include
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
… for SLM layout safety
|
/merge -f "UT unrelated fail" |
|
❌ Not enough approvals (0/2). At least 2 approvals are required to merge. |
|
/merge -f "UT unrelated fail" |
|
❌ Not enough approvals (0/2). At least 2 approvals are required to merge. |
|
/merge -f "UT unrelated fail" |
|
❌ Not enough approvals (0/2). At least 2 approvals are required to merge. |
ea1879a to
ff99c42
Compare
|
/merge -f "UT unrelated fail" |
|
❌ Not enough approvals (0/2). At least 2 approvals are required to merge. |
|
Hi @jianyizh May I know how do you handle the issue about build time? |
#3683 The kernel has build time issue is this one. When k=16, the build time is very slow, so I just keep k=1~8. I guess it's because the aot target dg2. register on dg2 is only half compare to bmg. Too many register spill may cause the issue. Still investigating |
|
Thanks for your details. |
…l#3372) ## Summary Builds on intel#3371 (subgroup topk kernel). Adds a **single workgroup topk kernel** — SYCL translation of PyTorch CUDA's single-block radix select path. - **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432 cases, 211 wins (>1.05x), 32 regressions (<0.98x) - **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster) - **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins ### Approach **Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A 1024-thread workgroup processes one slice using `RADIX_BITS=4` radix select to find the k-th value, then gathers matching elements. Translated from PyTorch CUDA's single-block path. Output is unsorted (caller sorts if needed). Best for large dim (>= 4096). **Updated dispatch logic:** - `dim < 1024` -> original kernel - `k <= 16` and large batch -> subgroup kernel (PR1, SORTED) - `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED) - otherwise -> original kernel Also fixes NaN handling in `SortingRadixSelect.h` `TopKTypeConfig::convert` for half/float/double (NaN maps to max radix value). Multi-block radix select (for very large slices across multiple workgroups) is planned as future work. ### Files changed | File | Description | |------|-------------| | `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel (from CUDA single-block path) | | `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch` declaration | | `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside subgroup kernel | | `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel paths | | `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert` | ### Correctness - **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare) - **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output verified monotonic) ### Benchmark: incremental gain from this PR Showing where single-wg kernel helps (large dim cases): **By dim (PR2 vs PR1-only):** | dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases | |-----|:-:|:-:|:-:|:-:| | 128 | 1.00x | 1.00x | 0.37x | 54 | | 129 | 1.00x | 1.00x | 0.39x | 54 | | 1024 | 1.00x | 1.47x | 0.77x | 54 | | 1025 | 1.00x | 1.35x | 0.63x | 54 | | 8192 | 1.03x | 1.68x | 0.62x | 54 | | 8193 | 1.01x | 1.30x | 0.49x | 54 | | 131072 | 1.99x | 3.73x | 0.68x | 54 | | 131073 | 1.51x | 2.31x | 0.43x | 54 | ### Full 432-case results (combined PR1+PR2) XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters. <details> <summary>Click to expand full table</summary> | dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us) | CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak | |-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:| | bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x | 0.1 | 0% | | bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x | 0.4 | 0% | | bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x | 0.45x | 2.6 | 1% | | bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x | 0.44x | 2.5 | 1% | | bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x | 0.42x | 2.4 | 1% | | bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x | 0.28x | 1.6 | 0% | | bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x | 0.23x | 1.3 | 0% | | bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x | 0.25x | 1.4 | 0% | | bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x | 0.5 | 0% | | bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x | 2.9 | 1% | | bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x | 2.8 | 1% | | bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x | 2.7 | 1% | | bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x | 2.7 | 1% | | bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x | 0.39x | 14.4 | 3% | | bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x | 0.41x | 15.3 | 3% | | bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x | 0.39x | 14.3 | 3% | | bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x | 0.26x | 9.6 | 2% | | bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x | 0.26x | 9.4 | 2% | | bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x | 0.29x | 10.7 | 2% | | bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x | 0.9 | 0% | | bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x | 0.6 | 0% | | bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x | 0.7 | 0% | | bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x | 0.9 | 0% | | bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x | 4.4 | 1% | | bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x | 4.5 | 1% | | bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x | 4.6 | 1% | | bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 | 1% | | bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x | 11.3 | 2% | | bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x | 10.7 | 2% | | bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x | 0.53x | 55.0 | 12% | | bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x | 0.53x | 54.5 | 12% | | bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x | 0.55x | 55.7 | 12% | | bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x | 0.37x | 37.3 | 8% | | bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x | 0.36x | 36.1 | 8% | | bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x | 0.37x | 37.4 | 8% | | bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x | 2.3 | 0% | | bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 | 1% | | bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 | 0% | | bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 | 1% | | bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 | 1% | | bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x | 9.5 | 2% | | bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 | 2% | | bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x | 9.4 | 2% | | bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x | 7.7 | 2% | | bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x | 0.30x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x | 0.29x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x | 0.29x | 16.6 | 4% | | bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x | 0.27x | 15.3 | 3% | | bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x | 0.46x | 69.8 | 15% | | bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x | 0.46x | 71.3 | 16% | | bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x | 0.48x | 72.7 | 16% | | bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x | 0.29x | 43.3 | 10% | | bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x | 0.29x | 44.0 | 10% | | bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 | 2.32x | 0.30x | 44.9 | 10% | | bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x | 4.3 | 1% | | bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x | 4.6 | 1% | | bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x | 5.6 | 1% | | bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x | 4.4 | 1% | | bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x | 4.6 | 1% | | bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x | 5.6 | 1% | | bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x | 1.13x | 38.2 | 8% | | bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x | 1.13x | 39.0 | 9% | | bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x | 1.10x | 39.4 | 9% | | bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x | 0.74x | 24.6 | 5% | | bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x | 0.74x | 25.1 | 5% | | bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x | 0.74x | 26.0 | 6% | | bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x | 0.57x | 64.9 | 14% | | bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x | 0.57x | 65.1 | 14% | | bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x | 0.57x | 65.0 | 14% | | bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x | 0.35x | 39.9 | 9% | | bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 | 6.53x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 | 6.91x | 1.06x | 152.1 | 33% | | bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 | 7.30x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 | 3.84x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 | 4.06x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 | 4.29x | 0.65x | 89.6 | 20% | | bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x | 0.27x | 5.3 | 1% | | bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x | 0.25x | 5.7 | 1% | | bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x | 0.25x | 6.9 | 2% | | bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x | 0.31x | 5.4 | 1% | | bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8 | 1% | | bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x | 0.29x | 6.9 | 2% | | bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x | 0.88x | 44.0 | 10% | | bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x | 0.88x | 44.7 | 10% | | bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x | 0.87x | 45.5 | 10% | | bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x | 0.56x | 27.1 | 6% | | bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x | 0.56x | 27.4 | 6% | | bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x | 0.56x | 28.3 | 6% | | bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x | 0.47x | 67.6 | 15% | | bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x | 0.47x | 67.7 | 15% | | bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x | 0.29x | 41.1 | 9% | | bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 | 6.57x | 1.05x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 | 6.94x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 | 7.33x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 | 3.86x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 | 4.08x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 | 4.32x | 0.64x | 90.6 | 20% | | float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x | 0.0 | 0% | | float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x | 0.1 | 0% | | float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x | 0.1 | 0% | | float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x | 0.1 | 0% | | float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x | 0.1 | 0% | | float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x | 0.4 | 0% | | float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x | 0.4 | 0% | | float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x | 0.4 | 0% | | float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x | 0.4 | 0% | | float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x | 0.4 | 0% | | float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x | 0.4 | 0% | | float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x | 0.37x | 2.1 | 0% | | float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x | 2.6 | 1% | | float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x | 0.35x | 2.0 | 0% | | float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x | 0.24x | 1.3 | 0% | | float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x | 0.25x | 1.4 | 0% | | float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x | 0.25x | 1.4 | 0% | | float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x | 0.1 | 0% | | float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x | 3.1 | 1% | | float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x | 3.0 | 1% | | float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x | 3.0 | 1% | | float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x | 2.8 | 1% | | float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x | 2.8 | 1% | | float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x | 2.9 | 1% | | float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x | 0.41x | 15.2 | 3% | | float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x | 0.45x | 16.7 | 4% | | float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x | 0.42x | 15.8 | 3% | | float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x | 0.27x | 10.0 | 2% | | float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x | 0.28x | 10.5 | 2% | | float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x | 0.28x | 10.5 | 2% | | float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x | 0.9 | 0% | | float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x | 0.6 | 0% | | float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x | 0.7 | 0% | | float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x | 0.9 | 0% | | float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x | 4.4 | 1% | | float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 | 1% | | float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x | 4.6 | 1% | | float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x | 4.1 | 1% | | float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x | 4.2 | 1% | | float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 | 1% | | float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x | 12.7 | 3% | | float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x | 12.7 | 3% | | float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x | 11.9 | 3% | | float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x | 12.3 | 3% | | float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x | 11.9 | 3% | | float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x | 11.5 | 3% | | float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x | 0.57x | 58.7 | 13% | | float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x | 0.55x | 57.8 | 13% | | float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x | 0.58x | 59.4 | 13% | | float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x | 0.37x | 38.1 | 8% | | float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x | 0.40x | 40.9 | 9% | | float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x | 0.38x | 38.7 | 8% | | float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x | 2.3 | 0% | | float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x | 3.2 | 1% | | float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 | 0% | | float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x | 2.6 | 1% | | float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 | 1% | | float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x | 10.7 | 2% | | float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x | 10.3 | 2% | | float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x | 10.1 | 2% | | float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x | 8.4 | 2% | | float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x | 8.2 | 2% | | float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x | 8.3 | 2% | | float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x | 0.34x | 19.3 | 4% | | float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x | 0.33x | 18.6 | 4% | | float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x | 0.32x | 17.9 | 4% | | float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x | 0.31x | 17.8 | 4% | | float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x | 0.30x | 17.1 | 4% | | float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x | 0.29x | 16.4 | 4% | | float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x | 0.49x | 76.2 | 17% | | float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x | 0.48x | 75.7 | 17% | | float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x | 0.48x | 74.8 | 16% | | float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x | 0.31x | 47.7 | 10% | | float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x | 0.30x | 46.8 | 10% | | float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x | 0.29x | 45.6 | 10% | | float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x | 4.6 | 1% | | float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x | 4.9 | 1% | | float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x | 5.9 | 1% | | float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x | 4.5 | 1% | | float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x | 4.9 | 1% | | float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x | 6.0 | 1% | | float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x | 44.1 | 10% | | float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x | 45.0 | 10% | | float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x | 1.27x | 45.4 | 10% | | float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x | 22.0 | 5% | | float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x | 22.3 | 5% | | float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x | 0.65x | 22.9 | 5% | | float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x | 0.70x | 81.0 | 18% | | float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x | 0.69x | 80.5 | 18% | | float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x | 0.69x | 79.9 | 18% | | float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x | 0.31x | 36.1 | 8% | | float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x | 0.31x | 36.0 | 8% | | float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x | 0.31x | 36.0 | 8% | | float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 | 5.87x | 1.20x | 174.4 | 38% | | float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 | 6.42x | 1.21x | 174.4 | 38% | | float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 | 7.04x | 1.20x | 174.0 | 38% | | float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 | 2.83x | 0.60x | 84.2 | 18% | | float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 | 3.10x | 0.60x | 84.0 | 18% | | float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 | 3.39x | 0.60x | 83.5 | 18% | | float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x | 0.27x | 5.7 | 1% | | float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x | 0.25x | 6.1 | 1% | | float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x | 0.25x | 7.4 | 2% | | float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x | 0.30x | 5.7 | 1% | | float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x | 0.29x | 6.2 | 1% | | float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x | 0.29x | 7.4 | 2% | | float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6 | 12% | | float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6 | 12% | | float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x | 1.02x | 53.7 | 12% | | float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x | 0.49x | 23.7 | 5% | | float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x | 0.49x | 24.1 | 5% | | float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x | 0.49x | 24.8 | 5% | | float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x | 0.57x | 84.1 | 18% | | float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x | 0.57x | 84.2 | 18% | | float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x | 0.57x | 84.3 | 18% | | float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x | 0.25x | 36.3 | 8% | | float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x | 0.25x | 36.5 | 8% | | float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 | 5.90x | 1.19x | 177.1 | 39% | | float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 | 6.50x | 1.19x | 177.4 | 39% | | float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 | 7.10x | 1.19x | 176.5 | 39% | | float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 | 2.84x | 0.59x | 85.1 | 19% | | float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 | 3.12x | 0.59x | 85.2 | 19% | | float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 | 3.42x | 0.59x | 85.1 | 19% | | float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x | 0.0 | 0% | | float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x | 0.0 | 0% | | float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x | 0.0 | 0% | | float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x | 0.0 | 0% | | float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x | 0.1 | 0% | | float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x | 0.1 | 0% | | float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x | 0.1 | 0% | | float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x | 0.1 | 0% | | float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x | 0.1 | 0% | | float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x | 0.1 | 0% | | float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x | 0.6 | 0% | | float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x | 0.7 | 0% | | float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x | 0.6 | 0% | | float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x | 0.6 | 0% | | float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x | 0.6 | 0% | | float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x | 0.6 | 0% | | float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x | 0.41x | 3.4 | 1% | | float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x | 0.48x | 4.0 | 1% | | float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x | 0.43x | 3.5 | 1% | | float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x | 0.37x | 3.0 | 1% | | float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x | 0.34x | 2.8 | 1% | | float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x | 0.27x | 2.2 | 0% | | float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x | 0.1 | 0% | | float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x | 0.2 | 0% | | float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x | 0.1 | 0% | | float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x | 0.2 | 0% | | float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x | 0.9 | 0% | | float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x | 1.0 | 0% | | float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x | 0.9 | 0% | | float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x | 0.9 | 0% | | float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x | 0.9 | 0% | | float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x | 0.9 | 0% | | float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x | 4.7 | 1% | | float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x | 4.3 | 1% | | float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x | 4.7 | 1% | | float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x | 4.6 | 1% | | float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x | 4.5 | 1% | | float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x | 4.2 | 1% | | float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x | 0.47x | 26.2 | 6% | | float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x | 0.47x | 26.1 | 6% | | float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x | 0.44x | 24.1 | 5% | | float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x | 0.31x | 17.1 | 4% | | float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x | 0.30x | 16.7 | 4% | | float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x | 0.30x | 16.7 | 4% | | float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x | 1.0 | 0% | | float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x | 1.1 | 0% | | float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x | 1.2 | 0% | | float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x | 1.0 | 0% | | float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x | 1.1 | 0% | | float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x | 1.2 | 0% | | float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x | 6.4 | 1% | | float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x | 6.5 | 1% | | float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x | 6.4 | 1% | | float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x | 6.4 | 1% | | float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x | 6.3 | 1% | | float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x | 6.4 | 1% | | float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x | 19.6 | 4% | | float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x | 19.1 | 4% | | float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x | 0.56x | 18.0 | 4% | | float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x | 18.8 | 4% | | float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x | 18.6 | 4% | | float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x | 0.59x | 18.9 | 4% | | float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x | 0.47x | 71.4 | 16% | | float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x | 0.47x | 70.9 | 16% | | float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x | 0.46x | 68.9 | 15% | | float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x | 0.37x | 55.0 | 12% | | float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x | 0.35x | 52.8 | 12% | | float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x | 0.34x | 51.6 | 11% | | float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x | 3.1 | 1% | | float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x | 3.4 | 1% | | float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x | 3.9 | 1% | | float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x | 3.2 | 1% | | float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x | 3.3 | 1% | | float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x | 3.6 | 1% | | float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x | 15.6 | 3% | | float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x | 15.4 | 3% | | float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x | 15.4 | 3% | | float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x | 13.0 | 3% | | float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x | 12.3 | 3% | | float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x | 12.3 | 3% | | float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x | 0.36x | 30.0 | 7% | | float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x | 0.35x | 28.8 | 6% | | float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x | 0.34x | 28.0 | 6% | | float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x | 0.34x | 28.3 | 6% | | float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x | 0.34x | 27.8 | 6% | | float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x | 0.32x | 26.8 | 6% | | float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x | 0.69x | 82.6 | 18% | | float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x | 0.69x | 81.6 | 18% | | float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 | 2.19x | 0.67x | 79.5 | 17% | | float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x | 0.54x | 62.7 | 14% | | float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x | 0.52x | 60.8 | 13% | | float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 | 1.65x | 0.51x | 59.3 | 13% | | float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x | 0.22x | 5.7 | 1% | | float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x | 0.21x | 5.9 | 1% | | float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x | 0.20x | 6.6 | 1% | | float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x | 0.24x | 5.6 | 1% | | float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x | 0.23x | 5.9 | 1% | | float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x | 0.22x | 6.6 | 1% | | float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x | 84.9 | 19% | | float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9 | 19% | | float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x | 1.73x | 86.1 | 19% | | float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x | 46.2 | 10% | | float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x | 46.4 | 10% | | float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x | 0.97x | 47.3 | 10% | | float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x | 0.87x | 145.4 | 32% | | float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x | 0.87x | 145.5 | 32% | | float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x | 0.87x | 145.9 | 32% | | float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x | 0.43x | 72.2 | 16% | | float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x | 0.43x | 72.4 | 16% | | float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x | 0.44x | 72.5 | 16% | | float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 | 6.06x | 2.13x | 273.2 | 60% | | float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 | 6.47x | 2.13x | 272.9 | 60% | | float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 | 7.00x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 | 3.35x | 1.20x | 151.3 | 33% | | float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 | 3.60x | 1.20x | 151.2 | 33% | | float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 | 3.91x | 1.20x | 150.9 | 33% | | float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x | 0.18x | 6.7 | 1% | | float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x | 0.17x | 7.0 | 2% | | float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x | 0.17x | 8.0 | 2% | | float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x | 0.21x | 6.7 | 1% | | float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x | 0.20x | 7.1 | 2% | | float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x | 0.20x | 8.0 | 2% | | float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x | 1.44x | 99.3 | 22% | | float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x | 1.43x | 99.8 | 22% | | float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x | 1.42x | 100.9 | 22% | | float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x | 0.76x | 51.3 | 11% | | float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x | 0.76x | 51.7 | 11% | | float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x | 0.76x | 52.5 | 12% | | float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x | 0.95x | 151.9 | 33% | | float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x | 0.96x | 151.9 | 33% | | float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x | 0.96x | 152.1 | 33% | | float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x | 0.48x | 74.1 | 16% | | float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x | 0.48x | 74.0 | 16% | | float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x | 0.48x | 74.2 | 16% | | float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 | 6.02x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 | 6.43x | 2.10x | 273.4 | 60% | | float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 | 6.98x | 2.10x | 273.6 | 60% | | float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 | 3.36x | 1.19x | 152.6 | 33% | | float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 | 3.60x | 1.20x | 152.7 | 33% | | float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 | 3.92x | 1.20x | 152.7 | 33% | </details> ### Test methodology - **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare. - **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is monotonic for both `largest=True/False`. - **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with 20 warmup iterations. `largest=True`. - **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6). --------- Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Summary
Builds on #3371 (subgroup topk kernel). Adds a single workgroup topk kernel — SYCL translation of PyTorch CUDA's single-block radix select path.
Approach
Single workgroup topk kernel (
TensorTopKSingleWgKernel.cpp): A 1024-thread workgroup processes one slice usingRADIX_BITS=4radix select to find the k-th value, then gathers matching elements. Translated from PyTorch CUDA's single-block path. Output is unsorted (caller sorts if needed). Best for large dim (>= 4096).Updated dispatch logic:
dim < 1024-> original kernelk <= 16and large batch -> subgroup kernel (PR1, SORTED)dim >= 4096-> single workgroup kernel (this PR, UNSORTED)Also fixes NaN handling in
SortingRadixSelect.hTopKTypeConfig::convertfor half/float/double (NaN maps to max radix value).Multi-block radix select (for very large slices across multiple workgroups) is planned as future work.
Files changed
TensorTopKSingleWgKernel.cpp(new)TensorTopKSingleWgKernel.h(new)single_wg_topk_try_launchdeclarationTensorTopKSbtopkKernel.cppTensorTopKSbtopkKernel.hSortingRadixSelect.hTopKTypeConfig::convertCorrectness
torch.topk(sorted=True)output verified monotonic)Benchmark: incremental gain from this PR
Showing where single-wg kernel helps (large dim cases):
By dim (PR2 vs PR1-only):
Full 432-case results (combined PR1+PR2)
XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters.
Click to expand full table
Test methodology
torch.topk(sorted=True)output is monotonic for bothlargest=True/False.largest=True.(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).