Add subgroup topk kernel for XPU (part1 of #3369)#3371
Conversation
Add an optimized topk kernel path where each 32-lane sub-group processes one slice entirely in registers via insertion sort + bitonic merge. Zero SLM, zero barriers. Output is already sorted. Constraints: k <= 16, large enough batch (nsegments >= HW_threads/4). Compile-time template dispatch on largest (direction) and IndexT (int32/int64). Kernel isolated in a separate translation unit to avoid SYCL compiler interference with the original kernel. 432/432 accuracy tests pass, 324/324 sortedness tests pass.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds an optimized XPU SYCL implementation of topk using a subgroup-based kernel path and integrates it into the existing XPU topk dispatch to skip unnecessary sorting when results are already sorted.
Changes:
- Added a new subgroup top-k kernel translation unit with launch heuristics (
k <= 16,dim >= 1024, sufficiently large batch). - Introduced a small public interface (
SbtopkResult,sbtopk_try_launch) to attempt the optimized path. - Updated the existing
topkcaller to try the subgroup kernel first and conditionally skip sorting.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h | Declares SbtopkResult and the optimized-kernel try-launch API. |
| src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp | Implements subgroup-in-register top-k kernel and dispatch heuristics. |
| src/ATen/native/xpu/sycl/TensorTopKKernel.cpp | Calls sbtopk_try_launch first and skips sort when already sorted. |
- Add kernel properties (sub_group_size<32>, grf_size<128>) to launch for explicit sub-group size and smaller GRF (better occupancy) - Fix std::numeric_limits<scalar_t>::infinity() for integer dtypes: use lowest()/max() when has_infinity is false - Add #include <limits> - Clarify insert() idx param is within-slice (int, bounded by sliceSize) - Shorten header comment for sbtopk_try_launch - Fix TensorTopKKernel.cpp comment (remove single-wg kernel reference) 432/432 accuracy, 324/324 sortedness pass.
aa5150e to
d742fa2
Compare
d742fa2 to
f926f72
Compare
f926f72 to
659972f
Compare
659972f to
d4daf78
Compare
…ent, drop TORCH_XPU_API
Performance outliers, please check!
|
- insert(): add count-aware stop condition so input values equal to the sentinel (e.g. all -inf for largest=true) fill the buffer correctly instead of repeatedly overwriting position K-1 - Add alignas(alignof(LoadT)) on local vectorized-load array - Add pointer alignment check in vec dispatch to safely fall back to scalar loads when input has a non-aligned storage offset
Performance outliers, please check!
|
Benchmarks show subgroup kernel is 2-4x faster than original even for small dims (32-512) when batch size is large. The previous dim>=1024 guard was overly conservative. The only hard requirement is dim>=SG_SIZE (32) so each lane gets at least one element.
Select K from {1, 2, 4, 8, 16} based on runtime k (round up to
next power of two). Smaller K means fewer unrolled iterations in
insert/merge/shuffle loops, dramatically reducing register pressure.
K<=8 eliminates all register spills on B580 (GRF 128) across fp32,
fp16, and bf16. For k=4 this gives 3-11x speedup over the previous
fixed K=16 path; k=16 takes the same K=16 template as before (no
regression).
Performance outliers, please check!
|
Benchmark: Subgroup TopK on Intel Arc B580Setup:
Summary
Full ResultsClick to expand full 156-case table
Notes
|
Performance outliers, please check!
|
e5edf06 to
3a7de35
Compare
- Replace K_sel if-else chain with c10::llvm::PowerOf2Ceil + std::min - Replace q.submit with sycl_kernel_submit + kernel properties - Add sycl_kernel_submit overloads accepting properties to SYCLHelpers.h - Simplify SBTOPK_DISPATCH_INDEX: only check numSlices <= INT_MAX (IndexT is only used for slice indices, not cross-slice global indices) - Add SG_MERGE_LEVELS constexpr + static_assert, replace magic number 5 - Refactor vec dispatch with can_use_vec lambda - Update IndexT comment to reflect simplified dispatch condition
CuiYifeng
left a comment
There was a problem hiding this comment.
The main part of this PR LGTM.
Performance outliers, please check!
|
|
e2e failure is known issue #3455. |
|
/merge -f "Failed unit tests are irrelevant to this PR" |
2 similar comments
|
/merge -f "Failed unit tests are irrelevant to this PR" |
|
/merge -f "Failed unit tests are irrelevant to this PR" |
## Summary - **Speedup vs original XPU:** 1.3648x geomean over 432 cases, 130 wins (>1.05x), 40 regressions (<0.98x) - **vs CUDA 4080S:** 0.4574x geomean (>1 means XPU faster) ### Approach Add a **subgroup topk kernel** (`SubgroupTopKFunctor` in `TensorTopKSbtopkKernel.cpp`) where each 32-lane sub-group processes one slice entirely in registers: - **Phase 1:** Each lane scans `dim/32` elements, maintaining a sorted top-k buffer via insertion sort (fully unrolled). - **Phase 2:** 5-level bitonic merge across sub-group lanes via `sycl::select_from_group` shuffle. - **Phase 3:** Lane 0 writes `k` results. Output is already sorted. Key properties: - Zero SLM (shared local memory), zero barriers - `largest` as compile-time template parameter — eliminates per-element direction branches - `int32`/`int64` index dispatch mirroring CUDA's `canUse32BitIndexMath` - Kernel isolated in a separate translation unit to prevent SYCL compiler global optimization interference with the original kernel **Dispatch:** `k <= 16` and `nsegments >= HW_thread_slots / 4` and `dim >= 32` → subgroup kernel (SORTED); otherwise → original kernel. ### Files changed | File | Description | |------|-------------| | `TensorTopKSbtopkKernel.cpp` (new) | Subgroup topk kernel + dispatch logic | | `TensorTopKSbtopkKernel.h` (new) | `SbtopkResult` enum + `sbtopk_try_launch` declaration | | `TensorTopKKernel.cpp` | Modified caller — tries optimized path first, skips sort if already sorted | ### Correctness - **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare) - **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output verified monotonic) ### Benchmark summary **By batch size:** | bs | speedup vs orig | vs CUDA 4080S | cases | |----|:-:|:-:|:-:| | 1 | 1.00x | 0.41x | 72 | | 8 | 1.00x | 0.43x | 72 | | 64 | 1.00x | 0.42x | 72 | | 256 | 1.00x | 0.36x | 72 | | 1024 | 2.53x | 0.63x | 72 | | 2048 | 2.55x | 0.55x | 72 | **By dim:** | dim | speedup vs orig | vs CUDA 4080S | cases | |-----|:-:|:-:|:-:| | 128 | 1.00x | 0.36x | 54 | | 129 | 1.00x | 0.39x | 54 | | 1024 | 1.47x | 0.77x | 54 | | 1025 | 1.35x | 0.63x | 54 | | 8192 | 1.62x | 0.60x | 54 | | 8193 | 1.30x | 0.48x | 54 | | 131072 | 1.87x | 0.34x | 54 | | 131073 | 1.53x | 0.28x | 54 | ### Full 432-case results XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters. <details> <summary>Click to expand full table</summary> | dtype | bs | dim | k | XPU orig (us) | XPU opt (us) | CUDA 4080S (us) | speedup | vs CUDA | BW (GB/s) | %peak | |-------|---:|----:|--:|--------------:|------------:|----------------:|--------:|--------:|----------:|------:| | bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 14.7 | 0.99x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 19.0 | 1.00x | 0.62x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 18.3 | 1.00x | 0.60x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 19.8 | 1.00x | 0.65x | 0.1 | 0% | | bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 37.4 | 1.03x | 0.84x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.2 | 1.06x | 0.87x | 0.3 | 0% | | bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 39.1 | 1.00x | 0.80x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 37.0 | 0.94x | 0.76x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 40.3 | 1.00x | 0.83x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 39.7 | 1.00x | 0.82x | 0.3 | 0% | | bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 46.3 | 0.98x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 46.3 | 0.98x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 46.4 | 1.01x | 0.11x | 0.6 | 0% | | bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 46.8 | 1.02x | 0.13x | 0.7 | 0% | | bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 46.8 | 0.99x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 46.7 | 0.98x | 0.11x | 0.6 | 0% | | bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 19.3 | 1.00x | 0.63x | 0.5 | 0% | | bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 20.5 | 1.00x | 0.67x | 0.5 | 0% | | bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 42.2 | 1.06x | 0.82x | 2.5 | 1% | | bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 39.9 | 0.95x | 0.73x | 2.4 | 1% | | bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 42.4 | 1.01x | 0.78x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 43.3 | 1.00x | 0.79x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 43.5 | 1.00x | 0.80x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 42.7 | 1.12x | 0.88x | 2.7 | 1% | | bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 56.8 | 0.98x | 0.14x | 5.3 | 1% | | bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 56.5 | 1.06x | 0.14x | 5.3 | 1% | | bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 56.7 | 0.99x | 0.13x | 4.8 | 1% | | bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 56.8 | 0.97x | 0.14x | 5.2 | 1% | | bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 57.1 | 1.00x | 0.14x | 5.2 | 1% | | bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 56.9 | 1.01x | 0.13x | 4.8 | 1% | | bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 14.9 | 1.00x | 0.49x | 0.6 | 0% | | bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 14.7 | 1.00x | 0.48x | 0.7 | 0% | | bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 14.8 | 1.01x | 0.49x | 0.9 | 0% | | bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 15.4 | 1.01x | 0.51x | 0.6 | 0% | | bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 15.5 | 1.00x | 0.51x | 0.7 | 0% | | bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 15.2 | 1.01x | 0.50x | 0.9 | 0% | | bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 19.5 | 1.00x | 0.64x | 4.4 | 1% | | bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 19.5 | 1.00x | 0.64x | 4.5 | 1% | | bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 19.5 | 1.00x | 0.64x | 4.6 | 1% | | bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 20.6 | 1.00x | 0.61x | 4.1 | 1% | | bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 | 1% | | bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 49.9 | 1.01x | 0.54x | 11.4 | 3% | | bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 49.5 | 1.01x | 0.51x | 10.9 | 2% | | bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 49.6 | 1.00x | 0.49x | 10.5 | 2% | | bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 49.8 | 1.07x | 0.55x | 11.7 | 3% | | bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 49.6 | 1.02x | 0.52x | 10.9 | 2% | | bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 49.7 | 1.00x | 0.50x | 10.6 | 2% | | bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 162.9 | 1.01x | 0.18x | 18.9 | 4% | | bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 164.6 | 0.99x | 0.17x | 17.7 | 4% | | bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 164.4 | 1.01x | 0.17x | 16.9 | 4% | | bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 166.8 | 1.01x | 0.19x | 18.9 | 4% | | bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 166.8 | 1.00x | 0.18x | 17.8 | 4% | | bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 165.5 | 1.00x | 0.17x | 16.8 | 4% | | bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 15.7 | 1.00x | 0.47x | 2.2 | 0% | | bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 15.6 | 1.01x | 0.46x | 2.6 | 1% | | bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 | 1% | | bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 | 0% | | bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 | 1% | | bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 | 1% | | bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 41.7 | 1.00x | 0.74x | 9.5 | 2% | | bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 | 2% | | bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 42.6 | 1.00x | 0.72x | 9.5 | 2% | | bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 45.9 | 0.98x | 0.63x | 7.4 | 2% | | bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 46.7 | 1.01x | 0.63x | 7.4 | 2% | | bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 47.1 | 1.00x | 0.62x | 7.5 | 2% | | bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 75.2 | 0.99x | 0.29x | 15.9 | 3% | | bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 75.0 | 1.00x | 0.28x | 15.6 | 3% | | bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 75.2 | 0.99x | 0.26x | 14.6 | 3% | | bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 75.1 | 0.97x | 0.28x | 15.7 | 3% | | bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 75.6 | 1.00x | 0.28x | 15.4 | 3% | | bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 75.7 | 1.00x | 0.26x | 14.7 | 3% | | bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 439.2 | 1.00x | 0.14x | 21.7 | 5% | | bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 436.9 | 1.00x | 0.13x | 20.5 | 5% | | bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 440.9 | 1.00x | 0.13x | 19.4 | 4% | | bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 441.5 | 1.00x | 0.14x | 21.7 | 5% | | bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 435.4 | 1.00x | 0.13x | 20.5 | 5% | | bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 443.1 | 1.00x | 0.13x | 19.3 | 4% | | bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 22.1 | 1.02x | 0.32x | 4.4 | 1% | | bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 22.0 | 1.00x | 0.29x | 4.6 | 1% | | bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 22.3 | 1.00x | 0.29x | 5.6 | 1% | | bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 24.4 | 1.02x | 0.35x | 4.4 | 1% | | bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 24.4 | 1.00x | 0.32x | 4.6 | 1% | | bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 24.5 | 1.00x | 0.32x | 5.6 | 1% | | bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 63.1 | 2.72x | 1.12x | 38.0 | 8% | | bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 63.3 | 2.78x | 1.13x | 38.8 | 9% | | bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 63.4 | 2.73x | 1.10x | 39.3 | 9% | | bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 64.5 | 2.54x | 0.75x | 24.9 | 5% | | bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 64.7 | 2.58x | 0.75x | 25.1 | 6% | | bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 64.8 | 2.59x | 0.74x | 25.9 | 6% | | bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 147.6 | 3.79x | 0.60x | 67.8 | 15% | | bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 147.4 | 3.95x | 0.59x | 67.6 | 15% | | bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 148.0 | 4.12x | 0.59x | 67.4 | 15% | | bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 149.2 | 2.32x | 0.37x | 41.4 | 9% | | bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 148.4 | 2.43x | 0.36x | 41.4 | 9% | | bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 149.3 | 2.56x | 0.37x | 41.6 | 9% | | bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1865.9 | 6.52x | 1.06x | 152.3 | 33% | | bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1867.4 | 6.92x | 1.06x | 152.3 | 33% | | bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1863.0 | 7.28x | 1.05x | 152.0 | 33% | | bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 1940.1 | 3.84x | 0.65x | 89.5 | 20% | | bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 1936.8 | 4.06x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 1944.4 | 4.28x | 0.65x | 89.5 | 20% | | bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 30.5 | 1.00x | 0.27x | 5.3 | 1% | | bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 30.5 | 1.00x | 0.25x | 5.7 | 1% | | bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 30.9 | 1.00x | 0.25x | 6.9 | 2% | | bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 35.4 | 1.00x | 0.31x | 5.4 | 1% | | bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8 | 1% | | bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 35.7 | 1.00x | 0.29x | 7.0 | 2% | | bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 85.7 | 2.87x | 0.89x | 44.4 | 10% | | bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 86.0 | 2.92x | 0.88x | 44.7 | 10% | | bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 86.4 | 2.88x | 0.87x | 45.5 | 10% | | bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 88.4 | 2.58x | 0.56x | 27.1 | 6% | | bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 88.7 | 2.67x | 0.56x | 27.5 | 6% | | bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 89.0 | 2.68x | 0.56x | 28.3 | 6% | | bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 234.9 | 3.78x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 234.1 | 3.94x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 235.0 | 4.13x | 0.47x | 67.9 | 15% | | bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 236.2 | 2.27x | 0.29x | 40.8 | 9% | | bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 237.3 | 2.38x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 237.4 | 2.50x | 0.29x | 41.0 | 9% | | bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3646.5 | 6.57x | 1.05x | 154.1 | 34% | | bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3644.1 | 6.94x | 1.05x | 154.1 | 34% | | bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3646.2 | 7.34x | 1.05x | 154.0 | 34% | | bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 3774.7 | 3.86x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 3780.1 | 4.08x | 0.64x | 90.5 | 20% | | bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 3773.0 | 4.31x | 0.64x | 90.5 | 20% | | float16 | 1 | 128 | 4 | 30.7 | 30.7 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 128 | 8 | 30.6 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 128 | 16 | 30.5 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 129 | 4 | 30.6 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 8 | 30.6 | 30.3 | 14.4 | 1.01x | 0.48x | 0.0 | 0% | | float16 | 1 | 129 | 16 | 30.5 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0% | | float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 17.4 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 17.8 | 1.00x | 0.58x | 0.1 | 0% | | float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 20.1 | 1.00x | 0.66x | 0.1 | 0% | | float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 33.6 | 1.08x | 0.88x | 0.4 | 0% | | float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 33.8 | 0.85x | 0.70x | 0.3 | 0% | | float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 31.5 | 0.94x | 0.65x | 0.3 | 0% | | float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 37.4 | 1.11x | 0.91x | 0.4 | 0% | | float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 36.9 | 0.97x | 0.84x | 0.4 | 0% | | float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 33.3 | 0.89x | 0.65x | 0.3 | 0% | | float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 46.2 | 0.98x | 0.15x | 0.9 | 0% | | float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 46.5 | 0.97x | 0.14x | 0.8 | 0% | | float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 46.1 | 0.98x | 0.13x | 0.7 | 0% | | float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 46.9 | 1.08x | 0.16x | 0.9 | 0% | | float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 46.8 | 0.99x | 0.14x | 0.8 | 0% | | float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 46.6 | 0.93x | 0.12x | 0.7 | 0% | | float16 | 8 | 128 | 4 | 30.5 | 30.2 | 14.4 | 1.01x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 8 | 30.4 | 30.2 | 14.5 | 1.01x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 16 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 129 | 4 | 30.5 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 8 | 30.3 | 30.2 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 16 | 30.5 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 19.1 | 1.01x | 0.63x | 0.5 | 0% | | float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 19.3 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 20.5 | 1.01x | 0.68x | 0.6 | 0% | | float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 37.9 | 1.00x | 0.83x | 2.9 | 1% | | float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 39.8 | 1.00x | 0.82x | 2.7 | 1% | | float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 41.7 | 0.94x | 0.81x | 2.6 | 1% | | float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 39.2 | 1.07x | 0.86x | 2.9 | 1% | | float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 40.7 | 0.94x | 0.84x | 2.7 | 1% | | float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 43.0 | 1.05x | 0.83x | 2.6 | 1% | | float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 56.0 | 0.93x | 0.17x | 6.3 | 1% | | float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 56.1 | 0.95x | 0.16x | 5.9 | 1% | | float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 56.3 | 1.01x | 0.15x | 5.4 | 1% | | float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 56.2 | 1.00x | 0.18x | 6.7 | 1% | | float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 56.3 | 0.99x | 0.16x | 6.1 | 1% | | float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 56.3 | 0.97x | 0.15x | 5.6 | 1% | | float16 | 64 | 128 | 4 | 30.5 | 30.1 | 14.9 | 1.01x | 0.50x | 0.6 | 0% | | float16 | 64 | 128 | 8 | 30.5 | 30.2 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | float16 | 64 | 128 | 16 | 30.4 | 30.2 | 14.7 | 1.01x | 0.49x | 0.9 | 0% | | float16 | 64 | 129 | 4 | 30.6 | 30.2 | 15.3 | 1.01x | 0.51x | 0.6 | 0% | | float16 | 64 | 129 | 8 | 30.6 | 30.2 | 15.2 | 1.01x | 0.50x | 0.7 | 0% | | float16 | 64 | 129 | 16 | 30.5 | 30.2 | 15.1 | 1.01x | 0.50x | 0.9 | 0% | | float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 4.4 | 1% | | float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 | 1% | | float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 19.4 | 1.00x | 0.64x | 4.7 | 1% | | float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 19.7 | 1.01x | 0.62x | 4.2 | 1% | | float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 20.4 | 1.00x | 0.64x | 4.2 | 1% | | float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 | 1% | | float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 49.4 | 0.97x | 0.59x | 12.5 | 3% | | float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 49.2 | 0.99x | 0.58x | 12.5 | 3% | | float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 49.2 | 0.98x | 0.54x | 11.7 | 3% | | float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 49.4 | 1.01x | 0.62x | 13.1 | 3% | | float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 49.4 | 1.04x | 0.59x | 12.5 | 3% | | float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 49.4 | 1.02x | 0.56x | 11.9 | 3% | | float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 162.1 | 1.04x | 0.22x | 23.2 | 5% | | float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 160.5 | 1.01x | 0.21x | 21.5 | 5% | | float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 162.4 | 0.98x | 0.19x | 19.4 | 4% | | float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 161.6 | 1.00x | 0.23x | 23.7 | 5% | | float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 163.9 | 1.01x | 0.21x | 21.6 | 5% | | float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 162.9 | 1.01x | 0.19x | 19.6 | 4% | | float16 | 256 | 128 | 4 | 33.7 | 33.6 | 15.5 | 1.00x | 0.46x | 2.3 | 0% | | float16 | 256 | 128 | 8 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | float16 | 256 | 128 | 16 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 3.2 | 1% | | float16 | 256 | 129 | 4 | 33.7 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 | 0% | | float16 | 256 | 129 | 8 | 33.7 | 33.5 | 15.9 | 1.01x | 0.47x | 2.6 | 1% | | float16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 | 1% | | float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 37.9 | 1.00x | 0.75x | 10.5 | 2% | | float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 38.8 | 1.00x | 0.73x | 10.3 | 2% | | float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 39.9 | 0.98x | 0.71x | 10.1 | 2% | | float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 42.0 | 1.00x | 0.66x | 8.4 | 2% | | float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 43.1 | 0.97x | 0.65x | 8.2 | 2% | | float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 43.8 | 1.02x | 0.65x | 8.3 | 2% | | float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 74.1 | 0.99x | 0.33x | 19.0 | 4% | | float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 74.4 | 1.00x | 0.32x | 18.0 | 4% | | float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 74.7 | 0.99x | 0.30x | 16.9 | 4% | | float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 74.3 | 0.99x | 0.34x | 19.1 | 4% | | float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 74.8 | 1.01x | 0.32x | 18.1 | 4% | | float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 74.9 | 0.98x | 0.29x | 16.5 | 4% | | float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 428.9 | 0.99x | 0.18x | 27.7 | 6% | | float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 427.9 | 1.00x | 0.16x | 25.2 | 6% | | float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 428.2 | 1.00x | 0.15x | 22.9 | 5% | | float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 431.9 | 0.98x | 0.18x | 27.5 | 6% | | float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 429.8 | 0.99x | 0.16x | 25.1 | 5% | | float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 432.2 | 1.00x | 0.15x | 22.8 | 5% | | float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 20.9 | 1.00x | 0.31x | 4.5 | 1% | | float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 20.9 | 1.01x | 0.30x | 4.9 | 1% | | float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 21.4 | 1.00x | 0.30x | 6.0 | 1% | | float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 23.3 | 1.00x | 0.35x | 4.6 | 1% | | float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 23.1 | 1.01x | 0.33x | 4.9 | 1% | | float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 23.4 | 0.98x | 0.32x | 5.9 | 1% | | float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 62.7 | 2.74x | 1.30x | 44.2 | 10% | | float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 63.0 | 2.80x | 1.29x | 44.7 | 10% | | float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 63.1 | 2.89x | 1.27x | 45.5 | 10% | | float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 64.2 | 1.89x | 0.66x | 21.9 | 5% | | float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 64.4 | 1.97x | 0.66x | 22.3 | 5% | | float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 64.5 | 2.08x | 0.65x | 22.9 | 5% | | float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 145.0 | 3.99x | 0.73x | 84.6 | 19% | | float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 144.6 | 4.22x | 0.73x | 84.7 | 19% | | float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 145.5 | 4.50x | 0.72x | 83.9 | 18% | | float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 146.1 | 1.75x | 0.32x | 36.9 | 8% | | float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 146.5 | 1.83x | 0.32x | 36.9 | 8% | | float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 146.2 | 1.98x | 0.32x | 36.8 | 8% | | float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1846.9 | 5.88x | 1.20x | 174.8 | 38% | | float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1856.1 | 6.41x | 1.20x | 174.1 | 38% | | float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1858.5 | 7.06x | 1.21x | 174.6 | 38% | | float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 1924.0 | 2.82x | 0.60x | 84.1 | 18% | | float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 1921.5 | 3.12x | 0.60x | 84.3 | 18% | | float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 1926.4 | 3.42x | 0.60x | 84.3 | 18% | | float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 28.3 | 0.99x | 0.26x | 5.6 | 1% | | float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 28.5 | 1.00x | 0.25x | 6.1 | 1% | | float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 29.2 | 1.01x | 0.26x | 7.4 | 2% | | float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 32.6 | 0.99x | 0.30x | 5.6 | 1% | | float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 32.7 | 1.00x | 0.29x | 6.2 | 1% | | float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 33.5 | 1.00x | 0.29x | 7.4 | 2% | | float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6 | 12% | | float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6 | 12% | | float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 86.0 | 3.08x | 1.02x | 53.6 | 12% | | float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 87.7 | 1.92x | 0.49x | 23.8 | 5% | | float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 88.0 | 1.98x | 0.49x | 24.1 | 5% | | float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 88.5 | 2.09x | 0.49x | 24.8 | 5% | | float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 228.7 | 3.94x | 0.57x | 84.2 | 18% | | float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 228.5 | 4.16x | 0.57x | 84.3 | 18% | | float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 230.5 | 4.51x | 0.57x | 84.5 | 19% | | float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 231.7 | 1.70x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 231.6 | 1.81x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 233.1 | 1.95x | 0.25x | 36.4 | 8% | | float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3622.2 | 5.90x | 1.19x | 176.9 | 39% | | float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3607.3 | 6.49x | 1.19x | 177.3 | 39% | | float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3607.4 | 7.10x | 1.19x | 176.5 | 39% | | float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 3743.3 | 2.84x | 0.59x | 85.2 | 19% | | float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 3747.1 | 3.12x | 0.59x | 85.1 | 19% | | float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 3749.5 | 3.43x | 0.59x | 85.2 | 19% | | float32 | 1 | 128 | 4 | 31.2 | 31.4 | 14.5 | 0.99x | 0.46x | 0.0 | 0% | | float32 | 1 | 128 | 8 | 34.0 | 34.4 | 14.3 | 0.99x | 0.42x | 0.0 | 0% | | float32 | 1 | 128 | 16 | 32.4 | 34.4 | 14.0 | 0.94x | 0.41x | 0.0 | 0% | | float32 | 1 | 129 | 4 | 34.1 | 34.4 | 14.4 | 0.99x | 0.42x | 0.0 | 0% | | float32 | 1 | 129 | 8 | 34.0 | 32.7 | 14.4 | 1.04x | 0.44x | 0.0 | 0% | | float32 | 1 | 129 | 16 | 34.1 | 34.3 | 15.2 | 0.99x | 0.44x | 0.0 | 0% | | float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 17.8 | 1.08x | 0.54x | 0.1 | 0% | | float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 22.2 | 0.99x | 0.62x | 0.1 | 0% | | float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 19.1 | 0.99x | 0.54x | 0.1 | 0% | | float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 18.8 | 0.98x | 0.52x | 0.1 | 0% | | float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 19.7 | 1.08x | 0.55x | 0.1 | 0% | | float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 19.6 | 0.99x | 0.55x | 0.1 | 0% | | float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 39.6 | 1.07x | 0.77x | 0.6 | 0% | | float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 38.0 | 1.16x | 0.69x | 0.6 | 0% | | float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 38.7 | 0.94x | 0.67x | 0.6 | 0% | | float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 34.1 | 0.99x | 0.66x | 0.6 | 0% | | float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 41.6 | 1.03x | 0.76x | 0.6 | 0% | | float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 39.8 | 1.04x | 0.69x | 0.6 | 0% | | float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 63.3 | 1.04x | 0.16x | 1.3 | 0% | | float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 63.3 | 1.03x | 0.16x | 1.3 | 0% | | float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 63.3 | 0.91x | 0.14x | 1.1 | 0% | | float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 64.0 | 1.04x | 0.16x | 1.3 | 0% | | float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 64.0 | 1.02x | 0.15x | 1.3 | 0% | | float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 63.9 | 1.01x | 0.15x | 1.2 | 0% | | float32 | 8 | 128 | 4 | 33.8 | 37.2 | 14.7 | 0.91x | 0.40x | 0.1 | 0% | | float32 | 8 | 128 | 8 | 35.0 | 34.1 | 14.3 | 1.03x | 0.42x | 0.1 | 0% | | float32 | 8 | 128 | 16 | 35.6 | 37.2 | 15.2 | 0.96x | 0.41x | 0.2 | 0% | | float32 | 8 | 129 | 4 | 35.2 | 36.0 | 15.0 | 0.98x | 0.42x | 0.1 | 0% | | float32 | 8 | 129 | 8 | 36.8 | 34.1 | 15.0 | 1.08x | 0.44x | 0.1 | 0% | | float32 | 8 | 129 | 16 | 35.3 | 35.5 | 15.3 | 0.99x | 0.43x | 0.2 | 0% | | float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 20.9 | 1.12x | 0.59x | 0.9 | 0% | | float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 19.7 | 1.07x | 0.55x | 0.9 | 0% | | float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 19.7 | 0.95x | 0.49x | 0.9 | 0% | | float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 20.6 | 1.07x | 0.58x | 0.9 | 0% | | float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 21.4 | 0.99x | 0.55x | 0.9 | 0% | | float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 22.0 | 1.12x | 0.60x | 0.9 | 0% | | float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 41.0 | 0.92x | 0.65x | 4.2 | 1% | | float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 42.6 | 1.10x | 0.77x | 4.8 | 1% | | float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 44.7 | 1.09x | 0.73x | 4.3 | 1% | | float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 43.0 | 0.85x | 0.67x | 4.1 | 1% | | float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 43.5 | 1.09x | 0.71x | 4.3 | 1% | | float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 45.0 | 0.95x | 0.67x | 3.9 | 1% | | float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 76.0 | 1.00x | 0.19x | 10.2 | 2% | | float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 76.0 | 1.02x | 0.18x | 9.9 | 2% | | float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 76.2 | 1.03x | 0.17x | 9.2 | 2% | | float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 76.0 | 0.98x | 0.18x | 10.2 | 2% | | float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 75.8 | 1.00x | 0.18x | 9.8 | 2% | | float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 76.1 | 0.99x | 0.16x | 8.8 | 2% | | float32 | 64 | 128 | 4 | 38.2 | 37.4 | 15.0 | 1.02x | 0.40x | 1.0 | 0% | | float32 | 64 | 128 | 8 | 36.8 | 37.2 | 15.0 | 0.99x | 0.40x | 1.0 | 0% | | float32 | 64 | 128 | 16 | 38.3 | 37.2 | 14.9 | 1.03x | 0.40x | 1.2 | 0% | | float32 | 64 | 129 | 4 | 38.5 | 37.1 | 15.5 | 1.04x | 0.42x | 1.0 | 0% | | float32 | 64 | 129 | 8 | 37.0 | 37.1 | 15.9 | 1.00x | 0.43x | 1.1 | 0% | | float32 | 64 | 129 | 16 | 38.4 | 38.8 | 15.4 | 0.99x | 0.40x | 1.2 | 0% | | float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 20.4 | 1.02x | 0.52x | 6.8 | 1% | | float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 20.3 | 1.02x | 0.52x | 6.8 | 2% | | float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 20.3 | 1.03x | 0.50x | 6.8 | 1% | | float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 22.1 | 0.95x | 0.51x | 6.1 | 1% | | float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 22.1 | 0.99x | 0.51x | 6.2 | 1% | | float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 22.2 | 0.96x | 0.50x | 6.1 | 1% | | float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 65.6 | 0.98x | 0.66x | 21.2 | 5% | | float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 65.6 | 0.97x | 0.62x | 19.7 | 4% | | float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 65.6 | 1.00x | 0.60x | 19.2 | 4% | | float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 65.6 | 0.98x | 0.66x | 21.1 | 5% | | float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 65.5 | 0.93x | 0.58x | 18.7 | 4% | | float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 65.6 | 0.97x | 0.57x | 18.2 | 4% | | float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 221.1 | 1.02x | 0.24x | 35.7 | 8% | | float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 220.4 | 1.02x | 0.22x | 33.3 | 7% | | float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 222.6 | 1.01x | 0.21x | 31.0 | 7% | | float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 223.0 | 1.00x | 0.24x | 35.7 | 8% | | float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 225.0 | 0.99x | 0.22x | 33.2 | 7% | | float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 223.7 | 0.99x | 0.20x | 30.5 | 7% | | float32 | 256 | 128 | 4 | 46.0 | 46.0 | 15.7 | 1.00x | 0.34x | 3.1 | 1% | | float32 | 256 | 128 | 8 | 47.2 | 47.5 | 15.7 | 0.99x | 0.33x | 3.3 | 1% | | float32 | 256 | 128 | 16 | 47.4 | 47.2 | 15.7 | 1.00x | 0.33x | 3.8 | 1% | | float32 | 256 | 129 | 4 | 47.2 | 47.5 | 16.1 | 0.99x | 0.34x | 3.0 | 1% | | float32 | 256 | 129 | 8 | 45.6 | 47.4 | 16.7 | 0.96x | 0.35x | 3.3 | 1% | | float32 | 256 | 129 | 16 | 47.3 | 49.0 | 16.6 | 0.97x | 0.34x | 3.7 | 1% | | float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 41.7 | 0.98x | 0.61x | 15.5 | 3% | | float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 43.2 | 1.01x | 0.62x | 15.3 | 3% | | float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 43.8 | 0.99x | 0.61x | 15.3 | 3% | | float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 45.9 | 1.02x | 0.57x | 13.1 | 3% | | float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 46.6 | 1.01x | 0.55x | 12.7 | 3% | | float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 48.1 | 0.98x | 0.54x | 12.3 | 3% | | float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 101.0 | 0.99x | 0.36x | 30.3 | 7% | | float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 101.3 | 1.05x | 0.35x | 29.4 | 6% | | float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 100.9 | 0.99x | 0.32x | 26.7 | 6% | | float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 101.7 | 1.02x | 0.37x | 30.2 | 7% | | float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 101.6 | 1.00x | 0.35x | 28.8 | 6% | | float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 101.8 | 1.00x | 0.32x | 26.5 | 6% | | float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1128.3 | 1.00x | 0.35x | 41.9 | 9% | | float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1132.5 | 0.99x | 0.33x | 39.0 | 9% | | float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1129.5 | 1.00x | 0.30x | 36.2 | 8% | | float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 1148.5 | 1.00x | 0.36x | 42.0 | 9% | | float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 1148.0 | 1.00x | 0.34x | 39.2 | 9% | | float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 1147.9 | 1.01x | 0.31x | 36.1 | 8% | | float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 22.3 | 0.99x | 0.22x | 5.6 | 1% | | float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 22.0 | 1.02x | 0.21x | 5.9 | 1% | | float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 22.2 | 0.98x | 0.20x | 6.6 | 1% | | float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 24.4 | 1.01x | 0.24x | 5.7 | 1% | | float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 24.4 | 1.00x | 0.23x | 5.8 | 1% | | float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 24.6 | 0.99x | 0.22x | 6.5 | 1% | | float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 88.3 | 3.70x | 1.76x | 84.5 | 19% | | float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9 | 19% | | float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 88.3 | 3.88x | 1.76x | 87.5 | 19% | | float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 90.2 | 2.73x | 0.98x | 46.1 | 10% | | float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 90.1 | 2.84x | 0.97x | 46.5 | 10% | | float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 90.4 | 2.87x | 0.97x | 47.3 | 10% | | float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 200.8 | 4.34x | 0.87x | 145.7 | 32% | | float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 200.2 | 4.64x | 0.87x | 145.6 | 32% | | float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 201.7 | 4.93x | 0.87x | 145.8 | 32% | | float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 202.4 | 2.18x | 0.44x | 72.3 | 16% | | float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 201.3 | 2.31x | 0.43x | 72.2 | 16% | | float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 202.6 | 2.49x | 0.43x | 72.4 | 16% | | float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 4191.1 | 6.06x | 2.13x | 273.4 | 60% | | float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 4189.9 | 6.47x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 4190.6 | 7.00x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 4260.7 | 3.35x | 1.20x | 151.4 | 33% | | float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 4261.2 | 3.60x | 1.20x | 151.3 | 33% | | float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 4261.2 | 3.91x | 1.20x | 151.0 | 33% | | float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 30.2 | 1.00x | 0.18x | 6.7 | 1% | | float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 30.6 | 1.00x | 0.17x | 7.0 | 2% | | float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 31.2 | 1.00x | 0.17x | 7.9 | 2% | | float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 35.4 | 1.00x | 0.21x | 6.8 | 1% | | float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 35.3 | 1.00x | 0.20x | 7.1 | 2% | | float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 36.4 | 1.00x | 0.20x | 7.9 | 2% | | float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 123.4 | 3.89x | 1.44x | 99.1 | 22% | | float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 123.4 | 4.04x | 1.44x | 99.9 | 22% | | float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 123.7 | 4.08x | 1.42x | 100.8 | 22% | | float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 126.5 | 2.84x | 0.76x | 51.3 | 11% | | float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 126.7 | 2.97x | 0.76x | 51.7 | 11% | | float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 127.0 | 3.02x | 0.76x | 52.6 | 12% | | float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 421.7 | 4.46x | 0.95x | 151.9 | 33% | | float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 424.8 | 4.73x | 0.96x | 151.8 | 33% | | float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 424.0 | 5.07x | 0.95x | 152.0 | 33% | | float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 436.2 | 2.18x | 0.48x | 74.0 | 16% | | float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 437.6 | 2.34x | 0.48x | 74.1 | 16% | | float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 440.8 | 2.49x | 0.48x | 74.1 | 16% | | float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 8254.2 | 6.02x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 8254.6 | 6.43x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 8250.2 | 6.97x | 2.10x | 273.3 | 60% | | float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 8407.4 | 3.36x | 1.20x | 152.7 | 33% | | float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 8407.4 | 3.60x | 1.19x | 152.6 | 33% | | float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 8413.3 | 3.91x | 1.19x | 152.5 | 33% | </details> ### Test methodology - **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare. - **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is monotonic for both `largest=True/False`. - **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with 20 warmup iterations. `largest=True`. - **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).
## Summary - **Speedup vs original XPU:** 1.3648x geomean over 432 cases, 130 wins (>1.05x), 40 regressions (<0.98x) - **vs CUDA 4080S:** 0.4574x geomean (>1 means XPU faster) ### Approach Add a **subgroup topk kernel** (`SubgroupTopKFunctor` in `TensorTopKSbtopkKernel.cpp`) where each 32-lane sub-group processes one slice entirely in registers: - **Phase 1:** Each lane scans `dim/32` elements, maintaining a sorted top-k buffer via insertion sort (fully unrolled). - **Phase 2:** 5-level bitonic merge across sub-group lanes via `sycl::select_from_group` shuffle. - **Phase 3:** Lane 0 writes `k` results. Output is already sorted. Key properties: - Zero SLM (shared local memory), zero barriers - `largest` as compile-time template parameter — eliminates per-element direction branches - `int32`/`int64` index dispatch mirroring CUDA's `canUse32BitIndexMath` - Kernel isolated in a separate translation unit to prevent SYCL compiler global optimization interference with the original kernel **Dispatch:** `k <= 16` and `nsegments >= HW_thread_slots / 4` and `dim >= 32` → subgroup kernel (SORTED); otherwise → original kernel. ### Files changed | File | Description | |------|-------------| | `TensorTopKSbtopkKernel.cpp` (new) | Subgroup topk kernel + dispatch logic | | `TensorTopKSbtopkKernel.h` (new) | `SbtopkResult` enum + `sbtopk_try_launch` declaration | | `TensorTopKKernel.cpp` | Modified caller — tries optimized path first, skips sort if already sorted | ### Correctness - **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare) - **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output verified monotonic) ### Benchmark summary **By batch size:** | bs | speedup vs orig | vs CUDA 4080S | cases | |----|:-:|:-:|:-:| | 1 | 1.00x | 0.41x | 72 | | 8 | 1.00x | 0.43x | 72 | | 64 | 1.00x | 0.42x | 72 | | 256 | 1.00x | 0.36x | 72 | | 1024 | 2.53x | 0.63x | 72 | | 2048 | 2.55x | 0.55x | 72 | **By dim:** | dim | speedup vs orig | vs CUDA 4080S | cases | |-----|:-:|:-:|:-:| | 128 | 1.00x | 0.36x | 54 | | 129 | 1.00x | 0.39x | 54 | | 1024 | 1.47x | 0.77x | 54 | | 1025 | 1.35x | 0.63x | 54 | | 8192 | 1.62x | 0.60x | 54 | | 8193 | 1.30x | 0.48x | 54 | | 131072 | 1.87x | 0.34x | 54 | | 131073 | 1.53x | 0.28x | 54 | ### Full 432-case results XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters. <details> <summary>Click to expand full table</summary> | dtype | bs | dim | k | XPU orig (us) | XPU opt (us) | CUDA 4080S (us) | speedup | vs CUDA | BW (GB/s) | %peak | |-------|---:|----:|--:|--------------:|------------:|----------------:|--------:|--------:|----------:|------:| | bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 14.7 | 0.99x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 19.0 | 1.00x | 0.62x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 18.3 | 1.00x | 0.60x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 19.8 | 1.00x | 0.65x | 0.1 | 0% | | bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 37.4 | 1.03x | 0.84x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.2 | 1.06x | 0.87x | 0.3 | 0% | | bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 39.1 | 1.00x | 0.80x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 37.0 | 0.94x | 0.76x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 40.3 | 1.00x | 0.83x | 0.3 | 0% | | bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 39.7 | 1.00x | 0.82x | 0.3 | 0% | | bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 46.3 | 0.98x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 46.3 | 0.98x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 46.4 | 1.01x | 0.11x | 0.6 | 0% | | bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 46.8 | 1.02x | 0.13x | 0.7 | 0% | | bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 46.8 | 0.99x | 0.12x | 0.7 | 0% | | bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 46.7 | 0.98x | 0.11x | 0.6 | 0% | | bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 19.3 | 1.00x | 0.63x | 0.5 | 0% | | bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 20.5 | 1.00x | 0.67x | 0.5 | 0% | | bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 42.2 | 1.06x | 0.82x | 2.5 | 1% | | bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 39.9 | 0.95x | 0.73x | 2.4 | 1% | | bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 42.4 | 1.01x | 0.78x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 43.3 | 1.00x | 0.79x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 43.5 | 1.00x | 0.80x | 2.4 | 1% | | bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 42.7 | 1.12x | 0.88x | 2.7 | 1% | | bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 56.8 | 0.98x | 0.14x | 5.3 | 1% | | bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 56.5 | 1.06x | 0.14x | 5.3 | 1% | | bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 56.7 | 0.99x | 0.13x | 4.8 | 1% | | bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 56.8 | 0.97x | 0.14x | 5.2 | 1% | | bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 57.1 | 1.00x | 0.14x | 5.2 | 1% | | bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 56.9 | 1.01x | 0.13x | 4.8 | 1% | | bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 14.9 | 1.00x | 0.49x | 0.6 | 0% | | bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 14.7 | 1.00x | 0.48x | 0.7 | 0% | | bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 14.8 | 1.01x | 0.49x | 0.9 | 0% | | bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 15.4 | 1.01x | 0.51x | 0.6 | 0% | | bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 15.5 | 1.00x | 0.51x | 0.7 | 0% | | bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 15.2 | 1.01x | 0.50x | 0.9 | 0% | | bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 19.5 | 1.00x | 0.64x | 4.4 | 1% | | bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 19.5 | 1.00x | 0.64x | 4.5 | 1% | | bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 19.5 | 1.00x | 0.64x | 4.6 | 1% | | bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 20.6 | 1.00x | 0.61x | 4.1 | 1% | | bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 | 1% | | bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 49.9 | 1.01x | 0.54x | 11.4 | 3% | | bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 49.5 | 1.01x | 0.51x | 10.9 | 2% | | bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 49.6 | 1.00x | 0.49x | 10.5 | 2% | | bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 49.8 | 1.07x | 0.55x | 11.7 | 3% | | bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 49.6 | 1.02x | 0.52x | 10.9 | 2% | | bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 49.7 | 1.00x | 0.50x | 10.6 | 2% | | bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 162.9 | 1.01x | 0.18x | 18.9 | 4% | | bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 164.6 | 0.99x | 0.17x | 17.7 | 4% | | bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 164.4 | 1.01x | 0.17x | 16.9 | 4% | | bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 166.8 | 1.01x | 0.19x | 18.9 | 4% | | bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 166.8 | 1.00x | 0.18x | 17.8 | 4% | | bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 165.5 | 1.00x | 0.17x | 16.8 | 4% | | bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 15.7 | 1.00x | 0.47x | 2.2 | 0% | | bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 15.6 | 1.01x | 0.46x | 2.6 | 1% | | bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 | 1% | | bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 | 0% | | bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 | 1% | | bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 | 1% | | bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 41.7 | 1.00x | 0.74x | 9.5 | 2% | | bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 | 2% | | bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 42.6 | 1.00x | 0.72x | 9.5 | 2% | | bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 45.9 | 0.98x | 0.63x | 7.4 | 2% | | bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 46.7 | 1.01x | 0.63x | 7.4 | 2% | | bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 47.1 | 1.00x | 0.62x | 7.5 | 2% | | bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 75.2 | 0.99x | 0.29x | 15.9 | 3% | | bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 75.0 | 1.00x | 0.28x | 15.6 | 3% | | bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 75.2 | 0.99x | 0.26x | 14.6 | 3% | | bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 75.1 | 0.97x | 0.28x | 15.7 | 3% | | bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 75.6 | 1.00x | 0.28x | 15.4 | 3% | | bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 75.7 | 1.00x | 0.26x | 14.7 | 3% | | bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 439.2 | 1.00x | 0.14x | 21.7 | 5% | | bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 436.9 | 1.00x | 0.13x | 20.5 | 5% | | bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 440.9 | 1.00x | 0.13x | 19.4 | 4% | | bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 441.5 | 1.00x | 0.14x | 21.7 | 5% | | bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 435.4 | 1.00x | 0.13x | 20.5 | 5% | | bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 443.1 | 1.00x | 0.13x | 19.3 | 4% | | bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 22.1 | 1.02x | 0.32x | 4.4 | 1% | | bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 22.0 | 1.00x | 0.29x | 4.6 | 1% | | bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 22.3 | 1.00x | 0.29x | 5.6 | 1% | | bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 24.4 | 1.02x | 0.35x | 4.4 | 1% | | bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 24.4 | 1.00x | 0.32x | 4.6 | 1% | | bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 24.5 | 1.00x | 0.32x | 5.6 | 1% | | bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 63.1 | 2.72x | 1.12x | 38.0 | 8% | | bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 63.3 | 2.78x | 1.13x | 38.8 | 9% | | bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 63.4 | 2.73x | 1.10x | 39.3 | 9% | | bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 64.5 | 2.54x | 0.75x | 24.9 | 5% | | bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 64.7 | 2.58x | 0.75x | 25.1 | 6% | | bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 64.8 | 2.59x | 0.74x | 25.9 | 6% | | bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 147.6 | 3.79x | 0.60x | 67.8 | 15% | | bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 147.4 | 3.95x | 0.59x | 67.6 | 15% | | bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 148.0 | 4.12x | 0.59x | 67.4 | 15% | | bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 149.2 | 2.32x | 0.37x | 41.4 | 9% | | bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 148.4 | 2.43x | 0.36x | 41.4 | 9% | | bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 149.3 | 2.56x | 0.37x | 41.6 | 9% | | bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1865.9 | 6.52x | 1.06x | 152.3 | 33% | | bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1867.4 | 6.92x | 1.06x | 152.3 | 33% | | bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1863.0 | 7.28x | 1.05x | 152.0 | 33% | | bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 1940.1 | 3.84x | 0.65x | 89.5 | 20% | | bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 1936.8 | 4.06x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 1944.4 | 4.28x | 0.65x | 89.5 | 20% | | bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 30.5 | 1.00x | 0.27x | 5.3 | 1% | | bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 30.5 | 1.00x | 0.25x | 5.7 | 1% | | bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 30.9 | 1.00x | 0.25x | 6.9 | 2% | | bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 35.4 | 1.00x | 0.31x | 5.4 | 1% | | bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8 | 1% | | bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 35.7 | 1.00x | 0.29x | 7.0 | 2% | | bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 85.7 | 2.87x | 0.89x | 44.4 | 10% | | bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 86.0 | 2.92x | 0.88x | 44.7 | 10% | | bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 86.4 | 2.88x | 0.87x | 45.5 | 10% | | bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 88.4 | 2.58x | 0.56x | 27.1 | 6% | | bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 88.7 | 2.67x | 0.56x | 27.5 | 6% | | bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 89.0 | 2.68x | 0.56x | 28.3 | 6% | | bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 234.9 | 3.78x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 234.1 | 3.94x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 235.0 | 4.13x | 0.47x | 67.9 | 15% | | bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 236.2 | 2.27x | 0.29x | 40.8 | 9% | | bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 237.3 | 2.38x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 237.4 | 2.50x | 0.29x | 41.0 | 9% | | bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3646.5 | 6.57x | 1.05x | 154.1 | 34% | | bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3644.1 | 6.94x | 1.05x | 154.1 | 34% | | bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3646.2 | 7.34x | 1.05x | 154.0 | 34% | | bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 3774.7 | 3.86x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 3780.1 | 4.08x | 0.64x | 90.5 | 20% | | bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 3773.0 | 4.31x | 0.64x | 90.5 | 20% | | float16 | 1 | 128 | 4 | 30.7 | 30.7 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 128 | 8 | 30.6 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 128 | 16 | 30.5 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 129 | 4 | 30.6 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 8 | 30.6 | 30.3 | 14.4 | 1.01x | 0.48x | 0.0 | 0% | | float16 | 1 | 129 | 16 | 30.5 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0% | | float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 17.4 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0% | | float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 17.8 | 1.00x | 0.58x | 0.1 | 0% | | float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 20.1 | 1.00x | 0.66x | 0.1 | 0% | | float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 33.6 | 1.08x | 0.88x | 0.4 | 0% | | float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 33.8 | 0.85x | 0.70x | 0.3 | 0% | | float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 31.5 | 0.94x | 0.65x | 0.3 | 0% | | float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 37.4 | 1.11x | 0.91x | 0.4 | 0% | | float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 36.9 | 0.97x | 0.84x | 0.4 | 0% | | float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 33.3 | 0.89x | 0.65x | 0.3 | 0% | | float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 46.2 | 0.98x | 0.15x | 0.9 | 0% | | float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 46.5 | 0.97x | 0.14x | 0.8 | 0% | | float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 46.1 | 0.98x | 0.13x | 0.7 | 0% | | float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 46.9 | 1.08x | 0.16x | 0.9 | 0% | | float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 46.8 | 0.99x | 0.14x | 0.8 | 0% | | float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 46.6 | 0.93x | 0.12x | 0.7 | 0% | | float16 | 8 | 128 | 4 | 30.5 | 30.2 | 14.4 | 1.01x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 8 | 30.4 | 30.2 | 14.5 | 1.01x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 16 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 129 | 4 | 30.5 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 8 | 30.3 | 30.2 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 16 | 30.5 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 19.1 | 1.01x | 0.63x | 0.5 | 0% | | float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 19.3 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 20.5 | 1.01x | 0.68x | 0.6 | 0% | | float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 37.9 | 1.00x | 0.83x | 2.9 | 1% | | float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 39.8 | 1.00x | 0.82x | 2.7 | 1% | | float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 41.7 | 0.94x | 0.81x | 2.6 | 1% | | float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 39.2 | 1.07x | 0.86x | 2.9 | 1% | | float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 40.7 | 0.94x | 0.84x | 2.7 | 1% | | float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 43.0 | 1.05x | 0.83x | 2.6 | 1% | | float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 56.0 | 0.93x | 0.17x | 6.3 | 1% | | float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 56.1 | 0.95x | 0.16x | 5.9 | 1% | | float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 56.3 | 1.01x | 0.15x | 5.4 | 1% | | float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 56.2 | 1.00x | 0.18x | 6.7 | 1% | | float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 56.3 | 0.99x | 0.16x | 6.1 | 1% | | float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 56.3 | 0.97x | 0.15x | 5.6 | 1% | | float16 | 64 | 128 | 4 | 30.5 | 30.1 | 14.9 | 1.01x | 0.50x | 0.6 | 0% | | float16 | 64 | 128 | 8 | 30.5 | 30.2 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | float16 | 64 | 128 | 16 | 30.4 | 30.2 | 14.7 | 1.01x | 0.49x | 0.9 | 0% | | float16 | 64 | 129 | 4 | 30.6 | 30.2 | 15.3 | 1.01x | 0.51x | 0.6 | 0% | | float16 | 64 | 129 | 8 | 30.6 | 30.2 | 15.2 | 1.01x | 0.50x | 0.7 | 0% | | float16 | 64 | 129 | 16 | 30.5 | 30.2 | 15.1 | 1.01x | 0.50x | 0.9 | 0% | | float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 4.4 | 1% | | float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 | 1% | | float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 19.4 | 1.00x | 0.64x | 4.7 | 1% | | float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 19.7 | 1.01x | 0.62x | 4.2 | 1% | | float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 20.4 | 1.00x | 0.64x | 4.2 | 1% | | float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 | 1% | | float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 49.4 | 0.97x | 0.59x | 12.5 | 3% | | float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 49.2 | 0.99x | 0.58x | 12.5 | 3% | | float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 49.2 | 0.98x | 0.54x | 11.7 | 3% | | float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 49.4 | 1.01x | 0.62x | 13.1 | 3% | | float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 49.4 | 1.04x | 0.59x | 12.5 | 3% | | float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 49.4 | 1.02x | 0.56x | 11.9 | 3% | | float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 162.1 | 1.04x | 0.22x | 23.2 | 5% | | float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 160.5 | 1.01x | 0.21x | 21.5 | 5% | | float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 162.4 | 0.98x | 0.19x | 19.4 | 4% | | float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 161.6 | 1.00x | 0.23x | 23.7 | 5% | | float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 163.9 | 1.01x | 0.21x | 21.6 | 5% | | float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 162.9 | 1.01x | 0.19x | 19.6 | 4% | | float16 | 256 | 128 | 4 | 33.7 | 33.6 | 15.5 | 1.00x | 0.46x | 2.3 | 0% | | float16 | 256 | 128 | 8 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | float16 | 256 | 128 | 16 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 3.2 | 1% | | float16 | 256 | 129 | 4 | 33.7 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 | 0% | | float16 | 256 | 129 | 8 | 33.7 | 33.5 | 15.9 | 1.01x | 0.47x | 2.6 | 1% | | float16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 | 1% | | float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 37.9 | 1.00x | 0.75x | 10.5 | 2% | | float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 38.8 | 1.00x | 0.73x | 10.3 | 2% | | float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 39.9 | 0.98x | 0.71x | 10.1 | 2% | | float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 42.0 | 1.00x | 0.66x | 8.4 | 2% | | float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 43.1 | 0.97x | 0.65x | 8.2 | 2% | | float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 43.8 | 1.02x | 0.65x | 8.3 | 2% | | float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 74.1 | 0.99x | 0.33x | 19.0 | 4% | | float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 74.4 | 1.00x | 0.32x | 18.0 | 4% | | float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 74.7 | 0.99x | 0.30x | 16.9 | 4% | | float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 74.3 | 0.99x | 0.34x | 19.1 | 4% | | float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 74.8 | 1.01x | 0.32x | 18.1 | 4% | | float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 74.9 | 0.98x | 0.29x | 16.5 | 4% | | float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 428.9 | 0.99x | 0.18x | 27.7 | 6% | | float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 427.9 | 1.00x | 0.16x | 25.2 | 6% | | float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 428.2 | 1.00x | 0.15x | 22.9 | 5% | | float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 431.9 | 0.98x | 0.18x | 27.5 | 6% | | float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 429.8 | 0.99x | 0.16x | 25.1 | 5% | | float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 432.2 | 1.00x | 0.15x | 22.8 | 5% | | float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 20.9 | 1.00x | 0.31x | 4.5 | 1% | | float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 20.9 | 1.01x | 0.30x | 4.9 | 1% | | float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 21.4 | 1.00x | 0.30x | 6.0 | 1% | | float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 23.3 | 1.00x | 0.35x | 4.6 | 1% | | float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 23.1 | 1.01x | 0.33x | 4.9 | 1% | | float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 23.4 | 0.98x | 0.32x | 5.9 | 1% | | float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 62.7 | 2.74x | 1.30x | 44.2 | 10% | | float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 63.0 | 2.80x | 1.29x | 44.7 | 10% | | float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 63.1 | 2.89x | 1.27x | 45.5 | 10% | | float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 64.2 | 1.89x | 0.66x | 21.9 | 5% | | float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 64.4 | 1.97x | 0.66x | 22.3 | 5% | | float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 64.5 | 2.08x | 0.65x | 22.9 | 5% | | float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 145.0 | 3.99x | 0.73x | 84.6 | 19% | | float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 144.6 | 4.22x | 0.73x | 84.7 | 19% | | float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 145.5 | 4.50x | 0.72x | 83.9 | 18% | | float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 146.1 | 1.75x | 0.32x | 36.9 | 8% | | float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 146.5 | 1.83x | 0.32x | 36.9 | 8% | | float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 146.2 | 1.98x | 0.32x | 36.8 | 8% | | float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1846.9 | 5.88x | 1.20x | 174.8 | 38% | | float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1856.1 | 6.41x | 1.20x | 174.1 | 38% | | float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1858.5 | 7.06x | 1.21x | 174.6 | 38% | | float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 1924.0 | 2.82x | 0.60x | 84.1 | 18% | | float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 1921.5 | 3.12x | 0.60x | 84.3 | 18% | | float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 1926.4 | 3.42x | 0.60x | 84.3 | 18% | | float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 28.3 | 0.99x | 0.26x | 5.6 | 1% | | float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 28.5 | 1.00x | 0.25x | 6.1 | 1% | | float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 29.2 | 1.01x | 0.26x | 7.4 | 2% | | float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 32.6 | 0.99x | 0.30x | 5.6 | 1% | | float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 32.7 | 1.00x | 0.29x | 6.2 | 1% | | float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 33.5 | 1.00x | 0.29x | 7.4 | 2% | | float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6 | 12% | | float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6 | 12% | | float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 86.0 | 3.08x | 1.02x | 53.6 | 12% | | float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 87.7 | 1.92x | 0.49x | 23.8 | 5% | | float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 88.0 | 1.98x | 0.49x | 24.1 | 5% | | float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 88.5 | 2.09x | 0.49x | 24.8 | 5% | | float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 228.7 | 3.94x | 0.57x | 84.2 | 18% | | float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 228.5 | 4.16x | 0.57x | 84.3 | 18% | | float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 230.5 | 4.51x | 0.57x | 84.5 | 19% | | float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 231.7 | 1.70x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 231.6 | 1.81x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 233.1 | 1.95x | 0.25x | 36.4 | 8% | | float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3622.2 | 5.90x | 1.19x | 176.9 | 39% | | float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3607.3 | 6.49x | 1.19x | 177.3 | 39% | | float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3607.4 | 7.10x | 1.19x | 176.5 | 39% | | float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 3743.3 | 2.84x | 0.59x | 85.2 | 19% | | float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 3747.1 | 3.12x | 0.59x | 85.1 | 19% | | float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 3749.5 | 3.43x | 0.59x | 85.2 | 19% | | float32 | 1 | 128 | 4 | 31.2 | 31.4 | 14.5 | 0.99x | 0.46x | 0.0 | 0% | | float32 | 1 | 128 | 8 | 34.0 | 34.4 | 14.3 | 0.99x | 0.42x | 0.0 | 0% | | float32 | 1 | 128 | 16 | 32.4 | 34.4 | 14.0 | 0.94x | 0.41x | 0.0 | 0% | | float32 | 1 | 129 | 4 | 34.1 | 34.4 | 14.4 | 0.99x | 0.42x | 0.0 | 0% | | float32 | 1 | 129 | 8 | 34.0 | 32.7 | 14.4 | 1.04x | 0.44x | 0.0 | 0% | | float32 | 1 | 129 | 16 | 34.1 | 34.3 | 15.2 | 0.99x | 0.44x | 0.0 | 0% | | float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 17.8 | 1.08x | 0.54x | 0.1 | 0% | | float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 22.2 | 0.99x | 0.62x | 0.1 | 0% | | float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 19.1 | 0.99x | 0.54x | 0.1 | 0% | | float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 18.8 | 0.98x | 0.52x | 0.1 | 0% | | float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 19.7 | 1.08x | 0.55x | 0.1 | 0% | | float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 19.6 | 0.99x | 0.55x | 0.1 | 0% | | float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 39.6 | 1.07x | 0.77x | 0.6 | 0% | | float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 38.0 | 1.16x | 0.69x | 0.6 | 0% | | float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 38.7 | 0.94x | 0.67x | 0.6 | 0% | | float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 34.1 | 0.99x | 0.66x | 0.6 | 0% | | float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 41.6 | 1.03x | 0.76x | 0.6 | 0% | | float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 39.8 | 1.04x | 0.69x | 0.6 | 0% | | float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 63.3 | 1.04x | 0.16x | 1.3 | 0% | | float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 63.3 | 1.03x | 0.16x | 1.3 | 0% | | float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 63.3 | 0.91x | 0.14x | 1.1 | 0% | | float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 64.0 | 1.04x | 0.16x | 1.3 | 0% | | float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 64.0 | 1.02x | 0.15x | 1.3 | 0% | | float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 63.9 | 1.01x | 0.15x | 1.2 | 0% | | float32 | 8 | 128 | 4 | 33.8 | 37.2 | 14.7 | 0.91x | 0.40x | 0.1 | 0% | | float32 | 8 | 128 | 8 | 35.0 | 34.1 | 14.3 | 1.03x | 0.42x | 0.1 | 0% | | float32 | 8 | 128 | 16 | 35.6 | 37.2 | 15.2 | 0.96x | 0.41x | 0.2 | 0% | | float32 | 8 | 129 | 4 | 35.2 | 36.0 | 15.0 | 0.98x | 0.42x | 0.1 | 0% | | float32 | 8 | 129 | 8 | 36.8 | 34.1 | 15.0 | 1.08x | 0.44x | 0.1 | 0% | | float32 | 8 | 129 | 16 | 35.3 | 35.5 | 15.3 | 0.99x | 0.43x | 0.2 | 0% | | float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 20.9 | 1.12x | 0.59x | 0.9 | 0% | | float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 19.7 | 1.07x | 0.55x | 0.9 | 0% | | float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 19.7 | 0.95x | 0.49x | 0.9 | 0% | | float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 20.6 | 1.07x | 0.58x | 0.9 | 0% | | float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 21.4 | 0.99x | 0.55x | 0.9 | 0% | | float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 22.0 | 1.12x | 0.60x | 0.9 | 0% | | float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 41.0 | 0.92x | 0.65x | 4.2 | 1% | | float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 42.6 | 1.10x | 0.77x | 4.8 | 1% | | float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 44.7 | 1.09x | 0.73x | 4.3 | 1% | | float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 43.0 | 0.85x | 0.67x | 4.1 | 1% | | float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 43.5 | 1.09x | 0.71x | 4.3 | 1% | | float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 45.0 | 0.95x | 0.67x | 3.9 | 1% | | float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 76.0 | 1.00x | 0.19x | 10.2 | 2% | | float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 76.0 | 1.02x | 0.18x | 9.9 | 2% | | float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 76.2 | 1.03x | 0.17x | 9.2 | 2% | | float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 76.0 | 0.98x | 0.18x | 10.2 | 2% | | float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 75.8 | 1.00x | 0.18x | 9.8 | 2% | | float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 76.1 | 0.99x | 0.16x | 8.8 | 2% | | float32 | 64 | 128 | 4 | 38.2 | 37.4 | 15.0 | 1.02x | 0.40x | 1.0 | 0% | | float32 | 64 | 128 | 8 | 36.8 | 37.2 | 15.0 | 0.99x | 0.40x | 1.0 | 0% | | float32 | 64 | 128 | 16 | 38.3 | 37.2 | 14.9 | 1.03x | 0.40x | 1.2 | 0% | | float32 | 64 | 129 | 4 | 38.5 | 37.1 | 15.5 | 1.04x | 0.42x | 1.0 | 0% | | float32 | 64 | 129 | 8 | 37.0 | 37.1 | 15.9 | 1.00x | 0.43x | 1.1 | 0% | | float32 | 64 | 129 | 16 | 38.4 | 38.8 | 15.4 | 0.99x | 0.40x | 1.2 | 0% | | float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 20.4 | 1.02x | 0.52x | 6.8 | 1% | | float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 20.3 | 1.02x | 0.52x | 6.8 | 2% | | float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 20.3 | 1.03x | 0.50x | 6.8 | 1% | | float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 22.1 | 0.95x | 0.51x | 6.1 | 1% | | float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 22.1 | 0.99x | 0.51x | 6.2 | 1% | | float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 22.2 | 0.96x | 0.50x | 6.1 | 1% | | float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 65.6 | 0.98x | 0.66x | 21.2 | 5% | | float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 65.6 | 0.97x | 0.62x | 19.7 | 4% | | float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 65.6 | 1.00x | 0.60x | 19.2 | 4% | | float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 65.6 | 0.98x | 0.66x | 21.1 | 5% | | float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 65.5 | 0.93x | 0.58x | 18.7 | 4% | | float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 65.6 | 0.97x | 0.57x | 18.2 | 4% | | float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 221.1 | 1.02x | 0.24x | 35.7 | 8% | | float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 220.4 | 1.02x | 0.22x | 33.3 | 7% | | float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 222.6 | 1.01x | 0.21x | 31.0 | 7% | | float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 223.0 | 1.00x | 0.24x | 35.7 | 8% | | float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 225.0 | 0.99x | 0.22x | 33.2 | 7% | | float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 223.7 | 0.99x | 0.20x | 30.5 | 7% | | float32 | 256 | 128 | 4 | 46.0 | 46.0 | 15.7 | 1.00x | 0.34x | 3.1 | 1% | | float32 | 256 | 128 | 8 | 47.2 | 47.5 | 15.7 | 0.99x | 0.33x | 3.3 | 1% | | float32 | 256 | 128 | 16 | 47.4 | 47.2 | 15.7 | 1.00x | 0.33x | 3.8 | 1% | | float32 | 256 | 129 | 4 | 47.2 | 47.5 | 16.1 | 0.99x | 0.34x | 3.0 | 1% | | float32 | 256 | 129 | 8 | 45.6 | 47.4 | 16.7 | 0.96x | 0.35x | 3.3 | 1% | | float32 | 256 | 129 | 16 | 47.3 | 49.0 | 16.6 | 0.97x | 0.34x | 3.7 | 1% | | float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 41.7 | 0.98x | 0.61x | 15.5 | 3% | | float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 43.2 | 1.01x | 0.62x | 15.3 | 3% | | float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 43.8 | 0.99x | 0.61x | 15.3 | 3% | | float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 45.9 | 1.02x | 0.57x | 13.1 | 3% | | float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 46.6 | 1.01x | 0.55x | 12.7 | 3% | | float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 48.1 | 0.98x | 0.54x | 12.3 | 3% | | float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 101.0 | 0.99x | 0.36x | 30.3 | 7% | | float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 101.3 | 1.05x | 0.35x | 29.4 | 6% | | float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 100.9 | 0.99x | 0.32x | 26.7 | 6% | | float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 101.7 | 1.02x | 0.37x | 30.2 | 7% | | float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 101.6 | 1.00x | 0.35x | 28.8 | 6% | | float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 101.8 | 1.00x | 0.32x | 26.5 | 6% | | float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1128.3 | 1.00x | 0.35x | 41.9 | 9% | | float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1132.5 | 0.99x | 0.33x | 39.0 | 9% | | float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1129.5 | 1.00x | 0.30x | 36.2 | 8% | | float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 1148.5 | 1.00x | 0.36x | 42.0 | 9% | | float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 1148.0 | 1.00x | 0.34x | 39.2 | 9% | | float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 1147.9 | 1.01x | 0.31x | 36.1 | 8% | | float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 22.3 | 0.99x | 0.22x | 5.6 | 1% | | float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 22.0 | 1.02x | 0.21x | 5.9 | 1% | | float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 22.2 | 0.98x | 0.20x | 6.6 | 1% | | float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 24.4 | 1.01x | 0.24x | 5.7 | 1% | | float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 24.4 | 1.00x | 0.23x | 5.8 | 1% | | float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 24.6 | 0.99x | 0.22x | 6.5 | 1% | | float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 88.3 | 3.70x | 1.76x | 84.5 | 19% | | float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9 | 19% | | float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 88.3 | 3.88x | 1.76x | 87.5 | 19% | | float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 90.2 | 2.73x | 0.98x | 46.1 | 10% | | float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 90.1 | 2.84x | 0.97x | 46.5 | 10% | | float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 90.4 | 2.87x | 0.97x | 47.3 | 10% | | float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 200.8 | 4.34x | 0.87x | 145.7 | 32% | | float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 200.2 | 4.64x | 0.87x | 145.6 | 32% | | float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 201.7 | 4.93x | 0.87x | 145.8 | 32% | | float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 202.4 | 2.18x | 0.44x | 72.3 | 16% | | float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 201.3 | 2.31x | 0.43x | 72.2 | 16% | | float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 202.6 | 2.49x | 0.43x | 72.4 | 16% | | float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 4191.1 | 6.06x | 2.13x | 273.4 | 60% | | float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 4189.9 | 6.47x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 4190.6 | 7.00x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 4260.7 | 3.35x | 1.20x | 151.4 | 33% | | float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 4261.2 | 3.60x | 1.20x | 151.3 | 33% | | float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 4261.2 | 3.91x | 1.20x | 151.0 | 33% | | float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 30.2 | 1.00x | 0.18x | 6.7 | 1% | | float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 30.6 | 1.00x | 0.17x | 7.0 | 2% | | float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 31.2 | 1.00x | 0.17x | 7.9 | 2% | | float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 35.4 | 1.00x | 0.21x | 6.8 | 1% | | float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 35.3 | 1.00x | 0.20x | 7.1 | 2% | | float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 36.4 | 1.00x | 0.20x | 7.9 | 2% | | float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 123.4 | 3.89x | 1.44x | 99.1 | 22% | | float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 123.4 | 4.04x | 1.44x | 99.9 | 22% | | float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 123.7 | 4.08x | 1.42x | 100.8 | 22% | | float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 126.5 | 2.84x | 0.76x | 51.3 | 11% | | float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 126.7 | 2.97x | 0.76x | 51.7 | 11% | | float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 127.0 | 3.02x | 0.76x | 52.6 | 12% | | float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 421.7 | 4.46x | 0.95x | 151.9 | 33% | | float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 424.8 | 4.73x | 0.96x | 151.8 | 33% | | float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 424.0 | 5.07x | 0.95x | 152.0 | 33% | | float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 436.2 | 2.18x | 0.48x | 74.0 | 16% | | float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 437.6 | 2.34x | 0.48x | 74.1 | 16% | | float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 440.8 | 2.49x | 0.48x | 74.1 | 16% | | float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 8254.2 | 6.02x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 8254.6 | 6.43x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 8250.2 | 6.97x | 2.10x | 273.3 | 60% | | float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 8407.4 | 3.36x | 1.20x | 152.7 | 33% | | float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 8407.4 | 3.60x | 1.19x | 152.6 | 33% | | float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 8413.3 | 3.91x | 1.19x | 152.5 | 33% | </details> ### Test methodology - **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare. - **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is monotonic for both `largest=True/False`. - **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with 20 warmup iterations. `largest=True`. - **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6). Co-authored-by: chuanqi129 <13608516+chuanqi129@users.noreply.github.com>
Split the monolithic TensorTopKSbtopkKernel.cpp into 5 per-K files (k1, k2, k4, k8, k16) to enable parallel AOT compilation and avoid CD build timeout that caused the original PR #3371 to be reverted. - TensorTopKSbtopkKernel.h: public API (SbtopkResult enum + dispatch) - TensorTopKSbtopkKernelImpl.h: shared functor + launch templates - TensorTopKSbtopkKernel_k{1,2,4,8,16}.cpp: per-K instantiations - TensorTopKSbtopkKernel.cpp: dispatch-only (routes to per-K units) - TensorTopKKernel.cpp: integrate sbtopk_try_launch fallback
Split the monolithic TensorTopKSbtopkKernel.cpp into 4 per-K files (k1, k2, k4, k8) to enable parallel AOT compilation and avoid CD build timeout that caused the original PR #3371 to be reverted. K=16 is excluded for now as it alone causes compilation timeout; it can be re-added once incremental build improvements land. - TensorTopKSbtopkKernel.h: public API (SbtopkResult enum + dispatch) - TensorTopKSbtopkKernelImpl.h: shared functor + launch templates - TensorTopKSbtopkKernel_k{1,2,4,8}.cpp: per-K instantiations - TensorTopKSbtopkKernel.cpp: dispatch-only (routes to per-K units) - TensorTopKKernel.cpp: integrate sbtopk_try_launch fallback
## Summary Builds on #3371 (subgroup topk kernel). Adds a **single workgroup topk kernel** — SYCL translation of PyTorch CUDA's single-block radix select path. - **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432 cases, 211 wins (>1.05x), 32 regressions (<0.98x) - **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster) - **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins ### Approach **Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A 1024-thread workgroup processes one slice using `RADIX_BITS=4` radix select to find the k-th value, then gathers matching elements. Translated from PyTorch CUDA's single-block path. Output is unsorted (caller sorts if needed). Best for large dim (>= 4096). **Updated dispatch logic:** - `dim < 1024` -> original kernel - `k <= 16` and large batch -> subgroup kernel (PR1, SORTED) - `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED) - otherwise -> original kernel Also fixes NaN handling in `SortingRadixSelect.h` `TopKTypeConfig::convert` for half/float/double (NaN maps to max radix value). Multi-block radix select (for very large slices across multiple workgroups) is planned as future work. ### Files changed | File | Description | |------|-------------| | `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel (from CUDA single-block path) | | `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch` declaration | | `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside subgroup kernel | | `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel paths | | `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert` | ### Correctness - **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare) - **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output verified monotonic) ### Benchmark: incremental gain from this PR Showing where single-wg kernel helps (large dim cases): **By dim (PR2 vs PR1-only):** | dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases | |-----|:-:|:-:|:-:|:-:| | 128 | 1.00x | 1.00x | 0.37x | 54 | | 129 | 1.00x | 1.00x | 0.39x | 54 | | 1024 | 1.00x | 1.47x | 0.77x | 54 | | 1025 | 1.00x | 1.35x | 0.63x | 54 | | 8192 | 1.03x | 1.68x | 0.62x | 54 | | 8193 | 1.01x | 1.30x | 0.49x | 54 | | 131072 | 1.99x | 3.73x | 0.68x | 54 | | 131073 | 1.51x | 2.31x | 0.43x | 54 | ### Full 432-case results (combined PR1+PR2) XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters. <details> <summary>Click to expand full table</summary> | dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us) | CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak | |-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:| | bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x | 0.1 | 0% | | bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x | 0.4 | 0% | | bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x | 0.45x | 2.6 | 1% | | bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x | 0.44x | 2.5 | 1% | | bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x | 0.42x | 2.4 | 1% | | bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x | 0.28x | 1.6 | 0% | | bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x | 0.23x | 1.3 | 0% | | bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x | 0.25x | 1.4 | 0% | | bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x | 0.5 | 0% | | bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x | 2.9 | 1% | | bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x | 2.8 | 1% | | bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x | 2.7 | 1% | | bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x | 2.7 | 1% | | bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x | 0.39x | 14.4 | 3% | | bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x | 0.41x | 15.3 | 3% | | bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x | 0.39x | 14.3 | 3% | | bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x | 0.26x | 9.6 | 2% | | bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x | 0.26x | 9.4 | 2% | | bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x | 0.29x | 10.7 | 2% | | bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x | 0.9 | 0% | | bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x | 0.6 | 0% | | bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x | 0.7 | 0% | | bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x | 0.9 | 0% | | bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x | 4.4 | 1% | | bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x | 4.5 | 1% | | bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x | 4.6 | 1% | | bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 | 1% | | bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x | 11.3 | 2% | | bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x | 10.7 | 2% | | bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x | 0.53x | 55.0 | 12% | | bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x | 0.53x | 54.5 | 12% | | bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x | 0.55x | 55.7 | 12% | | bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x | 0.37x | 37.3 | 8% | | bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x | 0.36x | 36.1 | 8% | | bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x | 0.37x | 37.4 | 8% | | bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x | 2.3 | 0% | | bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 | 1% | | bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 | 0% | | bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 | 1% | | bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 | 1% | | bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x | 9.5 | 2% | | bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 | 2% | | bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x | 9.4 | 2% | | bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x | 7.7 | 2% | | bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x | 0.30x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x | 0.29x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x | 0.29x | 16.6 | 4% | | bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x | 0.27x | 15.3 | 3% | | bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x | 0.46x | 69.8 | 15% | | bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x | 0.46x | 71.3 | 16% | | bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x | 0.48x | 72.7 | 16% | | bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x | 0.29x | 43.3 | 10% | | bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x | 0.29x | 44.0 | 10% | | bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 | 2.32x | 0.30x | 44.9 | 10% | | bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x | 4.3 | 1% | | bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x | 4.6 | 1% | | bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x | 5.6 | 1% | | bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x | 4.4 | 1% | | bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x | 4.6 | 1% | | bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x | 5.6 | 1% | | bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x | 1.13x | 38.2 | 8% | | bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x | 1.13x | 39.0 | 9% | | bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x | 1.10x | 39.4 | 9% | | bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x | 0.74x | 24.6 | 5% | | bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x | 0.74x | 25.1 | 5% | | bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x | 0.74x | 26.0 | 6% | | bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x | 0.57x | 64.9 | 14% | | bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x | 0.57x | 65.1 | 14% | | bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x | 0.57x | 65.0 | 14% | | bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x | 0.35x | 39.9 | 9% | | bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 | 6.53x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 | 6.91x | 1.06x | 152.1 | 33% | | bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 | 7.30x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 | 3.84x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 | 4.06x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 | 4.29x | 0.65x | 89.6 | 20% | | bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x | 0.27x | 5.3 | 1% | | bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x | 0.25x | 5.7 | 1% | | bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x | 0.25x | 6.9 | 2% | | bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x | 0.31x | 5.4 | 1% | | bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8 | 1% | | bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x | 0.29x | 6.9 | 2% | | bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x | 0.88x | 44.0 | 10% | | bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x | 0.88x | 44.7 | 10% | | bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x | 0.87x | 45.5 | 10% | | bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x | 0.56x | 27.1 | 6% | | bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x | 0.56x | 27.4 | 6% | | bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x | 0.56x | 28.3 | 6% | | bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x | 0.47x | 67.6 | 15% | | bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x | 0.47x | 67.7 | 15% | | bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x | 0.29x | 41.1 | 9% | | bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 | 6.57x | 1.05x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 | 6.94x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 | 7.33x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 | 3.86x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 | 4.08x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 | 4.32x | 0.64x | 90.6 | 20% | | float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x | 0.0 | 0% | | float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x | 0.1 | 0% | | float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x | 0.1 | 0% | | float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x | 0.1 | 0% | | float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x | 0.1 | 0% | | float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x | 0.4 | 0% | | float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x | 0.4 | 0% | | float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x | 0.4 | 0% | | float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x | 0.4 | 0% | | float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x | 0.4 | 0% | | float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x | 0.4 | 0% | | float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x | 0.37x | 2.1 | 0% | | float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x | 2.6 | 1% | | float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x | 0.35x | 2.0 | 0% | | float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x | 0.24x | 1.3 | 0% | | float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x | 0.25x | 1.4 | 0% | | float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x | 0.25x | 1.4 | 0% | | float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x | 0.1 | 0% | | float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x | 3.1 | 1% | | float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x | 3.0 | 1% | | float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x | 3.0 | 1% | | float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x | 2.8 | 1% | | float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x | 2.8 | 1% | | float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x | 2.9 | 1% | | float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x | 0.41x | 15.2 | 3% | | float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x | 0.45x | 16.7 | 4% | | float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x | 0.42x | 15.8 | 3% | | float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x | 0.27x | 10.0 | 2% | | float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x | 0.28x | 10.5 | 2% | | float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x | 0.28x | 10.5 | 2% | | float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x | 0.9 | 0% | | float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x | 0.6 | 0% | | float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x | 0.7 | 0% | | float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x | 0.9 | 0% | | float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x | 4.4 | 1% | | float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 | 1% | | float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x | 4.6 | 1% | | float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x | 4.1 | 1% | | float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x | 4.2 | 1% | | float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 | 1% | | float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x | 12.7 | 3% | | float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x | 12.7 | 3% | | float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x | 11.9 | 3% | | float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x | 12.3 | 3% | | float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x | 11.9 | 3% | | float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x | 11.5 | 3% | | float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x | 0.57x | 58.7 | 13% | | float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x | 0.55x | 57.8 | 13% | | float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x | 0.58x | 59.4 | 13% | | float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x | 0.37x | 38.1 | 8% | | float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x | 0.40x | 40.9 | 9% | | float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x | 0.38x | 38.7 | 8% | | float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x | 2.3 | 0% | | float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x | 3.2 | 1% | | float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 | 0% | | float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x | 2.6 | 1% | | float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 | 1% | | float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x | 10.7 | 2% | | float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x | 10.3 | 2% | | float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x | 10.1 | 2% | | float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x | 8.4 | 2% | | float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x | 8.2 | 2% | | float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x | 8.3 | 2% | | float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x | 0.34x | 19.3 | 4% | | float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x | 0.33x | 18.6 | 4% | | float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x | 0.32x | 17.9 | 4% | | float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x | 0.31x | 17.8 | 4% | | float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x | 0.30x | 17.1 | 4% | | float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x | 0.29x | 16.4 | 4% | | float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x | 0.49x | 76.2 | 17% | | float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x | 0.48x | 75.7 | 17% | | float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x | 0.48x | 74.8 | 16% | | float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x | 0.31x | 47.7 | 10% | | float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x | 0.30x | 46.8 | 10% | | float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x | 0.29x | 45.6 | 10% | | float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x | 4.6 | 1% | | float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x | 4.9 | 1% | | float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x | 5.9 | 1% | | float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x | 4.5 | 1% | | float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x | 4.9 | 1% | | float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x | 6.0 | 1% | | float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x | 44.1 | 10% | | float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x | 45.0 | 10% | | float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x | 1.27x | 45.4 | 10% | | float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x | 22.0 | 5% | | float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x | 22.3 | 5% | | float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x | 0.65x | 22.9 | 5% | | float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x | 0.70x | 81.0 | 18% | | float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x | 0.69x | 80.5 | 18% | | float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x | 0.69x | 79.9 | 18% | | float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x | 0.31x | 36.1 | 8% | | float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x | 0.31x | 36.0 | 8% | | float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x | 0.31x | 36.0 | 8% | | float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 | 5.87x | 1.20x | 174.4 | 38% | | float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 | 6.42x | 1.21x | 174.4 | 38% | | float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 | 7.04x | 1.20x | 174.0 | 38% | | float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 | 2.83x | 0.60x | 84.2 | 18% | | float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 | 3.10x | 0.60x | 84.0 | 18% | | float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 | 3.39x | 0.60x | 83.5 | 18% | | float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x | 0.27x | 5.7 | 1% | | float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x | 0.25x | 6.1 | 1% | | float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x | 0.25x | 7.4 | 2% | | float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x | 0.30x | 5.7 | 1% | | float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x | 0.29x | 6.2 | 1% | | float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x | 0.29x | 7.4 | 2% | | float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6 | 12% | | float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6 | 12% | | float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x | 1.02x | 53.7 | 12% | | float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x | 0.49x | 23.7 | 5% | | float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x | 0.49x | 24.1 | 5% | | float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x | 0.49x | 24.8 | 5% | | float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x | 0.57x | 84.1 | 18% | | float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x | 0.57x | 84.2 | 18% | | float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x | 0.57x | 84.3 | 18% | | float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x | 0.25x | 36.3 | 8% | | float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x | 0.25x | 36.5 | 8% | | float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 | 5.90x | 1.19x | 177.1 | 39% | | float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 | 6.50x | 1.19x | 177.4 | 39% | | float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 | 7.10x | 1.19x | 176.5 | 39% | | float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 | 2.84x | 0.59x | 85.1 | 19% | | float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 | 3.12x | 0.59x | 85.2 | 19% | | float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 | 3.42x | 0.59x | 85.1 | 19% | | float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x | 0.0 | 0% | | float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x | 0.0 | 0% | | float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x | 0.0 | 0% | | float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x | 0.0 | 0% | | float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x | 0.1 | 0% | | float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x | 0.1 | 0% | | float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x | 0.1 | 0% | | float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x | 0.1 | 0% | | float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x | 0.1 | 0% | | float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x | 0.1 | 0% | | float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x | 0.6 | 0% | | float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x | 0.7 | 0% | | float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x | 0.6 | 0% | | float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x | 0.6 | 0% | | float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x | 0.6 | 0% | | float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x | 0.6 | 0% | | float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x | 0.41x | 3.4 | 1% | | float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x | 0.48x | 4.0 | 1% | | float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x | 0.43x | 3.5 | 1% | | float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x | 0.37x | 3.0 | 1% | | float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x | 0.34x | 2.8 | 1% | | float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x | 0.27x | 2.2 | 0% | | float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x | 0.1 | 0% | | float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x | 0.2 | 0% | | float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x | 0.1 | 0% | | float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x | 0.2 | 0% | | float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x | 0.9 | 0% | | float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x | 1.0 | 0% | | float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x | 0.9 | 0% | | float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x | 0.9 | 0% | | float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x | 0.9 | 0% | | float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x | 0.9 | 0% | | float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x | 4.7 | 1% | | float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x | 4.3 | 1% | | float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x | 4.7 | 1% | | float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x | 4.6 | 1% | | float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x | 4.5 | 1% | | float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x | 4.2 | 1% | | float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x | 0.47x | 26.2 | 6% | | float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x | 0.47x | 26.1 | 6% | | float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x | 0.44x | 24.1 | 5% | | float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x | 0.31x | 17.1 | 4% | | float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x | 0.30x | 16.7 | 4% | | float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x | 0.30x | 16.7 | 4% | | float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x | 1.0 | 0% | | float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x | 1.1 | 0% | | float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x | 1.2 | 0% | | float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x | 1.0 | 0% | | float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x | 1.1 | 0% | | float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x | 1.2 | 0% | | float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x | 6.4 | 1% | | float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x | 6.5 | 1% | | float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x | 6.4 | 1% | | float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x | 6.4 | 1% | | float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x | 6.3 | 1% | | float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x | 6.4 | 1% | | float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x | 19.6 | 4% | | float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x | 19.1 | 4% | | float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x | 0.56x | 18.0 | 4% | | float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x | 18.8 | 4% | | float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x | 18.6 | 4% | | float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x | 0.59x | 18.9 | 4% | | float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x | 0.47x | 71.4 | 16% | | float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x | 0.47x | 70.9 | 16% | | float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x | 0.46x | 68.9 | 15% | | float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x | 0.37x | 55.0 | 12% | | float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x | 0.35x | 52.8 | 12% | | float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x | 0.34x | 51.6 | 11% | | float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x | 3.1 | 1% | | float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x | 3.4 | 1% | | float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x | 3.9 | 1% | | float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x | 3.2 | 1% | | float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x | 3.3 | 1% | | float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x | 3.6 | 1% | | float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x | 15.6 | 3% | | float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x | 15.4 | 3% | | float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x | 15.4 | 3% | | float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x | 13.0 | 3% | | float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x | 12.3 | 3% | | float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x | 12.3 | 3% | | float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x | 0.36x | 30.0 | 7% | | float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x | 0.35x | 28.8 | 6% | | float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x | 0.34x | 28.0 | 6% | | float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x | 0.34x | 28.3 | 6% | | float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x | 0.34x | 27.8 | 6% | | float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x | 0.32x | 26.8 | 6% | | float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x | 0.69x | 82.6 | 18% | | float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x | 0.69x | 81.6 | 18% | | float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 | 2.19x | 0.67x | 79.5 | 17% | | float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x | 0.54x | 62.7 | 14% | | float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x | 0.52x | 60.8 | 13% | | float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 | 1.65x | 0.51x | 59.3 | 13% | | float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x | 0.22x | 5.7 | 1% | | float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x | 0.21x | 5.9 | 1% | | float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x | 0.20x | 6.6 | 1% | | float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x | 0.24x | 5.6 | 1% | | float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x | 0.23x | 5.9 | 1% | | float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x | 0.22x | 6.6 | 1% | | float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x | 84.9 | 19% | | float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9 | 19% | | float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x | 1.73x | 86.1 | 19% | | float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x | 46.2 | 10% | | float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x | 46.4 | 10% | | float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x | 0.97x | 47.3 | 10% | | float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x | 0.87x | 145.4 | 32% | | float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x | 0.87x | 145.5 | 32% | | float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x | 0.87x | 145.9 | 32% | | float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x | 0.43x | 72.2 | 16% | | float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x | 0.43x | 72.4 | 16% | | float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x | 0.44x | 72.5 | 16% | | float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 | 6.06x | 2.13x | 273.2 | 60% | | float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 | 6.47x | 2.13x | 272.9 | 60% | | float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 | 7.00x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 | 3.35x | 1.20x | 151.3 | 33% | | float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 | 3.60x | 1.20x | 151.2 | 33% | | float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 | 3.91x | 1.20x | 150.9 | 33% | | float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x | 0.18x | 6.7 | 1% | | float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x | 0.17x | 7.0 | 2% | | float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x | 0.17x | 8.0 | 2% | | float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x | 0.21x | 6.7 | 1% | | float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x | 0.20x | 7.1 | 2% | | float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x | 0.20x | 8.0 | 2% | | float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x | 1.44x | 99.3 | 22% | | float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x | 1.43x | 99.8 | 22% | | float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x | 1.42x | 100.9 | 22% | | float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x | 0.76x | 51.3 | 11% | | float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x | 0.76x | 51.7 | 11% | | float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x | 0.76x | 52.5 | 12% | | float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x | 0.95x | 151.9 | 33% | | float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x | 0.96x | 151.9 | 33% | | float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x | 0.96x | 152.1 | 33% | | float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x | 0.48x | 74.1 | 16% | | float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x | 0.48x | 74.0 | 16% | | float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x | 0.48x | 74.2 | 16% | | float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 | 6.02x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 | 6.43x | 2.10x | 273.4 | 60% | | float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 | 6.98x | 2.10x | 273.6 | 60% | | float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 | 3.36x | 1.19x | 152.6 | 33% | | float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 | 3.60x | 1.20x | 152.7 | 33% | | float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 | 3.92x | 1.20x | 152.7 | 33% | </details> ### Test methodology - **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare. - **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is monotonic for both `largest=True/False`. - **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with 20 warmup iterations. `largest=True`. - **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6). --------- Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
…l#3372) ## Summary Builds on intel#3371 (subgroup topk kernel). Adds a **single workgroup topk kernel** — SYCL translation of PyTorch CUDA's single-block radix select path. - **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432 cases, 211 wins (>1.05x), 32 regressions (<0.98x) - **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster) - **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins ### Approach **Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A 1024-thread workgroup processes one slice using `RADIX_BITS=4` radix select to find the k-th value, then gathers matching elements. Translated from PyTorch CUDA's single-block path. Output is unsorted (caller sorts if needed). Best for large dim (>= 4096). **Updated dispatch logic:** - `dim < 1024` -> original kernel - `k <= 16` and large batch -> subgroup kernel (PR1, SORTED) - `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED) - otherwise -> original kernel Also fixes NaN handling in `SortingRadixSelect.h` `TopKTypeConfig::convert` for half/float/double (NaN maps to max radix value). Multi-block radix select (for very large slices across multiple workgroups) is planned as future work. ### Files changed | File | Description | |------|-------------| | `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel (from CUDA single-block path) | | `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch` declaration | | `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside subgroup kernel | | `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel paths | | `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert` | ### Correctness - **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare) - **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output verified monotonic) ### Benchmark: incremental gain from this PR Showing where single-wg kernel helps (large dim cases): **By dim (PR2 vs PR1-only):** | dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases | |-----|:-:|:-:|:-:|:-:| | 128 | 1.00x | 1.00x | 0.37x | 54 | | 129 | 1.00x | 1.00x | 0.39x | 54 | | 1024 | 1.00x | 1.47x | 0.77x | 54 | | 1025 | 1.00x | 1.35x | 0.63x | 54 | | 8192 | 1.03x | 1.68x | 0.62x | 54 | | 8193 | 1.01x | 1.30x | 0.49x | 54 | | 131072 | 1.99x | 3.73x | 0.68x | 54 | | 131073 | 1.51x | 2.31x | 0.43x | 54 | ### Full 432-case results (combined PR1+PR2) XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters. <details> <summary>Click to expand full table</summary> | dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us) | CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak | |-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:| | bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 | 0% | | bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x | 0.1 | 0% | | bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 | 0% | | bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x | 0.1 | 0% | | bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x | 0.4 | 0% | | bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x | 0.4 | 0% | | bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x | 0.4 | 0% | | bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x | 0.45x | 2.6 | 1% | | bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x | 0.44x | 2.5 | 1% | | bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x | 0.42x | 2.4 | 1% | | bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x | 0.28x | 1.6 | 0% | | bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x | 0.23x | 1.3 | 0% | | bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x | 0.25x | 1.4 | 0% | | bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 | 0% | | bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x | 0.1 | 0% | | bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x | 0.5 | 0% | | bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x | 2.9 | 1% | | bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x | 3.0 | 1% | | bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x | 2.8 | 1% | | bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x | 2.7 | 1% | | bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x | 2.7 | 1% | | bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x | 0.39x | 14.4 | 3% | | bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x | 0.41x | 15.3 | 3% | | bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x | 0.39x | 14.3 | 3% | | bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x | 0.26x | 9.6 | 2% | | bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x | 0.26x | 9.4 | 2% | | bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x | 0.29x | 10.7 | 2% | | bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x | 0.9 | 0% | | bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x | 0.6 | 0% | | bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x | 0.7 | 0% | | bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x | 0.9 | 0% | | bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x | 4.4 | 1% | | bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x | 4.5 | 1% | | bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x | 4.6 | 1% | | bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x | 4.0 | 1% | | bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 | 1% | | bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x | 11.3 | 2% | | bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x | 0.54x | 11.5 | 3% | | bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x | 10.7 | 2% | | bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x | 0.51x | 10.8 | 2% | | bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x | 0.53x | 55.0 | 12% | | bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x | 0.53x | 54.5 | 12% | | bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x | 0.55x | 55.7 | 12% | | bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x | 0.37x | 37.3 | 8% | | bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x | 0.36x | 36.1 | 8% | | bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x | 0.37x | 37.4 | 8% | | bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x | 2.3 | 0% | | bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 | 1% | | bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 | 0% | | bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 | 1% | | bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 | 1% | | bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x | 9.5 | 2% | | bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 | 2% | | bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x | 9.4 | 2% | | bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x | 7.3 | 2% | | bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x | 7.7 | 2% | | bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x | 0.30x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x | 0.29x | 16.5 | 4% | | bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x | 0.29x | 16.6 | 4% | | bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x | 0.27x | 15.3 | 3% | | bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x | 0.27x | 15.2 | 3% | | bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x | 0.46x | 69.8 | 15% | | bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x | 0.46x | 71.3 | 16% | | bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x | 0.48x | 72.7 | 16% | | bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x | 0.29x | 43.3 | 10% | | bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x | 0.29x | 44.0 | 10% | | bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 | 2.32x | 0.30x | 44.9 | 10% | | bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x | 4.3 | 1% | | bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x | 4.6 | 1% | | bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x | 5.6 | 1% | | bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x | 4.4 | 1% | | bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x | 4.6 | 1% | | bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x | 5.6 | 1% | | bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x | 1.13x | 38.2 | 8% | | bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x | 1.13x | 39.0 | 9% | | bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x | 1.10x | 39.4 | 9% | | bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x | 0.74x | 24.6 | 5% | | bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x | 0.74x | 25.1 | 5% | | bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x | 0.74x | 26.0 | 6% | | bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x | 0.57x | 64.9 | 14% | | bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x | 0.57x | 65.1 | 14% | | bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x | 0.57x | 65.0 | 14% | | bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x | 0.35x | 39.9 | 9% | | bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x | 0.36x | 40.4 | 9% | | bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 | 6.53x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 | 6.91x | 1.06x | 152.1 | 33% | | bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 | 7.30x | 1.06x | 152.4 | 33% | | bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 | 3.84x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 | 4.06x | 0.65x | 89.6 | 20% | | bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 | 4.29x | 0.65x | 89.6 | 20% | | bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x | 0.27x | 5.3 | 1% | | bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x | 0.25x | 5.7 | 1% | | bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x | 0.25x | 6.9 | 2% | | bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x | 0.31x | 5.4 | 1% | | bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8 | 1% | | bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x | 0.29x | 6.9 | 2% | | bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x | 0.88x | 44.0 | 10% | | bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x | 0.88x | 44.7 | 10% | | bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x | 0.87x | 45.5 | 10% | | bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x | 0.56x | 27.1 | 6% | | bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x | 0.56x | 27.4 | 6% | | bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x | 0.56x | 28.3 | 6% | | bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x | 0.47x | 67.6 | 15% | | bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x | 0.47x | 67.7 | 15% | | bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x | 0.47x | 67.8 | 15% | | bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x | 0.29x | 40.9 | 9% | | bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x | 0.29x | 41.1 | 9% | | bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 | 6.57x | 1.05x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 | 6.94x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 | 7.33x | 1.04x | 154.0 | 34% | | bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 | 3.86x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 | 4.08x | 0.64x | 90.6 | 20% | | bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 | 4.32x | 0.64x | 90.6 | 20% | | float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0% | | float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x | 0.0 | 0% | | float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x | 0.0 | 0% | | float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x | 0.1 | 0% | | float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x | 0.1 | 0% | | float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x | 0.1 | 0% | | float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x | 0.1 | 0% | | float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x | 0.1 | 0% | | float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x | 0.4 | 0% | | float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x | 0.4 | 0% | | float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x | 0.4 | 0% | | float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x | 0.4 | 0% | | float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x | 0.4 | 0% | | float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x | 0.4 | 0% | | float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x | 0.37x | 2.1 | 0% | | float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x | 2.6 | 1% | | float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x | 0.35x | 2.0 | 0% | | float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x | 0.24x | 1.3 | 0% | | float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x | 0.25x | 1.4 | 0% | | float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x | 0.25x | 1.4 | 0% | | float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x | 0.1 | 0% | | float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0% | | float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x | 0.1 | 0% | | float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x | 0.1 | 0% | | float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x | 0.6 | 0% | | float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0% | | float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x | 0.6 | 0% | | float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x | 3.1 | 1% | | float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x | 3.0 | 1% | | float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x | 3.0 | 1% | | float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x | 2.8 | 1% | | float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x | 2.8 | 1% | | float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x | 2.9 | 1% | | float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x | 0.41x | 15.2 | 3% | | float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x | 0.45x | 16.7 | 4% | | float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x | 0.42x | 15.8 | 3% | | float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x | 0.27x | 10.0 | 2% | | float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x | 0.28x | 10.5 | 2% | | float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x | 0.28x | 10.5 | 2% | | float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x | 0.6 | 0% | | float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x | 0.7 | 0% | | float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x | 0.9 | 0% | | float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x | 0.6 | 0% | | float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x | 0.7 | 0% | | float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x | 0.9 | 0% | | float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x | 4.4 | 1% | | float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 | 1% | | float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x | 4.6 | 1% | | float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x | 4.1 | 1% | | float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x | 4.2 | 1% | | float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 | 1% | | float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x | 12.7 | 3% | | float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x | 12.7 | 3% | | float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x | 11.9 | 3% | | float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x | 12.3 | 3% | | float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x | 11.9 | 3% | | float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x | 11.5 | 3% | | float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x | 0.57x | 58.7 | 13% | | float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x | 0.55x | 57.8 | 13% | | float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x | 0.58x | 59.4 | 13% | | float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x | 0.37x | 38.1 | 8% | | float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x | 0.40x | 40.9 | 9% | | float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x | 0.38x | 38.7 | 8% | | float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x | 2.3 | 0% | | float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 | 1% | | float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x | 3.2 | 1% | | float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 | 0% | | float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x | 2.6 | 1% | | float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 | 1% | | float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x | 10.7 | 2% | | float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x | 10.3 | 2% | | float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x | 10.1 | 2% | | float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x | 8.4 | 2% | | float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x | 8.2 | 2% | | float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x | 8.3 | 2% | | float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x | 0.34x | 19.3 | 4% | | float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x | 0.33x | 18.6 | 4% | | float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x | 0.32x | 17.9 | 4% | | float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x | 0.31x | 17.8 | 4% | | float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x | 0.30x | 17.1 | 4% | | float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x | 0.29x | 16.4 | 4% | | float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x | 0.49x | 76.2 | 17% | | float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x | 0.48x | 75.7 | 17% | | float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x | 0.48x | 74.8 | 16% | | float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x | 0.31x | 47.7 | 10% | | float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x | 0.30x | 46.8 | 10% | | float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x | 0.29x | 45.6 | 10% | | float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x | 4.6 | 1% | | float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x | 4.9 | 1% | | float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x | 5.9 | 1% | | float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x | 4.5 | 1% | | float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x | 4.9 | 1% | | float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x | 6.0 | 1% | | float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x | 44.1 | 10% | | float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x | 45.0 | 10% | | float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x | 1.27x | 45.4 | 10% | | float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x | 22.0 | 5% | | float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x | 22.3 | 5% | | float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x | 0.65x | 22.9 | 5% | | float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x | 0.70x | 81.0 | 18% | | float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x | 0.69x | 80.5 | 18% | | float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x | 0.69x | 79.9 | 18% | | float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x | 0.31x | 36.1 | 8% | | float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x | 0.31x | 36.0 | 8% | | float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x | 0.31x | 36.0 | 8% | | float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 | 5.87x | 1.20x | 174.4 | 38% | | float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 | 6.42x | 1.21x | 174.4 | 38% | | float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 | 7.04x | 1.20x | 174.0 | 38% | | float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 | 2.83x | 0.60x | 84.2 | 18% | | float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 | 3.10x | 0.60x | 84.0 | 18% | | float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 | 3.39x | 0.60x | 83.5 | 18% | | float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x | 0.27x | 5.7 | 1% | | float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x | 0.25x | 6.1 | 1% | | float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x | 0.25x | 7.4 | 2% | | float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x | 0.30x | 5.7 | 1% | | float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x | 0.29x | 6.2 | 1% | | float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x | 0.29x | 7.4 | 2% | | float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6 | 12% | | float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6 | 12% | | float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x | 1.02x | 53.7 | 12% | | float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x | 0.49x | 23.7 | 5% | | float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x | 0.49x | 24.1 | 5% | | float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x | 0.49x | 24.8 | 5% | | float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x | 0.57x | 84.1 | 18% | | float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x | 0.57x | 84.2 | 18% | | float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x | 0.57x | 84.3 | 18% | | float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x | 0.25x | 36.3 | 8% | | float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x | 0.25x | 36.4 | 8% | | float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x | 0.25x | 36.5 | 8% | | float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 | 5.90x | 1.19x | 177.1 | 39% | | float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 | 6.50x | 1.19x | 177.4 | 39% | | float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 | 7.10x | 1.19x | 176.5 | 39% | | float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 | 2.84x | 0.59x | 85.1 | 19% | | float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 | 3.12x | 0.59x | 85.2 | 19% | | float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 | 3.42x | 0.59x | 85.1 | 19% | | float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x | 0.0 | 0% | | float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x | 0.0 | 0% | | float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x | 0.0 | 0% | | float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x | 0.0 | 0% | | float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x | 0.0 | 0% | | float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x | 0.1 | 0% | | float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x | 0.1 | 0% | | float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x | 0.1 | 0% | | float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x | 0.1 | 0% | | float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x | 0.1 | 0% | | float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x | 0.1 | 0% | | float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x | 0.6 | 0% | | float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x | 0.7 | 0% | | float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x | 0.6 | 0% | | float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x | 0.6 | 0% | | float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x | 0.6 | 0% | | float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x | 0.6 | 0% | | float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x | 0.41x | 3.4 | 1% | | float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x | 0.48x | 4.0 | 1% | | float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x | 0.43x | 3.5 | 1% | | float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x | 0.37x | 3.0 | 1% | | float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x | 0.34x | 2.8 | 1% | | float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x | 0.27x | 2.2 | 0% | | float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x | 0.1 | 0% | | float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x | 0.2 | 0% | | float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x | 0.1 | 0% | | float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x | 0.1 | 0% | | float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x | 0.2 | 0% | | float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x | 0.9 | 0% | | float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x | 1.0 | 0% | | float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x | 0.9 | 0% | | float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x | 0.9 | 0% | | float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x | 0.9 | 0% | | float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x | 0.9 | 0% | | float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x | 4.7 | 1% | | float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x | 4.3 | 1% | | float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x | 4.7 | 1% | | float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x | 4.6 | 1% | | float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x | 4.5 | 1% | | float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x | 4.2 | 1% | | float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x | 0.47x | 26.2 | 6% | | float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x | 0.47x | 26.1 | 6% | | float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x | 0.44x | 24.1 | 5% | | float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x | 0.31x | 17.1 | 4% | | float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x | 0.30x | 16.7 | 4% | | float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x | 0.30x | 16.7 | 4% | | float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x | 1.0 | 0% | | float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x | 1.1 | 0% | | float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x | 1.2 | 0% | | float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x | 1.0 | 0% | | float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x | 1.1 | 0% | | float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x | 1.2 | 0% | | float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x | 6.4 | 1% | | float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x | 6.5 | 1% | | float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x | 6.4 | 1% | | float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x | 6.4 | 1% | | float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x | 6.3 | 1% | | float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x | 6.4 | 1% | | float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x | 19.6 | 4% | | float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x | 19.1 | 4% | | float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x | 0.56x | 18.0 | 4% | | float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x | 18.8 | 4% | | float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x | 18.6 | 4% | | float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x | 0.59x | 18.9 | 4% | | float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x | 0.47x | 71.4 | 16% | | float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x | 0.47x | 70.9 | 16% | | float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x | 0.46x | 68.9 | 15% | | float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x | 0.37x | 55.0 | 12% | | float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x | 0.35x | 52.8 | 12% | | float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x | 0.34x | 51.6 | 11% | | float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x | 3.1 | 1% | | float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x | 3.4 | 1% | | float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x | 3.9 | 1% | | float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x | 3.2 | 1% | | float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x | 3.3 | 1% | | float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x | 3.6 | 1% | | float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x | 15.6 | 3% | | float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x | 15.4 | 3% | | float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x | 15.4 | 3% | | float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x | 13.0 | 3% | | float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x | 12.3 | 3% | | float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x | 12.3 | 3% | | float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x | 0.36x | 30.0 | 7% | | float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x | 0.35x | 28.8 | 6% | | float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x | 0.34x | 28.0 | 6% | | float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x | 0.34x | 28.3 | 6% | | float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x | 0.34x | 27.8 | 6% | | float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x | 0.32x | 26.8 | 6% | | float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x | 0.69x | 82.6 | 18% | | float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x | 0.69x | 81.6 | 18% | | float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 | 2.19x | 0.67x | 79.5 | 17% | | float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x | 0.54x | 62.7 | 14% | | float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x | 0.52x | 60.8 | 13% | | float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 | 1.65x | 0.51x | 59.3 | 13% | | float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x | 0.22x | 5.7 | 1% | | float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x | 0.21x | 5.9 | 1% | | float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x | 0.20x | 6.6 | 1% | | float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x | 0.24x | 5.6 | 1% | | float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x | 0.23x | 5.9 | 1% | | float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x | 0.22x | 6.6 | 1% | | float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x | 84.9 | 19% | | float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9 | 19% | | float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x | 1.73x | 86.1 | 19% | | float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x | 46.2 | 10% | | float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x | 46.4 | 10% | | float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x | 0.97x | 47.3 | 10% | | float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x | 0.87x | 145.4 | 32% | | float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x | 0.87x | 145.5 | 32% | | float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x | 0.87x | 145.9 | 32% | | float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x | 0.43x | 72.2 | 16% | | float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x | 0.43x | 72.4 | 16% | | float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x | 0.44x | 72.5 | 16% | | float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 | 6.06x | 2.13x | 273.2 | 60% | | float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 | 6.47x | 2.13x | 272.9 | 60% | | float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 | 7.00x | 2.13x | 273.1 | 60% | | float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 | 3.35x | 1.20x | 151.3 | 33% | | float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 | 3.60x | 1.20x | 151.2 | 33% | | float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 | 3.91x | 1.20x | 150.9 | 33% | | float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x | 0.18x | 6.7 | 1% | | float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x | 0.17x | 7.0 | 2% | | float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x | 0.17x | 8.0 | 2% | | float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x | 0.21x | 6.7 | 1% | | float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x | 0.20x | 7.1 | 2% | | float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x | 0.20x | 8.0 | 2% | | float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x | 1.44x | 99.3 | 22% | | float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x | 1.43x | 99.8 | 22% | | float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x | 1.42x | 100.9 | 22% | | float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x | 0.76x | 51.3 | 11% | | float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x | 0.76x | 51.7 | 11% | | float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x | 0.76x | 52.5 | 12% | | float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x | 0.95x | 151.9 | 33% | | float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x | 0.96x | 151.9 | 33% | | float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x | 0.96x | 152.1 | 33% | | float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x | 0.48x | 74.1 | 16% | | float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x | 0.48x | 74.0 | 16% | | float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x | 0.48x | 74.2 | 16% | | float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 | 6.02x | 2.10x | 273.5 | 60% | | float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 | 6.43x | 2.10x | 273.4 | 60% | | float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 | 6.98x | 2.10x | 273.6 | 60% | | float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 | 3.36x | 1.19x | 152.6 | 33% | | float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 | 3.60x | 1.20x | 152.7 | 33% | | float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 | 3.92x | 1.20x | 152.7 | 33% | </details> ### Test methodology - **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare. - **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is monotonic for both `largest=True/False`. - **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with 20 warmup iterations. `largest=True`. - **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6). --------- Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Summary
Approach
Add a subgroup topk kernel (
SubgroupTopKFunctorinTensorTopKSbtopkKernel.cpp) where each 32-lane sub-group processes one slice entirely in registers:dim/32elements, maintaining a sorted top-k buffer via insertion sort (fully unrolled).sycl::select_from_groupshuffle.kresults. Output is already sorted.Key properties:
largestas compile-time template parameter — eliminates per-element direction branchesint32/int64index dispatch mirroring CUDA'scanUse32BitIndexMathDispatch:
k <= 16andnsegments >= HW_thread_slots / 4anddim >= 32→ subgroup kernel (SORTED); otherwise → original kernel.Files changed
TensorTopKSbtopkKernel.cpp(new)TensorTopKSbtopkKernel.h(new)SbtopkResultenum +sbtopk_try_launchdeclarationTensorTopKKernel.cppCorrectness
torch.topk(sorted=True)output verified monotonic)Benchmark summary
By batch size:
By dim:
Full 432-case results
XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters.
Click to expand full table
Test methodology
torch.topk(sorted=True)output is monotonic for bothlargest=True/False.largest=True.(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).