Skip to content

Add subgroup topk kernel for XPU (part1 of #3369)#3371

Merged
CuiYifeng merged 9 commits into
mainfrom
jianyi/subgroup-topk
Apr 29, 2026
Merged

Add subgroup topk kernel for XPU (part1 of #3369)#3371
CuiYifeng merged 9 commits into
mainfrom
jianyi/subgroup-topk

Conversation

@jianyizh
Copy link
Copy Markdown
Contributor

@jianyizh jianyizh commented Apr 17, 2026

Summary

  • Speedup vs original XPU: 1.3648x geomean over 432 cases, 130 wins (>1.05x), 40 regressions (<0.98x)
  • vs CUDA 4080S: 0.4574x geomean (>1 means XPU faster)

Approach

Add a subgroup topk kernel (SubgroupTopKFunctor in TensorTopKSbtopkKernel.cpp) where each 32-lane sub-group processes one slice entirely in registers:

  • Phase 1: Each lane scans dim/32 elements, maintaining a sorted top-k buffer via insertion sort (fully unrolled).
  • Phase 2: 5-level bitonic merge across sub-group lanes via sycl::select_from_group shuffle.
  • Phase 3: Lane 0 writes k results. Output is already sorted.

Key properties:

  • Zero SLM (shared local memory), zero barriers
  • largest as compile-time template parameter — eliminates per-element direction branches
  • int32/int64 index dispatch mirroring CUDA's canUse32BitIndexMath
  • Kernel isolated in a separate translation unit to prevent SYCL compiler global optimization interference with the original kernel

Dispatch: k <= 16 and nsegments >= HW_thread_slots / 4 and dim >= 32 → subgroup kernel (SORTED); otherwise → original kernel.

Files changed

File Description
TensorTopKSbtopkKernel.cpp (new) Subgroup topk kernel + dispatch logic
TensorTopKSbtopkKernel.h (new) SbtopkResult enum + sbtopk_try_launch declaration
TensorTopKKernel.cpp Modified caller — tries optimized path first, skips sort if already sorted

Correctness

  • Accuracy: 432/432 pass (CPU vs XPU, sort-then-compare)
  • Sortedness: 324/324 pass (torch.topk(sorted=True) output verified monotonic)

Benchmark summary

By batch size:

bs speedup vs orig vs CUDA 4080S cases
1 1.00x 0.41x 72
8 1.00x 0.43x 72
64 1.00x 0.42x 72
256 1.00x 0.36x 72
1024 2.53x 0.63x 72
2048 2.55x 0.55x 72

By dim:

dim speedup vs orig vs CUDA 4080S cases
128 1.00x 0.36x 54
129 1.00x 0.39x 54
1024 1.47x 0.77x 54
1025 1.35x 0.63x 54
8192 1.62x 0.60x 54
8193 1.30x 0.48x 54
131072 1.87x 0.34x 54
131073 1.53x 0.28x 54

Full 432-case results

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters.

Click to expand full table
dtype bs dim k XPU orig (us) XPU opt (us) CUDA 4080S (us) speedup vs CUDA BW (GB/s) %peak
bfloat16 1 128 4 30.6 30.7 14.4 1.00x 0.47x 0.0 0%
bfloat16 1 128 8 30.5 30.4 14.3 1.00x 0.47x 0.0 0%
bfloat16 1 128 16 30.4 30.4 14.3 1.00x 0.47x 0.0 0%
bfloat16 1 129 4 30.3 30.6 14.7 0.99x 0.48x 0.0 0%
bfloat16 1 129 8 30.4 30.5 14.6 1.00x 0.48x 0.0 0%
bfloat16 1 129 16 30.4 30.4 14.6 1.00x 0.48x 0.0 0%
bfloat16 1 1024 4 30.5 30.5 19.0 1.00x 0.62x 0.1 0%
bfloat16 1 1024 8 30.5 30.6 18.3 1.00x 0.60x 0.1 0%
bfloat16 1 1024 16 30.4 30.4 18.6 1.00x 0.61x 0.1 0%
bfloat16 1 1025 4 30.5 30.5 20.0 1.00x 0.66x 0.1 0%
bfloat16 1 1025 8 30.4 30.5 20.2 1.00x 0.66x 0.1 0%
bfloat16 1 1025 16 30.4 30.5 19.8 1.00x 0.65x 0.1 0%
bfloat16 1 8192 4 45.7 44.4 37.4 1.03x 0.84x 0.4 0%
bfloat16 1 8192 8 51.6 48.6 42.2 1.06x 0.87x 0.3 0%
bfloat16 1 8192 16 48.6 48.6 39.1 1.00x 0.80x 0.3 0%
bfloat16 1 8193 4 45.7 48.4 37.0 0.94x 0.76x 0.3 0%
bfloat16 1 8193 8 48.7 48.6 40.3 1.00x 0.83x 0.3 0%
bfloat16 1 8193 16 48.5 48.5 39.7 1.00x 0.82x 0.3 0%
bfloat16 1 131072 4 368.8 375.7 46.3 0.98x 0.12x 0.7 0%
bfloat16 1 131072 8 396.4 402.5 46.3 0.98x 0.12x 0.7 0%
bfloat16 1 131072 16 430.6 426.2 46.4 1.01x 0.11x 0.6 0%
bfloat16 1 131073 4 370.4 364.3 46.8 1.02x 0.13x 0.7 0%
bfloat16 1 131073 8 392.5 396.7 46.8 0.99x 0.12x 0.7 0%
bfloat16 1 131073 16 413.9 421.3 46.7 0.98x 0.11x 0.6 0%
bfloat16 8 128 4 30.4 30.4 14.9 1.00x 0.49x 0.1 0%
bfloat16 8 128 8 30.5 30.6 14.6 1.00x 0.48x 0.1 0%
bfloat16 8 128 16 30.4 30.3 14.6 1.00x 0.48x 0.1 0%
bfloat16 8 129 4 30.3 30.5 15.1 0.99x 0.50x 0.1 0%
bfloat16 8 129 8 30.3 30.5 15.1 0.99x 0.50x 0.1 0%
bfloat16 8 129 16 30.4 30.5 15.1 1.00x 0.50x 0.1 0%
bfloat16 8 1024 4 30.4 30.5 19.3 1.00x 0.63x 0.5 0%
bfloat16 8 1024 8 30.4 30.5 19.4 1.00x 0.64x 0.6 0%
bfloat16 8 1024 16 30.4 30.4 19.5 1.00x 0.64x 0.6 0%
bfloat16 8 1025 4 30.4 30.5 20.5 1.00x 0.67x 0.5 0%
bfloat16 8 1025 8 30.6 30.4 20.4 1.01x 0.67x 0.6 0%
bfloat16 8 1025 16 30.4 30.4 20.4 1.00x 0.67x 0.6 0%
bfloat16 8 8192 4 54.7 51.6 42.2 1.06x 0.82x 2.5 1%
bfloat16 8 8192 8 51.6 54.6 39.9 0.95x 0.73x 2.4 1%
bfloat16 8 8192 16 54.8 54.5 42.4 1.01x 0.78x 2.4 1%
bfloat16 8 8193 4 54.5 54.5 43.3 1.00x 0.79x 2.4 1%
bfloat16 8 8193 8 54.7 54.7 43.5 1.00x 0.80x 2.4 1%
bfloat16 8 8193 16 54.6 48.6 42.7 1.12x 0.88x 2.7 1%
bfloat16 8 131072 4 388.2 394.6 56.8 0.98x 0.14x 5.3 1%
bfloat16 8 131072 8 422.7 398.6 56.5 1.06x 0.14x 5.3 1%
bfloat16 8 131072 16 427.5 433.5 56.7 0.99x 0.13x 4.8 1%
bfloat16 8 131073 4 392.3 405.1 56.8 0.97x 0.14x 5.2 1%
bfloat16 8 131073 8 404.6 406.4 57.1 1.00x 0.14x 5.2 1%
bfloat16 8 131073 16 442.0 436.3 56.9 1.01x 0.13x 4.8 1%
bfloat16 64 128 4 30.5 30.5 14.9 1.00x 0.49x 0.6 0%
bfloat16 64 128 8 30.5 30.6 14.7 1.00x 0.48x 0.7 0%
bfloat16 64 128 16 30.6 30.4 14.8 1.01x 0.49x 0.9 0%
bfloat16 64 129 4 30.6 30.4 15.4 1.01x 0.51x 0.6 0%
bfloat16 64 129 8 30.5 30.4 15.5 1.00x 0.51x 0.7 0%
bfloat16 64 129 16 30.6 30.4 15.2 1.01x 0.50x 0.9 0%
bfloat16 64 1024 4 30.6 30.5 19.5 1.00x 0.64x 4.4 1%
bfloat16 64 1024 8 30.5 30.5 19.5 1.00x 0.64x 4.5 1%
bfloat16 64 1024 16 30.5 30.6 19.5 1.00x 0.64x 4.6 1%
bfloat16 64 1025 4 33.7 33.6 20.7 1.00x 0.62x 4.0 1%
bfloat16 64 1025 8 33.7 33.6 20.6 1.00x 0.61x 4.1 1%
bfloat16 64 1025 16 33.5 33.7 20.6 0.99x 0.61x 4.2 1%
bfloat16 64 8192 4 93.1 92.2 49.9 1.01x 0.54x 11.4 3%
bfloat16 64 8192 8 97.7 96.6 49.5 1.01x 0.51x 10.9 2%
bfloat16 64 8192 16 100.8 101.2 49.6 1.00x 0.49x 10.5 2%
bfloat16 64 8193 4 96.2 90.1 49.8 1.07x 0.55x 11.7 3%
bfloat16 64 8193 8 97.9 96.3 49.6 1.02x 0.52x 10.9 2%
bfloat16 64 8193 16 100.2 100.3 49.7 1.00x 0.50x 10.6 2%
bfloat16 64 131072 4 901.8 888.7 162.9 1.01x 0.18x 18.9 4%
bfloat16 64 131072 8 939.7 948.2 164.6 0.99x 0.17x 17.7 4%
bfloat16 64 131072 16 999.0 993.3 164.4 1.01x 0.17x 16.9 4%
bfloat16 64 131073 4 902.2 889.0 166.8 1.01x 0.19x 18.9 4%
bfloat16 64 131073 8 944.7 942.0 166.8 1.00x 0.18x 17.8 4%
bfloat16 64 131073 16 1002.6 1000.7 165.5 1.00x 0.17x 16.8 4%
bfloat16 256 128 4 33.7 33.7 15.7 1.00x 0.47x 2.2 0%
bfloat16 256 128 8 33.8 33.6 15.6 1.01x 0.46x 2.6 1%
bfloat16 256 128 16 33.6 33.6 15.7 1.00x 0.47x 3.2 1%
bfloat16 256 129 4 33.7 33.6 16.5 1.00x 0.49x 2.3 0%
bfloat16 256 129 8 33.6 33.6 16.3 1.00x 0.49x 2.6 1%
bfloat16 256 129 16 33.6 33.5 16.3 1.00x 0.49x 3.2 1%
bfloat16 256 1024 4 56.3 56.1 41.7 1.00x 0.74x 9.5 2%
bfloat16 256 1024 8 59.0 58.9 42.4 1.00x 0.72x 9.2 2%
bfloat16 256 1024 16 59.3 59.2 42.6 1.00x 0.72x 9.5 2%
bfloat16 256 1025 4 71.1 72.4 45.9 0.98x 0.63x 7.4 2%
bfloat16 256 1025 8 75.1 74.1 46.7 1.01x 0.63x 7.4 2%
bfloat16 256 1025 16 75.4 75.4 47.1 1.00x 0.62x 7.5 2%
bfloat16 256 8192 4 260.0 263.7 75.2 0.99x 0.29x 15.9 3%
bfloat16 256 8192 8 270.4 269.8 75.0 1.00x 0.28x 15.6 3%
bfloat16 256 8192 16 287.6 290.5 75.2 0.99x 0.26x 14.6 3%
bfloat16 256 8193 4 261.0 268.2 75.1 0.97x 0.28x 15.7 3%
bfloat16 256 8193 8 273.3 273.1 75.6 1.00x 0.28x 15.4 3%
bfloat16 256 8193 16 287.6 288.1 75.7 1.00x 0.26x 14.7 3%
bfloat16 256 131072 4 3096.6 3087.7 439.2 1.00x 0.14x 21.7 5%
bfloat16 256 131072 8 3283.4 3269.1 436.9 1.00x 0.13x 20.5 5%
bfloat16 256 131072 16 3464.5 3469.5 440.9 1.00x 0.13x 19.4 4%
bfloat16 256 131073 4 3085.3 3093.6 441.5 1.00x 0.14x 21.7 5%
bfloat16 256 131073 8 3282.4 3267.2 435.4 1.00x 0.13x 20.5 5%
bfloat16 256 131073 16 3462.5 3470.8 443.1 1.00x 0.13x 19.3 4%
bfloat16 1024 128 4 70.9 69.5 22.1 1.02x 0.32x 4.4 1%
bfloat16 1024 128 8 75.3 75.2 22.0 1.00x 0.29x 4.6 1%
bfloat16 1024 128 16 76.9 76.7 22.3 1.00x 0.29x 5.6 1%
bfloat16 1024 129 4 70.8 69.6 24.4 1.02x 0.35x 4.4 1%
bfloat16 1024 129 8 75.4 75.2 24.4 1.00x 0.32x 4.6 1%
bfloat16 1024 129 16 76.8 76.7 24.5 1.00x 0.32x 5.6 1%
bfloat16 1024 1024 4 152.6 56.2 63.1 2.72x 1.12x 38.0 8%
bfloat16 1024 1024 8 156.0 56.2 63.3 2.78x 1.13x 38.8 9%
bfloat16 1024 1024 16 157.2 57.5 63.4 2.73x 1.10x 39.3 9%
bfloat16 1024 1025 4 218.4 86.0 64.5 2.54x 0.75x 24.9 5%
bfloat16 1024 1025 8 223.7 86.8 64.7 2.58x 0.75x 25.1 6%
bfloat16 1024 1025 16 225.8 87.3 64.8 2.59x 0.74x 25.9 6%
bfloat16 1024 8192 4 939.4 248.0 147.6 3.79x 0.60x 67.8 15%
bfloat16 1024 8192 8 985.8 249.3 147.4 3.95x 0.59x 67.6 15%
bfloat16 1024 8192 16 1036.1 251.2 148.0 4.12x 0.59x 67.4 15%
bfloat16 1024 8193 4 941.7 406.6 149.2 2.32x 0.37x 41.4 9%
bfloat16 1024 8193 8 988.2 407.0 148.4 2.43x 0.36x 41.4 9%
bfloat16 1024 8193 16 1040.8 406.8 149.3 2.56x 0.37x 41.6 9%
bfloat16 1024 131072 4 11500.2 1762.5 1865.9 6.52x 1.06x 152.3 33%
bfloat16 1024 131072 8 12192.8 1762.8 1867.4 6.92x 1.06x 152.3 33%
bfloat16 1024 131072 16 12859.4 1767.0 1863.0 7.28x 1.05x 152.0 33%
bfloat16 1024 131073 4 11514.6 2998.5 1940.1 3.84x 0.65x 89.5 20%
bfloat16 1024 131073 8 12173.3 2998.4 1936.8 4.06x 0.65x 89.6 20%
bfloat16 1024 131073 16 12856.9 3002.4 1944.4 4.28x 0.65x 89.5 20%
bfloat16 2048 128 4 113.9 113.8 30.5 1.00x 0.27x 5.3 1%
bfloat16 2048 128 8 120.3 119.9 30.5 1.00x 0.25x 5.7 1%
bfloat16 2048 128 16 122.9 122.9 30.9 1.00x 0.25x 6.9 2%
bfloat16 2048 129 4 113.8 114.0 35.4 1.00x 0.31x 5.4 1%
bfloat16 2048 129 8 120.1 120.1 35.2 1.00x 0.29x 5.8 1%
bfloat16 2048 129 16 123.2 123.1 35.7 1.00x 0.29x 7.0 2%
bfloat16 2048 1024 4 276.3 96.4 85.7 2.87x 0.89x 44.4 10%
bfloat16 2048 1024 8 284.8 97.5 86.0 2.92x 0.88x 44.7 10%
bfloat16 2048 1024 16 286.1 99.3 86.4 2.88x 0.87x 45.5 10%
bfloat16 2048 1025 4 407.9 158.2 88.4 2.58x 0.56x 27.1 6%
bfloat16 2048 1025 8 423.7 158.8 88.7 2.67x 0.56x 27.5 6%
bfloat16 2048 1025 16 428.3 160.0 89.0 2.68x 0.56x 28.3 6%
bfloat16 2048 8192 4 1875.1 496.1 234.9 3.78x 0.47x 67.8 15%
bfloat16 2048 8192 8 1956.5 497.2 234.1 3.94x 0.47x 67.8 15%
bfloat16 2048 8192 16 2058.5 498.7 235.0 4.13x 0.47x 67.9 15%
bfloat16 2048 8193 4 1873.4 825.1 236.2 2.27x 0.29x 40.8 9%
bfloat16 2048 8193 8 1959.0 824.1 237.3 2.38x 0.29x 40.9 9%
bfloat16 2048 8193 16 2065.1 825.7 237.4 2.50x 0.29x 41.0 9%
bfloat16 2048 131072 4 22903.6 3485.4 3646.5 6.57x 1.05x 154.1 34%
bfloat16 2048 131072 8 24193.6 3484.6 3644.1 6.94x 1.05x 154.1 34%
bfloat16 2048 131072 16 25590.8 3487.7 3646.2 7.34x 1.05x 154.0 34%
bfloat16 2048 131073 4 22872.9 5925.0 3774.7 3.86x 0.64x 90.6 20%
bfloat16 2048 131073 8 24187.7 5933.4 3780.1 4.08x 0.64x 90.5 20%
bfloat16 2048 131073 16 25604.8 5934.5 3773.0 4.31x 0.64x 90.5 20%
float16 1 128 4 30.7 30.7 14.3 1.00x 0.47x 0.0 0%
float16 1 128 8 30.6 30.6 14.0 1.00x 0.46x 0.0 0%
float16 1 128 16 30.5 30.5 14.0 1.00x 0.46x 0.0 0%
float16 1 129 4 30.6 30.6 14.4 1.00x 0.47x 0.0 0%
float16 1 129 8 30.6 30.3 14.4 1.01x 0.48x 0.0 0%
float16 1 129 16 30.5 30.4 14.7 1.00x 0.48x 0.0 0%
float16 1 1024 4 30.6 30.7 17.4 1.00x 0.57x 0.1 0%
float16 1 1024 8 30.5 30.5 17.5 1.00x 0.57x 0.1 0%
float16 1 1024 16 30.4 30.5 17.5 1.00x 0.57x 0.1 0%
float16 1 1025 4 30.5 30.5 17.8 1.00x 0.58x 0.1 0%
float16 1 1025 8 30.4 30.4 18.6 1.00x 0.61x 0.1 0%
float16 1 1025 16 30.4 30.3 20.1 1.00x 0.66x 0.1 0%
float16 1 8192 4 41.4 38.2 33.6 1.08x 0.88x 0.4 0%
float16 1 8192 8 41.2 48.4 33.8 0.85x 0.70x 0.3 0%
float16 1 8192 16 45.6 48.4 31.5 0.94x 0.65x 0.3 0%
float16 1 8193 4 45.6 41.0 37.4 1.11x 0.91x 0.4 0%
float16 1 8193 8 42.6 44.1 36.9 0.97x 0.84x 0.4 0%
float16 1 8193 16 45.6 51.3 33.3 0.89x 0.65x 0.3 0%
float16 1 131072 4 297.2 304.4 46.2 0.98x 0.15x 0.9 0%
float16 1 131072 8 326.6 335.1 46.5 0.97x 0.14x 0.8 0%
float16 1 131072 16 348.1 355.4 46.1 0.98x 0.13x 0.7 0%
float16 1 131073 4 308.7 286.0 46.9 1.08x 0.16x 0.9 0%
float16 1 131073 8 321.3 325.3 46.8 0.99x 0.14x 0.8 0%
float16 1 131073 16 353.2 378.6 46.6 0.93x 0.12x 0.7 0%
float16 8 128 4 30.5 30.2 14.4 1.01x 0.48x 0.1 0%
float16 8 128 8 30.4 30.2 14.5 1.01x 0.48x 0.1 0%
float16 8 128 16 30.4 30.4 14.5 1.00x 0.48x 0.1 0%
float16 8 129 4 30.5 30.2 14.8 1.01x 0.49x 0.1 0%
float16 8 129 8 30.3 30.2 14.9 1.00x 0.49x 0.1 0%
float16 8 129 16 30.5 30.4 14.9 1.00x 0.49x 0.1 0%
float16 8 1024 4 30.6 30.4 19.1 1.01x 0.63x 0.5 0%
float16 8 1024 8 30.5 30.4 19.2 1.00x 0.63x 0.6 0%
float16 8 1024 16 30.4 30.3 19.3 1.00x 0.64x 0.6 0%
float16 8 1025 4 30.5 30.4 19.5 1.00x 0.64x 0.6 0%
float16 8 1025 8 30.5 30.3 20.4 1.01x 0.67x 0.6 0%
float16 8 1025 16 30.5 30.3 20.5 1.01x 0.68x 0.6 0%
float16 8 8192 4 45.6 45.5 37.9 1.00x 0.83x 2.9 1%
float16 8 8192 8 48.4 48.5 39.8 1.00x 0.82x 2.7 1%
float16 8 8192 16 48.5 51.5 41.7 0.94x 0.81x 2.6 1%
float16 8 8193 4 48.5 45.5 39.2 1.07x 0.86x 2.9 1%
float16 8 8193 8 45.6 48.6 40.7 0.94x 0.84x 2.7 1%
float16 8 8193 16 54.5 51.7 43.0 1.05x 0.83x 2.6 1%
float16 8 131072 4 309.9 334.0 56.0 0.93x 0.17x 6.3 1%
float16 8 131072 8 338.1 356.0 56.1 0.95x 0.16x 5.9 1%
float16 8 131072 16 393.3 387.7 56.3 1.01x 0.15x 5.4 1%
float16 8 131073 4 314.9 313.8 56.2 1.00x 0.18x 6.7 1%
float16 8 131073 8 341.7 344.2 56.3 0.99x 0.16x 6.1 1%
float16 8 131073 16 366.4 378.0 56.3 0.97x 0.15x 5.6 1%
float16 64 128 4 30.5 30.1 14.9 1.01x 0.50x 0.6 0%
float16 64 128 8 30.5 30.2 14.7 1.01x 0.49x 0.7 0%
float16 64 128 16 30.4 30.2 14.7 1.01x 0.49x 0.9 0%
float16 64 129 4 30.6 30.2 15.3 1.01x 0.51x 0.6 0%
float16 64 129 8 30.6 30.2 15.2 1.01x 0.50x 0.7 0%
float16 64 129 16 30.5 30.2 15.1 1.01x 0.50x 0.9 0%
float16 64 1024 4 30.4 30.4 19.2 1.00x 0.63x 4.4 1%
float16 64 1024 8 30.4 30.4 19.3 1.00x 0.63x 4.5 1%
float16 64 1024 16 30.4 30.3 19.4 1.00x 0.64x 4.7 1%
float16 64 1025 4 32.2 32.0 19.7 1.01x 0.62x 4.2 1%
float16 64 1025 8 32.1 32.1 20.4 1.00x 0.64x 4.2 1%
float16 64 1025 16 33.6 33.6 20.4 1.00x 0.61x 4.2 1%
float16 64 8192 4 81.3 84.2 49.4 0.97x 0.59x 12.5 3%
float16 64 8192 8 83.0 84.2 49.2 0.99x 0.58x 12.5 3%
float16 64 8192 16 88.7 90.4 49.2 0.98x 0.54x 11.7 3%
float16 64 8193 4 81.3 80.1 49.4 1.01x 0.62x 13.1 3%
float16 64 8193 8 87.2 84.0 49.4 1.04x 0.59x 12.5 3%
float16 64 8193 16 90.2 88.8 49.4 1.02x 0.56x 11.9 3%
float16 64 131072 4 752.0 723.7 162.1 1.04x 0.22x 23.2 5%
float16 64 131072 8 788.0 782.2 160.5 1.01x 0.21x 21.5 5%
float16 64 131072 16 853.1 866.5 162.4 0.98x 0.19x 19.4 4%
float16 64 131073 4 712.3 709.2 161.6 1.00x 0.23x 23.7 5%
float16 64 131073 8 784.4 775.9 163.9 1.01x 0.21x 21.6 5%
float16 64 131073 16 866.1 857.3 162.9 1.01x 0.19x 19.6 4%
float16 256 128 4 33.7 33.6 15.5 1.00x 0.46x 2.3 0%
float16 256 128 8 33.7 33.6 15.6 1.00x 0.46x 2.6 1%
float16 256 128 16 33.7 33.6 15.6 1.00x 0.46x 3.2 1%
float16 256 129 4 33.7 33.5 16.0 1.01x 0.48x 2.3 0%
float16 256 129 8 33.7 33.5 15.9 1.01x 0.47x 2.6 1%
float16 256 129 16 33.6 33.5 16.1 1.00x 0.48x 3.2 1%
float16 256 1024 4 50.6 50.8 37.9 1.00x 0.75x 10.5 2%
float16 256 1024 8 53.1 53.0 38.8 1.00x 0.73x 10.3 2%
float16 256 1024 16 55.0 56.0 39.9 0.98x 0.71x 10.1 2%
float16 256 1025 4 63.5 63.5 42.0 1.00x 0.66x 8.4 2%
float16 256 1025 8 64.6 66.3 43.1 0.97x 0.65x 8.2 2%
float16 256 1025 16 69.5 67.9 43.8 1.02x 0.65x 8.3 2%
float16 256 8192 4 219.8 221.4 74.1 0.99x 0.33x 19.0 4%
float16 256 8192 8 233.9 234.1 74.4 1.00x 0.32x 18.0 4%
float16 256 8192 16 248.0 250.8 74.7 0.99x 0.30x 16.9 4%
float16 256 8193 4 217.9 220.0 74.3 0.99x 0.34x 19.1 4%
float16 256 8193 8 235.5 232.7 74.8 1.01x 0.32x 18.1 4%
float16 256 8193 16 252.1 257.4 74.9 0.98x 0.29x 16.5 4%
float16 256 131072 4 2409.4 2421.9 428.9 0.99x 0.18x 27.7 6%
float16 256 131072 8 2673.7 2662.8 427.9 1.00x 0.16x 25.2 6%
float16 256 131072 16 2935.0 2934.9 428.2 1.00x 0.15x 22.9 5%
float16 256 131073 4 2405.3 2442.5 431.9 0.98x 0.18x 27.5 6%
float16 256 131073 8 2662.4 2677.0 429.8 0.99x 0.16x 25.1 5%
float16 256 131073 16 2941.0 2949.7 432.2 1.00x 0.15x 22.8 5%
float16 1024 128 4 67.6 67.6 20.9 1.00x 0.31x 4.5 1%
float16 1024 128 8 70.7 69.7 20.9 1.01x 0.30x 4.9 1%
float16 1024 128 16 71.4 71.4 21.4 1.00x 0.30x 6.0 1%
float16 1024 129 4 66.5 66.6 23.3 1.00x 0.35x 4.6 1%
float16 1024 129 8 70.8 70.1 23.1 1.01x 0.33x 4.9 1%
float16 1024 129 16 71.2 72.4 23.4 0.98x 0.32x 5.9 1%
float16 1024 1024 4 132.5 48.4 62.7 2.74x 1.30x 44.2 10%
float16 1024 1024 8 136.5 48.7 63.0 2.80x 1.29x 44.7 10%
float16 1024 1024 16 143.6 49.7 63.1 2.89x 1.27x 45.5 10%
float16 1024 1025 4 185.3 97.8 64.2 1.89x 0.66x 21.9 5%
float16 1024 1025 8 192.7 97.7 64.4 1.97x 0.66x 22.3 5%
float16 1024 1025 16 206.3 99.0 64.5 2.08x 0.65x 22.9 5%
float16 1024 8192 4 793.1 198.8 145.0 3.99x 0.73x 84.6 19%
float16 1024 8192 8 840.3 199.1 144.6 4.22x 0.73x 84.7 19%
float16 1024 8192 16 907.4 201.8 145.5 4.50x 0.72x 83.9 18%
float16 1024 8193 4 799.0 456.2 146.1 1.75x 0.32x 36.9 8%
float16 1024 8193 8 838.6 457.3 146.5 1.83x 0.32x 36.9 8%
float16 1024 8193 16 912.3 459.8 146.2 1.98x 0.32x 36.8 8%
float16 1024 131072 4 9033.3 1535.9 1846.9 5.88x 1.20x 174.8 38%
float16 1024 131072 8 9885.6 1542.6 1856.1 6.41x 1.20x 174.1 38%
float16 1024 131072 16 10870.4 1538.7 1858.5 7.06x 1.21x 174.6 38%
float16 1024 131073 4 9011.7 3193.9 1924.0 2.82x 0.60x 84.1 18%
float16 1024 131073 8 9922.9 3185.2 1921.5 3.12x 0.60x 84.3 18%
float16 1024 131073 16 10905.6 3186.0 1926.4 3.42x 0.60x 84.3 18%
float16 2048 128 4 106.8 107.8 28.3 0.99x 0.26x 5.6 1%
float16 2048 128 8 112.6 112.5 28.5 1.00x 0.25x 6.1 1%
float16 2048 128 16 115.6 114.5 29.2 1.01x 0.26x 7.4 2%
float16 2048 129 4 106.9 108.1 32.6 0.99x 0.30x 5.6 1%
float16 2048 129 8 112.5 112.4 32.7 1.00x 0.29x 6.2 1%
float16 2048 129 16 115.9 115.4 33.5 1.00x 0.29x 7.4 2%
float16 2048 1024 4 236.3 81.3 85.1 2.91x 1.05x 52.6 12%
float16 2048 1024 8 246.7 82.8 85.7 2.98x 1.04x 52.6 12%
float16 2048 1024 16 259.7 84.4 86.0 3.08x 1.02x 53.6 12%
float16 2048 1025 4 345.5 179.5 87.7 1.92x 0.49x 23.8 5%
float16 2048 1025 8 358.4 180.9 88.0 1.98x 0.49x 24.1 5%
float16 2048 1025 16 380.3 182.2 88.5 2.09x 0.49x 24.8 5%
float16 2048 8192 4 1572.3 399.3 228.7 3.94x 0.57x 84.2 18%
float16 2048 8192 8 1662.5 400.0 228.5 4.16x 0.57x 84.3 18%
float16 2048 8192 16 1808.5 401.1 230.5 4.51x 0.57x 84.5 19%
float16 2048 8193 4 1573.6 924.3 231.7 1.70x 0.25x 36.4 8%
float16 2048 8193 8 1672.3 926.3 231.6 1.81x 0.25x 36.4 8%
float16 2048 8193 16 1813.4 931.1 233.1 1.95x 0.25x 36.4 8%
float16 2048 131072 4 17900.0 3035.1 3622.2 5.90x 1.19x 176.9 39%
float16 2048 131072 8 19669.5 3028.6 3607.3 6.49x 1.19x 177.3 39%
float16 2048 131072 16 21602.8 3043.9 3607.4 7.10x 1.19x 176.5 39%
float16 2048 131073 4 17893.0 6305.2 3743.3 2.84x 0.59x 85.2 19%
float16 2048 131073 8 19693.7 6309.6 3747.1 3.12x 0.59x 85.1 19%
float16 2048 131073 16 21604.8 6307.9 3749.5 3.43x 0.59x 85.2 19%
float32 1 128 4 31.2 31.4 14.5 0.99x 0.46x 0.0 0%
float32 1 128 8 34.0 34.4 14.3 0.99x 0.42x 0.0 0%
float32 1 128 16 32.4 34.4 14.0 0.94x 0.41x 0.0 0%
float32 1 129 4 34.1 34.4 14.4 0.99x 0.42x 0.0 0%
float32 1 129 8 34.0 32.7 14.4 1.04x 0.44x 0.0 0%
float32 1 129 16 34.1 34.3 15.2 0.99x 0.44x 0.0 0%
float32 1 1024 4 35.3 32.7 17.8 1.08x 0.54x 0.1 0%
float32 1 1024 8 35.3 35.8 22.2 0.99x 0.62x 0.1 0%
float32 1 1024 16 35.3 35.7 19.1 0.99x 0.54x 0.1 0%
float32 1 1025 4 35.3 35.9 18.8 0.98x 0.52x 0.1 0%
float32 1 1025 8 38.5 35.8 19.7 1.08x 0.55x 0.1 0%
float32 1 1025 16 35.2 35.7 19.6 0.99x 0.55x 0.1 0%
float32 1 8192 4 54.6 51.1 39.6 1.07x 0.77x 0.6 0%
float32 1 8192 8 63.6 55.0 38.0 1.16x 0.69x 0.6 0%
float32 1 8192 16 54.6 58.0 38.7 0.94x 0.67x 0.6 0%
float32 1 8193 4 51.5 52.0 34.1 0.99x 0.66x 0.6 0%
float32 1 8193 8 56.5 54.9 41.6 1.03x 0.76x 0.6 0%
float32 1 8193 16 60.6 58.0 39.8 1.04x 0.69x 0.6 0%
float32 1 131072 4 410.5 393.7 63.3 1.04x 0.16x 1.3 0%
float32 1 131072 8 412.3 398.5 63.3 1.03x 0.16x 1.3 0%
float32 1 131072 16 423.5 467.2 63.3 0.91x 0.14x 1.1 0%
float32 1 131073 4 406.7 389.3 64.0 1.04x 0.16x 1.3 0%
float32 1 131073 8 425.0 417.1 64.0 1.02x 0.15x 1.3 0%
float32 1 131073 16 435.0 430.7 63.9 1.01x 0.15x 1.2 0%
float32 8 128 4 33.8 37.2 14.7 0.91x 0.40x 0.1 0%
float32 8 128 8 35.0 34.1 14.3 1.03x 0.42x 0.1 0%
float32 8 128 16 35.6 37.2 15.2 0.96x 0.41x 0.2 0%
float32 8 129 4 35.2 36.0 15.0 0.98x 0.42x 0.1 0%
float32 8 129 8 36.8 34.1 15.0 1.08x 0.44x 0.1 0%
float32 8 129 16 35.3 35.5 15.3 0.99x 0.43x 0.2 0%
float32 8 1024 4 39.8 35.6 20.9 1.12x 0.59x 0.9 0%
float32 8 1024 8 38.2 35.6 19.7 1.07x 0.55x 0.9 0%
float32 8 1024 16 38.3 40.2 19.7 0.95x 0.49x 0.9 0%
float32 8 1025 4 38.3 35.7 20.6 1.07x 0.58x 0.9 0%
float32 8 1025 8 38.4 38.7 21.4 0.99x 0.55x 0.9 0%
float32 8 1025 16 41.2 36.9 22.0 1.12x 0.60x 0.9 0%
float32 8 8192 4 57.5 62.6 41.0 0.92x 0.65x 4.2 1%
float32 8 8192 8 60.6 55.2 42.6 1.10x 0.77x 4.8 1%
float32 8 8192 16 66.7 61.1 44.7 1.09x 0.73x 4.3 1%
float32 8 8193 4 54.6 64.0 43.0 0.85x 0.67x 4.1 1%
float32 8 8193 8 66.5 61.0 43.5 1.09x 0.71x 4.3 1%
float32 8 8193 16 63.9 67.1 45.0 0.95x 0.67x 3.9 1%
float32 8 131072 4 412.1 410.8 76.0 1.00x 0.19x 10.2 2%
float32 8 131072 8 432.3 425.0 76.0 1.02x 0.18x 9.9 2%
float32 8 131072 16 470.7 458.5 76.2 1.03x 0.17x 9.2 2%
float32 8 131073 4 403.7 411.2 76.0 0.98x 0.18x 10.2 2%
float32 8 131073 8 424.4 425.9 75.8 1.00x 0.18x 9.8 2%
float32 8 131073 16 471.5 477.8 76.1 0.99x 0.16x 8.8 2%
float32 64 128 4 38.2 37.4 15.0 1.02x 0.40x 1.0 0%
float32 64 128 8 36.8 37.2 15.0 0.99x 0.40x 1.0 0%
float32 64 128 16 38.3 37.2 14.9 1.03x 0.40x 1.2 0%
float32 64 129 4 38.5 37.1 15.5 1.04x 0.42x 1.0 0%
float32 64 129 8 37.0 37.1 15.9 1.00x 0.43x 1.1 0%
float32 64 129 16 38.4 38.8 15.4 0.99x 0.40x 1.2 0%
float32 64 1024 4 39.6 38.9 20.4 1.02x 0.52x 6.8 1%
float32 64 1024 8 39.8 39.2 20.3 1.02x 0.52x 6.8 2%
float32 64 1024 16 41.4 40.2 20.3 1.03x 0.50x 6.8 1%
float32 64 1025 4 41.3 43.4 22.1 0.95x 0.51x 6.1 1%
float32 64 1025 8 42.9 43.3 22.1 0.99x 0.51x 6.2 1%
float32 64 1025 16 42.9 44.7 22.2 0.96x 0.50x 6.1 1%
float32 64 8192 4 96.8 99.2 65.6 0.98x 0.66x 21.2 5%
float32 64 8192 8 103.8 106.6 65.6 0.97x 0.62x 19.7 4%
float32 64 8192 16 109.6 109.9 65.6 1.00x 0.60x 19.2 4%
float32 64 8193 4 97.8 99.6 65.6 0.98x 0.66x 21.1 5%
float32 64 8193 8 104.9 112.7 65.5 0.93x 0.58x 18.7 4%
float32 64 8193 16 112.9 115.8 65.6 0.97x 0.57x 18.2 4%
float32 64 131072 4 956.6 940.0 221.1 1.02x 0.24x 35.7 8%
float32 64 131072 8 1024.0 1007.2 220.4 1.02x 0.22x 33.3 7%
float32 64 131072 16 1097.5 1082.4 222.6 1.01x 0.21x 31.0 7%
float32 64 131073 4 943.5 941.2 223.0 1.00x 0.24x 35.7 8%
float32 64 131073 8 1004.0 1010.3 225.0 0.99x 0.22x 33.2 7%
float32 64 131073 16 1095.1 1101.5 223.7 0.99x 0.20x 30.5 7%
float32 256 128 4 46.0 46.0 15.7 1.00x 0.34x 3.1 1%
float32 256 128 8 47.2 47.5 15.7 0.99x 0.33x 3.3 1%
float32 256 128 16 47.4 47.2 15.7 1.00x 0.33x 3.8 1%
float32 256 129 4 47.2 47.5 16.1 0.99x 0.34x 3.0 1%
float32 256 129 8 45.6 47.4 16.7 0.96x 0.35x 3.3 1%
float32 256 129 16 47.3 49.0 16.6 0.97x 0.34x 3.7 1%
float32 256 1024 4 66.7 68.3 41.7 0.98x 0.61x 15.5 3%
float32 256 1024 8 70.7 70.0 43.2 1.01x 0.62x 15.3 3%
float32 256 1024 16 71.1 71.6 43.8 0.99x 0.61x 15.3 3%
float32 256 1025 4 82.8 81.2 45.9 1.02x 0.57x 13.1 3%
float32 256 1025 8 85.8 84.6 46.6 1.01x 0.55x 12.7 3%
float32 256 1025 16 87.3 89.4 48.1 0.98x 0.54x 12.3 3%
float32 256 8192 4 274.6 277.6 101.0 0.99x 0.36x 30.3 7%
float32 256 8192 8 299.9 286.3 101.3 1.05x 0.35x 29.4 6%
float32 256 8192 16 313.3 315.7 100.9 0.99x 0.32x 26.7 6%
float32 256 8193 4 283.6 277.9 101.7 1.02x 0.37x 30.2 7%
float32 256 8193 8 292.0 292.6 101.6 1.00x 0.35x 28.8 6%
float32 256 8193 16 317.9 318.0 101.8 1.00x 0.32x 26.5 6%
float32 256 131072 4 3194.0 3202.4 1128.3 1.00x 0.35x 41.9 9%
float32 256 131072 8 3415.0 3445.5 1132.5 0.99x 0.33x 39.0 9%
float32 256 131072 16 3704.6 3711.3 1129.5 1.00x 0.30x 36.2 8%
float32 256 131073 4 3206.8 3195.1 1148.5 1.00x 0.36x 42.0 9%
float32 256 131073 8 3427.4 3420.5 1148.0 1.00x 0.34x 39.2 9%
float32 256 131073 16 3743.5 3721.6 1147.9 1.01x 0.31x 36.1 8%
float32 1024 128 4 100.9 102.1 22.3 0.99x 0.22x 5.6 1%
float32 1024 128 8 107.9 105.8 22.0 1.02x 0.21x 5.9 1%
float32 1024 128 16 108.2 110.0 22.2 0.98x 0.20x 6.6 1%
float32 1024 129 4 102.3 101.3 24.4 1.01x 0.24x 5.7 1%
float32 1024 129 8 108.0 108.2 24.4 1.00x 0.23x 5.8 1%
float32 1024 129 16 109.5 111.1 24.6 0.99x 0.22x 6.5 1%
float32 1024 1024 4 185.6 50.2 88.3 3.70x 1.76x 84.5 19%
float32 1024 1024 8 190.3 50.0 88.3 3.81x 1.77x 85.9 19%
float32 1024 1024 16 194.7 50.2 88.3 3.88x 1.76x 87.5 19%
float32 1024 1025 4 251.8 92.1 90.2 2.73x 0.98x 46.1 10%
float32 1024 1025 8 262.6 92.5 90.1 2.84x 0.97x 46.5 10%
float32 1024 1025 16 267.3 93.0 90.4 2.87x 0.97x 47.3 10%
float32 1024 8192 4 1000.9 230.7 200.8 4.34x 0.87x 145.7 32%
float32 1024 8192 8 1072.8 231.1 200.2 4.64x 0.87x 145.6 32%
float32 1024 8192 16 1140.4 231.5 201.7 4.93x 0.87x 145.8 32%
float32 1024 8193 4 1014.7 465.1 202.4 2.18x 0.44x 72.3 16%
float32 1024 8193 8 1076.7 465.9 201.3 2.31x 0.43x 72.2 16%
float32 1024 8193 16 1159.9 466.5 202.6 2.49x 0.43x 72.4 16%
float32 1024 131072 4 11911.6 1964.0 4191.1 6.06x 2.13x 273.4 60%
float32 1024 131072 8 12727.1 1966.1 4189.9 6.47x 2.13x 273.1 60%
float32 1024 131072 16 13772.9 1966.2 4190.6 7.00x 2.13x 273.1 60%
float32 1024 131073 4 11868.0 3547.2 4260.7 3.35x 1.20x 151.4 33%
float32 1024 131073 8 12770.6 3550.0 4261.2 3.60x 1.20x 151.3 33%
float32 1024 131073 16 13914.8 3557.8 4261.2 3.91x 1.20x 151.0 33%
float32 2048 128 4 170.5 170.2 30.2 1.00x 0.18x 6.7 1%
float32 2048 128 8 177.6 177.9 30.6 1.00x 0.17x 7.0 2%
float32 2048 128 16 180.7 181.4 31.2 1.00x 0.17x 7.9 2%
float32 2048 129 4 170.3 170.5 35.4 1.00x 0.21x 6.8 1%
float32 2048 129 8 176.5 176.7 35.3 1.00x 0.20x 7.1 2%
float32 2048 129 16 181.9 182.7 36.4 1.00x 0.20x 7.9 2%
float32 2048 1024 4 333.2 85.6 123.4 3.89x 1.44x 99.1 22%
float32 2048 1024 8 347.3 85.9 123.4 4.04x 1.44x 99.9 22%
float32 2048 1024 16 355.7 87.1 123.7 4.08x 1.42x 100.8 22%
float32 2048 1025 4 470.0 165.7 126.5 2.84x 0.76x 51.3 11%
float32 2048 1025 8 492.6 166.1 126.7 2.97x 0.76x 51.7 11%
float32 2048 1025 16 503.6 167.0 127.0 3.02x 0.76x 52.6 12%
float32 2048 8192 4 1972.4 442.5 421.7 4.46x 0.95x 151.9 33%
float32 2048 8192 8 2094.9 443.3 424.8 4.73x 0.96x 151.8 33%
float32 2048 8192 16 2251.3 444.0 424.0 5.07x 0.95x 152.0 33%
float32 2048 8193 4 1979.8 908.5 436.2 2.18x 0.48x 74.0 16%
float32 2048 8193 8 2127.7 907.9 437.6 2.34x 0.48x 74.1 16%
float32 2048 8193 16 2269.5 910.9 440.8 2.49x 0.48x 74.1 16%
float32 2048 131072 4 23642.3 3925.9 8254.2 6.02x 2.10x 273.5 60%
float32 2048 131072 8 25253.3 3926.0 8254.6 6.43x 2.10x 273.5 60%
float32 2048 131072 16 27390.4 3930.4 8250.2 6.97x 2.10x 273.3 60%
float32 2048 131073 4 23630.0 7033.7 8407.4 3.36x 1.20x 152.7 33%
float32 2048 131073 8 25309.8 7037.0 8407.4 3.60x 1.19x 152.6 33%
float32 2048 131073 16 27547.6 7041.9 8413.3 3.91x 1.19x 152.5 33%

Test methodology

  • Accuracy (432 cases): 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
  • Sortedness (324 cases): Verify torch.topk(sorted=True) output is monotonic for both largest=True/False.
  • Benchmark (432 cases): Median of 3 runs x 50 iterations each, with 20 warmup iterations. largest=True.
  • Bandwidth: (bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).

Copilot AI review requested due to automatic review settings April 17, 2026 08:52
Add an optimized topk kernel path where each 32-lane sub-group processes
one slice entirely in registers via insertion sort + bitonic merge.
Zero SLM, zero barriers. Output is already sorted.

Constraints: k <= 16, large enough batch (nsegments >= HW_threads/4).
Compile-time template dispatch on largest (direction) and IndexT (int32/int64).
Kernel isolated in a separate translation unit to avoid SYCL compiler
interference with the original kernel.

432/432 accuracy tests pass, 324/324 sortedness tests pass.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds an optimized XPU SYCL implementation of topk using a subgroup-based kernel path and integrates it into the existing XPU topk dispatch to skip unnecessary sorting when results are already sorted.

Changes:

  • Added a new subgroup top-k kernel translation unit with launch heuristics (k <= 16, dim >= 1024, sufficiently large batch).
  • Introduced a small public interface (SbtopkResult, sbtopk_try_launch) to attempt the optimized path.
  • Updated the existing topk caller to try the subgroup kernel first and conditionally skip sorting.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h Declares SbtopkResult and the optimized-kernel try-launch API.
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Implements subgroup-in-register top-k kernel and dispatch heuristics.
src/ATen/native/xpu/sycl/TensorTopKKernel.cpp Calls sbtopk_try_launch first and skips sort when already sorted.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKKernel.cpp Outdated
- Add kernel properties (sub_group_size<32>, grf_size<128>) to launch
  for explicit sub-group size and smaller GRF (better occupancy)
- Fix std::numeric_limits<scalar_t>::infinity() for integer dtypes:
  use lowest()/max() when has_infinity is false
- Add #include <limits>
- Clarify insert() idx param is within-slice (int, bounded by sliceSize)
- Shorten header comment for sbtopk_try_launch
- Fix TensorTopKKernel.cpp comment (remove single-wg kernel reference)

432/432 accuracy, 324/324 sortedness pass.
Copilot AI review requested due to automatic review settings April 17, 2026 13:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKKernel.cpp
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from aa5150e to d742fa2 Compare April 17, 2026 13:43
Copilot AI review requested due to automatic review settings April 17, 2026 14:08
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from d742fa2 to f926f72 Compare April 17, 2026 14:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKKernel.cpp
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from f926f72 to 659972f Compare April 17, 2026 14:21
@jianyizh jianyizh changed the title Add subgroup topk kernel for XPU Add subgroup topk kernel for XPU (part1 of #3369) Apr 17, 2026
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from 659972f to d4daf78 Compare April 17, 2026 14:37
Copilot AI review requested due to automatic review settings April 17, 2026 14:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 0.988015 0.696526
torchbench_bfloat16_training resnext50_32x4d 0.955237 0.697808
torchbench_bfloat16_training dcgan 0.811436 0.770243
torchbench_bfloat16_training densenet121 0.778202 0.801392
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mobilenet_v3_large 0.955225 0.807977

@jianyizh jianyizh requested a review from Copilot April 18, 2026 06:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
- insert(): add count-aware stop condition so input values equal to the
  sentinel (e.g. all -inf for largest=true) fill the buffer correctly
  instead of repeatedly overwriting position K-1
- Add alignas(alignof(LoadT)) on local vectorized-load array
- Add pointer alignment check in vec dispatch to safely fall back to
  scalar loads when input has a non-aligned storage offset
@jianyizh jianyizh requested a review from CuiYifeng April 20, 2026 02:42
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 0.947171 0.715448
torchbench_bfloat16_training resnet18 0.912483 0.759558
torchbench_bfloat16_training dcgan 0.845901 0.776201
torchbench_bfloat16_training densenet121 0.775895 0.788212
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mobilenet_v3_large 0.932345 0.800673
torchbench_bfloat16_training resnext50_32x4d 0.968829 0.835742

Benchmarks show subgroup kernel is 2-4x faster than original even
for small dims (32-512) when batch size is large. The previous
dim>=1024 guard was overly conservative. The only hard requirement
is dim>=SG_SIZE (32) so each lane gets at least one element.
Copilot AI review requested due to automatic review settings April 20, 2026 05:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h
Select K from {1, 2, 4, 8, 16} based on runtime k (round up to
next power of two).  Smaller K means fewer unrolled iterations in
insert/merge/shuffle loops, dramatically reducing register pressure.

K<=8 eliminates all register spills on B580 (GRF 128) across fp32,
fp16, and bf16.  For k=4 this gives 3-11x speedup over the previous
fixed K=16 path; k=16 takes the same K=16 template as before (no
regression).
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 1.005730 0.807516
torchbench_bfloat16_training resnet18 1.011793 0.844331

@jianyizh
Copy link
Copy Markdown
Contributor Author

Benchmark: Subgroup TopK on Intel Arc B580

Setup: torch.topk on input shape [numSlices=256, sliceSize=dim], k in {2, 4, 8, 16}, dtypes: float32, float16, bfloat16 (156 cases total).

  • B580 orig: Arc B580, main branch (before this PR)
  • B580 opt: Arc B580, this PR (subgroup topk kernel)
  • L20: NVIDIA L20 (PyTorch CUDA, for reference)

Summary

Metric Value
Avg speedup vs B580 orig 6.03x (median 4.33x, range 1.26x – 20.06x)
Cases where B580 opt beats L20 74 / 156 (47.4%)
k Avg Speedup vs orig B580 opt beats L20
2 8.28x 26/39 (67%)
4 7.70x 27/39 (69%)
8 5.25x 17/39 (44%)
16 2.91x 4/39 (10%)

Full Results

Click to expand full 156-case table
dtype dim k B580 orig (us) B580 opt (us) Speedup (orig/opt) L20 (us) L20 / B580 opt
float32 32 2 99.7 26.083 3.82x 10.547 0.404
float32 32 4 103.0 33.894 3.04x 9.933 0.293
float32 32 8 101.8 19.843 5.13x 10.035 0.506
float32 32 16 100.4 34.939 2.87x 9.728 0.278
float32 64 2 97.7 33.166 2.95x 9.523 0.287
float32 64 4 104.1 21.523 4.84x 9.523 0.442
float32 64 8 105.2 33.535 3.14x 9.830 0.293
float32 64 16 106.8 44.382 2.41x 9.728 0.219
float32 128 2 100.2 32.682 3.07x 11.366 0.348
float32 128 4 101.7 43.004 2.36x 11.366 0.264
float32 128 8 107.0 18.866 5.67x 11.469 0.608
float32 128 16 107.9 24.851 4.34x 11.776 0.474
float32 256 2 105.1 35.656 2.95x 18.330 0.514
float32 256 4 104.1 35.355 2.94x 18.842 0.533
float32 256 8 105.2 39.931 2.63x 18.842 0.472
float32 256 16 110.1 25.927 4.25x 18.842 0.727
float32 512 2 128.7 19.360 6.65x 36.966 1.909
float32 512 4 133.2 34.169 3.90x 37.990 1.112
float32 512 8 132.5 21.174 6.26x 40.038 1.891
float32 512 16 133.7 43.009 3.11x 39.014 0.907
float32 1024 2 177.0 21.533 8.22x 73.830 3.429
float32 1024 4 185.1 39.629 4.67x 74.342 1.876
float32 1024 8 192.5 34.034 5.66x 73.011 2.145
float32 1024 16 194.5 65.192 2.98x 74.445 1.142
float32 2048 2 243.0 30.020 8.09x 91.341 3.043
float32 2048 4 255.2 29.323 8.70x 91.443 3.118
float32 2048 8 268.2 59.540 4.50x 90.522 1.520
float32 2048 16 277.9 119.278 2.33x 91.136 0.764
float32 4096 2 436.7 49.072 8.90x 119.910 2.444
float32 4096 4 455.9 53.331 8.55x 120.525 2.260
float32 4096 8 486.3 112.211 4.33x 120.013 1.070
float32 4096 16 506.4 209.149 2.42x 121.139 0.579
float32 8192 2 937.2 97.755 9.59x 179.098 1.832
float32 8192 4 999.2 90.683 11.02x 179.302 1.977
float32 8192 8 1069.8 190.169 5.63x 185.344 0.975
float32 8192 16 1142.6 400.868 2.85x 182.682 0.456
float32 16384 2 1678.9 162.094 10.36x 310.989 1.919
float32 16384 4 1834.3 174.944 10.49x 312.013 1.784
float32 16384 8 1973.5 322.390 6.12x 310.784 0.964
float32 16384 16 2104.2 724.100 2.91x 312.422 0.431
float32 32768 2 3103.7 329.399 9.42x 1017.037 3.088
float32 32768 4 3387.1 337.532 10.03x 1018.675 3.018
float32 32768 8 3676.8 489.637 7.51x 1017.446 2.078
float32 32768 16 3932.6 1157.754 3.40x 1016.013 0.878
float32 65536 2 5890.2 631.119 9.33x 1966.182 3.115
float32 65536 4 6318.9 648.570 9.74x 1968.230 3.035
float32 65536 8 6868.2 798.028 8.61x 1967.411 2.465
float32 65536 16 7371.1 1777.911 4.15x 1968.947 1.107
float32 131072 2 11278.5 1250.382 9.02x 3840.512 3.071
float32 131072 4 11871.5 1265.540 9.38x 3839.693 3.034
float32 131072 8 12716.6 1401.395 9.07x 3837.235 2.738
float32 131072 16 13771.5 2690.631 5.12x 3836.826 1.426
float16 32 2 64.8 36.592 1.77x 10.035 0.274
float16 32 4 66.7 17.893 3.73x 9.933 0.555
float16 32 8 66.8 18.117 3.69x 9.830 0.543
float16 32 16 63.2 16.266 3.89x 10.752 0.661
float16 64 2 63.6 37.513 1.70x 9.626 0.257
float16 64 4 67.1 37.606 1.78x 9.830 0.261
float16 64 8 69.6 17.919 3.88x 10.547 0.589
float16 64 16 69.2 42.775 1.62x 9.728 0.227
float16 128 2 65.7 32.968 1.99x 10.342 0.314
float16 128 4 66.9 39.910 1.68x 10.342 0.259
float16 128 8 70.3 32.625 2.15x 10.445 0.320
float16 128 16 71.2 31.179 2.28x 10.957 0.351
float16 256 2 69.1 31.034 2.23x 16.384 0.528
float16 256 4 69.6 34.164 2.04x 16.896 0.495
float16 256 8 69.6 33.046 2.11x 16.896 0.511
float16 256 16 74.0 42.734 1.73x 17.101 0.400
float16 512 2 86.6 36.878 2.35x 32.154 0.872
float16 512 4 90.3 20.114 4.49x 34.304 1.705
float16 512 8 93.1 32.854 2.83x 34.816 1.060
float16 512 16 93.0 35.261 2.64x 34.714 0.984
float16 1024 2 125.1 19.718 6.34x 55.194 2.799
float16 1024 4 131.2 31.866 4.12x 51.917 1.629
float16 1024 8 135.8 36.759 3.69x 51.814 1.410
float16 1024 16 144.9 54.902 2.64x 52.429 0.955
float16 2048 2 176.0 19.162 9.18x 59.597 3.110
float16 2048 4 186.9 32.666 5.72x 60.416 1.850
float16 2048 8 200.0 52.723 3.79x 60.723 1.152
float16 2048 16 204.1 101.650 2.01x 61.645 0.606
float16 4096 2 332.0 31.626 10.50x 84.070 2.658
float16 4096 4 353.4 42.250 8.36x 85.197 2.016
float16 4096 8 369.2 94.364 3.91x 85.402 0.905
float16 4096 16 398.9 203.174 1.96x 86.118 0.424
float16 8192 2 729.6 44.928 16.24x 121.037 2.694
float16 8192 4 800.3 72.140 11.09x 121.958 1.691
float16 8192 8 844.8 166.452 5.08x 123.802 0.744
float16 8192 16 911.0 388.196 2.35x 124.314 0.320
float16 16384 2 1302.2 92.446 14.09x 202.547 2.191
float16 16384 4 1432.9 117.161 12.23x 204.493 1.745
float16 16384 8 1561.5 271.482 5.75x 206.234 0.760
float16 16384 16 1679.7 686.790 2.45x 204.902 0.298
float16 32768 2 2384.4 170.399 13.99x 363.418 2.133
float16 32768 4 2613.2 195.000 13.40x 362.701 1.860
float16 32768 8 2893.3 414.008 6.99x 363.520 0.878
float16 32768 16 3166.1 1081.153 2.93x 364.134 0.337
float16 65536 2 4393.6 331.287 13.26x 846.541 2.555
float16 65536 4 4863.9 349.612 13.91x 853.299 2.441
float16 65536 8 5321.1 630.120 8.44x 849.306 1.348
float16 65536 16 5923.2 1623.185 3.65x 851.354 0.524
float16 131072 2 8174.5 641.337 12.75x 1625.907 2.535
float16 131072 4 9035.2 660.218 13.69x 1620.480 2.454
float16 131072 8 9921.4 968.958 10.24x 1624.064 1.676
float16 131072 16 10880.3 2330.198 4.67x 1627.648 0.699
bfloat16 32 2 69.0 37.435 1.84x 9.626 0.257
bfloat16 32 4 72.2 30.129 2.40x 10.342 0.343
bfloat16 32 8 71.9 20.743 3.47x 9.830 0.474
bfloat16 32 16 68.3 39.281 1.74x 9.626 0.245
bfloat16 64 2 66.7 34.559 1.93x 9.830 0.284
bfloat16 64 4 72.1 32.937 2.19x 9.830 0.298
bfloat16 64 8 73.3 25.230 2.91x 9.933 0.394
bfloat16 64 16 72.9 58.001 1.26x 9.830 0.169
bfloat16 128 2 68.7 17.685 3.88x 11.162 0.631
bfloat16 128 4 69.9 17.300 4.04x 11.674 0.675
bfloat16 128 8 75.2 34.913 2.15x 11.571 0.331
bfloat16 128 16 76.5 21.336 3.59x 11.469 0.538
bfloat16 256 2 73.7 27.102 2.72x 18.432 0.680
bfloat16 256 4 72.5 20.056 3.61x 18.842 0.939
bfloat16 256 8 75.4 33.862 2.23x 18.637 0.550
bfloat16 256 16 77.4 30.077 2.57x 18.637 0.620
bfloat16 512 2 97.6 36.088 2.70x 36.966 1.024
bfloat16 512 4 100.6 17.961 5.60x 38.605 2.149
bfloat16 512 8 101.9 25.740 3.96x 39.424 1.532
bfloat16 512 16 101.3 37.658 2.69x 38.195 1.014
bfloat16 1024 2 146.1 37.123 3.94x 53.350 1.437
bfloat16 1024 4 151.3 19.942 7.59x 53.248 2.670
bfloat16 1024 8 156.3 33.384 4.68x 52.122 1.561
bfloat16 1024 16 157.2 61.204 2.57x 51.917 0.848
bfloat16 2048 2 210.7 32.734 6.44x 61.030 1.864
bfloat16 2048 4 219.8 36.286 6.06x 61.235 1.688
bfloat16 2048 8 228.0 52.660 4.33x 61.645 1.171
bfloat16 2048 16 231.4 109.283 2.12x 62.157 0.569
bfloat16 4096 2 398.9 37.435 10.66x 85.299 2.279
bfloat16 4096 4 417.3 48.090 8.68x 86.733 1.804
bfloat16 4096 8 433.2 98.696 4.39x 84.275 0.854
bfloat16 4096 16 443.5 222.352 1.99x 87.552 0.394
bfloat16 8192 2 873.1 43.529 20.06x 125.542 2.884
bfloat16 8192 4 943.7 82.030 11.50x 125.030 1.524
bfloat16 8192 8 985.7 176.337 5.59x 128.410 0.728
bfloat16 8192 16 1048.0 428.969 2.44x 125.440 0.292
bfloat16 16384 2 1618.6 88.837 18.22x 210.125 2.365
bfloat16 16384 4 1731.8 130.983 13.22x 211.661 1.616
bfloat16 16384 8 1852.6 307.294 6.03x 212.275 0.691
bfloat16 16384 16 1940.7 754.749 2.57x 210.637 0.279
bfloat16 32768 2 3003.4 173.831 17.28x 370.278 2.130
bfloat16 32768 4 3258.9 209.170 15.58x 371.917 1.778
bfloat16 32768 8 3472.6 453.014 7.67x 371.200 0.819
bfloat16 32768 16 3620.6 1194.440 3.03x 371.712 0.311
bfloat16 65536 2 5705.5 328.396 17.37x 854.938 2.603
bfloat16 65536 4 6099.0 359.362 16.97x 853.709 2.376
bfloat16 65536 8 6503.2 696.093 9.34x 855.757 1.229
bfloat16 65536 16 6812.1 1783.220 3.82x 860.262 0.482
bfloat16 131072 2 10942.8 644.342 16.98x 1630.618 2.531
bfloat16 131072 4 11518.8 677.243 17.01x 1641.882 2.424
bfloat16 131072 8 12175.3 1099.602 11.07x 1632.358 1.484
bfloat16 131072 16 12866.9 2564.588 5.02x 1637.786 0.639

Notes

  • Input shape: [256, dim] for all cases
  • B580 orig/L20 measured with torch.xpu.synchronize()/torch.cuda.synchronize() + time.perf_counter
  • B580 opt measured with op-bench profiler (device time)
  • "L20 / B580 opt" > 1.0 means B580 opt is faster than L20
  • Small dim (32–256) latencies are dominated by launch overhead; variations are measurement noise

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Copy link
Copy Markdown
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks!

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training pytorch_unet 0.732027 0.701785
huggingface_bfloat16_training XLNetLMHeadModel 0.735926 0.739482
huggingface_float16_training TrOCRForCausalLM 0.715079 0.742309
torchbench_bfloat16_training mobilenet_v2 0.848852 0.744022
huggingface_bfloat16_training BartForCausalLM 0.784425 0.749825
huggingface_float16_training BartForCausalLM 0.767757 0.751228
huggingface_bfloat16_training TrOCRForCausalLM 0.765387 0.752075
huggingface_float16_training DistilBertForMaskedLM 0.726882 0.756322
huggingface_bfloat16_training AllenaiLongformerBase 0.666775 0.756483
torchbench_bfloat16_training alexnet 0.773487 0.758073
huggingface_float16_training PLBartForCausalLM 0.751928 0.759221
huggingface_bfloat16_training DistilBertForMaskedLM 0.787416 0.764119
huggingface_bfloat16_training RobertaForCausalLM 0.792129 0.767104
torchbench_bfloat16_training Background_Matting 0.735471 0.768118
huggingface_float16_training XLNetLMHeadModel 0.692967 0.768142
huggingface_bfloat16_training LayoutLMForMaskedLM 0.747264 0.769014
huggingface_float16_training LayoutLMForMaskedLM 0.711922 0.769555
huggingface_float16_training MBartForCausalLM 0.742191 0.770040
huggingface_bfloat16_training PLBartForCausalLM 0.787240 0.771616
huggingface_bfloat16_training MBartForCausalLM 0.760636 0.774483
torchbench_bfloat16_training resnet50 0.792927 0.775641
huggingface_bfloat16_training BertForMaskedLM 0.784110 0.780163
huggingface_float16_training BertForMaskedLM 0.750469 0.781311
torchbench_bfloat16_training nvidia_deeprecommender 0.751267 0.781650
huggingface_float16_training RobertaForCausalLM 0.762570 0.784019
huggingface_float16_training DistillGPT2 0.723727 0.784441
huggingface_bfloat16_training OPTForCausalLM 0.815175 0.792447
huggingface_float16_training YituTechConvBert 0.720666 0.793982
huggingface_bfloat16_training ElectraForCausalLM 0.799759 0.794691
huggingface_float16_training PegasusForCausalLM 0.751302 0.797202
huggingface_bfloat16_training DistillGPT2 0.777563 0.797276
huggingface_float16_training OPTForCausalLM 0.804246 0.799229
huggingface_float16_training ElectraForCausalLM 0.756055 0.802675
huggingface_bfloat16_training YituTechConvBert 0.748906 0.808835
huggingface_bfloat16_training PegasusForCausalLM 0.767274 0.813000
huggingface_float16_training AlbertForMaskedLM 0.776446 0.827234
huggingface_float16_training AllenaiLongformerBase 0.657353 0.833531
huggingface_float16_training T5Small 0.792103 0.845485
huggingface_float16_training T5ForConditionalGeneration 0.794138 0.846337
torchbench_bfloat16_training vgg16 0.752654 0.864106
huggingface_float16_training GPT2ForSequenceClassification 0.746828 0.877822
huggingface_bfloat16_training GPT2ForSequenceClassification 0.786941 0.889578
torchbench_bfloat16_training LearningToPaint 0.697619 0.944144
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_float16_training MegatronBertForCausalLM 0.812439 0.803145
huggingface_bfloat16_training MegatronBertForCausalLM 0.846729 0.822921
huggingface_bfloat16_training T5ForConditionalGeneration 0.822018 0.838930
timm_models_bfloat16_training mobilevit_s 1.017896 0.840234
timm_models_bfloat16_training tf_efficientnet_b0 1.028554 0.842060
huggingface_float16_training M2M100ForConditionalGeneration 0.849708 0.842682
huggingface_bfloat16_training AlbertForMaskedLM 0.813793 0.849938
huggingface_float16_training BlenderbotForCausalLM 0.840431 0.853512
huggingface_bfloat16_training T5Small 0.832042 0.861847
huggingface_bfloat16_training M2M100ForConditionalGeneration 0.885330 0.865312
huggingface_float16_training GoogleFnet 0.835772 0.872642
huggingface_bfloat16_training GoogleFnet 0.837593 0.873639
huggingface_bfloat16_training DebertaV2ForMaskedLM 0.901616 0.879606
huggingface_bfloat16_training XGLMForCausalLM 0.843494 0.880984
huggingface_bfloat16_training BlenderbotForCausalLM 0.869219 0.881966
timm_models_bfloat16_training mobilenetv2_100 1.028982 0.885514
huggingface_float16_training DebertaV2ForMaskedLM 0.891486 0.887836
timm_models_bfloat16_training mobilenetv3_large_100 1.020387 0.889674
torchbench_bfloat16_training shufflenet_v2_x1_0 0.800072 0.906983
huggingface_float16_training XGLMForCausalLM 0.818948 0.912671
torchbench_bfloat16_training squeezenet1_1 0.844912 1.036711

Copilot AI review requested due to automatic review settings April 27, 2026 13:09
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from e5edf06 to 3a7de35 Compare April 27, 2026 13:09
- Replace K_sel if-else chain with c10::llvm::PowerOf2Ceil + std::min
- Replace q.submit with sycl_kernel_submit + kernel properties
- Add sycl_kernel_submit overloads accepting properties to SYCLHelpers.h
- Simplify SBTOPK_DISPATCH_INDEX: only check numSlices <= INT_MAX
  (IndexT is only used for slice indices, not cross-slice global indices)
- Add SG_MERGE_LEVELS constexpr + static_assert, replace magic number 5
- Refactor vec dispatch with can_use_vec lambda
- Update IndexT comment to reflect simplified dispatch condition
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Copy link
Copy Markdown
Contributor

@CuiYifeng CuiYifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main part of this PR LGTM.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_bfloat16_training TrOCRForCausalLM 0.728785 0.704822
huggingface_float16_training TrOCRForCausalLM 0.686870 0.710508
huggingface_bfloat16_training DistilBertForMaskedLM 0.759157 0.728112
huggingface_float16_training DistilBertForMaskedLM 0.707894 0.744829
huggingface_bfloat16_training RobertaForCausalLM 0.778381 0.747414
huggingface_float16_training BartForCausalLM 0.781544 0.758295
huggingface_bfloat16_training BartForCausalLM 0.810056 0.761432
huggingface_bfloat16_training BertForMaskedLM 0.786637 0.777669
huggingface_float16_training PegasusForCausalLM 0.731876 0.778853
huggingface_float16_training RobertaForCausalLM 0.754529 0.780642
huggingface_bfloat16_training LayoutLMForMaskedLM 0.789473 0.786303
huggingface_bfloat16_training DistillGPT2 0.777698 0.786791
huggingface_float16_training MBartForCausalLM 0.768628 0.788013
huggingface_float16_training DistillGPT2 0.726512 0.789362
huggingface_float16_training BertForMaskedLM 0.756735 0.790104
huggingface_bfloat16_training PLBartForCausalLM 0.795146 0.790313
huggingface_bfloat16_training XLNetLMHeadModel 0.776953 0.793240
huggingface_bfloat16_training MBartForCausalLM 0.783842 0.793673
huggingface_bfloat16_training PegasusForCausalLM 0.757929 0.796527
huggingface_float16_training PLBartForCausalLM 0.768026 0.798550
huggingface_bfloat16_training AllenaiLongformerBase 0.696802 0.798600
huggingface_float16_training LayoutLMForMaskedLM 0.749624 0.804081
huggingface_float16_training XLNetLMHeadModel 0.737034 0.811957
huggingface_float16_training AlbertForMaskedLM 0.778907 0.820047
huggingface_float16_training YituTechConvBert 0.749484 0.829264
huggingface_bfloat16_training YituTechConvBert 0.786531 0.854588
huggingface_float16_training ElectraForCausalLM 0.777779 0.857949
huggingface_float16_training AllenaiLongformerBase 0.676800 0.865608
huggingface_bfloat16_training GPT2ForSequenceClassification 0.795345 0.894546
huggingface_float16_training GPT2ForSequenceClassification 0.750631 0.911812
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_bfloat16_training OPTForCausalLM 0.850893 0.821623
huggingface_float16_training BlenderbotForCausalLM 0.829032 0.826002
huggingface_float16_training MegatronBertForCausalLM 0.827504 0.828424
huggingface_float16_training OPTForCausalLM 0.830220 0.832640
huggingface_bfloat16_training MegatronBertForCausalLM 0.846823 0.832646
huggingface_bfloat16_training AlbertForMaskedLM 0.813265 0.840053
huggingface_bfloat16_training ElectraForCausalLM 0.819623 0.844496
huggingface_bfloat16_training DebertaV2ForMaskedLM 0.873371 0.847714
torchbench_bfloat16_training mnasnet1_0 0.941280 0.850131
huggingface_float16_training DebertaV2ForMaskedLM 0.855165 0.859474
huggingface_bfloat16_training BlenderbotForCausalLM 0.848528 0.862730
huggingface_float16_training T5ForConditionalGeneration 0.803168 0.864535
huggingface_bfloat16_training T5Small 0.831588 0.867062
huggingface_float16_training T5Small 0.802753 0.869575
huggingface_bfloat16_training T5ForConditionalGeneration 0.838913 0.874394
huggingface_bfloat16_training GoogleFnet 0.840761 0.877972
huggingface_float16_training GoogleFnet 0.839524 0.878620
huggingface_bfloat16_training XGLMForCausalLM 0.847143 0.883788
huggingface_float16_training XGLMForCausalLM 0.838517 0.885862
huggingface_float16_training M2M100ForConditionalGeneration 0.914870 0.891360
huggingface_bfloat16_training M2M100ForConditionalGeneration 0.924642 0.893019

@CuiYifeng
Copy link
Copy Markdown
Contributor

e2e failure is known issue #3455.

@CuiYifeng
Copy link
Copy Markdown
Contributor

/merge -f "Failed unit tests are irrelevant to this PR"

2 similar comments
@CuiYifeng
Copy link
Copy Markdown
Contributor

/merge -f "Failed unit tests are irrelevant to this PR"

@CuiYifeng
Copy link
Copy Markdown
Contributor

/merge -f "Failed unit tests are irrelevant to this PR"

@CuiYifeng CuiYifeng merged commit 8eaa591 into main Apr 29, 2026
82 of 88 checks passed
@CuiYifeng CuiYifeng deleted the jianyi/subgroup-topk branch April 29, 2026 07:46
jafraustro pushed a commit to jafraustro/torch-xpu-ops that referenced this pull request Apr 29, 2026
## Summary

- **Speedup vs original XPU:** 1.3648x geomean over 432 cases, 130 wins
(>1.05x), 40 regressions (<0.98x)
- **vs CUDA 4080S:** 0.4574x geomean (>1 means XPU faster)

### Approach

Add a **subgroup topk kernel** (`SubgroupTopKFunctor` in
`TensorTopKSbtopkKernel.cpp`) where each 32-lane sub-group processes one
slice entirely in registers:

- **Phase 1:** Each lane scans `dim/32` elements, maintaining a sorted
top-k buffer via insertion sort (fully unrolled).
- **Phase 2:** 5-level bitonic merge across sub-group lanes via
`sycl::select_from_group` shuffle.
- **Phase 3:** Lane 0 writes `k` results. Output is already sorted.

Key properties:
- Zero SLM (shared local memory), zero barriers
- `largest` as compile-time template parameter — eliminates per-element
direction branches
- `int32`/`int64` index dispatch mirroring CUDA's `canUse32BitIndexMath`
- Kernel isolated in a separate translation unit to prevent SYCL
compiler global optimization interference with the original kernel

**Dispatch:** `k <= 16` and `nsegments >= HW_thread_slots / 4` and `dim
>= 32` → subgroup kernel (SORTED); otherwise → original kernel.

### Files changed

| File | Description |
|------|-------------|
| `TensorTopKSbtopkKernel.cpp` (new) | Subgroup topk kernel + dispatch
logic |
| `TensorTopKSbtopkKernel.h` (new) | `SbtopkResult` enum +
`sbtopk_try_launch` declaration |
| `TensorTopKKernel.cpp` | Modified caller — tries optimized path first,
skips sort if already sorted |

### Correctness

- **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare)
- **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output
verified monotonic)

### Benchmark summary

**By batch size:**

| bs | speedup vs orig | vs CUDA 4080S | cases |
|----|:-:|:-:|:-:|
| 1 | 1.00x | 0.41x | 72 |
| 8 | 1.00x | 0.43x | 72 |
| 64 | 1.00x | 0.42x | 72 |
| 256 | 1.00x | 0.36x | 72 |
| 1024 | 2.53x | 0.63x | 72 |
| 2048 | 2.55x | 0.55x | 72 |

**By dim:**

| dim | speedup vs orig | vs CUDA 4080S | cases |
|-----|:-:|:-:|:-:|
| 128 | 1.00x | 0.36x | 54 |
| 129 | 1.00x | 0.39x | 54 |
| 1024 | 1.47x | 0.77x | 54 |
| 1025 | 1.35x | 0.63x | 54 |
| 8192 | 1.62x | 0.60x | 54 |
| 8193 | 1.30x | 0.48x | 54 |
| 131072 | 1.87x | 0.34x | 54 |
| 131073 | 1.53x | 0.28x | 54 |

### Full 432-case results

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory
bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50
iters.

<details>
<summary>Click to expand full table</summary>

| dtype | bs | dim | k | XPU orig (us) | XPU opt (us) | CUDA 4080S (us)
| speedup | vs CUDA | BW (GB/s) | %peak |

|-------|---:|----:|--:|--------------:|------------:|----------------:|--------:|--------:|----------:|------:|
| bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0%
|
| bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0%
|
| bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 |
0% |
| bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 14.7 | 0.99x | 0.48x | 0.0 | 0%
|
| bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 14.6 | 1.00x | 0.48x | 0.0 | 0%
|
| bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 |
0% |
| bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 19.0 | 1.00x | 0.62x | 0.1 |
0% |
| bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 18.3 | 1.00x | 0.60x | 0.1 |
0% |
| bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 19.8 | 1.00x | 0.65x | 0.1 |
0% |
| bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 37.4 | 1.03x | 0.84x | 0.4 |
0% |
| bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.2 | 1.06x | 0.87x | 0.3 |
0% |
| bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 39.1 | 1.00x | 0.80x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 37.0 | 0.94x | 0.76x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 40.3 | 1.00x | 0.83x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 39.7 | 1.00x | 0.82x | 0.3 |
0% |
| bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 46.3 | 0.98x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 46.3 | 0.98x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 46.4 | 1.01x | 0.11x |
0.6 | 0% |
| bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 46.8 | 1.02x | 0.13x | 0.7
| 0% |
| bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 46.8 | 0.99x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 46.7 | 0.98x | 0.11x |
0.6 | 0% |
| bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 14.6 | 1.00x | 0.48x | 0.1 | 0%
|
| bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 |
0% |
| bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0%
|
| bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0%
|
| bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 15.1 | 1.00x | 0.50x | 0.1 |
0% |
| bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 19.3 | 1.00x | 0.63x | 0.5 |
0% |
| bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 |
0% |
| bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 |
0% |
| bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 20.5 | 1.00x | 0.67x | 0.5 |
0% |
| bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 |
0% |
| bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 |
0% |
| bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 42.2 | 1.06x | 0.82x | 2.5 |
1% |
| bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 39.9 | 0.95x | 0.73x | 2.4 |
1% |
| bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 42.4 | 1.01x | 0.78x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 43.3 | 1.00x | 0.79x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 43.5 | 1.00x | 0.80x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 42.7 | 1.12x | 0.88x | 2.7 |
1% |
| bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 56.8 | 0.98x | 0.14x | 5.3
| 1% |
| bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 56.5 | 1.06x | 0.14x | 5.3
| 1% |
| bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 56.7 | 0.99x | 0.13x |
4.8 | 1% |
| bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 56.8 | 0.97x | 0.14x | 5.2
| 1% |
| bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 57.1 | 1.00x | 0.14x | 5.2
| 1% |
| bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 56.9 | 1.01x | 0.13x |
4.8 | 1% |
| bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 14.9 | 1.00x | 0.49x | 0.6 |
0% |
| bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 14.7 | 1.00x | 0.48x | 0.7 |
0% |
| bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 14.8 | 1.01x | 0.49x | 0.9 |
0% |
| bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 15.4 | 1.01x | 0.51x | 0.6 |
0% |
| bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 15.5 | 1.00x | 0.51x | 0.7 |
0% |
| bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 15.2 | 1.01x | 0.50x | 0.9 |
0% |
| bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 19.5 | 1.00x | 0.64x | 4.4 |
1% |
| bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 19.5 | 1.00x | 0.64x | 4.5 |
1% |
| bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 19.5 | 1.00x | 0.64x | 4.6 |
1% |
| bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 |
1% |
| bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 20.6 | 1.00x | 0.61x | 4.1 |
1% |
| bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 |
1% |
| bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 49.9 | 1.01x | 0.54x | 11.4 |
3% |
| bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 49.5 | 1.01x | 0.51x | 10.9 |
2% |
| bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 49.6 | 1.00x | 0.49x |
10.5 | 2% |
| bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 49.8 | 1.07x | 0.55x | 11.7 |
3% |
| bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 49.6 | 1.02x | 0.52x | 10.9 |
2% |
| bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 49.7 | 1.00x | 0.50x |
10.6 | 2% |
| bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 162.9 | 1.01x | 0.18x |
18.9 | 4% |
| bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 164.6 | 0.99x | 0.17x |
17.7 | 4% |
| bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 164.4 | 1.01x | 0.17x |
16.9 | 4% |
| bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 166.8 | 1.01x | 0.19x |
18.9 | 4% |
| bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 166.8 | 1.00x | 0.18x |
17.8 | 4% |
| bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 165.5 | 1.00x | 0.17x
| 16.8 | 4% |
| bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 15.7 | 1.00x | 0.47x | 2.2 |
0% |
| bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 15.6 | 1.01x | 0.46x | 2.6 |
1% |
| bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 |
1% |
| bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 |
0% |
| bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 |
1% |
| bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 |
1% |
| bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 41.7 | 1.00x | 0.74x | 9.5 |
2% |
| bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 |
2% |
| bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 42.6 | 1.00x | 0.72x | 9.5
| 2% |
| bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 45.9 | 0.98x | 0.63x | 7.4 |
2% |
| bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 46.7 | 1.01x | 0.63x | 7.4 |
2% |
| bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 47.1 | 1.00x | 0.62x | 7.5
| 2% |
| bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 75.2 | 0.99x | 0.29x |
15.9 | 3% |
| bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 75.0 | 1.00x | 0.28x |
15.6 | 3% |
| bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 75.2 | 0.99x | 0.26x |
14.6 | 3% |
| bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 75.1 | 0.97x | 0.28x |
15.7 | 3% |
| bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 75.6 | 1.00x | 0.28x |
15.4 | 3% |
| bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 75.7 | 1.00x | 0.26x |
14.7 | 3% |
| bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 439.2 | 1.00x | 0.14x
| 21.7 | 5% |
| bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 436.9 | 1.00x | 0.13x
| 20.5 | 5% |
| bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 440.9 | 1.00x | 0.13x
| 19.4 | 4% |
| bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 441.5 | 1.00x | 0.14x
| 21.7 | 5% |
| bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 435.4 | 1.00x | 0.13x
| 20.5 | 5% |
| bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 443.1 | 1.00x | 0.13x
| 19.3 | 4% |
| bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 22.1 | 1.02x | 0.32x | 4.4 |
1% |
| bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 22.0 | 1.00x | 0.29x | 4.6 |
1% |
| bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 22.3 | 1.00x | 0.29x | 5.6
| 1% |
| bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 24.4 | 1.02x | 0.35x | 4.4 |
1% |
| bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 24.4 | 1.00x | 0.32x | 4.6 |
1% |
| bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 24.5 | 1.00x | 0.32x | 5.6
| 1% |
| bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 63.1 | 2.72x | 1.12x |
38.0 | 8% |
| bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 63.3 | 2.78x | 1.13x |
38.8 | 9% |
| bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 63.4 | 2.73x | 1.10x |
39.3 | 9% |
| bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 64.5 | 2.54x | 0.75x |
24.9 | 5% |
| bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 64.7 | 2.58x | 0.75x |
25.1 | 6% |
| bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 64.8 | 2.59x | 0.74x |
25.9 | 6% |
| bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 147.6 | 3.79x | 0.60x |
67.8 | 15% |
| bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 147.4 | 3.95x | 0.59x |
67.6 | 15% |
| bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 148.0 | 4.12x | 0.59x |
67.4 | 15% |
| bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 149.2 | 2.32x | 0.37x |
41.4 | 9% |
| bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 148.4 | 2.43x | 0.36x |
41.4 | 9% |
| bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 149.3 | 2.56x | 0.37x |
41.6 | 9% |
| bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1865.9 | 6.52x |
1.06x | 152.3 | 33% |
| bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1867.4 | 6.92x |
1.06x | 152.3 | 33% |
| bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1863.0 | 7.28x |
1.05x | 152.0 | 33% |
| bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 1940.1 | 3.84x |
0.65x | 89.5 | 20% |
| bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 1936.8 | 4.06x |
0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 1944.4 | 4.28x |
0.65x | 89.5 | 20% |
| bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 30.5 | 1.00x | 0.27x | 5.3
| 1% |
| bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 30.5 | 1.00x | 0.25x | 5.7
| 1% |
| bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 30.9 | 1.00x | 0.25x |
6.9 | 2% |
| bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 35.4 | 1.00x | 0.31x | 5.4
| 1% |
| bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8
| 1% |
| bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 35.7 | 1.00x | 0.29x |
7.0 | 2% |
| bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 85.7 | 2.87x | 0.89x |
44.4 | 10% |
| bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 86.0 | 2.92x | 0.88x |
44.7 | 10% |
| bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 86.4 | 2.88x | 0.87x |
45.5 | 10% |
| bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 88.4 | 2.58x | 0.56x |
27.1 | 6% |
| bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 88.7 | 2.67x | 0.56x |
27.5 | 6% |
| bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 89.0 | 2.68x | 0.56x |
28.3 | 6% |
| bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 234.9 | 3.78x | 0.47x |
67.8 | 15% |
| bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 234.1 | 3.94x | 0.47x |
67.8 | 15% |
| bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 235.0 | 4.13x | 0.47x |
67.9 | 15% |
| bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 236.2 | 2.27x | 0.29x |
40.8 | 9% |
| bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 237.3 | 2.38x | 0.29x |
40.9 | 9% |
| bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 237.4 | 2.50x | 0.29x |
41.0 | 9% |
| bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3646.5 | 6.57x |
1.05x | 154.1 | 34% |
| bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3644.1 | 6.94x |
1.05x | 154.1 | 34% |
| bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3646.2 | 7.34x |
1.05x | 154.0 | 34% |
| bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 3774.7 | 3.86x |
0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 3780.1 | 4.08x |
0.64x | 90.5 | 20% |
| bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 3773.0 | 4.31x |
0.64x | 90.5 | 20% |
| float16 | 1 | 128 | 4 | 30.7 | 30.7 | 14.3 | 1.00x | 0.47x | 0.0 | 0%
|
| float16 | 1 | 128 | 8 | 30.6 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0%
|
| float16 | 1 | 128 | 16 | 30.5 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0%
|
| float16 | 1 | 129 | 4 | 30.6 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0%
|
| float16 | 1 | 129 | 8 | 30.6 | 30.3 | 14.4 | 1.01x | 0.48x | 0.0 | 0%
|
| float16 | 1 | 129 | 16 | 30.5 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0%
|
| float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 17.4 | 1.00x | 0.57x | 0.1 | 0%
|
| float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0%
|
| float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 |
0% |
| float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 17.8 | 1.00x | 0.58x | 0.1 | 0%
|
| float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0%
|
| float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 20.1 | 1.00x | 0.66x | 0.1 |
0% |
| float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 33.6 | 1.08x | 0.88x | 0.4 | 0%
|
| float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 33.8 | 0.85x | 0.70x | 0.3 | 0%
|
| float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 31.5 | 0.94x | 0.65x | 0.3 |
0% |
| float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 37.4 | 1.11x | 0.91x | 0.4 | 0%
|
| float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 36.9 | 0.97x | 0.84x | 0.4 | 0%
|
| float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 33.3 | 0.89x | 0.65x | 0.3 |
0% |
| float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 46.2 | 0.98x | 0.15x | 0.9
| 0% |
| float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 46.5 | 0.97x | 0.14x | 0.8
| 0% |
| float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 46.1 | 0.98x | 0.13x | 0.7
| 0% |
| float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 46.9 | 1.08x | 0.16x | 0.9
| 0% |
| float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 46.8 | 0.99x | 0.14x | 0.8
| 0% |
| float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 46.6 | 0.93x | 0.12x | 0.7
| 0% |
| float16 | 8 | 128 | 4 | 30.5 | 30.2 | 14.4 | 1.01x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 128 | 8 | 30.4 | 30.2 | 14.5 | 1.01x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 128 | 16 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 129 | 4 | 30.5 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 129 | 8 | 30.3 | 30.2 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 129 | 16 | 30.5 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 19.1 | 1.01x | 0.63x | 0.5 | 0%
|
| float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0%
|
| float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 19.3 | 1.00x | 0.64x | 0.6 |
0% |
| float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0%
|
| float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 20.4 | 1.01x | 0.67x | 0.6 | 0%
|
| float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 20.5 | 1.01x | 0.68x | 0.6 |
0% |
| float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 37.9 | 1.00x | 0.83x | 2.9 | 1%
|
| float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 39.8 | 1.00x | 0.82x | 2.7 | 1%
|
| float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 41.7 | 0.94x | 0.81x | 2.6 |
1% |
| float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 39.2 | 1.07x | 0.86x | 2.9 | 1%
|
| float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 40.7 | 0.94x | 0.84x | 2.7 | 1%
|
| float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 43.0 | 1.05x | 0.83x | 2.6 |
1% |
| float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 56.0 | 0.93x | 0.17x | 6.3
| 1% |
| float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 56.1 | 0.95x | 0.16x | 5.9
| 1% |
| float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 56.3 | 1.01x | 0.15x | 5.4
| 1% |
| float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 56.2 | 1.00x | 0.18x | 6.7
| 1% |
| float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 56.3 | 0.99x | 0.16x | 6.1
| 1% |
| float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 56.3 | 0.97x | 0.15x | 5.6
| 1% |
| float16 | 64 | 128 | 4 | 30.5 | 30.1 | 14.9 | 1.01x | 0.50x | 0.6 | 0%
|
| float16 | 64 | 128 | 8 | 30.5 | 30.2 | 14.7 | 1.01x | 0.49x | 0.7 | 0%
|
| float16 | 64 | 128 | 16 | 30.4 | 30.2 | 14.7 | 1.01x | 0.49x | 0.9 |
0% |
| float16 | 64 | 129 | 4 | 30.6 | 30.2 | 15.3 | 1.01x | 0.51x | 0.6 | 0%
|
| float16 | 64 | 129 | 8 | 30.6 | 30.2 | 15.2 | 1.01x | 0.50x | 0.7 | 0%
|
| float16 | 64 | 129 | 16 | 30.5 | 30.2 | 15.1 | 1.01x | 0.50x | 0.9 |
0% |
| float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 4.4 |
1% |
| float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 |
1% |
| float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 19.4 | 1.00x | 0.64x | 4.7 |
1% |
| float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 19.7 | 1.01x | 0.62x | 4.2 |
1% |
| float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 20.4 | 1.00x | 0.64x | 4.2 |
1% |
| float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 |
1% |
| float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 49.4 | 0.97x | 0.59x | 12.5 |
3% |
| float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 49.2 | 0.99x | 0.58x | 12.5 |
3% |
| float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 49.2 | 0.98x | 0.54x | 11.7 |
3% |
| float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 49.4 | 1.01x | 0.62x | 13.1 |
3% |
| float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 49.4 | 1.04x | 0.59x | 12.5 |
3% |
| float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 49.4 | 1.02x | 0.56x | 11.9 |
3% |
| float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 162.1 | 1.04x | 0.22x |
23.2 | 5% |
| float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 160.5 | 1.01x | 0.21x |
21.5 | 5% |
| float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 162.4 | 0.98x | 0.19x |
19.4 | 4% |
| float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 161.6 | 1.00x | 0.23x |
23.7 | 5% |
| float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 163.9 | 1.01x | 0.21x |
21.6 | 5% |
| float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 162.9 | 1.01x | 0.19x |
19.6 | 4% |
| float16 | 256 | 128 | 4 | 33.7 | 33.6 | 15.5 | 1.00x | 0.46x | 2.3 |
0% |
| float16 | 256 | 128 | 8 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 |
1% |
| float16 | 256 | 128 | 16 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 3.2 |
1% |
| float16 | 256 | 129 | 4 | 33.7 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 |
0% |
| float16 | 256 | 129 | 8 | 33.7 | 33.5 | 15.9 | 1.01x | 0.47x | 2.6 |
1% |
| float16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 |
1% |
| float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 37.9 | 1.00x | 0.75x | 10.5 |
2% |
| float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 38.8 | 1.00x | 0.73x | 10.3 |
2% |
| float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 39.9 | 0.98x | 0.71x | 10.1
| 2% |
| float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 42.0 | 1.00x | 0.66x | 8.4 |
2% |
| float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 43.1 | 0.97x | 0.65x | 8.2 |
2% |
| float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 43.8 | 1.02x | 0.65x | 8.3 |
2% |
| float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 74.1 | 0.99x | 0.33x | 19.0
| 4% |
| float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 74.4 | 1.00x | 0.32x | 18.0
| 4% |
| float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 74.7 | 0.99x | 0.30x |
16.9 | 4% |
| float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 74.3 | 0.99x | 0.34x | 19.1
| 4% |
| float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 74.8 | 1.01x | 0.32x | 18.1
| 4% |
| float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 74.9 | 0.98x | 0.29x |
16.5 | 4% |
| float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 428.9 | 0.99x | 0.18x |
27.7 | 6% |
| float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 427.9 | 1.00x | 0.16x |
25.2 | 6% |
| float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 428.2 | 1.00x | 0.15x
| 22.9 | 5% |
| float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 431.9 | 0.98x | 0.18x |
27.5 | 6% |
| float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 429.8 | 0.99x | 0.16x |
25.1 | 5% |
| float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 432.2 | 1.00x | 0.15x
| 22.8 | 5% |
| float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 20.9 | 1.00x | 0.31x | 4.5 |
1% |
| float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 20.9 | 1.01x | 0.30x | 4.9 |
1% |
| float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 21.4 | 1.00x | 0.30x | 6.0 |
1% |
| float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 23.3 | 1.00x | 0.35x | 4.6 |
1% |
| float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 23.1 | 1.01x | 0.33x | 4.9 |
1% |
| float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 23.4 | 0.98x | 0.32x | 5.9 |
1% |
| float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 62.7 | 2.74x | 1.30x | 44.2
| 10% |
| float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 63.0 | 2.80x | 1.29x | 44.7
| 10% |
| float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 63.1 | 2.89x | 1.27x |
45.5 | 10% |
| float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 64.2 | 1.89x | 0.66x | 21.9
| 5% |
| float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 64.4 | 1.97x | 0.66x | 22.3
| 5% |
| float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 64.5 | 2.08x | 0.65x |
22.9 | 5% |
| float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 145.0 | 3.99x | 0.73x |
84.6 | 19% |
| float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 144.6 | 4.22x | 0.73x |
84.7 | 19% |
| float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 145.5 | 4.50x | 0.72x |
83.9 | 18% |
| float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 146.1 | 1.75x | 0.32x |
36.9 | 8% |
| float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 146.5 | 1.83x | 0.32x |
36.9 | 8% |
| float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 146.2 | 1.98x | 0.32x |
36.8 | 8% |
| float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1846.9 | 5.88x | 1.20x
| 174.8 | 38% |
| float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1856.1 | 6.41x | 1.20x
| 174.1 | 38% |
| float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1858.5 | 7.06x |
1.21x | 174.6 | 38% |
| float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 1924.0 | 2.82x | 0.60x
| 84.1 | 18% |
| float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 1921.5 | 3.12x | 0.60x
| 84.3 | 18% |
| float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 1926.4 | 3.42x |
0.60x | 84.3 | 18% |
| float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 28.3 | 0.99x | 0.26x | 5.6
| 1% |
| float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 28.5 | 1.00x | 0.25x | 6.1
| 1% |
| float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 29.2 | 1.01x | 0.26x | 7.4
| 2% |
| float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 32.6 | 0.99x | 0.30x | 5.6
| 1% |
| float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 32.7 | 1.00x | 0.29x | 6.2
| 1% |
| float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 33.5 | 1.00x | 0.29x | 7.4
| 2% |
| float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6
| 12% |
| float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6
| 12% |
| float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 86.0 | 3.08x | 1.02x |
53.6 | 12% |
| float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 87.7 | 1.92x | 0.49x |
23.8 | 5% |
| float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 88.0 | 1.98x | 0.49x |
24.1 | 5% |
| float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 88.5 | 2.09x | 0.49x |
24.8 | 5% |
| float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 228.7 | 3.94x | 0.57x |
84.2 | 18% |
| float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 228.5 | 4.16x | 0.57x |
84.3 | 18% |
| float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 230.5 | 4.51x | 0.57x |
84.5 | 19% |
| float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 231.7 | 1.70x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 231.6 | 1.81x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 233.1 | 1.95x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3622.2 | 5.90x |
1.19x | 176.9 | 39% |
| float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3607.3 | 6.49x |
1.19x | 177.3 | 39% |
| float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3607.4 | 7.10x |
1.19x | 176.5 | 39% |
| float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 3743.3 | 2.84x |
0.59x | 85.2 | 19% |
| float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 3747.1 | 3.12x |
0.59x | 85.1 | 19% |
| float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 3749.5 | 3.43x |
0.59x | 85.2 | 19% |
| float32 | 1 | 128 | 4 | 31.2 | 31.4 | 14.5 | 0.99x | 0.46x | 0.0 | 0%
|
| float32 | 1 | 128 | 8 | 34.0 | 34.4 | 14.3 | 0.99x | 0.42x | 0.0 | 0%
|
| float32 | 1 | 128 | 16 | 32.4 | 34.4 | 14.0 | 0.94x | 0.41x | 0.0 | 0%
|
| float32 | 1 | 129 | 4 | 34.1 | 34.4 | 14.4 | 0.99x | 0.42x | 0.0 | 0%
|
| float32 | 1 | 129 | 8 | 34.0 | 32.7 | 14.4 | 1.04x | 0.44x | 0.0 | 0%
|
| float32 | 1 | 129 | 16 | 34.1 | 34.3 | 15.2 | 0.99x | 0.44x | 0.0 | 0%
|
| float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 17.8 | 1.08x | 0.54x | 0.1 | 0%
|
| float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 22.2 | 0.99x | 0.62x | 0.1 | 0%
|
| float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 19.1 | 0.99x | 0.54x | 0.1 |
0% |
| float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 18.8 | 0.98x | 0.52x | 0.1 | 0%
|
| float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 19.7 | 1.08x | 0.55x | 0.1 | 0%
|
| float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 19.6 | 0.99x | 0.55x | 0.1 |
0% |
| float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 39.6 | 1.07x | 0.77x | 0.6 | 0%
|
| float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 38.0 | 1.16x | 0.69x | 0.6 | 0%
|
| float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 38.7 | 0.94x | 0.67x | 0.6 |
0% |
| float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 34.1 | 0.99x | 0.66x | 0.6 | 0%
|
| float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 41.6 | 1.03x | 0.76x | 0.6 | 0%
|
| float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 39.8 | 1.04x | 0.69x | 0.6 |
0% |
| float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 63.3 | 1.04x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 63.3 | 1.03x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 63.3 | 0.91x | 0.14x | 1.1
| 0% |
| float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 64.0 | 1.04x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 64.0 | 1.02x | 0.15x | 1.3
| 0% |
| float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 63.9 | 1.01x | 0.15x | 1.2
| 0% |
| float32 | 8 | 128 | 4 | 33.8 | 37.2 | 14.7 | 0.91x | 0.40x | 0.1 | 0%
|
| float32 | 8 | 128 | 8 | 35.0 | 34.1 | 14.3 | 1.03x | 0.42x | 0.1 | 0%
|
| float32 | 8 | 128 | 16 | 35.6 | 37.2 | 15.2 | 0.96x | 0.41x | 0.2 | 0%
|
| float32 | 8 | 129 | 4 | 35.2 | 36.0 | 15.0 | 0.98x | 0.42x | 0.1 | 0%
|
| float32 | 8 | 129 | 8 | 36.8 | 34.1 | 15.0 | 1.08x | 0.44x | 0.1 | 0%
|
| float32 | 8 | 129 | 16 | 35.3 | 35.5 | 15.3 | 0.99x | 0.43x | 0.2 | 0%
|
| float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 20.9 | 1.12x | 0.59x | 0.9 | 0%
|
| float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 19.7 | 1.07x | 0.55x | 0.9 | 0%
|
| float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 19.7 | 0.95x | 0.49x | 0.9 |
0% |
| float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 20.6 | 1.07x | 0.58x | 0.9 | 0%
|
| float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 21.4 | 0.99x | 0.55x | 0.9 | 0%
|
| float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 22.0 | 1.12x | 0.60x | 0.9 |
0% |
| float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 41.0 | 0.92x | 0.65x | 4.2 | 1%
|
| float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 42.6 | 1.10x | 0.77x | 4.8 | 1%
|
| float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 44.7 | 1.09x | 0.73x | 4.3 |
1% |
| float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 43.0 | 0.85x | 0.67x | 4.1 | 1%
|
| float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 43.5 | 1.09x | 0.71x | 4.3 | 1%
|
| float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 45.0 | 0.95x | 0.67x | 3.9 |
1% |
| float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 76.0 | 1.00x | 0.19x | 10.2
| 2% |
| float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 76.0 | 1.02x | 0.18x | 9.9
| 2% |
| float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 76.2 | 1.03x | 0.17x | 9.2
| 2% |
| float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 76.0 | 0.98x | 0.18x | 10.2
| 2% |
| float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 75.8 | 1.00x | 0.18x | 9.8
| 2% |
| float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 76.1 | 0.99x | 0.16x | 8.8
| 2% |
| float32 | 64 | 128 | 4 | 38.2 | 37.4 | 15.0 | 1.02x | 0.40x | 1.0 | 0%
|
| float32 | 64 | 128 | 8 | 36.8 | 37.2 | 15.0 | 0.99x | 0.40x | 1.0 | 0%
|
| float32 | 64 | 128 | 16 | 38.3 | 37.2 | 14.9 | 1.03x | 0.40x | 1.2 |
0% |
| float32 | 64 | 129 | 4 | 38.5 | 37.1 | 15.5 | 1.04x | 0.42x | 1.0 | 0%
|
| float32 | 64 | 129 | 8 | 37.0 | 37.1 | 15.9 | 1.00x | 0.43x | 1.1 | 0%
|
| float32 | 64 | 129 | 16 | 38.4 | 38.8 | 15.4 | 0.99x | 0.40x | 1.2 |
0% |
| float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 20.4 | 1.02x | 0.52x | 6.8 |
1% |
| float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 20.3 | 1.02x | 0.52x | 6.8 |
2% |
| float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 20.3 | 1.03x | 0.50x | 6.8 |
1% |
| float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 22.1 | 0.95x | 0.51x | 6.1 |
1% |
| float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 22.1 | 0.99x | 0.51x | 6.2 |
1% |
| float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 22.2 | 0.96x | 0.50x | 6.1 |
1% |
| float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 65.6 | 0.98x | 0.66x | 21.2 |
5% |
| float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 65.6 | 0.97x | 0.62x | 19.7
| 4% |
| float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 65.6 | 1.00x | 0.60x | 19.2
| 4% |
| float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 65.6 | 0.98x | 0.66x | 21.1 |
5% |
| float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 65.5 | 0.93x | 0.58x | 18.7
| 4% |
| float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 65.6 | 0.97x | 0.57x | 18.2
| 4% |
| float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 221.1 | 1.02x | 0.24x |
35.7 | 8% |
| float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 220.4 | 1.02x | 0.22x |
33.3 | 7% |
| float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 222.6 | 1.01x | 0.21x |
31.0 | 7% |
| float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 223.0 | 1.00x | 0.24x |
35.7 | 8% |
| float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 225.0 | 0.99x | 0.22x |
33.2 | 7% |
| float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 223.7 | 0.99x | 0.20x |
30.5 | 7% |
| float32 | 256 | 128 | 4 | 46.0 | 46.0 | 15.7 | 1.00x | 0.34x | 3.1 |
1% |
| float32 | 256 | 128 | 8 | 47.2 | 47.5 | 15.7 | 0.99x | 0.33x | 3.3 |
1% |
| float32 | 256 | 128 | 16 | 47.4 | 47.2 | 15.7 | 1.00x | 0.33x | 3.8 |
1% |
| float32 | 256 | 129 | 4 | 47.2 | 47.5 | 16.1 | 0.99x | 0.34x | 3.0 |
1% |
| float32 | 256 | 129 | 8 | 45.6 | 47.4 | 16.7 | 0.96x | 0.35x | 3.3 |
1% |
| float32 | 256 | 129 | 16 | 47.3 | 49.0 | 16.6 | 0.97x | 0.34x | 3.7 |
1% |
| float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 41.7 | 0.98x | 0.61x | 15.5 |
3% |
| float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 43.2 | 1.01x | 0.62x | 15.3 |
3% |
| float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 43.8 | 0.99x | 0.61x | 15.3
| 3% |
| float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 45.9 | 1.02x | 0.57x | 13.1 |
3% |
| float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 46.6 | 1.01x | 0.55x | 12.7 |
3% |
| float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 48.1 | 0.98x | 0.54x | 12.3
| 3% |
| float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 101.0 | 0.99x | 0.36x |
30.3 | 7% |
| float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 101.3 | 1.05x | 0.35x |
29.4 | 6% |
| float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 100.9 | 0.99x | 0.32x |
26.7 | 6% |
| float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 101.7 | 1.02x | 0.37x |
30.2 | 7% |
| float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 101.6 | 1.00x | 0.35x |
28.8 | 6% |
| float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 101.8 | 1.00x | 0.32x |
26.5 | 6% |
| float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1128.3 | 1.00x | 0.35x
| 41.9 | 9% |
| float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1132.5 | 0.99x | 0.33x
| 39.0 | 9% |
| float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1129.5 | 1.00x | 0.30x
| 36.2 | 8% |
| float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 1148.5 | 1.00x | 0.36x
| 42.0 | 9% |
| float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 1148.0 | 1.00x | 0.34x
| 39.2 | 9% |
| float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 1147.9 | 1.01x | 0.31x
| 36.1 | 8% |
| float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 22.3 | 0.99x | 0.22x | 5.6
| 1% |
| float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 22.0 | 1.02x | 0.21x | 5.9
| 1% |
| float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 22.2 | 0.98x | 0.20x | 6.6
| 1% |
| float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 24.4 | 1.01x | 0.24x | 5.7
| 1% |
| float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 24.4 | 1.00x | 0.23x | 5.8
| 1% |
| float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 24.6 | 0.99x | 0.22x | 6.5
| 1% |
| float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 88.3 | 3.70x | 1.76x | 84.5
| 19% |
| float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9
| 19% |
| float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 88.3 | 3.88x | 1.76x |
87.5 | 19% |
| float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 90.2 | 2.73x | 0.98x | 46.1
| 10% |
| float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 90.1 | 2.84x | 0.97x | 46.5
| 10% |
| float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 90.4 | 2.87x | 0.97x |
47.3 | 10% |
| float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 200.8 | 4.34x | 0.87x |
145.7 | 32% |
| float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 200.2 | 4.64x | 0.87x |
145.6 | 32% |
| float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 201.7 | 4.93x | 0.87x |
145.8 | 32% |
| float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 202.4 | 2.18x | 0.44x |
72.3 | 16% |
| float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 201.3 | 2.31x | 0.43x |
72.2 | 16% |
| float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 202.6 | 2.49x | 0.43x |
72.4 | 16% |
| float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 4191.1 | 6.06x |
2.13x | 273.4 | 60% |
| float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 4189.9 | 6.47x |
2.13x | 273.1 | 60% |
| float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 4190.6 | 7.00x |
2.13x | 273.1 | 60% |
| float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 4260.7 | 3.35x |
1.20x | 151.4 | 33% |
| float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 4261.2 | 3.60x |
1.20x | 151.3 | 33% |
| float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 4261.2 | 3.91x |
1.20x | 151.0 | 33% |
| float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 30.2 | 1.00x | 0.18x | 6.7
| 1% |
| float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 30.6 | 1.00x | 0.17x | 7.0
| 2% |
| float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 31.2 | 1.00x | 0.17x | 7.9
| 2% |
| float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 35.4 | 1.00x | 0.21x | 6.8
| 1% |
| float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 35.3 | 1.00x | 0.20x | 7.1
| 2% |
| float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 36.4 | 1.00x | 0.20x | 7.9
| 2% |
| float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 123.4 | 3.89x | 1.44x |
99.1 | 22% |
| float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 123.4 | 4.04x | 1.44x |
99.9 | 22% |
| float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 123.7 | 4.08x | 1.42x |
100.8 | 22% |
| float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 126.5 | 2.84x | 0.76x |
51.3 | 11% |
| float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 126.7 | 2.97x | 0.76x |
51.7 | 11% |
| float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 127.0 | 3.02x | 0.76x |
52.6 | 12% |
| float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 421.7 | 4.46x | 0.95x |
151.9 | 33% |
| float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 424.8 | 4.73x | 0.96x |
151.8 | 33% |
| float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 424.0 | 5.07x | 0.95x |
152.0 | 33% |
| float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 436.2 | 2.18x | 0.48x |
74.0 | 16% |
| float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 437.6 | 2.34x | 0.48x |
74.1 | 16% |
| float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 440.8 | 2.49x | 0.48x |
74.1 | 16% |
| float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 8254.2 | 6.02x |
2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 8254.6 | 6.43x |
2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 8250.2 | 6.97x |
2.10x | 273.3 | 60% |
| float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 8407.4 | 3.36x |
1.20x | 152.7 | 33% |
| float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 8407.4 | 3.60x |
1.19x | 152.6 | 33% |
| float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 8413.3 | 3.91x |
1.19x | 152.5 | 33% |

</details>

### Test methodology

- **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2
alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
- **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is
monotonic for both `largest=True/False`.
- **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with
20 warmup iterations. `largest=True`.
- **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) +
8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).
Copilot AI pushed a commit that referenced this pull request May 6, 2026
## Summary

- **Speedup vs original XPU:** 1.3648x geomean over 432 cases, 130 wins
(>1.05x), 40 regressions (<0.98x)
- **vs CUDA 4080S:** 0.4574x geomean (>1 means XPU faster)

### Approach

Add a **subgroup topk kernel** (`SubgroupTopKFunctor` in
`TensorTopKSbtopkKernel.cpp`) where each 32-lane sub-group processes one
slice entirely in registers:

- **Phase 1:** Each lane scans `dim/32` elements, maintaining a sorted
top-k buffer via insertion sort (fully unrolled).
- **Phase 2:** 5-level bitonic merge across sub-group lanes via
`sycl::select_from_group` shuffle.
- **Phase 3:** Lane 0 writes `k` results. Output is already sorted.

Key properties:
- Zero SLM (shared local memory), zero barriers
- `largest` as compile-time template parameter — eliminates per-element
direction branches
- `int32`/`int64` index dispatch mirroring CUDA's `canUse32BitIndexMath`
- Kernel isolated in a separate translation unit to prevent SYCL
compiler global optimization interference with the original kernel

**Dispatch:** `k <= 16` and `nsegments >= HW_thread_slots / 4` and `dim
>= 32` → subgroup kernel (SORTED); otherwise → original kernel.

### Files changed

| File | Description |
|------|-------------|
| `TensorTopKSbtopkKernel.cpp` (new) | Subgroup topk kernel + dispatch
logic |
| `TensorTopKSbtopkKernel.h` (new) | `SbtopkResult` enum +
`sbtopk_try_launch` declaration |
| `TensorTopKKernel.cpp` | Modified caller — tries optimized path first,
skips sort if already sorted |

### Correctness

- **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare)
- **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output
verified monotonic)

### Benchmark summary

**By batch size:**

| bs | speedup vs orig | vs CUDA 4080S | cases |
|----|:-:|:-:|:-:|
| 1 | 1.00x | 0.41x | 72 |
| 8 | 1.00x | 0.43x | 72 |
| 64 | 1.00x | 0.42x | 72 |
| 256 | 1.00x | 0.36x | 72 |
| 1024 | 2.53x | 0.63x | 72 |
| 2048 | 2.55x | 0.55x | 72 |

**By dim:**

| dim | speedup vs orig | vs CUDA 4080S | cases |
|-----|:-:|:-:|:-:|
| 128 | 1.00x | 0.36x | 54 |
| 129 | 1.00x | 0.39x | 54 |
| 1024 | 1.47x | 0.77x | 54 |
| 1025 | 1.35x | 0.63x | 54 |
| 8192 | 1.62x | 0.60x | 54 |
| 8193 | 1.30x | 0.48x | 54 |
| 131072 | 1.87x | 0.34x | 54 |
| 131073 | 1.53x | 0.28x | 54 |

### Full 432-case results

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory
bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50
iters.

<details>
<summary>Click to expand full table</summary>

| dtype | bs | dim | k | XPU orig (us) | XPU opt (us) | CUDA 4080S (us)
| speedup | vs CUDA | BW (GB/s) | %peak |

|-------|---:|----:|--:|--------------:|------------:|----------------:|--------:|--------:|----------:|------:|
| bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x | 0.0 | 0%
|
| bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 | 0%
|
| bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x | 0.0 |
0% |
| bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 14.7 | 0.99x | 0.48x | 0.0 | 0%
|
| bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 14.6 | 1.00x | 0.48x | 0.0 | 0%
|
| bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x | 0.0 |
0% |
| bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 19.0 | 1.00x | 0.62x | 0.1 |
0% |
| bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 18.3 | 1.00x | 0.60x | 0.1 |
0% |
| bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 20.2 | 1.00x | 0.66x | 0.1 |
0% |
| bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 19.8 | 1.00x | 0.65x | 0.1 |
0% |
| bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 37.4 | 1.03x | 0.84x | 0.4 |
0% |
| bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.2 | 1.06x | 0.87x | 0.3 |
0% |
| bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 39.1 | 1.00x | 0.80x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 37.0 | 0.94x | 0.76x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 40.3 | 1.00x | 0.83x | 0.3 |
0% |
| bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 39.7 | 1.00x | 0.82x | 0.3 |
0% |
| bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 46.3 | 0.98x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 46.3 | 0.98x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 46.4 | 1.01x | 0.11x |
0.6 | 0% |
| bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 46.8 | 1.02x | 0.13x | 0.7
| 0% |
| bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 46.8 | 0.99x | 0.12x | 0.7
| 0% |
| bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 46.7 | 0.98x | 0.11x |
0.6 | 0% |
| bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 14.6 | 1.00x | 0.48x | 0.1 | 0%
|
| bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 14.6 | 1.00x | 0.48x | 0.1 |
0% |
| bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0%
|
| bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 15.1 | 0.99x | 0.50x | 0.1 | 0%
|
| bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 15.1 | 1.00x | 0.50x | 0.1 |
0% |
| bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 19.3 | 1.00x | 0.63x | 0.5 |
0% |
| bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 19.4 | 1.00x | 0.64x | 0.6 |
0% |
| bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 |
0% |
| bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 20.5 | 1.00x | 0.67x | 0.5 |
0% |
| bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 20.4 | 1.01x | 0.67x | 0.6 |
0% |
| bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 20.4 | 1.00x | 0.67x | 0.6 |
0% |
| bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 42.2 | 1.06x | 0.82x | 2.5 |
1% |
| bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 39.9 | 0.95x | 0.73x | 2.4 |
1% |
| bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 42.4 | 1.01x | 0.78x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 43.3 | 1.00x | 0.79x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 43.5 | 1.00x | 0.80x | 2.4 |
1% |
| bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 42.7 | 1.12x | 0.88x | 2.7 |
1% |
| bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 56.8 | 0.98x | 0.14x | 5.3
| 1% |
| bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 56.5 | 1.06x | 0.14x | 5.3
| 1% |
| bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 56.7 | 0.99x | 0.13x |
4.8 | 1% |
| bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 56.8 | 0.97x | 0.14x | 5.2
| 1% |
| bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 57.1 | 1.00x | 0.14x | 5.2
| 1% |
| bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 56.9 | 1.01x | 0.13x |
4.8 | 1% |
| bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 14.9 | 1.00x | 0.49x | 0.6 |
0% |
| bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 14.7 | 1.00x | 0.48x | 0.7 |
0% |
| bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 14.8 | 1.01x | 0.49x | 0.9 |
0% |
| bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 15.4 | 1.01x | 0.51x | 0.6 |
0% |
| bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 15.5 | 1.00x | 0.51x | 0.7 |
0% |
| bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 15.2 | 1.01x | 0.50x | 0.9 |
0% |
| bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 19.5 | 1.00x | 0.64x | 4.4 |
1% |
| bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 19.5 | 1.00x | 0.64x | 4.5 |
1% |
| bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 19.5 | 1.00x | 0.64x | 4.6 |
1% |
| bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 20.7 | 1.00x | 0.62x | 4.0 |
1% |
| bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 20.6 | 1.00x | 0.61x | 4.1 |
1% |
| bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 20.6 | 0.99x | 0.61x | 4.2 |
1% |
| bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 49.9 | 1.01x | 0.54x | 11.4 |
3% |
| bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 49.5 | 1.01x | 0.51x | 10.9 |
2% |
| bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 49.6 | 1.00x | 0.49x |
10.5 | 2% |
| bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 49.8 | 1.07x | 0.55x | 11.7 |
3% |
| bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 49.6 | 1.02x | 0.52x | 10.9 |
2% |
| bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 49.7 | 1.00x | 0.50x |
10.6 | 2% |
| bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 162.9 | 1.01x | 0.18x |
18.9 | 4% |
| bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 164.6 | 0.99x | 0.17x |
17.7 | 4% |
| bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 164.4 | 1.01x | 0.17x |
16.9 | 4% |
| bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 166.8 | 1.01x | 0.19x |
18.9 | 4% |
| bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 166.8 | 1.00x | 0.18x |
17.8 | 4% |
| bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 165.5 | 1.00x | 0.17x
| 16.8 | 4% |
| bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 15.7 | 1.00x | 0.47x | 2.2 |
0% |
| bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 15.6 | 1.01x | 0.46x | 2.6 |
1% |
| bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x | 3.2 |
1% |
| bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 16.5 | 1.00x | 0.49x | 2.3 |
0% |
| bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x | 2.6 |
1% |
| bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.3 | 1.00x | 0.49x | 3.2 |
1% |
| bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 41.7 | 1.00x | 0.74x | 9.5 |
2% |
| bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 42.4 | 1.00x | 0.72x | 9.2 |
2% |
| bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 42.6 | 1.00x | 0.72x | 9.5
| 2% |
| bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 45.9 | 0.98x | 0.63x | 7.4 |
2% |
| bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 46.7 | 1.01x | 0.63x | 7.4 |
2% |
| bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 47.1 | 1.00x | 0.62x | 7.5
| 2% |
| bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 75.2 | 0.99x | 0.29x |
15.9 | 3% |
| bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 75.0 | 1.00x | 0.28x |
15.6 | 3% |
| bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 75.2 | 0.99x | 0.26x |
14.6 | 3% |
| bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 75.1 | 0.97x | 0.28x |
15.7 | 3% |
| bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 75.6 | 1.00x | 0.28x |
15.4 | 3% |
| bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 75.7 | 1.00x | 0.26x |
14.7 | 3% |
| bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 439.2 | 1.00x | 0.14x
| 21.7 | 5% |
| bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 436.9 | 1.00x | 0.13x
| 20.5 | 5% |
| bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 440.9 | 1.00x | 0.13x
| 19.4 | 4% |
| bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 441.5 | 1.00x | 0.14x
| 21.7 | 5% |
| bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 435.4 | 1.00x | 0.13x
| 20.5 | 5% |
| bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 443.1 | 1.00x | 0.13x
| 19.3 | 4% |
| bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 22.1 | 1.02x | 0.32x | 4.4 |
1% |
| bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 22.0 | 1.00x | 0.29x | 4.6 |
1% |
| bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 22.3 | 1.00x | 0.29x | 5.6
| 1% |
| bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 24.4 | 1.02x | 0.35x | 4.4 |
1% |
| bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 24.4 | 1.00x | 0.32x | 4.6 |
1% |
| bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 24.5 | 1.00x | 0.32x | 5.6
| 1% |
| bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 63.1 | 2.72x | 1.12x |
38.0 | 8% |
| bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 63.3 | 2.78x | 1.13x |
38.8 | 9% |
| bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 63.4 | 2.73x | 1.10x |
39.3 | 9% |
| bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 64.5 | 2.54x | 0.75x |
24.9 | 5% |
| bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 64.7 | 2.58x | 0.75x |
25.1 | 6% |
| bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 64.8 | 2.59x | 0.74x |
25.9 | 6% |
| bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 147.6 | 3.79x | 0.60x |
67.8 | 15% |
| bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 147.4 | 3.95x | 0.59x |
67.6 | 15% |
| bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 148.0 | 4.12x | 0.59x |
67.4 | 15% |
| bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 149.2 | 2.32x | 0.37x |
41.4 | 9% |
| bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 148.4 | 2.43x | 0.36x |
41.4 | 9% |
| bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 149.3 | 2.56x | 0.37x |
41.6 | 9% |
| bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1865.9 | 6.52x |
1.06x | 152.3 | 33% |
| bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1867.4 | 6.92x |
1.06x | 152.3 | 33% |
| bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1863.0 | 7.28x |
1.05x | 152.0 | 33% |
| bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 1940.1 | 3.84x |
0.65x | 89.5 | 20% |
| bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 1936.8 | 4.06x |
0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 1944.4 | 4.28x |
0.65x | 89.5 | 20% |
| bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 30.5 | 1.00x | 0.27x | 5.3
| 1% |
| bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 30.5 | 1.00x | 0.25x | 5.7
| 1% |
| bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 30.9 | 1.00x | 0.25x |
6.9 | 2% |
| bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 35.4 | 1.00x | 0.31x | 5.4
| 1% |
| bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 35.2 | 1.00x | 0.29x | 5.8
| 1% |
| bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 35.7 | 1.00x | 0.29x |
7.0 | 2% |
| bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 85.7 | 2.87x | 0.89x |
44.4 | 10% |
| bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 86.0 | 2.92x | 0.88x |
44.7 | 10% |
| bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 86.4 | 2.88x | 0.87x |
45.5 | 10% |
| bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 88.4 | 2.58x | 0.56x |
27.1 | 6% |
| bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 88.7 | 2.67x | 0.56x |
27.5 | 6% |
| bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 89.0 | 2.68x | 0.56x |
28.3 | 6% |
| bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 234.9 | 3.78x | 0.47x |
67.8 | 15% |
| bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 234.1 | 3.94x | 0.47x |
67.8 | 15% |
| bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 235.0 | 4.13x | 0.47x |
67.9 | 15% |
| bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 236.2 | 2.27x | 0.29x |
40.8 | 9% |
| bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 237.3 | 2.38x | 0.29x |
40.9 | 9% |
| bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 237.4 | 2.50x | 0.29x |
41.0 | 9% |
| bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3646.5 | 6.57x |
1.05x | 154.1 | 34% |
| bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3644.1 | 6.94x |
1.05x | 154.1 | 34% |
| bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3646.2 | 7.34x |
1.05x | 154.0 | 34% |
| bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 3774.7 | 3.86x |
0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 3780.1 | 4.08x |
0.64x | 90.5 | 20% |
| bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 3773.0 | 4.31x |
0.64x | 90.5 | 20% |
| float16 | 1 | 128 | 4 | 30.7 | 30.7 | 14.3 | 1.00x | 0.47x | 0.0 | 0%
|
| float16 | 1 | 128 | 8 | 30.6 | 30.6 | 14.0 | 1.00x | 0.46x | 0.0 | 0%
|
| float16 | 1 | 128 | 16 | 30.5 | 30.5 | 14.0 | 1.00x | 0.46x | 0.0 | 0%
|
| float16 | 1 | 129 | 4 | 30.6 | 30.6 | 14.4 | 1.00x | 0.47x | 0.0 | 0%
|
| float16 | 1 | 129 | 8 | 30.6 | 30.3 | 14.4 | 1.01x | 0.48x | 0.0 | 0%
|
| float16 | 1 | 129 | 16 | 30.5 | 30.4 | 14.7 | 1.00x | 0.48x | 0.0 | 0%
|
| float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 17.4 | 1.00x | 0.57x | 0.1 | 0%
|
| float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 | 0%
|
| float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 17.5 | 1.00x | 0.57x | 0.1 |
0% |
| float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 17.8 | 1.00x | 0.58x | 0.1 | 0%
|
| float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 18.6 | 1.00x | 0.61x | 0.1 | 0%
|
| float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 20.1 | 1.00x | 0.66x | 0.1 |
0% |
| float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 33.6 | 1.08x | 0.88x | 0.4 | 0%
|
| float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 33.8 | 0.85x | 0.70x | 0.3 | 0%
|
| float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 31.5 | 0.94x | 0.65x | 0.3 |
0% |
| float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 37.4 | 1.11x | 0.91x | 0.4 | 0%
|
| float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 36.9 | 0.97x | 0.84x | 0.4 | 0%
|
| float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 33.3 | 0.89x | 0.65x | 0.3 |
0% |
| float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 46.2 | 0.98x | 0.15x | 0.9
| 0% |
| float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 46.5 | 0.97x | 0.14x | 0.8
| 0% |
| float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 46.1 | 0.98x | 0.13x | 0.7
| 0% |
| float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 46.9 | 1.08x | 0.16x | 0.9
| 0% |
| float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 46.8 | 0.99x | 0.14x | 0.8
| 0% |
| float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 46.6 | 0.93x | 0.12x | 0.7
| 0% |
| float16 | 8 | 128 | 4 | 30.5 | 30.2 | 14.4 | 1.01x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 128 | 8 | 30.4 | 30.2 | 14.5 | 1.01x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 128 | 16 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x | 0.1 | 0%
|
| float16 | 8 | 129 | 4 | 30.5 | 30.2 | 14.8 | 1.01x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 129 | 8 | 30.3 | 30.2 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 129 | 16 | 30.5 | 30.4 | 14.9 | 1.00x | 0.49x | 0.1 | 0%
|
| float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 19.1 | 1.01x | 0.63x | 0.5 | 0%
|
| float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 19.2 | 1.00x | 0.63x | 0.6 | 0%
|
| float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 19.3 | 1.00x | 0.64x | 0.6 |
0% |
| float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 19.5 | 1.00x | 0.64x | 0.6 | 0%
|
| float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 20.4 | 1.01x | 0.67x | 0.6 | 0%
|
| float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 20.5 | 1.01x | 0.68x | 0.6 |
0% |
| float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 37.9 | 1.00x | 0.83x | 2.9 | 1%
|
| float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 39.8 | 1.00x | 0.82x | 2.7 | 1%
|
| float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 41.7 | 0.94x | 0.81x | 2.6 |
1% |
| float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 39.2 | 1.07x | 0.86x | 2.9 | 1%
|
| float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 40.7 | 0.94x | 0.84x | 2.7 | 1%
|
| float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 43.0 | 1.05x | 0.83x | 2.6 |
1% |
| float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 56.0 | 0.93x | 0.17x | 6.3
| 1% |
| float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 56.1 | 0.95x | 0.16x | 5.9
| 1% |
| float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 56.3 | 1.01x | 0.15x | 5.4
| 1% |
| float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 56.2 | 1.00x | 0.18x | 6.7
| 1% |
| float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 56.3 | 0.99x | 0.16x | 6.1
| 1% |
| float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 56.3 | 0.97x | 0.15x | 5.6
| 1% |
| float16 | 64 | 128 | 4 | 30.5 | 30.1 | 14.9 | 1.01x | 0.50x | 0.6 | 0%
|
| float16 | 64 | 128 | 8 | 30.5 | 30.2 | 14.7 | 1.01x | 0.49x | 0.7 | 0%
|
| float16 | 64 | 128 | 16 | 30.4 | 30.2 | 14.7 | 1.01x | 0.49x | 0.9 |
0% |
| float16 | 64 | 129 | 4 | 30.6 | 30.2 | 15.3 | 1.01x | 0.51x | 0.6 | 0%
|
| float16 | 64 | 129 | 8 | 30.6 | 30.2 | 15.2 | 1.01x | 0.50x | 0.7 | 0%
|
| float16 | 64 | 129 | 16 | 30.5 | 30.2 | 15.1 | 1.01x | 0.50x | 0.9 |
0% |
| float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x | 4.4 |
1% |
| float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x | 4.5 |
1% |
| float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 19.4 | 1.00x | 0.64x | 4.7 |
1% |
| float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 19.7 | 1.01x | 0.62x | 4.2 |
1% |
| float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 20.4 | 1.00x | 0.64x | 4.2 |
1% |
| float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x | 4.2 |
1% |
| float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 49.4 | 0.97x | 0.59x | 12.5 |
3% |
| float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 49.2 | 0.99x | 0.58x | 12.5 |
3% |
| float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 49.2 | 0.98x | 0.54x | 11.7 |
3% |
| float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 49.4 | 1.01x | 0.62x | 13.1 |
3% |
| float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 49.4 | 1.04x | 0.59x | 12.5 |
3% |
| float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 49.4 | 1.02x | 0.56x | 11.9 |
3% |
| float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 162.1 | 1.04x | 0.22x |
23.2 | 5% |
| float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 160.5 | 1.01x | 0.21x |
21.5 | 5% |
| float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 162.4 | 0.98x | 0.19x |
19.4 | 4% |
| float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 161.6 | 1.00x | 0.23x |
23.7 | 5% |
| float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 163.9 | 1.01x | 0.21x |
21.6 | 5% |
| float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 162.9 | 1.01x | 0.19x |
19.6 | 4% |
| float16 | 256 | 128 | 4 | 33.7 | 33.6 | 15.5 | 1.00x | 0.46x | 2.3 |
0% |
| float16 | 256 | 128 | 8 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 2.6 |
1% |
| float16 | 256 | 128 | 16 | 33.7 | 33.6 | 15.6 | 1.00x | 0.46x | 3.2 |
1% |
| float16 | 256 | 129 | 4 | 33.7 | 33.5 | 16.0 | 1.01x | 0.48x | 2.3 |
0% |
| float16 | 256 | 129 | 8 | 33.7 | 33.5 | 15.9 | 1.01x | 0.47x | 2.6 |
1% |
| float16 | 256 | 129 | 16 | 33.6 | 33.5 | 16.1 | 1.00x | 0.48x | 3.2 |
1% |
| float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 37.9 | 1.00x | 0.75x | 10.5 |
2% |
| float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 38.8 | 1.00x | 0.73x | 10.3 |
2% |
| float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 39.9 | 0.98x | 0.71x | 10.1
| 2% |
| float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 42.0 | 1.00x | 0.66x | 8.4 |
2% |
| float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 43.1 | 0.97x | 0.65x | 8.2 |
2% |
| float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 43.8 | 1.02x | 0.65x | 8.3 |
2% |
| float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 74.1 | 0.99x | 0.33x | 19.0
| 4% |
| float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 74.4 | 1.00x | 0.32x | 18.0
| 4% |
| float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 74.7 | 0.99x | 0.30x |
16.9 | 4% |
| float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 74.3 | 0.99x | 0.34x | 19.1
| 4% |
| float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 74.8 | 1.01x | 0.32x | 18.1
| 4% |
| float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 74.9 | 0.98x | 0.29x |
16.5 | 4% |
| float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 428.9 | 0.99x | 0.18x |
27.7 | 6% |
| float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 427.9 | 1.00x | 0.16x |
25.2 | 6% |
| float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 428.2 | 1.00x | 0.15x
| 22.9 | 5% |
| float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 431.9 | 0.98x | 0.18x |
27.5 | 6% |
| float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 429.8 | 0.99x | 0.16x |
25.1 | 5% |
| float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 432.2 | 1.00x | 0.15x
| 22.8 | 5% |
| float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 20.9 | 1.00x | 0.31x | 4.5 |
1% |
| float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 20.9 | 1.01x | 0.30x | 4.9 |
1% |
| float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 21.4 | 1.00x | 0.30x | 6.0 |
1% |
| float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 23.3 | 1.00x | 0.35x | 4.6 |
1% |
| float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 23.1 | 1.01x | 0.33x | 4.9 |
1% |
| float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 23.4 | 0.98x | 0.32x | 5.9 |
1% |
| float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 62.7 | 2.74x | 1.30x | 44.2
| 10% |
| float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 63.0 | 2.80x | 1.29x | 44.7
| 10% |
| float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 63.1 | 2.89x | 1.27x |
45.5 | 10% |
| float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 64.2 | 1.89x | 0.66x | 21.9
| 5% |
| float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 64.4 | 1.97x | 0.66x | 22.3
| 5% |
| float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 64.5 | 2.08x | 0.65x |
22.9 | 5% |
| float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 145.0 | 3.99x | 0.73x |
84.6 | 19% |
| float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 144.6 | 4.22x | 0.73x |
84.7 | 19% |
| float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 145.5 | 4.50x | 0.72x |
83.9 | 18% |
| float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 146.1 | 1.75x | 0.32x |
36.9 | 8% |
| float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 146.5 | 1.83x | 0.32x |
36.9 | 8% |
| float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 146.2 | 1.98x | 0.32x |
36.8 | 8% |
| float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1846.9 | 5.88x | 1.20x
| 174.8 | 38% |
| float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1856.1 | 6.41x | 1.20x
| 174.1 | 38% |
| float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1858.5 | 7.06x |
1.21x | 174.6 | 38% |
| float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 1924.0 | 2.82x | 0.60x
| 84.1 | 18% |
| float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 1921.5 | 3.12x | 0.60x
| 84.3 | 18% |
| float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 1926.4 | 3.42x |
0.60x | 84.3 | 18% |
| float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 28.3 | 0.99x | 0.26x | 5.6
| 1% |
| float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 28.5 | 1.00x | 0.25x | 6.1
| 1% |
| float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 29.2 | 1.01x | 0.26x | 7.4
| 2% |
| float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 32.6 | 0.99x | 0.30x | 5.6
| 1% |
| float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 32.7 | 1.00x | 0.29x | 6.2
| 1% |
| float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 33.5 | 1.00x | 0.29x | 7.4
| 2% |
| float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 85.1 | 2.91x | 1.05x | 52.6
| 12% |
| float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 85.7 | 2.98x | 1.04x | 52.6
| 12% |
| float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 86.0 | 3.08x | 1.02x |
53.6 | 12% |
| float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 87.7 | 1.92x | 0.49x |
23.8 | 5% |
| float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 88.0 | 1.98x | 0.49x |
24.1 | 5% |
| float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 88.5 | 2.09x | 0.49x |
24.8 | 5% |
| float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 228.7 | 3.94x | 0.57x |
84.2 | 18% |
| float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 228.5 | 4.16x | 0.57x |
84.3 | 18% |
| float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 230.5 | 4.51x | 0.57x |
84.5 | 19% |
| float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 231.7 | 1.70x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 231.6 | 1.81x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 233.1 | 1.95x | 0.25x |
36.4 | 8% |
| float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3622.2 | 5.90x |
1.19x | 176.9 | 39% |
| float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3607.3 | 6.49x |
1.19x | 177.3 | 39% |
| float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3607.4 | 7.10x |
1.19x | 176.5 | 39% |
| float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 3743.3 | 2.84x |
0.59x | 85.2 | 19% |
| float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 3747.1 | 3.12x |
0.59x | 85.1 | 19% |
| float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 3749.5 | 3.43x |
0.59x | 85.2 | 19% |
| float32 | 1 | 128 | 4 | 31.2 | 31.4 | 14.5 | 0.99x | 0.46x | 0.0 | 0%
|
| float32 | 1 | 128 | 8 | 34.0 | 34.4 | 14.3 | 0.99x | 0.42x | 0.0 | 0%
|
| float32 | 1 | 128 | 16 | 32.4 | 34.4 | 14.0 | 0.94x | 0.41x | 0.0 | 0%
|
| float32 | 1 | 129 | 4 | 34.1 | 34.4 | 14.4 | 0.99x | 0.42x | 0.0 | 0%
|
| float32 | 1 | 129 | 8 | 34.0 | 32.7 | 14.4 | 1.04x | 0.44x | 0.0 | 0%
|
| float32 | 1 | 129 | 16 | 34.1 | 34.3 | 15.2 | 0.99x | 0.44x | 0.0 | 0%
|
| float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 17.8 | 1.08x | 0.54x | 0.1 | 0%
|
| float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 22.2 | 0.99x | 0.62x | 0.1 | 0%
|
| float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 19.1 | 0.99x | 0.54x | 0.1 |
0% |
| float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 18.8 | 0.98x | 0.52x | 0.1 | 0%
|
| float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 19.7 | 1.08x | 0.55x | 0.1 | 0%
|
| float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 19.6 | 0.99x | 0.55x | 0.1 |
0% |
| float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 39.6 | 1.07x | 0.77x | 0.6 | 0%
|
| float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 38.0 | 1.16x | 0.69x | 0.6 | 0%
|
| float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 38.7 | 0.94x | 0.67x | 0.6 |
0% |
| float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 34.1 | 0.99x | 0.66x | 0.6 | 0%
|
| float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 41.6 | 1.03x | 0.76x | 0.6 | 0%
|
| float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 39.8 | 1.04x | 0.69x | 0.6 |
0% |
| float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 63.3 | 1.04x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 63.3 | 1.03x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 63.3 | 0.91x | 0.14x | 1.1
| 0% |
| float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 64.0 | 1.04x | 0.16x | 1.3
| 0% |
| float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 64.0 | 1.02x | 0.15x | 1.3
| 0% |
| float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 63.9 | 1.01x | 0.15x | 1.2
| 0% |
| float32 | 8 | 128 | 4 | 33.8 | 37.2 | 14.7 | 0.91x | 0.40x | 0.1 | 0%
|
| float32 | 8 | 128 | 8 | 35.0 | 34.1 | 14.3 | 1.03x | 0.42x | 0.1 | 0%
|
| float32 | 8 | 128 | 16 | 35.6 | 37.2 | 15.2 | 0.96x | 0.41x | 0.2 | 0%
|
| float32 | 8 | 129 | 4 | 35.2 | 36.0 | 15.0 | 0.98x | 0.42x | 0.1 | 0%
|
| float32 | 8 | 129 | 8 | 36.8 | 34.1 | 15.0 | 1.08x | 0.44x | 0.1 | 0%
|
| float32 | 8 | 129 | 16 | 35.3 | 35.5 | 15.3 | 0.99x | 0.43x | 0.2 | 0%
|
| float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 20.9 | 1.12x | 0.59x | 0.9 | 0%
|
| float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 19.7 | 1.07x | 0.55x | 0.9 | 0%
|
| float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 19.7 | 0.95x | 0.49x | 0.9 |
0% |
| float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 20.6 | 1.07x | 0.58x | 0.9 | 0%
|
| float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 21.4 | 0.99x | 0.55x | 0.9 | 0%
|
| float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 22.0 | 1.12x | 0.60x | 0.9 |
0% |
| float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 41.0 | 0.92x | 0.65x | 4.2 | 1%
|
| float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 42.6 | 1.10x | 0.77x | 4.8 | 1%
|
| float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 44.7 | 1.09x | 0.73x | 4.3 |
1% |
| float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 43.0 | 0.85x | 0.67x | 4.1 | 1%
|
| float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 43.5 | 1.09x | 0.71x | 4.3 | 1%
|
| float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 45.0 | 0.95x | 0.67x | 3.9 |
1% |
| float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 76.0 | 1.00x | 0.19x | 10.2
| 2% |
| float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 76.0 | 1.02x | 0.18x | 9.9
| 2% |
| float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 76.2 | 1.03x | 0.17x | 9.2
| 2% |
| float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 76.0 | 0.98x | 0.18x | 10.2
| 2% |
| float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 75.8 | 1.00x | 0.18x | 9.8
| 2% |
| float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 76.1 | 0.99x | 0.16x | 8.8
| 2% |
| float32 | 64 | 128 | 4 | 38.2 | 37.4 | 15.0 | 1.02x | 0.40x | 1.0 | 0%
|
| float32 | 64 | 128 | 8 | 36.8 | 37.2 | 15.0 | 0.99x | 0.40x | 1.0 | 0%
|
| float32 | 64 | 128 | 16 | 38.3 | 37.2 | 14.9 | 1.03x | 0.40x | 1.2 |
0% |
| float32 | 64 | 129 | 4 | 38.5 | 37.1 | 15.5 | 1.04x | 0.42x | 1.0 | 0%
|
| float32 | 64 | 129 | 8 | 37.0 | 37.1 | 15.9 | 1.00x | 0.43x | 1.1 | 0%
|
| float32 | 64 | 129 | 16 | 38.4 | 38.8 | 15.4 | 0.99x | 0.40x | 1.2 |
0% |
| float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 20.4 | 1.02x | 0.52x | 6.8 |
1% |
| float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 20.3 | 1.02x | 0.52x | 6.8 |
2% |
| float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 20.3 | 1.03x | 0.50x | 6.8 |
1% |
| float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 22.1 | 0.95x | 0.51x | 6.1 |
1% |
| float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 22.1 | 0.99x | 0.51x | 6.2 |
1% |
| float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 22.2 | 0.96x | 0.50x | 6.1 |
1% |
| float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 65.6 | 0.98x | 0.66x | 21.2 |
5% |
| float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 65.6 | 0.97x | 0.62x | 19.7
| 4% |
| float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 65.6 | 1.00x | 0.60x | 19.2
| 4% |
| float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 65.6 | 0.98x | 0.66x | 21.1 |
5% |
| float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 65.5 | 0.93x | 0.58x | 18.7
| 4% |
| float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 65.6 | 0.97x | 0.57x | 18.2
| 4% |
| float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 221.1 | 1.02x | 0.24x |
35.7 | 8% |
| float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 220.4 | 1.02x | 0.22x |
33.3 | 7% |
| float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 222.6 | 1.01x | 0.21x |
31.0 | 7% |
| float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 223.0 | 1.00x | 0.24x |
35.7 | 8% |
| float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 225.0 | 0.99x | 0.22x |
33.2 | 7% |
| float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 223.7 | 0.99x | 0.20x |
30.5 | 7% |
| float32 | 256 | 128 | 4 | 46.0 | 46.0 | 15.7 | 1.00x | 0.34x | 3.1 |
1% |
| float32 | 256 | 128 | 8 | 47.2 | 47.5 | 15.7 | 0.99x | 0.33x | 3.3 |
1% |
| float32 | 256 | 128 | 16 | 47.4 | 47.2 | 15.7 | 1.00x | 0.33x | 3.8 |
1% |
| float32 | 256 | 129 | 4 | 47.2 | 47.5 | 16.1 | 0.99x | 0.34x | 3.0 |
1% |
| float32 | 256 | 129 | 8 | 45.6 | 47.4 | 16.7 | 0.96x | 0.35x | 3.3 |
1% |
| float32 | 256 | 129 | 16 | 47.3 | 49.0 | 16.6 | 0.97x | 0.34x | 3.7 |
1% |
| float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 41.7 | 0.98x | 0.61x | 15.5 |
3% |
| float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 43.2 | 1.01x | 0.62x | 15.3 |
3% |
| float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 43.8 | 0.99x | 0.61x | 15.3
| 3% |
| float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 45.9 | 1.02x | 0.57x | 13.1 |
3% |
| float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 46.6 | 1.01x | 0.55x | 12.7 |
3% |
| float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 48.1 | 0.98x | 0.54x | 12.3
| 3% |
| float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 101.0 | 0.99x | 0.36x |
30.3 | 7% |
| float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 101.3 | 1.05x | 0.35x |
29.4 | 6% |
| float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 100.9 | 0.99x | 0.32x |
26.7 | 6% |
| float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 101.7 | 1.02x | 0.37x |
30.2 | 7% |
| float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 101.6 | 1.00x | 0.35x |
28.8 | 6% |
| float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 101.8 | 1.00x | 0.32x |
26.5 | 6% |
| float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1128.3 | 1.00x | 0.35x
| 41.9 | 9% |
| float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1132.5 | 0.99x | 0.33x
| 39.0 | 9% |
| float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1129.5 | 1.00x | 0.30x
| 36.2 | 8% |
| float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 1148.5 | 1.00x | 0.36x
| 42.0 | 9% |
| float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 1148.0 | 1.00x | 0.34x
| 39.2 | 9% |
| float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 1147.9 | 1.01x | 0.31x
| 36.1 | 8% |
| float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 22.3 | 0.99x | 0.22x | 5.6
| 1% |
| float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 22.0 | 1.02x | 0.21x | 5.9
| 1% |
| float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 22.2 | 0.98x | 0.20x | 6.6
| 1% |
| float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 24.4 | 1.01x | 0.24x | 5.7
| 1% |
| float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 24.4 | 1.00x | 0.23x | 5.8
| 1% |
| float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 24.6 | 0.99x | 0.22x | 6.5
| 1% |
| float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 88.3 | 3.70x | 1.76x | 84.5
| 19% |
| float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 88.3 | 3.81x | 1.77x | 85.9
| 19% |
| float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 88.3 | 3.88x | 1.76x |
87.5 | 19% |
| float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 90.2 | 2.73x | 0.98x | 46.1
| 10% |
| float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 90.1 | 2.84x | 0.97x | 46.5
| 10% |
| float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 90.4 | 2.87x | 0.97x |
47.3 | 10% |
| float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 200.8 | 4.34x | 0.87x |
145.7 | 32% |
| float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 200.2 | 4.64x | 0.87x |
145.6 | 32% |
| float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 201.7 | 4.93x | 0.87x |
145.8 | 32% |
| float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 202.4 | 2.18x | 0.44x |
72.3 | 16% |
| float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 201.3 | 2.31x | 0.43x |
72.2 | 16% |
| float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 202.6 | 2.49x | 0.43x |
72.4 | 16% |
| float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 4191.1 | 6.06x |
2.13x | 273.4 | 60% |
| float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 4189.9 | 6.47x |
2.13x | 273.1 | 60% |
| float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 4190.6 | 7.00x |
2.13x | 273.1 | 60% |
| float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 4260.7 | 3.35x |
1.20x | 151.4 | 33% |
| float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 4261.2 | 3.60x |
1.20x | 151.3 | 33% |
| float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 4261.2 | 3.91x |
1.20x | 151.0 | 33% |
| float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 30.2 | 1.00x | 0.18x | 6.7
| 1% |
| float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 30.6 | 1.00x | 0.17x | 7.0
| 2% |
| float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 31.2 | 1.00x | 0.17x | 7.9
| 2% |
| float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 35.4 | 1.00x | 0.21x | 6.8
| 1% |
| float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 35.3 | 1.00x | 0.20x | 7.1
| 2% |
| float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 36.4 | 1.00x | 0.20x | 7.9
| 2% |
| float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 123.4 | 3.89x | 1.44x |
99.1 | 22% |
| float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 123.4 | 4.04x | 1.44x |
99.9 | 22% |
| float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 123.7 | 4.08x | 1.42x |
100.8 | 22% |
| float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 126.5 | 2.84x | 0.76x |
51.3 | 11% |
| float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 126.7 | 2.97x | 0.76x |
51.7 | 11% |
| float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 127.0 | 3.02x | 0.76x |
52.6 | 12% |
| float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 421.7 | 4.46x | 0.95x |
151.9 | 33% |
| float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 424.8 | 4.73x | 0.96x |
151.8 | 33% |
| float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 424.0 | 5.07x | 0.95x |
152.0 | 33% |
| float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 436.2 | 2.18x | 0.48x |
74.0 | 16% |
| float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 437.6 | 2.34x | 0.48x |
74.1 | 16% |
| float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 440.8 | 2.49x | 0.48x |
74.1 | 16% |
| float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 8254.2 | 6.02x |
2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 8254.6 | 6.43x |
2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 8250.2 | 6.97x |
2.10x | 273.3 | 60% |
| float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 8407.4 | 3.36x |
1.20x | 152.7 | 33% |
| float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 8407.4 | 3.60x |
1.19x | 152.6 | 33% |
| float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 8413.3 | 3.91x |
1.19x | 152.5 | 33% |

</details>

### Test methodology

- **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2
alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
- **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is
monotonic for both `largest=True/False`.
- **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with
20 warmup iterations. `largest=True`.
- **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) +
8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).
Co-authored-by: chuanqi129 <13608516+chuanqi129@users.noreply.github.com>
CuiYifeng added a commit that referenced this pull request May 19, 2026
CuiYifeng added a commit that referenced this pull request May 19, 2026
guangyey pushed a commit that referenced this pull request May 19, 2026
…3707)

This reverts commit 8eaa591, except new
overloads of `sycl_kernel_submit`.
The motivation is that #3371 caused build timeout in stock CD.
jianyizh added a commit that referenced this pull request May 21, 2026
Split the monolithic TensorTopKSbtopkKernel.cpp into 5 per-K files
(k1, k2, k4, k8, k16) to enable parallel AOT compilation and avoid
CD build timeout that caused the original PR #3371 to be reverted.

- TensorTopKSbtopkKernel.h: public API (SbtopkResult enum + dispatch)
- TensorTopKSbtopkKernelImpl.h: shared functor + launch templates
- TensorTopKSbtopkKernel_k{1,2,4,8,16}.cpp: per-K instantiations
- TensorTopKSbtopkKernel.cpp: dispatch-only (routes to per-K units)
- TensorTopKKernel.cpp: integrate sbtopk_try_launch fallback
jianyizh added a commit that referenced this pull request May 21, 2026
Split the monolithic TensorTopKSbtopkKernel.cpp into 4 per-K files
(k1, k2, k4, k8) to enable parallel AOT compilation and avoid
CD build timeout that caused the original PR #3371 to be reverted.

K=16 is excluded for now as it alone causes compilation timeout;
it can be re-added once incremental build improvements land.

- TensorTopKSbtopkKernel.h: public API (SbtopkResult enum + dispatch)
- TensorTopKSbtopkKernelImpl.h: shared functor + launch templates
- TensorTopKSbtopkKernel_k{1,2,4,8}.cpp: per-K instantiations
- TensorTopKSbtopkKernel.cpp: dispatch-only (routes to per-K units)
- TensorTopKKernel.cpp: integrate sbtopk_try_launch fallback
chuanqi129 pushed a commit that referenced this pull request May 26, 2026
## Summary

Builds on #3371 (subgroup topk kernel). Adds a **single workgroup topk
kernel** — SYCL translation of PyTorch CUDA's single-block radix select
path.

- **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432
cases, 211 wins (>1.05x), 32 regressions (<0.98x)
- **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster)
- **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins

### Approach

**Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A
1024-thread workgroup processes one slice using `RADIX_BITS=4` radix
select to find the k-th value, then gathers matching elements.
Translated from PyTorch CUDA's single-block path. Output is unsorted
(caller sorts if needed). Best for large dim (>= 4096).

**Updated dispatch logic:**
- `dim < 1024` -> original kernel
- `k <= 16` and large batch -> subgroup kernel (PR1, SORTED)
- `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED)
- otherwise -> original kernel

Also fixes NaN handling in `SortingRadixSelect.h`
`TopKTypeConfig::convert` for half/float/double (NaN maps to max radix
value).

Multi-block radix select (for very large slices across multiple
workgroups) is planned as future work.

### Files changed

| File | Description |
|------|-------------|
| `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel
(from CUDA single-block path) |
| `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch`
declaration |
| `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside
subgroup kernel |
| `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel
paths |
| `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert`
|

### Correctness

- **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare)
- **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output
verified monotonic)

### Benchmark: incremental gain from this PR

Showing where single-wg kernel helps (large dim cases):

**By dim (PR2 vs PR1-only):**

| dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases |
|-----|:-:|:-:|:-:|:-:|
| 128 | 1.00x | 1.00x | 0.37x | 54 |
| 129 | 1.00x | 1.00x | 0.39x | 54 |
| 1024 | 1.00x | 1.47x | 0.77x | 54 |
| 1025 | 1.00x | 1.35x | 0.63x | 54 |
| 8192 | 1.03x | 1.68x | 0.62x | 54 |
| 8193 | 1.01x | 1.30x | 0.49x | 54 |
| 131072 | 1.99x | 3.73x | 0.68x | 54 |
| 131073 | 1.51x | 2.31x | 0.43x | 54 |

### Full 432-case results (combined PR1+PR2)

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory
bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50
iters.

<details>
<summary>Click to expand full table</summary>

| dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us)
| CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak |

|-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:|
| bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x |
0.1 | 0% |
| bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x |
0.4 | 0% |
| bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x |
0.45x | 2.6 | 1% |
| bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x |
0.44x | 2.5 | 1% |
| bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x |
0.42x | 2.4 | 1% |
| bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x |
0.28x | 1.6 | 0% |
| bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x |
0.23x | 1.3 | 0% |
| bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x |
0.25x | 1.4 | 0% |
| bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x |
0.5 | 0% |
| bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x |
2.9 | 1% |
| bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x |
2.8 | 1% |
| bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x |
2.7 | 1% |
| bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x |
2.7 | 1% |
| bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x |
0.39x | 14.4 | 3% |
| bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x |
0.41x | 15.3 | 3% |
| bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x |
0.39x | 14.3 | 3% |
| bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x |
0.26x | 9.6 | 2% |
| bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x |
0.26x | 9.4 | 2% |
| bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x |
0.29x | 10.7 | 2% |
| bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x |
0.9 | 0% |
| bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x |
0.6 | 0% |
| bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x |
0.7 | 0% |
| bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x |
0.9 | 0% |
| bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x |
4.4 | 1% |
| bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x |
4.5 | 1% |
| bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x
| 4.6 | 1% |
| bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x
| 4.2 | 1% |
| bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x |
11.3 | 2% |
| bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x |
11.5 | 3% |
| bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x |
0.54x | 11.5 | 3% |
| bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x |
10.7 | 2% |
| bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x |
10.8 | 2% |
| bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x |
0.51x | 10.8 | 2% |
| bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x |
0.53x | 55.0 | 12% |
| bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x |
0.53x | 54.5 | 12% |
| bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x |
0.55x | 55.7 | 12% |
| bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x |
0.37x | 37.3 | 8% |
| bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x |
0.36x | 36.1 | 8% |
| bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x
| 0.37x | 37.4 | 8% |
| bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x |
2.3 | 0% |
| bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x
| 3.2 | 1% |
| bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x |
2.3 | 0% |
| bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x |
2.6 | 1% |
| bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x
| 3.2 | 1% |
| bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x
| 9.5 | 2% |
| bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x
| 9.2 | 2% |
| bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x
| 9.4 | 2% |
| bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x
| 7.7 | 2% |
| bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x |
0.30x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x |
0.29x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x |
0.29x | 16.6 | 4% |
| bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x |
0.27x | 15.3 | 3% |
| bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x
| 0.46x | 69.8 | 15% |
| bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x
| 0.46x | 71.3 | 16% |
| bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x
| 0.48x | 72.7 | 16% |
| bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x
| 0.29x | 43.3 | 10% |
| bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x
| 0.29x | 44.0 | 10% |
| bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 |
2.32x | 0.30x | 44.9 | 10% |
| bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x
| 4.3 | 1% |
| bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x
| 4.6 | 1% |
| bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x
| 5.6 | 1% |
| bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x
| 4.4 | 1% |
| bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x
| 4.6 | 1% |
| bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x
| 5.6 | 1% |
| bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x |
1.13x | 38.2 | 8% |
| bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x |
1.13x | 39.0 | 9% |
| bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x |
1.10x | 39.4 | 9% |
| bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x |
0.74x | 24.6 | 5% |
| bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x |
0.74x | 25.1 | 5% |
| bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x |
0.74x | 26.0 | 6% |
| bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x |
0.57x | 64.9 | 14% |
| bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x |
0.57x | 65.1 | 14% |
| bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x |
0.57x | 65.0 | 14% |
| bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x |
0.35x | 39.9 | 9% |
| bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 |
6.53x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 |
6.91x | 1.06x | 152.1 | 33% |
| bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 |
7.30x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 |
3.84x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 |
4.06x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 |
4.29x | 0.65x | 89.6 | 20% |
| bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x |
0.27x | 5.3 | 1% |
| bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x |
0.25x | 5.7 | 1% |
| bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x |
0.25x | 6.9 | 2% |
| bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x |
0.31x | 5.4 | 1% |
| bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x |
0.29x | 5.8 | 1% |
| bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x |
0.29x | 6.9 | 2% |
| bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x |
0.88x | 44.0 | 10% |
| bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x |
0.88x | 44.7 | 10% |
| bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x |
0.87x | 45.5 | 10% |
| bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x |
0.56x | 27.1 | 6% |
| bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x |
0.56x | 27.4 | 6% |
| bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x |
0.56x | 28.3 | 6% |
| bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x |
0.47x | 67.6 | 15% |
| bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x |
0.47x | 67.7 | 15% |
| bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x |
0.47x | 67.8 | 15% |
| bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x |
0.29x | 41.1 | 9% |
| bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 |
6.57x | 1.05x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 |
6.94x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 |
7.33x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 |
3.86x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 |
4.08x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 |
4.32x | 0.64x | 90.6 | 20% |
| float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x |
0.0 | 0% |
| float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x |
0.1 | 0% |
| float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x |
0.1 | 0% |
| float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x |
0.1 | 0% |
| float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x |
0.1 | 0% |
| float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x |
0.4 | 0% |
| float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x |
0.4 | 0% |
| float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x |
0.4 | 0% |
| float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x |
0.4 | 0% |
| float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x |
0.4 | 0% |
| float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x |
0.4 | 0% |
| float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x |
0.37x | 2.1 | 0% |
| float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x
| 2.6 | 1% |
| float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x |
0.35x | 2.0 | 0% |
| float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x |
0.24x | 1.3 | 0% |
| float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x |
0.25x | 1.4 | 0% |
| float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x |
0.25x | 1.4 | 0% |
| float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x |
0.1 | 0% |
| float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x |
3.1 | 1% |
| float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x |
3.0 | 1% |
| float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x |
3.0 | 1% |
| float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x |
2.8 | 1% |
| float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x |
2.8 | 1% |
| float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x |
2.9 | 1% |
| float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x |
0.41x | 15.2 | 3% |
| float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x |
0.45x | 16.7 | 4% |
| float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x |
0.42x | 15.8 | 3% |
| float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x |
0.27x | 10.0 | 2% |
| float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x |
0.28x | 10.5 | 2% |
| float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x |
0.28x | 10.5 | 2% |
| float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x |
0.9 | 0% |
| float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x |
0.6 | 0% |
| float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x |
0.7 | 0% |
| float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x |
0.9 | 0% |
| float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x |
4.4 | 1% |
| float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x |
4.5 | 1% |
| float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x |
4.6 | 1% |
| float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x |
4.1 | 1% |
| float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x |
4.2 | 1% |
| float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x |
4.2 | 1% |
| float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x |
12.7 | 3% |
| float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x |
12.7 | 3% |
| float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x |
11.9 | 3% |
| float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x |
12.3 | 3% |
| float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x |
11.9 | 3% |
| float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x |
11.5 | 3% |
| float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x |
0.57x | 58.7 | 13% |
| float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x |
0.55x | 57.8 | 13% |
| float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x |
0.58x | 59.4 | 13% |
| float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x |
0.37x | 38.1 | 8% |
| float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x |
0.40x | 40.9 | 9% |
| float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x |
0.38x | 38.7 | 8% |
| float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x |
2.3 | 0% |
| float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x |
3.2 | 1% |
| float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x |
2.3 | 0% |
| float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x |
2.6 | 1% |
| float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x |
3.2 | 1% |
| float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x |
10.7 | 2% |
| float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x |
10.3 | 2% |
| float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x
| 10.1 | 2% |
| float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x |
8.4 | 2% |
| float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x |
8.2 | 2% |
| float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x
| 8.3 | 2% |
| float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x |
0.34x | 19.3 | 4% |
| float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x |
0.33x | 18.6 | 4% |
| float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x |
0.32x | 17.9 | 4% |
| float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x |
0.31x | 17.8 | 4% |
| float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x |
0.30x | 17.1 | 4% |
| float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x |
0.29x | 16.4 | 4% |
| float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x |
0.49x | 76.2 | 17% |
| float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x |
0.48x | 75.7 | 17% |
| float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x
| 0.48x | 74.8 | 16% |
| float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x
| 0.31x | 47.7 | 10% |
| float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x
| 0.30x | 46.8 | 10% |
| float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x
| 0.29x | 45.6 | 10% |
| float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x |
4.6 | 1% |
| float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x |
4.9 | 1% |
| float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x
| 5.9 | 1% |
| float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x |
4.5 | 1% |
| float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x |
4.9 | 1% |
| float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x
| 6.0 | 1% |
| float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x
| 44.1 | 10% |
| float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x
| 45.0 | 10% |
| float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x |
1.27x | 45.4 | 10% |
| float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x
| 22.0 | 5% |
| float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x
| 22.3 | 5% |
| float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x |
0.65x | 22.9 | 5% |
| float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x |
0.70x | 81.0 | 18% |
| float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x |
0.69x | 80.5 | 18% |
| float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x |
0.69x | 79.9 | 18% |
| float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x |
0.31x | 36.1 | 8% |
| float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 |
5.87x | 1.20x | 174.4 | 38% |
| float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 |
6.42x | 1.21x | 174.4 | 38% |
| float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 |
7.04x | 1.20x | 174.0 | 38% |
| float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 |
2.83x | 0.60x | 84.2 | 18% |
| float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 |
3.10x | 0.60x | 84.0 | 18% |
| float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 |
3.39x | 0.60x | 83.5 | 18% |
| float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x |
0.27x | 5.7 | 1% |
| float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x |
0.25x | 6.1 | 1% |
| float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x |
0.25x | 7.4 | 2% |
| float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x |
0.30x | 5.7 | 1% |
| float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x |
0.29x | 6.2 | 1% |
| float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x |
0.29x | 7.4 | 2% |
| float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x |
1.02x | 53.7 | 12% |
| float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x |
0.49x | 23.7 | 5% |
| float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x |
0.49x | 24.1 | 5% |
| float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x |
0.49x | 24.8 | 5% |
| float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x |
0.57x | 84.1 | 18% |
| float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x |
0.57x | 84.2 | 18% |
| float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x |
0.57x | 84.3 | 18% |
| float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x |
0.25x | 36.3 | 8% |
| float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x |
0.25x | 36.4 | 8% |
| float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x |
0.25x | 36.5 | 8% |
| float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 |
5.90x | 1.19x | 177.1 | 39% |
| float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 |
6.50x | 1.19x | 177.4 | 39% |
| float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 |
7.10x | 1.19x | 176.5 | 39% |
| float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 |
2.84x | 0.59x | 85.1 | 19% |
| float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 |
3.12x | 0.59x | 85.2 | 19% |
| float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 |
3.42x | 0.59x | 85.1 | 19% |
| float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x |
0.0 | 0% |
| float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x |
0.0 | 0% |
| float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x |
0.0 | 0% |
| float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x |
0.0 | 0% |
| float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x |
0.1 | 0% |
| float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x |
0.1 | 0% |
| float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x |
0.1 | 0% |
| float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x |
0.1 | 0% |
| float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x |
0.1 | 0% |
| float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x |
0.1 | 0% |
| float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x |
0.6 | 0% |
| float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x |
0.7 | 0% |
| float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x |
0.6 | 0% |
| float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x |
0.6 | 0% |
| float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x |
0.6 | 0% |
| float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x |
0.6 | 0% |
| float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x |
0.41x | 3.4 | 1% |
| float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x |
0.48x | 4.0 | 1% |
| float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x |
0.43x | 3.5 | 1% |
| float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x |
0.37x | 3.0 | 1% |
| float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x |
0.34x | 2.8 | 1% |
| float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x |
0.27x | 2.2 | 0% |
| float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x |
0.1 | 0% |
| float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x |
0.2 | 0% |
| float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x |
0.1 | 0% |
| float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x |
0.2 | 0% |
| float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x |
0.9 | 0% |
| float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x |
1.0 | 0% |
| float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x |
0.9 | 0% |
| float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x |
0.9 | 0% |
| float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x |
0.9 | 0% |
| float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x |
0.9 | 0% |
| float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x |
4.7 | 1% |
| float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x |
4.3 | 1% |
| float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x |
4.7 | 1% |
| float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x |
4.6 | 1% |
| float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x |
4.5 | 1% |
| float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x |
4.2 | 1% |
| float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x |
0.47x | 26.2 | 6% |
| float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x |
0.47x | 26.1 | 6% |
| float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x |
0.44x | 24.1 | 5% |
| float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x |
0.31x | 17.1 | 4% |
| float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x |
0.30x | 16.7 | 4% |
| float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x |
0.30x | 16.7 | 4% |
| float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x |
1.0 | 0% |
| float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x |
1.1 | 0% |
| float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x |
1.2 | 0% |
| float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x |
1.0 | 0% |
| float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x |
1.1 | 0% |
| float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x |
1.2 | 0% |
| float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x |
6.4 | 1% |
| float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x |
6.5 | 1% |
| float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x |
6.4 | 1% |
| float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x |
6.4 | 1% |
| float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x |
6.3 | 1% |
| float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x |
6.4 | 1% |
| float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x |
19.6 | 4% |
| float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x
| 19.1 | 4% |
| float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x |
0.56x | 18.0 | 4% |
| float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x |
18.8 | 4% |
| float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x
| 18.6 | 4% |
| float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x |
0.59x | 18.9 | 4% |
| float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x |
0.47x | 71.4 | 16% |
| float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x |
0.47x | 70.9 | 16% |
| float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x |
0.46x | 68.9 | 15% |
| float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x |
0.37x | 55.0 | 12% |
| float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x |
0.35x | 52.8 | 12% |
| float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x |
0.34x | 51.6 | 11% |
| float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x |
3.1 | 1% |
| float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x |
3.4 | 1% |
| float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x |
3.9 | 1% |
| float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x |
3.2 | 1% |
| float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x |
3.3 | 1% |
| float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x |
3.6 | 1% |
| float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x |
15.6 | 3% |
| float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x |
15.4 | 3% |
| float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x
| 15.4 | 3% |
| float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x |
13.0 | 3% |
| float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x |
12.3 | 3% |
| float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x
| 12.3 | 3% |
| float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x |
0.36x | 30.0 | 7% |
| float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x |
0.35x | 28.8 | 6% |
| float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x |
0.34x | 28.0 | 6% |
| float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x |
0.34x | 28.3 | 6% |
| float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x |
0.34x | 27.8 | 6% |
| float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x |
0.32x | 26.8 | 6% |
| float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x
| 0.69x | 82.6 | 18% |
| float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x
| 0.69x | 81.6 | 18% |
| float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 |
2.19x | 0.67x | 79.5 | 17% |
| float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x
| 0.54x | 62.7 | 14% |
| float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x
| 0.52x | 60.8 | 13% |
| float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 |
1.65x | 0.51x | 59.3 | 13% |
| float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x |
0.22x | 5.7 | 1% |
| float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x |
0.21x | 5.9 | 1% |
| float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x |
0.20x | 6.6 | 1% |
| float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x |
0.24x | 5.6 | 1% |
| float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x |
0.23x | 5.9 | 1% |
| float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x |
0.22x | 6.6 | 1% |
| float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x
| 84.9 | 19% |
| float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x
| 85.9 | 19% |
| float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x |
1.73x | 86.1 | 19% |
| float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x
| 46.2 | 10% |
| float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x
| 46.4 | 10% |
| float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x |
0.97x | 47.3 | 10% |
| float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x |
0.87x | 145.4 | 32% |
| float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x |
0.87x | 145.5 | 32% |
| float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x |
0.87x | 145.9 | 32% |
| float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x |
0.43x | 72.2 | 16% |
| float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x |
0.43x | 72.4 | 16% |
| float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x |
0.44x | 72.5 | 16% |
| float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 |
6.06x | 2.13x | 273.2 | 60% |
| float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 |
6.47x | 2.13x | 272.9 | 60% |
| float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 |
7.00x | 2.13x | 273.1 | 60% |
| float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 |
3.35x | 1.20x | 151.3 | 33% |
| float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 |
3.60x | 1.20x | 151.2 | 33% |
| float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 |
3.91x | 1.20x | 150.9 | 33% |
| float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x |
0.18x | 6.7 | 1% |
| float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x |
0.17x | 7.0 | 2% |
| float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x |
0.17x | 8.0 | 2% |
| float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x |
0.21x | 6.7 | 1% |
| float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x |
0.20x | 7.1 | 2% |
| float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x |
0.20x | 8.0 | 2% |
| float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x |
1.44x | 99.3 | 22% |
| float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x |
1.43x | 99.8 | 22% |
| float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x |
1.42x | 100.9 | 22% |
| float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x |
0.76x | 51.3 | 11% |
| float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x |
0.76x | 51.7 | 11% |
| float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x |
0.76x | 52.5 | 12% |
| float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x |
0.95x | 151.9 | 33% |
| float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x |
0.96x | 151.9 | 33% |
| float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x |
0.96x | 152.1 | 33% |
| float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x |
0.48x | 74.1 | 16% |
| float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x |
0.48x | 74.0 | 16% |
| float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x |
0.48x | 74.2 | 16% |
| float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 |
6.02x | 2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 |
6.43x | 2.10x | 273.4 | 60% |
| float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 |
6.98x | 2.10x | 273.6 | 60% |
| float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 |
3.36x | 1.19x | 152.6 | 33% |
| float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 |
3.60x | 1.20x | 152.7 | 33% |
| float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 |
3.92x | 1.20x | 152.7 | 33% |

</details>

### Test methodology

- **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2
alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
- **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is
monotonic for both `largest=True/False`.
- **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with
20 warmup iterations. `largest=True`.
- **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) +
8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).

---------

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
jafraustro pushed a commit to jafraustro/torch-xpu-ops that referenced this pull request May 26, 2026
…l#3372)

## Summary

Builds on intel#3371 (subgroup topk kernel). Adds a **single workgroup topk
kernel** — SYCL translation of PyTorch CUDA's single-block radix select
path.

- **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432
cases, 211 wins (>1.05x), 32 regressions (<0.98x)
- **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster)
- **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins

### Approach

**Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A
1024-thread workgroup processes one slice using `RADIX_BITS=4` radix
select to find the k-th value, then gathers matching elements.
Translated from PyTorch CUDA's single-block path. Output is unsorted
(caller sorts if needed). Best for large dim (>= 4096).

**Updated dispatch logic:**
- `dim < 1024` -> original kernel
- `k <= 16` and large batch -> subgroup kernel (PR1, SORTED)
- `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED)
- otherwise -> original kernel

Also fixes NaN handling in `SortingRadixSelect.h`
`TopKTypeConfig::convert` for half/float/double (NaN maps to max radix
value).

Multi-block radix select (for very large slices across multiple
workgroups) is planned as future work.

### Files changed

| File | Description |
|------|-------------|
| `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel
(from CUDA single-block path) |
| `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch`
declaration |
| `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside
subgroup kernel |
| `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel
paths |
| `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert`
|

### Correctness

- **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare)
- **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output
verified monotonic)

### Benchmark: incremental gain from this PR

Showing where single-wg kernel helps (large dim cases):

**By dim (PR2 vs PR1-only):**

| dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases |
|-----|:-:|:-:|:-:|:-:|
| 128 | 1.00x | 1.00x | 0.37x | 54 |
| 129 | 1.00x | 1.00x | 0.39x | 54 |
| 1024 | 1.00x | 1.47x | 0.77x | 54 |
| 1025 | 1.00x | 1.35x | 0.63x | 54 |
| 8192 | 1.03x | 1.68x | 0.62x | 54 |
| 8193 | 1.01x | 1.30x | 0.49x | 54 |
| 131072 | 1.99x | 3.73x | 0.68x | 54 |
| 131073 | 1.51x | 2.31x | 0.43x | 54 |

### Full 432-case results (combined PR1+PR2)

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory
bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50
iters.

<details>
<summary>Click to expand full table</summary>

| dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us)
| CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak |

|-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:|
| bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x |
0.1 | 0% |
| bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x |
0.4 | 0% |
| bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x |
0.45x | 2.6 | 1% |
| bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x |
0.44x | 2.5 | 1% |
| bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x |
0.42x | 2.4 | 1% |
| bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x |
0.28x | 1.6 | 0% |
| bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x |
0.23x | 1.3 | 0% |
| bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x |
0.25x | 1.4 | 0% |
| bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x |
0.5 | 0% |
| bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x |
2.9 | 1% |
| bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x |
2.8 | 1% |
| bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x |
2.7 | 1% |
| bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x |
2.7 | 1% |
| bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x |
0.39x | 14.4 | 3% |
| bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x |
0.41x | 15.3 | 3% |
| bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x |
0.39x | 14.3 | 3% |
| bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x |
0.26x | 9.6 | 2% |
| bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x |
0.26x | 9.4 | 2% |
| bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x |
0.29x | 10.7 | 2% |
| bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x |
0.9 | 0% |
| bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x |
0.6 | 0% |
| bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x |
0.7 | 0% |
| bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x |
0.9 | 0% |
| bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x |
4.4 | 1% |
| bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x |
4.5 | 1% |
| bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x
| 4.6 | 1% |
| bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x
| 4.2 | 1% |
| bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x |
11.3 | 2% |
| bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x |
11.5 | 3% |
| bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x |
0.54x | 11.5 | 3% |
| bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x |
10.7 | 2% |
| bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x |
10.8 | 2% |
| bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x |
0.51x | 10.8 | 2% |
| bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x |
0.53x | 55.0 | 12% |
| bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x |
0.53x | 54.5 | 12% |
| bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x |
0.55x | 55.7 | 12% |
| bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x |
0.37x | 37.3 | 8% |
| bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x |
0.36x | 36.1 | 8% |
| bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x
| 0.37x | 37.4 | 8% |
| bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x |
2.3 | 0% |
| bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x
| 3.2 | 1% |
| bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x |
2.3 | 0% |
| bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x |
2.6 | 1% |
| bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x
| 3.2 | 1% |
| bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x
| 9.5 | 2% |
| bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x
| 9.2 | 2% |
| bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x
| 9.4 | 2% |
| bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x
| 7.7 | 2% |
| bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x |
0.30x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x |
0.29x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x |
0.29x | 16.6 | 4% |
| bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x |
0.27x | 15.3 | 3% |
| bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x
| 0.46x | 69.8 | 15% |
| bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x
| 0.46x | 71.3 | 16% |
| bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x
| 0.48x | 72.7 | 16% |
| bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x
| 0.29x | 43.3 | 10% |
| bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x
| 0.29x | 44.0 | 10% |
| bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 |
2.32x | 0.30x | 44.9 | 10% |
| bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x
| 4.3 | 1% |
| bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x
| 4.6 | 1% |
| bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x
| 5.6 | 1% |
| bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x
| 4.4 | 1% |
| bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x
| 4.6 | 1% |
| bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x
| 5.6 | 1% |
| bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x |
1.13x | 38.2 | 8% |
| bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x |
1.13x | 39.0 | 9% |
| bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x |
1.10x | 39.4 | 9% |
| bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x |
0.74x | 24.6 | 5% |
| bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x |
0.74x | 25.1 | 5% |
| bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x |
0.74x | 26.0 | 6% |
| bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x |
0.57x | 64.9 | 14% |
| bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x |
0.57x | 65.1 | 14% |
| bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x |
0.57x | 65.0 | 14% |
| bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x |
0.35x | 39.9 | 9% |
| bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 |
6.53x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 |
6.91x | 1.06x | 152.1 | 33% |
| bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 |
7.30x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 |
3.84x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 |
4.06x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 |
4.29x | 0.65x | 89.6 | 20% |
| bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x |
0.27x | 5.3 | 1% |
| bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x |
0.25x | 5.7 | 1% |
| bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x |
0.25x | 6.9 | 2% |
| bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x |
0.31x | 5.4 | 1% |
| bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x |
0.29x | 5.8 | 1% |
| bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x |
0.29x | 6.9 | 2% |
| bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x |
0.88x | 44.0 | 10% |
| bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x |
0.88x | 44.7 | 10% |
| bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x |
0.87x | 45.5 | 10% |
| bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x |
0.56x | 27.1 | 6% |
| bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x |
0.56x | 27.4 | 6% |
| bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x |
0.56x | 28.3 | 6% |
| bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x |
0.47x | 67.6 | 15% |
| bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x |
0.47x | 67.7 | 15% |
| bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x |
0.47x | 67.8 | 15% |
| bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x |
0.29x | 41.1 | 9% |
| bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 |
6.57x | 1.05x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 |
6.94x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 |
7.33x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 |
3.86x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 |
4.08x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 |
4.32x | 0.64x | 90.6 | 20% |
| float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x |
0.0 | 0% |
| float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x |
0.1 | 0% |
| float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x |
0.1 | 0% |
| float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x |
0.1 | 0% |
| float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x |
0.1 | 0% |
| float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x |
0.4 | 0% |
| float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x |
0.4 | 0% |
| float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x |
0.4 | 0% |
| float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x |
0.4 | 0% |
| float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x |
0.4 | 0% |
| float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x |
0.4 | 0% |
| float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x |
0.37x | 2.1 | 0% |
| float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x
| 2.6 | 1% |
| float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x |
0.35x | 2.0 | 0% |
| float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x |
0.24x | 1.3 | 0% |
| float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x |
0.25x | 1.4 | 0% |
| float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x |
0.25x | 1.4 | 0% |
| float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x |
0.1 | 0% |
| float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x |
3.1 | 1% |
| float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x |
3.0 | 1% |
| float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x |
3.0 | 1% |
| float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x |
2.8 | 1% |
| float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x |
2.8 | 1% |
| float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x |
2.9 | 1% |
| float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x |
0.41x | 15.2 | 3% |
| float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x |
0.45x | 16.7 | 4% |
| float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x |
0.42x | 15.8 | 3% |
| float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x |
0.27x | 10.0 | 2% |
| float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x |
0.28x | 10.5 | 2% |
| float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x |
0.28x | 10.5 | 2% |
| float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x |
0.9 | 0% |
| float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x |
0.6 | 0% |
| float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x |
0.7 | 0% |
| float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x |
0.9 | 0% |
| float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x |
4.4 | 1% |
| float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x |
4.5 | 1% |
| float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x |
4.6 | 1% |
| float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x |
4.1 | 1% |
| float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x |
4.2 | 1% |
| float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x |
4.2 | 1% |
| float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x |
12.7 | 3% |
| float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x |
12.7 | 3% |
| float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x |
11.9 | 3% |
| float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x |
12.3 | 3% |
| float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x |
11.9 | 3% |
| float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x |
11.5 | 3% |
| float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x |
0.57x | 58.7 | 13% |
| float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x |
0.55x | 57.8 | 13% |
| float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x |
0.58x | 59.4 | 13% |
| float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x |
0.37x | 38.1 | 8% |
| float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x |
0.40x | 40.9 | 9% |
| float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x |
0.38x | 38.7 | 8% |
| float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x |
2.3 | 0% |
| float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x |
3.2 | 1% |
| float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x |
2.3 | 0% |
| float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x |
2.6 | 1% |
| float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x |
3.2 | 1% |
| float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x |
10.7 | 2% |
| float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x |
10.3 | 2% |
| float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x
| 10.1 | 2% |
| float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x |
8.4 | 2% |
| float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x |
8.2 | 2% |
| float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x
| 8.3 | 2% |
| float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x |
0.34x | 19.3 | 4% |
| float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x |
0.33x | 18.6 | 4% |
| float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x |
0.32x | 17.9 | 4% |
| float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x |
0.31x | 17.8 | 4% |
| float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x |
0.30x | 17.1 | 4% |
| float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x |
0.29x | 16.4 | 4% |
| float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x |
0.49x | 76.2 | 17% |
| float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x |
0.48x | 75.7 | 17% |
| float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x
| 0.48x | 74.8 | 16% |
| float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x
| 0.31x | 47.7 | 10% |
| float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x
| 0.30x | 46.8 | 10% |
| float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x
| 0.29x | 45.6 | 10% |
| float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x |
4.6 | 1% |
| float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x |
4.9 | 1% |
| float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x
| 5.9 | 1% |
| float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x |
4.5 | 1% |
| float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x |
4.9 | 1% |
| float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x
| 6.0 | 1% |
| float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x
| 44.1 | 10% |
| float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x
| 45.0 | 10% |
| float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x |
1.27x | 45.4 | 10% |
| float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x
| 22.0 | 5% |
| float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x
| 22.3 | 5% |
| float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x |
0.65x | 22.9 | 5% |
| float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x |
0.70x | 81.0 | 18% |
| float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x |
0.69x | 80.5 | 18% |
| float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x |
0.69x | 79.9 | 18% |
| float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x |
0.31x | 36.1 | 8% |
| float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 |
5.87x | 1.20x | 174.4 | 38% |
| float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 |
6.42x | 1.21x | 174.4 | 38% |
| float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 |
7.04x | 1.20x | 174.0 | 38% |
| float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 |
2.83x | 0.60x | 84.2 | 18% |
| float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 |
3.10x | 0.60x | 84.0 | 18% |
| float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 |
3.39x | 0.60x | 83.5 | 18% |
| float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x |
0.27x | 5.7 | 1% |
| float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x |
0.25x | 6.1 | 1% |
| float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x |
0.25x | 7.4 | 2% |
| float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x |
0.30x | 5.7 | 1% |
| float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x |
0.29x | 6.2 | 1% |
| float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x |
0.29x | 7.4 | 2% |
| float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x |
1.02x | 53.7 | 12% |
| float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x |
0.49x | 23.7 | 5% |
| float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x |
0.49x | 24.1 | 5% |
| float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x |
0.49x | 24.8 | 5% |
| float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x |
0.57x | 84.1 | 18% |
| float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x |
0.57x | 84.2 | 18% |
| float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x |
0.57x | 84.3 | 18% |
| float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x |
0.25x | 36.3 | 8% |
| float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x |
0.25x | 36.4 | 8% |
| float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x |
0.25x | 36.5 | 8% |
| float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 |
5.90x | 1.19x | 177.1 | 39% |
| float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 |
6.50x | 1.19x | 177.4 | 39% |
| float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 |
7.10x | 1.19x | 176.5 | 39% |
| float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 |
2.84x | 0.59x | 85.1 | 19% |
| float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 |
3.12x | 0.59x | 85.2 | 19% |
| float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 |
3.42x | 0.59x | 85.1 | 19% |
| float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x |
0.0 | 0% |
| float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x |
0.0 | 0% |
| float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x |
0.0 | 0% |
| float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x |
0.0 | 0% |
| float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x |
0.1 | 0% |
| float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x |
0.1 | 0% |
| float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x |
0.1 | 0% |
| float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x |
0.1 | 0% |
| float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x |
0.1 | 0% |
| float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x |
0.1 | 0% |
| float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x |
0.6 | 0% |
| float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x |
0.7 | 0% |
| float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x |
0.6 | 0% |
| float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x |
0.6 | 0% |
| float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x |
0.6 | 0% |
| float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x |
0.6 | 0% |
| float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x |
0.41x | 3.4 | 1% |
| float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x |
0.48x | 4.0 | 1% |
| float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x |
0.43x | 3.5 | 1% |
| float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x |
0.37x | 3.0 | 1% |
| float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x |
0.34x | 2.8 | 1% |
| float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x |
0.27x | 2.2 | 0% |
| float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x |
0.1 | 0% |
| float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x |
0.2 | 0% |
| float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x |
0.1 | 0% |
| float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x |
0.2 | 0% |
| float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x |
0.9 | 0% |
| float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x |
1.0 | 0% |
| float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x |
0.9 | 0% |
| float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x |
0.9 | 0% |
| float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x |
0.9 | 0% |
| float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x |
0.9 | 0% |
| float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x |
4.7 | 1% |
| float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x |
4.3 | 1% |
| float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x |
4.7 | 1% |
| float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x |
4.6 | 1% |
| float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x |
4.5 | 1% |
| float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x |
4.2 | 1% |
| float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x |
0.47x | 26.2 | 6% |
| float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x |
0.47x | 26.1 | 6% |
| float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x |
0.44x | 24.1 | 5% |
| float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x |
0.31x | 17.1 | 4% |
| float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x |
0.30x | 16.7 | 4% |
| float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x |
0.30x | 16.7 | 4% |
| float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x |
1.0 | 0% |
| float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x |
1.1 | 0% |
| float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x |
1.2 | 0% |
| float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x |
1.0 | 0% |
| float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x |
1.1 | 0% |
| float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x |
1.2 | 0% |
| float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x |
6.4 | 1% |
| float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x |
6.5 | 1% |
| float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x |
6.4 | 1% |
| float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x |
6.4 | 1% |
| float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x |
6.3 | 1% |
| float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x |
6.4 | 1% |
| float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x |
19.6 | 4% |
| float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x
| 19.1 | 4% |
| float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x |
0.56x | 18.0 | 4% |
| float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x |
18.8 | 4% |
| float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x
| 18.6 | 4% |
| float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x |
0.59x | 18.9 | 4% |
| float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x |
0.47x | 71.4 | 16% |
| float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x |
0.47x | 70.9 | 16% |
| float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x |
0.46x | 68.9 | 15% |
| float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x |
0.37x | 55.0 | 12% |
| float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x |
0.35x | 52.8 | 12% |
| float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x |
0.34x | 51.6 | 11% |
| float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x |
3.1 | 1% |
| float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x |
3.4 | 1% |
| float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x |
3.9 | 1% |
| float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x |
3.2 | 1% |
| float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x |
3.3 | 1% |
| float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x |
3.6 | 1% |
| float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x |
15.6 | 3% |
| float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x |
15.4 | 3% |
| float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x
| 15.4 | 3% |
| float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x |
13.0 | 3% |
| float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x |
12.3 | 3% |
| float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x
| 12.3 | 3% |
| float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x |
0.36x | 30.0 | 7% |
| float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x |
0.35x | 28.8 | 6% |
| float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x |
0.34x | 28.0 | 6% |
| float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x |
0.34x | 28.3 | 6% |
| float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x |
0.34x | 27.8 | 6% |
| float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x |
0.32x | 26.8 | 6% |
| float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x
| 0.69x | 82.6 | 18% |
| float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x
| 0.69x | 81.6 | 18% |
| float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 |
2.19x | 0.67x | 79.5 | 17% |
| float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x
| 0.54x | 62.7 | 14% |
| float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x
| 0.52x | 60.8 | 13% |
| float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 |
1.65x | 0.51x | 59.3 | 13% |
| float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x |
0.22x | 5.7 | 1% |
| float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x |
0.21x | 5.9 | 1% |
| float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x |
0.20x | 6.6 | 1% |
| float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x |
0.24x | 5.6 | 1% |
| float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x |
0.23x | 5.9 | 1% |
| float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x |
0.22x | 6.6 | 1% |
| float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x
| 84.9 | 19% |
| float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x
| 85.9 | 19% |
| float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x |
1.73x | 86.1 | 19% |
| float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x
| 46.2 | 10% |
| float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x
| 46.4 | 10% |
| float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x |
0.97x | 47.3 | 10% |
| float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x |
0.87x | 145.4 | 32% |
| float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x |
0.87x | 145.5 | 32% |
| float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x |
0.87x | 145.9 | 32% |
| float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x |
0.43x | 72.2 | 16% |
| float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x |
0.43x | 72.4 | 16% |
| float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x |
0.44x | 72.5 | 16% |
| float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 |
6.06x | 2.13x | 273.2 | 60% |
| float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 |
6.47x | 2.13x | 272.9 | 60% |
| float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 |
7.00x | 2.13x | 273.1 | 60% |
| float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 |
3.35x | 1.20x | 151.3 | 33% |
| float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 |
3.60x | 1.20x | 151.2 | 33% |
| float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 |
3.91x | 1.20x | 150.9 | 33% |
| float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x |
0.18x | 6.7 | 1% |
| float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x |
0.17x | 7.0 | 2% |
| float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x |
0.17x | 8.0 | 2% |
| float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x |
0.21x | 6.7 | 1% |
| float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x |
0.20x | 7.1 | 2% |
| float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x |
0.20x | 8.0 | 2% |
| float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x |
1.44x | 99.3 | 22% |
| float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x |
1.43x | 99.8 | 22% |
| float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x |
1.42x | 100.9 | 22% |
| float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x |
0.76x | 51.3 | 11% |
| float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x |
0.76x | 51.7 | 11% |
| float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x |
0.76x | 52.5 | 12% |
| float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x |
0.95x | 151.9 | 33% |
| float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x |
0.96x | 151.9 | 33% |
| float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x |
0.96x | 152.1 | 33% |
| float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x |
0.48x | 74.1 | 16% |
| float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x |
0.48x | 74.0 | 16% |
| float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x |
0.48x | 74.2 | 16% |
| float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 |
6.02x | 2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 |
6.43x | 2.10x | 273.4 | 60% |
| float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 |
6.98x | 2.10x | 273.6 | 60% |
| float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 |
3.36x | 1.19x | 152.6 | 33% |
| float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 |
3.60x | 1.20x | 152.7 | 33% |
| float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 |
3.92x | 1.20x | 152.7 | 33% |

</details>

### Test methodology

- **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2
alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
- **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is
monotonic for both `largest=True/False`.
- **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with
20 warmup iterations. `largest=True`.
- **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) +
8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).

---------

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants