Skip to content

Add single workgroup topk kernel for XPU (part 2 of #3369)#3372

Merged
chuanqi129 merged 10 commits into
mainfrom
jianyi/single-wg-topk
May 26, 2026
Merged

Add single workgroup topk kernel for XPU (part 2 of #3369)#3372
chuanqi129 merged 10 commits into
mainfrom
jianyi/single-wg-topk

Conversation

@jianyizh
Copy link
Copy Markdown
Contributor

Summary

Builds on #3371 (subgroup topk kernel). Adds a single workgroup topk kernel — SYCL translation of PyTorch CUDA's single-block radix select path.

  • Combined (PR1+PR2) vs original XPU: 1.5737x geomean over 432 cases, 211 wins (>1.05x), 32 regressions (<0.98x)
  • Combined vs CUDA 4080S: 0.5274x geomean (>1 means XPU faster)
  • PR2 incremental vs PR1-only: 1.1530x geomean, 107 additional wins

Approach

Single workgroup topk kernel (TensorTopKSingleWgKernel.cpp): A 1024-thread workgroup processes one slice using RADIX_BITS=4 radix select to find the k-th value, then gathers matching elements. Translated from PyTorch CUDA's single-block path. Output is unsorted (caller sorts if needed). Best for large dim (>= 4096).

Updated dispatch logic:

  • dim < 1024 -> original kernel
  • k <= 16 and large batch -> subgroup kernel (PR1, SORTED)
  • dim >= 4096 -> single workgroup kernel (this PR, UNSORTED)
  • otherwise -> original kernel

Also fixes NaN handling in SortingRadixSelect.h TopKTypeConfig::convert for half/float/double (NaN maps to max radix value).

Multi-block radix select (for very large slices across multiple workgroups) is planned as future work.

Files changed

File Description
TensorTopKSingleWgKernel.cpp (new) Single workgroup topk kernel (from CUDA single-block path)
TensorTopKSingleWgKernel.h (new) single_wg_topk_try_launch declaration
TensorTopKSbtopkKernel.cpp Add single-wg dispatch path alongside subgroup kernel
TensorTopKSbtopkKernel.h Update comments to describe both kernel paths
SortingRadixSelect.h Fix NaN handling in TopKTypeConfig::convert

Correctness

  • Accuracy: 432/432 pass (CPU vs XPU, sort-then-compare)
  • Sortedness: 324/324 pass (torch.topk(sorted=True) output verified monotonic)

Benchmark: incremental gain from this PR

Showing where single-wg kernel helps (large dim cases):

By dim (PR2 vs PR1-only):

dim PR2 vs PR1 PR2 vs orig PR2 vs CUDA cases
128 1.00x 1.00x 0.37x 54
129 1.00x 1.00x 0.39x 54
1024 1.00x 1.47x 0.77x 54
1025 1.00x 1.35x 0.63x 54
8192 1.03x 1.68x 0.62x 54
8193 1.01x 1.30x 0.49x 54
131072 1.99x 3.73x 0.68x 54
131073 1.51x 2.31x 0.43x 54

Full 432-case results (combined PR1+PR2)

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50 iters.

Click to expand full table
dtype bs dim k XPU orig (us) XPU PR1 (us) XPU PR1+PR2 (us) CUDA 4080S (us) vs orig vs CUDA BW (GB/s) %peak
bfloat16 1 128 4 30.6 30.7 30.6 14.4 1.00x 0.47x 0.0 0%
bfloat16 1 128 8 30.5 30.4 30.4 14.3 1.00x 0.47x 0.0 0%
bfloat16 1 128 16 30.4 30.4 30.5 14.3 1.00x 0.47x 0.0 0%
bfloat16 1 129 4 30.3 30.6 30.4 14.7 1.00x 0.48x 0.0 0%
bfloat16 1 129 8 30.4 30.5 30.3 14.6 1.00x 0.48x 0.0 0%
bfloat16 1 129 16 30.4 30.4 30.4 14.6 1.00x 0.48x 0.0 0%
bfloat16 1 1024 4 30.5 30.5 30.4 19.0 1.00x 0.62x 0.1 0%
bfloat16 1 1024 8 30.5 30.6 30.4 18.3 1.00x 0.60x 0.1 0%
bfloat16 1 1024 16 30.4 30.4 30.5 18.6 1.00x 0.61x 0.1 0%
bfloat16 1 1025 4 30.5 30.5 30.5 20.0 1.00x 0.66x 0.1 0%
bfloat16 1 1025 8 30.4 30.5 30.5 20.2 1.00x 0.66x 0.1 0%
bfloat16 1 1025 16 30.4 30.5 30.4 19.8 1.00x 0.65x 0.1 0%
bfloat16 1 8192 4 45.7 44.4 42.8 37.4 1.07x 0.87x 0.4 0%
bfloat16 1 8192 8 51.6 48.6 42.5 42.2 1.21x 0.99x 0.4 0%
bfloat16 1 8192 16 48.6 48.6 42.7 39.1 1.14x 0.92x 0.4 0%
bfloat16 1 8193 4 45.7 48.4 45.8 37.0 1.00x 0.81x 0.4 0%
bfloat16 1 8193 8 48.7 48.6 45.9 40.3 1.06x 0.88x 0.4 0%
bfloat16 1 8193 16 48.5 48.5 47.2 39.7 1.03x 0.84x 0.4 0%
bfloat16 1 131072 4 368.8 375.7 102.4 46.3 3.60x 0.45x 2.6 1%
bfloat16 1 131072 8 396.4 402.5 105.2 46.3 3.77x 0.44x 2.5 1%
bfloat16 1 131072 16 430.6 426.2 111.0 46.4 3.88x 0.42x 2.4 1%
bfloat16 1 131073 4 370.4 364.3 168.6 46.8 2.20x 0.28x 1.6 0%
bfloat16 1 131073 8 392.5 396.7 202.4 46.8 1.94x 0.23x 1.3 0%
bfloat16 1 131073 16 413.9 421.3 184.1 46.7 2.25x 0.25x 1.4 0%
bfloat16 8 128 4 30.4 30.4 30.3 14.9 1.00x 0.49x 0.1 0%
bfloat16 8 128 8 30.5 30.6 30.4 14.6 1.00x 0.48x 0.1 0%
bfloat16 8 128 16 30.4 30.3 30.3 14.6 1.00x 0.48x 0.1 0%
bfloat16 8 129 4 30.3 30.5 30.2 15.1 1.00x 0.50x 0.1 0%
bfloat16 8 129 8 30.3 30.5 30.5 15.1 0.99x 0.50x 0.1 0%
bfloat16 8 129 16 30.4 30.5 30.3 15.1 1.00x 0.50x 0.1 0%
bfloat16 8 1024 4 30.4 30.5 30.4 19.3 1.00x 0.63x 0.5 0%
bfloat16 8 1024 8 30.4 30.5 30.5 19.4 1.00x 0.64x 0.6 0%
bfloat16 8 1024 16 30.4 30.4 30.4 19.5 1.00x 0.64x 0.6 0%
bfloat16 8 1025 4 30.4 30.5 30.4 20.5 1.00x 0.67x 0.6 0%
bfloat16 8 1025 8 30.6 30.4 30.4 20.4 1.01x 0.67x 0.6 0%
bfloat16 8 1025 16 30.4 30.4 30.5 20.4 1.00x 0.67x 0.6 0%
bfloat16 8 8192 4 54.7 51.6 44.2 42.2 1.24x 0.95x 3.0 1%
bfloat16 8 8192 8 51.6 54.6 45.6 39.9 1.13x 0.87x 2.9 1%
bfloat16 8 8192 16 54.8 54.5 44.5 42.4 1.23x 0.95x 3.0 1%
bfloat16 8 8193 4 54.5 54.5 47.3 43.3 1.15x 0.92x 2.8 1%
bfloat16 8 8193 8 54.7 54.7 48.5 43.5 1.13x 0.90x 2.7 1%
bfloat16 8 8193 16 54.6 48.6 48.5 42.7 1.13x 0.88x 2.7 1%
bfloat16 8 131072 4 388.2 394.6 145.4 56.8 2.67x 0.39x 14.4 3%
bfloat16 8 131072 8 422.7 398.6 137.5 56.5 3.07x 0.41x 15.3 3%
bfloat16 8 131072 16 427.5 433.5 146.5 56.7 2.92x 0.39x 14.3 3%
bfloat16 8 131073 4 392.3 405.1 218.3 56.8 1.80x 0.26x 9.6 2%
bfloat16 8 131073 8 404.6 406.4 222.5 57.1 1.82x 0.26x 9.4 2%
bfloat16 8 131073 16 442.0 436.3 196.2 56.9 2.25x 0.29x 10.7 2%
bfloat16 64 128 4 30.5 30.5 30.3 14.9 1.01x 0.49x 0.6 0%
bfloat16 64 128 8 30.5 30.6 30.3 14.7 1.01x 0.49x 0.7 0%
bfloat16 64 128 16 30.6 30.4 30.2 14.8 1.01x 0.49x 0.9 0%
bfloat16 64 129 4 30.6 30.4 30.3 15.4 1.01x 0.51x 0.6 0%
bfloat16 64 129 8 30.5 30.4 30.3 15.5 1.01x 0.51x 0.7 0%
bfloat16 64 129 16 30.6 30.4 30.3 15.2 1.01x 0.50x 0.9 0%
bfloat16 64 1024 4 30.6 30.5 30.4 19.5 1.01x 0.64x 4.4 1%
bfloat16 64 1024 8 30.5 30.5 30.3 19.5 1.01x 0.64x 4.5 1%
bfloat16 64 1024 16 30.5 30.6 30.7 19.5 0.99x 0.64x 4.6 1%
bfloat16 64 1025 4 33.7 33.6 33.6 20.7 1.00x 0.62x 4.0 1%
bfloat16 64 1025 8 33.7 33.6 33.7 20.6 1.00x 0.61x 4.0 1%
bfloat16 64 1025 16 33.5 33.7 33.7 20.6 0.99x 0.61x 4.2 1%
bfloat16 64 8192 4 93.1 92.2 93.4 49.9 1.00x 0.53x 11.3 2%
bfloat16 64 8192 8 97.7 96.6 92.0 49.5 1.06x 0.54x 11.5 3%
bfloat16 64 8192 16 100.8 101.2 91.7 49.6 1.10x 0.54x 11.5 3%
bfloat16 64 8193 4 96.2 90.1 97.9 49.8 0.98x 0.51x 10.7 2%
bfloat16 64 8193 8 97.9 96.3 97.9 49.6 1.00x 0.51x 10.8 2%
bfloat16 64 8193 16 100.2 100.3 97.7 49.7 1.03x 0.51x 10.8 2%
bfloat16 64 131072 4 901.8 888.7 304.9 162.9 2.96x 0.53x 55.0 12%
bfloat16 64 131072 8 939.7 948.2 308.0 164.6 3.05x 0.53x 54.5 12%
bfloat16 64 131072 16 999.0 993.3 301.4 164.4 3.31x 0.55x 55.7 12%
bfloat16 64 131073 4 902.2 889.0 449.7 166.8 2.01x 0.37x 37.3 8%
bfloat16 64 131073 8 944.7 942.0 464.5 166.8 2.03x 0.36x 36.1 8%
bfloat16 64 131073 16 1002.6 1000.7 449.2 165.5 2.23x 0.37x 37.4 8%
bfloat16 256 128 4 33.7 33.7 33.6 15.7 1.00x 0.47x 2.3 0%
bfloat16 256 128 8 33.8 33.6 33.7 15.6 1.00x 0.46x 2.6 1%
bfloat16 256 128 16 33.6 33.6 33.6 15.7 1.00x 0.47x 3.2 1%
bfloat16 256 129 4 33.7 33.6 33.6 16.5 1.00x 0.49x 2.3 0%
bfloat16 256 129 8 33.6 33.6 33.6 16.3 1.00x 0.49x 2.6 1%
bfloat16 256 129 16 33.6 33.5 33.5 16.3 1.00x 0.49x 3.2 1%
bfloat16 256 1024 4 56.3 56.1 56.2 41.7 1.00x 0.74x 9.5 2%
bfloat16 256 1024 8 59.0 58.9 58.9 42.4 1.00x 0.72x 9.2 2%
bfloat16 256 1024 16 59.3 59.2 60.1 42.6 0.99x 0.71x 9.4 2%
bfloat16 256 1025 4 71.1 72.4 73.4 45.9 0.97x 0.63x 7.3 2%
bfloat16 256 1025 8 75.1 74.1 74.8 46.7 1.00x 0.62x 7.3 2%
bfloat16 256 1025 16 75.4 75.4 73.8 47.1 1.02x 0.64x 7.7 2%
bfloat16 256 8192 4 260.0 263.7 254.6 75.2 1.02x 0.30x 16.5 4%
bfloat16 256 8192 8 270.4 269.8 255.6 75.0 1.06x 0.29x 16.5 4%
bfloat16 256 8192 16 287.6 290.5 255.0 75.2 1.13x 0.29x 16.6 4%
bfloat16 256 8193 4 261.0 268.2 274.2 75.1 0.95x 0.27x 15.3 3%
bfloat16 256 8193 8 273.3 273.1 276.5 75.6 0.99x 0.27x 15.2 3%
bfloat16 256 8193 16 287.6 288.1 277.8 75.7 1.04x 0.27x 15.2 3%
bfloat16 256 131072 4 3096.6 3087.7 961.2 439.2 3.22x 0.46x 69.8 15%
bfloat16 256 131072 8 3283.4 3269.1 941.6 436.9 3.49x 0.46x 71.3 16%
bfloat16 256 131072 16 3464.5 3469.5 923.2 440.9 3.75x 0.48x 72.7 16%
bfloat16 256 131073 4 3085.3 3093.6 1548.8 441.5 1.99x 0.29x 43.3 10%
bfloat16 256 131073 8 3282.4 3267.2 1525.2 435.4 2.15x 0.29x 44.0 10%
bfloat16 256 131073 16 3462.5 3470.8 1495.2 443.1 2.32x 0.30x 44.9 10%
bfloat16 1024 128 4 70.9 69.5 70.6 22.1 1.00x 0.31x 4.3 1%
bfloat16 1024 128 8 75.3 75.2 75.3 22.0 1.00x 0.29x 4.6 1%
bfloat16 1024 128 16 76.9 76.7 76.6 22.3 1.00x 0.29x 5.6 1%
bfloat16 1024 129 4 70.8 69.6 69.9 24.4 1.01x 0.35x 4.4 1%
bfloat16 1024 129 8 75.4 75.2 75.1 24.4 1.00x 0.32x 4.6 1%
bfloat16 1024 129 16 76.8 76.7 76.6 24.5 1.00x 0.32x 5.6 1%
bfloat16 1024 1024 4 152.6 56.2 56.0 63.1 2.73x 1.13x 38.2 8%
bfloat16 1024 1024 8 156.0 56.2 55.9 63.3 2.79x 1.13x 39.0 9%
bfloat16 1024 1024 16 157.2 57.5 57.4 63.4 2.74x 1.10x 39.4 9%
bfloat16 1024 1025 4 218.4 86.0 86.9 64.5 2.51x 0.74x 24.6 5%
bfloat16 1024 1025 8 223.7 86.8 87.0 64.7 2.57x 0.74x 25.1 5%
bfloat16 1024 1025 16 225.8 87.3 87.1 64.8 2.59x 0.74x 26.0 6%
bfloat16 1024 8192 4 939.4 248.0 259.0 147.6 3.63x 0.57x 64.9 14%
bfloat16 1024 8192 8 985.8 249.3 258.9 147.4 3.81x 0.57x 65.1 14%
bfloat16 1024 8192 16 1036.1 251.2 260.7 148.0 3.97x 0.57x 65.0 14%
bfloat16 1024 8193 4 941.7 406.6 421.8 149.2 2.23x 0.35x 39.9 9%
bfloat16 1024 8193 8 988.2 407.0 417.2 148.4 2.37x 0.36x 40.4 9%
bfloat16 1024 8193 16 1040.8 406.8 419.0 149.3 2.48x 0.36x 40.4 9%
bfloat16 1024 131072 4 11500.2 1762.5 1762.0 1865.9 6.53x 1.06x 152.4 33%
bfloat16 1024 131072 8 12192.8 1762.8 1764.9 1867.4 6.91x 1.06x 152.1 33%
bfloat16 1024 131072 16 12859.4 1767.0 1762.5 1863.0 7.30x 1.06x 152.4 33%
bfloat16 1024 131073 4 11514.6 2998.5 2996.9 1940.1 3.84x 0.65x 89.6 20%
bfloat16 1024 131073 8 12173.3 2998.4 2997.4 1936.8 4.06x 0.65x 89.6 20%
bfloat16 1024 131073 16 12856.9 3002.4 2997.6 1944.4 4.29x 0.65x 89.6 20%
bfloat16 2048 128 4 113.9 113.8 113.5 30.5 1.00x 0.27x 5.3 1%
bfloat16 2048 128 8 120.3 119.9 119.7 30.5 1.01x 0.25x 5.7 1%
bfloat16 2048 128 16 122.9 122.9 123.3 30.9 1.00x 0.25x 6.9 2%
bfloat16 2048 129 4 113.8 114.0 113.7 35.4 1.00x 0.31x 5.4 1%
bfloat16 2048 129 8 120.1 120.1 120.1 35.2 1.00x 0.29x 5.8 1%
bfloat16 2048 129 16 123.2 123.1 123.7 35.7 1.00x 0.29x 6.9 2%
bfloat16 2048 1024 4 276.3 96.4 97.2 85.7 2.84x 0.88x 44.0 10%
bfloat16 2048 1024 8 284.8 97.5 97.6 86.0 2.92x 0.88x 44.7 10%
bfloat16 2048 1024 16 286.1 99.3 99.3 86.4 2.88x 0.87x 45.5 10%
bfloat16 2048 1025 4 407.9 158.2 158.2 88.4 2.58x 0.56x 27.1 6%
bfloat16 2048 1025 8 423.7 158.8 159.0 88.7 2.66x 0.56x 27.4 6%
bfloat16 2048 1025 16 428.3 160.0 159.9 89.0 2.68x 0.56x 28.3 6%
bfloat16 2048 8192 4 1875.1 496.1 497.7 234.9 3.77x 0.47x 67.6 15%
bfloat16 2048 8192 8 1956.5 497.2 498.0 234.1 3.93x 0.47x 67.7 15%
bfloat16 2048 8192 16 2058.5 498.7 499.5 235.0 4.12x 0.47x 67.8 15%
bfloat16 2048 8193 4 1873.4 825.1 822.9 236.2 2.28x 0.29x 40.9 9%
bfloat16 2048 8193 8 1959.0 824.1 823.8 237.3 2.38x 0.29x 40.9 9%
bfloat16 2048 8193 16 2065.1 825.7 825.2 237.4 2.50x 0.29x 41.1 9%
bfloat16 2048 131072 4 22903.6 3485.4 3486.6 3646.5 6.57x 1.05x 154.0 34%
bfloat16 2048 131072 8 24193.6 3484.6 3488.3 3644.1 6.94x 1.04x 154.0 34%
bfloat16 2048 131072 16 25590.8 3487.7 3489.4 3646.2 7.33x 1.04x 154.0 34%
bfloat16 2048 131073 4 22872.9 5925.0 5928.1 3774.7 3.86x 0.64x 90.6 20%
bfloat16 2048 131073 8 24187.7 5933.4 5929.8 3780.1 4.08x 0.64x 90.6 20%
bfloat16 2048 131073 16 25604.8 5934.5 5926.6 3773.0 4.32x 0.64x 90.6 20%
float16 1 128 4 30.7 30.7 30.6 14.3 1.00x 0.47x 0.0 0%
float16 1 128 8 30.6 30.6 30.5 14.0 1.00x 0.46x 0.0 0%
float16 1 128 16 30.5 30.5 30.6 14.0 1.00x 0.46x 0.0 0%
float16 1 129 4 30.6 30.6 30.7 14.4 1.00x 0.47x 0.0 0%
float16 1 129 8 30.6 30.3 30.5 14.4 1.00x 0.47x 0.0 0%
float16 1 129 16 30.5 30.4 30.7 14.7 0.99x 0.48x 0.0 0%
float16 1 1024 4 30.6 30.7 30.8 17.4 0.99x 0.56x 0.1 0%
float16 1 1024 8 30.5 30.5 30.8 17.5 0.99x 0.57x 0.1 0%
float16 1 1024 16 30.4 30.5 30.7 17.5 0.99x 0.57x 0.1 0%
float16 1 1025 4 30.5 30.5 30.7 17.8 0.99x 0.58x 0.1 0%
float16 1 1025 8 30.4 30.4 30.7 18.6 0.99x 0.61x 0.1 0%
float16 1 1025 16 30.4 30.3 30.7 20.1 0.99x 0.65x 0.1 0%
float16 1 8192 4 41.4 38.2 38.5 33.6 1.08x 0.87x 0.4 0%
float16 1 8192 8 41.2 48.4 42.9 33.8 0.96x 0.79x 0.4 0%
float16 1 8192 16 45.6 48.4 38.3 31.5 1.19x 0.82x 0.4 0%
float16 1 8193 4 45.6 41.0 44.5 37.4 1.02x 0.84x 0.4 0%
float16 1 8193 8 42.6 44.1 40.0 36.9 1.06x 0.92x 0.4 0%
float16 1 8193 16 45.6 51.3 46.0 33.3 0.99x 0.72x 0.4 0%
float16 1 131072 4 297.2 304.4 126.4 46.2 2.35x 0.37x 2.1 0%
float16 1 131072 8 326.6 335.1 99.5 46.5 3.28x 0.47x 2.6 1%
float16 1 131072 16 348.1 355.4 132.9 46.1 2.62x 0.35x 2.0 0%
float16 1 131073 4 308.7 286.0 198.8 46.9 1.55x 0.24x 1.3 0%
float16 1 131073 8 321.3 325.3 188.1 46.8 1.71x 0.25x 1.4 0%
float16 1 131073 16 353.2 378.6 185.2 46.6 1.91x 0.25x 1.4 0%
float16 8 128 4 30.5 30.2 30.4 14.4 1.00x 0.47x 0.1 0%
float16 8 128 8 30.4 30.2 30.3 14.5 1.00x 0.48x 0.1 0%
float16 8 128 16 30.4 30.4 30.4 14.5 1.00x 0.48x 0.1 0%
float16 8 129 4 30.5 30.2 30.2 14.8 1.01x 0.49x 0.1 0%
float16 8 129 8 30.3 30.2 30.3 14.9 1.00x 0.49x 0.1 0%
float16 8 129 16 30.5 30.4 30.3 14.9 1.01x 0.49x 0.1 0%
float16 8 1024 4 30.6 30.4 30.3 19.1 1.01x 0.63x 0.6 0%
float16 8 1024 8 30.5 30.4 30.4 19.2 1.00x 0.63x 0.6 0%
float16 8 1024 16 30.4 30.3 30.4 19.3 1.00x 0.63x 0.6 0%
float16 8 1025 4 30.5 30.4 30.4 19.5 1.00x 0.64x 0.6 0%
float16 8 1025 8 30.5 30.3 30.4 20.4 1.00x 0.67x 0.6 0%
float16 8 1025 16 30.5 30.3 30.4 20.5 1.00x 0.67x 0.6 0%
float16 8 8192 4 45.6 45.5 42.7 37.9 1.07x 0.89x 3.1 1%
float16 8 8192 8 48.4 48.5 44.0 39.8 1.10x 0.90x 3.0 1%
float16 8 8192 16 48.5 51.5 44.1 41.7 1.10x 0.95x 3.0 1%
float16 8 8193 4 48.5 45.5 47.3 39.2 1.03x 0.83x 2.8 1%
float16 8 8193 8 45.6 48.6 47.0 40.7 0.97x 0.87x 2.8 1%
float16 8 8193 16 54.5 51.7 45.7 43.0 1.19x 0.94x 2.9 1%
float16 8 131072 4 309.9 334.0 137.7 56.0 2.25x 0.41x 15.2 3%
float16 8 131072 8 338.1 356.0 125.9 56.1 2.69x 0.45x 16.7 4%
float16 8 131072 16 393.3 387.7 132.6 56.3 2.97x 0.42x 15.8 3%
float16 8 131073 4 314.9 313.8 208.8 56.2 1.51x 0.27x 10.0 2%
float16 8 131073 8 341.7 344.2 200.6 56.3 1.70x 0.28x 10.5 2%
float16 8 131073 16 366.4 378.0 200.1 56.3 1.83x 0.28x 10.5 2%
float16 64 128 4 30.5 30.1 30.3 14.9 1.01x 0.49x 0.6 0%
float16 64 128 8 30.5 30.2 30.3 14.7 1.01x 0.49x 0.7 0%
float16 64 128 16 30.4 30.2 30.1 14.7 1.01x 0.49x 0.9 0%
float16 64 129 4 30.6 30.2 30.3 15.3 1.01x 0.50x 0.6 0%
float16 64 129 8 30.6 30.2 30.4 15.2 1.01x 0.50x 0.7 0%
float16 64 129 16 30.5 30.2 30.4 15.1 1.00x 0.50x 0.9 0%
float16 64 1024 4 30.4 30.4 30.3 19.2 1.00x 0.63x 4.4 1%
float16 64 1024 8 30.4 30.4 30.4 19.3 1.00x 0.63x 4.5 1%
float16 64 1024 16 30.4 30.3 30.5 19.4 1.00x 0.64x 4.6 1%
float16 64 1025 4 32.2 32.0 33.0 19.7 0.98x 0.60x 4.1 1%
float16 64 1025 8 32.1 32.1 32.3 20.4 0.99x 0.63x 4.2 1%
float16 64 1025 16 33.6 33.6 33.6 20.4 1.00x 0.61x 4.2 1%
float16 64 8192 4 81.3 84.2 83.0 49.4 0.98x 0.60x 12.7 3%
float16 64 8192 8 83.0 84.2 83.0 49.2 1.00x 0.59x 12.7 3%
float16 64 8192 16 88.7 90.4 89.1 49.2 1.00x 0.55x 11.9 3%
float16 64 8193 4 81.3 80.1 85.8 49.4 0.95x 0.58x 12.3 3%
float16 64 8193 8 87.2 84.0 88.8 49.4 0.98x 0.56x 11.9 3%
float16 64 8193 16 90.2 88.8 91.7 49.4 0.98x 0.54x 11.5 3%
float16 64 131072 4 752.0 723.7 285.8 162.1 2.63x 0.57x 58.7 13%
float16 64 131072 8 788.0 782.2 290.4 160.5 2.71x 0.55x 57.8 13%
float16 64 131072 16 853.1 866.5 282.4 162.4 3.02x 0.58x 59.4 13%
float16 64 131073 4 712.3 709.2 440.0 161.6 1.62x 0.37x 38.1 8%
float16 64 131073 8 784.4 775.9 409.9 163.9 1.91x 0.40x 40.9 9%
float16 64 131073 16 866.1 857.3 433.5 162.9 2.00x 0.38x 38.7 8%
float16 256 128 4 33.7 33.6 33.5 15.5 1.01x 0.46x 2.3 0%
float16 256 128 8 33.7 33.6 33.6 15.6 1.00x 0.46x 2.6 1%
float16 256 128 16 33.7 33.6 33.5 15.6 1.01x 0.47x 3.2 1%
float16 256 129 4 33.7 33.5 33.5 16.0 1.01x 0.48x 2.3 0%
float16 256 129 8 33.7 33.5 33.6 15.9 1.00x 0.47x 2.6 1%
float16 256 129 16 33.6 33.5 33.5 16.1 1.00x 0.48x 3.2 1%
float16 256 1024 4 50.6 50.8 50.1 37.9 1.01x 0.76x 10.7 2%
float16 256 1024 8 53.1 53.0 52.8 38.8 1.01x 0.73x 10.3 2%
float16 256 1024 16 55.0 56.0 55.7 39.9 0.99x 0.72x 10.1 2%
float16 256 1025 4 63.5 63.5 63.4 42.0 1.00x 0.66x 8.4 2%
float16 256 1025 8 64.6 66.3 66.4 43.1 0.97x 0.65x 8.2 2%
float16 256 1025 16 69.5 67.9 68.2 43.8 1.02x 0.64x 8.3 2%
float16 256 8192 4 219.8 221.4 218.2 74.1 1.01x 0.34x 19.3 4%
float16 256 8192 8 233.9 234.1 226.5 74.4 1.03x 0.33x 18.6 4%
float16 256 8192 16 248.0 250.8 237.1 74.7 1.05x 0.32x 17.9 4%
float16 256 8193 4 217.9 220.0 236.7 74.3 0.92x 0.31x 17.8 4%
float16 256 8193 8 235.5 232.7 246.1 74.8 0.96x 0.30x 17.1 4%
float16 256 8193 16 252.1 257.4 257.6 74.9 0.98x 0.29x 16.4 4%
float16 256 131072 4 2409.4 2421.9 880.3 428.9 2.74x 0.49x 76.2 17%
float16 256 131072 8 2673.7 2662.8 887.3 427.9 3.01x 0.48x 75.7 17%
float16 256 131072 16 2935.0 2934.9 898.3 428.2 3.27x 0.48x 74.8 16%
float16 256 131073 4 2405.3 2442.5 1408.4 431.9 1.71x 0.31x 47.7 10%
float16 256 131073 8 2662.4 2677.0 1434.5 429.8 1.86x 0.30x 46.8 10%
float16 256 131073 16 2941.0 2949.7 1471.8 432.2 2.00x 0.29x 45.6 10%
float16 1024 128 4 67.6 67.6 66.6 20.9 1.02x 0.31x 4.6 1%
float16 1024 128 8 70.7 69.7 70.6 20.9 1.00x 0.30x 4.9 1%
float16 1024 128 16 71.4 71.4 71.7 21.4 1.00x 0.30x 5.9 1%
float16 1024 129 4 66.5 66.6 67.6 23.3 0.98x 0.34x 4.5 1%
float16 1024 129 8 70.8 70.1 70.5 23.1 1.00x 0.33x 4.9 1%
float16 1024 129 16 71.2 72.4 71.2 23.4 1.00x 0.33x 6.0 1%
float16 1024 1024 4 132.5 48.4 48.5 62.7 2.73x 1.29x 44.1 10%
float16 1024 1024 8 136.5 48.7 48.4 63.0 2.82x 1.30x 45.0 10%
float16 1024 1024 16 143.6 49.7 49.8 63.1 2.88x 1.27x 45.4 10%
float16 1024 1025 4 185.3 97.8 97.5 64.2 1.90x 0.66x 22.0 5%
float16 1024 1025 8 192.7 97.7 97.8 64.4 1.97x 0.66x 22.3 5%
float16 1024 1025 16 206.3 99.0 98.9 64.5 2.09x 0.65x 22.9 5%
float16 1024 8192 4 793.1 198.8 207.6 145.0 3.82x 0.70x 81.0 18%
float16 1024 8192 8 840.3 199.1 209.4 144.6 4.01x 0.69x 80.5 18%
float16 1024 8192 16 907.4 201.8 211.9 145.5 4.28x 0.69x 79.9 18%
float16 1024 8193 4 799.0 456.2 466.4 146.1 1.71x 0.31x 36.1 8%
float16 1024 8193 8 838.6 457.3 468.8 146.5 1.79x 0.31x 36.0 8%
float16 1024 8193 16 912.3 459.8 470.6 146.2 1.94x 0.31x 36.0 8%
float16 1024 131072 4 9033.3 1535.9 1539.0 1846.9 5.87x 1.20x 174.4 38%
float16 1024 131072 8 9885.6 1542.6 1539.7 1856.1 6.42x 1.21x 174.4 38%
float16 1024 131072 16 10870.4 1538.7 1544.1 1858.5 7.04x 1.20x 174.0 38%
float16 1024 131073 4 9011.7 3193.9 3188.8 1924.0 2.83x 0.60x 84.2 18%
float16 1024 131073 8 9922.9 3185.2 3196.3 1921.5 3.10x 0.60x 84.0 18%
float16 1024 131073 16 10905.6 3186.0 3216.1 1926.4 3.39x 0.60x 83.5 18%
float16 2048 128 4 106.8 107.8 106.5 28.3 1.00x 0.27x 5.7 1%
float16 2048 128 8 112.6 112.5 112.4 28.5 1.00x 0.25x 6.1 1%
float16 2048 128 16 115.6 114.5 115.4 29.2 1.00x 0.25x 7.4 2%
float16 2048 129 4 106.9 108.1 107.7 32.6 0.99x 0.30x 5.7 1%
float16 2048 129 8 112.5 112.4 112.3 32.7 1.00x 0.29x 6.2 1%
float16 2048 129 16 115.9 115.4 115.3 33.5 1.01x 0.29x 7.4 2%
float16 2048 1024 4 236.3 81.3 81.3 85.1 2.91x 1.05x 52.6 12%
float16 2048 1024 8 246.7 82.8 82.8 85.7 2.98x 1.04x 52.6 12%
float16 2048 1024 16 259.7 84.4 84.2 86.0 3.08x 1.02x 53.7 12%
float16 2048 1025 4 345.5 179.5 180.5 87.7 1.91x 0.49x 23.7 5%
float16 2048 1025 8 358.4 180.9 180.8 88.0 1.98x 0.49x 24.1 5%
float16 2048 1025 16 380.3 182.2 182.2 88.5 2.09x 0.49x 24.8 5%
float16 2048 8192 4 1572.3 399.3 399.8 228.7 3.93x 0.57x 84.1 18%
float16 2048 8192 8 1662.5 400.0 400.3 228.5 4.15x 0.57x 84.2 18%
float16 2048 8192 16 1808.5 401.1 402.1 230.5 4.50x 0.57x 84.3 18%
float16 2048 8193 4 1573.6 924.3 926.2 231.7 1.70x 0.25x 36.3 8%
float16 2048 8193 8 1672.3 926.3 926.2 231.6 1.81x 0.25x 36.4 8%
float16 2048 8193 16 1813.4 931.1 929.0 233.1 1.95x 0.25x 36.5 8%
float16 2048 131072 4 17900.0 3035.1 3031.5 3622.2 5.90x 1.19x 177.1 39%
float16 2048 131072 8 19669.5 3028.6 3027.0 3607.3 6.50x 1.19x 177.4 39%
float16 2048 131072 16 21602.8 3043.9 3043.3 3607.4 7.10x 1.19x 176.5 39%
float16 2048 131073 4 17893.0 6305.2 6308.6 3743.3 2.84x 0.59x 85.1 19%
float16 2048 131073 8 19693.7 6309.6 6303.1 3747.1 3.12x 0.59x 85.2 19%
float16 2048 131073 16 21604.8 6307.9 6309.5 3749.5 3.42x 0.59x 85.1 19%
float32 1 128 4 31.2 31.4 37.1 14.5 0.84x 0.39x 0.0 0%
float32 1 128 8 34.0 34.4 34.1 14.3 1.00x 0.42x 0.0 0%
float32 1 128 16 32.4 34.4 32.2 14.0 1.01x 0.43x 0.0 0%
float32 1 129 4 34.1 34.4 35.5 14.4 0.96x 0.41x 0.0 0%
float32 1 129 8 34.0 32.7 33.9 14.4 1.00x 0.42x 0.0 0%
float32 1 129 16 34.1 34.3 32.0 15.2 1.07x 0.47x 0.0 0%
float32 1 1024 4 35.3 32.7 35.4 17.8 1.00x 0.50x 0.1 0%
float32 1 1024 8 35.3 35.8 35.3 22.2 1.00x 0.63x 0.1 0%
float32 1 1024 16 35.3 35.7 35.5 19.1 0.99x 0.54x 0.1 0%
float32 1 1025 4 35.3 35.9 33.7 18.8 1.05x 0.56x 0.1 0%
float32 1 1025 8 38.5 35.8 35.6 19.7 1.08x 0.55x 0.1 0%
float32 1 1025 16 35.2 35.7 33.7 19.6 1.04x 0.58x 0.1 0%
float32 1 8192 4 54.6 51.1 52.0 39.6 1.05x 0.76x 0.6 0%
float32 1 8192 8 63.6 55.0 50.5 38.0 1.26x 0.75x 0.7 0%
float32 1 8192 16 54.6 58.0 55.1 38.7 0.99x 0.70x 0.6 0%
float32 1 8193 4 51.5 52.0 53.4 34.1 0.96x 0.64x 0.6 0%
float32 1 8193 8 56.5 54.9 53.5 41.6 1.06x 0.78x 0.6 0%
float32 1 8193 16 60.6 58.0 52.1 39.8 1.16x 0.76x 0.6 0%
float32 1 131072 4 410.5 393.7 155.8 63.3 2.63x 0.41x 3.4 1%
float32 1 131072 8 412.3 398.5 130.7 63.3 3.15x 0.48x 4.0 1%
float32 1 131072 16 423.5 467.2 148.8 63.3 2.85x 0.43x 3.5 1%
float32 1 131073 4 406.7 389.3 172.4 64.0 2.36x 0.37x 3.0 1%
float32 1 131073 8 425.0 417.1 189.1 64.0 2.25x 0.34x 2.8 1%
float32 1 131073 16 435.0 430.7 240.7 63.9 1.81x 0.27x 2.2 0%
float32 8 128 4 33.8 37.2 33.8 14.7 1.00x 0.43x 0.1 0%
float32 8 128 8 35.0 34.1 35.1 14.3 1.00x 0.41x 0.1 0%
float32 8 128 16 35.6 37.2 36.7 15.2 0.97x 0.41x 0.2 0%
float32 8 129 4 35.2 36.0 35.2 15.0 1.00x 0.43x 0.1 0%
float32 8 129 8 36.8 34.1 33.8 15.0 1.09x 0.44x 0.1 0%
float32 8 129 16 35.3 35.5 36.8 15.3 0.96x 0.42x 0.2 0%
float32 8 1024 4 39.8 35.6 37.9 20.9 1.05x 0.55x 0.9 0%
float32 8 1024 8 38.2 35.6 35.2 19.7 1.09x 0.56x 1.0 0%
float32 8 1024 16 38.3 40.2 38.2 19.7 1.00x 0.52x 0.9 0%
float32 8 1025 4 38.3 35.7 38.3 20.6 1.00x 0.54x 0.9 0%
float32 8 1025 8 38.4 38.7 38.3 21.4 1.00x 0.56x 0.9 0%
float32 8 1025 16 41.2 36.9 39.5 22.0 1.04x 0.56x 0.9 0%
float32 8 8192 4 57.5 62.6 56.1 41.0 1.02x 0.73x 4.7 1%
float32 8 8192 8 60.6 55.2 60.8 42.6 1.00x 0.70x 4.3 1%
float32 8 8192 16 66.7 61.1 56.3 44.7 1.18x 0.79x 4.7 1%
float32 8 8193 4 54.6 64.0 57.7 43.0 0.95x 0.75x 4.6 1%
float32 8 8193 8 66.5 61.0 57.9 43.5 1.15x 0.75x 4.5 1%
float32 8 8193 16 63.9 67.1 62.3 45.0 1.03x 0.72x 4.2 1%
float32 8 131072 4 412.1 410.8 160.3 76.0 2.57x 0.47x 26.2 6%
float32 8 131072 8 432.3 425.0 161.0 76.0 2.69x 0.47x 26.1 6%
float32 8 131072 16 470.7 458.5 174.0 76.2 2.71x 0.44x 24.1 5%
float32 8 131073 4 403.7 411.2 244.8 76.0 1.65x 0.31x 17.1 4%
float32 8 131073 8 424.4 425.9 251.9 75.8 1.68x 0.30x 16.7 4%
float32 8 131073 16 471.5 477.8 250.7 76.1 1.88x 0.30x 16.7 4%
float32 64 128 4 38.2 37.4 36.8 15.0 1.04x 0.41x 1.0 0%
float32 64 128 8 36.8 37.2 36.7 15.0 1.00x 0.41x 1.1 0%
float32 64 128 16 38.3 37.2 38.1 14.9 1.01x 0.39x 1.2 0%
float32 64 129 4 38.5 37.1 36.8 15.5 1.05x 0.42x 1.0 0%
float32 64 129 8 37.0 37.1 36.8 15.9 1.01x 0.43x 1.1 0%
float32 64 129 16 38.4 38.8 37.1 15.4 1.04x 0.42x 1.2 0%
float32 64 1024 4 39.6 38.9 41.3 20.4 0.96x 0.49x 6.4 1%
float32 64 1024 8 39.8 39.2 41.1 20.3 0.97x 0.49x 6.5 1%
float32 64 1024 16 41.4 40.2 42.6 20.3 0.97x 0.48x 6.4 1%
float32 64 1025 4 41.3 43.4 41.4 22.1 1.00x 0.53x 6.4 1%
float32 64 1025 8 42.9 43.3 42.4 22.1 1.01x 0.52x 6.3 1%
float32 64 1025 16 42.9 44.7 42.6 22.2 1.01x 0.52x 6.4 1%
float32 64 8192 4 96.8 99.2 106.9 65.6 0.91x 0.61x 19.6 4%
float32 64 8192 8 103.8 106.6 110.0 65.6 0.94x 0.60x 19.1 4%
float32 64 8192 16 109.6 109.9 117.2 65.6 0.94x 0.56x 18.0 4%
float32 64 8193 4 97.8 99.6 111.5 65.6 0.88x 0.59x 18.8 4%
float32 64 8193 8 104.9 112.7 112.9 65.5 0.93x 0.58x 18.6 4%
float32 64 8193 16 112.9 115.8 111.7 65.6 1.01x 0.59x 18.9 4%
float32 64 131072 4 956.6 940.0 470.2 221.1 2.03x 0.47x 71.4 16%
float32 64 131072 8 1024.0 1007.2 473.6 220.4 2.16x 0.47x 70.9 16%
float32 64 131072 16 1097.5 1082.4 487.5 222.6 2.25x 0.46x 68.9 15%
float32 64 131073 4 943.5 941.2 610.1 223.0 1.55x 0.37x 55.0 12%
float32 64 131073 8 1004.0 1010.3 635.1 225.0 1.58x 0.35x 52.8 12%
float32 64 131073 16 1095.1 1101.5 650.8 223.7 1.68x 0.34x 51.6 11%
float32 256 128 4 46.0 46.0 45.8 15.7 1.00x 0.34x 3.1 1%
float32 256 128 8 47.2 47.5 45.8 15.7 1.03x 0.34x 3.4 1%
float32 256 128 16 47.4 47.2 45.7 15.7 1.04x 0.34x 3.9 1%
float32 256 129 4 47.2 47.5 45.8 16.1 1.03x 0.35x 3.2 1%
float32 256 129 8 45.6 47.4 47.2 16.7 0.97x 0.35x 3.3 1%
float32 256 129 16 47.3 49.0 50.1 16.6 0.94x 0.33x 3.6 1%
float32 256 1024 4 66.7 68.3 68.2 41.7 0.98x 0.61x 15.6 3%
float32 256 1024 8 70.7 70.0 69.5 43.2 1.02x 0.62x 15.4 3%
float32 256 1024 16 71.1 71.6 71.2 43.8 1.00x 0.62x 15.4 3%
float32 256 1025 4 82.8 81.2 81.8 45.9 1.01x 0.56x 13.0 3%
float32 256 1025 8 85.8 84.6 87.6 46.6 0.98x 0.53x 12.3 3%
float32 256 1025 16 87.3 89.4 89.3 48.1 0.98x 0.54x 12.3 3%
float32 256 8192 4 274.6 277.6 279.6 101.0 0.98x 0.36x 30.0 7%
float32 256 8192 8 299.9 286.3 292.0 101.3 1.03x 0.35x 28.8 6%
float32 256 8192 16 313.3 315.7 301.0 100.9 1.04x 0.34x 28.0 6%
float32 256 8193 4 283.6 277.9 296.7 101.7 0.96x 0.34x 28.3 6%
float32 256 8193 8 292.0 292.6 303.0 101.6 0.96x 0.34x 27.8 6%
float32 256 8193 16 317.9 318.0 314.7 101.8 1.01x 0.32x 26.8 6%
float32 256 131072 4 3194.0 3202.4 1625.5 1128.3 1.96x 0.69x 82.6 18%
float32 256 131072 8 3415.0 3445.5 1644.8 1132.5 2.08x 0.69x 81.6 18%
float32 256 131072 16 3704.6 3711.3 1687.9 1129.5 2.19x 0.67x 79.5 17%
float32 256 131073 4 3206.8 3195.1 2142.2 1148.5 1.50x 0.54x 62.7 14%
float32 256 131073 8 3427.4 3420.5 2207.1 1148.0 1.55x 0.52x 60.8 13%
float32 256 131073 16 3743.5 3721.6 2263.0 1147.9 1.65x 0.51x 59.3 13%
float32 1024 128 4 100.9 102.1 100.7 22.3 1.00x 0.22x 5.7 1%
float32 1024 128 8 107.9 105.8 105.5 22.0 1.02x 0.21x 5.9 1%
float32 1024 128 16 108.2 110.0 109.3 22.2 0.99x 0.20x 6.6 1%
float32 1024 129 4 102.3 101.3 103.5 24.4 0.99x 0.24x 5.6 1%
float32 1024 129 8 108.0 108.2 105.5 24.4 1.02x 0.23x 5.9 1%
float32 1024 129 16 109.5 111.1 109.4 24.6 1.00x 0.22x 6.6 1%
float32 1024 1024 4 185.6 50.2 50.0 88.3 3.71x 1.77x 84.9 19%
float32 1024 1024 8 190.3 50.0 50.0 88.3 3.81x 1.77x 85.9 19%
float32 1024 1024 16 194.7 50.2 51.0 88.3 3.82x 1.73x 86.1 19%
float32 1024 1025 4 251.8 92.1 91.9 90.2 2.74x 0.98x 46.2 10%
float32 1024 1025 8 262.6 92.5 92.7 90.1 2.83x 0.97x 46.4 10%
float32 1024 1025 16 267.3 93.0 93.0 90.4 2.87x 0.97x 47.3 10%
float32 1024 8192 4 1000.9 230.7 231.1 200.8 4.33x 0.87x 145.4 32%
float32 1024 8192 8 1072.8 231.1 231.3 200.2 4.64x 0.87x 145.5 32%
float32 1024 8192 16 1140.4 231.5 231.4 201.7 4.93x 0.87x 145.9 32%
float32 1024 8193 4 1014.7 465.1 465.7 202.4 2.18x 0.43x 72.2 16%
float32 1024 8193 8 1076.7 465.9 465.1 201.3 2.31x 0.43x 72.4 16%
float32 1024 8193 16 1159.9 466.5 465.6 202.6 2.49x 0.44x 72.5 16%
float32 1024 131072 4 11911.6 1964.0 1965.1 4191.1 6.06x 2.13x 273.2 60%
float32 1024 131072 8 12727.1 1966.1 1968.0 4189.9 6.47x 2.13x 272.9 60%
float32 1024 131072 16 13772.9 1966.2 1966.7 4190.6 7.00x 2.13x 273.1 60%
float32 1024 131073 4 11868.0 3547.2 3547.7 4260.7 3.35x 1.20x 151.3 33%
float32 1024 131073 8 12770.6 3550.0 3550.8 4261.2 3.60x 1.20x 151.2 33%
float32 1024 131073 16 13914.8 3557.8 3560.1 4261.2 3.91x 1.20x 150.9 33%
float32 2048 128 4 170.5 170.2 171.1 30.2 1.00x 0.18x 6.7 1%
float32 2048 128 8 177.6 177.9 178.6 30.6 0.99x 0.17x 7.0 2%
float32 2048 128 16 180.7 181.4 180.1 31.2 1.00x 0.17x 8.0 2%
float32 2048 129 4 170.3 170.5 171.3 35.4 0.99x 0.21x 6.7 1%
float32 2048 129 8 176.5 176.7 177.2 35.3 1.00x 0.20x 7.1 2%
float32 2048 129 16 181.9 182.7 181.0 36.4 1.00x 0.20x 8.0 2%
float32 2048 1024 4 333.2 85.6 85.5 123.4 3.90x 1.44x 99.3 22%
float32 2048 1024 8 347.3 85.9 86.0 123.4 4.04x 1.43x 99.8 22%
float32 2048 1024 16 355.7 87.1 87.0 123.7 4.09x 1.42x 100.9 22%
float32 2048 1025 4 470.0 165.7 165.7 126.5 2.84x 0.76x 51.3 11%
float32 2048 1025 8 492.6 166.1 166.1 126.7 2.97x 0.76x 51.7 11%
float32 2048 1025 16 503.6 167.0 167.5 127.0 3.01x 0.76x 52.5 12%
float32 2048 8192 4 1972.4 442.5 442.5 421.7 4.46x 0.95x 151.9 33%
float32 2048 8192 8 2094.9 443.3 443.1 424.8 4.73x 0.96x 151.9 33%
float32 2048 8192 16 2251.3 444.0 443.8 424.0 5.07x 0.96x 152.1 33%
float32 2048 8193 4 1979.8 908.5 906.7 436.2 2.18x 0.48x 74.1 16%
float32 2048 8193 8 2127.7 907.9 909.8 437.6 2.34x 0.48x 74.0 16%
float32 2048 8193 16 2269.5 910.9 909.9 440.8 2.49x 0.48x 74.2 16%
float32 2048 131072 4 23642.3 3925.9 3925.6 8254.2 6.02x 2.10x 273.5 60%
float32 2048 131072 8 25253.3 3926.0 3928.5 8254.6 6.43x 2.10x 273.4 60%
float32 2048 131072 16 27390.4 3930.4 3925.5 8250.2 6.98x 2.10x 273.6 60%
float32 2048 131073 4 23630.0 7033.7 7035.5 8407.4 3.36x 1.19x 152.6 33%
float32 2048 131073 8 25309.8 7037.0 7033.5 8407.4 3.60x 1.20x 152.7 33%
float32 2048 131073 16 27547.6 7041.9 7036.1 8413.3 3.92x 1.20x 152.7 33%

Test methodology

  • Accuracy (432 cases): 3 dtypes x 6 batch sizes x 4 dims x 2 alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
  • Sortedness (324 cases): Verify torch.topk(sorted=True) output is monotonic for both largest=True/False.
  • Benchmark (432 cases): Median of 3 runs x 50 iterations each, with 20 warmup iterations. largest=True.
  • Bandwidth: (bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) + 8)) / time. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).

@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch 2 times, most recently from 735b721 to e7cf066 Compare April 17, 2026 13:23
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from aa5150e to d742fa2 Compare April 17, 2026 13:43
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from e7cf066 to 6bd0fbd Compare April 17, 2026 13:44
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from d742fa2 to f926f72 Compare April 17, 2026 14:08
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from 6bd0fbd to 6f002ec Compare April 17, 2026 14:15
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from f926f72 to 659972f Compare April 17, 2026 14:21
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from 6f002ec to 12b8ed3 Compare April 17, 2026 14:21
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from 659972f to d4daf78 Compare April 17, 2026 14:37
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch 2 times, most recently from 98186e7 to 4fe12fd Compare April 17, 2026 14:48
@jianyizh jianyizh changed the title Add single workgroup topk kernel for XPU (from CUDA single-block path) Add single workgroup topk kernel for XPU (part 2 of #3369) Apr 18, 2026
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from 4fe12fd to 1e308ff Compare April 18, 2026 05:38
@jianyizh jianyizh requested a review from Copilot April 18, 2026 06:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an additional optimized XPU SYCL TopK kernel specialized for large last-dimension slices, and updates topk dispatch + radix conversion to improve performance and NaN behavior.

Changes:

  • Introduce a single-workgroup (1024-thread) radix-select + gather TopK kernel (unsorted output).
  • Extend sbtopk_try_launch dispatch to choose between subgroup-sorted vs single-workgroup-unsorted paths.
  • Adjust radix conversion to map NaNs to the maximum radix value for consistent selection behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.h Declares the new single-workgroup TopK launch helper.
src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp Implements the single-workgroup radix-select TopK kernel and its launcher.
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.h Updates comments to reflect multiple optimized kernel paths.
src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Adds dispatch logic for the new single-workgroup kernel alongside subgroup TopK.
src/ATen/native/xpu/sycl/SortingRadixSelect.h Updates TopKTypeConfig::convert to treat NaNs as maximum radix values.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/SortingRadixSelect.h
Comment thread src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch 3 times, most recently from 47be176 to b6170a9 Compare April 21, 2026 00:57
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch 2 times, most recently from cc9071f to 195567a Compare April 26, 2026 11:30
@jianyizh jianyizh force-pushed the jianyi/subgroup-topk branch from dd8e48f to bde1bdb Compare April 26, 2026 11:30
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from 1b7acb8 to 989aad8 Compare April 27, 2026 13:09
Base automatically changed from jianyi/subgroup-topk to main April 29, 2026 07:46
Copilot AI review requested due to automatic review settings April 29, 2026 08:02
@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from 989aad8 to 46a59e2 Compare April 29, 2026 08:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread src/ATen/native/xpu/sycl/TensorTopKSingleWgKernel.cpp
Comment thread src/ATen/native/xpu/sycl/SortingRadixSelect.h
Comment thread src/ATen/native/xpu/sycl/TensorTopKSbtopkKernel.cpp
@jianyizh jianyizh removed the request for review from chuanqi129 May 25, 2026 13:51
@jianyizh
Copy link
Copy Markdown
Contributor Author

/merge -f "UT unrelated fail"

@github-actions
Copy link
Copy Markdown

❌ Not enough approvals (0/2). At least 2 approvals are required to merge.

jianyizh and others added 10 commits May 25, 2026 06:58
SYCL translation of PyTorch CUDA's single-block radix select path.
A 1024-thread workgroup processes one slice using RADIX_BITS=4 radix
select to find the k-th value, then gathers matching elements.
Output is unsorted. Best for large dim (>= 4096).

Dispatch updated: dim < 1024 -> original; k <= 16 + large batch ->
subgroup kernel; dim >= 4096 -> single workgroup kernel; else -> original.

Also fixes NaN handling in SortingRadixSelect.h for half/float/double.

432/432 accuracy tests pass, 324/324 sortedness tests pass.
Add IndexT template parameter to SbtopkGatherFunctor so that shared memory,
histogram counts, and element indices use the correct index type. Dispatch
IndexT as int when nsegments*nelements <= INT_MAX, int64_t otherwise.
Remove the nsegments <= INT_MAX guard in the caller since the kernel now
handles both cases internally.
- Add alignas(alignof(LoadT)) on all scalar_t src[VEC_SIZE] arrays used
  for vectorized loads (3 occurrences) to ensure proper alignment
- Replace magic numbers smem[64]/smem[65] with named constants
  SMEM_FOUND_FLAG / SMEM_FOUND_IDX for clarity and maintainability
…el_submit, simplify EPT with PowerOf2Floor, remove unnecessary include
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
@CuiYifeng
Copy link
Copy Markdown
Contributor

/merge -f "UT unrelated fail"

@github-actions
Copy link
Copy Markdown

❌ Not enough approvals (0/2). At least 2 approvals are required to merge.

@CuiYifeng
Copy link
Copy Markdown
Contributor

/merge -f "UT unrelated fail"

@github-actions
Copy link
Copy Markdown

❌ Not enough approvals (0/2). At least 2 approvals are required to merge.

@jianyizh
Copy link
Copy Markdown
Contributor Author

/merge -f "UT unrelated fail"

@github-actions
Copy link
Copy Markdown

❌ Not enough approvals (0/2). At least 2 approvals are required to merge.

@jianyizh jianyizh force-pushed the jianyi/single-wg-topk branch from ea1879a to ff99c42 Compare May 25, 2026 14:05
@jianyizh
Copy link
Copy Markdown
Contributor Author

/merge -f "UT unrelated fail"

@github-actions
Copy link
Copy Markdown

❌ Not enough approvals (0/2). At least 2 approvals are required to merge.

@chuanqi129 chuanqi129 merged commit 6336b4f into main May 26, 2026
21 of 23 checks passed
@chuanqi129 chuanqi129 deleted the jianyi/single-wg-topk branch May 26, 2026 02:33
@guangyey
Copy link
Copy Markdown
Contributor

Hi @jianyizh May I know how do you handle the issue about build time?

@jianyizh
Copy link
Copy Markdown
Contributor Author

Hi @jianyizh May I know how do you handle the issue about build time?

#3683 The kernel has build time issue is this one. When k=16, the build time is very slow, so I just keep k=1~8. I guess it's because the aot target dg2. register on dg2 is only half compare to bmg. Too many register spill may cause the issue. Still investigating

@guangyey
Copy link
Copy Markdown
Contributor

Thanks for your details.

jafraustro pushed a commit to jafraustro/torch-xpu-ops that referenced this pull request May 26, 2026
…l#3372)

## Summary

Builds on intel#3371 (subgroup topk kernel). Adds a **single workgroup topk
kernel** — SYCL translation of PyTorch CUDA's single-block radix select
path.

- **Combined (PR1+PR2) vs original XPU:** 1.5737x geomean over 432
cases, 211 wins (>1.05x), 32 regressions (<0.98x)
- **Combined vs CUDA 4080S:** 0.5274x geomean (>1 means XPU faster)
- **PR2 incremental vs PR1-only:** 1.1530x geomean, 107 additional wins

### Approach

**Single workgroup topk kernel** (`TensorTopKSingleWgKernel.cpp`): A
1024-thread workgroup processes one slice using `RADIX_BITS=4` radix
select to find the k-th value, then gathers matching elements.
Translated from PyTorch CUDA's single-block path. Output is unsorted
(caller sorts if needed). Best for large dim (>= 4096).

**Updated dispatch logic:**
- `dim < 1024` -> original kernel
- `k <= 16` and large batch -> subgroup kernel (PR1, SORTED)
- `dim >= 4096` -> single workgroup kernel (this PR, UNSORTED)
- otherwise -> original kernel

Also fixes NaN handling in `SortingRadixSelect.h`
`TopKTypeConfig::convert` for half/float/double (NaN maps to max radix
value).

Multi-block radix select (for very large slices across multiple
workgroups) is planned as future work.

### Files changed

| File | Description |
|------|-------------|
| `TensorTopKSingleWgKernel.cpp` (new) | Single workgroup topk kernel
(from CUDA single-block path) |
| `TensorTopKSingleWgKernel.h` (new) | `single_wg_topk_try_launch`
declaration |
| `TensorTopKSbtopkKernel.cpp` | Add single-wg dispatch path alongside
subgroup kernel |
| `TensorTopKSbtopkKernel.h` | Update comments to describe both kernel
paths |
| `SortingRadixSelect.h` | Fix NaN handling in `TopKTypeConfig::convert`
|

### Correctness

- **Accuracy:** 432/432 pass (CPU vs XPU, sort-then-compare)
- **Sortedness:** 324/324 pass (`torch.topk(sorted=True)` output
verified monotonic)

### Benchmark: incremental gain from this PR

Showing where single-wg kernel helps (large dim cases):

**By dim (PR2 vs PR1-only):**

| dim | PR2 vs PR1 | PR2 vs orig | PR2 vs CUDA | cases |
|-----|:-:|:-:|:-:|:-:|
| 128 | 1.00x | 1.00x | 0.37x | 54 |
| 129 | 1.00x | 1.00x | 0.39x | 54 |
| 1024 | 1.00x | 1.47x | 0.77x | 54 |
| 1025 | 1.00x | 1.35x | 0.63x | 54 |
| 8192 | 1.03x | 1.68x | 0.62x | 54 |
| 8193 | 1.01x | 1.30x | 0.49x | 54 |
| 131072 | 1.99x | 3.73x | 0.68x | 54 |
| 131073 | 1.51x | 2.31x | 0.43x | 54 |

### Full 432-case results (combined PR1+PR2)

XPU: Intel Arc B580. CUDA: NVIDIA RTX 4080 SUPER. B580 peak memory
bandwidth: 456 GB/s. Times in microseconds (us). Median of 3 runs x 50
iters.

<details>
<summary>Click to expand full table</summary>

| dtype | bs | dim | k | XPU orig (us) | XPU PR1 (us) | XPU PR1+PR2 (us)
| CUDA 4080S (us) | vs orig | vs CUDA | BW (GB/s) | %peak |

|-------|---:|----:|--:|--------------:|------------:|-----------------:|----------------:|--------:|--------:|----------:|------:|
| bfloat16 | 1 | 128 | 4 | 30.6 | 30.7 | 30.6 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 8 | 30.5 | 30.4 | 30.4 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 128 | 16 | 30.4 | 30.4 | 30.5 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 4 | 30.3 | 30.6 | 30.4 | 14.7 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 8 | 30.4 | 30.5 | 30.3 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 129 | 16 | 30.4 | 30.4 | 30.4 | 14.6 | 1.00x | 0.48x |
0.0 | 0% |
| bfloat16 | 1 | 1024 | 4 | 30.5 | 30.5 | 30.4 | 19.0 | 1.00x | 0.62x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 8 | 30.5 | 30.6 | 30.4 | 18.3 | 1.00x | 0.60x |
0.1 | 0% |
| bfloat16 | 1 | 1024 | 16 | 30.4 | 30.4 | 30.5 | 18.6 | 1.00x | 0.61x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.5 | 20.0 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 8 | 30.4 | 30.5 | 30.5 | 20.2 | 1.00x | 0.66x |
0.1 | 0% |
| bfloat16 | 1 | 1025 | 16 | 30.4 | 30.5 | 30.4 | 19.8 | 1.00x | 0.65x |
0.1 | 0% |
| bfloat16 | 1 | 8192 | 4 | 45.7 | 44.4 | 42.8 | 37.4 | 1.07x | 0.87x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 8 | 51.6 | 48.6 | 42.5 | 42.2 | 1.21x | 0.99x |
0.4 | 0% |
| bfloat16 | 1 | 8192 | 16 | 48.6 | 48.6 | 42.7 | 39.1 | 1.14x | 0.92x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 4 | 45.7 | 48.4 | 45.8 | 37.0 | 1.00x | 0.81x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 8 | 48.7 | 48.6 | 45.9 | 40.3 | 1.06x | 0.88x |
0.4 | 0% |
| bfloat16 | 1 | 8193 | 16 | 48.5 | 48.5 | 47.2 | 39.7 | 1.03x | 0.84x |
0.4 | 0% |
| bfloat16 | 1 | 131072 | 4 | 368.8 | 375.7 | 102.4 | 46.3 | 3.60x |
0.45x | 2.6 | 1% |
| bfloat16 | 1 | 131072 | 8 | 396.4 | 402.5 | 105.2 | 46.3 | 3.77x |
0.44x | 2.5 | 1% |
| bfloat16 | 1 | 131072 | 16 | 430.6 | 426.2 | 111.0 | 46.4 | 3.88x |
0.42x | 2.4 | 1% |
| bfloat16 | 1 | 131073 | 4 | 370.4 | 364.3 | 168.6 | 46.8 | 2.20x |
0.28x | 1.6 | 0% |
| bfloat16 | 1 | 131073 | 8 | 392.5 | 396.7 | 202.4 | 46.8 | 1.94x |
0.23x | 1.3 | 0% |
| bfloat16 | 1 | 131073 | 16 | 413.9 | 421.3 | 184.1 | 46.7 | 2.25x |
0.25x | 1.4 | 0% |
| bfloat16 | 8 | 128 | 4 | 30.4 | 30.4 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 8 | 30.5 | 30.6 | 30.4 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 128 | 16 | 30.4 | 30.3 | 30.3 | 14.6 | 1.00x | 0.48x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 4 | 30.3 | 30.5 | 30.2 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 8 | 30.3 | 30.5 | 30.5 | 15.1 | 0.99x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 129 | 16 | 30.4 | 30.5 | 30.3 | 15.1 | 1.00x | 0.50x |
0.1 | 0% |
| bfloat16 | 8 | 1024 | 4 | 30.4 | 30.5 | 30.4 | 19.3 | 1.00x | 0.63x |
0.5 | 0% |
| bfloat16 | 8 | 1024 | 8 | 30.4 | 30.5 | 30.5 | 19.4 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1024 | 16 | 30.4 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 4 | 30.4 | 30.5 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 8 | 30.6 | 30.4 | 30.4 | 20.4 | 1.01x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 1025 | 16 | 30.4 | 30.4 | 30.5 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| bfloat16 | 8 | 8192 | 4 | 54.7 | 51.6 | 44.2 | 42.2 | 1.24x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8192 | 8 | 51.6 | 54.6 | 45.6 | 39.9 | 1.13x | 0.87x |
2.9 | 1% |
| bfloat16 | 8 | 8192 | 16 | 54.8 | 54.5 | 44.5 | 42.4 | 1.23x | 0.95x |
3.0 | 1% |
| bfloat16 | 8 | 8193 | 4 | 54.5 | 54.5 | 47.3 | 43.3 | 1.15x | 0.92x |
2.8 | 1% |
| bfloat16 | 8 | 8193 | 8 | 54.7 | 54.7 | 48.5 | 43.5 | 1.13x | 0.90x |
2.7 | 1% |
| bfloat16 | 8 | 8193 | 16 | 54.6 | 48.6 | 48.5 | 42.7 | 1.13x | 0.88x |
2.7 | 1% |
| bfloat16 | 8 | 131072 | 4 | 388.2 | 394.6 | 145.4 | 56.8 | 2.67x |
0.39x | 14.4 | 3% |
| bfloat16 | 8 | 131072 | 8 | 422.7 | 398.6 | 137.5 | 56.5 | 3.07x |
0.41x | 15.3 | 3% |
| bfloat16 | 8 | 131072 | 16 | 427.5 | 433.5 | 146.5 | 56.7 | 2.92x |
0.39x | 14.3 | 3% |
| bfloat16 | 8 | 131073 | 4 | 392.3 | 405.1 | 218.3 | 56.8 | 1.80x |
0.26x | 9.6 | 2% |
| bfloat16 | 8 | 131073 | 8 | 404.6 | 406.4 | 222.5 | 57.1 | 1.82x |
0.26x | 9.4 | 2% |
| bfloat16 | 8 | 131073 | 16 | 442.0 | 436.3 | 196.2 | 56.9 | 2.25x |
0.29x | 10.7 | 2% |
| bfloat16 | 64 | 128 | 4 | 30.5 | 30.5 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| bfloat16 | 64 | 128 | 8 | 30.5 | 30.6 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| bfloat16 | 64 | 128 | 16 | 30.6 | 30.4 | 30.2 | 14.8 | 1.01x | 0.49x |
0.9 | 0% |
| bfloat16 | 64 | 129 | 4 | 30.6 | 30.4 | 30.3 | 15.4 | 1.01x | 0.51x |
0.6 | 0% |
| bfloat16 | 64 | 129 | 8 | 30.5 | 30.4 | 30.3 | 15.5 | 1.01x | 0.51x |
0.7 | 0% |
| bfloat16 | 64 | 129 | 16 | 30.6 | 30.4 | 30.3 | 15.2 | 1.01x | 0.50x |
0.9 | 0% |
| bfloat16 | 64 | 1024 | 4 | 30.6 | 30.5 | 30.4 | 19.5 | 1.01x | 0.64x |
4.4 | 1% |
| bfloat16 | 64 | 1024 | 8 | 30.5 | 30.5 | 30.3 | 19.5 | 1.01x | 0.64x |
4.5 | 1% |
| bfloat16 | 64 | 1024 | 16 | 30.5 | 30.6 | 30.7 | 19.5 | 0.99x | 0.64x
| 4.6 | 1% |
| bfloat16 | 64 | 1025 | 4 | 33.7 | 33.6 | 33.6 | 20.7 | 1.00x | 0.62x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 8 | 33.7 | 33.6 | 33.7 | 20.6 | 1.00x | 0.61x |
4.0 | 1% |
| bfloat16 | 64 | 1025 | 16 | 33.5 | 33.7 | 33.7 | 20.6 | 0.99x | 0.61x
| 4.2 | 1% |
| bfloat16 | 64 | 8192 | 4 | 93.1 | 92.2 | 93.4 | 49.9 | 1.00x | 0.53x |
11.3 | 2% |
| bfloat16 | 64 | 8192 | 8 | 97.7 | 96.6 | 92.0 | 49.5 | 1.06x | 0.54x |
11.5 | 3% |
| bfloat16 | 64 | 8192 | 16 | 100.8 | 101.2 | 91.7 | 49.6 | 1.10x |
0.54x | 11.5 | 3% |
| bfloat16 | 64 | 8193 | 4 | 96.2 | 90.1 | 97.9 | 49.8 | 0.98x | 0.51x |
10.7 | 2% |
| bfloat16 | 64 | 8193 | 8 | 97.9 | 96.3 | 97.9 | 49.6 | 1.00x | 0.51x |
10.8 | 2% |
| bfloat16 | 64 | 8193 | 16 | 100.2 | 100.3 | 97.7 | 49.7 | 1.03x |
0.51x | 10.8 | 2% |
| bfloat16 | 64 | 131072 | 4 | 901.8 | 888.7 | 304.9 | 162.9 | 2.96x |
0.53x | 55.0 | 12% |
| bfloat16 | 64 | 131072 | 8 | 939.7 | 948.2 | 308.0 | 164.6 | 3.05x |
0.53x | 54.5 | 12% |
| bfloat16 | 64 | 131072 | 16 | 999.0 | 993.3 | 301.4 | 164.4 | 3.31x |
0.55x | 55.7 | 12% |
| bfloat16 | 64 | 131073 | 4 | 902.2 | 889.0 | 449.7 | 166.8 | 2.01x |
0.37x | 37.3 | 8% |
| bfloat16 | 64 | 131073 | 8 | 944.7 | 942.0 | 464.5 | 166.8 | 2.03x |
0.36x | 36.1 | 8% |
| bfloat16 | 64 | 131073 | 16 | 1002.6 | 1000.7 | 449.2 | 165.5 | 2.23x
| 0.37x | 37.4 | 8% |
| bfloat16 | 256 | 128 | 4 | 33.7 | 33.7 | 33.6 | 15.7 | 1.00x | 0.47x |
2.3 | 0% |
| bfloat16 | 256 | 128 | 8 | 33.8 | 33.6 | 33.7 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| bfloat16 | 256 | 128 | 16 | 33.6 | 33.6 | 33.6 | 15.7 | 1.00x | 0.47x
| 3.2 | 1% |
| bfloat16 | 256 | 129 | 4 | 33.7 | 33.6 | 33.6 | 16.5 | 1.00x | 0.49x |
2.3 | 0% |
| bfloat16 | 256 | 129 | 8 | 33.6 | 33.6 | 33.6 | 16.3 | 1.00x | 0.49x |
2.6 | 1% |
| bfloat16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.3 | 1.00x | 0.49x
| 3.2 | 1% |
| bfloat16 | 256 | 1024 | 4 | 56.3 | 56.1 | 56.2 | 41.7 | 1.00x | 0.74x
| 9.5 | 2% |
| bfloat16 | 256 | 1024 | 8 | 59.0 | 58.9 | 58.9 | 42.4 | 1.00x | 0.72x
| 9.2 | 2% |
| bfloat16 | 256 | 1024 | 16 | 59.3 | 59.2 | 60.1 | 42.6 | 0.99x | 0.71x
| 9.4 | 2% |
| bfloat16 | 256 | 1025 | 4 | 71.1 | 72.4 | 73.4 | 45.9 | 0.97x | 0.63x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 8 | 75.1 | 74.1 | 74.8 | 46.7 | 1.00x | 0.62x
| 7.3 | 2% |
| bfloat16 | 256 | 1025 | 16 | 75.4 | 75.4 | 73.8 | 47.1 | 1.02x | 0.64x
| 7.7 | 2% |
| bfloat16 | 256 | 8192 | 4 | 260.0 | 263.7 | 254.6 | 75.2 | 1.02x |
0.30x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 8 | 270.4 | 269.8 | 255.6 | 75.0 | 1.06x |
0.29x | 16.5 | 4% |
| bfloat16 | 256 | 8192 | 16 | 287.6 | 290.5 | 255.0 | 75.2 | 1.13x |
0.29x | 16.6 | 4% |
| bfloat16 | 256 | 8193 | 4 | 261.0 | 268.2 | 274.2 | 75.1 | 0.95x |
0.27x | 15.3 | 3% |
| bfloat16 | 256 | 8193 | 8 | 273.3 | 273.1 | 276.5 | 75.6 | 0.99x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 8193 | 16 | 287.6 | 288.1 | 277.8 | 75.7 | 1.04x |
0.27x | 15.2 | 3% |
| bfloat16 | 256 | 131072 | 4 | 3096.6 | 3087.7 | 961.2 | 439.2 | 3.22x
| 0.46x | 69.8 | 15% |
| bfloat16 | 256 | 131072 | 8 | 3283.4 | 3269.1 | 941.6 | 436.9 | 3.49x
| 0.46x | 71.3 | 16% |
| bfloat16 | 256 | 131072 | 16 | 3464.5 | 3469.5 | 923.2 | 440.9 | 3.75x
| 0.48x | 72.7 | 16% |
| bfloat16 | 256 | 131073 | 4 | 3085.3 | 3093.6 | 1548.8 | 441.5 | 1.99x
| 0.29x | 43.3 | 10% |
| bfloat16 | 256 | 131073 | 8 | 3282.4 | 3267.2 | 1525.2 | 435.4 | 2.15x
| 0.29x | 44.0 | 10% |
| bfloat16 | 256 | 131073 | 16 | 3462.5 | 3470.8 | 1495.2 | 443.1 |
2.32x | 0.30x | 44.9 | 10% |
| bfloat16 | 1024 | 128 | 4 | 70.9 | 69.5 | 70.6 | 22.1 | 1.00x | 0.31x
| 4.3 | 1% |
| bfloat16 | 1024 | 128 | 8 | 75.3 | 75.2 | 75.3 | 22.0 | 1.00x | 0.29x
| 4.6 | 1% |
| bfloat16 | 1024 | 128 | 16 | 76.9 | 76.7 | 76.6 | 22.3 | 1.00x | 0.29x
| 5.6 | 1% |
| bfloat16 | 1024 | 129 | 4 | 70.8 | 69.6 | 69.9 | 24.4 | 1.01x | 0.35x
| 4.4 | 1% |
| bfloat16 | 1024 | 129 | 8 | 75.4 | 75.2 | 75.1 | 24.4 | 1.00x | 0.32x
| 4.6 | 1% |
| bfloat16 | 1024 | 129 | 16 | 76.8 | 76.7 | 76.6 | 24.5 | 1.00x | 0.32x
| 5.6 | 1% |
| bfloat16 | 1024 | 1024 | 4 | 152.6 | 56.2 | 56.0 | 63.1 | 2.73x |
1.13x | 38.2 | 8% |
| bfloat16 | 1024 | 1024 | 8 | 156.0 | 56.2 | 55.9 | 63.3 | 2.79x |
1.13x | 39.0 | 9% |
| bfloat16 | 1024 | 1024 | 16 | 157.2 | 57.5 | 57.4 | 63.4 | 2.74x |
1.10x | 39.4 | 9% |
| bfloat16 | 1024 | 1025 | 4 | 218.4 | 86.0 | 86.9 | 64.5 | 2.51x |
0.74x | 24.6 | 5% |
| bfloat16 | 1024 | 1025 | 8 | 223.7 | 86.8 | 87.0 | 64.7 | 2.57x |
0.74x | 25.1 | 5% |
| bfloat16 | 1024 | 1025 | 16 | 225.8 | 87.3 | 87.1 | 64.8 | 2.59x |
0.74x | 26.0 | 6% |
| bfloat16 | 1024 | 8192 | 4 | 939.4 | 248.0 | 259.0 | 147.6 | 3.63x |
0.57x | 64.9 | 14% |
| bfloat16 | 1024 | 8192 | 8 | 985.8 | 249.3 | 258.9 | 147.4 | 3.81x |
0.57x | 65.1 | 14% |
| bfloat16 | 1024 | 8192 | 16 | 1036.1 | 251.2 | 260.7 | 148.0 | 3.97x |
0.57x | 65.0 | 14% |
| bfloat16 | 1024 | 8193 | 4 | 941.7 | 406.6 | 421.8 | 149.2 | 2.23x |
0.35x | 39.9 | 9% |
| bfloat16 | 1024 | 8193 | 8 | 988.2 | 407.0 | 417.2 | 148.4 | 2.37x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 8193 | 16 | 1040.8 | 406.8 | 419.0 | 149.3 | 2.48x |
0.36x | 40.4 | 9% |
| bfloat16 | 1024 | 131072 | 4 | 11500.2 | 1762.5 | 1762.0 | 1865.9 |
6.53x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131072 | 8 | 12192.8 | 1762.8 | 1764.9 | 1867.4 |
6.91x | 1.06x | 152.1 | 33% |
| bfloat16 | 1024 | 131072 | 16 | 12859.4 | 1767.0 | 1762.5 | 1863.0 |
7.30x | 1.06x | 152.4 | 33% |
| bfloat16 | 1024 | 131073 | 4 | 11514.6 | 2998.5 | 2996.9 | 1940.1 |
3.84x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 8 | 12173.3 | 2998.4 | 2997.4 | 1936.8 |
4.06x | 0.65x | 89.6 | 20% |
| bfloat16 | 1024 | 131073 | 16 | 12856.9 | 3002.4 | 2997.6 | 1944.4 |
4.29x | 0.65x | 89.6 | 20% |
| bfloat16 | 2048 | 128 | 4 | 113.9 | 113.8 | 113.5 | 30.5 | 1.00x |
0.27x | 5.3 | 1% |
| bfloat16 | 2048 | 128 | 8 | 120.3 | 119.9 | 119.7 | 30.5 | 1.01x |
0.25x | 5.7 | 1% |
| bfloat16 | 2048 | 128 | 16 | 122.9 | 122.9 | 123.3 | 30.9 | 1.00x |
0.25x | 6.9 | 2% |
| bfloat16 | 2048 | 129 | 4 | 113.8 | 114.0 | 113.7 | 35.4 | 1.00x |
0.31x | 5.4 | 1% |
| bfloat16 | 2048 | 129 | 8 | 120.1 | 120.1 | 120.1 | 35.2 | 1.00x |
0.29x | 5.8 | 1% |
| bfloat16 | 2048 | 129 | 16 | 123.2 | 123.1 | 123.7 | 35.7 | 1.00x |
0.29x | 6.9 | 2% |
| bfloat16 | 2048 | 1024 | 4 | 276.3 | 96.4 | 97.2 | 85.7 | 2.84x |
0.88x | 44.0 | 10% |
| bfloat16 | 2048 | 1024 | 8 | 284.8 | 97.5 | 97.6 | 86.0 | 2.92x |
0.88x | 44.7 | 10% |
| bfloat16 | 2048 | 1024 | 16 | 286.1 | 99.3 | 99.3 | 86.4 | 2.88x |
0.87x | 45.5 | 10% |
| bfloat16 | 2048 | 1025 | 4 | 407.9 | 158.2 | 158.2 | 88.4 | 2.58x |
0.56x | 27.1 | 6% |
| bfloat16 | 2048 | 1025 | 8 | 423.7 | 158.8 | 159.0 | 88.7 | 2.66x |
0.56x | 27.4 | 6% |
| bfloat16 | 2048 | 1025 | 16 | 428.3 | 160.0 | 159.9 | 89.0 | 2.68x |
0.56x | 28.3 | 6% |
| bfloat16 | 2048 | 8192 | 4 | 1875.1 | 496.1 | 497.7 | 234.9 | 3.77x |
0.47x | 67.6 | 15% |
| bfloat16 | 2048 | 8192 | 8 | 1956.5 | 497.2 | 498.0 | 234.1 | 3.93x |
0.47x | 67.7 | 15% |
| bfloat16 | 2048 | 8192 | 16 | 2058.5 | 498.7 | 499.5 | 235.0 | 4.12x |
0.47x | 67.8 | 15% |
| bfloat16 | 2048 | 8193 | 4 | 1873.4 | 825.1 | 822.9 | 236.2 | 2.28x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 8 | 1959.0 | 824.1 | 823.8 | 237.3 | 2.38x |
0.29x | 40.9 | 9% |
| bfloat16 | 2048 | 8193 | 16 | 2065.1 | 825.7 | 825.2 | 237.4 | 2.50x |
0.29x | 41.1 | 9% |
| bfloat16 | 2048 | 131072 | 4 | 22903.6 | 3485.4 | 3486.6 | 3646.5 |
6.57x | 1.05x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 8 | 24193.6 | 3484.6 | 3488.3 | 3644.1 |
6.94x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131072 | 16 | 25590.8 | 3487.7 | 3489.4 | 3646.2 |
7.33x | 1.04x | 154.0 | 34% |
| bfloat16 | 2048 | 131073 | 4 | 22872.9 | 5925.0 | 5928.1 | 3774.7 |
3.86x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 8 | 24187.7 | 5933.4 | 5929.8 | 3780.1 |
4.08x | 0.64x | 90.6 | 20% |
| bfloat16 | 2048 | 131073 | 16 | 25604.8 | 5934.5 | 5926.6 | 3773.0 |
4.32x | 0.64x | 90.6 | 20% |
| float16 | 1 | 128 | 4 | 30.7 | 30.7 | 30.6 | 14.3 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 128 | 8 | 30.6 | 30.6 | 30.5 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 128 | 16 | 30.5 | 30.5 | 30.6 | 14.0 | 1.00x | 0.46x |
0.0 | 0% |
| float16 | 1 | 129 | 4 | 30.6 | 30.6 | 30.7 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 8 | 30.6 | 30.3 | 30.5 | 14.4 | 1.00x | 0.47x |
0.0 | 0% |
| float16 | 1 | 129 | 16 | 30.5 | 30.4 | 30.7 | 14.7 | 0.99x | 0.48x |
0.0 | 0% |
| float16 | 1 | 1024 | 4 | 30.6 | 30.7 | 30.8 | 17.4 | 0.99x | 0.56x |
0.1 | 0% |
| float16 | 1 | 1024 | 8 | 30.5 | 30.5 | 30.8 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1024 | 16 | 30.4 | 30.5 | 30.7 | 17.5 | 0.99x | 0.57x |
0.1 | 0% |
| float16 | 1 | 1025 | 4 | 30.5 | 30.5 | 30.7 | 17.8 | 0.99x | 0.58x |
0.1 | 0% |
| float16 | 1 | 1025 | 8 | 30.4 | 30.4 | 30.7 | 18.6 | 0.99x | 0.61x |
0.1 | 0% |
| float16 | 1 | 1025 | 16 | 30.4 | 30.3 | 30.7 | 20.1 | 0.99x | 0.65x |
0.1 | 0% |
| float16 | 1 | 8192 | 4 | 41.4 | 38.2 | 38.5 | 33.6 | 1.08x | 0.87x |
0.4 | 0% |
| float16 | 1 | 8192 | 8 | 41.2 | 48.4 | 42.9 | 33.8 | 0.96x | 0.79x |
0.4 | 0% |
| float16 | 1 | 8192 | 16 | 45.6 | 48.4 | 38.3 | 31.5 | 1.19x | 0.82x |
0.4 | 0% |
| float16 | 1 | 8193 | 4 | 45.6 | 41.0 | 44.5 | 37.4 | 1.02x | 0.84x |
0.4 | 0% |
| float16 | 1 | 8193 | 8 | 42.6 | 44.1 | 40.0 | 36.9 | 1.06x | 0.92x |
0.4 | 0% |
| float16 | 1 | 8193 | 16 | 45.6 | 51.3 | 46.0 | 33.3 | 0.99x | 0.72x |
0.4 | 0% |
| float16 | 1 | 131072 | 4 | 297.2 | 304.4 | 126.4 | 46.2 | 2.35x |
0.37x | 2.1 | 0% |
| float16 | 1 | 131072 | 8 | 326.6 | 335.1 | 99.5 | 46.5 | 3.28x | 0.47x
| 2.6 | 1% |
| float16 | 1 | 131072 | 16 | 348.1 | 355.4 | 132.9 | 46.1 | 2.62x |
0.35x | 2.0 | 0% |
| float16 | 1 | 131073 | 4 | 308.7 | 286.0 | 198.8 | 46.9 | 1.55x |
0.24x | 1.3 | 0% |
| float16 | 1 | 131073 | 8 | 321.3 | 325.3 | 188.1 | 46.8 | 1.71x |
0.25x | 1.4 | 0% |
| float16 | 1 | 131073 | 16 | 353.2 | 378.6 | 185.2 | 46.6 | 1.91x |
0.25x | 1.4 | 0% |
| float16 | 8 | 128 | 4 | 30.5 | 30.2 | 30.4 | 14.4 | 1.00x | 0.47x |
0.1 | 0% |
| float16 | 8 | 128 | 8 | 30.4 | 30.2 | 30.3 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 128 | 16 | 30.4 | 30.4 | 30.4 | 14.5 | 1.00x | 0.48x |
0.1 | 0% |
| float16 | 8 | 129 | 4 | 30.5 | 30.2 | 30.2 | 14.8 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 8 | 30.3 | 30.2 | 30.3 | 14.9 | 1.00x | 0.49x |
0.1 | 0% |
| float16 | 8 | 129 | 16 | 30.5 | 30.4 | 30.3 | 14.9 | 1.01x | 0.49x |
0.1 | 0% |
| float16 | 8 | 1024 | 4 | 30.6 | 30.4 | 30.3 | 19.1 | 1.01x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 8 | 30.5 | 30.4 | 30.4 | 19.2 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1024 | 16 | 30.4 | 30.3 | 30.4 | 19.3 | 1.00x | 0.63x |
0.6 | 0% |
| float16 | 8 | 1025 | 4 | 30.5 | 30.4 | 30.4 | 19.5 | 1.00x | 0.64x |
0.6 | 0% |
| float16 | 8 | 1025 | 8 | 30.5 | 30.3 | 30.4 | 20.4 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 1025 | 16 | 30.5 | 30.3 | 30.4 | 20.5 | 1.00x | 0.67x |
0.6 | 0% |
| float16 | 8 | 8192 | 4 | 45.6 | 45.5 | 42.7 | 37.9 | 1.07x | 0.89x |
3.1 | 1% |
| float16 | 8 | 8192 | 8 | 48.4 | 48.5 | 44.0 | 39.8 | 1.10x | 0.90x |
3.0 | 1% |
| float16 | 8 | 8192 | 16 | 48.5 | 51.5 | 44.1 | 41.7 | 1.10x | 0.95x |
3.0 | 1% |
| float16 | 8 | 8193 | 4 | 48.5 | 45.5 | 47.3 | 39.2 | 1.03x | 0.83x |
2.8 | 1% |
| float16 | 8 | 8193 | 8 | 45.6 | 48.6 | 47.0 | 40.7 | 0.97x | 0.87x |
2.8 | 1% |
| float16 | 8 | 8193 | 16 | 54.5 | 51.7 | 45.7 | 43.0 | 1.19x | 0.94x |
2.9 | 1% |
| float16 | 8 | 131072 | 4 | 309.9 | 334.0 | 137.7 | 56.0 | 2.25x |
0.41x | 15.2 | 3% |
| float16 | 8 | 131072 | 8 | 338.1 | 356.0 | 125.9 | 56.1 | 2.69x |
0.45x | 16.7 | 4% |
| float16 | 8 | 131072 | 16 | 393.3 | 387.7 | 132.6 | 56.3 | 2.97x |
0.42x | 15.8 | 3% |
| float16 | 8 | 131073 | 4 | 314.9 | 313.8 | 208.8 | 56.2 | 1.51x |
0.27x | 10.0 | 2% |
| float16 | 8 | 131073 | 8 | 341.7 | 344.2 | 200.6 | 56.3 | 1.70x |
0.28x | 10.5 | 2% |
| float16 | 8 | 131073 | 16 | 366.4 | 378.0 | 200.1 | 56.3 | 1.83x |
0.28x | 10.5 | 2% |
| float16 | 64 | 128 | 4 | 30.5 | 30.1 | 30.3 | 14.9 | 1.01x | 0.49x |
0.6 | 0% |
| float16 | 64 | 128 | 8 | 30.5 | 30.2 | 30.3 | 14.7 | 1.01x | 0.49x |
0.7 | 0% |
| float16 | 64 | 128 | 16 | 30.4 | 30.2 | 30.1 | 14.7 | 1.01x | 0.49x |
0.9 | 0% |
| float16 | 64 | 129 | 4 | 30.6 | 30.2 | 30.3 | 15.3 | 1.01x | 0.50x |
0.6 | 0% |
| float16 | 64 | 129 | 8 | 30.6 | 30.2 | 30.4 | 15.2 | 1.01x | 0.50x |
0.7 | 0% |
| float16 | 64 | 129 | 16 | 30.5 | 30.2 | 30.4 | 15.1 | 1.00x | 0.50x |
0.9 | 0% |
| float16 | 64 | 1024 | 4 | 30.4 | 30.4 | 30.3 | 19.2 | 1.00x | 0.63x |
4.4 | 1% |
| float16 | 64 | 1024 | 8 | 30.4 | 30.4 | 30.4 | 19.3 | 1.00x | 0.63x |
4.5 | 1% |
| float16 | 64 | 1024 | 16 | 30.4 | 30.3 | 30.5 | 19.4 | 1.00x | 0.64x |
4.6 | 1% |
| float16 | 64 | 1025 | 4 | 32.2 | 32.0 | 33.0 | 19.7 | 0.98x | 0.60x |
4.1 | 1% |
| float16 | 64 | 1025 | 8 | 32.1 | 32.1 | 32.3 | 20.4 | 0.99x | 0.63x |
4.2 | 1% |
| float16 | 64 | 1025 | 16 | 33.6 | 33.6 | 33.6 | 20.4 | 1.00x | 0.61x |
4.2 | 1% |
| float16 | 64 | 8192 | 4 | 81.3 | 84.2 | 83.0 | 49.4 | 0.98x | 0.60x |
12.7 | 3% |
| float16 | 64 | 8192 | 8 | 83.0 | 84.2 | 83.0 | 49.2 | 1.00x | 0.59x |
12.7 | 3% |
| float16 | 64 | 8192 | 16 | 88.7 | 90.4 | 89.1 | 49.2 | 1.00x | 0.55x |
11.9 | 3% |
| float16 | 64 | 8193 | 4 | 81.3 | 80.1 | 85.8 | 49.4 | 0.95x | 0.58x |
12.3 | 3% |
| float16 | 64 | 8193 | 8 | 87.2 | 84.0 | 88.8 | 49.4 | 0.98x | 0.56x |
11.9 | 3% |
| float16 | 64 | 8193 | 16 | 90.2 | 88.8 | 91.7 | 49.4 | 0.98x | 0.54x |
11.5 | 3% |
| float16 | 64 | 131072 | 4 | 752.0 | 723.7 | 285.8 | 162.1 | 2.63x |
0.57x | 58.7 | 13% |
| float16 | 64 | 131072 | 8 | 788.0 | 782.2 | 290.4 | 160.5 | 2.71x |
0.55x | 57.8 | 13% |
| float16 | 64 | 131072 | 16 | 853.1 | 866.5 | 282.4 | 162.4 | 3.02x |
0.58x | 59.4 | 13% |
| float16 | 64 | 131073 | 4 | 712.3 | 709.2 | 440.0 | 161.6 | 1.62x |
0.37x | 38.1 | 8% |
| float16 | 64 | 131073 | 8 | 784.4 | 775.9 | 409.9 | 163.9 | 1.91x |
0.40x | 40.9 | 9% |
| float16 | 64 | 131073 | 16 | 866.1 | 857.3 | 433.5 | 162.9 | 2.00x |
0.38x | 38.7 | 8% |
| float16 | 256 | 128 | 4 | 33.7 | 33.6 | 33.5 | 15.5 | 1.01x | 0.46x |
2.3 | 0% |
| float16 | 256 | 128 | 8 | 33.7 | 33.6 | 33.6 | 15.6 | 1.00x | 0.46x |
2.6 | 1% |
| float16 | 256 | 128 | 16 | 33.7 | 33.6 | 33.5 | 15.6 | 1.01x | 0.47x |
3.2 | 1% |
| float16 | 256 | 129 | 4 | 33.7 | 33.5 | 33.5 | 16.0 | 1.01x | 0.48x |
2.3 | 0% |
| float16 | 256 | 129 | 8 | 33.7 | 33.5 | 33.6 | 15.9 | 1.00x | 0.47x |
2.6 | 1% |
| float16 | 256 | 129 | 16 | 33.6 | 33.5 | 33.5 | 16.1 | 1.00x | 0.48x |
3.2 | 1% |
| float16 | 256 | 1024 | 4 | 50.6 | 50.8 | 50.1 | 37.9 | 1.01x | 0.76x |
10.7 | 2% |
| float16 | 256 | 1024 | 8 | 53.1 | 53.0 | 52.8 | 38.8 | 1.01x | 0.73x |
10.3 | 2% |
| float16 | 256 | 1024 | 16 | 55.0 | 56.0 | 55.7 | 39.9 | 0.99x | 0.72x
| 10.1 | 2% |
| float16 | 256 | 1025 | 4 | 63.5 | 63.5 | 63.4 | 42.0 | 1.00x | 0.66x |
8.4 | 2% |
| float16 | 256 | 1025 | 8 | 64.6 | 66.3 | 66.4 | 43.1 | 0.97x | 0.65x |
8.2 | 2% |
| float16 | 256 | 1025 | 16 | 69.5 | 67.9 | 68.2 | 43.8 | 1.02x | 0.64x
| 8.3 | 2% |
| float16 | 256 | 8192 | 4 | 219.8 | 221.4 | 218.2 | 74.1 | 1.01x |
0.34x | 19.3 | 4% |
| float16 | 256 | 8192 | 8 | 233.9 | 234.1 | 226.5 | 74.4 | 1.03x |
0.33x | 18.6 | 4% |
| float16 | 256 | 8192 | 16 | 248.0 | 250.8 | 237.1 | 74.7 | 1.05x |
0.32x | 17.9 | 4% |
| float16 | 256 | 8193 | 4 | 217.9 | 220.0 | 236.7 | 74.3 | 0.92x |
0.31x | 17.8 | 4% |
| float16 | 256 | 8193 | 8 | 235.5 | 232.7 | 246.1 | 74.8 | 0.96x |
0.30x | 17.1 | 4% |
| float16 | 256 | 8193 | 16 | 252.1 | 257.4 | 257.6 | 74.9 | 0.98x |
0.29x | 16.4 | 4% |
| float16 | 256 | 131072 | 4 | 2409.4 | 2421.9 | 880.3 | 428.9 | 2.74x |
0.49x | 76.2 | 17% |
| float16 | 256 | 131072 | 8 | 2673.7 | 2662.8 | 887.3 | 427.9 | 3.01x |
0.48x | 75.7 | 17% |
| float16 | 256 | 131072 | 16 | 2935.0 | 2934.9 | 898.3 | 428.2 | 3.27x
| 0.48x | 74.8 | 16% |
| float16 | 256 | 131073 | 4 | 2405.3 | 2442.5 | 1408.4 | 431.9 | 1.71x
| 0.31x | 47.7 | 10% |
| float16 | 256 | 131073 | 8 | 2662.4 | 2677.0 | 1434.5 | 429.8 | 1.86x
| 0.30x | 46.8 | 10% |
| float16 | 256 | 131073 | 16 | 2941.0 | 2949.7 | 1471.8 | 432.2 | 2.00x
| 0.29x | 45.6 | 10% |
| float16 | 1024 | 128 | 4 | 67.6 | 67.6 | 66.6 | 20.9 | 1.02x | 0.31x |
4.6 | 1% |
| float16 | 1024 | 128 | 8 | 70.7 | 69.7 | 70.6 | 20.9 | 1.00x | 0.30x |
4.9 | 1% |
| float16 | 1024 | 128 | 16 | 71.4 | 71.4 | 71.7 | 21.4 | 1.00x | 0.30x
| 5.9 | 1% |
| float16 | 1024 | 129 | 4 | 66.5 | 66.6 | 67.6 | 23.3 | 0.98x | 0.34x |
4.5 | 1% |
| float16 | 1024 | 129 | 8 | 70.8 | 70.1 | 70.5 | 23.1 | 1.00x | 0.33x |
4.9 | 1% |
| float16 | 1024 | 129 | 16 | 71.2 | 72.4 | 71.2 | 23.4 | 1.00x | 0.33x
| 6.0 | 1% |
| float16 | 1024 | 1024 | 4 | 132.5 | 48.4 | 48.5 | 62.7 | 2.73x | 1.29x
| 44.1 | 10% |
| float16 | 1024 | 1024 | 8 | 136.5 | 48.7 | 48.4 | 63.0 | 2.82x | 1.30x
| 45.0 | 10% |
| float16 | 1024 | 1024 | 16 | 143.6 | 49.7 | 49.8 | 63.1 | 2.88x |
1.27x | 45.4 | 10% |
| float16 | 1024 | 1025 | 4 | 185.3 | 97.8 | 97.5 | 64.2 | 1.90x | 0.66x
| 22.0 | 5% |
| float16 | 1024 | 1025 | 8 | 192.7 | 97.7 | 97.8 | 64.4 | 1.97x | 0.66x
| 22.3 | 5% |
| float16 | 1024 | 1025 | 16 | 206.3 | 99.0 | 98.9 | 64.5 | 2.09x |
0.65x | 22.9 | 5% |
| float16 | 1024 | 8192 | 4 | 793.1 | 198.8 | 207.6 | 145.0 | 3.82x |
0.70x | 81.0 | 18% |
| float16 | 1024 | 8192 | 8 | 840.3 | 199.1 | 209.4 | 144.6 | 4.01x |
0.69x | 80.5 | 18% |
| float16 | 1024 | 8192 | 16 | 907.4 | 201.8 | 211.9 | 145.5 | 4.28x |
0.69x | 79.9 | 18% |
| float16 | 1024 | 8193 | 4 | 799.0 | 456.2 | 466.4 | 146.1 | 1.71x |
0.31x | 36.1 | 8% |
| float16 | 1024 | 8193 | 8 | 838.6 | 457.3 | 468.8 | 146.5 | 1.79x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 8193 | 16 | 912.3 | 459.8 | 470.6 | 146.2 | 1.94x |
0.31x | 36.0 | 8% |
| float16 | 1024 | 131072 | 4 | 9033.3 | 1535.9 | 1539.0 | 1846.9 |
5.87x | 1.20x | 174.4 | 38% |
| float16 | 1024 | 131072 | 8 | 9885.6 | 1542.6 | 1539.7 | 1856.1 |
6.42x | 1.21x | 174.4 | 38% |
| float16 | 1024 | 131072 | 16 | 10870.4 | 1538.7 | 1544.1 | 1858.5 |
7.04x | 1.20x | 174.0 | 38% |
| float16 | 1024 | 131073 | 4 | 9011.7 | 3193.9 | 3188.8 | 1924.0 |
2.83x | 0.60x | 84.2 | 18% |
| float16 | 1024 | 131073 | 8 | 9922.9 | 3185.2 | 3196.3 | 1921.5 |
3.10x | 0.60x | 84.0 | 18% |
| float16 | 1024 | 131073 | 16 | 10905.6 | 3186.0 | 3216.1 | 1926.4 |
3.39x | 0.60x | 83.5 | 18% |
| float16 | 2048 | 128 | 4 | 106.8 | 107.8 | 106.5 | 28.3 | 1.00x |
0.27x | 5.7 | 1% |
| float16 | 2048 | 128 | 8 | 112.6 | 112.5 | 112.4 | 28.5 | 1.00x |
0.25x | 6.1 | 1% |
| float16 | 2048 | 128 | 16 | 115.6 | 114.5 | 115.4 | 29.2 | 1.00x |
0.25x | 7.4 | 2% |
| float16 | 2048 | 129 | 4 | 106.9 | 108.1 | 107.7 | 32.6 | 0.99x |
0.30x | 5.7 | 1% |
| float16 | 2048 | 129 | 8 | 112.5 | 112.4 | 112.3 | 32.7 | 1.00x |
0.29x | 6.2 | 1% |
| float16 | 2048 | 129 | 16 | 115.9 | 115.4 | 115.3 | 33.5 | 1.01x |
0.29x | 7.4 | 2% |
| float16 | 2048 | 1024 | 4 | 236.3 | 81.3 | 81.3 | 85.1 | 2.91x | 1.05x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 8 | 246.7 | 82.8 | 82.8 | 85.7 | 2.98x | 1.04x
| 52.6 | 12% |
| float16 | 2048 | 1024 | 16 | 259.7 | 84.4 | 84.2 | 86.0 | 3.08x |
1.02x | 53.7 | 12% |
| float16 | 2048 | 1025 | 4 | 345.5 | 179.5 | 180.5 | 87.7 | 1.91x |
0.49x | 23.7 | 5% |
| float16 | 2048 | 1025 | 8 | 358.4 | 180.9 | 180.8 | 88.0 | 1.98x |
0.49x | 24.1 | 5% |
| float16 | 2048 | 1025 | 16 | 380.3 | 182.2 | 182.2 | 88.5 | 2.09x |
0.49x | 24.8 | 5% |
| float16 | 2048 | 8192 | 4 | 1572.3 | 399.3 | 399.8 | 228.7 | 3.93x |
0.57x | 84.1 | 18% |
| float16 | 2048 | 8192 | 8 | 1662.5 | 400.0 | 400.3 | 228.5 | 4.15x |
0.57x | 84.2 | 18% |
| float16 | 2048 | 8192 | 16 | 1808.5 | 401.1 | 402.1 | 230.5 | 4.50x |
0.57x | 84.3 | 18% |
| float16 | 2048 | 8193 | 4 | 1573.6 | 924.3 | 926.2 | 231.7 | 1.70x |
0.25x | 36.3 | 8% |
| float16 | 2048 | 8193 | 8 | 1672.3 | 926.3 | 926.2 | 231.6 | 1.81x |
0.25x | 36.4 | 8% |
| float16 | 2048 | 8193 | 16 | 1813.4 | 931.1 | 929.0 | 233.1 | 1.95x |
0.25x | 36.5 | 8% |
| float16 | 2048 | 131072 | 4 | 17900.0 | 3035.1 | 3031.5 | 3622.2 |
5.90x | 1.19x | 177.1 | 39% |
| float16 | 2048 | 131072 | 8 | 19669.5 | 3028.6 | 3027.0 | 3607.3 |
6.50x | 1.19x | 177.4 | 39% |
| float16 | 2048 | 131072 | 16 | 21602.8 | 3043.9 | 3043.3 | 3607.4 |
7.10x | 1.19x | 176.5 | 39% |
| float16 | 2048 | 131073 | 4 | 17893.0 | 6305.2 | 6308.6 | 3743.3 |
2.84x | 0.59x | 85.1 | 19% |
| float16 | 2048 | 131073 | 8 | 19693.7 | 6309.6 | 6303.1 | 3747.1 |
3.12x | 0.59x | 85.2 | 19% |
| float16 | 2048 | 131073 | 16 | 21604.8 | 6307.9 | 6309.5 | 3749.5 |
3.42x | 0.59x | 85.1 | 19% |
| float32 | 1 | 128 | 4 | 31.2 | 31.4 | 37.1 | 14.5 | 0.84x | 0.39x |
0.0 | 0% |
| float32 | 1 | 128 | 8 | 34.0 | 34.4 | 34.1 | 14.3 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 128 | 16 | 32.4 | 34.4 | 32.2 | 14.0 | 1.01x | 0.43x |
0.0 | 0% |
| float32 | 1 | 129 | 4 | 34.1 | 34.4 | 35.5 | 14.4 | 0.96x | 0.41x |
0.0 | 0% |
| float32 | 1 | 129 | 8 | 34.0 | 32.7 | 33.9 | 14.4 | 1.00x | 0.42x |
0.0 | 0% |
| float32 | 1 | 129 | 16 | 34.1 | 34.3 | 32.0 | 15.2 | 1.07x | 0.47x |
0.0 | 0% |
| float32 | 1 | 1024 | 4 | 35.3 | 32.7 | 35.4 | 17.8 | 1.00x | 0.50x |
0.1 | 0% |
| float32 | 1 | 1024 | 8 | 35.3 | 35.8 | 35.3 | 22.2 | 1.00x | 0.63x |
0.1 | 0% |
| float32 | 1 | 1024 | 16 | 35.3 | 35.7 | 35.5 | 19.1 | 0.99x | 0.54x |
0.1 | 0% |
| float32 | 1 | 1025 | 4 | 35.3 | 35.9 | 33.7 | 18.8 | 1.05x | 0.56x |
0.1 | 0% |
| float32 | 1 | 1025 | 8 | 38.5 | 35.8 | 35.6 | 19.7 | 1.08x | 0.55x |
0.1 | 0% |
| float32 | 1 | 1025 | 16 | 35.2 | 35.7 | 33.7 | 19.6 | 1.04x | 0.58x |
0.1 | 0% |
| float32 | 1 | 8192 | 4 | 54.6 | 51.1 | 52.0 | 39.6 | 1.05x | 0.76x |
0.6 | 0% |
| float32 | 1 | 8192 | 8 | 63.6 | 55.0 | 50.5 | 38.0 | 1.26x | 0.75x |
0.7 | 0% |
| float32 | 1 | 8192 | 16 | 54.6 | 58.0 | 55.1 | 38.7 | 0.99x | 0.70x |
0.6 | 0% |
| float32 | 1 | 8193 | 4 | 51.5 | 52.0 | 53.4 | 34.1 | 0.96x | 0.64x |
0.6 | 0% |
| float32 | 1 | 8193 | 8 | 56.5 | 54.9 | 53.5 | 41.6 | 1.06x | 0.78x |
0.6 | 0% |
| float32 | 1 | 8193 | 16 | 60.6 | 58.0 | 52.1 | 39.8 | 1.16x | 0.76x |
0.6 | 0% |
| float32 | 1 | 131072 | 4 | 410.5 | 393.7 | 155.8 | 63.3 | 2.63x |
0.41x | 3.4 | 1% |
| float32 | 1 | 131072 | 8 | 412.3 | 398.5 | 130.7 | 63.3 | 3.15x |
0.48x | 4.0 | 1% |
| float32 | 1 | 131072 | 16 | 423.5 | 467.2 | 148.8 | 63.3 | 2.85x |
0.43x | 3.5 | 1% |
| float32 | 1 | 131073 | 4 | 406.7 | 389.3 | 172.4 | 64.0 | 2.36x |
0.37x | 3.0 | 1% |
| float32 | 1 | 131073 | 8 | 425.0 | 417.1 | 189.1 | 64.0 | 2.25x |
0.34x | 2.8 | 1% |
| float32 | 1 | 131073 | 16 | 435.0 | 430.7 | 240.7 | 63.9 | 1.81x |
0.27x | 2.2 | 0% |
| float32 | 8 | 128 | 4 | 33.8 | 37.2 | 33.8 | 14.7 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 128 | 8 | 35.0 | 34.1 | 35.1 | 14.3 | 1.00x | 0.41x |
0.1 | 0% |
| float32 | 8 | 128 | 16 | 35.6 | 37.2 | 36.7 | 15.2 | 0.97x | 0.41x |
0.2 | 0% |
| float32 | 8 | 129 | 4 | 35.2 | 36.0 | 35.2 | 15.0 | 1.00x | 0.43x |
0.1 | 0% |
| float32 | 8 | 129 | 8 | 36.8 | 34.1 | 33.8 | 15.0 | 1.09x | 0.44x |
0.1 | 0% |
| float32 | 8 | 129 | 16 | 35.3 | 35.5 | 36.8 | 15.3 | 0.96x | 0.42x |
0.2 | 0% |
| float32 | 8 | 1024 | 4 | 39.8 | 35.6 | 37.9 | 20.9 | 1.05x | 0.55x |
0.9 | 0% |
| float32 | 8 | 1024 | 8 | 38.2 | 35.6 | 35.2 | 19.7 | 1.09x | 0.56x |
1.0 | 0% |
| float32 | 8 | 1024 | 16 | 38.3 | 40.2 | 38.2 | 19.7 | 1.00x | 0.52x |
0.9 | 0% |
| float32 | 8 | 1025 | 4 | 38.3 | 35.7 | 38.3 | 20.6 | 1.00x | 0.54x |
0.9 | 0% |
| float32 | 8 | 1025 | 8 | 38.4 | 38.7 | 38.3 | 21.4 | 1.00x | 0.56x |
0.9 | 0% |
| float32 | 8 | 1025 | 16 | 41.2 | 36.9 | 39.5 | 22.0 | 1.04x | 0.56x |
0.9 | 0% |
| float32 | 8 | 8192 | 4 | 57.5 | 62.6 | 56.1 | 41.0 | 1.02x | 0.73x |
4.7 | 1% |
| float32 | 8 | 8192 | 8 | 60.6 | 55.2 | 60.8 | 42.6 | 1.00x | 0.70x |
4.3 | 1% |
| float32 | 8 | 8192 | 16 | 66.7 | 61.1 | 56.3 | 44.7 | 1.18x | 0.79x |
4.7 | 1% |
| float32 | 8 | 8193 | 4 | 54.6 | 64.0 | 57.7 | 43.0 | 0.95x | 0.75x |
4.6 | 1% |
| float32 | 8 | 8193 | 8 | 66.5 | 61.0 | 57.9 | 43.5 | 1.15x | 0.75x |
4.5 | 1% |
| float32 | 8 | 8193 | 16 | 63.9 | 67.1 | 62.3 | 45.0 | 1.03x | 0.72x |
4.2 | 1% |
| float32 | 8 | 131072 | 4 | 412.1 | 410.8 | 160.3 | 76.0 | 2.57x |
0.47x | 26.2 | 6% |
| float32 | 8 | 131072 | 8 | 432.3 | 425.0 | 161.0 | 76.0 | 2.69x |
0.47x | 26.1 | 6% |
| float32 | 8 | 131072 | 16 | 470.7 | 458.5 | 174.0 | 76.2 | 2.71x |
0.44x | 24.1 | 5% |
| float32 | 8 | 131073 | 4 | 403.7 | 411.2 | 244.8 | 76.0 | 1.65x |
0.31x | 17.1 | 4% |
| float32 | 8 | 131073 | 8 | 424.4 | 425.9 | 251.9 | 75.8 | 1.68x |
0.30x | 16.7 | 4% |
| float32 | 8 | 131073 | 16 | 471.5 | 477.8 | 250.7 | 76.1 | 1.88x |
0.30x | 16.7 | 4% |
| float32 | 64 | 128 | 4 | 38.2 | 37.4 | 36.8 | 15.0 | 1.04x | 0.41x |
1.0 | 0% |
| float32 | 64 | 128 | 8 | 36.8 | 37.2 | 36.7 | 15.0 | 1.00x | 0.41x |
1.1 | 0% |
| float32 | 64 | 128 | 16 | 38.3 | 37.2 | 38.1 | 14.9 | 1.01x | 0.39x |
1.2 | 0% |
| float32 | 64 | 129 | 4 | 38.5 | 37.1 | 36.8 | 15.5 | 1.05x | 0.42x |
1.0 | 0% |
| float32 | 64 | 129 | 8 | 37.0 | 37.1 | 36.8 | 15.9 | 1.01x | 0.43x |
1.1 | 0% |
| float32 | 64 | 129 | 16 | 38.4 | 38.8 | 37.1 | 15.4 | 1.04x | 0.42x |
1.2 | 0% |
| float32 | 64 | 1024 | 4 | 39.6 | 38.9 | 41.3 | 20.4 | 0.96x | 0.49x |
6.4 | 1% |
| float32 | 64 | 1024 | 8 | 39.8 | 39.2 | 41.1 | 20.3 | 0.97x | 0.49x |
6.5 | 1% |
| float32 | 64 | 1024 | 16 | 41.4 | 40.2 | 42.6 | 20.3 | 0.97x | 0.48x |
6.4 | 1% |
| float32 | 64 | 1025 | 4 | 41.3 | 43.4 | 41.4 | 22.1 | 1.00x | 0.53x |
6.4 | 1% |
| float32 | 64 | 1025 | 8 | 42.9 | 43.3 | 42.4 | 22.1 | 1.01x | 0.52x |
6.3 | 1% |
| float32 | 64 | 1025 | 16 | 42.9 | 44.7 | 42.6 | 22.2 | 1.01x | 0.52x |
6.4 | 1% |
| float32 | 64 | 8192 | 4 | 96.8 | 99.2 | 106.9 | 65.6 | 0.91x | 0.61x |
19.6 | 4% |
| float32 | 64 | 8192 | 8 | 103.8 | 106.6 | 110.0 | 65.6 | 0.94x | 0.60x
| 19.1 | 4% |
| float32 | 64 | 8192 | 16 | 109.6 | 109.9 | 117.2 | 65.6 | 0.94x |
0.56x | 18.0 | 4% |
| float32 | 64 | 8193 | 4 | 97.8 | 99.6 | 111.5 | 65.6 | 0.88x | 0.59x |
18.8 | 4% |
| float32 | 64 | 8193 | 8 | 104.9 | 112.7 | 112.9 | 65.5 | 0.93x | 0.58x
| 18.6 | 4% |
| float32 | 64 | 8193 | 16 | 112.9 | 115.8 | 111.7 | 65.6 | 1.01x |
0.59x | 18.9 | 4% |
| float32 | 64 | 131072 | 4 | 956.6 | 940.0 | 470.2 | 221.1 | 2.03x |
0.47x | 71.4 | 16% |
| float32 | 64 | 131072 | 8 | 1024.0 | 1007.2 | 473.6 | 220.4 | 2.16x |
0.47x | 70.9 | 16% |
| float32 | 64 | 131072 | 16 | 1097.5 | 1082.4 | 487.5 | 222.6 | 2.25x |
0.46x | 68.9 | 15% |
| float32 | 64 | 131073 | 4 | 943.5 | 941.2 | 610.1 | 223.0 | 1.55x |
0.37x | 55.0 | 12% |
| float32 | 64 | 131073 | 8 | 1004.0 | 1010.3 | 635.1 | 225.0 | 1.58x |
0.35x | 52.8 | 12% |
| float32 | 64 | 131073 | 16 | 1095.1 | 1101.5 | 650.8 | 223.7 | 1.68x |
0.34x | 51.6 | 11% |
| float32 | 256 | 128 | 4 | 46.0 | 46.0 | 45.8 | 15.7 | 1.00x | 0.34x |
3.1 | 1% |
| float32 | 256 | 128 | 8 | 47.2 | 47.5 | 45.8 | 15.7 | 1.03x | 0.34x |
3.4 | 1% |
| float32 | 256 | 128 | 16 | 47.4 | 47.2 | 45.7 | 15.7 | 1.04x | 0.34x |
3.9 | 1% |
| float32 | 256 | 129 | 4 | 47.2 | 47.5 | 45.8 | 16.1 | 1.03x | 0.35x |
3.2 | 1% |
| float32 | 256 | 129 | 8 | 45.6 | 47.4 | 47.2 | 16.7 | 0.97x | 0.35x |
3.3 | 1% |
| float32 | 256 | 129 | 16 | 47.3 | 49.0 | 50.1 | 16.6 | 0.94x | 0.33x |
3.6 | 1% |
| float32 | 256 | 1024 | 4 | 66.7 | 68.3 | 68.2 | 41.7 | 0.98x | 0.61x |
15.6 | 3% |
| float32 | 256 | 1024 | 8 | 70.7 | 70.0 | 69.5 | 43.2 | 1.02x | 0.62x |
15.4 | 3% |
| float32 | 256 | 1024 | 16 | 71.1 | 71.6 | 71.2 | 43.8 | 1.00x | 0.62x
| 15.4 | 3% |
| float32 | 256 | 1025 | 4 | 82.8 | 81.2 | 81.8 | 45.9 | 1.01x | 0.56x |
13.0 | 3% |
| float32 | 256 | 1025 | 8 | 85.8 | 84.6 | 87.6 | 46.6 | 0.98x | 0.53x |
12.3 | 3% |
| float32 | 256 | 1025 | 16 | 87.3 | 89.4 | 89.3 | 48.1 | 0.98x | 0.54x
| 12.3 | 3% |
| float32 | 256 | 8192 | 4 | 274.6 | 277.6 | 279.6 | 101.0 | 0.98x |
0.36x | 30.0 | 7% |
| float32 | 256 | 8192 | 8 | 299.9 | 286.3 | 292.0 | 101.3 | 1.03x |
0.35x | 28.8 | 6% |
| float32 | 256 | 8192 | 16 | 313.3 | 315.7 | 301.0 | 100.9 | 1.04x |
0.34x | 28.0 | 6% |
| float32 | 256 | 8193 | 4 | 283.6 | 277.9 | 296.7 | 101.7 | 0.96x |
0.34x | 28.3 | 6% |
| float32 | 256 | 8193 | 8 | 292.0 | 292.6 | 303.0 | 101.6 | 0.96x |
0.34x | 27.8 | 6% |
| float32 | 256 | 8193 | 16 | 317.9 | 318.0 | 314.7 | 101.8 | 1.01x |
0.32x | 26.8 | 6% |
| float32 | 256 | 131072 | 4 | 3194.0 | 3202.4 | 1625.5 | 1128.3 | 1.96x
| 0.69x | 82.6 | 18% |
| float32 | 256 | 131072 | 8 | 3415.0 | 3445.5 | 1644.8 | 1132.5 | 2.08x
| 0.69x | 81.6 | 18% |
| float32 | 256 | 131072 | 16 | 3704.6 | 3711.3 | 1687.9 | 1129.5 |
2.19x | 0.67x | 79.5 | 17% |
| float32 | 256 | 131073 | 4 | 3206.8 | 3195.1 | 2142.2 | 1148.5 | 1.50x
| 0.54x | 62.7 | 14% |
| float32 | 256 | 131073 | 8 | 3427.4 | 3420.5 | 2207.1 | 1148.0 | 1.55x
| 0.52x | 60.8 | 13% |
| float32 | 256 | 131073 | 16 | 3743.5 | 3721.6 | 2263.0 | 1147.9 |
1.65x | 0.51x | 59.3 | 13% |
| float32 | 1024 | 128 | 4 | 100.9 | 102.1 | 100.7 | 22.3 | 1.00x |
0.22x | 5.7 | 1% |
| float32 | 1024 | 128 | 8 | 107.9 | 105.8 | 105.5 | 22.0 | 1.02x |
0.21x | 5.9 | 1% |
| float32 | 1024 | 128 | 16 | 108.2 | 110.0 | 109.3 | 22.2 | 0.99x |
0.20x | 6.6 | 1% |
| float32 | 1024 | 129 | 4 | 102.3 | 101.3 | 103.5 | 24.4 | 0.99x |
0.24x | 5.6 | 1% |
| float32 | 1024 | 129 | 8 | 108.0 | 108.2 | 105.5 | 24.4 | 1.02x |
0.23x | 5.9 | 1% |
| float32 | 1024 | 129 | 16 | 109.5 | 111.1 | 109.4 | 24.6 | 1.00x |
0.22x | 6.6 | 1% |
| float32 | 1024 | 1024 | 4 | 185.6 | 50.2 | 50.0 | 88.3 | 3.71x | 1.77x
| 84.9 | 19% |
| float32 | 1024 | 1024 | 8 | 190.3 | 50.0 | 50.0 | 88.3 | 3.81x | 1.77x
| 85.9 | 19% |
| float32 | 1024 | 1024 | 16 | 194.7 | 50.2 | 51.0 | 88.3 | 3.82x |
1.73x | 86.1 | 19% |
| float32 | 1024 | 1025 | 4 | 251.8 | 92.1 | 91.9 | 90.2 | 2.74x | 0.98x
| 46.2 | 10% |
| float32 | 1024 | 1025 | 8 | 262.6 | 92.5 | 92.7 | 90.1 | 2.83x | 0.97x
| 46.4 | 10% |
| float32 | 1024 | 1025 | 16 | 267.3 | 93.0 | 93.0 | 90.4 | 2.87x |
0.97x | 47.3 | 10% |
| float32 | 1024 | 8192 | 4 | 1000.9 | 230.7 | 231.1 | 200.8 | 4.33x |
0.87x | 145.4 | 32% |
| float32 | 1024 | 8192 | 8 | 1072.8 | 231.1 | 231.3 | 200.2 | 4.64x |
0.87x | 145.5 | 32% |
| float32 | 1024 | 8192 | 16 | 1140.4 | 231.5 | 231.4 | 201.7 | 4.93x |
0.87x | 145.9 | 32% |
| float32 | 1024 | 8193 | 4 | 1014.7 | 465.1 | 465.7 | 202.4 | 2.18x |
0.43x | 72.2 | 16% |
| float32 | 1024 | 8193 | 8 | 1076.7 | 465.9 | 465.1 | 201.3 | 2.31x |
0.43x | 72.4 | 16% |
| float32 | 1024 | 8193 | 16 | 1159.9 | 466.5 | 465.6 | 202.6 | 2.49x |
0.44x | 72.5 | 16% |
| float32 | 1024 | 131072 | 4 | 11911.6 | 1964.0 | 1965.1 | 4191.1 |
6.06x | 2.13x | 273.2 | 60% |
| float32 | 1024 | 131072 | 8 | 12727.1 | 1966.1 | 1968.0 | 4189.9 |
6.47x | 2.13x | 272.9 | 60% |
| float32 | 1024 | 131072 | 16 | 13772.9 | 1966.2 | 1966.7 | 4190.6 |
7.00x | 2.13x | 273.1 | 60% |
| float32 | 1024 | 131073 | 4 | 11868.0 | 3547.2 | 3547.7 | 4260.7 |
3.35x | 1.20x | 151.3 | 33% |
| float32 | 1024 | 131073 | 8 | 12770.6 | 3550.0 | 3550.8 | 4261.2 |
3.60x | 1.20x | 151.2 | 33% |
| float32 | 1024 | 131073 | 16 | 13914.8 | 3557.8 | 3560.1 | 4261.2 |
3.91x | 1.20x | 150.9 | 33% |
| float32 | 2048 | 128 | 4 | 170.5 | 170.2 | 171.1 | 30.2 | 1.00x |
0.18x | 6.7 | 1% |
| float32 | 2048 | 128 | 8 | 177.6 | 177.9 | 178.6 | 30.6 | 0.99x |
0.17x | 7.0 | 2% |
| float32 | 2048 | 128 | 16 | 180.7 | 181.4 | 180.1 | 31.2 | 1.00x |
0.17x | 8.0 | 2% |
| float32 | 2048 | 129 | 4 | 170.3 | 170.5 | 171.3 | 35.4 | 0.99x |
0.21x | 6.7 | 1% |
| float32 | 2048 | 129 | 8 | 176.5 | 176.7 | 177.2 | 35.3 | 1.00x |
0.20x | 7.1 | 2% |
| float32 | 2048 | 129 | 16 | 181.9 | 182.7 | 181.0 | 36.4 | 1.00x |
0.20x | 8.0 | 2% |
| float32 | 2048 | 1024 | 4 | 333.2 | 85.6 | 85.5 | 123.4 | 3.90x |
1.44x | 99.3 | 22% |
| float32 | 2048 | 1024 | 8 | 347.3 | 85.9 | 86.0 | 123.4 | 4.04x |
1.43x | 99.8 | 22% |
| float32 | 2048 | 1024 | 16 | 355.7 | 87.1 | 87.0 | 123.7 | 4.09x |
1.42x | 100.9 | 22% |
| float32 | 2048 | 1025 | 4 | 470.0 | 165.7 | 165.7 | 126.5 | 2.84x |
0.76x | 51.3 | 11% |
| float32 | 2048 | 1025 | 8 | 492.6 | 166.1 | 166.1 | 126.7 | 2.97x |
0.76x | 51.7 | 11% |
| float32 | 2048 | 1025 | 16 | 503.6 | 167.0 | 167.5 | 127.0 | 3.01x |
0.76x | 52.5 | 12% |
| float32 | 2048 | 8192 | 4 | 1972.4 | 442.5 | 442.5 | 421.7 | 4.46x |
0.95x | 151.9 | 33% |
| float32 | 2048 | 8192 | 8 | 2094.9 | 443.3 | 443.1 | 424.8 | 4.73x |
0.96x | 151.9 | 33% |
| float32 | 2048 | 8192 | 16 | 2251.3 | 444.0 | 443.8 | 424.0 | 5.07x |
0.96x | 152.1 | 33% |
| float32 | 2048 | 8193 | 4 | 1979.8 | 908.5 | 906.7 | 436.2 | 2.18x |
0.48x | 74.1 | 16% |
| float32 | 2048 | 8193 | 8 | 2127.7 | 907.9 | 909.8 | 437.6 | 2.34x |
0.48x | 74.0 | 16% |
| float32 | 2048 | 8193 | 16 | 2269.5 | 910.9 | 909.9 | 440.8 | 2.49x |
0.48x | 74.2 | 16% |
| float32 | 2048 | 131072 | 4 | 23642.3 | 3925.9 | 3925.6 | 8254.2 |
6.02x | 2.10x | 273.5 | 60% |
| float32 | 2048 | 131072 | 8 | 25253.3 | 3926.0 | 3928.5 | 8254.6 |
6.43x | 2.10x | 273.4 | 60% |
| float32 | 2048 | 131072 | 16 | 27390.4 | 3930.4 | 3925.5 | 8250.2 |
6.98x | 2.10x | 273.6 | 60% |
| float32 | 2048 | 131073 | 4 | 23630.0 | 7033.7 | 7035.5 | 8407.4 |
3.36x | 1.19x | 152.6 | 33% |
| float32 | 2048 | 131073 | 8 | 25309.8 | 7037.0 | 7033.5 | 8407.4 |
3.60x | 1.20x | 152.7 | 33% |
| float32 | 2048 | 131073 | 16 | 27547.6 | 7041.9 | 7036.1 | 8413.3 |
3.92x | 1.20x | 152.7 | 33% |

</details>

### Test methodology

- **Accuracy (432 cases):** 3 dtypes x 6 batch sizes x 4 dims x 2
alignments x 3 k values. CPU reference vs XPU, sort-then-compare.
- **Sortedness (324 cases):** Verify `torch.topk(sorted=True)` output is
monotonic for both `largest=True/False`.
- **Benchmark (432 cases):** Median of 3 runs x 50 iterations each, with
20 warmup iterations. `largest=True`.
- **Bandwidth:** `(bs * dim * sizeof(dtype) + bs * k * (sizeof(dtype) +
8)) / time`. Peak B580 = 456 GB/s (192-bit x 19 Gbps GDDR6).

---------

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR kernel_optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants