[Tile][WIP] tile DeviceTransform port by nanan-nvidia · Pull Request #9210 · NVIDIA/cccl

nanan-nvidia · 2026-06-01T23:06:39Z

Highly experimental, opening as Draft to share a work-in-progress port of cub::DeviceTransform (#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.

Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a __tile__ function, and it must consist of tile operations and must be inlinable.

Current B200 benchmark on pytorch and babel:

pytorch (tile / cub/ delta = tile - cub, BW utilisation %)

op	T	2^16	2^20	2^24	2^28	2^31
relu	half	$\ \ 0.4/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.6/\ \ 6.8/{\color{red}-\ \ 0.2}$	$47.9/47.4/{\color{green}+\ \ 0.5}$	$83.5/83.1/{\color{green}+\ \ 0.4}$	$87.8/80.2/{\color{green}+\ \ 7.6}$
relu	bf16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.7/\ \ 7.0/{\color{red}-\ \ 0.3}$	$47.5/46.2/{\color{green}+\ \ 1.3}$	$83.3/83.0/{\color{green}+\ \ 0.3}$	$87.7/80.9/{\color{green}+\ \ 6.8}$
relu	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$64.7/65.0/{\color{red}-\ \ 0.3}$	$88.4/88.8/{\color{red}-\ \ 0.3}$	$90.8/91.1/{\color{red}-\ \ 0.3}$
sigmoid	half	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.9/\ \ 6.6/{\color{green}+\ \ 0.2}$	$34.5/31.5/{\color{green}+\ \ 3.0}$	$48.1/45.5/{\color{green}+\ \ 2.5}$	$49.3/46.8/{\color{green}+\ \ 2.6}$
sigmoid	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.3}$	$34.8/32.0/{\color{green}+\ \ 2.8}$	$48.1/45.5/{\color{green}+\ \ 2.6}$	$49.4/46.8/{\color{green}+\ \ 2.7}$
sigmoid	f32	$\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$61.3/56.8/{\color{green}+\ \ 4.5}$	$83.8/76.7/{\color{green}+\ \ 7.1}$	$79.3/73.3/{\color{green}+\ \ 6.0}$
tanh	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.3}$	$39.6/35.5/{\color{green}+\ \ 4.1}$	$58.5/51.6/{\color{green}+\ \ 6.9}$	$60.4/53.2/{\color{green}+\ \ 7.2}$
tanh	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.2}$	$38.4/35.5/{\color{green}+\ \ 2.9}$	$55.3/51.4/{\color{green}+\ \ 3.9}$	$56.9/53.1/{\color{green}+\ \ 3.8}$
tanh	f32	$\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$	$13.0/13.3/{\color{red}-\ \ 0.3}$	$64.8/57.5/{\color{green}+\ \ 7.3}$	$88.8/76.7/{\color{green}+12.0}$	$86.7/75.0/{\color{green}+11.6}$
gelu	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$35.3/30.5/{\color{green}+\ \ 4.8}$	$48.8/43.2/{\color{green}+\ \ 5.5}$	$50.1/44.4/{\color{green}+\ \ 5.7}$
gelu	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$34.9/30.6/{\color{green}+\ \ 4.4}$	$48.3/43.3/{\color{green}+\ \ 5.1}$	$49.6/44.5/{\color{green}+\ \ 5.1}$
gelu	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$13.1/13.3/{\color{red}-\ \ 0.2}$	$64.2/53.5/{\color{green}+10.8}$	$85.2/72.8/{\color{green}+12.4}$	$80.5/69.2/{\color{green}+11.4}$
sin	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$	$33.3/30.4/{\color{green}+\ \ 2.8}$	$46.9/41.5/{\color{green}+\ \ 5.3}$	$48.1/42.6/{\color{green}+\ \ 5.5}$
sin	bf16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$	$33.2/30.4/{\color{green}+\ \ 2.8}$	$46.8/41.6/{\color{green}+\ \ 5.2}$	$48.1/42.6/{\color{green}+\ \ 5.5}$
sin	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$12.9/13.3/{\color{red}-\ \ 0.3}$	$60.4/53.4/{\color{green}+\ \ 7.0}$	$79.8/72.7/{\color{green}+\ \ 7.1}$	$76.2/69.5/{\color{green}+\ \ 6.7}$
exp	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.8/\ \ 6.7/{\color{green}+\ \ 0.1}$	$46.9/38.8/{\color{green}+\ \ 8.1}$	$74.8/61.0/{\color{green}+13.8}$	$73.4/60.3/{\color{green}+13.1}$
exp	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$45.5/38.9/{\color{green}+\ \ 6.6}$	$71.2/62.1/{\color{green}+\ \ 9.1}$	$73.6/61.7/{\color{green}+11.9}$
exp	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$64.7/65.4/{\color{red}-\ \ 0.7}$	$88.5/87.2/{\color{green}+\ \ 1.2}$	$90.8/82.8/{\color{green}+\ \ 8.1}$
add	half	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.1/10.0/{\color{green}+\ \ 0.1}$	$57.1/53.5/{\color{green}+\ \ 3.6}$	$88.6/86.8/{\color{green}+\ \ 1.8}$	$92.5/84.1/{\color{green}+\ \ 8.4}$
add	bf16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.2/10.0/{\color{green}+\ \ 0.2}$	$56.7/53.5/{\color{green}+\ \ 3.2}$	$88.6/86.9/{\color{green}+\ \ 1.8}$	$92.5/85.7/{\color{green}+\ \ 6.8}$
add	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.8/17.9/{\color{red}-\ \ 0.2}$	$70.1/69.9/{\color{green}+\ \ 0.2}$	$91.2/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.6/{\color{red}-\ \ 0.9}$
sub	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{red}-\ \ 0.0}$	$57.2/53.4/{\color{green}+\ \ 3.8}$	$88.7/86.8/{\color{green}+\ \ 1.8}$	$92.5/84.0/{\color{green}+\ \ 8.5}$
sub	bf16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$57.4/53.4/{\color{green}+\ \ 4.0}$	$88.6/86.9/{\color{green}+\ \ 1.8}$	$92.4/85.6/{\color{green}+\ \ 6.9}$
sub	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.0}$	$17.7/17.7/{\color{green}+\ \ 0.1}$	$70.7/70.0/{\color{green}+\ \ 0.7}$	$91.2/92.1/{\color{red}-\ \ 0.8}$	$92.7/93.7/{\color{red}-\ \ 1.0}$
mul	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$57.7/53.5/{\color{green}+\ \ 4.3}$	$88.6/86.8/{\color{green}+\ \ 1.8}$	$92.6/83.9/{\color{green}+\ \ 8.6}$
mul	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.1/10.0/{\color{green}+\ \ 0.1}$	$57.7/53.5/{\color{green}+\ \ 4.3}$	$88.7/86.8/{\color{green}+\ \ 1.8}$	$92.5/85.5/{\color{green}+\ \ 6.9}$
mul	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$19.3/18.1/{\color{green}+\ \ 1.1}$	$71.0/70.0/{\color{green}+\ \ 1.0}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.6/{\color{red}-\ \ 0.8}$
div	half	$\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$	$10.0/10.0/{\color{red}-\ \ 0.0}$	$53.5/49.3/{\color{green}+\ \ 4.3}$	$83.2/73.2/{\color{green}+10.0}$	$79.4/69.9/{\color{green}+\ \ 9.5}$
div	bf16	$\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$	$\ \ 9.5/10.0/{\color{red}-\ \ 0.5}$	$54.4/49.2/{\color{green}+\ \ 5.2}$	$82.5/69.7/{\color{green}+12.8}$	$79.8/71.6/{\color{green}+\ \ 8.2}$
div	f32	$\ \ 1.5/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.5/17.2/{\color{green}+\ \ 0.3}$	$69.9/67.4/{\color{green}+\ \ 2.5}$	$92.3/85.1/{\color{green}+\ \ 7.2}$	$93.6/81.8/{\color{green}+11.8}$
le	half	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.9/53.4/{\color{green}+\ \ 4.5}$	$89.2/84.7/{\color{green}+\ \ 4.5}$	$92.7/82.5/{\color{green}+10.2}$
le	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.1}$	$10.0/\ \ 9.6/{\color{green}+\ \ 0.5}$	$57.9/54.0/{\color{green}+\ \ 3.9}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.7/83.1/{\color{green}+\ \ 9.6}$
le	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.2/17.6/{\color{green}+\ \ 0.7}$	$70.6/70.3/{\color{green}+\ \ 0.3}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.9/{\color{red}-\ \ 1.2}$
ge	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.5/53.4/{\color{green}+\ \ 4.1}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.8/82.6/{\color{green}+10.2}$
ge	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.8/{\color{green}+\ \ 0.2}$	$57.5/53.4/{\color{green}+\ \ 4.1}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.7/83.1/{\color{green}+\ \ 9.6}$
ge	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.1}$	$18.4/17.5/{\color{green}+\ \ 0.9}$	$70.2/69.9/{\color{green}+\ \ 0.3}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.7/93.9/{\color{red}-\ \ 1.2}$
fmin	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$	$57.4/53.5/{\color{green}+\ \ 3.9}$	$89.1/84.0/{\color{green}+\ \ 5.0}$	$92.7/80.2/{\color{green}+12.5}$
fmin	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$	$57.9/53.4/{\color{green}+\ \ 4.5}$	$89.1/84.0/{\color{green}+\ \ 5.1}$	$92.6/81.7/{\color{green}+10.9}$
fmin	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$19.3/18.5/{\color{green}+\ \ 0.8}$	$70.7/70.6/{\color{green}+\ \ 0.1}$	$91.2/92.0/{\color{red}-\ \ 0.8}$	$92.8/93.5/{\color{red}-\ \ 0.7}$
fmax	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.7/53.4/{\color{green}+\ \ 4.3}$	$89.1/83.8/{\color{green}+\ \ 5.3}$	$92.7/80.3/{\color{green}+12.4}$
fmax	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$58.1/53.5/{\color{green}+\ \ 4.7}$	$89.1/84.0/{\color{green}+\ \ 5.0}$	$92.6/81.7/{\color{green}+10.9}$
fmax	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.7/18.2/{\color{green}+\ \ 0.5}$	$69.8/68.5/{\color{green}+\ \ 1.3}$	$91.2/92.0/{\color{red}-\ \ 0.8}$	$92.8/93.5/{\color{red}-\ \ 0.7}$

babel

op	T	2^16	2^20	2^24	2^28	2^31
mul	i8	$\ \ 0.2/\ \ 0.3/{\color{red}-\ \ 0.1}$	$\ \ 3.5/\ \ 3.4/{\color{green}+\ \ 0.1}$	$34.5/30.2/{\color{green}+\ \ 4.3}$	$72.9/58.9/{\color{green}+14.0}$	$79.4/59.8/{\color{green}+19.6}$
mul	i16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.7/\ \ 6.7/{\color{red}-\ \ 0.0}$	$48.2/47.5/{\color{green}+\ \ 0.7}$	$85.2/85.5/{\color{red}-\ \ 0.2}$	$89.4/82.2/{\color{green}+\ \ 7.2}$
mul	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$12.9/13.3/{\color{red}-\ \ 0.3}$	$61.6/65.3/{\color{red}-\ \ 3.7}$	$87.5/89.0/{\color{red}-\ \ 1.5}$	$89.6/91.0/{\color{red}-\ \ 1.5}$
mul	f64	$\ \ 1.7/\ \ 2.1/{\color{red}-\ \ 0.4}$	$21.8/21.5/{\color{green}+\ \ 0.3}$	$73.8/76.1/{\color{red}-\ \ 2.2}$	$88.8/90.0/{\color{red}-\ \ 1.3}$	$89.8/91.1/{\color{red}-\ \ 1.3}$
add	i8	$\ \ 0.3/\ \ 0.4/{\color{red}-\ \ 0.1}$	$\ \ 5.2/\ \ 5.0/{\color{green}+\ \ 0.1}$	$38.8/39.9/{\color{red}-\ \ 1.1}$	$66.9/71.3/{\color{red}-\ \ 4.4}$	$71.2/70.4/{\color{green}+\ \ 0.8}$
add	i16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$58.4/54.7/{\color{green}+\ \ 3.7}$	$88.7/86.9/{\color{green}+\ \ 1.9}$	$92.5/85.3/{\color{green}+\ \ 7.2}$
add	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.7/17.5/{\color{green}+\ \ 0.2}$	$70.2/68.8/{\color{green}+\ \ 1.5}$	$91.2/92.1/{\color{red}-\ \ 0.9}$	$92.8/93.7/{\color{red}-\ \ 0.9}$
add	f64	$\ \ 3.0/\ \ 3.1/{\color{red}-\ \ 0.1}$	$30.9/31.5/{\color{red}-\ \ 0.6}$	$80.3/79.9/{\color{green}+\ \ 0.4}$	$92.0/93.3/{\color{red}-\ \ 1.3}$	$93.0/94.1/{\color{red}-\ \ 1.1}$
triad	i8	$\ \ 0.4/\ \ 0.4/{\color{red}-\ \ 0.0}$	$\ \ 5.1/\ \ 5.0/{\color{green}+\ \ 0.0}$	$36.9/36.9/{\color{green}+\ \ 0.0}$	$64.1/67.4/{\color{red}-\ \ 3.3}$	$66.2/64.6/{\color{green}+\ \ 1.6}$
triad	i16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.9/53.5/{\color{green}+\ \ 4.4}$	$90.0/85.9/{\color{green}+\ \ 4.0}$	$93.0/83.3/{\color{green}+\ \ 9.6}$
triad	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.9/17.6/{\color{green}+\ \ 1.3}$	$71.0/70.0/{\color{green}+\ \ 1.1}$	$91.3/92.3/{\color{red}-\ \ 1.0}$	$92.9/91.6/{\color{green}+\ \ 1.3}$
triad	f64	$\ \ 3.2/\ \ 3.1/{\color{green}+\ \ 0.1}$	$30.5/29.7/{\color{green}+\ \ 0.8}$	$79.9/79.8/{\color{green}+\ \ 0.1}$	$92.1/93.5/{\color{red}-\ \ 1.4}$	$93.0/93.8/{\color{red}-\ \ 0.8}$
nstream	i8	$\ \ 0.4/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.6/\ \ 6.7/{\color{red}-\ \ 0.0}$	$37.7/42.6/{\color{red}-\ \ 4.9}$	$56.5/65.8/{\color{red}-\ \ 9.2}$	$57.3/64.0/{\color{red}-\ \ 6.6}$
nstream	i16	$\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$	$13.3/13.3/{\color{green}+\ \ 0.0}$	$61.9/60.9/{\color{green}+\ \ 1.0}$	$91.7/85.2/{\color{green}+\ \ 6.5}$	$95.0/81.5/{\color{green}+13.6}$
nstream	f32	$\ \ 2.1/\ \ 2.1/{\color{green}+\ \ 0.1}$	$22.0/21.4/{\color{green}+\ \ 0.6}$	$74.6/73.4/{\color{green}+\ \ 1.2}$	$93.4/92.7/{\color{green}+\ \ 0.8}$	$95.3/93.4/{\color{green}+\ \ 1.9}$
nstream	f64	$\ \ 3.8/\ \ 3.6/{\color{green}+\ \ 0.2}$	$35.5/35.5/{\color{green}+\ \ 0.0}$	$83.3/83.2/{\color{green}+\ \ 0.0}$	$94.5/94.5/{\color{red}-\ \ 0.1}$	$95.4/95.5/{\color{red}-\ \ 0.0}$

copy, grayscale, fill

op	T	2^16	2^20	2^24	2^28	2^31
copy	i8	$\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$	$\ \ 4.0/\ \ 3.4/{\color{green}+\ \ 0.6}$	$35.8/30.8/{\color{green}+\ \ 5.1}$	$82.1/72.0/{\color{green}+10.0}$	$89.8/68.5/{\color{green}+21.2}$
copy	i16	$\ \ 0.5/\ \ 0.5/{\color{green}+\ \ 0.0}$	$\ \ 6.8/\ \ 6.8/{\color{green}+\ \ 0.1}$	$51.7/47.4/{\color{green}+\ \ 4.4}$	$86.5/84.8/{\color{green}+\ \ 1.7}$	$90.5/80.6/{\color{green}+\ \ 9.9}$
copy	i32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$13.0/13.1/{\color{red}-\ \ 0.1}$	$61.8/65.3/{\color{red}-\ \ 3.4}$	$87.7/88.7/{\color{red}-\ \ 1.1}$	$89.6/91.0/{\color{red}-\ \ 1.5}$
copy	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$13.0/13.2/{\color{red}-\ \ 0.1}$	$62.2/65.3/{\color{red}-\ \ 3.1}$	$87.6/88.8/{\color{red}-\ \ 1.1}$	$88.2/91.2/{\color{red}-\ \ 3.0}$
copy	f64	$\ \ 2.1/\ \ 2.1/{\color{red}-\ \ 0.0}$	$23.0/22.2/{\color{green}+\ \ 0.8}$	$75.5/75.7/{\color{red}-\ \ 0.2}$	$89.8/90.1/{\color{red}-\ \ 0.3}$	$90.9/91.2/{\color{red}-\ \ 0.3}$
grayscale	f32	$\ \ 1.7/\ \ 2.0/{\color{red}-\ \ 0.3}$	$21.7/21.3/{\color{green}+\ \ 0.4}$	$70.7/71.1/{\color{red}-\ \ 0.4}$	$92.3/92.6/{\color{red}-\ \ 0.4}$	$94.0/93.7/{\color{green}+\ \ 0.3}$
grayscale	f64	$\ \ 3.5/\ \ 3.8/{\color{red}-\ \ 0.2}$	$35.1/35.1/{\color{red}-\ \ 0.0}$	$79.7/82.7/{\color{red}-\ \ 3.1}$	$92.9/94.4/{\color{red}-\ \ 1.5}$	$93.7/95.6/{\color{red}-\ \ 1.9}$
fill	I8	$\ \ 0.1/\ \ 0.1/{\color{red}-\ \ 0.0}$	$\ \ 2.2/\ \ 2.2/{\color{red}-\ \ 0.0}$	$26.3/26.3/{\color{red}-\ \ 0.0}$	$85.1/85.1/{\color{red}-\ \ 0.0}$	$97.4/97.4/{\color{red}-\ \ 0.0}$
fill	I16	$\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$	$\ \ 4.4/\ \ 4.4/{\color{green}+\ \ 0.0}$	$42.1/42.6/{\color{red}-\ \ 0.6}$	$91.8/91.9/{\color{red}-\ \ 0.1}$	$98.2/98.2/{\color{red}-\ \ 0.0}$
fill	I32	$\ \ 0.6/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 8.5/\ \ 8.3/{\color{green}+\ \ 0.2}$	$60.1/59.3/{\color{green}+\ \ 0.8}$	$95.1/95.0/{\color{green}+\ \ 0.1}$	$98.8/98.7/{\color{green}+\ \ 0.1}$
fill	I64	$\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$	$13.9/15.1/{\color{red}-\ \ 1.2}$	$72.8/71.4/{\color{green}+\ \ 1.4}$	$97.4/97.0/{\color{green}+\ \ 0.4}$	$99.1/99.0/{\color{green}+\ \ 0.1}$

We did not do benchmarks on complex, fib and heavy:

for complex, cutile does not accept std::complex as a vaild type to form tiles.
for fib, with tile semantics, there is no 1-to-1 fair implementation in tile. We can get one by abusing ct::select but it is much slower.
for heavy, cutile lowers syntax like T reg[N] to heap allocation.

There was a more detailed write up on fib and heavy here: #9038 (comment)

copy-pr-bot · 2026-06-01T23:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

miscco

This looks great already

bernhardmgruber · 2026-06-02T07:26:29Z

+    auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
+    auto out       = ct::assume_aligned<16>(out_);


Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.

Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.

I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark

That is a very good idea! Please help them improve tile!

Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.

bernhardmgruber · 2026-06-02T07:29:02Z

I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice!

nanan-nvidia · 2026-06-04T01:30:05Z

NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with --enable-tile. I temporarily unblocked myself with e0a31e4

This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with --enable-tile and -DCCCL_ENABLE_TILE_TRANSFORM_DISPATCH.

The basic idea is, at compile time, eligible (Op, T, NIn) combos will be dispatched to tile kernels with traits. We can ship traits based on our benchmarks (i.e. if we know tile is better on some (Op, T, NIn)). The user can also self-register the combos they find tile to be beneficial. To self register, they need to provide three pieces of data:

First, a SIMT functor that will be called at the API:

struct my_tanh {
    template <class T>
    __host__ __device__ T operator()(T v) const {
        return static_cast<T>(::cuda::std::tanh(static_cast<float>(v)));
    }
};

Second, a tile functor that has the same semantical meaning:

#if defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH) && _CCCL_TILE_COMPILATION()
struct tile_my_tanh {
    template <class T>
    __tile__ auto operator()(T v) const {
        namespace ct = cuda::tiles;
        return ct::element_cast<ct::tile_element_t<T>>(
            ct::tanh(ct::element_cast<float>(v)));
    }
};

And third, the trait specialization that links those two functors:

CUB_NAMESPACE_BEGIN
namespace detail::transform::tile {
    template <class T>
    struct tile_eligible<my_tanh, T, 1> : std::true_type {
        using tile_op_type = tile_my_tanh;
    };
    // Optional: hint the tile policy picker that this is MUFU-heavy.
    template <> struct tile_mufu_heavy<my_tanh> : std::true_type {};
}
CUB_NAMESPACE_END
#endif

The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases.

The dispatch flow is as follows:

                 cub::DeviceTransform::Transform(.., op, ..)
                                │
                                ▼
              if constexpr (tile_dispatch_eligible_v<Op, OutIter, InIters...>):
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
       runtime preconditions OK?            standard CUB dispatch
       (16B-aligned pointers, 
        num_items % 16 == 0, 
        num_items ≤ 2^31)
              │
       ┌──────┴──────┐
       ▼             ▼
   tile kernel    standard CUB

Those runtime preconditions exist because current bad cutile performance on those cases.

fbusato

I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.

fbusato · 2026-06-04T18:27:18Z

+constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
+{
+  constexpr int threads_per_block    = 128;
+  constexpr int vector_bytes         = 16; // LDG.E.128 -> 16 bytes


Blackwell also has LDG.E.256 -> 32 bytes

true. however this would require 32 byte alignment which will be a separate kernel, and I am also not sure about

does cutile generate 256 bit loads

does it help performance wise

need to verify.

instead of another kernel, can we dispatch online? e.g.

constexpr auto align = (condition) 16 : 32;

I see what you mean! Yes, this works. Now I just need to validate perf from blackwell. Potentially separate PR

fbusato · 2026-06-04T18:28:15Z

+  const int items_for_latency =
+    static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter));
+
+  int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency;


fbusato · 2026-06-10T17:50:39Z

+    }
+};
+
+CUB_NAMESPACE_BEGIN


question. Is CUB_NAMESPACE_BEGIN needed for benchmarks?

It is so that we can specialize tile_eligible etc (must be in namespace first)

nanan-nvidia · 2026-06-12T02:14:45Z

/ok to test b893189

nanan-nvidia · 2026-06-12T07:03:47Z

/ok to test 9af44e6

nanan-nvidia · 2026-06-12T09:39:50Z

/ok to test 1960f3f

fbusato · 2026-06-12T18:19:12Z

+// Defined as a literal 1/0 (not (_CCCL_CUB_HAS_TILE_TRANSFORM() && defined(...))) so that
+// `#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()` in non-system code (benches, tests) does not
+// generate `defined` via macro expansion, which is UB and trips -Wexpansion-to-defined under -Werror.
+#if _CCCL_CUB_HAS_TILE_TRANSFORM() && defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH)


@bernhardmgruber my feeling is that we should have a CUB specific macro to enable or disable Tile for CUB. Per-API macro looks too invasive. Any thought?

Why not a CCCL-wide macro? But otherwise, this should definitely go into some more common/non-transform specific header. Like cub/detail/tile_support.cuh or something like that. You may also move the tile traits there.

fbusato · 2026-06-12T18:28:27Z

+constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
+{
+  constexpr int threads_per_block    = 128;
+  constexpr int vector_bytes         = 16; // LDG.E.128 -> 16 bytes


instead of another kernel, can we dispatch online? e.g.

constexpr auto align = (condition) 16 : 32;

nanan-nvidia · 2026-06-14T03:54:33Z

/ok to test 88aeee3

nanan-nvidia · 2026-06-22T21:48:48Z

/ok to test 0c13142

coderabbitai · 2026-06-22T22:20:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds an opt-in tile-DSL fast path for cub::DeviceTransform controlled by CCCL_ENABLE_TILE_TRANSFORM_DISPATCH (requires CUDA 13.4+). Introduces compile-time gate macros, public cub::transform trait customization points, SM80+ tile kernels, tuning, dispatch plumbing, DeviceTransform integration, libcudacxx builtin compatibility fixes, a Catch2 test suite, and four nvbench benchmark files.

Changes

Tile-DSL DeviceTransform dispatch

Layer / File(s)	Summary
Compile-time config macros and trait customization surface `cub/CMakeLists.txt`, `cub/cub/device/dispatch/dispatch_transform_tile_config.cuh`, `cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh`	Defines `_CCCL_CUB_HAS_TILE_TRANSFORM` and `_CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED` macros, public `tile_eligible_v`/`tile_operator`/`tile_operator_t`/`tile_mufu_heavy_v` customization points in `cub::transform`, built-in `__half`/`__nv_bfloat16` specializations, and the CMake option with CUDA 13.4+ enforcement.
Tile kernel, aligned partition view, and tuning `cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh`, `cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh`	Adds `make_aligned_partition_view` wrapping pointers in 16-byte-aligned `cuda::tiles::tensor_span` and partition views, SM80+-gated `transform_kernel` `__tile_global__` kernel performing masked loads/applies/stores, and `pick_tile_size` computing items/thread from bytes-in-flight latency, vector-load granularity, occupancy, and MUFU-heavy constraints.
Dispatch plumbing `cub/cub/device/dispatch/dispatch_transform_tile.cuh`	Implements `launch_impl` (CTA-per-tile launch via ceil_div), `tile_dispatch_eligible_v` compile-time predicate, `device_supports_tile()` sm_80+ runtime gate, `runtime_preconditions_valid()` 16-byte alignment/divisibility-by-16 check, and `dispatch()` bridge unwrapping contiguous iterators and mapping `TransformOp` to `tile_operator_t`.
DeviceTransform integration `cub/cub/device/device_transform.cuh`	Adds conditional tile-dispatch includes and early tile-path routing in `DeviceTransform::__transform_internal` that checks compile-time and runtime eligibility before falling through to standard dispatch.
libcudacxx builtin compatibility `libcudacxx/include/cuda/std/__cccl/builtin.h`, `libcudacxx/include/cuda/std/__new/launder.h`	Disables `_CCCL_BUILTIN_ASSUME_ALIGNED` (replaced with passthrough) and `_CCCL_BUILTIN_LAUNDER` when `_CCCL_TILE_COMPILATION()` is true.
Catch2 tile-dispatch tests and CMake wiring `cub/test/CMakeLists.txt`, `cub/test/catch2_test_device_transform_tile.cu`	Adds unary (`square_op`) and binary (`add_op`) tile-dispatch Catch2 tests with scalar/tile functor pairs, `tile_eligible_v`/`tile_operator` registrations, exact equality assertions against `std::transform` reference, and a fallback test when dispatch is disabled. CMakeLists.txt applies `--enable-tile` and the define.
nvbench identity-copy benchmark `cub/benchmarks/bench/transform/tile/copy.cu`	Adds identity transform functor with optional tile variant, registered via `tile_eligible_v`/`tile_operator`, with nvbench harness measuring bandwidth across configurable types and power-of-two sizes.
nvbench RGB-to-grayscale benchmark `cub/benchmarks/bench/transform/tile/grayscale.cu`	Adds rgb_to_y functor computing weighted luminance from three inputs with optional tile variant (arity 3), registered via `tile_eligible_v`/`tile_operator`, with nvbench harness measuring three-input reduce-combine transform.
nvbench BabelStream ops benchmark `cub/benchmarks/bench/transform/tile/babelstream.cu`	Adds stateless scalar ops (mul/add/triad/nstream) with optional tile variants registered for respective arities, element-type and size configuration, and four separate benchmarks with output/input metrics.
nvbench PyTorch-style ops benchmark `cub/benchmarks/bench/transform/tile/pytorch.cu`	Adds unary ops (ReLU/Sigmoid/Tanh/GELU/Sin/Exp) and binary ops (add/sub/mul/div/le/ge/fmin/fmax) with optional tile variants using `cuda::tiles` math/comparison, MUFU-heavy hints, float conversion helpers, and full macro-based registration across types and sizes.
nvbench CMake wiring `cub/benchmarks/CMakeLists.txt`	Detects tile benchmarks via `/transform/tile/` path and applies `--enable-tile` flag and feature macro when `CCCL_ENABLE_TILE_TRANSFORM_DISPATCH` is enabled.

Suggested reviewers

wmaxey

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

cub/test/catch2_test_device_transform_tile.cu (1)
23-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

suggestion: Remove unused namespace alias.

The alias ct = ::cuda::tiles is declared but never referenced in the file.
Cleanup
-namespace ct = ::cuda::tiles;
-
cub/CMakeLists.txt (1)

29-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: Reject unsupported compiler/toolchain combinations explicitly.

When CCCL_ENABLE_TILE_TRANSFORM_DISPATCH=ON and CMAKE_CUDA_COMPILER_ID is not NVIDIA, this block currently does not fail fast. Adding an explicit FATAL_ERROR (or at least a warning) avoids a silent misconfiguration where users think tile dispatch is enabled.

cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh (1)

44-57: 🚀 Performance & Scalability | 🔵 Trivial | 🏗️ Heavy lift

suggestion: Avoid hard-wiring tuning to default compute_capability{10,0} in the dispatch path.

pick_tile_size is cc-aware, but the shown dispatch call-site uses the default argument, so all architectures get the same tile size. That risks regressions on SM80/SM90 where bytes-in-flight targets differ. Consider selecting tile size from actual device cc (or explicit per-arch specializations) before launch.

As per path instructions, cub/**/* review should prioritize performance-regression risks.

Source: Path instructions

cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)

137-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: make the unwrapped iterator locals const.

out_ptr and in_ptrs are not mutated after initialization.

As per coding guidelines, “All non-modified variables must use const.”

Also applies to: 157-158

Source: Coding guidelines

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2bffc417-f5fd-459c-bad1-ca9bce7e1582

📥 Commits

Reviewing files that changed from the base of the PR and between 1fa6ab4 and 0c13142.

📒 Files selected for processing (16)

cub/CMakeLists.txt
cub/benchmarks/CMakeLists.txt
cub/benchmarks/bench/transform/tile/babelstream.cu
cub/benchmarks/bench/transform/tile/copy.cu
cub/benchmarks/bench/transform/tile/grayscale.cu
cub/benchmarks/bench/transform/tile/pytorch.cu
cub/cub/device/device_transform.cuh
cub/cub/device/dispatch/dispatch_transform_tile.cuh
cub/cub/device/dispatch/dispatch_transform_tile_config.cuh
cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh
cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh
cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh
cub/test/CMakeLists.txt
cub/test/catch2_test_device_transform_tile.cu
libcudacxx/include/cuda/std/__cccl/builtin.h
libcudacxx/include/cuda/std/__new/launder.h

bernhardmgruber · 2026-06-23T12:36:33Z

+struct mul_op
+{
+  template <class B>
+  __host__ __device__ auto operator()(B b) const
+  {
+    return -(b + b);
+  }
+};


Important: We should not need new operator definitions. We should just specialize tile_operator for cuda::plus etc. Unless there is a reason this is not possible.

unfortunately, we need to keep them independent for now

bernhardmgruber · 2026-06-23T12:42:02Z

+struct tile_plus
+{
+  template <class A, class B>
+  __tile__ auto operator()(A a, B b) const


Why can't we just use cuda::std::plus?

bernhardmgruber · 2026-06-23T12:42:54Z

+#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()
+    // Opt-in tile path. When the (Op, T, NIn) combo is trait-eligible and the device is sm_80+, we check the


Q: Should the tuning policy contain a flag bool use_tile that decided whether tile is used or not?

It should when we are at a stage where a normal scalar functor could be dispatched to tile; for now I think it is too early.

fbusato · 2026-06-23T18:05:59Z

Why can't we just use cuda::std::plus?

@bernhardmgruber In the current status, it is problematic to support __host__ __device__ __tile__ operators.

…free-function dispatch

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)
104-104: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: out_ptr is never modified; mark it const (line 124 in dispatch already does).
-  auto out_ptr           = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output);
+  const auto out_ptr     = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output);
As per coding guidelines: "All non-modified variables must use const".

Source: Coding guidelines

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82afd949-f888-4a5a-90b9-07d8e642ccdb

📥 Commits

Reviewing files that changed from the base of the PR and between 0c13142 and da688b6.

📒 Files selected for processing (1)

cub/cub/device/dispatch/dispatch_transform_tile.cuh

nanan-nvidia · 2026-06-23T20:30:16Z

/ok to test 20fd237

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh (2)

61-61: 🩺 Stability & Availability | 🟠 Major

important: Add non-negative validation to the runtime gate before launching the tile kernel.

The runtime_preconditions_valid() function at line 96 checks alignment and divisibility by 16, but does not validate num_items >= 0. Line 61 of the kernel uses ct::assume_bounded_below<0>(num_items), which requires num_items >= 0. Since OffsetT is a template parameter that can be unsigned, a negative or large unsigned value cast to int64_t will violate this assumption. Add a representable, non-negative check using the signed-safe pattern: when OffsetT could be unsigned, use cuda::std::is_unsigned_v<OffsetT> ? false : (num_items < 0) instead of a direct comparison.

Source: Coding guidelines

56-56: 🎯 Functional Correctness | 🟠 Major

important: do not add an unchecked no-alias contract to the tile fast path.

__restrict__ qualifiers on kernel parameters declare that out and the ins pointers do not alias. However, the API explicitly permits in-place transforms (line 347 of device_transform.cuh: "May point to the same sequence as \p input"), and the dispatch preconditions check only alignment and divisibility—not aliasing. If the user passes the same pointer as both output and input with aligned/divisible data, the tile kernel launches with a violated __restrict__ contract, resulting in undefined behavior. Either drop the qualifiers or add an aliasing check before kernel launch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 01a4a08a-c771-4bc4-9987-4afad86a4512

📥 Commits

Reviewing files that changed from the base of the PR and between da688b6 and 8a8724b.

📒 Files selected for processing (6)

cub/benchmarks/bench/transform/tile/babelstream.cu
cub/benchmarks/bench/transform/tile/copy.cu
cub/benchmarks/bench/transform/tile/grayscale.cu
cub/benchmarks/bench/transform/tile/pytorch.cu
cub/cub/device/dispatch/dispatch_transform_tile.cuh
cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh

💤 Files with no reviewable changes (1)

cub/cub/device/dispatch/dispatch_transform_tile.cuh

🚧 Files skipped from review as they are similar to previous changes (4)

cub/benchmarks/bench/transform/tile/grayscale.cu
cub/benchmarks/bench/transform/tile/copy.cu
cub/benchmarks/bench/transform/tile/pytorch.cu
cub/benchmarks/bench/transform/tile/babelstream.cu

…sforms)

nanan-nvidia · 2026-06-23T20:47:31Z

/ok to test 3ce1647

nanan-nvidia · 2026-06-23T20:54:49Z

/ok to test f7729e7

github-actions · 2026-06-24T01:24:31Z

🥳 CI Workflow Results

🟩 Finished in 4h 25m: Pass: 100%/343 | Total: 4d 00h | Max: 1h 57m | Hits: 99%/482196

See results here.

nanan-nvidia self-assigned this Jun 1, 2026

github-project-automation Bot added this to CCCL Jun 1, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 1, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 1, 2026

nanan-nvidia mentioned this pull request Jun 1, 2026

Functional/Non-Functional C++ Tile Evaluation #9038

Open

miscco reviewed Jun 2, 2026

View reviewed changes

Comment thread cub/benchmarks/bench/transform/tile/device_transform.cuh Outdated

bernhardmgruber reviewed Jun 2, 2026

View reviewed changes

fbusato requested changes Jun 4, 2026

View reviewed changes

nanan-nvidia changed the title ~~[Tile][WIP] tile DeviceTransform port~~ https://github.com/NVIDIA/cccl/pull/9210#issuecomment-4618162940 Jun 9, 2026

nanan-nvidia removed this from CCCL Jun 9, 2026

nanan-nvidia added this to CCCL Jun 9, 2026

github-project-automation Bot moved this to In Progress in CCCL Jun 9, 2026

nanan-nvidia removed this from CCCL Jun 9, 2026

nanan-nvidia added this to CCCL Jun 9, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 9, 2026

github-project-automation Bot moved this from Todo to In Progress in CCCL Jun 9, 2026

nanan-nvidia changed the title ~~https://github.com/NVIDIA/cccl/pull/9210#issuecomment-4618162940~~ [Tile][WIP] tile DeviceTransform port Jun 9, 2026

fbusato reviewed Jun 10, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

nanan-nvidia force-pushed the tile-device-transform branch from 9af44e6 to 1960f3f Compare June 12, 2026 09:38

This comment has been minimized.

Sign in to view

fbusato reviewed Jun 12, 2026

View reviewed changes

nanan-nvidia marked this pull request as ready for review June 22, 2026 22:09

nanan-nvidia requested review from a team as code owners June 22, 2026 22:09

nanan-nvidia requested review from ericniebler, gonidelis, robertmaynard and srinivasyadav18 June 22, 2026 22:09

cccl-authenticator-app Bot moved this from Blocked to In Review in CCCL Jun 22, 2026

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread cub/benchmarks/bench/transform/tile/copy.cu

Comment thread cub/cub/device/device_transform.cuh

Comment thread cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh

Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh

Comment thread cub/test/CMakeLists.txt

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Jun 23, 2026

View reviewed changes

nanan-nvidia added 2 commits June 23, 2026 13:17

tile: drop internal DeviceTransform struct; fold tile-size pick into …

da688b6

…free-function dispatch

tile: remove unused fill_kernel (no Fill hook wired)

ebede88

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

nanan-nvidia added 3 commits June 23, 2026 13:26

tile: drop redundant inline comments in dispatch

cd8d429

tile: use Apache-2.0 license header for new bench files

8a8724b

tile: clang-format dispatch and fix stale struct reference in comment

20fd237

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

nanan-nvidia added 2 commits June 23, 2026 13:43

tile: mark out_ptr const in runtime_preconditions_valid

581fcf4

tile: drop __restrict__ from kernel params (API permits in-place tran…

3ce1647

…sforms)

tile: reflow transform_kernel signature (clang-format)

f7729e7

		auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
		auto out = ct::assume_aligned<16>(out_);

		#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()
		// Opt-in tile path. When the (Op, T, NIn) combo is trait-eligible and the device is sm_80+, we check the

Conversation

nanan-nvidia commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nanan-nvidia Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Jun 2, 2026

Uh oh!

nanan-nvidia commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fbusato left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nanan-nvidia commented Jun 12, 2026

Uh oh!

This comment has been minimized.

nanan-nvidia commented Jun 12, 2026

Uh oh!

This comment has been minimized.

nanan-nvidia commented Jun 12, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nanan-nvidia Jun 2, 2026 •

edited

Loading

nanan-nvidia commented Jun 4, 2026 •

edited

Loading

bernhardmgruber Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading