Skip to content

[Tile][WIP] tile DeviceTransform port#9210

Open
nanan-nvidia wants to merge 83 commits into
NVIDIA:mainfrom
nanan-nvidia:tile-device-transform
Open

[Tile][WIP] tile DeviceTransform port#9210
nanan-nvidia wants to merge 83 commits into
NVIDIA:mainfrom
nanan-nvidia:tile-device-transform

Conversation

@nanan-nvidia

Copy link
Copy Markdown

Highly experimental, opening as Draft to share a work-in-progress port of cub::DeviceTransform (#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.

Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a __tile__ function, and it must consist of tile operations and must be inlinable.

Current B200 benchmark on pytorch and babel:

pytorch (tile / cub/ delta = tile - cub, BW utilisation %)

op T 2^16 2^20 2^24 2^28 2^31
relu half $\ \ 0.4/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.6/\ \ 6.8/{\color{red}-\ \ 0.2}$ $47.9/47.4/{\color{green}+\ \ 0.5}$ $83.5/83.1/{\color{green}+\ \ 0.4}$ $87.8/80.2/{\color{green}+\ \ 7.6}$
relu bf16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.7/\ \ 7.0/{\color{red}-\ \ 0.3}$ $47.5/46.2/{\color{green}+\ \ 1.3}$ $83.3/83.0/{\color{green}+\ \ 0.3}$ $87.7/80.9/{\color{green}+\ \ 6.8}$
relu f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $64.7/65.0/{\color{red}-\ \ 0.3}$ $88.4/88.8/{\color{red}-\ \ 0.3}$ $90.8/91.1/{\color{red}-\ \ 0.3}$
sigmoid half $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.9/\ \ 6.6/{\color{green}+\ \ 0.2}$ $34.5/31.5/{\color{green}+\ \ 3.0}$ $48.1/45.5/{\color{green}+\ \ 2.5}$ $49.3/46.8/{\color{green}+\ \ 2.6}$
sigmoid bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.3}$ $34.8/32.0/{\color{green}+\ \ 2.8}$ $48.1/45.5/{\color{green}+\ \ 2.6}$ $49.4/46.8/{\color{green}+\ \ 2.7}$
sigmoid f32 $\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $61.3/56.8/{\color{green}+\ \ 4.5}$ $83.8/76.7/{\color{green}+\ \ 7.1}$ $79.3/73.3/{\color{green}+\ \ 6.0}$
tanh half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.3}$ $39.6/35.5/{\color{green}+\ \ 4.1}$ $58.5/51.6/{\color{green}+\ \ 6.9}$ $60.4/53.2/{\color{green}+\ \ 7.2}$
tanh bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.2}$ $38.4/35.5/{\color{green}+\ \ 2.9}$ $55.3/51.4/{\color{green}+\ \ 3.9}$ $56.9/53.1/{\color{green}+\ \ 3.8}$
tanh f32 $\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$ $13.0/13.3/{\color{red}-\ \ 0.3}$ $64.8/57.5/{\color{green}+\ \ 7.3}$ $88.8/76.7/{\color{green}+12.0}$ $86.7/75.0/{\color{green}+11.6}$
gelu half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $35.3/30.5/{\color{green}+\ \ 4.8}$ $48.8/43.2/{\color{green}+\ \ 5.5}$ $50.1/44.4/{\color{green}+\ \ 5.7}$
gelu bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $34.9/30.6/{\color{green}+\ \ 4.4}$ $48.3/43.3/{\color{green}+\ \ 5.1}$ $49.6/44.5/{\color{green}+\ \ 5.1}$
gelu f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $13.1/13.3/{\color{red}-\ \ 0.2}$ $64.2/53.5/{\color{green}+10.8}$ $85.2/72.8/{\color{green}+12.4}$ $80.5/69.2/{\color{green}+11.4}$
sin half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$ $33.3/30.4/{\color{green}+\ \ 2.8}$ $46.9/41.5/{\color{green}+\ \ 5.3}$ $48.1/42.6/{\color{green}+\ \ 5.5}$
sin bf16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$ $33.2/30.4/{\color{green}+\ \ 2.8}$ $46.8/41.6/{\color{green}+\ \ 5.2}$ $48.1/42.6/{\color{green}+\ \ 5.5}$
sin f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $12.9/13.3/{\color{red}-\ \ 0.3}$ $60.4/53.4/{\color{green}+\ \ 7.0}$ $79.8/72.7/{\color{green}+\ \ 7.1}$ $76.2/69.5/{\color{green}+\ \ 6.7}$
exp half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.8/\ \ 6.7/{\color{green}+\ \ 0.1}$ $46.9/38.8/{\color{green}+\ \ 8.1}$ $74.8/61.0/{\color{green}+13.8}$ $73.4/60.3/{\color{green}+13.1}$
exp bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $45.5/38.9/{\color{green}+\ \ 6.6}$ $71.2/62.1/{\color{green}+\ \ 9.1}$ $73.6/61.7/{\color{green}+11.9}$
exp f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $64.7/65.4/{\color{red}-\ \ 0.7}$ $88.5/87.2/{\color{green}+\ \ 1.2}$ $90.8/82.8/{\color{green}+\ \ 8.1}$
add half $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.1/10.0/{\color{green}+\ \ 0.1}$ $57.1/53.5/{\color{green}+\ \ 3.6}$ $88.6/86.8/{\color{green}+\ \ 1.8}$ $92.5/84.1/{\color{green}+\ \ 8.4}$
add bf16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.2/10.0/{\color{green}+\ \ 0.2}$ $56.7/53.5/{\color{green}+\ \ 3.2}$ $88.6/86.9/{\color{green}+\ \ 1.8}$ $92.5/85.7/{\color{green}+\ \ 6.8}$
add f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.8/17.9/{\color{red}-\ \ 0.2}$ $70.1/69.9/{\color{green}+\ \ 0.2}$ $91.2/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.6/{\color{red}-\ \ 0.9}$
sub half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{red}-\ \ 0.0}$ $57.2/53.4/{\color{green}+\ \ 3.8}$ $88.7/86.8/{\color{green}+\ \ 1.8}$ $92.5/84.0/{\color{green}+\ \ 8.5}$
sub bf16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $57.4/53.4/{\color{green}+\ \ 4.0}$ $88.6/86.9/{\color{green}+\ \ 1.8}$ $92.4/85.6/{\color{green}+\ \ 6.9}$
sub f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.0}$ $17.7/17.7/{\color{green}+\ \ 0.1}$ $70.7/70.0/{\color{green}+\ \ 0.7}$ $91.2/92.1/{\color{red}-\ \ 0.8}$ $92.7/93.7/{\color{red}-\ \ 1.0}$
mul half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $57.7/53.5/{\color{green}+\ \ 4.3}$ $88.6/86.8/{\color{green}+\ \ 1.8}$ $92.6/83.9/{\color{green}+\ \ 8.6}$
mul bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.1/10.0/{\color{green}+\ \ 0.1}$ $57.7/53.5/{\color{green}+\ \ 4.3}$ $88.7/86.8/{\color{green}+\ \ 1.8}$ $92.5/85.5/{\color{green}+\ \ 6.9}$
mul f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $19.3/18.1/{\color{green}+\ \ 1.1}$ $71.0/70.0/{\color{green}+\ \ 1.0}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.6/{\color{red}-\ \ 0.8}$
div half $\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$ $10.0/10.0/{\color{red}-\ \ 0.0}$ $53.5/49.3/{\color{green}+\ \ 4.3}$ $83.2/73.2/{\color{green}+10.0}$ $79.4/69.9/{\color{green}+\ \ 9.5}$
div bf16 $\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$ $\ \ 9.5/10.0/{\color{red}-\ \ 0.5}$ $54.4/49.2/{\color{green}+\ \ 5.2}$ $82.5/69.7/{\color{green}+12.8}$ $79.8/71.6/{\color{green}+\ \ 8.2}$
div f32 $\ \ 1.5/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.5/17.2/{\color{green}+\ \ 0.3}$ $69.9/67.4/{\color{green}+\ \ 2.5}$ $92.3/85.1/{\color{green}+\ \ 7.2}$ $93.6/81.8/{\color{green}+11.8}$
le half $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.9/53.4/{\color{green}+\ \ 4.5}$ $89.2/84.7/{\color{green}+\ \ 4.5}$ $92.7/82.5/{\color{green}+10.2}$
le bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.1}$ $10.0/\ \ 9.6/{\color{green}+\ \ 0.5}$ $57.9/54.0/{\color{green}+\ \ 3.9}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.7/83.1/{\color{green}+\ \ 9.6}$
le f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.2/17.6/{\color{green}+\ \ 0.7}$ $70.6/70.3/{\color{green}+\ \ 0.3}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.9/{\color{red}-\ \ 1.2}$
ge half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.5/53.4/{\color{green}+\ \ 4.1}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.8/82.6/{\color{green}+10.2}$
ge bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.8/{\color{green}+\ \ 0.2}$ $57.5/53.4/{\color{green}+\ \ 4.1}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.7/83.1/{\color{green}+\ \ 9.6}$
ge f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.1}$ $18.4/17.5/{\color{green}+\ \ 0.9}$ $70.2/69.9/{\color{green}+\ \ 0.3}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.7/93.9/{\color{red}-\ \ 1.2}$
fmin half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$ $57.4/53.5/{\color{green}+\ \ 3.9}$ $89.1/84.0/{\color{green}+\ \ 5.0}$ $92.7/80.2/{\color{green}+12.5}$
fmin bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$ $57.9/53.4/{\color{green}+\ \ 4.5}$ $89.1/84.0/{\color{green}+\ \ 5.1}$ $92.6/81.7/{\color{green}+10.9}$
fmin f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $19.3/18.5/{\color{green}+\ \ 0.8}$ $70.7/70.6/{\color{green}+\ \ 0.1}$ $91.2/92.0/{\color{red}-\ \ 0.8}$ $92.8/93.5/{\color{red}-\ \ 0.7}$
fmax half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.7/53.4/{\color{green}+\ \ 4.3}$ $89.1/83.8/{\color{green}+\ \ 5.3}$ $92.7/80.3/{\color{green}+12.4}$
fmax bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $58.1/53.5/{\color{green}+\ \ 4.7}$ $89.1/84.0/{\color{green}+\ \ 5.0}$ $92.6/81.7/{\color{green}+10.9}$
fmax f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.7/18.2/{\color{green}+\ \ 0.5}$ $69.8/68.5/{\color{green}+\ \ 1.3}$ $91.2/92.0/{\color{red}-\ \ 0.8}$ $92.8/93.5/{\color{red}-\ \ 0.7}$

babel

op T 2^16 2^20 2^24 2^28 2^31
mul i8 $\ \ 0.2/\ \ 0.3/{\color{red}-\ \ 0.1}$ $\ \ 3.5/\ \ 3.4/{\color{green}+\ \ 0.1}$ $34.5/30.2/{\color{green}+\ \ 4.3}$ $72.9/58.9/{\color{green}+14.0}$ $79.4/59.8/{\color{green}+19.6}$
mul i16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.7/\ \ 6.7/{\color{red}-\ \ 0.0}$ $48.2/47.5/{\color{green}+\ \ 0.7}$ $85.2/85.5/{\color{red}-\ \ 0.2}$ $89.4/82.2/{\color{green}+\ \ 7.2}$
mul f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $12.9/13.3/{\color{red}-\ \ 0.3}$ $61.6/65.3/{\color{red}-\ \ 3.7}$ $87.5/89.0/{\color{red}-\ \ 1.5}$ $89.6/91.0/{\color{red}-\ \ 1.5}$
mul f64 $\ \ 1.7/\ \ 2.1/{\color{red}-\ \ 0.4}$ $21.8/21.5/{\color{green}+\ \ 0.3}$ $73.8/76.1/{\color{red}-\ \ 2.2}$ $88.8/90.0/{\color{red}-\ \ 1.3}$ $89.8/91.1/{\color{red}-\ \ 1.3}$
add i8 $\ \ 0.3/\ \ 0.4/{\color{red}-\ \ 0.1}$ $\ \ 5.2/\ \ 5.0/{\color{green}+\ \ 0.1}$ $38.8/39.9/{\color{red}-\ \ 1.1}$ $66.9/71.3/{\color{red}-\ \ 4.4}$ $71.2/70.4/{\color{green}+\ \ 0.8}$
add i16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $58.4/54.7/{\color{green}+\ \ 3.7}$ $88.7/86.9/{\color{green}+\ \ 1.9}$ $92.5/85.3/{\color{green}+\ \ 7.2}$
add f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.7/17.5/{\color{green}+\ \ 0.2}$ $70.2/68.8/{\color{green}+\ \ 1.5}$ $91.2/92.1/{\color{red}-\ \ 0.9}$ $92.8/93.7/{\color{red}-\ \ 0.9}$
add f64 $\ \ 3.0/\ \ 3.1/{\color{red}-\ \ 0.1}$ $30.9/31.5/{\color{red}-\ \ 0.6}$ $80.3/79.9/{\color{green}+\ \ 0.4}$ $92.0/93.3/{\color{red}-\ \ 1.3}$ $93.0/94.1/{\color{red}-\ \ 1.1}$
triad i8 $\ \ 0.4/\ \ 0.4/{\color{red}-\ \ 0.0}$ $\ \ 5.1/\ \ 5.0/{\color{green}+\ \ 0.0}$ $36.9/36.9/{\color{green}+\ \ 0.0}$ $64.1/67.4/{\color{red}-\ \ 3.3}$ $66.2/64.6/{\color{green}+\ \ 1.6}$
triad i16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.9/53.5/{\color{green}+\ \ 4.4}$ $90.0/85.9/{\color{green}+\ \ 4.0}$ $93.0/83.3/{\color{green}+\ \ 9.6}$
triad f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.9/17.6/{\color{green}+\ \ 1.3}$ $71.0/70.0/{\color{green}+\ \ 1.1}$ $91.3/92.3/{\color{red}-\ \ 1.0}$ $92.9/91.6/{\color{green}+\ \ 1.3}$
triad f64 $\ \ 3.2/\ \ 3.1/{\color{green}+\ \ 0.1}$ $30.5/29.7/{\color{green}+\ \ 0.8}$ $79.9/79.8/{\color{green}+\ \ 0.1}$ $92.1/93.5/{\color{red}-\ \ 1.4}$ $93.0/93.8/{\color{red}-\ \ 0.8}$
nstream i8 $\ \ 0.4/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.6/\ \ 6.7/{\color{red}-\ \ 0.0}$ $37.7/42.6/{\color{red}-\ \ 4.9}$ $56.5/65.8/{\color{red}-\ \ 9.2}$ $57.3/64.0/{\color{red}-\ \ 6.6}$
nstream i16 $\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$ $13.3/13.3/{\color{green}+\ \ 0.0}$ $61.9/60.9/{\color{green}+\ \ 1.0}$ $91.7/85.2/{\color{green}+\ \ 6.5}$ $95.0/81.5/{\color{green}+13.6}$
nstream f32 $\ \ 2.1/\ \ 2.1/{\color{green}+\ \ 0.1}$ $22.0/21.4/{\color{green}+\ \ 0.6}$ $74.6/73.4/{\color{green}+\ \ 1.2}$ $93.4/92.7/{\color{green}+\ \ 0.8}$ $95.3/93.4/{\color{green}+\ \ 1.9}$
nstream f64 $\ \ 3.8/\ \ 3.6/{\color{green}+\ \ 0.2}$ $35.5/35.5/{\color{green}+\ \ 0.0}$ $83.3/83.2/{\color{green}+\ \ 0.0}$ $94.5/94.5/{\color{red}-\ \ 0.1}$ $95.4/95.5/{\color{red}-\ \ 0.0}$

copy, grayscale, fill

op T 2^16 2^20 2^24 2^28 2^31
copy i8 $\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$ $\ \ 4.0/\ \ 3.4/{\color{green}+\ \ 0.6}$ $35.8/30.8/{\color{green}+\ \ 5.1}$ $82.1/72.0/{\color{green}+10.0}$ $89.8/68.5/{\color{green}+21.2}$
copy i16 $\ \ 0.5/\ \ 0.5/{\color{green}+\ \ 0.0}$ $\ \ 6.8/\ \ 6.8/{\color{green}+\ \ 0.1}$ $51.7/47.4/{\color{green}+\ \ 4.4}$ $86.5/84.8/{\color{green}+\ \ 1.7}$ $90.5/80.6/{\color{green}+\ \ 9.9}$
copy i32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $13.0/13.1/{\color{red}-\ \ 0.1}$ $61.8/65.3/{\color{red}-\ \ 3.4}$ $87.7/88.7/{\color{red}-\ \ 1.1}$ $89.6/91.0/{\color{red}-\ \ 1.5}$
copy f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $13.0/13.2/{\color{red}-\ \ 0.1}$ $62.2/65.3/{\color{red}-\ \ 3.1}$ $87.6/88.8/{\color{red}-\ \ 1.1}$ $88.2/91.2/{\color{red}-\ \ 3.0}$
copy f64 $\ \ 2.1/\ \ 2.1/{\color{red}-\ \ 0.0}$ $23.0/22.2/{\color{green}+\ \ 0.8}$ $75.5/75.7/{\color{red}-\ \ 0.2}$ $89.8/90.1/{\color{red}-\ \ 0.3}$ $90.9/91.2/{\color{red}-\ \ 0.3}$
grayscale f32 $\ \ 1.7/\ \ 2.0/{\color{red}-\ \ 0.3}$ $21.7/21.3/{\color{green}+\ \ 0.4}$ $70.7/71.1/{\color{red}-\ \ 0.4}$ $92.3/92.6/{\color{red}-\ \ 0.4}$ $94.0/93.7/{\color{green}+\ \ 0.3}$
grayscale f64 $\ \ 3.5/\ \ 3.8/{\color{red}-\ \ 0.2}$ $35.1/35.1/{\color{red}-\ \ 0.0}$ $79.7/82.7/{\color{red}-\ \ 3.1}$ $92.9/94.4/{\color{red}-\ \ 1.5}$ $93.7/95.6/{\color{red}-\ \ 1.9}$
fill I8 $\ \ 0.1/\ \ 0.1/{\color{red}-\ \ 0.0}$ $\ \ 2.2/\ \ 2.2/{\color{red}-\ \ 0.0}$ $26.3/26.3/{\color{red}-\ \ 0.0}$ $85.1/85.1/{\color{red}-\ \ 0.0}$ $97.4/97.4/{\color{red}-\ \ 0.0}$
fill I16 $\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$ $\ \ 4.4/\ \ 4.4/{\color{green}+\ \ 0.0}$ $42.1/42.6/{\color{red}-\ \ 0.6}$ $91.8/91.9/{\color{red}-\ \ 0.1}$ $98.2/98.2/{\color{red}-\ \ 0.0}$
fill I32 $\ \ 0.6/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 8.5/\ \ 8.3/{\color{green}+\ \ 0.2}$ $60.1/59.3/{\color{green}+\ \ 0.8}$ $95.1/95.0/{\color{green}+\ \ 0.1}$ $98.8/98.7/{\color{green}+\ \ 0.1}$
fill I64 $\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$ $13.9/15.1/{\color{red}-\ \ 1.2}$ $72.8/71.4/{\color{green}+\ \ 1.4}$ $97.4/97.0/{\color{green}+\ \ 0.4}$ $99.1/99.0/{\color{green}+\ \ 0.1}$

We did not do benchmarks on complex, fib and heavy:

  • for complex, cutile does not accept std::complex as a vaild type to form tiles.
  • for fib, with tile semantics, there is no 1-to-1 fair implementation in tile. We can get one by abusing ct::select but it is much slower.
  • for heavy, cutile lowers syntax like T reg[N] to heap allocation.

There was a more detailed write up on fib and heavy here: #9038 (comment)

@nanan-nvidia nanan-nvidia self-assigned this Jun 1, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 1, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 1, 2026

@miscco miscco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great already

Comment thread cub/benchmarks/bench/transform/tile/device_transform.cuh Outdated
Comment on lines +78 to +79
auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
auto out = ct::assume_aligned<16>(out_);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.

@nanan-nvidia nanan-nvidia Jun 2, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.

I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good idea! Please help them improve tile!

Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.

@bernhardmgruber

Copy link
Copy Markdown
Contributor

I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice!

@nanan-nvidia

nanan-nvidia commented Jun 4, 2026

Copy link
Copy Markdown
Author

NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with --enable-tile. I temporarily unblocked myself with e0a31e4

This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with --enable-tile and -DCCCL_ENABLE_TILE_TRANSFORM_DISPATCH.

The basic idea is, at compile time, eligible (Op, T, NIn) combos will be dispatched to tile kernels with traits. We can ship traits based on our benchmarks (i.e. if we know tile is better on some (Op, T, NIn)). The user can also self-register the combos they find tile to be beneficial. To self register, they need to provide three pieces of data:

First, a SIMT functor that will be called at the API:

struct my_tanh {
    template <class T>
    __host__ __device__ T operator()(T v) const {
        return static_cast<T>(::cuda::std::tanh(static_cast<float>(v)));
    }
};

Second, a tile functor that has the same semantical meaning:

#if defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH) && _CCCL_TILE_COMPILATION()
struct tile_my_tanh {
    template <class T>
    __tile__ auto operator()(T v) const {
        namespace ct = cuda::tiles;
        return ct::element_cast<ct::tile_element_t<T>>(
            ct::tanh(ct::element_cast<float>(v)));
    }
};

And third, the trait specialization that links those two functors:

CUB_NAMESPACE_BEGIN
namespace detail::transform::tile {
    template <class T>
    struct tile_eligible<my_tanh, T, 1> : std::true_type {
        using tile_op_type = tile_my_tanh;
    };
    // Optional: hint the tile policy picker that this is MUFU-heavy.
    template <> struct tile_mufu_heavy<my_tanh> : std::true_type {};
}
CUB_NAMESPACE_END
#endif

The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases.

The dispatch flow is as follows:

                 cub::DeviceTransform::Transform(.., op, ..)
                                │
                                ▼
              if constexpr (tile_dispatch_eligible_v<Op, OutIter, InIters...>):
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
       runtime preconditions OK?            standard CUB dispatch
       (16B-aligned pointers, 
        num_items % 16 == 0, 
        num_items ≤ 2^31)
              │
       ┌──────┴──────┐
       ▼             ▼
   tile kernel    standard CUB

Those runtime preconditions exist because current bad cutile performance on those cases.

image

@fbusato fbusato left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.

Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh Outdated
constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
{
constexpr int threads_per_block = 128;
constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blackwell also has LDG.E.256 -> 32 bytes

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. however this would require 32 byte alignment which will be a separate kernel, and I am also not sure about

  1. does cutile generate 256 bit loads
  2. does it help performance wise

need to verify.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of another kernel, can we dispatch online? e.g.

constexpr auto align = (condition) 16 : 32;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean! Yes, this works. Now I just need to validate perf from blackwell. Potentially separate PR

Comment thread cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh Outdated
const int items_for_latency =
static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter));

int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use max

@nanan-nvidia nanan-nvidia changed the title [Tile][WIP] tile DeviceTransform port https://github.com/NVIDIA/cccl/pull/9210#issuecomment-4618162940 Jun 9, 2026
@nanan-nvidia nanan-nvidia removed this from CCCL Jun 9, 2026
@nanan-nvidia nanan-nvidia added this to CCCL Jun 9, 2026
@github-project-automation github-project-automation Bot moved this to In Progress in CCCL Jun 9, 2026
@nanan-nvidia nanan-nvidia removed this from CCCL Jun 9, 2026
@nanan-nvidia nanan-nvidia added this to CCCL Jun 9, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 9, 2026
@github-project-automation github-project-automation Bot moved this from Todo to In Progress in CCCL Jun 9, 2026
@nanan-nvidia nanan-nvidia changed the title https://github.com/NVIDIA/cccl/pull/9210#issuecomment-4618162940 [Tile][WIP] tile DeviceTransform port Jun 9, 2026
Comment thread cub/cub/device/dispatch/dispatch_transform_tile_config.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
Comment thread cub/benchmarks/bench/transform/tile/bench_init.cuh Outdated
Comment thread cub/benchmarks/bench/transform/tile/copy.cu Outdated
}
};

CUB_NAMESPACE_BEGIN

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. Is CUB_NAMESPACE_BEGIN needed for benchmarks?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is so that we can specialize tile_eligible etc (must be in namespace first)

Comment thread cub/benchmarks/bench/transform/tile/pytorch.cu
Comment thread cub/benchmarks/bench/transform/tile/test_device_transform.cu Outdated
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test b893189

@github-actions

This comment has been minimized.

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 9af44e6

@github-actions

This comment has been minimized.

@nanan-nvidia nanan-nvidia force-pushed the tile-device-transform branch from 9af44e6 to 1960f3f Compare June 12, 2026 09:38
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 1960f3f

@github-actions

This comment has been minimized.

Comment thread cub/cub/device/device_transform.cuh Outdated
Comment thread cub/cub/device/device_transform.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_transform_tile_config.cuh Outdated
// Defined as a literal 1/0 (not (_CCCL_CUB_HAS_TILE_TRANSFORM() && defined(...))) so that
// `#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()` in non-system code (benches, tests) does not
// generate `defined` via macro expansion, which is UB and trips -Wexpansion-to-defined under -Werror.
#if _CCCL_CUB_HAS_TILE_TRANSFORM() && defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber my feeling is that we should have a CUB specific macro to enable or disable Tile for CUB. Per-API macro looks too invasive. Any thought?

@bernhardmgruber bernhardmgruber Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a CCCL-wide macro? But otherwise, this should definitely go into some more common/non-transform specific header. Like cub/detail/tile_support.cuh or something like that. You may also move the tile traits there.

Comment thread cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh Outdated
constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
{
constexpr int threads_per_block = 128;
constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of another kernel, can we dispatch online? e.g.

constexpr auto align = (condition) 16 : 32;

Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh Outdated
Comment thread cub/benchmarks/bench/transform/tile/copy.cu Outdated
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 88aeee3

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 0c13142

@nanan-nvidia nanan-nvidia marked this pull request as ready for review June 22, 2026 22:09
@nanan-nvidia nanan-nvidia requested review from a team as code owners June 22, 2026 22:09
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Blocked to In Review in CCCL Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds an opt-in tile-DSL fast path for cub::DeviceTransform controlled by CCCL_ENABLE_TILE_TRANSFORM_DISPATCH (requires CUDA 13.4+). Introduces compile-time gate macros, public cub::transform trait customization points, SM80+ tile kernels, tuning, dispatch plumbing, DeviceTransform integration, libcudacxx builtin compatibility fixes, a Catch2 test suite, and four nvbench benchmark files.

Changes

Tile-DSL DeviceTransform dispatch

Layer / File(s) Summary
Compile-time config macros and trait customization surface
cub/CMakeLists.txt, cub/cub/device/dispatch/dispatch_transform_tile_config.cuh, cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh
Defines _CCCL_CUB_HAS_TILE_TRANSFORM and _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED macros, public tile_eligible_v/tile_operator/tile_operator_t/tile_mufu_heavy_v customization points in cub::transform, built-in __half/__nv_bfloat16 specializations, and the CMake option with CUDA 13.4+ enforcement.
Tile kernel, aligned partition view, and tuning
cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh, cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh
Adds make_aligned_partition_view wrapping pointers in 16-byte-aligned cuda::tiles::tensor_span and partition views, SM80+-gated transform_kernel __tile_global__ kernel performing masked loads/applies/stores, and pick_tile_size computing items/thread from bytes-in-flight latency, vector-load granularity, occupancy, and MUFU-heavy constraints.
Dispatch plumbing
cub/cub/device/dispatch/dispatch_transform_tile.cuh
Implements launch_impl (CTA-per-tile launch via ceil_div), tile_dispatch_eligible_v compile-time predicate, device_supports_tile() sm_80+ runtime gate, runtime_preconditions_valid() 16-byte alignment/divisibility-by-16 check, and dispatch() bridge unwrapping contiguous iterators and mapping TransformOp to tile_operator_t.
DeviceTransform integration
cub/cub/device/device_transform.cuh
Adds conditional tile-dispatch includes and early tile-path routing in DeviceTransform::__transform_internal that checks compile-time and runtime eligibility before falling through to standard dispatch.
libcudacxx builtin compatibility
libcudacxx/include/cuda/std/__cccl/builtin.h, libcudacxx/include/cuda/std/__new/launder.h
Disables _CCCL_BUILTIN_ASSUME_ALIGNED (replaced with passthrough) and _CCCL_BUILTIN_LAUNDER when _CCCL_TILE_COMPILATION() is true.
Catch2 tile-dispatch tests and CMake wiring
cub/test/CMakeLists.txt, cub/test/catch2_test_device_transform_tile.cu
Adds unary (square_op) and binary (add_op) tile-dispatch Catch2 tests with scalar/tile functor pairs, tile_eligible_v/tile_operator registrations, exact equality assertions against std::transform reference, and a fallback test when dispatch is disabled. CMakeLists.txt applies --enable-tile and the define.
nvbench identity-copy benchmark
cub/benchmarks/bench/transform/tile/copy.cu
Adds identity transform functor with optional tile variant, registered via tile_eligible_v/tile_operator, with nvbench harness measuring bandwidth across configurable types and power-of-two sizes.
nvbench RGB-to-grayscale benchmark
cub/benchmarks/bench/transform/tile/grayscale.cu
Adds rgb_to_y functor computing weighted luminance from three inputs with optional tile variant (arity 3), registered via tile_eligible_v/tile_operator, with nvbench harness measuring three-input reduce-combine transform.
nvbench BabelStream ops benchmark
cub/benchmarks/bench/transform/tile/babelstream.cu
Adds stateless scalar ops (mul/add/triad/nstream) with optional tile variants registered for respective arities, element-type and size configuration, and four separate benchmarks with output/input metrics.
nvbench PyTorch-style ops benchmark
cub/benchmarks/bench/transform/tile/pytorch.cu
Adds unary ops (ReLU/Sigmoid/Tanh/GELU/Sin/Exp) and binary ops (add/sub/mul/div/le/ge/fmin/fmax) with optional tile variants using cuda::tiles math/comparison, MUFU-heavy hints, float conversion helpers, and full macro-based registration across types and sizes.
nvbench CMake wiring
cub/benchmarks/CMakeLists.txt
Detects tile benchmarks via /transform/tile/ path and applies --enable-tile flag and feature macro when CCCL_ENABLE_TILE_TRANSFORM_DISPATCH is enabled.

Suggested reviewers

  • wmaxey

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
cub/test/catch2_test_device_transform_tile.cu (1)

23-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

suggestion: Remove unused namespace alias.

The alias ct = ::cuda::tiles is declared but never referenced in the file.

Cleanup
-namespace ct = ::cuda::tiles;
-
cub/CMakeLists.txt (1)

29-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: Reject unsupported compiler/toolchain combinations explicitly.

When CCCL_ENABLE_TILE_TRANSFORM_DISPATCH=ON and CMAKE_CUDA_COMPILER_ID is not NVIDIA, this block currently does not fail fast. Adding an explicit FATAL_ERROR (or at least a warning) avoids a silent misconfiguration where users think tile dispatch is enabled.

cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh (1)

44-57: 🚀 Performance & Scalability | 🔵 Trivial | 🏗️ Heavy lift

suggestion: Avoid hard-wiring tuning to default compute_capability{10,0} in the dispatch path.

pick_tile_size is cc-aware, but the shown dispatch call-site uses the default argument, so all architectures get the same tile size. That risks regressions on SM80/SM90 where bytes-in-flight targets differ. Consider selecting tile size from actual device cc (or explicit per-arch specializations) before launch.

As per path instructions, cub/**/* review should prioritize performance-regression risks.

Source: Path instructions

cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)

137-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: make the unwrapped iterator locals const.

out_ptr and in_ptrs are not mutated after initialization.

As per coding guidelines, “All non-modified variables must use const.”

Also applies to: 157-158

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2bffc417-f5fd-459c-bad1-ca9bce7e1582

📥 Commits

Reviewing files that changed from the base of the PR and between 1fa6ab4 and 0c13142.

📒 Files selected for processing (16)
  • cub/CMakeLists.txt
  • cub/benchmarks/CMakeLists.txt
  • cub/benchmarks/bench/transform/tile/babelstream.cu
  • cub/benchmarks/bench/transform/tile/copy.cu
  • cub/benchmarks/bench/transform/tile/grayscale.cu
  • cub/benchmarks/bench/transform/tile/pytorch.cu
  • cub/cub/device/device_transform.cuh
  • cub/cub/device/dispatch/dispatch_transform_tile.cuh
  • cub/cub/device/dispatch/dispatch_transform_tile_config.cuh
  • cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh
  • cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh
  • cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh
  • cub/test/CMakeLists.txt
  • cub/test/catch2_test_device_transform_tile.cu
  • libcudacxx/include/cuda/std/__cccl/builtin.h
  • libcudacxx/include/cuda/std/__new/launder.h

Comment thread cub/benchmarks/bench/transform/tile/copy.cu
Comment thread cub/cub/device/device_transform.cuh
Comment thread cub/cub/device/dispatch/dispatch_transform_tile_traits.cuh
Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh
Comment thread cub/test/CMakeLists.txt
@github-actions

This comment has been minimized.

Comment thread cub/benchmarks/bench/transform/tile/babelstream.cu Outdated
Comment on lines +18 to +25
struct mul_op
{
template <class B>
__host__ __device__ auto operator()(B b) const
{
return -(b + b);
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: We should not need new operator definitions. We should just specialize tile_operator for cuda::plus etc. Unless there is a reason this is not possible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, we need to keep them independent for now

Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_transform_tile.cuh Outdated
struct tile_plus
{
template <class A, class B>
__tile__ auto operator()(A a, B b) const

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just use cuda::std::plus?

Comment on lines +107 to +108
#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()
// Opt-in tile path. When the (Op, T, NIn) combo is trait-eligible and the device is sm_80+, we check the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should the tuning policy contain a flag bool use_tile that decided whether tile is used or not?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should when we are at a stage where a normal scalar functor could be dispatched to tile; for now I think it is too early.

@fbusato

fbusato commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Why can't we just use cuda::std::plus?

@bernhardmgruber In the current status, it is problematic to support __host__ __device__ __tile__ operators.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)

104-104: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: out_ptr is never modified; mark it const (line 124 in dispatch already does).

-  auto out_ptr           = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output);
+  const auto out_ptr     = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output);

As per coding guidelines: "All non-modified variables must use const".

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82afd949-f888-4a5a-90b9-07d8e642ccdb

📥 Commits

Reviewing files that changed from the base of the PR and between 0c13142 and da688b6.

📒 Files selected for processing (1)
  • cub/cub/device/dispatch/dispatch_transform_tile.cuh

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 20fd237

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh (2)

61-61: 🩺 Stability & Availability | 🟠 Major

important: Add non-negative validation to the runtime gate before launching the tile kernel.

The runtime_preconditions_valid() function at line 96 checks alignment and divisibility by 16, but does not validate num_items >= 0. Line 61 of the kernel uses ct::assume_bounded_below<0>(num_items), which requires num_items >= 0. Since OffsetT is a template parameter that can be unsigned, a negative or large unsigned value cast to int64_t will violate this assumption. Add a representable, non-negative check using the signed-safe pattern: when OffsetT could be unsigned, use cuda::std::is_unsigned_v<OffsetT> ? false : (num_items < 0) instead of a direct comparison.

Source: Coding guidelines


56-56: 🎯 Functional Correctness | 🟠 Major

important: do not add an unchecked no-alias contract to the tile fast path.

__restrict__ qualifiers on kernel parameters declare that out and the ins pointers do not alias. However, the API explicitly permits in-place transforms (line 347 of device_transform.cuh: "May point to the same sequence as \p input"), and the dispatch preconditions check only alignment and divisibility—not aliasing. If the user passes the same pointer as both output and input with aligned/divisible data, the tile kernel launches with a violated __restrict__ contract, resulting in undefined behavior. Either drop the qualifiers or add an aliasing check before kernel launch.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 01a4a08a-c771-4bc4-9987-4afad86a4512

📥 Commits

Reviewing files that changed from the base of the PR and between da688b6 and 8a8724b.

📒 Files selected for processing (6)
  • cub/benchmarks/bench/transform/tile/babelstream.cu
  • cub/benchmarks/bench/transform/tile/copy.cu
  • cub/benchmarks/bench/transform/tile/grayscale.cu
  • cub/benchmarks/bench/transform/tile/pytorch.cu
  • cub/cub/device/dispatch/dispatch_transform_tile.cuh
  • cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh
💤 Files with no reviewable changes (1)
  • cub/cub/device/dispatch/dispatch_transform_tile.cuh
🚧 Files skipped from review as they are similar to previous changes (4)
  • cub/benchmarks/bench/transform/tile/grayscale.cu
  • cub/benchmarks/bench/transform/tile/copy.cu
  • cub/benchmarks/bench/transform/tile/pytorch.cu
  • cub/benchmarks/bench/transform/tile/babelstream.cu

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 3ce1647

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test f7729e7

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 4h 25m: Pass: 100%/343 | Total: 4d 00h | Max: 1h 57m | Hits: 99%/482196

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants