[Tile][WIP] tile DeviceTransform port#9210
Conversation
| auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_)); | ||
| auto out = ct::assume_aligned<16>(out_); |
There was a problem hiding this comment.
Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.
There was a problem hiding this comment.
Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.
I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark
There was a problem hiding this comment.
That is a very good idea! Please help them improve tile!
Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.
|
I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice! |
|
NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with The basic idea is, at compile time, eligible First, a SIMT functor that will be called at the API: Second, a tile functor that has the same semantical meaning: And third, the trait specialization that links those two functors: The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases. The dispatch flow is as follows: Those runtime preconditions exist because current bad cutile performance on those cases.
|
fbusato
left a comment
There was a problem hiding this comment.
I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.
| constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000) | ||
| { | ||
| constexpr int threads_per_block = 128; | ||
| constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes |
There was a problem hiding this comment.
Blackwell also has LDG.E.256 -> 32 bytes
There was a problem hiding this comment.
true. however this would require 32 byte alignment which will be a separate kernel, and I am also not sure about
- does cutile generate 256 bit loads
- does it help performance wise
need to verify.
There was a problem hiding this comment.
instead of another kernel, can we dispatch online? e.g.
constexpr auto align = (condition) 16 : 32;There was a problem hiding this comment.
I see what you mean! Yes, this works. Now I just need to validate perf from blackwell. Potentially separate PR
| const int items_for_latency = | ||
| static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter)); | ||
|
|
||
| int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency; |
| } | ||
| }; | ||
|
|
||
| CUB_NAMESPACE_BEGIN |
There was a problem hiding this comment.
question. Is CUB_NAMESPACE_BEGIN needed for benchmarks?
There was a problem hiding this comment.
It is so that we can specialize tile_eligible etc (must be in namespace first)
|
/ok to test b893189 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 9af44e6 |
This comment has been minimized.
This comment has been minimized.
9af44e6 to
1960f3f
Compare
|
/ok to test 1960f3f |
This comment has been minimized.
This comment has been minimized.
| // Defined as a literal 1/0 (not (_CCCL_CUB_HAS_TILE_TRANSFORM() && defined(...))) so that | ||
| // `#if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED()` in non-system code (benches, tests) does not | ||
| // generate `defined` via macro expansion, which is UB and trips -Wexpansion-to-defined under -Werror. | ||
| #if _CCCL_CUB_HAS_TILE_TRANSFORM() && defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH) |
There was a problem hiding this comment.
@bernhardmgruber my feeling is that we should have a CUB specific macro to enable or disable Tile for CUB. Per-API macro looks too invasive. Any thought?
There was a problem hiding this comment.
Why not a CCCL-wide macro? But otherwise, this should definitely go into some more common/non-transform specific header. Like cub/detail/tile_support.cuh or something like that. You may also move the tile traits there.
| constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000) | ||
| { | ||
| constexpr int threads_per_block = 128; | ||
| constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes |
There was a problem hiding this comment.
instead of another kernel, can we dispatch online? e.g.
constexpr auto align = (condition) 16 : 32;|
/ok to test 88aeee3 |
|
/ok to test 0c13142 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an opt-in tile-DSL fast path for ChangesTile-DSL DeviceTransform dispatch
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
cub/test/catch2_test_device_transform_tile.cu (1)
23-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion: Remove unused namespace alias.
The alias
ct = ::cuda::tilesis declared but never referenced in the file.Cleanup
-namespace ct = ::cuda::tiles; -cub/CMakeLists.txt (1)
29-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: Reject unsupported compiler/toolchain combinations explicitly.
When
CCCL_ENABLE_TILE_TRANSFORM_DISPATCH=ONandCMAKE_CUDA_COMPILER_IDis notNVIDIA, this block currently does not fail fast. Adding an explicitFATAL_ERROR(or at least a warning) avoids a silent misconfiguration where users think tile dispatch is enabled.cub/cub/device/dispatch/tuning/tuning_transform_tile.cuh (1)
44-57: 🚀 Performance & Scalability | 🔵 Trivial | 🏗️ Heavy liftsuggestion: Avoid hard-wiring tuning to default
compute_capability{10,0}in the dispatch path.
pick_tile_sizeis cc-aware, but the shown dispatch call-site uses the default argument, so all architectures get the same tile size. That risks regressions on SM80/SM90 where bytes-in-flight targets differ. Consider selecting tile size from actual device cc (or explicit per-arch specializations) before launch.As per path instructions,
cub/**/*review should prioritize performance-regression risks.Source: Path instructions
cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)
137-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: make the unwrapped iterator locals const.
out_ptrandin_ptrsare not mutated after initialization.As per coding guidelines, “All non-modified variables must use
const.”Also applies to: 157-158
Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2bffc417-f5fd-459c-bad1-ca9bce7e1582
📒 Files selected for processing (16)
cub/CMakeLists.txtcub/benchmarks/CMakeLists.txtcub/benchmarks/bench/transform/tile/babelstream.cucub/benchmarks/bench/transform/tile/copy.cucub/benchmarks/bench/transform/tile/grayscale.cucub/benchmarks/bench/transform/tile/pytorch.cucub/cub/device/device_transform.cuhcub/cub/device/dispatch/dispatch_transform_tile.cuhcub/cub/device/dispatch/dispatch_transform_tile_config.cuhcub/cub/device/dispatch/dispatch_transform_tile_traits.cuhcub/cub/device/dispatch/kernels/kernel_transform_tile.cuhcub/cub/device/dispatch/tuning/tuning_transform_tile.cuhcub/test/CMakeLists.txtcub/test/catch2_test_device_transform_tile.culibcudacxx/include/cuda/std/__cccl/builtin.hlibcudacxx/include/cuda/std/__new/launder.h
This comment has been minimized.
This comment has been minimized.
| struct mul_op | ||
| { | ||
| template <class B> | ||
| __host__ __device__ auto operator()(B b) const | ||
| { | ||
| return -(b + b); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Important: We should not need new operator definitions. We should just specialize tile_operator for cuda::plus etc. Unless there is a reason this is not possible.
There was a problem hiding this comment.
unfortunately, we need to keep them independent for now
| struct tile_plus | ||
| { | ||
| template <class A, class B> | ||
| __tile__ auto operator()(A a, B b) const |
There was a problem hiding this comment.
Why can't we just use cuda::std::plus?
| #if _CCCL_CUB_TILE_TRANSFORM_DISPATCH_ENABLED() | ||
| // Opt-in tile path. When the (Op, T, NIn) combo is trait-eligible and the device is sm_80+, we check the |
There was a problem hiding this comment.
Q: Should the tuning policy contain a flag bool use_tile that decided whether tile is used or not?
There was a problem hiding this comment.
It should when we are at a stage where a normal scalar functor could be dispatched to tile; for now I think it is too early.
@bernhardmgruber In the current status, it is problematic to support |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/dispatch_transform_tile.cuh (1)
104-104: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion:
out_ptris never modified; mark itconst(line 124 indispatchalready does).- auto out_ptr = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output); + const auto out_ptr = THRUST_NS_QUALIFIER::try_unwrap_contiguous_iterator(output);As per coding guidelines: "All non-modified variables must use
const".Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 82afd949-f888-4a5a-90b9-07d8e642ccdb
📒 Files selected for processing (1)
cub/cub/device/dispatch/dispatch_transform_tile.cuh
|
/ok to test 20fd237 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cub/cub/device/dispatch/kernels/kernel_transform_tile.cuh (2)
61-61: 🩺 Stability & Availability | 🟠 Majorimportant: Add non-negative validation to the runtime gate before launching the tile kernel.
The
runtime_preconditions_valid()function at line 96 checks alignment and divisibility by 16, but does not validatenum_items >= 0. Line 61 of the kernel usesct::assume_bounded_below<0>(num_items), which requiresnum_items >= 0. SinceOffsetTis a template parameter that can be unsigned, a negative or large unsigned value cast toint64_twill violate this assumption. Add a representable, non-negative check using the signed-safe pattern: whenOffsetTcould be unsigned, usecuda::std::is_unsigned_v<OffsetT> ? false : (num_items < 0)instead of a direct comparison.Source: Coding guidelines
56-56: 🎯 Functional Correctness | 🟠 Majorimportant: do not add an unchecked no-alias contract to the tile fast path.
__restrict__qualifiers on kernel parameters declare thatoutand theinspointers do not alias. However, the API explicitly permits in-place transforms (line 347 of device_transform.cuh: "May point to the same sequence as \p input"), and the dispatch preconditions check only alignment and divisibility—not aliasing. If the user passes the same pointer as both output and input with aligned/divisible data, the tile kernel launches with a violated__restrict__contract, resulting in undefined behavior. Either drop the qualifiers or add an aliasing check before kernel launch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 01a4a08a-c771-4bc4-9987-4afad86a4512
📒 Files selected for processing (6)
cub/benchmarks/bench/transform/tile/babelstream.cucub/benchmarks/bench/transform/tile/copy.cucub/benchmarks/bench/transform/tile/grayscale.cucub/benchmarks/bench/transform/tile/pytorch.cucub/cub/device/dispatch/dispatch_transform_tile.cuhcub/cub/device/dispatch/kernels/kernel_transform_tile.cuh
💤 Files with no reviewable changes (1)
- cub/cub/device/dispatch/dispatch_transform_tile.cuh
🚧 Files skipped from review as they are similar to previous changes (4)
- cub/benchmarks/bench/transform/tile/grayscale.cu
- cub/benchmarks/bench/transform/tile/copy.cu
- cub/benchmarks/bench/transform/tile/pytorch.cu
- cub/benchmarks/bench/transform/tile/babelstream.cu
|
/ok to test 3ce1647 |
|
/ok to test f7729e7 |
🥳 CI Workflow Results🟩 Finished in 4h 25m: Pass: 100%/343 | Total: 4d 00h | Max: 1h 57m | Hits: 99%/482196See results here. |

Highly experimental, opening as Draft to share a work-in-progress port of
cub::DeviceTransform(#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a
__tile__function, and it must consist of tile operations and must be inlinable.Current B200 benchmark on
pytorchandbabel:pytorch (tile / cub/ delta = tile - cub, BW utilisation %)
babel
copy, grayscale, fill
We did not do benchmarks on
complex,fibandheavy:complex, cutile does not acceptstd::complexas a vaild type to form tiles.fib, with tile semantics, there is no 1-to-1 fair implementation intile. We can get one by abusingct::selectbut it is much slower.heavy, cutile lowers syntax likeT reg[N]to heap allocation.There was a more detailed write up on
fibandheavyhere: #9038 (comment)