The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.
git clone https://github.com/ArthurinRUC/cutlass-notes.git
make update # clone cutlassAll example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:
cd 01-minimal-gemm
python minimal_gemm.pyEach example also ships a CuTe DSL Python port (cutedsl_*.py) alongside the original .cu / .py pair. The DSL port skips the C++ build step entirely — there is nothing to compile, just an import of the official CuTe DSL Python package.
The ports also enable TVM-FFI, so each pre-compiled callable accepts raw torch.Tensor arguments directly and runs on torch.cuda.current_stream() without per-call dlpack conversion or explicit stream plumbing. The extra runtime requirements are:
pip install \
"nvidia-cutlass-dsl>=4.3.5" \
"cuda-python>=12.9" \
"cuda-bindings>=12.9" \
"apache-tvm-ffi>=0.1.8" \
"torch_c_dlpack_ext>=0.1.5"cuda-bindings provides cuda.bindings.driver.CUstream (used in the @cute.jit signature). apache-tvm-ffi + torch_c_dlpack_ext back the --enable-tvm-ffi codegen path and the from_dlpack(..., enable_tvm_ffi=True) runtime tensor binding.
Then run the DSL example the same way you'd run the original Python launcher:
cd 01-minimal-gemm
python cutedsl_minimal_gemm.pyThe DSL ports use @cute.jit / @cute.kernel to express the kernel and cute.compile(..., options="--enable-tvm-ffi") to pre-compile each is_gemm / dtype specialization once per run. Compile templates use make_cute_tensor(...) (a from_dlpack(..., enable_tvm_ffi=True) wrapper) plus make_fake_stream(use_tvm_ffi_env_stream=True); at runtime the compiled callable is invoked with bare torch.Tensor args — the DSL syncs to torch.cuda.current_stream() automatically (the environment-stream pattern). The host-side test harness mirrors the original — same seed, same Success / Failed summary — so a passing CUDA build and a passing DSL build print structurally identical output.
| Notes | Summary | Links |
|---|---|---|
| 00-Intro | Brief introduction to CUTLASS | intro |
| 01-minimal-gemm | - Introduces CuTe fundamentals - Implements 16x8x8 GEMM kernel using single MMA instruction from scratch - Python kernel invocation, precision validation & performance benchmarking - Profiling with Nsight Compute (ncu) |
minimal-gemm |
| 02-mixed-precision-gemm | - Implements mixed-precision GEMM supporting varying input/output/accumulation precisions - Explores technical details for numerical precision conversion within kernels - Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops) |
mixed-precision-gemm |
| 03-tiled-mma | - Introduces the key conceptual model of GEMM operator: Three-Level Tiling - Details the implementation of Tiled MMA operations in CUTLASS CuTe - Explains the usage and semantics of various parameters in the Tiled MMA API - Extends the GEMM kernel from single instruction to single tile operation |
tiled-mma |
| 04-tiled-copy | - Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory - Describes the API parameters and semantics of TiledCopy - Demonstrates how to implement data copying at the Tile level - Introduces foundational knowledge of GPU global memory access characteristics |
tiled-copy |
| 05-block-mma | - Extends Tiled MMA to the Block level for larger-scale GEMM computations - Explains how multiple Tiled MMA operations are combined within a thread block - Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level - Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA |
block-mma |
| 06-block-copy | - Stages A / B / (C) through shared memory before the tensor-core MMA - Introduces the gmem→smem→rmem dataflow, with explicit G2S / S2R / R2S / S2G TiledCopies - Uses 128-bit cp.async for the gmem→smem path and AutoVectorizingCopy (CopyUniversalOp in CuTe DSL) for the rest- Walks through dynamic shared-memory sizing and the lifecycle of A/B/C/O buffers within a single block |
block-copy |
This project is licensed under the MIT License - see the LICENSE file for details.