Skip to content

ArthurinRUC/cutlass-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUTLASS Notes

The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.

Usage

git clone https://github.com/ArthurinRUC/cutlass-notes.git

make update  # clone cutlass

Run sample code

All example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:

cd 01-minimal-gemm
python minimal_gemm.py

CuTe DSL versions

Each example also ships a CuTe DSL Python port (cutedsl_*.py) alongside the original .cu / .py pair. The DSL port skips the C++ build step entirely — there is nothing to compile, just an import of the official CuTe DSL Python package.

The ports also enable TVM-FFI, so each pre-compiled callable accepts raw torch.Tensor arguments directly and runs on torch.cuda.current_stream() without per-call dlpack conversion or explicit stream plumbing. The extra runtime requirements are:

pip install \
  "nvidia-cutlass-dsl>=4.3.5" \
  "cuda-python>=12.9" \
  "cuda-bindings>=12.9" \
  "apache-tvm-ffi>=0.1.8" \
  "torch_c_dlpack_ext>=0.1.5"

cuda-bindings provides cuda.bindings.driver.CUstream (used in the @cute.jit signature). apache-tvm-ffi + torch_c_dlpack_ext back the --enable-tvm-ffi codegen path and the from_dlpack(..., enable_tvm_ffi=True) runtime tensor binding.

Then run the DSL example the same way you'd run the original Python launcher:

cd 01-minimal-gemm
python cutedsl_minimal_gemm.py

The DSL ports use @cute.jit / @cute.kernel to express the kernel and cute.compile(..., options="--enable-tvm-ffi") to pre-compile each is_gemm / dtype specialization once per run. Compile templates use make_cute_tensor(...) (a from_dlpack(..., enable_tvm_ffi=True) wrapper) plus make_fake_stream(use_tvm_ffi_env_stream=True); at runtime the compiled callable is invoked with bare torch.Tensor args — the DSL syncs to torch.cuda.current_stream() automatically (the environment-stream pattern). The host-side test harness mirrors the original — same seed, same Success / Failed summary — so a passing CUDA build and a passing DSL build print structurally identical output.

Note list

Notes Summary Links
00-Intro Brief introduction to CUTLASS intro
01-minimal-gemm - Introduces CuTe fundamentals
- Implements 16x8x8 GEMM kernel using single MMA instruction from scratch
- Python kernel invocation, precision validation & performance benchmarking
- Profiling with Nsight Compute (ncu)
minimal-gemm
02-mixed-precision-gemm - Implements mixed-precision GEMM supporting varying input/output/accumulation precisions
- Explores technical details for numerical precision conversion within kernels
- Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops)
mixed-precision-gemm
03-tiled-mma - Introduces the key conceptual model of GEMM operator: Three-Level Tiling
- Details the implementation of Tiled MMA operations in CUTLASS CuTe
- Explains the usage and semantics of various parameters in the Tiled MMA API
- Extends the GEMM kernel from single instruction to single tile operation
tiled-mma
04-tiled-copy - Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory
- Describes the API parameters and semantics of TiledCopy
- Demonstrates how to implement data copying at the Tile level
- Introduces foundational knowledge of GPU global memory access characteristics
tiled-copy
05-block-mma - Extends Tiled MMA to the Block level for larger-scale GEMM computations
- Explains how multiple Tiled MMA operations are combined within a thread block
- Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level
- Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA
block-mma
06-block-copy - Stages A / B / (C) through shared memory before the tensor-core MMA
- Introduces the gmem→smem→rmem dataflow, with explicit G2S / S2R / R2S / S2G TiledCopies
- Uses 128-bit cp.async for the gmem→smem path and AutoVectorizingCopy (CopyUniversalOp in CuTe DSL) for the rest
- Walks through dynamic shared-memory sizing and the lifecycle of A/B/C/O buffers within a single block
block-copy

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

From Minimal GEMM to Everything

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages