CUTLASS Notes

The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.

Usage

git clone https://github.com/ArthurinRUC/cutlass-notes.git

make update  # clone cutlass

Run sample code

All example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:

cd 01-minimal-gemm
python minimal_gemm.py

CuTe DSL versions

Each example also ships a CuTe DSL Python port (cutedsl_*.py) alongside the original .cu / .py pair. The DSL port skips the C++ build step entirely — there is nothing to compile, just an import of the official CuTe DSL Python package.

The ports also enable TVM-FFI, so each pre-compiled callable accepts raw torch.Tensor arguments directly and runs on torch.cuda.current_stream() without per-call dlpack conversion or explicit stream plumbing. The extra runtime requirements are:

pip install \
  "nvidia-cutlass-dsl>=4.3.5" \
  "cuda-python>=12.9" \
  "cuda-bindings>=12.9" \
  "apache-tvm-ffi>=0.1.8" \
  "torch_c_dlpack_ext>=0.1.5"

cuda-bindings provides cuda.bindings.driver.CUstream (used in the @cute.jit signature). apache-tvm-ffi + torch_c_dlpack_ext back the --enable-tvm-ffi codegen path and the from_dlpack(..., enable_tvm_ffi=True) runtime tensor binding.

Then run the DSL example the same way you'd run the original Python launcher:

cd 01-minimal-gemm
python cutedsl_minimal_gemm.py

The DSL ports use @cute.jit / @cute.kernel to express the kernel and cute.compile(..., options="--enable-tvm-ffi") to pre-compile each is_gemm / dtype specialization once per run. Compile templates use make_cute_tensor(...) (a from_dlpack(..., enable_tvm_ffi=True) wrapper) plus make_fake_stream(use_tvm_ffi_env_stream=True); at runtime the compiled callable is invoked with bare torch.Tensor args — the DSL syncs to torch.cuda.current_stream() automatically (the environment-stream pattern). The host-side test harness mirrors the original — same seed, same Success / Failed summary — so a passing CUDA build and a passing DSL build print structurally identical output.

Note list

Notes	Summary	Links
00-Intro	Brief introduction to CUTLASS	intro
01-minimal-gemm	- Introduces CuTe fundamentals - Implements 16x8x8 GEMM kernel using single MMA instruction from scratch - Python kernel invocation, precision validation & performance benchmarking - Profiling with Nsight Compute (ncu)	minimal-gemm
02-mixed-precision-gemm	- Implements mixed-precision GEMM supporting varying input/output/accumulation precisions - Explores technical details for numerical precision conversion within kernels - Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops)	mixed-precision-gemm
03-tiled-mma	- Introduces the key conceptual model of GEMM operator: Three-Level Tiling - Details the implementation of Tiled MMA operations in CUTLASS CuTe - Explains the usage and semantics of various parameters in the Tiled MMA API - Extends the GEMM kernel from single instruction to single tile operation	tiled-mma
04-tiled-copy	- Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory - Describes the API parameters and semantics of TiledCopy - Demonstrates how to implement data copying at the Tile level - Introduces foundational knowledge of GPU global memory access characteristics	tiled-copy
05-block-mma	- Extends Tiled MMA to the Block level for larger-scale GEMM computations - Explains how multiple Tiled MMA operations are combined within a thread block - Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level - Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA	block-mma
06-block-copy	- Stages A / B / (C) through shared memory before the tensor-core MMA - Introduces the gmem→smem→rmem dataflow, with explicit G2S / S2R / R2S / S2G TiledCopies - Uses 128-bit `cp.async` for the gmem→smem path and `AutoVectorizingCopy` (`CopyUniversalOp` in CuTe DSL) for the rest - Walks through dynamic shared-memory sizing and the lifecycle of A/B/C/O buffers within a single block	block-copy

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
01-minimal-gemm		01-minimal-gemm
02-mixed-precision-gemm		02-mixed-precision-gemm
03-tiled-mma		03-tiled-mma
04-tiled-copy		04-tiled-copy
05-block-mma		05-block-mma
06-block-copy		06-block-copy
07-swizzling		07-swizzling
08-dynamic-mma		08-dynamic-mma
09-pipelining		09-pipelining
10-gemm-api		10-gemm-api
11-tma-load-store		11-tma-load-store
12-tma-multicast-reduce		12-tma-multicast-reduce
13-warpgroup-mma		13-warpgroup-mma
14-warp-specialization		14-warp-specialization
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUTLASS Notes

Usage

Run sample code

CuTe DSL versions

Note list

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUTLASS Notes

Usage

Run sample code

CuTe DSL versions

Note list

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages