A collection of microbenchmarks for NVIDIA Blackwell (SM 100) GPUs, covering memory throughput, latency, tensor core (UMMA) performance, and HBM-resident elementwise throughput.
https://newsletter.semianalysis.com/p/dissecting-nvidia-blackwell-tensor
| Path | Purpose |
|---|---|
ldgsts_throughput/ |
LDGSTS HBM throughput |
tma2d_throughput/ |
TMA 2D HBM throughput |
ldgsts_latency/ |
LDGSTS latency |
tma2d_latency/ |
TMA 2D latency |
umma_throughput/ |
UMMA tensor-core throughput |
umma_latency/ |
UMMA tensor-core latency |
elementwise_throughput/ |
fp32 HBM-resident activation/elementwise throughput |
Compute for this project is generously sponsored by Nebius and Verda.

