Benchmarking CUDA and Vulkan Compute shader performance in various applications.
Vulkan compute shaders are abstracted using the Kompute framework.
Current implementation compares basic GEneral Matrix Multiplication(GEMM) operations on the GPU.
- CMake 3.28+
- C++20 Compiler
- CUDA SDK installed (12.2+ recommended, but tested with 12.6) + supported Nvidia GPU
Vulkan SDKinstalled.
Create .env.cmake from .env.cmake.template, modifying as needed. Dependencies are managed using CMake's FetchContent.
cmake -B build -S .
# For Benchmarks, building in release mode is recommended
cmake --build build --config Release
# or for single config generators
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build buildThe benchmark executables have "benchmark" in their name.
# for example
./sgemm_benchmark
Two 4096x4096 32bit float matrix multiplication on an RTX 4080 12GB mobile GPU.
- Fastest Naive implementation is about
0.08xof cuBLAS performance. - Fastest Tiling Implementation is about
0.11xof cuBLAS performance. - 2D register blocking achieves about
0.32xof cuBLAS performance.
There are still various optimizations that can be done to improve performance.
- Tiling in local memory.
- Wider data-types.
- Transposed input matrix.
- More work per thread.
- Wider loads with register blocking.
- 2D register blocking.
- Compare cuBLAS sgemm against Kompute using various GEMMs methods.
- Try half precision cuBLAS hgemms vs Kompute HGEMMs - Using Vulkan half floats.
- Create Kompute Operations for these methods to ease usage.
-
How to Optimize a GEMM- Optimized Row major matrix multiplication using Vulkan
