CUDA kernel experiments, with a benchmark-first layout from the start.
The default path here is hand-written CUDA C++. Python is used for suite configs, environment capture, reporting, and CLI tooling.
csrc/: native code, headers, launchers, and kernel-family directoriessrc/fast_kernels/: CLI, schemas, registry, environment capture, and reportingbenchmarks/: suite definitions, shapes, baseline adapters, and runnersresults/: versioned benchmark outputs
uv sync
uv run fk env
uv run fk verify benchmarks/suites/template_gemm.toml
uv run fk bench benchmarks/suites/template_gemm.tomlThat benchmark command writes a run under results/template_gemm/<run-id>/.
Editable installs default to FK_ENABLE_CUDA=ON and build for the active GPU architecture.
Default CUDA setup:
uv sync --extra benchmark
uv run fk envCPU-only opt-out:
CMAKE_ARGS=-DFK_ENABLE_CUDA=OFF uv syncFor direct CMake builds:
cmake --preset dev
cmake --build --preset dev