Demystifying High-Performance AI Kernels with Modern C++ & CUDA
TensorCraft-HPC is a header-only C++/CUDA kernel library for learning, validating, and packaging modern AI operators. The repository keeps implementations readable, exposes progressive optimization paths, and ships a lightweight Python binding surface.
| Feature | Description |
|---|---|
| 🎓 Educational Design | Progressive optimization paths from naive to Tensor Core |
| ⚡ Zero-Build Integration | Header-only — just #include and go |
| 📊 Multi-Architecture | SM70 (Volta) to SM100 (Blackwell) |
| 🔧 OpenSpec Workflow | Specification-driven development |
| 📚 Bilingual Docs | Complete English & Chinese documentation |
cmake --preset cpu-smoke
cmake --build --preset cpu-smoke --parallel 2
cmake --install build/cpu-smoke --prefix /tmp/tensorcraft-install
python3 -m build --wheelcmake --preset dev
cmake --build --preset dev --parallel $(nproc)
ctest --preset dev --output-on-failure#include "tensorcraft/kernels/gemm.hpp"
#include "tensorcraft/memory/tensor.hpp"
// Create GPU tensors (RAII-managed)
tensorcraft::FloatTensor A({4096, 4096});
tensorcraft::FloatTensor B({4096, 4096});
tensorcraft::FloatTensor C({4096, 4096});
// Optimized GEMM (92% cuBLAS performance)
tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);import tensorcraft_ops as tc
import numpy as np
# GPU-accelerated GEMM
A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
C = tc.gemm(A, B)
# FlashAttention
Q, K, V = [np.random.randn(32, 128, 64).astype(np.float16) for _ in range(3)]
output = tc.flash_attention(Q, K, V)| Area | Scope |
|---|---|
| Core utilities | CUDA checks, feature detection, type traits, warp helpers |
| Memory | Tensor, aligned vectors, memory pool |
| Kernels | GEMM, FlashAttention, normalization, convolution, sparse, fusion |
| Python | tensorcraft_ops bindings for smoke/integration workflows |
| Validation | CPU smoke build/install, Python wheel build, optional CUDA tests |
| Kernel | Reference | Performance |
|---|---|---|
| GEMM (FP16) | cuBLAS | 92% |
| FlashAttention | cuDNN | 85% |
| LayerNorm | cuDNN | 95% |
| Conv2D | cuDNN | 78% |
| SpMV (CSR) | cuSPARSE | 88% |
Benchmarks on A100 80GB, CUDA 12.4, FP16 Tensor Core
- Documentation hub: https://lessup.github.io/modern-ai-kernels/
- Getting started: docs/en/getting-started/
- Architecture: docs/en/architecture/
- API Reference: docs/en/api/
- Examples: docs/en/examples/
- 中文文档: docs/zh/
This repository uses OpenSpec as the active development workflow:
- Review accepted specs in
openspec/specs/ - Create or update changes under
openspec/changes/ - Implement against that change
- Run validation before merge
- Use
/reviewbefore merging structural changes
modern-ai-kernels/
├── include/tensorcraft/ # Header-only C++/CUDA library
├── src/python_ops/ # Python bindings
├── tests/ # Validation
├── benchmarks/ # Benchmark binaries
├── docs/ # GitHub Pages + documentation
├── openspec/ # Active spec workflow
└── .github/ # Workflows, templates
- Build system: CMake presets
- Formatting:
.clang-format,.clang-tidy,pre-commit - LSP:
clangdwithcompile_commands.json - GitHub automation: CI, Pages, release workflow
Contributions are welcome! Please:
- Read the OpenSpec workflow
- Follow the code style (run
pre-commithooks) - Add tests for new functionality
- Update documentation
Released under the MIT License.
Made with ❤️ for learning high-performance AI kernels