TensorCraft-HPC

Demystifying High-Performance AI Kernels with Modern C++ & CUDA

Documentation • Getting Started • API Reference • Examples

TensorCraft-HPC is a header-only C++/CUDA kernel library for learning, validating, and packaging modern AI operators. The repository keeps implementations readable, exposes progressive optimization paths, and ships a lightweight Python binding surface.

✨ Highlights

Feature	Description
🎓 Educational Design	Progressive optimization paths from naive to Tensor Core
⚡ Zero-Build Integration	Header-only — just `#include` and go
📊 Multi-Architecture	SM70 (Volta) to SM100 (Blackwell)
🔧 OpenSpec Workflow	Specification-driven development
📚 Bilingual Docs	Complete English & Chinese documentation

🚀 Quick Start

CPU-only Smoke Validation

cmake --preset cpu-smoke
cmake --build --preset cpu-smoke --parallel 2
cmake --install build/cpu-smoke --prefix /tmp/tensorcraft-install
python3 -m build --wheel

CUDA-enabled Local Validation

cmake --preset dev
cmake --build --preset dev --parallel $(nproc)
ctest --preset dev --output-on-failure

C++ Usage

#include "tensorcraft/kernels/gemm.hpp"
#include "tensorcraft/memory/tensor.hpp"

// Create GPU tensors (RAII-managed)
tensorcraft::FloatTensor A({4096, 4096});
tensorcraft::FloatTensor B({4096, 4096});
tensorcraft::FloatTensor C({4096, 4096});

// Optimized GEMM (92% cuBLAS performance)
tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);

Python Usage

import tensorcraft_ops as tc
import numpy as np

# GPU-accelerated GEMM
A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
C = tc.gemm(A, B)

# FlashAttention
Q, K, V = [np.random.randn(32, 128, 64).astype(np.float16) for _ in range(3)]
output = tc.flash_attention(Q, K, V)

📦 Capability Snapshot

Area	Scope
Core utilities	CUDA checks, feature detection, type traits, warp helpers
Memory	`Tensor`, aligned vectors, memory pool
Kernels	GEMM, FlashAttention, normalization, convolution, sparse, fusion
Python	`tensorcraft_ops` bindings for smoke/integration workflows
Validation	CPU smoke build/install, Python wheel build, optional CUDA tests

📈 Performance Benchmarks

Kernel	Reference	Performance
GEMM (FP16)	cuBLAS	92%
FlashAttention	cuDNN	85%
LayerNorm	cuDNN	95%
Conv2D	cuDNN	78%
SpMV (CSR)	cuSPARSE	88%

Benchmarks on A100 80GB, CUDA 12.4, FP16 Tensor Core

📚 Documentation

Documentation hub: https://lessup.github.io/modern-ai-kernels/
Getting started: docs/en/getting-started/
Architecture: docs/en/architecture/
API Reference: docs/en/api/
Examples: docs/en/examples/
中文文档: docs/zh/

🔧 OpenSpec Workflow

This repository uses OpenSpec as the active development workflow:

Review accepted specs in openspec/specs/
Create or update changes under openspec/changes/
Implement against that change
Run validation before merge
Use /review before merging structural changes

📁 Repository Layout

modern-ai-kernels/
├── include/tensorcraft/   # Header-only C++/CUDA library
├── src/python_ops/        # Python bindings
├── tests/                 # Validation
├── benchmarks/            # Benchmark binaries
├── docs/                  # GitHub Pages + documentation
├── openspec/              # Active spec workflow
└── .github/               # Workflows, templates

🛠 Tooling Baseline

Build system: CMake presets
Formatting: .clang-format, .clang-tidy, pre-commit
LSP: clangd with compile_commands.json
GitHub automation: CI, Pages, release workflow

🤝 Contributing

Contributions are welcome! Please:

Read the OpenSpec workflow
Follow the code style (run pre-commit hooks)
Add tests for new functionality
Update documentation

📄 License

Released under the MIT License.

⬆ Back to Top

Made with ❤️ for learning high-performance AI kernels

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
.vscode		.vscode
benchmarks		benchmarks
docs		docs
examples		examples
include/tensorcraft		include/tensorcraft
legacy-specs		legacy-specs
openspec		openspec
src/python_ops		src/python_ops
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.cursorrules		.cursorrules
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HANDOVER.md		HANDOVER.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorCraft-HPC

✨ Highlights

🚀 Quick Start

CPU-only Smoke Validation

CUDA-enabled Local Validation

C++ Usage

Python Usage

📦 Capability Snapshot

📈 Performance Benchmarks

📚 Documentation

🔧 OpenSpec Workflow

📁 Repository Layout

🛠 Tooling Baseline

🤝 Contributing

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensorCraft-HPC

✨ Highlights

🚀 Quick Start

CPU-only Smoke Validation

CUDA-enabled Local Validation

C++ Usage

Python Usage

📦 Capability Snapshot

📈 Performance Benchmarks

📚 Documentation

🔧 OpenSpec Workflow

📁 Repository Layout

🛠 Tooling Baseline

🤝 Contributing

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages