A lightweight deep learning framework built from scratch using raw CUDA with a friendly Python API including all necessary functionality for core DL components.
- Fully connected layer
- Convolutional layer
- GPU Acceleration (15x faster than regular NumPy, same speed as PyTorch for smaller stuff)
- Flatten layer
- Max pooling layer
- ReLU activation
- Softmax layer
- Model save/load
- Cross Entropy Loss & MSE Loss
- Model & sequential classes
- Training & eval loop
- Mini-batching
Convolutional NN on MNIST
>>> py main.py
Input shape: (60000, 784)
Labels shape: (60000,)
Model(
[0] Conv2d ((1, 28, 28) → (5, 24, 24))
[1] ReLU
[2] Flatten
[3] Linear (2880 → 128)
[4] ReLU
[5] Linear (128 → 10)
Loss: CrossEntropyLoss
Total parameters: 373,063
)
TRAINING...
EPOCH 1/10, Loss: 0.1227
...
EPOCH 10/10, Loss: 0.0347
Time spent training: 437.89s
EVALUATING...
Sample labels: [9 2 9 8 9 7 1 2 4 3]
Sample preds: [9 2 9 8 9 7 1 2 4 3]
Accuracy: 98.32%
Save weights? (y/n) >>> y
File name? (empty for default) >>> cnn-weights
Saved model weights to cnn-weights.pklWith MaxPool (GPU ver.)
Model(
[0] Conv2d ((1, 28, 28) → (5, 24, 24))
[1] ReLU
[2] MaxPool # on gpu
[3] Flatten
[4] Linear (720 → 128)
[5] ReLU
[6] Linear (128 → 10)
Loss: CrossEntropyLoss
Total parameters: 96,583
Device: CPU
)
TRAINING...
EPOCH 1/5, Loss: 1.75497549
...
EPOCH 5/5, Loss: 0.41454145
Finished in: 492.62s # CPU time 1200s
EVALUATING...
Sample labels: [8 5 6 4 2 4 2 4 1 3]
Sample preds: [8 5 6 4 4 4 2 4 1 3]
Accuracy: 89.95%MLP on MNIST (GPU)
>>> py main.py
Input shape: (60000, 784)
Labels shape: (60000,)
Model(
[0] Linear (784 → 512)
[1] ReLU
[2] Linear (512 → 512)
[3] ReLU
[4] Linear (512 → 512)
[5] ReLU
[6] Linear (512 → 10)
Loss: CrossEntropyLoss
Total parameters: 932,362
Device: GPU
)
TRAINING...
EPOCH 1/10, Loss: 0.5499
...
EPOCH 10/10, Loss: 0.2297
Finished in: 9.51s
EVALUATING...
Sample labels: [7 3 1 1 0 8 0 8 6 4]
Sample preds: [7 3 1 1 0 0 0 8 6 4]
Accuracy: 95.70%
Save weights? (y/n) >>> nWith pretrained weights:
Loaded model weights from mlp-weights.pkl
EVALUATING...
Sample labels: [2 0 1 9 6 5 5 6 7 8]
Sample preds: [2 0 1 9 6 5 5 6 7 8]
Accuracy: 98.13%Note
This library doesn't have autograd (yet), graph tracing, mixed precision, tensor cores, cuDNN, cuBLAS, or any of the fancy stuff PyTorch does.
It only runs "faster" because it's lightweight.
Still beats pytorch at batch sizes <= 512 for MNIST though, so it's a win in my book.
All benchmarks were run on a RTX 4060, training a simple MNIST NN from scratch using this library’s GPU backend.
Model:
Linear(784 → 512) → ReLU → Linear(512 → 512) → ReLU → Linear(512 → 512) → ReLU → Linear(512 → 10)
Loss: CrossEntropy
Optimizer: SGD, lr=0.1
Epochs: 10
| Batch Size | Framework | Time (10 Epochs) |
|---|---|---|
| 64 | PyTorch | 27.2s |
| 64 | This lib | 20.2s |
| 512 | PyTorch | 9.7s |
| 512 | This lib | 9.5s |
- GPU programming seemed like a really fun problem space
- Implement a bunch of things on my own
- Experiment with a framework and learn cool stuff
- Python 3.10+
- pip
- CUDA Toolkit
- CMake
- gcc or g++
Clone the repo:
git clone https://github.com/sidsurakanti/tiny-ml-lib.git
cd tiny-ml-libCreate a virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate # windows: venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtBuild core CUDA lib
mkdir build && cd build
cmake .. && make && make install
cd ..Run it
python3 main.pyor
python main.py- MLP basic functionality
- Add Conv2d
- Add pooling layer
- Add weight inits
- Cuda remake
- Add more activations etc
Need help? Ping me on discord