A high-performance neural network implementation using CUDA and cuBLAS for MNIST digit classification. This project demonstrates GPU-accelerated deep learning and achieves 97.5% accuracy on the MNIST test set and 98.5% accuracy on the training set.
- Input Layer: 784 neurons (28×28 flattened MNIST images)
- Hidden Layer: 128 neurons with ReLU activation
- Output Layer: 10 neurons with Softmax activation (digit classes 0-9)
- Optimizer: SGD with momentum
- Loss Function: Cross-entropy loss
├── utils.cuh # Header file with utility function declarations
├── utils.cu # Utility functions implementation
├── train.cu # Training code
├── test.cu # Testing code
├── create_mnist_data.py # Python script to generate binary data files from MNIST
├── Makefile # Build config
└── README.md # Read this.
- NVIDIA GPU with CUDA support
- CUDA Toolkit (tested with CUDA 11.0+)
- cuBLAS library
- GCC/G++ compiler
- Python 3.x (for data preparation)
Note: This project was created in Coursera's Lab Environment.
pip install torch torchvision numpypython create_mnist_dataset.pyThis creates:
- train_images.bin (60,000 × 784 float32)
- train_labels.bin (60,000 × 10 float32)
- test_images.bin (10,000 × 784 float32)
- test_labels.bin (10,000 × 10 float32)
make clean buildThis compiles:
utils.o- Utility functions object filetrain.exe- Training executabletest.exe- Testing executable
./train.exe./test.exe- ReLU Activation: GPU-accelerated forward and backward pass
- Cross-entropy Gradient: Parallel gradient computation
- Batch Processing: Configurable batch size (default: 64)
- Momentum: SGD with momentum (default: 0.9)
- Gradient Clipping: Prevents gradient explosion (norm: 1.0)
- Weight Persistence: Automatic save/load of trained weights
- He Initialization: Proper weight initialization for ReLU networks
- cuBLAS Integration: Optimized matrix operations
- Memory Management: Efficient GPU memory allocation
- Column-major Storage: cuBLAS-compatible weight layout
Key parameters in train.cu:
const int batch_size = 64; // Training batch size
const int hidden_dim = 128; // Hidden layer neurons
const float learning_rate = 0.005f; // Learning rate
const float momentum = 0.9f; // Momentum coefficient
const int epochs = 20; // Training epochsThis project is licensed under the MIT License - see the LICENSE file for details. Feel free to modify and extend for your own learning and research.
Suggestions for improvements:
- Add more activation functions
- Implement different optimizers (Adam, RMSprop)
- Add regularization techniques (dropout, weight decay)
- Support for different architectures
- Visualization tools for training progress