A complete implementation of a GPT-style transformer language model written in C++ from scratch, featuring multi-head attention, layer normalization, feed-forward networks, and AdamW optimization.
- Open Command Prompt (CMD)
- Navigate to the folder containing
lum_gpt.cpp - Paste this command to compile:
g++ -std=c++17 -O3 -march=native lum_gpt.cpp -o lum_gpt.exe
- You can add more flags like
-ffast-mathor debugging flags if you modified the code and added more features and want to debug the code. - After several seconds, it will compile into
lum_gpt.exe - Just double click
lum_gpt.exeto run!
The program automatically downloads the TinyShakespeare dataset if not present. If the library isn't present which downloads the dataset so it is then recommended to download it (the dataset) manually.
Test System:
- CPU: AMD Phenom™ Triple-Core Processor @ 2.40 GHz
- RAM: 2 GB DDR2 (700 MB available)
- Storage: 149 GB HDD
- GPU: None (GTX 210 only for display)
Resource Usage During Training:
- Memory: 32 MB
- CPU: 45%
- Disk: 0-2%
- Training Time: ~8 minutes per 200 iterations
Complete transformer implementation with:
- Multi-head attention with causal masking
- Layer normalization (Pre-LN as in GPT-2/3)
- Feed-forward networks with GELU activation
- AdamW optimizer with decoupled weight decay
- Advanced text generation (Temperature + Top-K sampling)
- Characters: 1.1M
- Vocabulary: 65 unique characters
- Lines: ~40,000
Original Hyperparameters (output.txt):
batch_size = 4, block_size = 64, d_model = 128, n_heads = 4, n_layers = 4Loss Progress:
Step 0: 4.5875
Step 200: 3.1597
Step 400: 3.1563
...
Step 2000: 3.2377
- Content: 202 jokes from Internet Archive
- Vocabulary: 82 unique characters
- Lines: ~3,000
- Quality: More modern, clearer English than TinyShakespeare
- Availability: The custom dataset is also added to the repository so you can use. Simply, change the path of the dataset in the code.
Enhanced Hyperparameters:
batch_size = 6, block_size = 128, d_model = 256, n_heads = 6, n_layers = 6- Tensor Operations: Custom tensor class with optimized matrix operations
- Embeddings: Token and positional embeddings with Xavier initialization
- Multi-Head Attention: Scaled dot-product attention implementation
- Layer Normalization: Mathematically precise gradient computation
- Feed-Forward Networks: MLP with GELU activation
- AdamW Optimizer: Adam with decoupled weight decay
- All gradients computed using exact mathematical derivations
- Numerical stability through epsilon constants and max trick
- Proper gradient accumulation and backpropagation
- Combined softmax-cross entropy gradients for efficiency
- Flattened tensor storage for cache efficiency
- Thread-local random number generation
- Careful buffer management and reuse
- In-place operations where possible
- While the output can be improved, the test was checked on both smaller hyper parameters and slightly increased.
The next version will include cutting-edge optimizations:
- 4-bit Quantization QAT (Quantization Aware Training)
- RoPE (Rotary Position Embedding)
- ALiBi (Attention with Linear Biases)
- Eigen 3.4.0 integration for ultra-optimized linear algebra
- Custom inference engine with specialized optimizations
- Ultra-efficient memory management
- First attempt at transformer implementation from scratch
- Runs on 15+ year old hardware with excellent performance
- Complete mathematical implementation with proper gradients
- Custom dataset compatibility with automatic vocabulary building
Current implementation supports:
- Variable vocabulary sizes (62-82+ characters tested)
- Adjustable context windows (64-128+ tokens)
- Scalable model dimensions (128-256+ features)
- Flexible batch processing
- Custom dataset integration
| Component | Implementation |
|---|---|
| Language | C++17 |
| Dependencies | Standard library only |
| Memory Model | Flattened tensors |
| Optimization | AdamW with weight decay |
| Attention | Multi-head with causal mask |
| Generation | Temperature + Top-K sampling |
| Dataset | Auto-download capability |
- 2,000+ lines of well-documented C++
- Mathematical comments explaining derivations
- Error handling and numerical stability
- Modular design with clear separation
- Performance optimizations throughout
This project represents the foundation for advanced transformer research and development. The upcoming version will push the boundaries of efficient transformer implementation while maintaining the educational clarity of the current version. This version doesn't include the inference support but it will be added in the upcoming version which will allow to train the model once and use the weights and vocabulary files for inference.
A complete transformer implementation proving that deep learning doesn't require expensive hardware or complex frameworks.