GPT C++ Implementation from Scratch

A complete implementation of a GPT-style transformer language model written in C++ from scratch, featuring multi-head attention, layer normalization, feed-forward networks, and AdamW optimization.

Quick Start

Open Command Prompt (CMD)
Navigate to the folder containing lum_gpt.cpp

Paste this command to compile:

g++ -std=c++17 -O3 -march=native lum_gpt.cpp -o lum_gpt.exe

You can add more flags like -ffast-math or debugging flags if you modified the code and added more features and want to debug the code.
After several seconds, it will compile into lum_gpt.exe
Just double click lum_gpt.exe to run!

The program automatically downloads the TinyShakespeare dataset if not present. If the library isn't present which downloads the dataset so it is then recommended to download it (the dataset) manually.

Hardware Performance

Test System:

CPU: AMD Phenom™ Triple-Core Processor @ 2.40 GHz
RAM: 2 GB DDR2 (700 MB available)
Storage: 149 GB HDD
GPU: None (GTX 210 only for display)

Resource Usage During Training:

Memory: 32 MB
CPU: 45%
Disk: 0-2%
Training Time: ~8 minutes per 200 iterations

Model Architecture

Complete transformer implementation with:

Multi-head attention with causal masking
Layer normalization (Pre-LN as in GPT-2/3)
Feed-forward networks with GELU activation
AdamW optimizer with decoupled weight decay
Advanced text generation (Temperature + Top-K sampling)

Training Results

TinyShakespeare Dataset

Characters: 1.1M
Vocabulary: 65 unique characters
Lines: ~40,000

Original Hyperparameters (output.txt):

batch_size = 4, block_size = 64, d_model = 128, n_heads = 4, n_layers = 4

Loss Progress:

Step 0: 4.5875
Step 200: 3.1597
Step 400: 3.1563
...
Step 2000: 3.2377

Custom Nasiruddin Dataset

Content: 202 jokes from Internet Archive
Vocabulary: 82 unique characters
Lines: ~3,000
Quality: More modern, clearer English than TinyShakespeare
Availability: The custom dataset is also added to the repository so you can use. Simply, change the path of the dataset in the code.

Enhanced Hyperparameters:

batch_size = 6, block_size = 128, d_model = 256, n_heads = 6, n_layers = 6

Technical Implementation

Core Components

Tensor Operations: Custom tensor class with optimized matrix operations
Embeddings: Token and positional embeddings with Xavier initialization
Multi-Head Attention: Scaled dot-product attention implementation
Layer Normalization: Mathematically precise gradient computation
Feed-Forward Networks: MLP with GELU activation
AdamW Optimizer: Adam with decoupled weight decay

Mathematical Precision

All gradients computed using exact mathematical derivations
Numerical stability through epsilon constants and max trick
Proper gradient accumulation and backpropagation
Combined softmax-cross entropy gradients for efficiency

Memory Optimization

Flattened tensor storage for cache efficiency
Thread-local random number generation
Careful buffer management and reuse
In-place operations where possible
While the output can be improved, the test was checked on both smaller hyper parameters and slightly increased.

Next Version (Coming Soon)

The next version will include cutting-edge optimizations:

4-bit Quantization QAT (Quantization Aware Training)
RoPE (Rotary Position Embedding)
ALiBi (Attention with Linear Biases)
Eigen 3.4.0 integration for ultra-optimized linear algebra
Custom inference engine with specialized optimizations
Ultra-efficient memory management

Key Achievements

First attempt at transformer implementation from scratch
Runs on 15+ year old hardware with excellent performance
Complete mathematical implementation with proper gradients
Custom dataset compatibility with automatic vocabulary building

Model Configuration

Current implementation supports:

Variable vocabulary sizes (62-82+ characters tested)
Adjustable context windows (64-128+ tokens)
Scalable model dimensions (128-256+ features)
Flexible batch processing
Custom dataset integration

Technical Specifications

Component	Implementation
Language	C++17
Dependencies	Standard library only
Memory Model	Flattened tensors
Optimization	AdamW with weight decay
Attention	Multi-head with causal mask
Generation	Temperature + Top-K sampling
Dataset	Auto-download capability

Code Quality

2,000+ lines of well-documented C++
Mathematical comments explaining derivations
Error handling and numerical stability
Modular design with clear separation
Performance optimizations throughout

Future Development

This project represents the foundation for advanced transformer research and development. The upcoming version will push the boundaries of efficient transformer implementation while maintaining the educational clarity of the current version. This version doesn't include the inference support but it will be added in the upcoming version which will allow to train the model once and use the weights and vocabulary files for inference.

A complete transformer implementation proving that deep learning doesn't require expensive hardware or complex frameworks.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
lum_gpt.cpp		lum_gpt.cpp
nasiruddin.txt		nasiruddin.txt
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT C++ Implementation from Scratch

Quick Start

Hardware Performance

Model Architecture

Training Results

TinyShakespeare Dataset

Custom Nasiruddin Dataset

Technical Implementation

Core Components

Mathematical Precision

Memory Optimization

Next Version (Coming Soon)

Key Achievements

Model Configuration

Technical Specifications

Code Quality

Future Development

About

Uh oh!

Releases

Packages

Languages

License

LumGenLab/LumGPT

Folders and files

Latest commit

History

Repository files navigation

GPT C++ Implementation from Scratch

Quick Start

Hardware Performance

Model Architecture

Training Results

TinyShakespeare Dataset

Custom Nasiruddin Dataset

Technical Implementation

Core Components

Mathematical Precision

Memory Optimization

Next Version (Coming Soon)

Key Achievements

Model Configuration

Technical Specifications

Code Quality

Future Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages