Skip to content

art-test-stack/gpt-lab

Repository files navigation

stars-shield license-shield deepwiki-shield hf-page-shield linkedin-shield


      ___           ___           ___           ___       ___           ___     
     /\  \         /\  \         /\  \         /\__\     /\  \         /\  \    
    /::\  \       /::\  \        \:\  \       /:/  /    /::\  \       /::\  \   
   /:/\:\  \     /:/\:\  \        \:\  \     /:/  /    /:/\:\  \     /:/\:\  \  
  /:/  \:\  \   /::\~\:\  \       /::\  \   /:/  /    /::\~\:\  \   /::\~\:\__\ 
 /:/__/_\:\__\ /:/\:\ \:\__\     /:/\:\__\ /:/__/    /:/\:\ \:\__\ /:/\:\ \:|__|
 \:\  /\ \/__/ \/__\:\/:/  /    /:/  \/__/ \:\  \    \/__\:\/:/  / \:\~\:\/:/  /
  \:\ \:\__\        \::/  /    /:/  /       \:\  \        \::/  /   \:\ \::/  / 
   \:\/:/  /         \/__/     \/__/         \:\  \       /:/  /     \:\/:/  /  
    \::/  /                                   \:\__\     /:/  /       \::/__/   
     \/__/                                     \/__/     \/__/         ~~       

gpt-lab* is a light-weight library for monitoring small LLM trainings, supporting inference, for small-scale ablation studies. It also includes an interface to chat with the model, and with models from 🤗 API, locally or remotely.
Explore the docs »

Request Feature »

*This name is quite pompous and vague, I admit it. Any suggestions for a better one are welcomed!

Table of Contents

About The Project

Purpose

This project is primarily educational*. It implements transformer-based language models from scratch to expose and understand their core mechanisms.

While modern LLMs can generate strong implementations, true understanding comes from building. This repository follows that philosophy: learning through construction and internalization, permitting elaboration. That said, building alone does not guarantee understanding.

"What I cannot create, I do not understand." - Richard Feynman 🐐

This is not a production-ready library. It is a lightweight, transparent playground for training small models, running experiments, ablation studies, and exploring architectural ideas.

Components are often adapted from existing work and properly credited. The goal is not to reinvent the wheel, but to understand it well enough to modify and improve it. At least, that is the intention.

gpt-lab supports distributed training on at most a single GPU node. It is not optimized for large-scale training, but it is designed to be modular and extensible.

A simple but important question would be: Why this repo exists (vs nanoGPT / HF Trainer). Here are some of the motivations:

  • modular training instrumentation for ablations
  • pluggable optimizer + architecture factory
  • distributed streaming dataloaders designed for throughput balance
  • tokenizer experimentation pipeline
  • built-in evaluation / inference interface

*For the non-initiated, there is of course better free-online resources available. Find some at the references section.

Built With

Torch <<3 🐐 (sorry JAX-ers)
huggingface-shield (datasets, transformers, tokenizer, hub)
wandb-shield (training monitoring)
tiktoken-shield (very fast tokenizer encoder)
gradio-shield (web interface -- not really actively developed; may have some bugs and issues)
uv-shield (dependency management and CLI)

Get Started

Setup requirements

This project has been developed and tested with Python 3.12. gpt-lab uses uv to manage dependencies.

  • Clone the repo

    git clone git@github.com:art-test-stack/gpt-lab.git
  • Install dependencies for CUDA device:

    uv sync --extra gpu

    or install dependencies for CPU/MPS device:

    uv sync --extra cpu
  • Install dependencies for development (optional, but recommended if you want to contribute):

    uv sync --group=dev
  • To use the library in jupyter notebooks:

    uv sync --group=notebook

Note

Make sure to adjust the CUDA version in uv.toml if needed. This extra is only available for Linux systems with compatible NVIDIA GPUs. It permits using flash_attention for faster attention computation. Default mode uses kernels implementation, making the installation easier.

Usage

There is many layers in the library, and many components that can be used and customized. The main ones are the following:

I recommend to check out the corresponding deepwiki for more detailed documentation and explanations on the different components of the library. The sketchs generated explain well the interaction between the different modules.

Scripts

The library includes some scripts for training, evaluation, and inference. They are located in the scripts/ folder. The main ones are the following:

Data

TLDR; The dataloader is easily built with the following code snippet. It employs the following strategies:

  • Streaming (not mmap, not in-memory)
  • Packed (not padded batching)
  • Distributed (not replicated dataset)
  • Lazy (not pre-tokenized)
from gpt_lab.data.loader import build_dataloader, DistDataLoader

data_loader: DistDataLoader = build_dataloader(
    name="climbix-base",
    tokenizer=tokenizer,
    column="text",
    split="train",
    seq_len=model.config.max_context,
    batch_size=32,
    base_url="karpathy/climbmix-400b-shuffle", # for starting point
    max_shards=6542 # last shard id for the given dataset (if not provided, it will be computed by probing the server, which can take a while)
) 

Notice that if you use split val, you will get a simple function that returns a dataloader; which permits to ensure the same validation set across different training runs, steps, etc.

from gpt_lab.data.loader import build_dataloader

val_loader_fn = build_dataloader(
    name="climbix-base",
    tokenizer=tokenizer,
    column="text",
    split="val", # same config except here
    seq_len=model.config.max_context,
    batch_size=32,
    base_url="karpathy/climbmix-400b-shuffle", 
    max_shards=6542 
) 

val_loader = val_loader_fn() # called as a function to get the dataloader instance in trainer

We can do whatever we want with maths, modelization, the implementation in PyTorch, etc, but the core component of any Machine Learning system is still its data.

The data processing has roughly three folds. One for train a tokenizer (see Tokenization), one for model training and one for model evaluation. We focus here on model training/evaluation.

The data pipeline basically works with for any dataset available on internet that can be reshaped into contiguous Parquet shards. Any good starting point, are karpathy/climbmix-400b-shuffle or HuggingFaceFW/fineweb-edu (but needs to be re-sharded! see (scripts/reshard_dataset.py)).

Warning

The data loader inside the library assumes a specific structure for the dataset: it needs to be split into shard files with names in the format shard_{:05d}.parquet, where the ids are contiguous integers. If shard names do not follow this format under base_url, they would be simply ignored by the downloader.

sequence packing Sequence packing strategy. From The Smol Training Playbook.

The build_dataloader function the data loader, which is accessible from gpt_lab.data.DistDataLoader. It employs a distributed streaming data pipeline over Parquet shards with on-the-fly tokenization and greedy document packing into fixed-length sequences. It creates cpu and gpu buffers to pre-load the data in contiguous memory, stream the data from local shards downloaded from the given dataset, and feed the model with the data with a packing strategy, to maximize the throughput of the training loop by avoiding the use of padding tokens. It also supports distributed training setups, and can be used with DDP or other distributed training frameworks.

The critical point regarding model training, is that we must make sure to have a good balance between loader time and model forward/backward time to avoid bottlenecks from the data loading process. Given that constraint, the implementated data loader is satisfying.

Automatic configuration

The library provides a minimalistic automatic configuration system that computes optimal model architecture, tokenizer settings, and training hyperparameters using scaling laws.

The system is based on the AutoConfig class (available in gpt_lab.model.auto.py) and is used in scripts/train_base.py via the --auto flag.

It automatically determines:

  • model depth scaling
  • width / aspect ratio expansion
  • vocabulary size (scaling-law driven or tokenizer-based)
  • batch size and gradient accumulation
  • training horizon (steps / FLOPs / data ratio)
from gpt_lab.model.auto import AutoConfig

# Automatic full training configuration
cfg = AutoConfig(
    basename="ic1-125M",
    depth=12,
    aspect_ratio=16,
)
meta_config = cfg.generate_gpt_config(device="cuda")

Note

Setting vocab_size = -1 enables automatic scaling-law vocabulary selection.
Setting training_time, n_steps, target_flops, or target_param_data_ratio controls training horizon priority.
The system automatically builds a reference model (12-layer baseline) to normalize scaling-law computations.

Next sections detail the different generated components.

Tokenization

The tokenization implementation are located in gpt_lab.tokenizer. The code only includes BPE tokenization for now (include sentencepiece is a TODO). The tokenizer training is only supported by huggingface implementation for now. For inference, the tiktoken implementation is the default one, as it is much faster than the huggingface one. The custom BPE implementation is still under development, and is not functional yet.

Training a tokenizer

from gpt_lab.tokenizer import Tokenizer
from gpt_lab.tokenizer.corpus import TokenizerCorpus
from gpt_lab.utils.schemas import TokenizerTrainerConfig

# uses default corpus settings (mixture of HuggingFaceFW/fineweb-edu, HuggingFaceFW/fineweb-2, HuggingFaceTB/finemath and codeparrot/codeparrot-clean)
corpus = TokenizerCorpus.from_sources(random_seed=42)
cfg = TokenizerTrainerConfig(
    name="my_tokenizer",
    vocab_size=32_000,
    pat_str="gpt2", # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
)
tokenizer = Tokenizer.train_from_iterator(cfg, iterator=corpus.iterator())

Using a pre-trained tokenizer

from gpt_lab.tokenizer import Tokenizer
from gpt_lab.utils.schemas import TokenizerConfig

tokenizer = Tokenizer.from_pretrained("cl100k-base", source="tiktoken")

Which tokenizer implementation to choose?

The tokenizer training script is located in scripts/train_tokenizer.py. It allows you to train a BPE tokenizer on a custom corpus, using different implementations (tiktoken, HuggingFace, or custom BPE implementations). You can also choose to write the corpus from sources (e.g., Wikipedia, OpenWebText) or load an existing corpus.

Training time benchmarks for different implementations and configurations. All the tokenizers were trained on corpus generated from gpt_lab.tokenizer.corpus.TokenizerCorpus() with default settings, tuned with variable vocab_size.

Model architecture

The library provides a modular implementation of a GPT-style DenseTransformer, where architectural components (attention, feedforward blocks, normalization, and positional encoding) are fully decoupled and configurable. The core model is defined in gpt_lab.model.gpt, while reusable layer primitives are implemented in gpt_lab.model.layers. Model behavior is controlled via a Pydantic configuration class (TransformerConfig) defined in gpt_lab.utils.schemas (but also accessible under gpt_lab.model.gpt), enabling structured extension of architectural variants and hyperparameters.

Here a simple example to produce a Llama-like architecture:

from gpt_lab.model.gpt import TransformerConfig, DenseTransformer

cfg = TransformerConfig(
    vocab_size=32_000,
    max_context=2048,
    d_model=512,
    n_heads=8,
    n_kv_heads=8,
    n_layers=6,
    d_ffn=2048,
    attn_impl="flash_attention",
    act_func="swiglu",
    normalization="rms",
)
model = DenseTransformer(cfg)

Optimization

In the Trainer class (available in gpt_lab.train.trainer), the optimizer is built from DenseTransformer.build_optimizer method, which is implemented in the DenseTransformer class (available in gpt_lab.model.gpt). This design allows for a high degree of flexibility and modularity in the optimization process. Moreover, the optimizer is initiated based on configs/optim.yaml configuration file, which can be easily modified to include new optimizers or adjust existing ones.

default:
  opt: "adamw"
  eps: 1e-10
  weight_decay: 1e-3
embeddings: 
  opt: "adamw"
  lr: .3
  betas: [.8, .995]
  eps: 1e-10
  weight_decay: 1e-3
transformer:
  opt: "muon"
  lr: 2e-2
  momentum: .95
  ns_steps: 5
  beta: .9
  weight_decay: .28
...

The optimization process is decoupled from the model architecture, and is implemented as a separate component that can be easily swapped and customized. The optimizer is built based on the model configuration and the training configuration, using a factory pattern. The optimizer implementations are located in gpt_lab.optim.factory and the corresponding subfolders for the different optimizers.

from gpt_lab.optim import OptimizerFactory

model = ...
optim_cfg = ... # dict of optimizer hyperparameters, e.g., {"opt": "adamw", "lr": 1e-3, ...}
param_groups = [
    {"params": model.embeddings.parameters(), **optim_cfg["embeddings"]},
    {"params": model.blocks.parameters(), **optim_cfg["blocks"]},
    ...
]
optimizer = OptimizerFactory.build_optimizer(param_groups)

Warning

This is maybe the most critical part of the library, regarding model training, and it is also the part that I have less implemented myself. I used a lot of external repositories for code baseline, and used LLMs back and fourth to enhance it. My goal was to make it work, while being more modular. However, my comprehension of optimization algorithms, coupled with torch.compile and distributed training is quite limited. So, I encourage you to check the code in gpt_lab.optim.factory and the corresponding subfolders for the different optimizers.

Pre training

The pre-training script is located in scripts/train_base.py. It allows you to pre-train a GPT model from scratch on a defined corpus, using different configurations (model architecture, training hyperparameters, optimizer, etc.). You can also choose to write the corpus from sources (e.g., Wikipedia, OpenWebText) or load an existing corpus.

Warning

There are two sub-arguments for this script auto and custom. For now, only auto is implemented, which allows you to automatically load a configuration based on main (depth, aspect_ratio, n_heads, etc.) arguments and compute optimal training parameters, as optimal vocab_size if not provided. The script can then train a new tokenizer with --train-tokenizer flag. The custom argument is intended to allow you to directly pass the configuration as command-line arguments, without the need for a YAML file. This feature is under development and will be implemented in the future.

Checkpointing

The framework provides a simple checkpointing system that is integrated with the training loop. The checkpointing system is implemented in the CheckpointManager class (available in gpt_lab.model.checkpoint) and allows you to save and load model checkpoints during training. Models should be saved in the following directory structure:

<CACHE_DIR>/
└── models/
    └── <model_name>/                          # e.g., "ic1", "gpt2-small", "llama2"
        └── <run_name>/
            ├── meta.pskl                      # immutable (model, tokenizer, git)
            └── <source>/                      # base / sft / rl
                ├── training_config.pkl        # per-phase config
                ├── checkpoint_state.pkl       # best bpb/core steps and values
                ├── checkpoint_step_000000/
                │   ├── model.pt
                │   ├── optim_rank0.pt         # optimizer state dict (optim_rank{rank}.pt if sharded, otherwise optim.pt)
                │   ├── optim_rank1.pt
                │   ├── ...                    # more optimizer shards if needed
                │   ├── trainer_state.pkl      # training state, rng state, data state, best bpb/core steps
                │   └── metrics.pkl
                ├── checkpoint_step_000100/
                │   └── ...
                └── ...

Board

Vizualize the training progress in the board of your choice (Tensorboard, Weights & Biases, or Trackio). You can also log to a dummy board that does not log anything, for faster training without logging overhead.

Chat with the model

In this section, you will find instructions to run the chat interface with different models.

Under development environment (DEVELOPMENT='1' in .env), you can run the chat interface with auto-reloading, use the following command:

uv run gradio scripts/chat_app.py --demo-name=app

Otherwise, if you don't want auto-reloading, use:

uv run python -m scripts.chat_app

Then, open your browser and go to http://127.0.0.1:7860/. It is quite straightforward to use. You can select different models (local or remote), choose some hyperparameters for inference, and chat with the model.

Development Notes

Some components are intentionally incomplete. Contributors (including automated tools) are encouraged to explore TODOs and propose improvements via pull requests.

References

Nice repositories to check out for inspiration and reference

  1. karpathy/nanoGPT by Andrej Karpathy for pretraining code .
  2. karpathy/nanochat by Andrej Karpathy for full training (base, sft, grpo) and inference pipeline.
  3. KellerJordan/modded-nanogpt by Jordan Keller for speedrunner implementation and optimization techniques.

Some nice blogs and web-articles

  1. Huggingface or nanotron playbooks. All of them are very good. It takes days to read them all, and more to diggest, but they are worth it.
  2. frontier model training methodologies by Alex Wa (DJ Dumpling). Quite compact compared with the HuggingFace playbooks, but still very informative and insightful (it is quite a condensed version of it).
  3. Making Deep Learning Go Brrrr From First Principles by Horace He (PyTorch). Very nice intro on basics of GPU computation for deep learning.
  4. Tokenizers by Karparthy for a very nice overview of tokenization for LLMs.

Some bibliography

Note

All of the literature ressources below all participated in some way to the development of the library. I have probably forgotten some, and I apologize for that. If you think some important papers are missing please feel free to add one (or suggest one) via pull request. Some papers are not directly cited in the code, I will try to add some as much as possible in the future.

Title Authors Journal Year DOI
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models nan nan nan
Tokenization Is More Than Compression nan nan nan
Practical Efficiency of Muon for Pretraining AI et al. arXiv 2025 2505.02222
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton Abreu et al. arXiv 2025 2510.09378
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training Bergsma et al. arXiv 2025 2505.13738
Knowledge distillation: A good teacher is patient and consistent Beyer et al. arXiv 2021 2106.05237
Language Models are Few-Shot Learners Brown et al. arXiv 2020 10.48550/arXiv.2005.14165
PaLM: Scaling Language Modeling with Pathways Chowdhery et al. arXiv 2022 10.48550/arXiv.2204.02311
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Dao et al. NeurIPS 2022 2205.14135
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Dao arXiv 2023 10.48550/arXiv.2307.08691
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI et al. Nature volume 645, pages 633-638 (2025) 2025 10.1038/s41586-025-09422-z
QLoRA: Efficient Finetuning of Quantized LLMs Dettmers et al. arXiv 2023 10.48550/arXiv.2305.14314
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al. arXiv 2018 10.48550/arXiv.1810.04805
Fewer Truncations Improve Language Modeling Ding et al. arXiv 2024 2404.10830
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Fedus et al. ICML 2021 2101.03961
How to Train Long-Context Language Models (Effectively) Gao et al. arXiv 2024 10.48550/arXiv.2410.02660
Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials Grishina et al. arXiv 2025 2506.10935
Shampoo: Preconditioned Stochastic Tensor Optimization Gupta et al. arXiv 2018 10.48550/arXiv.1802.09568
Training Compute-Optimal Large Language Models Hoffmann et al. arXiv 2022 10.48550/arXiv.2203.15556
LoRA: Low-Rank Adaptation of Large Language Models Hu et al. ICLR 2021 2106.09685
Block-Recurrent Transformers Hutchins et al. arXiv 2022 2203.07852
Mistral 7B Jiang et al. arXiv 2023 10.48550/arXiv.2310.06825
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Katharopoulos et al. arXiv 2020 10.48550/arXiv.2006.16236
KellerJordan/Muon KellerJordan GitHub 2024 github:kellerjordan/muon
KellerJordan/modded-nanogpt KellerJordan GitHub 2024 github:kellerjordan/modded-nanogpt
KIMI K2: OPEN AGENTIC INTELLIGENCE Kimi Team arXiv 2025 10.48550/arXiv.2507.20534
Attention Residuals Kimi Team arXiv 2026 10.48550/arXiv.2603.15031
Decoding-time Realignment of Language Models Liu et al. arXiv 2024 2402.02992
Muon is Scalable for LLM Training Liu et al. 2025 arXiv 2502.16982
StarCoder 2 and The Stack v2: The Next Generation Lozhkov et al. arXiv 2024 10.48550/arXiv.2402.19173
YaRN: Efficient Context Window Extension of Large Language Models Peng et al. arXiv 2023 10.48550/arXiv.2309.00071
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free Qiu et al. arXiv 2025 2505.06708
Language models are unsupervised multitask learners Radford et al. OpenAI 2019 unsupervised-multitask
SQUAD: 100,000+ Questions for Machine Comprehension of Text Rajpurkar et al. arXiv 2016 10.48550/arXiv.1606.05250
Observational Scaling Laws and the Predictability of Language Model Performance Ruan et al. arXiv 2024 10.48550/arXiv.2405.10938
Observational Scaling Laws and the Predictability of Language Model Performance Ruan et al. arXiv 2024 2405.10938
SlimPajama-DC: Understanding Data Combinations for LLM Training Shen et al. arXiv 2023 2309.10818
How to Train Your Energy-Based Models Song et al. arXiv 2021 10.48550/arXiv.2101.03288
Building Bridges between Regression, Clustering, and Classification Stewart et al. arXiv 2025 2502.02996
RoFormer: Enhanced Transformer with Rotary Position Embedding Su et al. arXiv 2021 10.48550/arXiv.2104.09864
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies Tao et al. arXiv 2024 2407.13623
Efficient Transformers: A Survey Tay et al. LLMs BasicsarXiv 2020 10.48550/arXiv.2009.06732
Attention is all you need Vaswani et al. arXiv 2017 10.48550/arXiv.1706.03762
ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers Wang and Li arXiv 2023 2310.02489
Fantastic Pretraining Optimizers and Where to Find Them Wen et al. arXiv 2025 2509.02046
Unified Training of Universal Time Series Forecasting Transformers Woo et al. arXiv 2024 2402.02592
Effective Long-Context Scaling of Foundation Models Xiong et al. arXiv 2023 10.48550/arXiv.2309.16039
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly Yen et al. arXiv 2024 10.48550/arXiv.2410.02694
Recursive Language Models Zhang et al. arXiv 2025 2512.24601
dLLM: Simple Diffusion Language Modeling Zhou et al. arXiv 2026 10.48550/arXiv.2602.22661

Bibliography made with art-test-stack/MyBible.

Some video ressources

For the laziest (😛), there are also a lot of Youtube videos that explain well the different components of the library, and how to implement them. Here are some of them that I found useful:

  1. Andrej Karpathy's YouTube channel for his unmatched expertise in the field, and his ability to explain complex concepts in a simple and intuitive way. His videos on Transformers and LLMs are particularly useful for understanding the architecture and training of these models.
  2. Stanfords CME295 course for the very nice lecture on Transformers and LLMs by Afshine and Shervine Amidi. They currently releasing lectures of CME296, which is on diffusion & LVMs.

Extra

  1. Banner made with: hacker-tools/ascii-banner

TODOs

Here a non-exhaustive list of features that I aim to implement. Stars correspond to the priority level. Contributions are welcome!

  • Tokenization ⭐️
    • BPE implementation in Python
    • Rust implementation
  • Architecture ⭐️⭐️⭐️
    • Alibi
    • MoE
    • Mixture of Depths
  • Optimization ⭐️⭐️
    • Shampoo optimizer
    • LION optimizer
    • MARS optimizer
  • Precision ⭐️⭐️
    • model and optimizer quantization
  • Training ⭐️⭐️⭐️
    • fine-tuning / intruction tuning
    • grpo
  • Cross-lib features ⭐️⭐️
    • HuggingFace integration (model loading, tokenizers, etc.)
    • vLLM, DeepSpeed, Megatron-LM, etc. integration

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Arthur Testard - arthur.testard.pro@gmail.com

Project Link: art-test-stack/gpt-lab

Citation

If you use this work in your research, please consider citing the following:

@misc{gptlab2026,
  author={Testard, Arthur},
  title={gpt-lab: A light-weight library for fast-ablation studies on GPT-like LMs},
  year={2026},
  publisher={GitHub},
  url={https://github.com/art-test-stack/gpt-lab}
}

(back to top)

About

A light-weight library for fast-ablation studies on GPT-like Language Models

Resources

License

Stars

Watchers

Forks

Contributors

Languages