GitHub - art-test-stack/gpt-lab: A light-weight library for fast-ablation studies on GPT-like Language Models

      ___           ___           ___           ___       ___           ___     
     /\  \         /\  \         /\  \         /\__\     /\  \         /\  \    
    /::\  \       /::\  \        \:\  \       /:/  /    /::\  \       /::\  \   
   /:/\:\  \     /:/\:\  \        \:\  \     /:/  /    /:/\:\  \     /:/\:\  \  
  /:/  \:\  \   /::\~\:\  \       /::\  \   /:/  /    /::\~\:\  \   /::\~\:\__\ 
 /:/__/_\:\__\ /:/\:\ \:\__\     /:/\:\__\ /:/__/    /:/\:\ \:\__\ /:/\:\ \:|__|
 \:\  /\ \/__/ \/__\:\/:/  /    /:/  \/__/ \:\  \    \/__\:\/:/  / \:\~\:\/:/  /
  \:\ \:\__\        \::/  /    /:/  /       \:\  \        \::/  /   \:\ \::/  / 
   \:\/:/  /         \/__/     \/__/         \:\  \       /:/  /     \:\/:/  /  
    \::/  /                                   \:\__\     /:/  /       \::/__/   
     \/__/                                     \/__/     \/__/         ~~

gpt-lab* is a light-weight library for monitoring small LLM trainings, supporting inference, for small-scale ablation studies. It also includes an interface to chat with the model, and with models from 🤗 API, locally or remotely.
Explore the docs »

Request Feature »

*This name is quite pompous and vague, I admit it. Any suggestions for a better one are welcomed!

About The Project

Purpose

This project is primarily educational*. It implements transformer-based language models from scratch to expose and understand their core mechanisms.

While modern LLMs can generate strong implementations, true understanding comes from building. This repository follows that philosophy: learning through construction and internalization, permitting elaboration. That said, building alone does not guarantee understanding.

"What I cannot create, I do not understand." - Richard Feynman 🐐

This is not a production-ready library. It is a lightweight, transparent playground for training small models, running experiments, ablation studies, and exploring architectural ideas.

Components are often adapted from existing work and properly credited. The goal is not to reinvent the wheel, but to understand it well enough to modify and improve it. At least, that is the intention.

gpt-lab supports distributed training on at most a single GPU node. It is not optimized for large-scale training, but it is designed to be modular and extensible.

A simple but important question would be: Why this repo exists (vs nanoGPT / HF Trainer). Here are some of the motivations:

modular training instrumentation for ablations
pluggable optimizer + architecture factory
distributed streaming dataloaders designed for throughput balance
tokenizer experimentation pipeline
built-in evaluation / inference interface

*For the non-initiated, there is of course better free-online resources available. Find some at the references section.

Built With

<<3 🐐 (sorry JAX-ers)
(datasets, transformers, tokenizer, hub)
(training monitoring)
(very fast tokenizer encoder)
(web interface -- not really actively developed; may have some bugs and issues)
(dependency management and CLI)

Get Started

Setup requirements

This project has been developed and tested with Python 3.12. gpt-lab uses uv to manage dependencies.

Clone the repo

git clone git@github.com:art-test-stack/gpt-lab.git

Install dependencies for CUDA device:
```
uv sync --extra gpu
```
or install dependencies for CPU/MPS device:
```
uv sync --extra cpu
```
Install dependencies for development (optional, but recommended if you want to contribute):
```
uv sync --group=dev
```
To use the library in jupyter notebooks:
```
uv sync --group=notebook
```

Note

Make sure to adjust the CUDA version in uv.toml if needed. This extra is only available for Linux systems with compatible NVIDIA GPUs. It permits using flash_attention for faster attention computation. Default mode uses kernels implementation, making the installation easier.

Usage

There is many layers in the library, and many components that can be used and customized. The main ones are the following:

Scripts
Data processing
Automatic configuration
Tokenization
Model architecture
Optimization
Training
Checkpointing
Board
Chat with the model

I recommend to check out the corresponding deepwiki for more detailed documentation and explanations on the different components of the library. The sketchs generated explain well the interaction between the different modules.

Scripts

The library includes some scripts for training, evaluation, and inference. They are located in the scripts/ folder. The main ones are the following:

Data

TLDR; The dataloader is easily built with the following code snippet. It employs the following strategies:

Streaming (not mmap, not in-memory)
Packed (not padded batching)
Distributed (not replicated dataset)
Lazy (not pre-tokenized)

from gpt_lab.data.loader import build_dataloader, DistDataLoader

data_loader: DistDataLoader = build_dataloader(
    name="climbix-base",
    tokenizer=tokenizer,
    column="text",
    split="train",
    seq_len=model.config.max_context,
    batch_size=32,
    base_url="karpathy/climbmix-400b-shuffle", # for starting point
    max_shards=6542 # last shard id for the given dataset (if not provided, it will be computed by probing the server, which can take a while)
)

Notice that if you use split val, you will get a simple function that returns a dataloader; which permits to ensure the same validation set across different training runs, steps, etc.

from gpt_lab.data.loader import build_dataloader

val_loader_fn = build_dataloader(
    name="climbix-base",
    tokenizer=tokenizer,
    column="text",
    split="val", # same config except here
    seq_len=model.config.max_context,
    batch_size=32,
    base_url="karpathy/climbmix-400b-shuffle", 
    max_shards=6542 
) 

val_loader = val_loader_fn() # called as a function to get the dataloader instance in trainer

We can do whatever we want with maths, modelization, the implementation in PyTorch, etc, but the core component of any Machine Learning system is still its data.

The data processing has roughly three folds. One for train a tokenizer (see Tokenization), one for model training and one for model evaluation. We focus here on model training/evaluation.

The data pipeline basically works with for any dataset available on internet that can be reshaped into contiguous Parquet shards. Any good starting point, are karpathy/climbmix-400b-shuffle or HuggingFaceFW/fineweb-edu (but needs to be re-sharded! see (scripts/reshard_dataset.py)).

Warning

The data loader inside the library assumes a specific structure for the dataset: it needs to be split into shard files with names in the format shard_{:05d}.parquet, where the ids are contiguous integers. If shard names do not follow this format under base_url, they would be simply ignored by the downloader.

Sequence packing strategy. From The Smol Training Playbook.

The build_dataloader function the data loader, which is accessible from gpt_lab.data.DistDataLoader. It employs a distributed streaming data pipeline over Parquet shards with on-the-fly tokenization and greedy document packing into fixed-length sequences. It creates cpu and gpu buffers to pre-load the data in contiguous memory, stream the data from local shards downloaded from the given dataset, and feed the model with the data with a packing strategy, to maximize the throughput of the training loop by avoiding the use of padding tokens. It also supports distributed training setups, and can be used with DDP or other distributed training frameworks.

The critical point regarding model training, is that we must make sure to have a good balance between loader time and model forward/backward time to avoid bottlenecks from the data loading process. Given that constraint, the implementated data loader is satisfying.

Automatic configuration

The library provides a minimalistic automatic configuration system that computes optimal model architecture, tokenizer settings, and training hyperparameters using scaling laws.

The system is based on the AutoConfig class (available in gpt_lab.model.auto.py) and is used in scripts/train_base.py via the --auto flag.

It automatically determines:

model depth scaling
width / aspect ratio expansion
vocabulary size (scaling-law driven or tokenizer-based)
batch size and gradient accumulation
training horizon (steps / FLOPs / data ratio)

from gpt_lab.model.auto import AutoConfig

# Automatic full training configuration
cfg = AutoConfig(
    basename="ic1-125M",
    depth=12,
    aspect_ratio=16,
)
meta_config = cfg.generate_gpt_config(device="cuda")

Note

Setting vocab_size = -1 enables automatic scaling-law vocabulary selection.
Setting training_time, n_steps, target_flops, or target_param_data_ratio controls training horizon priority.
The system automatically builds a reference model (12-layer baseline) to normalize scaling-law computations.

Next sections detail the different generated components.

Tokenization

The tokenization implementation are located in gpt_lab.tokenizer. The code only includes BPE tokenization for now (include sentencepiece is a TODO). The tokenizer training is only supported by huggingface implementation for now. For inference, the tiktoken implementation is the default one, as it is much faster than the huggingface one. The custom BPE implementation is still under development, and is not functional yet.

Training a tokenizer

from gpt_lab.tokenizer import Tokenizer
from gpt_lab.tokenizer.corpus import TokenizerCorpus
from gpt_lab.utils.schemas import TokenizerTrainerConfig

# uses default corpus settings (mixture of HuggingFaceFW/fineweb-edu, HuggingFaceFW/fineweb-2, HuggingFaceTB/finemath and codeparrot/codeparrot-clean)
corpus = TokenizerCorpus.from_sources(random_seed=42)
cfg = TokenizerTrainerConfig(
    name="my_tokenizer",
    vocab_size=32_000,
    pat_str="gpt2", # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
)
tokenizer = Tokenizer.train_from_iterator(cfg, iterator=corpus.iterator())

Using a pre-trained tokenizer

from gpt_lab.tokenizer import Tokenizer
from gpt_lab.utils.schemas import TokenizerConfig

tokenizer = Tokenizer.from_pretrained("cl100k-base", source="tiktoken")

Which tokenizer implementation to choose?

The tokenizer training script is located in scripts/train_tokenizer.py. It allows you to train a BPE tokenizer on a custom corpus, using different implementations (tiktoken, HuggingFace, or custom BPE implementations). You can also choose to write the corpus from sources (e.g., Wikipedia, OpenWebText) or load an existing corpus.

Training time benchmarks for different implementations and configurations. All the tokenizers were trained on corpus generated from gpt_lab.tokenizer.corpus.TokenizerCorpus() with default settings, tuned with variable vocab_size.

Model architecture

The library provides a modular implementation of a GPT-style DenseTransformer, where architectural components (attention, feedforward blocks, normalization, and positional encoding) are fully decoupled and configurable. The core model is defined in gpt_lab.model.gpt, while reusable layer primitives are implemented in gpt_lab.model.layers. Model behavior is controlled via a Pydantic configuration class (TransformerConfig) defined in gpt_lab.utils.schemas (but also accessible under gpt_lab.model.gpt), enabling structured extension of architectural variants and hyperparameters.

Here a simple example to produce a Llama-like architecture:

from gpt_lab.model.gpt import TransformerConfig, DenseTransformer

cfg = TransformerConfig(
    vocab_size=32_000,
    max_context=2048,
    d_model=512,
    n_heads=8,
    n_kv_heads=8,
    n_layers=6,
    d_ffn=2048,
    attn_impl="flash_attention",
    act_func="swiglu",
    normalization="rms",
)
model = DenseTransformer(cfg)

Optimization

In the Trainer class (available in gpt_lab.train.trainer), the optimizer is built from DenseTransformer.build_optimizer method, which is implemented in the DenseTransformer class (available in gpt_lab.model.gpt). This design allows for a high degree of flexibility and modularity in the optimization process. Moreover, the optimizer is initiated based on configs/optim.yaml configuration file, which can be easily modified to include new optimizers or adjust existing ones.

default:
  opt: "adamw"
  eps: 1e-10
  weight_decay: 1e-3
embeddings: 
  opt: "adamw"
  lr: .3
  betas: [.8, .995]
  eps: 1e-10
  weight_decay: 1e-3
transformer:
  opt: "muon"
  lr: 2e-2
  momentum: .95
  ns_steps: 5
  beta: .9
  weight_decay: .28
...

The optimization process is decoupled from the model architecture, and is implemented as a separate component that can be easily swapped and customized. The optimizer is built based on the model configuration and the training configuration, using a factory pattern. The optimizer implementations are located in gpt_lab.optim.factory and the corresponding subfolders for the different optimizers.

from gpt_lab.optim import OptimizerFactory

model = ...
optim_cfg = ... # dict of optimizer hyperparameters, e.g., {"opt": "adamw", "lr": 1e-3, ...}
param_groups = [
    {"params": model.embeddings.parameters(), **optim_cfg["embeddings"]},
    {"params": model.blocks.parameters(), **optim_cfg["blocks"]},
    ...
]
optimizer = OptimizerFactory.build_optimizer(param_groups)

Warning

This is maybe the most critical part of the library, regarding model training, and it is also the part that I have less implemented myself. I used a lot of external repositories for code baseline, and used LLMs back and fourth to enhance it. My goal was to make it work, while being more modular. However, my comprehension of optimization algorithms, coupled with torch.compile and distributed training is quite limited. So, I encourage you to check the code in gpt_lab.optim.factory and the corresponding subfolders for the different optimizers.

Pre training

The pre-training script is located in scripts/train_base.py. It allows you to pre-train a GPT model from scratch on a defined corpus, using different configurations (model architecture, training hyperparameters, optimizer, etc.). You can also choose to write the corpus from sources (e.g., Wikipedia, OpenWebText) or load an existing corpus.

Warning

There are two sub-arguments for this script auto and custom. For now, only auto is implemented, which allows you to automatically load a configuration based on main (depth, aspect_ratio, n_heads, etc.) arguments and compute optimal training parameters, as optimal vocab_size if not provided. The script can then train a new tokenizer with --train-tokenizer flag. The custom argument is intended to allow you to directly pass the configuration as command-line arguments, without the need for a YAML file. This feature is under development and will be implemented in the future.

Checkpointing

The framework provides a simple checkpointing system that is integrated with the training loop. The checkpointing system is implemented in the CheckpointManager class (available in gpt_lab.model.checkpoint) and allows you to save and load model checkpoints during training. Models should be saved in the following directory structure:

<CACHE_DIR>/
└── models/
    └── <model_name>/                          # e.g., "ic1", "gpt2-small", "llama2"
        └── <run_name>/
            ├── meta.pskl                      # immutable (model, tokenizer, git)
            └── <source>/                      # base / sft / rl
                ├── training_config.pkl        # per-phase config
                ├── checkpoint_state.pkl       # best bpb/core steps and values
                ├── checkpoint_step_000000/
                │   ├── model.pt
                │   ├── optim_rank0.pt         # optimizer state dict (optim_rank{rank}.pt if sharded, otherwise optim.pt)
                │   ├── optim_rank1.pt
                │   ├── ...                    # more optimizer shards if needed
                │   ├── trainer_state.pkl      # training state, rng state, data state, best bpb/core steps
                │   └── metrics.pkl
                ├── checkpoint_step_000100/
                │   └── ...
                └── ...

Board

Vizualize the training progress in the board of your choice (Tensorboard, Weights & Biases, or Trackio). You can also log to a dummy board that does not log anything, for faster training without logging overhead.

Chat with the model

In this section, you will find instructions to run the chat interface with different models.

Under development environment (DEVELOPMENT='1' in .env), you can run the chat interface with auto-reloading, use the following command:

uv run gradio scripts/chat_app.py --demo-name=app

Otherwise, if you don't want auto-reloading, use:

uv run python -m scripts.chat_app

Then, open your browser and go to http://127.0.0.1:7860/. It is quite straightforward to use. You can select different models (local or remote), choose some hyperparameters for inference, and chat with the model.

Development Notes

Some components are intentionally incomplete. Contributors (including automated tools) are encouraged to explore TODOs and propose improvements via pull requests.

References

Nice repositories to check out for inspiration and reference

karpathy/nanoGPT by Andrej Karpathy for pretraining code .
karpathy/nanochat by Andrej Karpathy for full training (base, sft, grpo) and inference pipeline.
KellerJordan/modded-nanogpt by Jordan Keller for speedrunner implementation and optimization techniques.

Some nice blogs and web-articles

Huggingface or nanotron playbooks. All of them are very good. It takes days to read them all, and more to diggest, but they are worth it.
- The Ultra-Scale Playbook:
- The Smol Training Playbook: this is just a gold mine.
- The LLM Evaluation Guidebook
frontier model training methodologies by Alex Wa (DJ Dumpling). Quite compact compared with the HuggingFace playbooks, but still very informative and insightful (it is quite a condensed version of it).
Making Deep Learning Go Brrrr From First Principles by Horace He (PyTorch). Very nice intro on basics of GPU computation for deep learning.
Tokenizers by Karparthy for a very nice overview of tokenization for LLMs.

Some bibliography

Note

All of the literature ressources below all participated in some way to the development of the library. I have probably forgotten some, and I apologize for that. If you think some important papers are missing please feel free to add one (or suggest one) via pull request. Some papers are not directly cited in the code, I will try to add some as much as possible in the future.

Title	Authors	Journal	Year	DOI
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models		nan	nan	nan
Tokenization Is More Than Compression		nan	nan	nan
Practical Efficiency of Muon for Pretraining	AI et al.	arXiv	2025	2505.02222
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton	Abreu et al.	arXiv	2025	2510.09378
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training	Bergsma et al.	arXiv	2025	2505.13738
Knowledge distillation: A good teacher is patient and consistent	Beyer et al.	arXiv	2021	2106.05237
Language Models are Few-Shot Learners	Brown et al.	arXiv	2020	10.48550/arXiv.2005.14165
PaLM: Scaling Language Modeling with Pathways	Chowdhery et al.	arXiv	2022	10.48550/arXiv.2204.02311
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	Dao et al.	NeurIPS	2022	2205.14135
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	Dao	arXiv	2023	10.48550/arXiv.2307.08691
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	DeepSeek-AI et al.	Nature volume 645, pages 633-638 (2025)	2025	10.1038/s41586-025-09422-z
QLoRA: Efficient Finetuning of Quantized LLMs	Dettmers et al.	arXiv	2023	10.48550/arXiv.2305.14314
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Devlin et al.	arXiv	2018	10.48550/arXiv.1810.04805
Fewer Truncations Improve Language Modeling	Ding et al.	arXiv	2024	2404.10830
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	Fedus et al.	ICML	2021	2101.03961
How to Train Long-Context Language Models (Effectively)	Gao et al.	arXiv	2024	10.48550/arXiv.2410.02660
Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials	Grishina et al.	arXiv	2025	2506.10935
Shampoo: Preconditioned Stochastic Tensor Optimization	Gupta et al.	arXiv	2018	10.48550/arXiv.1802.09568
Training Compute-Optimal Large Language Models	Hoffmann et al.	arXiv	2022	10.48550/arXiv.2203.15556
LoRA: Low-Rank Adaptation of Large Language Models	Hu et al.	ICLR	2021	2106.09685
Block-Recurrent Transformers	Hutchins et al.	arXiv	2022	2203.07852
Mistral 7B	Jiang et al.	arXiv	2023	10.48550/arXiv.2310.06825
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	Katharopoulos et al.	arXiv	2020	10.48550/arXiv.2006.16236
KellerJordan/Muon	KellerJordan	GitHub	2024	github:kellerjordan/muon
KellerJordan/modded-nanogpt	KellerJordan	GitHub	2024	github:kellerjordan/modded-nanogpt
KIMI K2: OPEN AGENTIC INTELLIGENCE	Kimi Team	arXiv	2025	10.48550/arXiv.2507.20534
Attention Residuals	Kimi Team	arXiv	2026	10.48550/arXiv.2603.15031
Decoding-time Realignment of Language Models	Liu et al.	arXiv	2024	2402.02992
Muon is Scalable for LLM Training	Liu et al.	2025	arXiv	2502.16982
StarCoder 2 and The Stack v2: The Next Generation	Lozhkov et al.	arXiv	2024	10.48550/arXiv.2402.19173
YaRN: Efficient Context Window Extension of Large Language Models	Peng et al.	arXiv	2023	10.48550/arXiv.2309.00071
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free	Qiu et al.	arXiv	2025	2505.06708
Language models are unsupervised multitask learners	Radford et al.	OpenAI	2019	unsupervised-multitask
SQUAD: 100,000+ Questions for Machine Comprehension of Text	Rajpurkar et al.	arXiv	2016	10.48550/arXiv.1606.05250
Observational Scaling Laws and the Predictability of Language Model Performance	Ruan et al.	arXiv	2024	10.48550/arXiv.2405.10938
Observational Scaling Laws and the Predictability of Language Model Performance	Ruan et al.	arXiv	2024	2405.10938
SlimPajama-DC: Understanding Data Combinations for LLM Training	Shen et al.	arXiv	2023	2309.10818
How to Train Your Energy-Based Models	Song et al.	arXiv	2021	10.48550/arXiv.2101.03288
Building Bridges between Regression, Clustering, and Classification	Stewart et al.	arXiv	2025	2502.02996
RoFormer: Enhanced Transformer with Rotary Position Embedding	Su et al.	arXiv	2021	10.48550/arXiv.2104.09864
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies	Tao et al.	arXiv	2024	2407.13623
Efficient Transformers: A Survey	Tay et al.	LLMs BasicsarXiv	2020	10.48550/arXiv.2009.06732
Attention is all you need	Vaswani et al.	arXiv	2017	10.48550/arXiv.1706.03762
ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers	Wang and Li	arXiv	2023	2310.02489
Fantastic Pretraining Optimizers and Where to Find Them	Wen et al.	arXiv	2025	2509.02046
Unified Training of Universal Time Series Forecasting Transformers	Woo et al.	arXiv	2024	2402.02592
Effective Long-Context Scaling of Foundation Models	Xiong et al.	arXiv	2023	10.48550/arXiv.2309.16039
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly	Yen et al.	arXiv	2024	10.48550/arXiv.2410.02694
Recursive Language Models	Zhang et al.	arXiv	2025	2512.24601
dLLM: Simple Diffusion Language Modeling	Zhou et al.	arXiv	2026	10.48550/arXiv.2602.22661

Bibliography made with art-test-stack/MyBible.

Some video ressources

For the laziest (😛), there are also a lot of Youtube videos that explain well the different components of the library, and how to implement them. Here are some of them that I found useful:

Andrej Karpathy's YouTube channel for his unmatched expertise in the field, and his ability to explain complex concepts in a simple and intuitive way. His videos on Transformers and LLMs are particularly useful for understanding the architecture and training of these models.
Stanfords CME295 course for the very nice lecture on Transformers and LLMs by Afshine and Shervine Amidi. They currently releasing lectures of CME296, which is on diffusion & LVMs.

Extra

Banner made with: hacker-tools/ascii-banner

TODOs

Here a non-exhaustive list of features that I aim to implement. Stars correspond to the priority level. Contributions are welcome!

Tokenization ⭐️
- BPE implementation in Python
- Rust implementation
Architecture ⭐️⭐️⭐️
- Alibi
- MoE
- Mixture of Depths
Optimization ⭐️⭐️
- Shampoo optimizer
- LION optimizer
- MARS optimizer
Precision ⭐️⭐️
- model and optimizer quantization
Training ⭐️⭐️⭐️
- fine-tuning / intruction tuning
- grpo
Cross-lib features ⭐️⭐️
- HuggingFace integration (model loading, tokenizers, etc.)
- vLLM, DeepSpeed, Megatron-LM, etc. integration

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Arthur Testard - arthur.testard.pro@gmail.com

Project Link: art-test-stack/gpt-lab

Citation

If you use this work in your research, please consider citing the following:

@misc{gptlab2026,
  author={Testard, Arthur},
  title={gpt-lab: A light-weight library for fast-ablation studies on GPT-like LMs},
  year={2026},
  publisher={GitHub},
  url={https://github.com/art-test-stack/gpt-lab}
}

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
nanochat-tasks @ d1aef7f		nanochat-tasks @ d1aef7f
rbpe		rbpe
scripts		scripts
src/gpt_lab		src/gpt_lab
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
ref.bib		ref.bib
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About The Project

Purpose

Built With

Get Started

Setup requirements

Usage

Scripts

Data

Automatic configuration

Tokenization

Training a tokenizer

Using a pre-trained tokenizer

Which tokenizer implementation to choose?

Model architecture

Optimization

Pre training

Checkpointing

Board

Chat with the model

Development Notes

References

Nice repositories to check out for inspiration and reference

Some nice blogs and web-articles

Some bibliography

Some video ressources

Extra

TODOs

License

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages