llm-from-scratch-stack

A compact, reproducible, offline-first LLM training stack covering pretraining, evaluation, SFT, and DPO.

What it is / isn't

✅ Educational, production-quality small-scale stack with reproducibility and CI.
❌ Not a frontier-scale training system.

Install

pip install -e .
pip install -e ".[datasets,wandb]"  # optional extras

Quickstart (offline)

python scripts/train_tokenizer.py data.train_path=data/toy/train.jsonl out_dir=artifacts/tokenizer_toy
python scripts/pretrain.py config-name=pretrain model=gpt_small data=toy train.max_steps=200
python scripts/evaluate.py eval.checkpoint_path=<path_to_last.ckpt> eval.max_batches=50
python scripts/sft.py sft.data_path=data/toy/sft.jsonl sft.base_checkpoint=<path_to_last.ckpt>
python scripts/dpo.py dpo.data_path=data/toy/prefs.jsonl dpo.base_checkpoint=<path_to_last.ckpt>

Hydra overrides are supported, e.g.

python scripts/pretrain.py train.max_steps=1000 model.n_layers=6

Reproducibility

Each run stores config.yaml, manifest.json (timestamp, host, git sha, pip freeze, command), JSONL logs, and TensorBoard logs. Checkpoints save model/optimizer/scheduler/scaler/step/RNG state.

Logging and metrics

metrics.jsonl: scalar logs (loss, lr, tokens/sec, grad_norm, val_loss).
TensorBoard under tb/.
Validation perplexity uses distributed-safe token-weighted reduction path in evaluation utilities.

Safety and compliance

This repository includes only synthetic toy data. Do not train on sensitive/proprietary data without consent, policy review, and legal checks.

Roadmap

richer packed-sequence boundary masking
stronger probe tasks
broader FSDP checkpoint modes
benchmark scripts for multi-node

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
configs		configs
data/toy		data/toy
notebooks		notebooks
scripts		scripts
src/llmstack		src/llmstack
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-from-scratch-stack

What it is / isn't

Install

Quickstart (offline)

Reproducibility

Logging and metrics

Safety and compliance

Roadmap

About

Uh oh!

Releases

Packages

Languages

License

jsp2195/llm-from-scratch-stack

Folders and files

Latest commit

History

Repository files navigation

llm-from-scratch-stack

What it is / isn't

Install

Quickstart (offline)

Reproducibility

Logging and metrics

Safety and compliance

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages