hollow-chains

Measure Structural Fidelity (SF) vs Semantic Correctness (SC) in reasoning traces from tiny language models — and the gap between them ("theater": well-formed but wrong).

Install

pip install -e ".[dev]"        # M1 metrics only (CPU)
pip install -e ".[dev,gpu]"    # M2 training + generation (Colab / GPU)

Target Python: 3.12 (Colab parity).

M1 — Metrics

Compute metrics on a JSONL file of GenerationRecord samples:

compute-metrics --records data/records.jsonl --config configs/metrics.yaml --out report.json

M2 — Train → Generate → Metrics

flowchart LR
  A[Tokenizer] --> B[Pretrain shard]
  B --> C[Pretrain ladder]
  C --> D[SFT reasoning traces]
  D --> E[generate_records]
  E --> F[GenerationRecord JSONL]
  F --> G[M1 compute-metrics]

Tokenizer — one frozen ByteLevel BPE (configs/tokenizer.yaml), reasoning tags imported from M1.
Pretrain — causal LM on FineWeb-Edu shard for each ladder rung.
SFT — reasoning traces with M1 tag schema; one-axis sweeps in configs/sft.yaml.
Generate — generate_records() writes schema-valid JSONL with optional token entropies.
Metrics — same M1 CLI; no torch in the metrics layer.

Model ladder (vocab=16000, ctx=512, tied embeddings)

Rung	Target	Realized	hidden	layers	heads	intermediate
tiny_1m	1,000,000	951,720	56	1	1	256
small_8m	8,000,000	7,661,120	320	2	5	896
mid_50m	50,000,000	47,679,588	836	4	11	2304
large_150m	150,000,000	142,774,164	1404	5	18	3840
xl_350m	350,000,000	333,041,312	1888	7	16	5120

Print live table: python -m hollow_chains.models.ladder

At tiny_1m, ~96% of weights are the shared vocab embedding matrix — expected for this vocab size.

Colab notebooks

Notebook	Purpose
`notebooks/00_setup_colab.ipynb`	Mount Drive, train tokenizer, materialize pretrain shard
`notebooks/01_pretrain_ladder.ipynb`	Pretrain each rung
`notebooks/02_sft_sweeps.ipynb`	Build SFT data + run sweep cells
`notebooks/03_generate_emergence.ipynb`	Generate JSONL + M1 reports

Each notebook has a config cell at the top and lists artifacts written at the end.

Local CPU smoke test

Verifies the full loop without network/GPU:

pytest tests/test_pipeline_smoke.py -v

Runs: micro tokenizer → tiny_1m pretrain (2 steps) → SFT (1 step) → 1 generation → M1 metrics.

Generation recipes

Validated prompt formats and decoding for external HuggingFace checkpoints (generate_with_recipe in eval/generate.py).

Reasoning (`SupraLabs/*-Reasoning`)

Field	Value
System	Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions.
Prompt	`[SYSTEM]: {system}\n\n[USER]: {question}\n\n[ASSISTANT]: <
Decoding	`do_sample=True`, `temperature=0.3`, `top_k=25`, `top_p=0.8`, `repetition_penalty=1.3`, `max_new_tokens=512`
Post	Decode with `skip_special_tokens=False`, strip `<s>`/`</s>`, prepend `<

Instruct (`SupraLabs/*-Instruct`)

Field	Value
Prompt	Alpaca: `Below is an instruction... ### Instruction:\n{question}\n\n### Response:\n`
Decoding	`do_sample=True`, `temperature=0.7`, `top_k=50`, `top_p=0.9`, `repetition_penalty=1.15`, `max_new_tokens=300`
Post	No prepend

Qwen3 teacher (SFT data)

Field	Value
Models	`Qwen/Qwen3-0.6B`, `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-4B`
Prompt	`apply_chat_template(..., enable_thinking=True)`
Decoding	`do_sample=True`, `temperature=0.6`, `top_p=0.95`, `top_k=20`, `max_new_tokens=512`
SFT prompt	Same reasoning format as eval (open `<
External-base SFT defaults	`epochs=6`, `lr=3e-4`, `batch_size=4`, `max_len=768` (`configs/scale_ladder.yaml`)

R2 teacher data (local CLI)

Generate teacher SFT JSONL caches for the R2 Colab notebook:

pip install -e ".[gpu]"   # torch, transformers, openai, bitsandbytes (optional 4-bit)
export TOGETHER_API_KEY=... # for deepseek-v4 / qwen3p5-397b
python scripts/gen_teacher_data.py --teacher qwen3-0p6b --n 200
python scripts/gen_teacher_data.py --teacher all --n 200 --out-dir ./teacher_cache

Writes r2_{teacher}.jsonl (e.g. r2_qwen3-0p6b.jsonl) to --out-dir. Upload these files to Google Drive MyDrive/MicroLM/sft_data/ so the R2 Colab notebook cache-hits and skips re-generation.

Teachers: qwen3-0p6b, qwen3-1p7b (local GPU), deepseek-v4, qwen3p5-397b (Together API).

Development

make test    # M1 metrics (>90% coverage) + M2 smoke test
make lint    # ruff + black

Milestones

Milestone	Scope	Status
M1	Metrics layer, schema, parser, CLI	Done
M2	Tokenizer, ladder, pretrain, SFT, generate, Colab	Done
M3	Bit-flip + quantization corruption	Pending
M4	Degradation eval, aggregation, visualization	Pending

Project structure

MicroLM/
├── configs/
│   ├── metrics.yaml
│   ├── tokenizer.yaml
│   ├── model_ladder.yaml
│   ├── pretrain.yaml
│   ├── sft.yaml
│   ├── scale_ladder.yaml
│   ├── generate.yaml
│   └── smoke.yaml
├── notebooks/          # Colab orchestration
├── scripts/
│   ├── compute_metrics.py
│   └── train_tokenizer.py
├── src/hollow_chains/
│   ├── data/           # schema, tokenizer, pretrain_data, build_reasoning_sft
│   ├── metrics/        # M1 — torch-free
│   ├── models/         # ladder (Llama + param solver)
│   ├── train/          # pretrain, sft
│   ├── eval/           # generate, run_emergence
│   └── ...
└── tests/
    ├── test_metrics.py
    └── test_pipeline_smoke.py

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
configs		configs
notebooks		notebooks
scripts		scripts
src/hollow_chains		src/hollow_chains
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
hollow_chains_research_report.md		hollow_chains_research_report.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hollow-chains

Install

M1 — Metrics

M2 — Train → Generate → Metrics

Model ladder (vocab=16000, ctx=512, tied embeddings)

Colab notebooks

Local CPU smoke test

Generation recipes

Reasoning (`SupraLabs/*-Reasoning`)

Instruct (`SupraLabs/*-Instruct`)

Qwen3 teacher (SFT data)

R2 teacher data (local CLI)

Development

Milestones

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hollow-chains

Install

M1 — Metrics

M2 — Train → Generate → Metrics

Model ladder (vocab=16000, ctx=512, tied embeddings)

Colab notebooks

Local CPU smoke test

Generation recipes

Reasoning (SupraLabs/*-Reasoning)

Instruct (SupraLabs/*-Instruct)

Qwen3 teacher (SFT data)

R2 teacher data (local CLI)

Development

Milestones

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Reasoning (`SupraLabs/*-Reasoning`)

Instruct (`SupraLabs/*-Instruct`)

Packages