Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/final_model.pt
final_model.int6.ptz
1,728 changes: 1,728 additions & 0 deletions logs/12L_int5_s1337.txt

Large diffs are not rendered by default.

1,728 changes: 1,728 additions & 0 deletions logs/12L_int5_s2024.txt

Large diffs are not rendered by default.

1,728 changes: 1,728 additions & 0 deletions logs/12L_int5_s42.txt

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# 12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + SWA

**val_bpb: 1.1541** (sliding window stride=64, 3-seed mean) | **~15.9 MB** artifact | 8xH100 SXM, 600s

## Key Innovation: Mixed Int5/Int6 Quantization + 12 Layers

Instead of uniform int6 quantization, we use precision-tiered quantization:
- **Int5 [-16,15]** for MLP weights (largest tensors, most compressible)
- **Int6 [-32,31]** for attention weights (more precision-sensitive)
- **FP16** for tied embeddings

Int5 values stored in int8 have **3 zero high bits** vs 2 for int6. zstd-22 compresses int5 at ~1.88x vs int6 at ~1.51x, saving ~1.8MB. This funds a **12th transformer layer** while staying under 16MB — the deepest model submitted to date.

## Architecture

- **12 transformer layers** (deepest submission), 512 dim, 8 heads, 4 KV heads (GQA)
- MLP 3x expansion (hidden=1536), relu² activation
- SmearGate (learned token blending gate)
- BigramHash (2048 buckets, dim=128)
- U-Net skip connections
- Tied embeddings, vocab 1024, seq_len 2048

## Training Config

| Parameter | Value |
|-----------|-------|
| Layers | **12** |
| Matrix LR | 0.025 |
| Scalar LR | 0.025 |
| Tied Embed LR | 0.035 |
| Muon Momentum | 0.99 |
| Muon WD | 0.04 |
| Adam WD | 0.04 |
| Warmdown | 3000 iters |
| SWA | every 200 steps, ~7 checkpoint avg |
| Eval stride | 64 |
| Batch | 786,432 tokens/step |

## Results (3-seed)

| Seed | Steps | ms/step | Post-Q BPB | Sliding BPB (s64) |
|------|-------|---------|------------|-------------------|
| 1337 | 5,590 | 107.34 | 1.17668 | **1.15402** |
| 42 | 5,588 | 107.37 | 1.17647 | **1.15390** |
| 2024 | 5,589 | 107.35 | 1.17679 | **1.15425** |

**Mean sliding BPB: 1.15406 | Std: 0.00035**

## Ablation: Why 12 Layers + Int5

| Config | Sliding BPB | Artifact | Notes |
|--------|-------------|----------|-------|
| 9L int6 (PR #162 base) | ~1.148 | 15.4 MB | Baseline |
| 11L int6 (PR #198) | **1.1318** | 15.7 MB | Current SOTA |
| **12L int5-MLP + int6-attn** | **1.1541** | ~15.9 MB | This submission |

The 12th layer adds depth but each step is slower (107ms vs 81ms for 11L), yielding ~5,590 steps vs ~7,412. The depth-vs-speed tradeoff doesn't fully pay off at 600s, but demonstrates that int5 MLP quantization is a viable compression strategy for fitting more layers.

## Reproduction

```bash
cd /workspace
git clone https://github.com/alertcat/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
bash run_8xh100.sh
```

## Files

- `train_gpt.py` — full training script with Int5 MLP quantization
- `README.md` — this file
- `submission.json` — structured results
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"author": "alertcat",
"github_id": "alertcat",
"name": "12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + Muon WD=0.04 + SWA",
"blurb": "12-layer transformer with mixed int5/int6 quantization: int5 [-16,15] for MLP weights (better zstd compression), int6 for attention weights, FP16 for embeddings. The int5 savings (~1.8MB) fund the 12th transformer layer. SmearGate, BigramHash(2048), Muon WD=0.04, SWA every 200 steps, sliding window eval stride=64.",
"date": "2026-03-20T14:48:00Z",
"val_loss": 1.94831390,
"val_bpb": 1.15390429,
"pre_quant_val_bpb": 1.1579,
"step_stop": 5588,
"wallclock_seconds": 600.0,
"eval_time_seconds": 89.62,
"num_layers": 12,
"model_dim": 512,
"num_heads": 8,
"num_kv_heads": 4,
"mlp_mult": 3,
"vocab_size": 1024,
"train_seq_len": 2048,
"seeds": {
"1337": {"val_bpb": 1.15401994, "steps": 5590},
"42": {"val_bpb": 1.15390429, "steps": 5588},
"2024": {"val_bpb": 1.15425222, "steps": 5589}
},
"mean_val_bpb": 1.15405882,
"std_val_bpb": 0.00017
}
Loading