openai · alertcat · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,5 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/final_model.pt
+final_model.int6.ptz
diff --git a/logs/12L_int5_s1337.txt b/logs/12L_int5_s1337.txt
diff --git a/logs/12L_int5_s2024.txt b/logs/12L_int5_s2024.txt
diff --git a/logs/12L_int5_s42.txt b/logs/12L_int5_s42.txt
diff --git a/records/track_10min_16mb/2026-03-20_12L_Int5MLP_SmearGate_BigramHash/README.md b/records/track_10min_16mb/2026-03-20_12L_Int5MLP_SmearGate_BigramHash/README.md
@@ -0,0 +1,73 @@
+# 12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + SWA
+
+**val_bpb: 1.1541** (sliding window stride=64, 3-seed mean) | **~15.9 MB** artifact | 8xH100 SXM, 600s
+
+## Key Innovation: Mixed Int5/Int6 Quantization + 12 Layers
+
+Instead of uniform int6 quantization, we use precision-tiered quantization:
+- **Int5 [-16,15]** for MLP weights (largest tensors, most compressible)
+- **Int6 [-32,31]** for attention weights (more precision-sensitive)
+- **FP16** for tied embeddings
+
+Int5 values stored in int8 have **3 zero high bits** vs 2 for int6. zstd-22 compresses int5 at ~1.88x vs int6 at ~1.51x, saving ~1.8MB. This funds a **12th transformer layer** while staying under 16MB — the deepest model submitted to date.
+
+## Architecture
+
+- **12 transformer layers** (deepest submission), 512 dim, 8 heads, 4 KV heads (GQA)
+- MLP 3x expansion (hidden=1536), relu² activation
+- SmearGate (learned token blending gate)
+- BigramHash (2048 buckets, dim=128)
+- U-Net skip connections
+- Tied embeddings, vocab 1024, seq_len 2048
+
+## Training Config
+
+| Parameter | Value |
+|-----------|-------|
+| Layers | **12** |
+| Matrix LR | 0.025 |
+| Scalar LR | 0.025 |
+| Tied Embed LR | 0.035 |
+| Muon Momentum | 0.99 |
+| Muon WD | 0.04 |
+| Adam WD | 0.04 |
+| Warmdown | 3000 iters |
+| SWA | every 200 steps, ~7 checkpoint avg |
+| Eval stride | 64 |
+| Batch | 786,432 tokens/step |
+
+## Results (3-seed)
+
+| Seed | Steps | ms/step | Post-Q BPB | Sliding BPB (s64) |
+|------|-------|---------|------------|-------------------|
+| 1337 | 5,590 | 107.34 | 1.17668 | **1.15402** |
+| 42 | 5,588 | 107.37 | 1.17647 | **1.15390** |
+| 2024 | 5,589 | 107.35 | 1.17679 | **1.15425** |
+
+**Mean sliding BPB: 1.15406 | Std: 0.00035**
+
+## Ablation: Why 12 Layers + Int5
+
+| Config | Sliding BPB | Artifact | Notes |
+|--------|-------------|----------|-------|
+| 9L int6 (PR #162 base) | ~1.148 | 15.4 MB | Baseline |
+| 11L int6 (PR #198) | **1.1318** | 15.7 MB | Current SOTA |
+| **12L int5-MLP + int6-attn** | **1.1541** | ~15.9 MB | This submission |
+
+The 12th layer adds depth but each step is slower (107ms vs 81ms for 11L), yielding ~5,590 steps vs ~7,412. The depth-vs-speed tradeoff doesn't fully pay off at 600s, but demonstrates that int5 MLP quantization is a viable compression strategy for fitting more layers.
+
+## Reproduction
+
+```bash
+cd /workspace
+git clone https://github.com/alertcat/parameter-golf.git
+cd parameter-golf
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
+bash run_8xh100.sh
+```
+
+## Files
+
+- `train_gpt.py` — full training script with Int5 MLP quantization
+- `README.md` — this file
+- `submission.json` — structured results
diff --git a/records/track_10min_16mb/2026-03-20_12L_Int5MLP_SmearGate_BigramHash/submission.json b/records/track_10min_16mb/2026-03-20_12L_Int5MLP_SmearGate_BigramHash/submission.json
@@ -0,0 +1,27 @@
+{
+  "author": "alertcat",
+  "github_id": "alertcat",
+  "name": "12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + Muon WD=0.04 + SWA",
+  "blurb": "12-layer transformer with mixed int5/int6 quantization: int5 [-16,15] for MLP weights (better zstd compression), int6 for attention weights, FP16 for embeddings. The int5 savings (~1.8MB) fund the 12th transformer layer. SmearGate, BigramHash(2048), Muon WD=0.04, SWA every 200 steps, sliding window eval stride=64.",
+  "date": "2026-03-20T14:48:00Z",
+  "val_loss": 1.94831390,
+  "val_bpb": 1.15390429,
+  "pre_quant_val_bpb": 1.1579,
+  "step_stop": 5588,
+  "wallclock_seconds": 600.0,
+  "eval_time_seconds": 89.62,
+  "num_layers": 12,
+  "model_dim": 512,
+  "num_heads": 8,
+  "num_kv_heads": 4,
+  "mlp_mult": 3,
+  "vocab_size": 1024,
+  "train_seq_len": 2048,
+  "seeds": {
+    "1337": {"val_bpb": 1.15401994, "steps": 5590},
+    "42": {"val_bpb": 1.15390429, "steps": 5588},
+    "2024": {"val_bpb": 1.15425222, "steps": 5589}
+  },
+  "mean_val_bpb": 1.15405882,
+  "std_val_bpb": 0.00017
+}