openai · machdragon · Mar 20, 2026
diff --git a/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/README.md b/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/README.md
@@ -0,0 +1,75 @@
+## 11-Layer Int6 + LAWA-EMA + Overtone Init (val_bpb: 1.1551)
+
+**val_bpb = 1.1551** (sliding window, stride=64) | **12.7 MB** artifact | 8xH100 SXM, 600s
+
+### Changes from PR #198
+
+| | [PR #198](https://github.com/openai/parameter-golf/pull/198) | This |
+|---|---|---|
+| val_bpb (sliding s64) | 1.1318 | **1.1551** |
+| Weight averaging | SWA (~8 ckpt, warmdown only) | LAWA-EMA (every step, decay=0.995) |
+| Embedding init | Normal | Overtone (SVD power-law) |
+| Artifact size | 15.7 MB | **12.7 MB** |
+| Steps (600s) | 7,412 | 6,715 |
+| Step time | 81ms | 89ms |
+
+### What's new
+
+1. **LAWA-EMA** (replaces SWA). Float32 exponential moving average of all parameters, updated every step with decay=0.995. Effective window ~200 steps. Applied to base model before int6 quantization.
+
+2. **Overtone init**. SVD decomposes the random embedding matrix, replaces singular values with power-law decay (1/sqrt(k)). Produces smoother per-row value ranges for tighter int6 quantization.
+
+3. **BigramHashEmbedding.proj zero-init fix**. The `_init_weights` method was overwriting BigramHashEmbedding.proj's intended zero initialization with orthogonal init. Fixed by setting `_zero_init=True` on the proj layer.
+
+4. **Sliding window eval fix**. Partial windows at the validation boundary were double-counting tokens. Fixed by only generating full windows (`ws + seq_len <= total`).
+
+### Carried from PR #198
+
+- 11 transformer layers (5 encoder + 6 decoder, U-Net skip connections)
+- Int6 per-row quantization (MLP+attention), int8 embedding, zstd-22 compression
+- MLP 3x (hidden=1536), relu² activation
+- FlashAttention 3 (direct `flash_attn_func` calls)
+- SmearGate + BigramHash (2048x128)
+- Orthogonal + muP-scaled init on all large matrices
+- Weight decay 0.04 (Muon + AdamW)
+- GQA (8 heads, 4 KV heads), logit softcap 30.0
+- Sequence length 2048, NTK-aware RoPE
+- Muon optimizer, momentum 0.99, warmdown 1200 iters, grad clip 0.3
+
+### Configuration
+
+```bash
+NUM_LAYERS=11 MUON_WD=0.04 ADAM_WD=0.04 BIGRAM_VOCAB_SIZE=2048 \
+LAWA_ENABLED=1 LAWA_EMA_DECAY=0.995 \
+ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=600 \
+torchrun --nproc_per_node=8 train_gpt.py
+```
+
+### Key metrics
+
+- 6,715 steps in 600s (89ms/step)
+- ~5.3B train tokens (6,715 steps x 786,432 tokens/step)
+- Peak memory: 19,828 MiB per GPU
+
+| Metric | Value |
+|--------|-------|
+| Pre-quant val_bpb | 1.1622 |
+| Int6 roundtrip val_bpb | 1.1779 |
+| **Int6 sliding val_bpb (s64)** | **1.1551** |
+| Compressed artifact (int6+zstd) | 12,639,639 bytes |
+| Code size | 65,258 bytes |
+| **Total submission size** | **12,704,897 bytes** |
+
+### Reproducibility
+
+Single seed run (seed=1337). Additional seed runs pending.
+
+| Seed | Steps | Sliding s64 | Artifact |
+|------|-------|-------------|----------|
+| 1337 | 6,715 | 1.1551 | 12,704,897 |
+
+### Included files
+
+- `train_gpt.py` -- full training + quantization + evaluation script
+- `train.log` -- training log from seed 1337
+- `submission.json` -- leaderboard metadata
diff --git a/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/submission.json b/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/submission.json
@@ -0,0 +1,18 @@
+{
+  "author": "Alex Machado",
+  "github_id": "machdragon",
+  "name": "11-Layer Int6 + LAWA-EMA + Overtone Init + FA3",
+  "blurb": "PR #198 base (11L, int6 per-row MLP+attn, int8 tok_emb, zstd-22, WD=0.04, MLP 3x relu², FA3, SmearGate, BigramHash 2048x128, OrthoInit, U-Net skips, GQA 8/4, seq=2048 NTK RoPE) with SWA replaced by LAWA-EMA (decay=0.995, float32 shadow, every-step update) and Overtone init (SVD power-law embedding spectrum). Sliding window eval stride=64.",
+  "date": "2026-03-20T18:51:00Z",
+  "val_loss": 1.95035169,
+  "val_bpb": 1.15510813,
+  "pre_quant_val_loss": 1.9624,
+  "pre_quant_val_bpb": 1.1622,
+  "int6_roundtrip_val_loss": 1.98875174,
+  "int6_roundtrip_val_bpb": 1.17785080,
+  "int6_sliding_val_loss": 1.95035169,
+  "int6_sliding_val_bpb": 1.15510813,
+  "bytes_total": 12704897,
+  "bytes_model_int6_zstd": 12639639,
+  "bytes_code": 65258
+}
diff --git a/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/train.log b/records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/train.log
@@ -0,0 +1,94 @@
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/vol/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/vol/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26829913
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+lawa:init decay=0.995 shadow_params=116
+step:0/20000 val_loss:6.9315 val_bpb:4.1052 train_time:0ms step_avg:0.05ms
+step:1/20000 train_loss:6.9319 train_time:371ms step_avg:370.73ms
+step:2/20000 train_loss:9.9889 train_time:446ms step_avg:223.08ms
+step:3/20000 train_loss:8.4565 train_time:530ms step_avg:176.73ms
+step:4/20000 train_loss:8.2752 train_time:616ms step_avg:154.06ms
+step:5/20000 train_loss:7.9905 train_time:701ms step_avg:140.27ms
+step:6/20000 train_loss:7.8355 train_time:787ms step_avg:131.22ms
+step:7/20000 train_loss:7.6626 train_time:875ms step_avg:124.95ms
+step:8/20000 train_loss:7.2479 train_time:958ms step_avg:119.77ms
+step:9/20000 train_loss:6.9406 train_time:1043ms step_avg:115.92ms
+step:10/20000 train_loss:6.5805 train_time:1136ms step_avg:113.61ms
+step:200/20000 train_loss:2.4139 train_time:17881ms step_avg:89.41ms
+step:400/20000 train_loss:2.4563 train_time:35727ms step_avg:89.32ms
+step:600/20000 train_loss:2.3790 train_time:53213ms step_avg:88.69ms
+step:800/20000 train_loss:2.2529 train_time:71333ms step_avg:89.17ms
+step:1000/20000 train_loss:2.2958 train_time:89085ms step_avg:89.08ms
+step:1000/20000 val_loss:2.2451 val_bpb:1.3297 train_time:89107ms step_avg:89.11ms
+step:1200/20000 train_loss:2.3685 train_time:107187ms step_avg:89.32ms
+step:1400/20000 train_loss:2.2051 train_time:125256ms step_avg:89.47ms
+step:1600/20000 train_loss:2.0900 train_time:143034ms step_avg:89.40ms
+step:1800/20000 train_loss:2.1900 train_time:161062ms step_avg:89.48ms
+step:2000/20000 train_loss:2.1070 train_time:178750ms step_avg:89.38ms
+step:2000/20000 val_loss:2.1718 val_bpb:1.2863 train_time:178767ms step_avg:89.38ms
+step:2200/20000 train_loss:2.2281 train_time:196786ms step_avg:89.45ms
+step:2400/20000 train_loss:2.1142 train_time:214544ms step_avg:89.39ms
+step:2600/20000 train_loss:2.1639 train_time:232542ms step_avg:89.44ms
+step:2800/20000 train_loss:2.2065 train_time:250889ms step_avg:89.60ms
+step:3000/20000 train_loss:2.2152 train_time:268592ms step_avg:89.53ms
+step:3000/20000 val_loss:2.1481 val_bpb:1.2722 train_time:268614ms step_avg:89.54ms
+step:3200/20000 train_loss:2.2314 train_time:286569ms step_avg:89.55ms
+step:3400/20000 train_loss:2.0757 train_time:304185ms step_avg:89.47ms
+step:3600/20000 train_loss:2.1564 train_time:322165ms step_avg:89.49ms
+step:3800/20000 train_loss:2.1418 train_time:339740ms step_avg:89.41ms
+step:4000/20000 train_loss:2.0468 train_time:357704ms step_avg:89.43ms
+step:4000/20000 val_loss:2.1371 val_bpb:1.2657 train_time:357721ms step_avg:89.43ms
+step:4200/20000 train_loss:2.2318 train_time:375765ms step_avg:89.47ms
+step:4400/20000 train_loss:2.1217 train_time:393436ms step_avg:89.42ms
+step:4600/20000 train_loss:1.9291 train_time:411377ms step_avg:89.43ms
+step:4800/20000 train_loss:2.5217 train_time:429125ms step_avg:89.40ms
+step:5000/20000 train_loss:2.2031 train_time:447215ms step_avg:89.44ms
+step:5000/20000 val_loss:2.1245 val_bpb:1.2583 train_time:447232ms step_avg:89.45ms
+step:5200/20000 train_loss:2.1445 train_time:464930ms step_avg:89.41ms
+step:5400/20000 train_loss:2.1561 train_time:482953ms step_avg:89.44ms
+step:5600/20000 train_loss:2.0640 train_time:500755ms step_avg:89.42ms
+step:5800/20000 train_loss:2.1049 train_time:518407ms step_avg:89.38ms
+step:6000/20000 train_loss:2.0124 train_time:536384ms step_avg:89.40ms
+step:6000/20000 val_loss:2.0620 val_bpb:1.2212 train_time:536402ms step_avg:89.40ms
+step:6200/20000 train_loss:2.0216 train_time:554041ms step_avg:89.36ms
+step:6400/20000 train_loss:2.0435 train_time:571992ms step_avg:89.37ms
+step:6600/20000 train_loss:1.8704 train_time:589776ms step_avg:89.36ms
+step:6715/20000 val_loss:1.9624 val_bpb:1.1622 train_time:600087ms step_avg:89.37ms
+stopping_early: wallclock_cap train_time:600087ms step:6715/20000
+peak memory allocated: 19828 MiB reserved: 20956 MiB
+lawa:applying EMA shadow weights to base_model
+Serialized model: 105783402 bytes
+Code size: 65258 bytes
+Serialized model int6+zstd: 12639639 bytes
+Total submission size int6+zstd: 12704897 bytes
+final_int6_roundtrip val_loss:1.9888 val_bpb:1.1779 eval_time:259689ms
+final_int6_roundtrip_exact val_loss:1.98875174 val_bpb:1.17785080
+final_int6_sliding_window val_loss:1.9504 val_bpb:1.1551 stride:64 eval_time:192706ms
+final_int6_sliding_window_exact val_loss:1.95035169 val_bpb:1.15510813