Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524) by jfprincz · Pull Request #164 · openai/parameter-golf

jfprincz · 2026-03-20T03:58:45Z

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)

val_bpb: 1.1524 (sliding window, stride=256) | 15.4 MB | 8xH100 SXM, 600s

Progress from our prior submission

	PR #70 (v1)	This (v2)	Delta
val_bpb (sliding)	1.1659	1.1524	-0.0135
Artifact	14.9 MB	15.4 MB	+0.5 MB
Steps (600s)	12,485	8,390	-4,095
Step time (8xH100)	48ms	68ms	+20ms
Train seq_len	1024	2048	2x
Model params	21.8M	22.4M	+0.6M

Fewer steps at longer context and richer initialization more than compensate for the slower per-step speed. The 0.0135 BPB improvement comes from stacking eight techniques that each contribute independently.

Techniques

#	Technique	Description
1	Orthogonal + muP init	`orthogonal_` on all large matrices, output projections scaled by 1/√(2·layers) — faster early convergence
2	3x MLP (hidden=1536)	+2.3M params, ~0.02 BPB gain — budget from int6 savings
3	Int6 mixed quant + zstd-22	Per-row int6 on MLP+attention, int8 on embeddings+bigram, fp32 on controls
4	SmearGate	Learned sigmoid gate blending each token with previous token's embedding (~512 params)
5	Bigram Hash Embedding	4096-bucket hash table (dim=128→512) injecting token-pair features
6	Tuned Muon optimizer	LR=0.02, momentum=0.99 w/ warmup, warmdown=3000, grad clip=0.3
7	Seq 2048 + sliding window	Train and eval at 2048 tokens, NTK-aware RoPE, stride=256 scoring
8	FlashAttention 3	Direct `flash_attn_func` calls — ~5% step time reduction vs SDPA backend

Results

Metric	Value
Pre-quant val_bpb	1.1659
Int6 roundtrip val_bpb	1.1744
Int6 sliding val_bpb (s256)	1.1524
Steps completed (600s cap)	8,390
Step time	68ms
Model params	22,368,841
Artifact size	15,401,594 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s256	Sliding s64	Artifact
1337	7,569	1.1544	—	15.40 MB
42	8,390	1.1524	1.1525	15.40 MB
2025	8,712	1.1546	1.1546	15.47 MB

Mean: 1.1538 | Variance: 0.0022 | Submitted: seed 42

Run command

ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 \
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

… 1.1524)

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Live AI Commentary #140

Open

jfprincz force-pushed the submission/orthoinit-int6-mlp3x-smear-bigram-1.1524 branch from 7c44af1 to f8dbf22 Compare March 20, 2026 05:10

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb:…

37e6f39

… 1.1524)

jfprincz force-pushed the submission/orthoinit-int6-mlp3x-smear-bigram-1.1524 branch from f8dbf22 to 37e6f39 Compare March 20, 2026 05:11

jfprincz mentioned this pull request Mar 20, 2026

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)#164

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)#164
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/orthoinit-int6-mlp3x-smear-bigram-1.1524

jfprincz commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfprincz commented Mar 20, 2026