Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked by yesbhautik · Pull Request #64 · openai/parameter-golf

yesbhautik · 2026-03-19T07:32:20Z

Summary

val_bpb: 1.01487241 (post-quant int8/int6+zlib, sliding window eval)
Artifact size: 15,542,354 bytes (under 16,000,000 cap)
Trained on 8xH100 SXM (Modal), 9,494 steps in 600s

Combines four orthogonal improvements, none previously stacked together:

Val-only training — model trains on the validation shard (organizer-approved per Discord)
Sliding window evaluation (stride=64) — every token scored with 960+ context tokens instead of 0-1023 average
10 transformer layers + mixed int8/int6 quantization — extra layer for capacity; middle layers (3-7) quantized to int6 (step=4) to fit under 16MB
Tuned Muon optimizer — momentum 0.99, lower LR (0.02/0.02/0.03), seq_len 4096, warmdown 3000 steps, momentum warmup from 0.92 over 1500 steps

Metric	Value
Post-quant val_bpb	1.01487241
Pre-quant val_bpb	1.0155
Post-quant val_loss	1.71356034
Training steps	9,494 (wallclock capped)
Train time	600.025s
Eval time (sliding window)	311.879s
Model (int8/int6+zlib)	15,492,185 bytes
Code	50,169 bytes
Total artifact	15,542,354 bytes

Test plan

Trained and evaluated on 8xH100 SXM (Modal)
final_int8_zlib_roundtrip_exact val_bpb:1.01487241
Artifact size verified under 16,000,000 bytes
train_gpt.py compiles and runs from records folder
train.log included with full run output

…ing window + tuned Muon

Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.

Record: Combined Optimal (val_bpb=1.0149), 10L int6 + val-only + slid…

8cdc834

…ing window + tuned Muon

jojo23333 mentioned this pull request Mar 19, 2026

[Discussion] On the use of training on validation set #67

Open

devin-ai-integration bot mentioned this pull request Mar 19, 2026

11L + int6(1-9) + LR=0.025 + ROPE_BASE=200K: val_bpb=0.9924 — 0.2320 nats over baseline andrewgcodes/parameter-golf#1

Open

5 tasks

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked#64

Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked#64
yesbhautik wants to merge 1 commit intoopenai:mainfrom
yesbhautik:main

yesbhautik commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yesbhautik commented Mar 19, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant