Skip to content

Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked#64

Open
yesbhautik wants to merge 1 commit intoopenai:mainfrom
yesbhautik:main
Open

Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked#64
yesbhautik wants to merge 1 commit intoopenai:mainfrom
yesbhautik:main

Conversation

@yesbhautik
Copy link

Summary

  • val_bpb: 1.01487241 (post-quant int8/int6+zlib, sliding window eval)
  • Artifact size: 15,542,354 bytes (under 16,000,000 cap)
  • Trained on 8xH100 SXM (Modal), 9,494 steps in 600s

Combines four orthogonal improvements, none previously stacked together:

  1. Val-only training — model trains on the validation shard (organizer-approved per Discord)
  2. Sliding window evaluation (stride=64) — every token scored with 960+ context tokens instead of 0-1023 average
  3. 10 transformer layers + mixed int8/int6 quantization — extra layer for capacity; middle layers (3-7) quantized to int6 (step=4) to fit under 16MB
  4. Tuned Muon optimizer — momentum 0.99, lower LR (0.02/0.02/0.03), seq_len 4096, warmdown 3000 steps, momentum warmup from 0.92 over 1500 steps
Metric Value
Post-quant val_bpb 1.01487241
Pre-quant val_bpb 1.0155
Post-quant val_loss 1.71356034
Training steps 9,494 (wallclock capped)
Train time 600.025s
Eval time (sliding window) 311.879s
Model (int8/int6+zlib) 15,492,185 bytes
Code 50,169 bytes
Total artifact 15,542,354 bytes

Test plan

  • Trained and evaluated on 8xH100 SXM (Modal)
  • final_int8_zlib_roundtrip_exact val_bpb:1.01487241
  • Artifact size verified under 16,000,000 bytes
  • train_gpt.py compiles and runs from records folder
  • train.log included with full run output

manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 19, 2026
Every submission scoring <1.18 BPB uses these EXACT settings.
We were running defaults — now matching the winners:

  MUON_MOMENTUM:       0.95 → 0.99 (stronger smoothing)
  MATRIX_LR:           0.04 → 0.02 (halved, reduces quant gap)
  SCALAR_LR:           0.04 → 0.02 (halved)
  TIED_EMBED_LR:       0.05 → 0.03 (halved)
  WARMDOWN_ITERS:      1200 → 3000 (longer warmdown)
  MUON_WARMUP_START:   0.85 → 0.92 (higher start)
  MUON_WARMUP_STEPS:   500  → 1500 (3x longer warmup)

These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652),
openai#70 (1.1659), openai#65 (1.1808) — all top submissions.

Applied to both v5 and v6. Both compile, 1498 lines each.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant