Skip to content

Non-record: Stacked hyperparameter tuning + eval2048 (RTX 5090, val_bpb 1.336)#104

Open
gwelinder wants to merge 1 commit intoopenai:mainfrom
gwelinder:submission/stacked-hyperparams-rtx5090
Open

Non-record: Stacked hyperparameter tuning + eval2048 (RTX 5090, val_bpb 1.336)#104
gwelinder wants to merge 1 commit intoopenai:mainfrom
gwelinder:submission/stacked-hyperparams-rtx5090

Conversation

@gwelinder
Copy link

Non-record submission: Stacked Hyperparameter Tuning + Eval2048

val_bpb: 1.3358 (post-quant int8+zlib) | 15.8MB artifact | RTX 5090, 20 train shards

What this is

40+ automated experiments via an autoresearch loop on the baseline 9x512 architecture. No architecture changes. 5 stacked config fixes improve val_bpb by 0.027.

Key finding

WARMDOWN_ITERS=1200 is broken at 600s wallclock. At ~620ms/step you get ~968 total steps, so 1200 > total steps and the cosine warmdown fires from step 1. Fix: WARMDOWN_ITERS=3000. (PRs #48 and #73 flagged the same thing.)

Stacked config

WARMDOWN_ITERS=3000, MATRIX_LR=0.06, LOGIT_SOFTCAP=15, MUON_MOMENTUM=0.99
TRAIN_BATCH_TOKENS=131072 (quarter-batch), EVAL_SEQ_LEN=2048

Negative results (also in the README)

  • Butterfly/Monarch MLP: 7MB artifact but 1.46 bpb
  • Reservoir random MLPs: 2.14 bpb
  • Depth recurrence (4x3=12 eff layers): 1.50 bpb
  • 6 alternative shapes: none beat 9x512

Also in train_gpt.py

  • EVAL_SEQ_LEN decoupling (train short, eval long)
  • Alias-aware serialization (shared weights stored once)
  • Mixed int6/int8 quantization (INT6_ALL_BLOCK_MATRICES env var)
  • Sliding-window eval (EVAL_STRIDE env var, batched)
  • Depth recurrence support (NUM_UNIQUE_LAYERS, NUM_RECURRENCE)

Hardware: RTX 5090, RunPod. Not 8xH100. This is a dev iteration result with interesting negative findings.

…ards)

val_bpb 1.336, 15.8MB artifact. 40+ experiments via autoresearch loop.
Key finding: baseline WARMDOWN_ITERS=1200 is broken at 600s wallclock.
Also includes negative results for butterfly MLP, reservoir MLPs, depth recurrence, and 6 iso-byte shapes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant