Skip to content

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)#73

Open
NishantDahal wants to merge 1 commit intoopenai:mainfrom
NishantDahal:swiglu-warmdown-1x5090
Open

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)#73
NishantDahal wants to merge 1 commit intoopenai:mainfrom
NishantDahal:swiglu-warmdown-1x5090

Conversation

@NishantDahal
Copy link

Non-record submission documenting a 10-experiment systematic exploration on 1×RTX 5090.

Best val_bpb: 1.3281 (post-quant, under 16MB artifact cap)

Key findings:

  • Discovered warmdown schedule bug in stock train_gpt.py — default warmdown_iters=1200 with 600s wallclock causes LR to decay from step 1. Fixed with time-fraction approach (warmdown_frac=0.2). Worth -0.006 bpb alone.
  • SwiGLU activation replacing ReLU² (-0.004 bpb)
  • Quarter batch size (131K tokens) for 4× more optimizer steps (-0.016 bpb cumulative)
  • Gradient accumulation ×2 (-0.002 bpb)
  • Negative results: weight decay (no effect), layer recurrence (harmful)

Total improvement: -0.035 bpb over stock baseline

Score gap vs leaderboard baseline (1.2244) is explained by hardware throughput — 1×5090 gets ~3,773 steps vs ~13,780 on 8×H100. The improvements are hardware-agnostic and should transfer to multi-GPU runs.

Full experiment log and analysis in README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants