Sliding Window + Long-Context Training: val_bpb=1.1764 by saml212 · Pull Request #96 · openai/parameter-golf

saml212 · 2026-03-19T15:52:57Z

Score

val_bpb = 1.1764 (baseline: 1.2244, improvement: 0.048 BPB / 0.087 nats)

Across seeds: 1.1764 (SEED=1337)

Approach

Train at longer sequences (2048 tokens) with high Muon momentum (0.99), low learning rate (0.02), and tight gradient clipping (0.3). Evaluate with overlapping sliding windows where every scored token sees 1536+ tokens of preceding context.

Novel Findings

1. Training length doesn't matter with sliding window eval

Training at 2048 vs 4096 gives identical BPB when evaluated with sliding window (1.1764 vs 1.1765). The sliding window already provides long context at eval — the model just needs to learn local patterns well. Training at 2048 is strictly better because it gets more optimization steps in 10 minutes.

2. Gradient clipping sweet spot for long sequences

Long-sequence training benefits from a narrow clipping window (0.3 vs default 1.0). Full sweep from 0.0 to 1.0 identified 0.3 as optimal — stabilizes long-sequence gradient variance without over-constraining.

3. Batch=786K is optimal for train@2048

Swept 393K to 1M. The sweet spot (786K) balances gradient noise against step count within the 10-min budget.

4. Quantization-aware warmdown (from our earlier PR #61)

Aggressive LR warmdown reduces post-quant penalty 3x (0.014→0.005 BPB). However, this interacts with base LR — the benefit only appears at high LR (0.06), not low LR (0.02). Full curve mapped across 10 warmdown values in PR #61.

Configuration

TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 MATRIX_LR=0.02 SCALAR_LR=0.02
TIED_EMBED_LR=0.03 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 GRAD_CLIP_NORM=0.3
EVAL_SEQ_LEN=2048 EVAL_STRIDE=512

Reproduction

TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 MATRIX_LR=0.02 SCALAR_LR=0.02 \
TIED_EMBED_LR=0.03 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 GRAD_CLIP_NORM=0.3 \
EVAL_SEQ_LEN=2048 EVAL_STRIDE=512 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

15.88MB artifact. 8xH100 SXM (RunPod, ~47ms/step).

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README.

…ride=512

…=786K

saml212 added 6 commits March 18, 2026 23:56

update submission: 1.1793 BPB via train@4096 + sliding window eval st…

46a0c09

…ride=512

update: 1.1780 BPB via train@2048 batch=786K + slide stride=512

76758c3

update: 1.1769 BPB with grad_clip=0.5 + train@2048 + batch=786K + slide

e07dbd3

update: 1.1764 BPB grad_clip=0.3

912c3fb

update: 1.1764 BPB — train@2048 + slide stride=512 + clip=0.3 + batch…

50225c5

…=786K

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliding Window + Long-Context Training: val_bpb=1.1764#96

Sliding Window + Long-Context Training: val_bpb=1.1764#96
saml212 wants to merge 6 commits intoopenai:mainfrom
saml212:sam/sliding-window-optimization

saml212 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saml212 commented Mar 19, 2026

Score

Approach

Novel Findings

1. Training length doesn't matter with sliding window eval

2. Gradient clipping sweet spot for long sequences

3. Batch=786K is optimal for train@2048

4. Quantization-aware warmdown (from our earlier PR #61)

Configuration

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant