Sliding Window + Long-Context Training: val_bpb=1.1764#96
Open
saml212 wants to merge 6 commits intoopenai:mainfrom
Open
Sliding Window + Long-Context Training: val_bpb=1.1764#96saml212 wants to merge 6 commits intoopenai:mainfrom
saml212 wants to merge 6 commits intoopenai:mainfrom
Conversation
Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Score
val_bpb = 1.1764 (baseline: 1.2244, improvement: 0.048 BPB / 0.087 nats)
Across seeds: 1.1764 (SEED=1337)
Approach
Train at longer sequences (2048 tokens) with high Muon momentum (0.99), low learning rate (0.02), and tight gradient clipping (0.3). Evaluate with overlapping sliding windows where every scored token sees 1536+ tokens of preceding context.
Novel Findings
1. Training length doesn't matter with sliding window eval
Training at 2048 vs 4096 gives identical BPB when evaluated with sliding window (1.1764 vs 1.1765). The sliding window already provides long context at eval — the model just needs to learn local patterns well. Training at 2048 is strictly better because it gets more optimization steps in 10 minutes.
2. Gradient clipping sweet spot for long sequences
Long-sequence training benefits from a narrow clipping window (0.3 vs default 1.0). Full sweep from 0.0 to 1.0 identified 0.3 as optimal — stabilizes long-sequence gradient variance without over-constraining.
3. Batch=786K is optimal for train@2048
Swept 393K to 1M. The sweet spot (786K) balances gradient noise against step count within the 10-min budget.
4. Quantization-aware warmdown (from our earlier PR #61)
Aggressive LR warmdown reduces post-quant penalty 3x (0.014→0.005 BPB). However, this interacts with base LR — the benefit only appears at high LR (0.06), not low LR (0.02). Full curve mapped across 10 warmdown values in PR #61.
Configuration
Reproduction
15.88MB artifact. 8xH100 SXM (RunPod, ~47ms/step).