11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198
Open
jfprincz wants to merge 1 commit intoopenai:mainfrom
Open
11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz wants to merge 1 commit intoopenai:mainfrom
Conversation
2 tasks
integrate-your-mind
pushed a commit
to integrate-your-mind/parameter-golf
that referenced
this pull request
Mar 20, 2026
- train_gpt.py: ADAM_WEIGHT_DECAY env var (AdamW when >0), FP16_EMBED flag - RESEARCH_NOTES.md: Full analysis of all open PRs, technique taxonomy, strategy to beat new openai#1 (1.1318 BPB from PR openai#198) - Key finding: Int6+zstd, SmearGate, BigramHash, SWA, MuonWD are essential - Our TTT LoRA is unique advantage not used by any top-5 submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alertcat
added a commit
to alertcat/parameter-golf
that referenced
this pull request
Mar 20, 2026
Innovation over PR openai#198 (SOTA 1.1318): - 12 transformer layers (was 11): +2.2M params, better representation - Int5 quantization for MLP weights [-16,15]: 3 zero high bits - zstd compression 1.88x vs int6 1.51x, saves ~1.8MB - Funds the 12th layer within 16MB budget - Int6 kept for attention weights (precision-sensitive) - FA3 fallback for older PyTorch - LR=0.025 (validated as optimal in A/B testing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0xjaishy
pushed a commit
to 0xjaishy/parameter-golf
that referenced
this pull request
Mar 20, 2026
Rebuild from the proven openai#1 submission (PR openai#198, 1.1326 BPB) and stack four untried improvements: - RoPE base 50K (smoother position interpolation at seq2048) - LAWA-EMA replacing periodic SWA (continuous exponential moving average) - Context-length curriculum (seq1024 early for 60% more steps, seq2048 late) - Full-model SGD test-time training (1 epoch, lr=3e-4, on val data) Architecture: 11L 512d MLP3x SmearGate BigramHash OrthoInit WD=0.04 Artifact: ~15.7MB (int6+zstd-22), 26.8M params, FA3 with SDPA fallback Pending 8xH100 run. Target: sub-1.13 BPB. Made-with: Cursor
3 tasks
machdragon
added a commit
to machdragon/parameter-golf
that referenced
this pull request
Mar 20, 2026
Replace the pr162-based fork with pr198 (11L, WD=0.04, relu², FA3, NTK RoPE) as the base. SWA→LAWA-EMA swap and Overtone init are the only changes from pr198, giving a clean single-variable ablation on the strongest confirmed leaderboard submission. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattqlf
added a commit
to mattqlf/parameter-golf
that referenced
this pull request
Mar 20, 2026
Adds elementwise sigmoid gate after attention output, before output projection. Gate projection initialized to zero (gate ≈ 0.5 at start). Only 3 lines changed from PR openai#198's train_gpt.py.
machdragon
added a commit
to machdragon/parameter-golf
that referenced
this pull request
Mar 20, 2026
Runs records/track_10min_16mb/lawa_frontier/train_gpt.py on 8x H100 via torchrun with PR openai#198 defaults (11 layers, LAWA-EMA decay=0.995, bigram vocab 2048). Uses devel CUDA image for FA3 compilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 tasks
michaeljabbour
added a commit
to michaeljabbour/parameter-golf
that referenced
this pull request
Mar 20, 2026
- Flash Attention 3 support with SDPA fallback (1.5-2x attention speedup on H100) - NUM_LAYERS: 10 → 11 (more capacity from int6 savings) - ROPE_BASE: 10000 → 50000 (extended positional encoding) - MATRIX_LR: 0.02 → 0.025 (frontier tuning) - SCALAR_LR: 0.02 → 0.025 - TIED_EMBED_LR: 0.03 → 0.035 - BIGRAM_VOCAB_SIZE: 4096 → 2048 (smaller vocab saves params for more layers) Frontier config from PR openai#198 (1.1326 BPB). FA3 enables more training steps within 10-minute competition window. 🤖 Generated with Amplifier Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 20, 2026
Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf: - PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash - PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA - PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips Updated program.md to point agent at PR openai#198 as the new starting base, with detailed technique breakdown and strategy to beat 1.1318. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
machdragon
added a commit
to machdragon/parameter-golf
that referenced
this pull request
Mar 20, 2026
11-Layer Int6 + LAWA-EMA (decay=0.995) + Overtone Init, based on PR openai#198. Replaces SWA with every-step EMA averaging. Fixes bigram proj zero-init override and sliding window partial-window overlap. 12.7 MB artifact. 8xH100 SXM, 600s, seed=1337, 6715 steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)
val_bpb: 1.1318 (sliding window, stride=64) | 15.7 MB | 8xH100 SXM, 600s
Progress from prior submissions
Two extra layers compensate for fewer steps. Weight decay 0.04 (Muon + AdamW) keeps weights quantization-friendly under int6. Sliding window now at stride=64.
Key additions over PR #164
Everything else from PR #164 carries forward: OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, SmearGate, BigramHash, FA3, seq 2048, tuned Muon.
Results
Reproducibility (3 seeds)
Mean: 1.1326 | Variance: 0.0017 | Submitted: seed 1337
Run command