11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) by jfprincz · Pull Request #198 · openai/parameter-golf

jfprincz · 2026-03-20T09:27:08Z

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)

val_bpb: 1.1318 (sliding window, stride=64) | 15.7 MB | 8xH100 SXM, 600s

Progress from prior submissions

	PR #70	PR #164	This	Delta vs #164
val_bpb (sliding)	1.1659 (s256)	1.1524 (s256)	1.1318 (s64)	-0.0206
Layers	9	9	11	+2
Params	21.8M	22.4M	26.8M	+4.4M
Artifact	14.9 MB	15.4 MB	15.7 MB	+0.3 MB
Steps (600s)	12,485	8,390	7,412	-978
Step time (8xH100)	48ms	68ms	81ms	+13ms

Two extra layers compensate for fewer steps. Weight decay 0.04 (Muon + AdamW) keeps weights quantization-friendly under int6. Sliding window now at stride=64.

Key additions over PR #164

Change	Impact
11 layers (was 9)	+4.4M params, major capacity gain, funded by int6 headroom
Weight decay 0.04	Muon WD + AdamW WD, smaller weights improve int6 compression
SWA (~8 checkpoint avg)	Smoothed weights during warmdown
Eval stride=64 (was 256)	Near-full context for every scored token
Bigram 2048 buckets (was 4096)	Saves ~300 KB artifact, negligible BPB cost

Everything else from PR #164 carries forward: OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, SmearGate, BigramHash, FA3, seq 2048, tuned Muon.

Results

Metric	Value
Pre-quant val_bpb	1.1432
Int6 roundtrip val_bpb	1.1543
Int6 sliding val_bpb (s64)	1.1318
Steps completed (600s cap)	7,412
Step time	81ms
Model params	26,829,913
Artifact size	15,689,380 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s64	Artifact
1337	7,412	1.1318	15.69 MB
42	7,407	1.1335	15.70 MB
2025	7,412	1.1324	15.69 MB

Mean: 1.1326 | Variance: 0.0017 | Submitted: seed 1337

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

- train_gpt.py: ADAM_WEIGHT_DECAY env var (AdamW when >0), FP16_EMBED flag - RESEARCH_NOTES.md: Full analysis of all open PRs, technique taxonomy, strategy to beat new openai#1 (1.1318 BPB from PR openai#198) - Key finding: Int6+zstd, SmearGate, BigramHash, SWA, MuonWD are essential - Our TTT LoRA is unique advantage not used by any top-5 submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Innovation over PR openai#198 (SOTA 1.1318): - 12 transformer layers (was 11): +2.2M params, better representation - Int5 quantization for MLP weights [-16,15]: 3 zero high bits - zstd compression 1.88x vs int6 1.51x, saves ~1.8MB - Funds the 12th layer within 16MB budget - Int6 kept for attention weights (precision-sensitive) - FA3 fallback for older PyTorch - LR=0.025 (validated as optimal in A/B testing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rebuild from the proven openai#1 submission (PR openai#198, 1.1326 BPB) and stack four untried improvements: - RoPE base 50K (smoother position interpolation at seq2048) - LAWA-EMA replacing periodic SWA (continuous exponential moving average) - Context-length curriculum (seq1024 early for 60% more steps, seq2048 late) - Full-model SGD test-time training (1 epoch, lr=3e-4, on val data) Architecture: 11L 512d MLP3x SmearGate BigramHash OrthoInit WD=0.04 Artifact: ~15.7MB (int6+zstd-22), 26.8M params, FA3 with SDPA fallback Pending 8xH100 run. Target: sub-1.13 BPB. Made-with: Cursor

Replace the pr162-based fork with pr198 (11L, WD=0.04, relu², FA3, NTK RoPE) as the base. SWA→LAWA-EMA swap and Overtone init are the only changes from pr198, giving a clean single-variable ablation on the strongest confirmed leaderboard submission. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds elementwise sigmoid gate after attention output, before output projection. Gate projection initialized to zero (gate ≈ 0.5 at start). Only 3 lines changed from PR openai#198's train_gpt.py.

Runs records/track_10min_16mb/lawa_frontier/train_gpt.py on 8x H100 via torchrun with PR openai#198 defaults (11 layers, LAWA-EMA decay=0.995, bigram vocab 2048). Uses devel CUDA image for FA3 compilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Flash Attention 3 support with SDPA fallback (1.5-2x attention speedup on H100) - NUM_LAYERS: 10 → 11 (more capacity from int6 savings) - ROPE_BASE: 10000 → 50000 (extended positional encoding) - MATRIX_LR: 0.02 → 0.025 (frontier tuning) - SCALAR_LR: 0.02 → 0.025 - TIED_EMBED_LR: 0.03 → 0.035 - BIGRAM_VOCAB_SIZE: 4096 → 2048 (smaller vocab saves params for more layers) Frontier config from PR openai#198 (1.1326 BPB). FA3 enables more training steps within 10-minute competition window. 🤖 Generated with Amplifier Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf: - PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash - PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA - PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips Updated program.md to point agent at PR openai#198 as the new starting base, with detailed technique breakdown and strategy to beat 1.1318. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

11-Layer Int6 + LAWA-EMA (decay=0.995) + Overtone Init, based on PR openai#198. Replaces SWA with every-step EMA averaging. Fixes bigram proj zero-init override and sliding window partial-window overlap. 12.7 MB artifact. 8xH100 SXM, 600s, seed=1337, 6715 steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)

372bdde

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Live AI Commentary #140

Open

machdragon mentioned this pull request Mar 20, 2026

LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551) #201

Open

2 tasks

alertcat mentioned this pull request Mar 20, 2026

Non-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541 #219

Open

0xjaishy mentioned this pull request Mar 20, 2026

Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run) #223

Draft

3 tasks

machdragon mentioned this pull request Mar 20, 2026

LAWA-EMA frontier fork with int6 QAT + 3x MLP machdragon/parameter-golf#3

Merged

3 tasks

ibarrajo mentioned this pull request Mar 20, 2026

Record: 8L Paid Prefix + SmearGate + Int6 (val_bpb=1.0539) #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-int6-wd04-swa-fa3-1.1318

jfprincz commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfprincz commented Mar 20, 2026