Skip to content

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198

Open
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-int6-wd04-swa-fa3-1.1318
Open

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)#198
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-int6-wd04-swa-fa3-1.1318

Conversation

@jfprincz
Copy link

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)

val_bpb: 1.1318 (sliding window, stride=64) | 15.7 MB | 8xH100 SXM, 600s

Progress from prior submissions

PR #70 PR #164 This Delta vs #164
val_bpb (sliding) 1.1659 (s256) 1.1524 (s256) 1.1318 (s64) -0.0206
Layers 9 9 11 +2
Params 21.8M 22.4M 26.8M +4.4M
Artifact 14.9 MB 15.4 MB 15.7 MB +0.3 MB
Steps (600s) 12,485 8,390 7,412 -978
Step time (8xH100) 48ms 68ms 81ms +13ms

Two extra layers compensate for fewer steps. Weight decay 0.04 (Muon + AdamW) keeps weights quantization-friendly under int6. Sliding window now at stride=64.

Key additions over PR #164

Change Impact
11 layers (was 9) +4.4M params, major capacity gain, funded by int6 headroom
Weight decay 0.04 Muon WD + AdamW WD, smaller weights improve int6 compression
SWA (~8 checkpoint avg) Smoothed weights during warmdown
Eval stride=64 (was 256) Near-full context for every scored token
Bigram 2048 buckets (was 4096) Saves ~300 KB artifact, negligible BPB cost

Everything else from PR #164 carries forward: OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, SmearGate, BigramHash, FA3, seq 2048, tuned Muon.

Results

Metric Value
Pre-quant val_bpb 1.1432
Int6 roundtrip val_bpb 1.1543
Int6 sliding val_bpb (s64) 1.1318
Steps completed (600s cap) 7,412
Step time 81ms
Model params 26,829,913
Artifact size 15,689,380 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s64 Artifact
1337 7,412 1.1318 15.69 MB
42 7,407 1.1335 15.70 MB
2025 7,412 1.1324 15.69 MB

Mean: 1.1326 | Variance: 0.0017 | Submitted: seed 1337

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

integrate-your-mind pushed a commit to integrate-your-mind/parameter-golf that referenced this pull request Mar 20, 2026
- train_gpt.py: ADAM_WEIGHT_DECAY env var (AdamW when >0), FP16_EMBED flag
- RESEARCH_NOTES.md: Full analysis of all open PRs, technique taxonomy,
  strategy to beat new openai#1 (1.1318 BPB from PR openai#198)
- Key finding: Int6+zstd, SmearGate, BigramHash, SWA, MuonWD are essential
- Our TTT LoRA is unique advantage not used by any top-5 submission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Mar 20, 2026
Innovation over PR openai#198 (SOTA 1.1318):
- 12 transformer layers (was 11): +2.2M params, better representation
- Int5 quantization for MLP weights [-16,15]: 3 zero high bits
  - zstd compression 1.88x vs int6 1.51x, saves ~1.8MB
  - Funds the 12th layer within 16MB budget
- Int6 kept for attention weights (precision-sensitive)
- FA3 fallback for older PyTorch
- LR=0.025 (validated as optimal in A/B testing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0xjaishy pushed a commit to 0xjaishy/parameter-golf that referenced this pull request Mar 20, 2026
Rebuild from the proven openai#1 submission (PR openai#198, 1.1326 BPB) and stack
four untried improvements:

- RoPE base 50K (smoother position interpolation at seq2048)
- LAWA-EMA replacing periodic SWA (continuous exponential moving average)
- Context-length curriculum (seq1024 early for 60% more steps, seq2048 late)
- Full-model SGD test-time training (1 epoch, lr=3e-4, on val data)

Architecture: 11L 512d MLP3x SmearGate BigramHash OrthoInit WD=0.04
Artifact: ~15.7MB (int6+zstd-22), 26.8M params, FA3 with SDPA fallback
Pending 8xH100 run. Target: sub-1.13 BPB.

Made-with: Cursor
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
Replace the pr162-based fork with pr198 (11L, WD=0.04, relu², FA3,
NTK RoPE) as the base. SWA→LAWA-EMA swap and Overtone init are the
only changes from pr198, giving a clean single-variable ablation on
the strongest confirmed leaderboard submission.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattqlf added a commit to mattqlf/parameter-golf that referenced this pull request Mar 20, 2026
Adds elementwise sigmoid gate after attention output, before output
projection. Gate projection initialized to zero (gate ≈ 0.5 at start).
Only 3 lines changed from PR openai#198's train_gpt.py.
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
Runs records/track_10min_16mb/lawa_frontier/train_gpt.py on 8x H100
via torchrun with PR openai#198 defaults (11 layers, LAWA-EMA decay=0.995,
bigram vocab 2048). Uses devel CUDA image for FA3 compilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
michaeljabbour added a commit to michaeljabbour/parameter-golf that referenced this pull request Mar 20, 2026
- Flash Attention 3 support with SDPA fallback (1.5-2x attention speedup on H100)
- NUM_LAYERS: 10 → 11 (more capacity from int6 savings)
- ROPE_BASE: 10000 → 50000 (extended positional encoding)
- MATRIX_LR: 0.02 → 0.025 (frontier tuning)
- SCALAR_LR: 0.02 → 0.025
- TIED_EMBED_LR: 0.03 → 0.035
- BIGRAM_VOCAB_SIZE: 4096 → 2048 (smaller vocab saves params for more layers)

Frontier config from PR openai#198 (1.1326 BPB). FA3 enables more training steps within 10-minute competition window.

🤖 Generated with Amplifier

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 20, 2026
Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf:
- PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash
- PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA
- PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips

Updated program.md to point agent at PR openai#198 as the new starting base,
with detailed technique breakdown and strategy to beat 1.1318.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
11-Layer Int6 + LAWA-EMA (decay=0.995) + Overtone Init, based on PR openai#198.
Replaces SWA with every-step EMA averaging. Fixes bigram proj zero-init
override and sliding window partial-window overlap. 12.7 MB artifact.

8xH100 SXM, 600s, seed=1337, 6715 steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant