Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630 by aquariouseworkman · Pull Request #65 · openai/parameter-golf

aquariouseworkman · 2026-03-19T07:42:39Z

Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

val_bpb: 1.1630 | Total size: 15,353,490 bytes (under 16MB)

Four orthogonal improvements over the naive baseline:

Wider MLP (MLP_MULT=3) — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
Mixed-precision quantization — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
Optimized throughput — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
Sliding window eval (stride=64) — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

Run command

RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Key metrics

Metric	Value
Steps (10 min cap)	12,395
int6/int8 sliding val_bpb	1.1630
Quantization penalty	+0.0015 BPB
Artifact size	15,353,490 bytes

Trained and evaluated on 8xH100 SXM (RunPod)

…_bpb 1.1652 Stack five techniques from systematic PR analysis: - MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70) - int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70) - zstd-22 compression (from PR openai#70) - TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65) - Sliding window eval at stride=64 with compiled forward_logits Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001). Three seeds: 1.16615, 1.16532, 1.16412. Artifact: 15.6MB (under 16,000,000 byte cap). Training: 9370 steps at 64ms/step on 8xH100 SXM. Made-with: Cursor

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.

- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep

- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61 - record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage - keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Straight-Through Estimator fake int6 quantization to CastedLinear during training. Forward pass uses quantized weights (int6 per-row), backward passes gradients through originals. Teaches weight distributions that survive post-training int6 quantization. Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64. Three seeds: - SEED=1337: val_bpb=1.16356083 - SEED=1338: val_bpb=1.16275343 - SEED=1339: val_bpb=1.16337225 Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001. Artifact: 15.3MB (under 16,000,000 byte cap). Made-with: Cursor

Downloaded PR openai#65 SOTA train_gpt.py (1.1630 BPB). Added zstandard dep, use_sota flag to toggle between baseline and SOTA scripts. 5-min baseline recorded: val_bpb=1.3738, post-quant=1.3766. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

9d318e7

devin-ai-integration bot mentioned this pull request Mar 19, 2026

11L + int6(1-9) + LR=0.025 + ROPE_BASE=200K + WARMDOWN=14K + SEED=42: val_bpb=0.9857 — 0.2387 nats over baseline andrewgcodes/parameter-golf#1

Open

5 tasks

arjun-krishna1 mentioned this pull request Mar 19, 2026

ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632 #66

Open

aquariouseworkman changed the title ~~Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808~~ Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630 Mar 19, 2026

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Open

yesbhautik mentioned this pull request Mar 19, 2026

Record: val_bpb=0.9695 (val-only) + val_bpb=1.1629 (standard) #64

Open

6 tasks

0hq added the record submission ready for review label Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630#65

Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630#65
aquariouseworkman wants to merge 2 commits intoopenai:mainfrom
aquariouseworkman:main

aquariouseworkman commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aquariouseworkman commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

Run command

Key metrics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aquariouseworkman commented Mar 19, 2026 •

edited

Loading