[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160) by unixmadtoonslab · Pull Request #76 · openai/parameter-golf

unixmadtoonslab · 2026-03-19T10:51:00Z

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window

Estimated val_bpb: ~1.160 (awaiting 8xH100 SXM compute for official run)

Stacks four orthogonal improvements over the naive baseline (1.2244):

Techniques

Wider MLP (MLP_MULT=3.0) — 3x expansion (hidden=1536), ~0.019 BPB improvement
int6 per-row quantization on MLP+attention — saves ~4MB artifact space, only +0.010 BPB degradation; zstd-22 compression
FP16 tied embedding passthrough — keeps the most quantization-sensitive tensor (tied embed/logit head) in fp16 instead of int8, ~0.005 BPB improvement at ~0.5MB cost
Sliding window eval (stride=256) — overlapping windows, ~0.033 BPB improvement, zero artifact cost

Novel contribution: FP16 Tied Embedding + int6

This submission extends the int6+wider MLP approach (PR #70) with fp16 tied embedding passthrough. The tied embedding doubles as the output logit head and is the most quantization-sensitive tensor. Keeping it in fp16 fits within the int6 space savings while providing additional BPB improvement.

Run Command

FP16_KEEP_NAME_PATTERNS=tok_emb \
MATRIX_LR=0.020 SCALAR_LR=0.020 TIED_EMBED_LR=0.030 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_MOMENTUM_WARMUP_START=0.92 \
WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

1xH100 Validation (3 min, 348 steps)

Metric	Value
Pre-quant val_bpb	2.1346
int6+zstd roundtrip val_bpb	2.1356
int6+zstd sliding window val_bpb	2.1333
Artifact size (int6+zstd + code)	15,630,013 bytes

Status

Code complete and tested on 1xH100
int6+zstd roundtrip verified (quant gap: 0.001 BPB)
Artifact fits under 16MB (15.63MB)
Pending: 8xH100 SXM official 10-minute run (applying for compute credits)
Final val_bpb numbers and train.log

What We Tried and Rejected

QAT (int6 fake quantization): Eliminates quant gap but 54% step overhead. Net negative.
SEQ_LEN=4096: Fewer training tokens, smaller sliding window gain with wider MLP.
Depth recurrence: 0.13 BPB quant gap, not viable in 10min.

… ~1.160) Four orthogonal improvements stacked: int6 mixed-precision quantization on MLP+attention weights with zstd-22 compression, 3x MLP expansion, fp16 tied embedding passthrough, and sliding window evaluation. Awaiting 8xH100 SXM compute credits for official run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0hq added the not ready for review label Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160)#76

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160)#76
unixmadtoonslab wants to merge 1 commit intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding

unixmadtoonslab commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unixmadtoonslab commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!