Skip to content

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160)#76

Open
unixmadtoonslab wants to merge 1 commit intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding
Open

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window (est. val_bpb ~1.160)#76
unixmadtoonslab wants to merge 1 commit intoopenai:mainfrom
unixmadtoonslab:submission/int6-wider-mlp-fp16embed-sliding

Conversation

@unixmadtoonslab
Copy link

@unixmadtoonslab unixmadtoonslab commented Mar 19, 2026

[WIP] Int6 + Wider MLP 3x + FP16 Embed + Sliding Window

Estimated val_bpb: ~1.160 (awaiting 8xH100 SXM compute for official run)

Stacks four orthogonal improvements over the naive baseline (1.2244):

Techniques

  1. Wider MLP (MLP_MULT=3.0) — 3x expansion (hidden=1536), ~0.019 BPB improvement
  2. int6 per-row quantization on MLP+attention — saves ~4MB artifact space, only +0.010 BPB degradation; zstd-22 compression
  3. FP16 tied embedding passthrough — keeps the most quantization-sensitive tensor (tied embed/logit head) in fp16 instead of int8, ~0.005 BPB improvement at ~0.5MB cost
  4. Sliding window eval (stride=256) — overlapping windows, ~0.033 BPB improvement, zero artifact cost

Novel contribution: FP16 Tied Embedding + int6

This submission extends the int6+wider MLP approach (PR #70) with fp16 tied embedding passthrough. The tied embedding doubles as the output logit head and is the most quantization-sensitive tensor. Keeping it in fp16 fits within the int6 space savings while providing additional BPB improvement.

Run Command

FP16_KEEP_NAME_PATTERNS=tok_emb \
MATRIX_LR=0.020 SCALAR_LR=0.020 TIED_EMBED_LR=0.030 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_MOMENTUM_WARMUP_START=0.92 \
WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

1xH100 Validation (3 min, 348 steps)

Metric Value
Pre-quant val_bpb 2.1346
int6+zstd roundtrip val_bpb 2.1356
int6+zstd sliding window val_bpb 2.1333
Artifact size (int6+zstd + code) 15,630,013 bytes

Status

  • Code complete and tested on 1xH100
  • int6+zstd roundtrip verified (quant gap: 0.001 BPB)
  • Artifact fits under 16MB (15.63MB)
  • Pending: 8xH100 SXM official 10-minute run (applying for compute credits)
  • Final val_bpb numbers and train.log

What We Tried and Rejected

  • QAT (int6 fake quantization): Eliminates quant gap but 54% step overhead. Net negative.
  • SEQ_LEN=4096: Fewer training tokens, smaller sliding window gain with wider MLP.
  • Depth recurrence: 0.13 BPB quant gap, not viable in 10min.

… ~1.160)

Four orthogonal improvements stacked: int6 mixed-precision quantization on
MLP+attention weights with zstd-22 compression, 3x MLP expansion, fp16 tied
embedding passthrough, and sliding window evaluation. Awaiting 8xH100 SXM
compute credits for official run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants