Skip to content

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)#164

Open
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/orthoinit-int6-mlp3x-smear-bigram-1.1524
Open

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)#164
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/orthoinit-int6-mlp3x-smear-bigram-1.1524

Conversation

@jfprincz
Copy link

Submission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)

val_bpb: 1.1524 (sliding window, stride=256) | 15.4 MB | 8xH100 SXM, 600s

Progress from our prior submission

PR #70 (v1) This (v2) Delta
val_bpb (sliding) 1.1659 1.1524 -0.0135
Artifact 14.9 MB 15.4 MB +0.5 MB
Steps (600s) 12,485 8,390 -4,095
Step time (8xH100) 48ms 68ms +20ms
Train seq_len 1024 2048 2x
Model params 21.8M 22.4M +0.6M

Fewer steps at longer context and richer initialization more than compensate for the slower per-step speed. The 0.0135 BPB improvement comes from stacking eight techniques that each contribute independently.

Techniques

# Technique Description
1 Orthogonal + muP init orthogonal_ on all large matrices, output projections scaled by 1/√(2·layers) — faster early convergence
2 3x MLP (hidden=1536) +2.3M params, ~0.02 BPB gain — budget from int6 savings
3 Int6 mixed quant + zstd-22 Per-row int6 on MLP+attention, int8 on embeddings+bigram, fp32 on controls
4 SmearGate Learned sigmoid gate blending each token with previous token's embedding (~512 params)
5 Bigram Hash Embedding 4096-bucket hash table (dim=128→512) injecting token-pair features
6 Tuned Muon optimizer LR=0.02, momentum=0.99 w/ warmup, warmdown=3000, grad clip=0.3
7 Seq 2048 + sliding window Train and eval at 2048 tokens, NTK-aware RoPE, stride=256 scoring
8 FlashAttention 3 Direct flash_attn_func calls — ~5% step time reduction vs SDPA backend

Results

Metric Value
Pre-quant val_bpb 1.1659
Int6 roundtrip val_bpb 1.1744
Int6 sliding val_bpb (s256) 1.1524
Steps completed (600s cap) 8,390
Step time 68ms
Model params 22,368,841
Artifact size 15,401,594 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s256 Sliding s64 Artifact
1337 7,569 1.1544 15.40 MB
42 8,390 1.1524 1.1525 15.40 MB
2025 8,712 1.1546 1.1546 15.47 MB

Mean: 1.1538 | Variance: 0.0022 | Submitted: seed 42

Run command

ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 \
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@jfprincz jfprincz force-pushed the submission/orthoinit-int6-mlp3x-smear-bigram-1.1524 branch from 7c44af1 to f8dbf22 Compare March 20, 2026 05:10
@jfprincz jfprincz force-pushed the submission/orthoinit-int6-mlp3x-smear-bigram-1.1524 branch from f8dbf22 to 37e6f39 Compare March 20, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant