Skip to content

Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations#212

Open
mrdavtan wants to merge 2 commits intoopenai:mainfrom
mrdavtan:int6-3xMLP-pr
Open

Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations#212
mrdavtan wants to merge 2 commits intoopenai:mainfrom
mrdavtan:int6-3xMLP-pr

Conversation

@mrdavtan
Copy link

Summary

val_bpb = 1.1708 — independent int6 implementation with 3x MLP expansion, currently #2 on the merged leaderboard.

  • Int6 per-row quantization ([-31,31]) + zstd-22 compression
  • 3x MLP expansion (hidden=1536) — 21.8M params in 15.2MB artifact
  • FP16 tied embedding, WD=20000, tuned LRs, sliding window eval stride=64
  • 8×H100 SXM, 12,507 steps at 48ms/step

What's different about this submission

This isn't just a score entry. It's accompanied by 9 controlled ablations testing techniques that top entries use but never isolate — all on the same hardware, same seed, one variable at a time.

Ablation results (6 negative findings)

Technique val_bpb vs Control (1.1929) Verdict
SWA 1.1933 +0.0004 No effect at WD=1200
Doc-isolated eval 1.2015 +0.0086 Hurts at stride=64 (contradicts LoRA TTT)
Curriculum learning 1.1942 +0.0013 No effect
Multi-token prediction 1.1947 +0.0018 No effect
Int6 + 3x MLP 1.1708 -0.0221 Best result
+ SmearGate + BigramHash 1.1739 -0.019 Hurts on top of int6
Depth recurrence + Huginn 4.34-5.58 Catastrophic at 7.6M scale
Int8 QAT (PR #145) 1.2052 +0.012 Overhead exceeds recovery

Key findings for the community

  1. Doc-isolated eval hurts at stride=64 — contradicts LoRA TTT entry's +0.011 at stride=256. Crossover exists between stride 64 and 256.
  2. SmearGate + BigramHash don't help out-of-the-box with int6 — may require specific init or interaction with OrthoInit.
  3. Huginn eval-time scaling fails at small scale — tested both U-Net skips and flat loops. 3 shared blocks at 7.6M params can't learn iterative refinement.
  4. SWA bf16 accumulation bug — accumulating in bf16 for thousands of steps causes catastrophic precision loss.
  5. torch.compile graph priming pitfall — pre-compiling conditional code paths causes 50% slowdown.

See README for full analysis.

Test plan

  • Artifact under 16,000,000 bytes (15,175,136)
  • Training completes within 600s (599.98s)
  • Eval completes within 600s (80s)
  • Training log included
  • Additional seeds for statistical validation (pending compute credits)

Built with Claude Code

Independent int6 implementation with 3x MLP expansion, FP16 embed,
WD20k, sliding window eval. 21.8M params in 15.2MB artifact.
Accompanied by 9 controlled ablations with 6 negative findings.
@mrdavtan
Copy link
Author

Update: 5-seed statistical validation added

Seed val_bpb
31337 1.1703
1337 1.1708
2024 1.1712
42 1.1732
7 1.1767
Mean 1.1724
Std 0.0026

Gap vs baseline: 0.036 nats (threshold: 0.005) | t-stat: 44.2 | p < 0.01

All 5 runs on 8×H100 SXM (RunPod Parameter Golf template), PyTorch 2.9.1+cu128, same config, only seed varied. README and submission.json updated with full results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant