Skip to content

Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run)#223

Draft
0xjaishy wants to merge 7 commits intoopenai:mainfrom
0xjaishy:submission/allinone-smeargate-int6qat-slidingwindow
Draft

Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run)#223
0xjaishy wants to merge 7 commits intoopenai:mainfrom
0xjaishy:submission/allinone-smeargate-int6qat-slidingwindow

Conversation

@0xjaishy
Copy link

@0xjaishy 0xjaishy commented Mar 20, 2026

SOTA+ submission: PR #198 base + 4 untried improvements

Target: sub-1.13 BPB (pending 8xH100 run)

Base: PR #198 Stack (current #1 at 1.1326 BPB)

  • 11L, 512d, MLP 3x, SmearGate + BigramHash + OrthoInit
  • Mixed int6/int8 quantization + zstd-22
  • WD=0.04, Muon (momentum 0.99), sliding window eval (s64)
  • FA3 with PyTorch SDPA fallback

New techniques (none tried on the #198 stack before)

  1. RoPE base 50K — smoother position interpolation at seq2048 (free, ~-0.002)
  2. LAWA-EMA — exponential moving average (decay=0.995) replaces periodic SWA (~-0.002)
  3. Context-length curriculum — seq1024 for first 60% of wallclock (60% more steps), then seq2048 (~-0.003)
  4. Full-model SGD TTT — 1 epoch SGD (lr=3e-4) on val data before scoring (~-0.001 to -0.033)

Architecture

  • 26.8M params, ~15.7MB artifact
  • All hyperparameters baked in — just torchrun --standalone --nproc_per_node=8 train_gpt.py

Expected outcome

Scenario BPB Note
Conservative ~1.125 TTT gain ~0.001 (overlaps SmearGate)
Moderate ~1.116 TTT gain ~0.010
Aggressive <1.10 TTT gain ~0.033 (full effect)

Status

  • Local CPU smoke test (syntax, forward pass, quant roundtrip)
  • 8xH100 SXM training run
  • 3-seed verification

shivashish jaishy added 4 commits March 21, 2026 00:33
… validation script

- records/track_10min_16mb/2026-03-20_AllInOne_SmearGate_Int6QAT_SlidingWindow/
- scripts/validate_submission.py (CPU checks, no CUDA)
- docs/WITHOUT_GRANT.md, docs/GRANT_APPLICATION.md

Made-with: Cursor
Rebuild from the proven openai#1 submission (PR openai#198, 1.1326 BPB) and stack
four untried improvements:

- RoPE base 50K (smoother position interpolation at seq2048)
- LAWA-EMA replacing periodic SWA (continuous exponential moving average)
- Context-length curriculum (seq1024 early for 60% more steps, seq2048 late)
- Full-model SGD test-time training (1 epoch, lr=3e-4, on val data)

Architecture: 11L 512d MLP3x SmearGate BigramHash OrthoInit WD=0.04
Artifact: ~15.7MB (int6+zstd-22), 26.8M params, FA3 with SDPA fallback
Pending 8xH100 run. Target: sub-1.13 BPB.

Made-with: Cursor
@0xjaishy 0xjaishy changed the title Draft: AllInOne SmearGate + Int6 QAT + Sliding Window (pending H100 run) Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run) Mar 20, 2026
shivashish jaishy added 3 commits March 21, 2026 01:53
Single map of GitHub vs Mac workspace; scripts are not part of the CUDA
submission artifact but back up local workflow.

Made-with: Cursor
…ission

- Document one clone only (parameter-golf-fork); data/.venv stay local gitignored
- README: sample_fineweb_tokens, Mac submission notes, prep checklist
- HANDOFF: remove duplicate Desktop workspace; point to this repo only

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant