ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66
Open
arjun-krishna1 wants to merge 20 commits intoopenai:mainfrom
Open
ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66arjun-krishna1 wants to merge 20 commits intoopenai:mainfrom
arjun-krishna1 wants to merge 20 commits intoopenai:mainfrom
Conversation
Stack four proven techniques identified via systematic PR analysis: - TRAIN_SEQ_LEN=4096 for richer per-step training signal - Optimizer tuning: Muon momentum 0.99, LRs halved, warmdown 3000 - fp16 tied embedding export (MLP_HIDDEN=992 to stay under 16MB) - Sliding window eval at stride=64 with 4096-token context windows Beats naive baseline (1.2244) by 0.041 BPB and all public PRs. Training: 9919 steps at 60ms/step on 8xH100 SXM. Eval: 278s sliding window (within separate 10-min eval budget). Made-with: Cursor
Made-with: Cursor
jordankzf
added a commit
to jordankzf/parameter-golf
that referenced
this pull request
Mar 19, 2026
- seq_len=4096 (4x context, biggest single BPB win) - Muon momentum 0.99, lower LRs (0.02/0.02/0.03) - Batch 393K (more steps/min), warmdown 3000 - fp16 tied embedding export (halves quant penalty) - Defaults to SP-1024 (data exists on HuggingFace, no tokenizer training) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three seeds all clear the 1.2194 threshold (SOTA - 0.005): - SEED=1337: val_bpb=1.18335372 - SEED=1338: val_bpb=1.18437368 - SEED=1339: val_bpb=1.18481782 Mean=1.18418174, std=0.00075068, t=81.26 (df=2), p<<0.001. Made-with: Cursor
Made-with: Cursor
- Run command now references full records folder path so it runs correctly from the repo root as reviewers expect - Root train_gpt.py reverted to openai/parameter-golf main so PR only adds the records folder, as required by challenge rules Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Skill now lives at parameter-golf-autoresearch/SKILL.md inside the records submission folder, following the agentskills.io standard (folder name matches the name field, proper frontmatter with metadata). Removed the .cursor/skills/ copy so the PR only touches the records folder. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…nfig Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…_bpb 1.1652 Stack five techniques from systematic PR analysis: - MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70) - int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70) - zstd-22 compression (from PR openai#70) - TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65) - Sliding window eval at stride=64 with compiled forward_logits Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001). Three seeds: 1.16615, 1.16532, 1.16412. Artifact: 15.6MB (under 16,000,000 byte cap). Training: 9370 steps at 64ms/step on 8xH100 SXM. Made-with: Cursor
Author
|
Time to sleep 😭 |
Made-with: Cursor
manfromnowhere143
added a commit
to manfromnowhere143/parameter-golf
that referenced
this pull request
Mar 19, 2026
Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.
phaesoo
added a commit
to phaesoo/parameter-golf
that referenced
this pull request
Mar 19, 2026
openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closed
Contributor
|
Awesome! |
Author
thanks man! @chonchiog feel free to fork and build off of it |
Contributor
|
for sure! @arjun-krishna1 I'm waiting for the runpod credits :) |
xskuy
pushed a commit
to xskuy/parameter-golf
that referenced
this pull request
Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Straight-Through Estimator fake int6 quantization to CastedLinear during training. Forward pass uses quantized weights (int6 per-row), backward passes gradients through originals. Teaches weight distributions that survive post-training int6 quantization. Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64. Three seeds: - SEED=1337: val_bpb=1.16356083 - SEED=1338: val_bpb=1.16275343 - SEED=1339: val_bpb=1.16337225 Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001. Artifact: 15.3MB (under 16,000,000 byte cap). Made-with: Cursor
rsavitt
pushed a commit
to rsavitt/parameter-golf
that referenced
this pull request
Mar 19, 2026
Based on PR openai#66 (ArjunAutoResearch) composition of top techniques: - Int6 per-row quantization + zstd-22 (~4MB savings vs int8+zlib) - MLP 3x expansion (hidden=1536) enabled by int6 budget savings - STE fake int6 QAT in CastedLinear (trains weights to survive quantization) - Sliding window eval (stride=64, seq_len=4096) - Tuned optimizer: matrix_lr=0.02, muon_momentum=0.99, warmdown=3000 - fp16 tied embedding passthrough (no embedding quant penalty) - Seq len 4096, batch tokens 393K Expected: ~1.163 BPB on 8xH100 (vs baseline 1.2244) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey there! This was super fun thanks. I took the approach of building an AutoResearch agent harness that could work towards solving this autonomously. Added as an agent skill in my submission for others to build off of!
Built an auto-research pipeline:
ArjunAutoResearch ended up with a final val_bpb of 1.16323 +/- 0.00042 (mean across 3 seeds, p << 0.001).
The artifact size is: 15,265,243 bytes (under 16,000,000).
With more compute (will apply for more) I would scale this AutoResearch agent by trying to compose approaches in the Medium, Low buckets as well as trying to come up with strategies on its own and researching approaches from the internet that people haven't made pull requests for yet.
The approach ArjunAutoResearch came up with composed the following techniques from these pull requests:
Wider MLP (
MLP_MULT=3.0, hidden=1536) (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)3x MLP expansion enabled by int6 quantization saving ~4MB. 1536 is 64-aligned for optimal H100 matmul tile utilization.
Longer training context (
TRAIN_SEQ_LEN=4096) (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)4x more context per sequence than the baseline's 1024, significantly improving convergence quality per step.
Optimizer tuning (from New SOTA attempt (val_bpb=1.2014) #52)
MUON_MOMENTUM=0.99, learning rates halved, batch 393K, warmdown 3000 stepsSTE fake int6 quantization-aware training (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)
During training, all
CastedLinearweights get fake int6 quantization via Straight-Through Estimator: forward uses quantized weights, backward passes gradients through originals. Teaches weight distributions that survive int6 post-training quantization.int6 per-row quantization on MLP+attention (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)
Mixed precision: int6 on 2D block weights, fp16 passthrough on tied embedding, zstd-22 compression.
fp16 tied embedding passthrough (from fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197) #42)
The tied embedding doubles as the output head. Keeping it in fp16 eliminates embedding quantization penalty entirely.
Sliding window evaluation (
EVAL_STRIDE=64,TRAIN_SEQ_LEN=4096) (from Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50, extended to seq_len=4096)Each token scored with up to 4032 tokens of context. Compiled
forward_logitsfor fast eval.Results
Mean: 1.16323, std: 0.00042. One-sample t-test against threshold 1.2194: t = 230.34 (df=2). p << 0.001.
Key numbers (seed 1337)