Skip to content

Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659#70

Open
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission-jfprincz-1.1666
Open

Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659#70
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission-jfprincz-1.1666

Conversation

@jfprincz
Copy link

@jfprincz jfprincz commented Mar 19, 2026

Submission: Wider MLP 3x + int6 Quantization + Sliding Window Eval

val_bpb: 1.1659 | Total size: 14,855,508 bytes (under 16MB)

Three orthogonal improvements over the naive baseline:

  1. Wider MLP (MLP_MULT=3.0) - 2x to 3x expansion (hidden=1536), ~0.019 BPB improvement
  2. int6 per-row on MLP+attention - saves ~4MB artifact space, only +0.010 BPB degradation; zstd-22 compression
  3. Sliding window eval (stride=256) - overlapping windows, batched forward_logits, ~0.033 BPB improvement, zero artifact cost

Run command

RUN_ID=official_v1_reach MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 TRAIN_LOG_EVERY=200 MATRIX_LR=0.020 SCALAR_LR=0.020 TIED_EMBED_LR=0.030 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_MOMENTUM_WARMUP_START=0.92 WARMDOWN_ITERS=3000 torchrun --standalone --nproc_per_node=8 train_gpt.py

Key metrics

Metric Value
Steps (10 min cap) 12,485
int6 sliding val_bpb 1.1659
Artifact size 14,855,508 bytes
Two seeds: 1.16658, 1.16591 (submitted: 1338)

See README.md in the submission folder for full details.

@jfprincz jfprincz force-pushed the submission-jfprincz-1.1666 branch from 651ec21 to 2790ac0 Compare March 19, 2026 08:57
@jfprincz jfprincz changed the title Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1666 Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 Mar 19, 2026
keshav55 added a commit to keshav55/parameter-golf that referenced this pull request Mar 19, 2026
Sliding window eval gives ~0.03 BPB free (proven by 5+ competitors):
- stride=64 with seq_len=1024 → every token scored with 960+ context
- forward_per_token_loss() method for per-token scoring
- Only counts last `stride` positions per window (full context)
- EVAL_STRIDE env var (0 = disable, default 64)

MLP 3x gives ~0.02 BPB (proven by jfprincz, PR openai#70):
- Hidden dim 1536 instead of 1024
- Needs INT6 middle layers to fit in 16MB (already implemented)

Updated INTEL.md with latest competitive landscape (28→70+ PRs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arjun-krishna1 added a commit to arjun-krishna1/parameter-golf that referenced this pull request Mar 19, 2026
…_bpb 1.1652

Stack five techniques from systematic PR analysis:
- MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70)
- int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70)
- zstd-22 compression (from PR openai#70)
- TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65)
- Sliding window eval at stride=64 with compiled forward_logits

Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001).
Three seeds: 1.16615, 1.16532, 1.16412.
Artifact: 15.6MB (under 16,000,000 byte cap).
Training: 9370 steps at 64ms/step on 8xH100 SXM.

Made-with: Cursor
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 19, 2026
Every submission scoring <1.18 BPB uses these EXACT settings.
We were running defaults — now matching the winners:

  MUON_MOMENTUM:       0.95 → 0.99 (stronger smoothing)
  MATRIX_LR:           0.04 → 0.02 (halved, reduces quant gap)
  SCALAR_LR:           0.04 → 0.02 (halved)
  TIED_EMBED_LR:       0.05 → 0.03 (halved)
  WARMDOWN_ITERS:      1200 → 3000 (longer warmdown)
  MUON_WARMUP_START:   0.85 → 0.92 (higher start)
  MUON_WARMUP_STEPS:   500  → 1500 (3x longer warmup)

These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652),
openai#70 (1.1659), openai#65 (1.1808) — all top submissions.

Applied to both v5 and v6. Both compile, 1498 lines each.
South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70
- promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63
- refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65
- update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
xskuy pushed a commit to xskuy/parameter-golf that referenced this pull request Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs):

1. Sliding window eval (stride=256): overlapping windows give each token
   more context. Free ~0.03 bpb improvement, zero artifact cost.
   Based on PRs openai#70, openai#77, openai#65.

2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and
   EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8,
   allowing bigger models. Based on PRs openai#78, openai#70.

3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives
   ~0.019 bpb improvement. Based on PRs openai#70, openai#66.

4. Default dim=512 with LR=0.03 (best config from experiments).

5. forward_logits() helper for sliding window (avoids model.forward
   which returns loss, not logits).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants