Skip to content

Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630#65

Open
aquariouseworkman wants to merge 2 commits intoopenai:mainfrom
aquariouseworkman:main
Open

Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630#65
aquariouseworkman wants to merge 2 commits intoopenai:mainfrom
aquariouseworkman:main

Conversation

@aquariouseworkman
Copy link

@aquariouseworkman aquariouseworkman commented Mar 19, 2026

Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

val_bpb: 1.1630 | Total size: 15,353,490 bytes (under 16MB)

Four orthogonal improvements over the naive baseline:

  1. Wider MLP (MLP_MULT=3) — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
  2. Mixed-precision quantization — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
  3. Optimized throughput — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
  4. Sliding window eval (stride=64) — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

Run command

RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Key metrics

Metric Value
Steps (10 min cap) 12,395
int6/int8 sliding val_bpb 1.1630
Quantization penalty +0.0015 BPB
Artifact size 15,353,490 bytes
  • Trained and evaluated on 8xH100 SXM (RunPod)

arjun-krishna1 added a commit to arjun-krishna1/parameter-golf that referenced this pull request Mar 19, 2026
…_bpb 1.1652

Stack five techniques from systematic PR analysis:
- MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70)
- int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70)
- zstd-22 compression (from PR openai#70)
- TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65)
- Sliding window eval at stride=64 with compiled forward_logits

Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001).
Three seeds: 1.16615, 1.16532, 1.16412.
Artifact: 15.6MB (under 16,000,000 byte cap).
Training: 9370 steps at 64ms/step on 8xH100 SXM.

Made-with: Cursor
## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

**val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB)

Four orthogonal improvements over the naive baseline:

1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

### Run command

```bash
RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Key metrics

| Metric | Value |
|--------|-------|
| Steps (10 min cap) | 12,395 |
| int6/int8 sliding val_bpb | **1.1630** |
| Quantization penalty | +0.0015 BPB |
| Artifact size | 15,353,490 bytes |
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 19, 2026
Every submission scoring <1.18 BPB uses these EXACT settings.
We were running defaults — now matching the winners:

  MUON_MOMENTUM:       0.95 → 0.99 (stronger smoothing)
  MATRIX_LR:           0.04 → 0.02 (halved, reduces quant gap)
  SCALAR_LR:           0.04 → 0.02 (halved)
  TIED_EMBED_LR:       0.05 → 0.03 (halved)
  WARMDOWN_ITERS:      1200 → 3000 (longer warmdown)
  MUON_WARMUP_START:   0.85 → 0.92 (higher start)
  MUON_WARMUP_STEPS:   500  → 1500 (3x longer warmup)

These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652),
openai#70 (1.1659), openai#65 (1.1808) — all top submissions.

Applied to both v5 and v6. Both compile, 1498 lines each.
@aquariouseworkman aquariouseworkman changed the title Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808 Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630 Mar 19, 2026
South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70
- promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63
- refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65
- update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- fix the PR-audit notes to attribute the long-context branch to PR openai#65 rather than PR openai#61
- record PR openai#61 as schedule-side evidence about long warmdown reducing quantization damage
- keep the ideas backlog aligned with the actual GitHub PR content before using it for next-step decisions
phaesoo added a commit to phaesoo/parameter-golf that referenced this pull request Mar 19, 2026
openai#77, openai#78)

Analyzed techniques, ablations, and individual BPB contributions.
Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029)
are the dominant validated techniques. Several promising combinations
remain untested across submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xskuy pushed a commit to xskuy/parameter-golf that referenced this pull request Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs):

1. Sliding window eval (stride=256): overlapping windows give each token
   more context. Free ~0.03 bpb improvement, zero artifact cost.
   Based on PRs openai#70, openai#77, openai#65.

2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and
   EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8,
   allowing bigger models. Based on PRs openai#78, openai#70.

3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives
   ~0.019 bpb improvement. Based on PRs openai#70, openai#66.

4. Default dim=512 with LR=0.03 (best config from experiments).

5. forward_logits() helper for sliding window (avoids model.forward
   which returns loss, not logits).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arjun-krishna1 added a commit to arjun-krishna1/parameter-golf that referenced this pull request Mar 19, 2026
Add Straight-Through Estimator fake int6 quantization to CastedLinear
during training. Forward pass uses quantized weights (int6 per-row),
backward passes gradients through originals. Teaches weight distributions
that survive post-training int6 quantization.

Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64.

Three seeds:
- SEED=1337: val_bpb=1.16356083
- SEED=1338: val_bpb=1.16275343
- SEED=1339: val_bpb=1.16337225

Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001.
Artifact: 15.3MB (under 16,000,000 byte cap).

Made-with: Cursor
lolrazh added a commit to lolrazh/parameter-golf that referenced this pull request Mar 19, 2026
Downloaded PR openai#65 SOTA train_gpt.py (1.1630 BPB). Added zstandard dep,
use_sota flag to toggle between baseline and SOTA scripts.
5-min baseline recorded: val_bpb=1.3738, post-quant=1.3766.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants