KURE/R2 + Tanh Reparam + Parallel EMA + LoRA TTT#4
Open
machdragon wants to merge 11 commits intomainfrom
Open
Conversation
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551). Adds four improvements targeting quantization fidelity and eval-time adaptation: - KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights - Tanh weight reparameterization bounding effective weights to [-1,1] - Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection - Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8xH100 launcher with all new env defaults: KURE_LAMBDA=0.01, R2_LAMBDA=0.01, TANH_REPARAM=1, LAWA_EMA_DECAY=0.999, TTT_LORA_ENABLED=1. CLI flags for key hyperparameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- LAWA/KURE: windreamer cu126_torch2100 + pytorch 2.10.0 cu12.6 devel base - Drop FA hopper source build; add Dockerfile.modal-fa3 for GHCR reuse - modal_fa3_image_smoke.py: quick H100 import check; RUNBOOK updates Made-with: Cursor
- modal_train_volume_check: shared ensure before LAWA/KURE torchrun - modal_fa3_image_smoke: mount parameter-golf-data, verify FA3 + SP load + shard count - Dockerfile.modal-fa3: use /opt/conda/bin/python -m pip (PEP 668) - RUNBOOK: required sync section + smoke/troubleshooting updates Made-with: Cursor
- Continue after dataset put failure so tokenizer always uploads - MODAL_SYNC_FORCE=1 for full dataset re-upload; tokenizer uses --force - RUNBOOK: document re-run + MODAL_SYNC_FORCE Made-with: Cursor
- Add modal_image_fa3_pytorch.py: /opt/conda/bin/python -m pip per Modal run_commands - Smoke + LAWA + KURE use shared builder + add_local_python_source(volume_check) - RUNBOOK: link Modal images guide + troubleshooting for externally-managed-environment Made-with: Cursor
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551). Adds four improvements targeting quantization fidelity and eval-time adaptation.
Changes (1857 lines, +329 from base)
1. KURE + R2 regularization (lines 1081-1098, 1635)
quant_reg_loss()— kurtosis→1.8 penalty + 2×std outlier penaltybase_model(DDP-safe)eps=1e-8in kurtosis to prevent NaN when var≈02. Tanh reparameterization (lines 546-553)
CastedLinear._tanh_reparamclass flag; forward appliestorch.tanh(w)for 2D weights ≥ 64×64.weightkept as nn.Parameter (safe for all ~15 callers:_init_weights, optimizer setup, TTT adapter sizing, tied embeddings, etc.)tanh(P)directly; materialized before export3. Parallel EMA tracks (lines 1569-1582, 1660-1666, 1703-1728)
4. Causal LoRA TTT (lines 1107-1287, 1838-1849)
BatchedLinearLoRA,BatchedTTTLoRA,eval_val_ttt_lora()GPT.forward()(lora=kwarg),Block.forward()(q/v delta fns),CausalSelfAttention.forward()(q/v deltas)Hyperparameters (env vars)
Verification
python3 -c "import ast; ast.parse(...)"— syntax OKquant_reg_loss,KURE_LAMBDA,R2_LAMBDA— reg wired intanh_reparam,torch.tanh(w)— tanh in CastedLinear.forwardlawa_decays,lawa_averaged— parallel tracks + safe exportBatchedTTTLoRA,eval_val_ttt_lora— TTT ported1e-8in kurtosis — eps presentbase_modelin reg loss — DDP unwrap correct.weightstill exists on CastedLinear (inherits nn.Linear)Test plan
🤖 Generated with Claude Code