[WIP] SubSixteen: Ternary QAT + Depth Recurrence + TTT (val_bpb pending)#69
Draft
TevBenji wants to merge 10 commits intoopenai:mainfrom
Draft
[WIP] SubSixteen: Ternary QAT + Depth Recurrence + TTT (val_bpb pending)#69TevBenji wants to merge 10 commits intoopenai:mainfrom
TevBenji wants to merge 10 commits intoopenai:mainfrom
Conversation
…ests
- Implement BitLinear class with ternary {-1, 0, +1} quantization-aware training
- Use straight-through estimator for gradient flow during backward pass
- Add AbsMedian quantization strategy with per-output-channel scaling
- Implement L1 regularization helper to encourage ternary sparsity
- Add comprehensive property-based tests using Hypothesis framework
- Validate ternary quantization roundtrip, STE gradient flow, L1 proportionality
- Verify drop-in compatibility with CastedLinear baseline
- Test numerical stability and edge cases across random dimensions
…ockGroup - Add comprehensive property-based tests using Hypothesis for depth recurrence validation - Test recurrent depth multiplication (Property 8): verify each shared block executes exactly M times - Test per-loop differentiation (Property 9): validate M distinct LayerNorm instances and loop signal slices - Test input injection x0 influence (Property 10): ensure different x0 tensors produce different outputs - Test progressive loss generation (Property 11): verify auxiliary losses at designated loops - Update train_gpt.py to expose RecurrentBlockGroup for testing - Validates Requirements 3.2, 3.3, 3.4, 3.5, and 3.6 for recurrent architecture
- Add .hypothesis/ to ignore property-based test artifacts - Add .pytest_cache/ to ignore pytest cache files - Prevents test-generated files from being tracked in version control
…mpression - Add test_compression_properties.py with Property 14 (ternary packing efficiency) and Property 15 (compression pipeline round-trip) - Add test_hyperparams_properties.py with Property 12 (MLP hidden dimension validation) and Property 13 (model width expansion) - Validate ternary quantization achieves <=1.7 bits per weight with >=50% sparsity - Verify compression pipeline round-trip preserves model outputs within floating-point tolerance - Ensure MLP hidden dimensions scale correctly with model_dim and mlp_mult parameters - Update train_gpt.py to support property-based test imports and validation
…s and QAT schedule - Add test_optimizer_properties.py with Property 19 validating optimizer group assignment correctness for Muon (2D matrix params) and Adam (scalar params) - Add test_qat_schedule.py with Property 6 validating QAT state consistency before and after switchover - Update train_gpt.py to support QAT activation and learning rate schedule validation - Ensure all 2D BitLinear/CastedLinear weights from blocks are assigned to Muon optimizer group - Verify embedding, loop signals, and control tensor parameters are assigned to Adam optimizer group - Validate learning rate schedule shape maintains values in [0, 1] with proper warmup and cooldown phases - Validates Requirements 2.1, 2.2, 7.1, 7.2, and 7.3
…ss slow health checks - Add comprehensive property-based tests for Test-Time Training (TTTModule) covering prefix alignment and selective parameter adaptation - Create test_ttt_properties.py with 7 property tests validating TTT requirements including prefix fraction alignment, selective layer adaptation, and gradient flow - Suppress HealthCheck.too_slow in bitlinear property tests to prevent flaky test failures on slower systems - Update train_gpt.py with TTTModule implementation supporting configurable adaptation layers and prefix fractions - Add helper functions for building minimal GPT models and generating test tokens for efficient CPU-based testing
- Add test_integration_properties.py with property-based tests for tied embeddings, seed determinism, and BPB validity - Property 23: Verify tied embedding weight sharing (lm_head=None when tie_embeddings=True) - Property 24: Validate seed determinism produces bitwise-identical parameters across runs - Property 22: Ensure BPB computation validity for positive finite loss and token/byte counts - Suppress slow health checks for hypothesis tests to improve CI performance - Inline BitLinear implementation in train_gpt.py to reduce external dependencies - Update train_gpt.py module docstring to clarify 1500-line constraint - Add json import for potential metrics serialization
…ignore - Add comprehensive README documenting SubSixteen approach combining ternary QAT, depth recurrence, and test-time training - Include detailed parameter budget breakdown showing 20.7M stored params with 55M effective capacity - Document training schedule, configuration, and command for 8×H100 reproducibility - Add validation summary covering 28 property-based tests across all techniques - Update .gitignore to exclude .kiro/ directory for IDE artifacts
- Remove .kiro/ entry from gitignore - Preserve logs/ directory in ignore list - Simplify gitignore configuration for project cleanup
- Add RunPod 8xH100 setup and training script with full hyperparameter configuration - Add submission.json with SubSixteen model metadata and architecture description - Add placeholder training log for 8xH100 run results - Add complete train_gpt.py implementation featuring BitLinear ternary quantization, depth recurrence with shared blocks, test-time training, and progressive loss scheduling - Combines three orthogonal techniques (ternary QAT, recurrent depth, TTT) targeting sub-1.10 BPB performance
phaesoo
added a commit
to phaesoo/parameter-golf
that referenced
this pull request
Mar 19, 2026
openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks three orthogonal techniques not previously combined in a single submission:
Ternary QAT — BitLinear layers with {-1, 0, +1} weights via AbsMedian quantization + STE gradients. ~1.5 bits/weight vs 8 for int8, enabling 4-5× more parameters per byte. L1 regularization drives zero-heavy ternary distributions for better zlib ratios.
Depth recurrence — Prelude(1) + RecurrentBlockGroup(3 shared blocks × 10 loops) + Coda(1) = 32 effective transformer layers from 5 stored blocks. Per-loop LayerNorms + low-rank signals (rank 48) differentiate iterations. Progressive loss at loops 4 and 7 for gradient flow.
Test-time training — Per-document 1-step SGD on last 2 layers during eval using 30% prefix adaptation. No training data accessed during evaluation.
Configuration
Key Metrics
Validation
__main__guarded, syntax cleanStatus
train_gpt.py(1,285 lines, under 1,500 limit)