What:
Add scale-dependent threshold adjustment to the existing SemDeDup stage, where semantic similarity thresholds tighten as corpus size (or target model size) increases. Based on the empirical finding that gradient alignment for semantically equivalent documents increases with model capability, causing accelerated "semantic collisions" at hundreds of billions of tokens.
Why:
arXiv:2603.06603 (Feb 2026) demonstrates that semantic duplication disrupts standard scaling laws — causing mild degradation in small models but rapidly increasing loss penalties in larger models. Fixed similarity thresholds calibrated on small runs systematically over-retain semantic duplicates at production scale, corrupting scaling predictions. This is a silent bug in every large-scale pretraining pipeline using SemDeDup with fixed thresholds.
Definition of Done:
- Extend SemDeDupConfig with scale_aware: bool and target_model_params: int fields
- Implement threshold scaling function: ε(N_corpus, N_model) derived from the power law + collision rate formula in the paper
- Ships with recommended default thresholds at 1B / 10B / 100B / 1T token scales (table in docs)
- SemanticCollisionAudit utility: samples the corpus at target scale, reports estimated semantic collision rate before dedup
- Integration test: verify tighter thresholds reduce semantic near-duplicate count at simulated large scale
- Warning logged when target_model_params is unset and corpus size exceeds 100B tokens
What:
Add scale-dependent threshold adjustment to the existing SemDeDup stage, where semantic similarity thresholds tighten as corpus size (or target model size) increases. Based on the empirical finding that gradient alignment for semantically equivalent documents increases with model capability, causing accelerated "semantic collisions" at hundreds of billions of tokens.
Why:
arXiv:2603.06603 (Feb 2026) demonstrates that semantic duplication disrupts standard scaling laws — causing mild degradation in small models but rapidly increasing loss penalties in larger models. Fixed similarity thresholds calibrated on small runs systematically over-retain semantic duplicates at production scale, corrupting scaling predictions. This is a silent bug in every large-scale pretraining pipeline using SemDeDup with fixed thresholds.
Definition of Done: