What:
Add a filter that identifies documents independently recovered by 2+ separate web crawl pipelines/sources, treating cross-source agreement as a zero-cost quality signal. Extends the existing MinHash deduplication stage to track which source each document came from and flag cross-source duplicates as high-quality rather than removing them.
Why:
MixMinHash (arXiv:2512.18834, Dec 2025) shows this gives +4.5% relative improvement on Arabic and +5.5% on Turkish vs. FineWeb-2 baselines, with 4× more unique tokens, using no models, no annotations, and no extra computation beyond standard dedup already in NeMo Curator. The intuition: if multiple independent crawls independently found and kept a document, it's likely high-value. Particularly powerful for low-resource languages where quality classifiers are weak.
Definition of Done:
- Extend FuzzyDuplicatesFilter / MinHash pipeline to accept a source_id column per document
- New mode: cross_source_agreement — instead of deduplicating, marks documents with cross_source_count (number of independent sources that contain this
document)
- CrossSourceAgreementFilter: keeps only documents with cross_source_count ≥ K (configurable K, default 2)
- Compatible with multi-corpus ingestion (e.g., combining CommonCrawl + C4 + RefinedWeb)
- Integration test: inject known cross-source duplicates, verify they are flagged rather than deduplicated
- Tutorial: multilingual corpus construction using cross-source agreement for 5 low-resource languages
What:
Add a filter that identifies documents independently recovered by 2+ separate web crawl pipelines/sources, treating cross-source agreement as a zero-cost quality signal. Extends the existing MinHash deduplication stage to track which source each document came from and flag cross-source duplicates as high-quality rather than removing them.
Why:
MixMinHash (arXiv:2512.18834, Dec 2025) shows this gives +4.5% relative improvement on Arabic and +5.5% on Turkish vs. FineWeb-2 baselines, with 4× more unique tokens, using no models, no annotations, and no extra computation beyond standard dedup already in NeMo Curator. The intuition: if multiple independent crawls independently found and kept a document, it's likely high-value. Particularly powerful for low-resource languages where quality classifiers are weak.
Definition of Done:
document)