Cross-Source Agreement Quality Filter (MixMinHash)

What:           
Add a filter that identifies documents independently recovered by 2+ separate web crawl pipelines/sources, treating cross-source agreement as a zero-cost quality signal. Extends the existing MinHash deduplication stage to track which source each document came from and flag cross-source duplicates as high-quality rather than removing them.
                                                                                                                                                               
Why:            
MixMinHash (arXiv:2512.18834, Dec 2025) shows this gives +4.5% relative improvement on Arabic and +5.5% on Turkish vs. FineWeb-2 baselines, with 4× more unique tokens, using no models, no annotations, and no extra computation beyond standard dedup already in NeMo Curator. The intuition: if multiple independent crawls independently found and kept a document, it's likely high-value. Particularly powerful for low-resource languages where quality classifiers are weak.                                                                                                                                        
                  
Definition of Done:
  - Extend FuzzyDuplicatesFilter / MinHash pipeline to accept a source_id column per document
  - New mode: cross_source_agreement — instead of deduplicating, marks documents with cross_source_count (number of independent sources that contain this      
  document)                                                                                                                                              
  - CrossSourceAgreementFilter: keeps only documents with cross_source_count ≥ K (configurable K, default 2)                                                   
  - Compatible with multi-corpus ingestion (e.g., combining CommonCrawl + C4 + RefinedWeb)                  
  - Integration test: inject known cross-source duplicates, verify they are flagged rather than deduplicated                                                   
  - Tutorial: multilingual corpus construction using cross-source agreement for 5 low-resource languages                                                                                                                                       

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Source Agreement Quality Filter (MixMinHash) #1757

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cross-Source Agreement Quality Filter (MixMinHash) #1757

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions