Skip to content

Cross-Source Agreement Quality Filter (MixMinHash) #1757

@arhamm1

Description

@arhamm1

What:
Add a filter that identifies documents independently recovered by 2+ separate web crawl pipelines/sources, treating cross-source agreement as a zero-cost quality signal. Extends the existing MinHash deduplication stage to track which source each document came from and flag cross-source duplicates as high-quality rather than removing them.

Why:
MixMinHash (arXiv:2512.18834, Dec 2025) shows this gives +4.5% relative improvement on Arabic and +5.5% on Turkish vs. FineWeb-2 baselines, with 4× more unique tokens, using no models, no annotations, and no extra computation beyond standard dedup already in NeMo Curator. The intuition: if multiple independent crawls independently found and kept a document, it's likely high-value. Particularly powerful for low-resource languages where quality classifiers are weak.

Definition of Done:

  • Extend FuzzyDuplicatesFilter / MinHash pipeline to accept a source_id column per document
  • New mode: cross_source_agreement — instead of deduplicating, marks documents with cross_source_count (number of independent sources that contain this
    document)
  • CrossSourceAgreementFilter: keeps only documents with cross_source_count ≥ K (configurable K, default 2)
  • Compatible with multi-corpus ingestion (e.g., combining CommonCrawl + C4 + RefinedWeb)
  • Integration test: inject known cross-source duplicates, verify they are flagged rather than deduplicated
  • Tutorial: multilingual corpus construction using cross-source agreement for 5 low-resource languages

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions