NVIDIA-NeMo · Jorjeous · May 21, 2026 · Jun 9, 2026 · Jun 9, 2026
@@ -0,0 +1,105 @@
+---
+description: "Cluster speaker embeddings into speaker groups using SCOTCH (Scalable Centroid-based Two-stage Clustering with Hierarchical refinement)"
+categories: ["audio-processing"]
+tags: ["scotch", "speaker-clustering", "ahc", "birch", "speaker-id"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "audio-only"
+---
+
+# SCOTCH Speaker Clustering
+
+Cluster speaker embeddings into speaker groups using Agglomerative Hierarchical Clustering (AHC). For datasets over 500 hours, a two-stage BIRCH → AHC pipeline keeps memory bounded while scaling to tens of millions of utterances.
+
+## Understanding SCOTCH
+
+SCOTCH assigns every utterance a `speaker_label` (integer cluster ID) and a `confidence_score` (0–1 silhouette-style metric). Two backends are available:
+
+| Backend | Scale | Memory | When to Use |
+|---------|-------|--------|-------------|
+| Standard AHC | < 500 h / < 150K utterances | O(N²) | Small-to-medium datasets |
+| Large-scale (BIRCH + AHC) | > 500 h / millions of utterances | O(N·D + K²) | Web-scale audio |
+
+Use `recommend_clustering_method()` to pick automatically:
+
+```python
+from nemo_curator.stages.audio.speaker_id.clustering.large_scale_clustering_and_scoring import (
+    recommend_clustering_method,
+)
+
+method = recommend_clustering_method(num_hours=1200)
+# -> "large_scale"
+```
+
+## Basic Usage
+
+### Standard AHC (SpeakerClusteringStage)
+
+```python
+from nemo_curator.stages.audio.speaker_id import SpeakerClusteringStage
+
+cluster = SpeakerClusteringStage(
+    input_manifest="manifests/manifest_{0..49}.json",
+    embedding_dir="embeddings/",
+    output_manifest_dir="output_manifests/",
+    threshold=0.292,
+    embedding_normalization="center_global",
+)
+pipeline.add_stage(cluster)
+```
+
+### Large-Scale (Function API)
+
+```python
+from nemo_curator.stages.audio.speaker_id.clustering.large_scale_clustering_and_scoring import (
+    cluster_embeddings_large_scale,
+)
+
+labels, confidence, stats = cluster_embeddings_large_scale(
+    embeddings,       # (N, D) float32 array
+    threshold=0.40,
+    min_cluster_size=30,
+)
+```
+
+## Parameters
+
+### SpeakerClusteringStage
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `threshold` | float | 0.292 | Cosine-similarity cutoff for same-speaker decisions |
+| `linkage_method` | str | `"average"` | AHC linkage: `"average"`, `"complete"`, `"single"` |
+| `shard_level_clustering` | bool | `False` | Cluster each shard independently vs globally |
+| `batch_size` | int | `None` | Group shards into batches for clustering |
+| `embedding_normalization` | str | `"center_global"` | `"none"`, `"center_global"`, or `"external"` |
+
+### Large-Scale Function
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `threshold` | float | 0.40 | Cosine cutoff at AHC step |
+| `min_cluster_size` | int | 30 | Drop clusters smaller than this |
+| `birch_threshold` | float | ~0.63 | BIRCH leaf radius (Euclidean on unit sphere) |
+| `branching_factor` | int | 50 | BIRCH tree branching factor |
+
+## Threshold Tuning
+
+| Threshold | Effect | Use Case |
+|-----------|--------|----------|
+| 0.25–0.30 | Aggressive merging, fewer speakers | High-recall speaker groups |
+| 0.30–0.35 | Balanced (default for TitaNet) | General ASR data curation |
+| 0.40–0.50 | Conservative, more speakers | Purity-first for TTS training |
+
+## Best Practices
+
+- **Run embedding normalization** — `center_global` (default) subtracts batch mean, which improves threshold stability across domains.
+- **Start with default thresholds** — 0.292 for standard AHC (TitaNet), 0.40 for large-scale.
+- **Use `min_cluster_size=30`** for training data — drops long-tail noise speakers that hurt model quality.
+- **CPU-only** — clustering does not need GPU; allocate sufficient RAM for the embedding matrix.
+
+## Related Topics
+
+- [Speaker Embedding](/curate-audio/process-data/quality-filtering/speaker-embedding) — extract embeddings before clustering
+- [UTMOSv2 Scoring](/curate-audio/process-data/quality-filtering/utmosv2) — combine with quality filtering
@@ -0,0 +1,86 @@
+---
+description: "Extract speaker embeddings from audio using TitaNet for downstream clustering and speaker identification"
+categories: ["audio-processing"]
+tags: ["speaker-embedding", "titanet", "speaker-id", "embedding"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "audio-only"
+---
+
+# Speaker Embedding
+
+Extract per-utterance speaker embeddings using a pretrained [TitaNet](https://catalog.ngc.nvidia.com/orgs/nvidia/models/speakerverification_en_titanet_large) model. These embeddings power downstream speaker clustering (SCOTCH) and confidence scoring.
+
+## Understanding Speaker Embeddings
+
+Speaker embeddings are fixed-dimensional vectors that capture the vocal characteristics of a speaker. Two utterances from the same speaker produce embeddings with high cosine similarity; different speakers produce low similarity.
+
+Three stage variants handle different data layouts:
+
+| Stage | Input | Use Case |
+|-------|-------|----------|
+| `SpeakerEmbeddingAudioTaskStage` | In-memory waveform (`AudioTask`) | AIS-streamed pipelines |
+| `SpeakerEmbeddingRequestStage` | File paths in JSONL (`DocumentBatch`) | Local/S3/tar audio |
+| `SpeakerEmbeddingLhotseStage` | Lhotse CutSet (NeMo tarred) | Large-scale NeMo datasets |
+
+## Basic Usage
+
+### In-Memory Waveform (AudioTask)
+
+```python
+from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingAudioTaskStage
+
+embed = SpeakerEmbeddingAudioTaskStage(
+    output_dir="embeddings/",
+    target_sample_rate=16000,
+)
+pipeline.add_stage(embed)
+```
+
+### File Paths (DocumentBatch)
+
+```python
+from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingRequestStage
+
+embed = SpeakerEmbeddingRequestStage(
+    output_path="embeddings/all_embeddings.npz",
+    output_format="npz",
+)
+pipeline.add_stage(embed)
+```
+
+### NeMo Tarred (Lhotse)
+
+```python
+from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingLhotseStage
+
+embed = SpeakerEmbeddingLhotseStage(
+    input_manifest="manifests/manifest_{0..49}.json",
+    input_tar="tars/audio_{0..49}.tar",
+    output_path="embeddings/",
+    batch_size=64,
+)
+```
+
+## Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model_name` | str | `nvidia/speakerverification_en_titanet_large` | Pretrained NeMo speaker model |
+| `target_sample_rate` | int | 16000 | Resample audio to this rate |
+| `output_dir` / `output_path` | str | `""` | Where to save embedding files |
+| `output_format` | str | `"npz"` | `"npz"` or `"pt"` |
+| `batch_size` | int | 64 | Inference batch size |
+
+## Best Practices
+
+- **Use TitaNet Large** for best clustering quality; the default model works well across domains.
+- **Target 16 kHz** — the model expects this sample rate; mismatched rates degrade accuracy.
+- **Save per-shard embeddings** for large datasets; use `merge_shard_embeddings()` to combine afterwards.
+- **GPU recommended** — embedding extraction is compute-intensive; allocate at least 4 GB GPU memory.
+
+## Related Topics
+
+- [SCOTCH Clustering](/curate-audio/process-data/quality-filtering/scotch-clustering) — cluster embeddings into speaker groups
+- [UTMOSv2 Scoring](/curate-audio/process-data/quality-filtering/utmosv2) — audio quality scoring
@@ -0,0 +1,67 @@
+---
+description: "Score audio quality using UTMOSv2 Mean Opinion Score prediction"
+categories: ["audio-processing"]
+tags: ["utmosv2", "mos", "mean-opinion-score", "quality", "audio-metrics"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "beginner"
+content_type: "how-to"
+modality: "audio-only"
+---
+
+# UTMOSv2 Scoring
+
+Compute per-utterance Mean Opinion Score (MOS) predictions using the [UTMOSv2](https://github.com/sarulab-speech/UTMOSv2) model. Unlike `UTMOSFilterStage` (which uses the older `utmos22_strong`), this stage supports in-memory waveforms, local files, and remote audio (S3/AIS/HTTP).
+
+## Understanding UTMOSv2
+
+UTMOSv2 predicts a MOS score on the standard 1.0–5.0 scale. It handles multiple audio formats (WAV, OPUS, FLAC, MP3) and three input modes:
+
+| Mode | Input | Best For |
+|------|-------|----------|
+| In-memory waveform | `waveform` + `sample_rate` keys | NeMo tarred / AIS-streamed pipelines |
+| Local file path | `audio_filepath` on disk | Pre-downloaded datasets |
+| Remote URL | `s3://` / `ais://` / `http://` | Cloud-hosted audio |
+
+## Basic Usage
+
+```python
+from nemo_curator.stages.audio.metrics.utmosv2_score import GetUtmosv2ScoreStage
+
+utmos = GetUtmosv2ScoreStage(
+    score_key="utmosv2_score",
+    sample_rate=16000,
+)
+pipeline.add_stage(utmos)
+```
+
+The stage adds a `utmosv2_score` field to each entry. Entries where audio cannot be loaded receive `NaN`.
+
+## Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `score_key` | str | `"utmosv2_score"` | Output key for the MOS score |
+| `audio_filepath_key` | str | `"audio_filepath"` | Input key for file paths |
+| `waveform_key` | str | `"waveform"` | Input key for in-memory waveforms |
+| `audio_root` | str | `""` | Root directory prepended to relative paths |
+| `sample_rate` | int | 16000 | Target sample rate for inference |
+| `inference_batch_size` | int | 16 | Batch size for `model.predict()` |
+| `num_repetitions` | int | 1 | Test-time augmentation repetitions |
+
+## When to Use UTMOSv2 vs UTMOS
+
+- **UTMOS** (`UTMOSFilterStage`) — older `utmos22_strong` model, integrated threshold-based filtering, simpler API.
+- **UTMOSv2** (`GetUtmosv2ScoreStage`) — newer model, supports remote files and in-memory waveforms, scoring only (filter downstream).
+
+For new pipelines, prefer UTMOSv2 for its flexibility. Apply filtering in a subsequent stage based on the score.
+
+## Best Practices
+
+- **GPU recommended** — allocate 1 GPU per stage instance for inference speed.
+- **Score first, filter later** — run with no threshold to inspect the MOS distribution before choosing a cutoff.
+- **Waveforms are stripped** from output — the `waveform` key is removed after scoring to keep JSONL output clean.
+
+## Related Topics
+
+- [UTMOS Filter](/curate-audio/process-data/quality-filtering/utmos) — threshold-based filtering with the older model
+- [SCOTCH Clustering](/curate-audio/process-data/quality-filtering/scotch-clustering) — combine quality scores with speaker clustering
@@ -39,6 +39,9 @@
     "PreserveByValueStage": "nemo_curator.stages.audio.common",
     "SIGMOSFilterStage": "nemo_curator.stages.audio.filtering",
     "SegmentConcatenationStage": "nemo_curator.stages.audio.preprocessing",
+    "SpeakerClusteringStage": "nemo_curator.stages.audio.speaker_id",
+    "SpeakerEmbeddingLhotseStage": "nemo_curator.stages.audio.speaker_id",
+    "SpeakerEmbeddingRequestStage": "nemo_curator.stages.audio.speaker_id",
     "SpeakerSeparationStage": "nemo_curator.stages.audio.segmentation",
     "TimestampMapperStage": "nemo_curator.stages.audio.postprocessing",
     "UTMOSFilterStage": "nemo_curator.stages.audio.filtering",
@@ -57,6 +60,9 @@
     "PreserveByValueStage",
     "SIGMOSFilterStage",
     "SegmentConcatenationStage",
+    "SpeakerClusteringStage",
+    "SpeakerEmbeddingLhotseStage",
+    "SpeakerEmbeddingRequestStage",
     "SpeakerSeparationStage",
     "TimestampMapperStage",
     "UTMOSFilterStage",

@@ -22,6 +22,7 @@
     "ComputeWERStage": "nemo_curator.stages.audio.metrics.wer",
     "GetPairwiseWerStage": "nemo_curator.stages.audio.metrics.wer",
     "TorchSquimQualityMetricsStage": "nemo_curator.stages.audio.metrics.squim",
+    "GetUtmosv2ScoreStage": "nemo_curator.stages.audio.metrics.utmosv2_score",
 }
 
 _cache: dict[str, Any] = {}