Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
description: "Cluster speaker embeddings into speaker groups using SCOTCH (Scalable Centroid-based Two-stage Clustering with Hierarchical refinement)"
categories: ["audio-processing"]
tags: ["scotch", "speaker-clustering", "ahc", "birch", "speaker-id"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "audio-only"
---

# SCOTCH Speaker Clustering

Cluster speaker embeddings into speaker groups using Agglomerative Hierarchical Clustering (AHC). For datasets over 500 hours, a two-stage BIRCH → AHC pipeline keeps memory bounded while scaling to tens of millions of utterances.

## Understanding SCOTCH

SCOTCH assigns every utterance a `speaker_label` (integer cluster ID) and a `confidence_score` (0–1 silhouette-style metric). Two backends are available:

| Backend | Scale | Memory | When to Use |
|---------|-------|--------|-------------|
| Standard AHC | < 500 h / < 150K utterances | O(N²) | Small-to-medium datasets |
| Large-scale (BIRCH + AHC) | > 500 h / millions of utterances | O(N·D + K²) | Web-scale audio |

Use `recommend_clustering_method()` to pick automatically:

```python
from nemo_curator.stages.audio.speaker_id.clustering.large_scale_clustering_and_scoring import (
recommend_clustering_method,
)

method = recommend_clustering_method(num_hours=1200)
# -> "large_scale"
```

## Basic Usage

### Standard AHC (SpeakerClusteringStage)

```python
from nemo_curator.stages.audio.speaker_id import SpeakerClusteringStage

cluster = SpeakerClusteringStage(
input_manifest="manifests/manifest_{0..49}.json",
embedding_dir="embeddings/",
output_manifest_dir="output_manifests/",
threshold=0.292,
embedding_normalization="center_global",
)
pipeline.add_stage(cluster)
```

### Large-Scale (Function API)

```python
from nemo_curator.stages.audio.speaker_id.clustering.large_scale_clustering_and_scoring import (
cluster_embeddings_large_scale,
)

labels, confidence, stats = cluster_embeddings_large_scale(
embeddings, # (N, D) float32 array
threshold=0.40,
min_cluster_size=30,
)
```

## Parameters

### SpeakerClusteringStage

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `threshold` | float | 0.292 | Cosine-similarity cutoff for same-speaker decisions |
| `linkage_method` | str | `"average"` | AHC linkage: `"average"`, `"complete"`, `"single"` |
| `shard_level_clustering` | bool | `False` | Cluster each shard independently vs globally |
| `batch_size` | int | `None` | Group shards into batches for clustering |
| `embedding_normalization` | str | `"center_global"` | `"none"`, `"center_global"`, or `"external"` |

### Large-Scale Function

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `threshold` | float | 0.40 | Cosine cutoff at AHC step |
| `min_cluster_size` | int | 30 | Drop clusters smaller than this |
| `birch_threshold` | float | ~0.63 | BIRCH leaf radius (Euclidean on unit sphere) |
| `branching_factor` | int | 50 | BIRCH tree branching factor |

## Threshold Tuning

| Threshold | Effect | Use Case |
|-----------|--------|----------|
| 0.25–0.30 | Aggressive merging, fewer speakers | High-recall speaker groups |
| 0.30–0.35 | Balanced (default for TitaNet) | General ASR data curation |
| 0.40–0.50 | Conservative, more speakers | Purity-first for TTS training |

## Best Practices

- **Run embedding normalization** — `center_global` (default) subtracts batch mean, which improves threshold stability across domains.
- **Start with default thresholds** — 0.292 for standard AHC (TitaNet), 0.40 for large-scale.
- **Use `min_cluster_size=30`** for training data — drops long-tail noise speakers that hurt model quality.
- **CPU-only** — clustering does not need GPU; allocate sufficient RAM for the embedding matrix.

## Related Topics

- [Speaker Embedding](/curate-audio/process-data/quality-filtering/speaker-embedding) — extract embeddings before clustering
- [UTMOSv2 Scoring](/curate-audio/process-data/quality-filtering/utmosv2) — combine with quality filtering
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
description: "Extract speaker embeddings from audio using TitaNet for downstream clustering and speaker identification"
categories: ["audio-processing"]
tags: ["speaker-embedding", "titanet", "speaker-id", "embedding"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "audio-only"
---

# Speaker Embedding

Extract per-utterance speaker embeddings using a pretrained [TitaNet](https://catalog.ngc.nvidia.com/orgs/nvidia/models/speakerverification_en_titanet_large) model. These embeddings power downstream speaker clustering (SCOTCH) and confidence scoring.

## Understanding Speaker Embeddings

Speaker embeddings are fixed-dimensional vectors that capture the vocal characteristics of a speaker. Two utterances from the same speaker produce embeddings with high cosine similarity; different speakers produce low similarity.

Three stage variants handle different data layouts:

| Stage | Input | Use Case |
|-------|-------|----------|
| `SpeakerEmbeddingAudioTaskStage` | In-memory waveform (`AudioTask`) | AIS-streamed pipelines |
| `SpeakerEmbeddingRequestStage` | File paths in JSONL (`DocumentBatch`) | Local/S3/tar audio |
| `SpeakerEmbeddingLhotseStage` | Lhotse CutSet (NeMo tarred) | Large-scale NeMo datasets |

## Basic Usage

### In-Memory Waveform (AudioTask)

```python
from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingAudioTaskStage

embed = SpeakerEmbeddingAudioTaskStage(
output_dir="embeddings/",
target_sample_rate=16000,
)
pipeline.add_stage(embed)
```

### File Paths (DocumentBatch)

```python
from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingRequestStage

embed = SpeakerEmbeddingRequestStage(
output_path="embeddings/all_embeddings.npz",
output_format="npz",
)
pipeline.add_stage(embed)
```

### NeMo Tarred (Lhotse)

```python
from nemo_curator.stages.audio.speaker_id import SpeakerEmbeddingLhotseStage

embed = SpeakerEmbeddingLhotseStage(
input_manifest="manifests/manifest_{0..49}.json",
input_tar="tars/audio_{0..49}.tar",
output_path="embeddings/",
batch_size=64,
)
```

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | `nvidia/speakerverification_en_titanet_large` | Pretrained NeMo speaker model |
| `target_sample_rate` | int | 16000 | Resample audio to this rate |
| `output_dir` / `output_path` | str | `""` | Where to save embedding files |
| `output_format` | str | `"npz"` | `"npz"` or `"pt"` |
| `batch_size` | int | 64 | Inference batch size |

## Best Practices

- **Use TitaNet Large** for best clustering quality; the default model works well across domains.
- **Target 16 kHz** — the model expects this sample rate; mismatched rates degrade accuracy.
- **Save per-shard embeddings** for large datasets; use `merge_shard_embeddings()` to combine afterwards.
- **GPU recommended** — embedding extraction is compute-intensive; allocate at least 4 GB GPU memory.

## Related Topics

- [SCOTCH Clustering](/curate-audio/process-data/quality-filtering/scotch-clustering) — cluster embeddings into speaker groups
- [UTMOSv2 Scoring](/curate-audio/process-data/quality-filtering/utmosv2) — audio quality scoring
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
description: "Score audio quality using UTMOSv2 Mean Opinion Score prediction"
categories: ["audio-processing"]
tags: ["utmosv2", "mos", "mean-opinion-score", "quality", "audio-metrics"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "how-to"
modality: "audio-only"
---

# UTMOSv2 Scoring

Compute per-utterance Mean Opinion Score (MOS) predictions using the [UTMOSv2](https://github.com/sarulab-speech/UTMOSv2) model. Unlike `UTMOSFilterStage` (which uses the older `utmos22_strong`), this stage supports in-memory waveforms, local files, and remote audio (S3/AIS/HTTP).

## Understanding UTMOSv2

UTMOSv2 predicts a MOS score on the standard 1.0–5.0 scale. It handles multiple audio formats (WAV, OPUS, FLAC, MP3) and three input modes:

| Mode | Input | Best For |
|------|-------|----------|
| In-memory waveform | `waveform` + `sample_rate` keys | NeMo tarred / AIS-streamed pipelines |
| Local file path | `audio_filepath` on disk | Pre-downloaded datasets |
| Remote URL | `s3://` / `ais://` / `http://` | Cloud-hosted audio |

## Basic Usage

```python
from nemo_curator.stages.audio.metrics.utmosv2_score import GetUtmosv2ScoreStage

utmos = GetUtmosv2ScoreStage(
score_key="utmosv2_score",
sample_rate=16000,
)
pipeline.add_stage(utmos)
```

The stage adds a `utmosv2_score` field to each entry. Entries where audio cannot be loaded receive `NaN`.

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `score_key` | str | `"utmosv2_score"` | Output key for the MOS score |
| `audio_filepath_key` | str | `"audio_filepath"` | Input key for file paths |
| `waveform_key` | str | `"waveform"` | Input key for in-memory waveforms |
| `audio_root` | str | `""` | Root directory prepended to relative paths |
| `sample_rate` | int | 16000 | Target sample rate for inference |
| `inference_batch_size` | int | 16 | Batch size for `model.predict()` |
| `num_repetitions` | int | 1 | Test-time augmentation repetitions |

## When to Use UTMOSv2 vs UTMOS

- **UTMOS** (`UTMOSFilterStage`) — older `utmos22_strong` model, integrated threshold-based filtering, simpler API.
- **UTMOSv2** (`GetUtmosv2ScoreStage`) — newer model, supports remote files and in-memory waveforms, scoring only (filter downstream).

For new pipelines, prefer UTMOSv2 for its flexibility. Apply filtering in a subsequent stage based on the score.

## Best Practices

- **GPU recommended** — allocate 1 GPU per stage instance for inference speed.
- **Score first, filter later** — run with no threshold to inspect the MOS distribution before choosing a cutoff.
- **Waveforms are stripped** from output — the `waveform` key is removed after scoring to keep JSONL output clean.

## Related Topics

- [UTMOS Filter](/curate-audio/process-data/quality-filtering/utmos) — threshold-based filtering with the older model
- [SCOTCH Clustering](/curate-audio/process-data/quality-filtering/scotch-clustering) — combine quality scores with speaker clustering
6 changes: 6 additions & 0 deletions nemo_curator/stages/audio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@
"PreserveByValueStage": "nemo_curator.stages.audio.common",
"SIGMOSFilterStage": "nemo_curator.stages.audio.filtering",
"SegmentConcatenationStage": "nemo_curator.stages.audio.preprocessing",
"SpeakerClusteringStage": "nemo_curator.stages.audio.speaker_id",
"SpeakerEmbeddingLhotseStage": "nemo_curator.stages.audio.speaker_id",
"SpeakerEmbeddingRequestStage": "nemo_curator.stages.audio.speaker_id",
"SpeakerSeparationStage": "nemo_curator.stages.audio.segmentation",
"TimestampMapperStage": "nemo_curator.stages.audio.postprocessing",
"UTMOSFilterStage": "nemo_curator.stages.audio.filtering",
Expand All @@ -57,6 +60,9 @@
"PreserveByValueStage",
"SIGMOSFilterStage",
"SegmentConcatenationStage",
"SpeakerClusteringStage",
"SpeakerEmbeddingLhotseStage",
"SpeakerEmbeddingRequestStage",
"SpeakerSeparationStage",
"TimestampMapperStage",
"UTMOSFilterStage",
Expand Down
1 change: 1 addition & 0 deletions nemo_curator/stages/audio/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
"ComputeWERStage": "nemo_curator.stages.audio.metrics.wer",
"GetPairwiseWerStage": "nemo_curator.stages.audio.metrics.wer",
"TorchSquimQualityMetricsStage": "nemo_curator.stages.audio.metrics.squim",
"GetUtmosv2ScoreStage": "nemo_curator.stages.audio.metrics.utmosv2_score",
}

_cache: dict[str, Any] = {}
Expand Down
Loading
Loading