A pure Rust implementation of the pyannote speaker diarization pipeline. All inference and clustering logic is written in Rust, running ONNX models via the ort crate. No Python runtime is required at inference time.
The scripts/pyannote/ directory contains Python utilities for one-time model preparation only — downloading pretrained models from HuggingFace, exporting them to ONNX format, and precomputing PLDA parameters. These scripts are not needed at runtime.
Produces "who spoke when" annotations with VBx clustering (PLDA + VB-HMM), achieving DER 5-7% against the Python pyannote reference and 13.6x faster inference. Also provides streaming Voice Activity Detection (VAD) as a standalone mode.
Core inference stack (specified in Cargo.toml):
| Crate | Purpose |
|---|---|
ort 2.0.0-rc.12 |
ONNX Runtime bindings |
mel_spec 0.3 |
80-dim fbank feature extraction |
kodama 0.2 |
Hierarchical clustering (Centroid linkage for AHC init) |
ndarray 0.16 |
N-dimensional arrays for PLDA/VBx math |
ndarray-npy 0.9 |
Load .npz PLDA model files |
thiserror 2 |
Error type derivation |
Optional feature: coreml — enables CoreML execution provider on Apple Silicon (currently falls back to CPU due to model compatibility issues).
Download via scripts/pyannote/download_models.py (requires HuggingFace token):
models/speaker-diarization-community-1/
segmentation/model.onnx # 5.6 MB — pyannote segmentation-3.0
embedding/model.onnx # 25 MB — WeSpeaker ResNet34
plda/vbx_model.npz # PLDA parameters (precomputed)
The PLDA file is generated by scripts/pyannote/precompute_plda.py from the original pyannote model weights.
Scripts in scripts/pyannote/ handle model preparation and reference output generation:
pip install -r scripts/pyannote/requirements.txt
python scripts/pyannote/download_models.py
python scripts/pyannote/export_onnx.py
python scripts/pyannote/precompute_plda.pyLoads only the segmentation model (5.6 MB). Buffers 10s windows internally, triggers ONNX inference every 2.5s step.
use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};
let opts = TieguanyinOolongOptions::new().with_vad_only(true);
let mut detector = TieguanyinOolong::new_from_path(
"models/.../segmentation/model.onnx", opts
)?;
// Feed audio chunks (any size)
for chunk in audio_chunks {
if let Some(range) = detector.detect(chunk)? {
println!("Voice: {} to {}", range.start.as_secs_f64(), range.end.as_secs_f64());
}
}
// Flush trailing audio
if let Some(range) = detector.finish() {
println!("Voice: {range}");
}Returns: VoiceRange { start: Duration, end: Duration } — one range per call at most.
Requires segmentation + embedding + PLDA models. Four output modes:
use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};
let opts = TieguanyinOolongOptions::new()
.with_embedding_model_path(Some("models/.../embedding/model.onnx".into()))
.with_plda_model("models/.../plda/vbx_model.npz".into());
let mut diarizer = TieguanyinOolong::new_from_path(
"models/.../segmentation/model.onnx", opts
)?;
let pcm: &[f32] = /* 16kHz mono f32 PCM */;| Method | Return Type | Description |
|---|---|---|
diarize(pcm) |
Vec<SpeakerSegment> |
Exclusive mode — one speaker per frame, suitable for STT alignment |
diarize_timed(pcm) |
(Vec<SpeakerSegment>, DiarizeTiming) |
Same + per-step timing breakdown |
diarize_with_embeddings(pcm) |
(Vec<SpeakerSegment>, HashMap<u32, Vec<f32>>) |
Same + 256-dim mean embedding per speaker (for cross-file speaker matching) |
diarize_overlap(pcm) |
Vec<DiarizeSegment> |
Overlap mode — multiple records for overlapping speech, compatible with Python pyannote output |
// Exclusive mode output
pub struct SpeakerSegment {
pub start: Duration, // inclusive
pub end: Duration, // exclusive
pub speaker_id: u32, // 0-based, assigned by clustering
pub is_overlap: bool, // true if 2+ speakers active in this segment
}
// Overlap mode output
pub struct DiarizeSegment {
pub start: Duration,
pub end: Duration,
pub speaker_id: u32, // same timespan may have multiple records
}
// Timing breakdown
pub struct DiarizeTiming {
pub segmentation_ms: u64, // sliding window ONNX + powerset decode
pub embedding_ms: u64, // per-segment WeSpeaker inference
pub clustering_ms: u64, // VBx (PLDA + VB-HMM) + label reconstruction
pub total_ms: u64,
}TieguanyinOolongOptions supports builder pattern:
| Option | Default | Description |
|---|---|---|
vad_onset |
0.5 | VAD activation threshold |
vad_offset |
0.357 | VAD deactivation threshold (hysteresis) |
min_duration_off |
0s | Minimum silence gap to split segments |
embedding_exclude_overlap |
true | Skip overlapping regions for cleaner embeddings |
clustering_threshold |
0.6 | VBx clustering threshold |
intra_threads |
None | ONNX Runtime intra-op thread count |
PCM f32 (16kHz mono)
|
+-- [VAD path] detect() / finish()
| +-- 10s sliding window -> segmentation ONNX -> powerset decode
| -> dual-threshold hysteresis (onset/offset) -> VoiceRange
|
+-- [Diarization path] diarize() / diarize_timed() / ...
|
+-- Step 1: run_segmentation_loop_batched()
| Sliding window (10s, step 2.5s) -> batch ONNX (batch=16)
| -> 7-class log-softmax powerset -> per-speaker binary activity
| -> center-crop stitching to global timeline
|
+-- Step 2: collect_embeddings()
| Active frames -> fbank features (80-dim, x32768 scaling)
| -> fixed 200-frame resampling -> embedding ONNX -> 256-dim L2-normalized
|
+-- Step 3: vbx_cluster_embeddings()
| PLDA transform (256->128) -> AHC initialization
| -> VB-HMM iteration (20 rounds) -> global speaker labels
|
+-- Step 4: reconstruct_frame_speaker_probs() -> build_speaker_segments()
Center-crop overlap-add frame reconstruction
-> argmax winner speaker -> sort + merge adjacent same-speaker segments
| File | Description |
|---|---|
src/lib.rs |
Public API and re-exports |
src/types.rs |
SpeakerSegment, DiarizeSegment, DiarizeTiming, VoiceRange |
src/diarizer.rs |
Core implementation (~3100 lines): VAD state machine, ONNX inference, VBx clustering, PLDA |
scripts/pyannote/ |
Python model toolchain (download, export, PLDA precompute, reference generation) |
Test platform: Apple M1 Max (8P+2E, 10 cores), 32 GB RAM, macOS, --release build.
| Mode | Models | Time | RSS |
|---|---|---|---|
| VAD-only | segmentation (5.6 MB) | 27 ms | +33 MB |
| Full diarization | seg + emb + PLDA | 42 ms | +68 MB |
| Audio | Duration | Seg | Emb | Clus | Total | RTF | Speakers | Segments |
|---|---|---|---|---|---|---|---|---|
| Dual-speaker dialogue | 3:47 | 2.7s | 24.3s | 4ms | 27.0s | 0.119 | 2 | 42 |
| Sound effects (bell) | 0:01 | 6ms | 3ms | 0ms | 9ms | 0.006 | 0 | 0 |
| Ambient (thunderstorm) | 2:31 | 1.8s | 3.1s | 0ms | 4.9s | 0.032 | 0 | 0 |
| Movie soundtrack | 18:18 | 13.7s | 57.0s | 46ms | 70.7s | 0.064 | 6 | 128 |
| Podcast (concatenated) | 85:24 | 62.9s | 690s | 34.5s | 787s | 0.154 | 2 | 1308 |
Non-speech audio (sound effects, ambient) correctly produces 0 speakers, 0 segments.
| Audio | Duration | VAD Time | VAD RTF | Voice Ranges | Voice % |
|---|---|---|---|---|---|
| Dual-speaker dialogue | 3:47 | 2.9s | 0.013 | 32 | 69% |
| Sound effects (bell) | 0:01 | 5ms | 0.003 | 0 | 0% |
| Ambient (thunderstorm) | 2:31 | 1.9s | 0.013 | 0 | 0% |
| Movie soundtrack | 18:18 | 14.7s | 0.013 | 138 | 22% |
| Podcast (concatenated) | 85:24 | 69.0s | 0.014 | 881 | 85% |
VAD RTF stable at 0.013 (~10x faster than full diarization).
Same model (speaker-diarization-community-1). Python uses PyTorch inference.
| Audio | Python | Rust | Speedup | Speakers | DER |
|---|---|---|---|---|---|
| Dual-speaker (3:47) | 367.2s | 27.0s | 13.6x | 2 = 2 | 6.33% |
| Podcast 1 (42:51) | 4093.2s | ~395s | ~10.4x | 2 = 2 | 5.33% |
| Podcast 2 (8:31) | 840.9s | — | — | 2 = 2 | 6.21% |
| Podcast 3 (34:01) | 2647.3s | — | — | 2 = 2 | 6.61% |
Embedding inference accounts for ~80% of total time (per-segment sequential ONNX calls). For an 85-minute podcast, embedding alone takes 690s out of 787s total. The bottleneck is the ~4000 sequential ONNX calls (one per active speech segment) — each call is fast, but the count scales linearly with speech density.
| Optimization | Effect | Details |
|---|---|---|
| Fbank caching | Avoid rebuilding Fbank instance per embedding call | Fbank initialized once in struct, reused across all chunks |
| Segmentation batching | N single-window ONNX calls → ceil(N/16) batch calls | [N, 1, 160000] input tensor, batch size 16 |
| Fixed-frame resampling | Eliminate ONNX Runtime per-shape JIT compilation | All embedding inputs normalized to 200 frames — without this, ORT recompiles kernels for each unique shape, causing 14+ min inference on 18-min audio |
| Contiguous memory layout | Reduce heap allocations in frame reconstruction | Frame probability matrix changed from Vec<Vec<...>> to flat Vec<f32> row-major |
| Center-crop overlap-add | Reduce boundary artifacts in window stitching | Shared reconstruct_frame_speaker_probs() function ensures consistent behavior across all diarization paths |
| ort thread configuration | Reduce multi-core contention | Configurable intra_threads for ONNX Runtime intra-op parallelism |
| Optimization | Why it failed |
|---|---|
| CoreML EP for segmentation | MLProgram compilation fails — SincConv1D operator topology not accepted by CoreML's model parser. Error: "Operations are expected to be topologically sorted". Auto-fallback to CPU implemented. |
| CoreML EP for embedding | CoreML model compilation changes numerical behavior. Tested both CPUAndNeuralEngine (ANE float16) and CPUAndGPU (GPU float32) — both degrade embedding precision enough that PLDA clustering produces 1 speaker instead of 2. Explicitly disabled. |
| IOBinding | Requires CoreML EP to be active for device-side tensor allocation. Since neither model uses CoreML, IOBinding provides no benefit. Code removed. |
| Embedding batching | WeSpeaker model supports dynamic batch, but active frame counts range 20–997 per segment. Zero-padding waste is enormous — batch version is slower and uses more memory than sequential calls. Config field retained but unused. |
| TensorRef zero-copy | ort 2.0.0-rc.12's TensorRef::from_array_view produces corrupted inference results (4 speakers vs expected 2). Reverted to Tensor::from_array(to_vec()) with data copy. |
- Smaller embedding model (e.g., ECAPA-TDNN) — fewer parameters, faster per-call
- INT8 quantization — reduce embedding ONNX compute per call
- Segment merging before embedding — reduce total call count by pre-clustering adjacent short segments
- Accept current RTF 0.06–0.15 as sufficient for offline video indexing
MIT