Oolong — Speaker Diarization in Rust

A pure Rust implementation of the pyannote speaker diarization pipeline. All inference and clustering logic is written in Rust, running ONNX models via the ort crate. No Python runtime is required at inference time.

The scripts/pyannote/ directory contains Python utilities for one-time model preparation only — downloading pretrained models from HuggingFace, exporting them to ONNX format, and precomputing PLDA parameters. These scripts are not needed at runtime.

Produces "who spoke when" annotations with VBx clustering (PLDA + VB-HMM), achieving DER 5-7% against the Python pyannote reference and 13.6x faster inference. Also provides streaming Voice Activity Detection (VAD) as a standalone mode.

Dependencies & Models

Rust Dependencies

Core inference stack (specified in Cargo.toml):

Crate	Purpose
`ort` 2.0.0-rc.12	ONNX Runtime bindings
`mel_spec` 0.3	80-dim fbank feature extraction
`kodama` 0.2	Hierarchical clustering (Centroid linkage for AHC init)
`ndarray` 0.16	N-dimensional arrays for PLDA/VBx math
`ndarray-npy` 0.9	Load `.npz` PLDA model files
`thiserror` 2	Error type derivation

Optional feature: coreml — enables CoreML execution provider on Apple Silicon (currently falls back to CPU due to model compatibility issues).

ONNX Models

Download via scripts/pyannote/download_models.py (requires HuggingFace token):

models/speaker-diarization-community-1/
  segmentation/model.onnx   # 5.6 MB — pyannote segmentation-3.0
  embedding/model.onnx      # 25 MB  — WeSpeaker ResNet34
  plda/vbx_model.npz        # PLDA parameters (precomputed)

The PLDA file is generated by scripts/pyannote/precompute_plda.py from the original pyannote model weights.

Python Toolchain (one-time setup)

Scripts in scripts/pyannote/ handle model preparation and reference output generation:

pip install -r scripts/pyannote/requirements.txt
python scripts/pyannote/download_models.py
python scripts/pyannote/export_onnx.py
python scripts/pyannote/precompute_plda.py

Usage

VAD-Only (streaming voice detection)

Loads only the segmentation model (5.6 MB). Buffers 10s windows internally, triggers ONNX inference every 2.5s step.

use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};

let opts = TieguanyinOolongOptions::new().with_vad_only(true);
let mut detector = TieguanyinOolong::new_from_path(
    "models/.../segmentation/model.onnx", opts
)?;

// Feed audio chunks (any size)
for chunk in audio_chunks {
    if let Some(range) = detector.detect(chunk)? {
        println!("Voice: {} to {}", range.start.as_secs_f64(), range.end.as_secs_f64());
    }
}
// Flush trailing audio
if let Some(range) = detector.finish() {
    println!("Voice: {range}");
}

Returns: VoiceRange { start: Duration, end: Duration } — one range per call at most.

Speaker Diarization

Requires segmentation + embedding + PLDA models. Four output modes:

use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};

let opts = TieguanyinOolongOptions::new()
    .with_embedding_model_path(Some("models/.../embedding/model.onnx".into()))
    .with_plda_model("models/.../plda/vbx_model.npz".into());
let mut diarizer = TieguanyinOolong::new_from_path(
    "models/.../segmentation/model.onnx", opts
)?;

let pcm: &[f32] = /* 16kHz mono f32 PCM */;

Method	Return Type	Description
`diarize(pcm)`	`Vec<SpeakerSegment>`	Exclusive mode — one speaker per frame, suitable for STT alignment
`diarize_timed(pcm)`	`(Vec<SpeakerSegment>, DiarizeTiming)`	Same + per-step timing breakdown
`diarize_with_embeddings(pcm)`	`(Vec<SpeakerSegment>, HashMap<u32, Vec<f32>>)`	Same + 256-dim mean embedding per speaker (for cross-file speaker matching)
`diarize_overlap(pcm)`	`Vec<DiarizeSegment>`	Overlap mode — multiple records for overlapping speech, compatible with Python pyannote output

Output Types

// Exclusive mode output
pub struct SpeakerSegment {
    pub start: Duration,       // inclusive
    pub end: Duration,         // exclusive
    pub speaker_id: u32,       // 0-based, assigned by clustering
    pub is_overlap: bool,      // true if 2+ speakers active in this segment
}

// Overlap mode output
pub struct DiarizeSegment {
    pub start: Duration,
    pub end: Duration,
    pub speaker_id: u32,       // same timespan may have multiple records
}

// Timing breakdown
pub struct DiarizeTiming {
    pub segmentation_ms: u64,  // sliding window ONNX + powerset decode
    pub embedding_ms: u64,     // per-segment WeSpeaker inference
    pub clustering_ms: u64,    // VBx (PLDA + VB-HMM) + label reconstruction
    pub total_ms: u64,
}

Configuration

TieguanyinOolongOptions supports builder pattern:

Option	Default	Description
`vad_onset`	0.5	VAD activation threshold
`vad_offset`	0.357	VAD deactivation threshold (hysteresis)
`min_duration_off`	0s	Minimum silence gap to split segments
`embedding_exclude_overlap`	true	Skip overlapping regions for cleaner embeddings
`clustering_threshold`	0.6	VBx clustering threshold
`intra_threads`	None	ONNX Runtime intra-op thread count

Architecture

PCM f32 (16kHz mono)
  |
  +-- [VAD path] detect() / finish()
  |     +-- 10s sliding window -> segmentation ONNX -> powerset decode
  |         -> dual-threshold hysteresis (onset/offset) -> VoiceRange
  |
  +-- [Diarization path] diarize() / diarize_timed() / ...
        |
        +-- Step 1: run_segmentation_loop_batched()
        |     Sliding window (10s, step 2.5s) -> batch ONNX (batch=16)
        |     -> 7-class log-softmax powerset -> per-speaker binary activity
        |     -> center-crop stitching to global timeline
        |
        +-- Step 2: collect_embeddings()
        |     Active frames -> fbank features (80-dim, x32768 scaling)
        |     -> fixed 200-frame resampling -> embedding ONNX -> 256-dim L2-normalized
        |
        +-- Step 3: vbx_cluster_embeddings()
        |     PLDA transform (256->128) -> AHC initialization
        |     -> VB-HMM iteration (20 rounds) -> global speaker labels
        |
        +-- Step 4: reconstruct_frame_speaker_probs() -> build_speaker_segments()
              Center-crop overlap-add frame reconstruction
              -> argmax winner speaker -> sort + merge adjacent same-speaker segments

Key Files

File	Description
`src/lib.rs`	Public API and re-exports
`src/types.rs`	`SpeakerSegment`, `DiarizeSegment`, `DiarizeTiming`, `VoiceRange`
`src/diarizer.rs`	Core implementation (~3100 lines): VAD state machine, ONNX inference, VBx clustering, PLDA
`scripts/pyannote/`	Python model toolchain (download, export, PLDA precompute, reference generation)

Benchmark

Test platform: Apple M1 Max (8P+2E, 10 cores), 32 GB RAM, macOS, --release build.

Model Cold Start

Mode	Models	Time	RSS
VAD-only	segmentation (5.6 MB)	27 ms	+33 MB
Full diarization	seg + emb + PLDA	42 ms	+68 MB

Diarization Performance

Audio	Duration	Seg	Emb	Clus	Total	RTF	Speakers	Segments
Dual-speaker dialogue	3:47	2.7s	24.3s	4ms	27.0s	0.119	2	42
Sound effects (bell)	0:01	6ms	3ms	0ms	9ms	0.006	0	0
Ambient (thunderstorm)	2:31	1.8s	3.1s	0ms	4.9s	0.032	0	0
Movie soundtrack	18:18	13.7s	57.0s	46ms	70.7s	0.064	6	128
Podcast (concatenated)	85:24	62.9s	690s	34.5s	787s	0.154	2	1308

Non-speech audio (sound effects, ambient) correctly produces 0 speakers, 0 segments.

VAD-Only Performance

Audio	Duration	VAD Time	VAD RTF	Voice Ranges	Voice %
Dual-speaker dialogue	3:47	2.9s	0.013	32	69%
Sound effects (bell)	0:01	5ms	0.003	0	0%
Ambient (thunderstorm)	2:31	1.9s	0.013	0	0%
Movie soundtrack	18:18	14.7s	0.013	138	22%
Podcast (concatenated)	85:24	69.0s	0.014	881	85%

VAD RTF stable at 0.013 (~10x faster than full diarization).

Python pyannote Comparison

Same model (speaker-diarization-community-1). Python uses PyTorch inference.

Audio	Python	Rust	Speedup	Speakers	DER
Dual-speaker (3:47)	367.2s	27.0s	13.6x	2 = 2	6.33%
Podcast 1 (42:51)	4093.2s	~395s	~10.4x	2 = 2	5.33%
Podcast 2 (8:31)	840.9s	—	—	2 = 2	6.21%
Podcast 3 (34:01)	2647.3s	—	—	2 = 2	6.61%

Performance Bottleneck

Embedding inference accounts for ~80% of total time (per-segment sequential ONNX calls). For an 85-minute podcast, embedding alone takes 690s out of 787s total. The bottleneck is the ~4000 sequential ONNX calls (one per active speech segment) — each call is fast, but the count scales linearly with speech density.

Optimizations

Implemented

Optimization	Effect	Details
Fbank caching	Avoid rebuilding Fbank instance per embedding call	`Fbank` initialized once in struct, reused across all chunks
Segmentation batching	N single-window ONNX calls → ceil(N/16) batch calls	`[N, 1, 160000]` input tensor, batch size 16
Fixed-frame resampling	Eliminate ONNX Runtime per-shape JIT compilation	All embedding inputs normalized to 200 frames — without this, ORT recompiles kernels for each unique shape, causing 14+ min inference on 18-min audio
Contiguous memory layout	Reduce heap allocations in frame reconstruction	Frame probability matrix changed from `Vec<Vec<...>>` to flat `Vec<f32>` row-major
Center-crop overlap-add	Reduce boundary artifacts in window stitching	Shared `reconstruct_frame_speaker_probs()` function ensures consistent behavior across all diarization paths
ort thread configuration	Reduce multi-core contention	Configurable `intra_threads` for ONNX Runtime intra-op parallelism

Attempted but Failed

Optimization	Why it failed
CoreML EP for segmentation	MLProgram compilation fails — SincConv1D operator topology not accepted by CoreML's model parser. Error: `"Operations are expected to be topologically sorted"`. Auto-fallback to CPU implemented.
CoreML EP for embedding	CoreML model compilation changes numerical behavior. Tested both `CPUAndNeuralEngine` (ANE float16) and `CPUAndGPU` (GPU float32) — both degrade embedding precision enough that PLDA clustering produces 1 speaker instead of 2. Explicitly disabled.
IOBinding	Requires CoreML EP to be active for device-side tensor allocation. Since neither model uses CoreML, IOBinding provides no benefit. Code removed.
Embedding batching	WeSpeaker model supports dynamic batch, but active frame counts range 20–997 per segment. Zero-padding waste is enormous — batch version is slower and uses more memory than sequential calls. Config field retained but unused.
TensorRef zero-copy	`ort` 2.0.0-rc.12's `TensorRef::from_array_view` produces corrupted inference results (4 speakers vs expected 2). Reverted to `Tensor::from_array(to_vec())` with data copy.

What Would Help Next

Smaller embedding model (e.g., ECAPA-TDNN) — fewer parameters, faster per-call
INT8 quantization — reduce embedding ONNX compute per call
Segment merging before embedding — reduce total call count by pre-clustering adjacent short segments
Accept current RTF 0.06–0.15 as sufficient for offline video indexing

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts/pyannote		scripts/pyannote
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oolong — Speaker Diarization in Rust

Dependencies & Models

Rust Dependencies

ONNX Models

Python Toolchain (one-time setup)

Usage

VAD-Only (streaming voice detection)

Speaker Diarization

Output Types

Configuration

Architecture

Key Files

Benchmark

Model Cold Start

Diarization Performance

VAD-Only Performance

Python pyannote Comparison

Performance Bottleneck

Optimizations

Implemented

Attempted but Failed

What Would Help Next

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oolong — Speaker Diarization in Rust

Dependencies & Models

Rust Dependencies

ONNX Models

Python Toolchain (one-time setup)

Usage

VAD-Only (streaming voice detection)

Speaker Diarization

Output Types

Configuration

Architecture

Key Files

Benchmark

Model Cold Start

Diarization Performance

VAD-Only Performance

Python pyannote Comparison

Performance Bottleneck

Optimizations

Implemented

Attempted but Failed

What Would Help Next

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages