Skip to content

Findit-AI/Oolong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Oolong — Speaker Diarization in Rust

A pure Rust implementation of the pyannote speaker diarization pipeline. All inference and clustering logic is written in Rust, running ONNX models via the ort crate. No Python runtime is required at inference time.

The scripts/pyannote/ directory contains Python utilities for one-time model preparation only — downloading pretrained models from HuggingFace, exporting them to ONNX format, and precomputing PLDA parameters. These scripts are not needed at runtime.

Produces "who spoke when" annotations with VBx clustering (PLDA + VB-HMM), achieving DER 5-7% against the Python pyannote reference and 13.6x faster inference. Also provides streaming Voice Activity Detection (VAD) as a standalone mode.

Dependencies & Models

Rust Dependencies

Core inference stack (specified in Cargo.toml):

Crate Purpose
ort 2.0.0-rc.12 ONNX Runtime bindings
mel_spec 0.3 80-dim fbank feature extraction
kodama 0.2 Hierarchical clustering (Centroid linkage for AHC init)
ndarray 0.16 N-dimensional arrays for PLDA/VBx math
ndarray-npy 0.9 Load .npz PLDA model files
thiserror 2 Error type derivation

Optional feature: coreml — enables CoreML execution provider on Apple Silicon (currently falls back to CPU due to model compatibility issues).

ONNX Models

Download via scripts/pyannote/download_models.py (requires HuggingFace token):

models/speaker-diarization-community-1/
  segmentation/model.onnx   # 5.6 MB — pyannote segmentation-3.0
  embedding/model.onnx      # 25 MB  — WeSpeaker ResNet34
  plda/vbx_model.npz        # PLDA parameters (precomputed)

The PLDA file is generated by scripts/pyannote/precompute_plda.py from the original pyannote model weights.

Python Toolchain (one-time setup)

Scripts in scripts/pyannote/ handle model preparation and reference output generation:

pip install -r scripts/pyannote/requirements.txt
python scripts/pyannote/download_models.py
python scripts/pyannote/export_onnx.py
python scripts/pyannote/precompute_plda.py

Usage

VAD-Only (streaming voice detection)

Loads only the segmentation model (5.6 MB). Buffers 10s windows internally, triggers ONNX inference every 2.5s step.

use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};

let opts = TieguanyinOolongOptions::new().with_vad_only(true);
let mut detector = TieguanyinOolong::new_from_path(
    "models/.../segmentation/model.onnx", opts
)?;

// Feed audio chunks (any size)
for chunk in audio_chunks {
    if let Some(range) = detector.detect(chunk)? {
        println!("Voice: {} to {}", range.start.as_secs_f64(), range.end.as_secs_f64());
    }
}
// Flush trailing audio
if let Some(range) = detector.finish() {
    println!("Voice: {range}");
}

Returns: VoiceRange { start: Duration, end: Duration } — one range per call at most.

Speaker Diarization

Requires segmentation + embedding + PLDA models. Four output modes:

use oolong_diarization::{TieguanyinOolong, TieguanyinOolongOptions};

let opts = TieguanyinOolongOptions::new()
    .with_embedding_model_path(Some("models/.../embedding/model.onnx".into()))
    .with_plda_model("models/.../plda/vbx_model.npz".into());
let mut diarizer = TieguanyinOolong::new_from_path(
    "models/.../segmentation/model.onnx", opts
)?;

let pcm: &[f32] = /* 16kHz mono f32 PCM */;
Method Return Type Description
diarize(pcm) Vec<SpeakerSegment> Exclusive mode — one speaker per frame, suitable for STT alignment
diarize_timed(pcm) (Vec<SpeakerSegment>, DiarizeTiming) Same + per-step timing breakdown
diarize_with_embeddings(pcm) (Vec<SpeakerSegment>, HashMap<u32, Vec<f32>>) Same + 256-dim mean embedding per speaker (for cross-file speaker matching)
diarize_overlap(pcm) Vec<DiarizeSegment> Overlap mode — multiple records for overlapping speech, compatible with Python pyannote output

Output Types

// Exclusive mode output
pub struct SpeakerSegment {
    pub start: Duration,       // inclusive
    pub end: Duration,         // exclusive
    pub speaker_id: u32,       // 0-based, assigned by clustering
    pub is_overlap: bool,      // true if 2+ speakers active in this segment
}

// Overlap mode output
pub struct DiarizeSegment {
    pub start: Duration,
    pub end: Duration,
    pub speaker_id: u32,       // same timespan may have multiple records
}

// Timing breakdown
pub struct DiarizeTiming {
    pub segmentation_ms: u64,  // sliding window ONNX + powerset decode
    pub embedding_ms: u64,     // per-segment WeSpeaker inference
    pub clustering_ms: u64,    // VBx (PLDA + VB-HMM) + label reconstruction
    pub total_ms: u64,
}

Configuration

TieguanyinOolongOptions supports builder pattern:

Option Default Description
vad_onset 0.5 VAD activation threshold
vad_offset 0.357 VAD deactivation threshold (hysteresis)
min_duration_off 0s Minimum silence gap to split segments
embedding_exclude_overlap true Skip overlapping regions for cleaner embeddings
clustering_threshold 0.6 VBx clustering threshold
intra_threads None ONNX Runtime intra-op thread count

Architecture

PCM f32 (16kHz mono)
  |
  +-- [VAD path] detect() / finish()
  |     +-- 10s sliding window -> segmentation ONNX -> powerset decode
  |         -> dual-threshold hysteresis (onset/offset) -> VoiceRange
  |
  +-- [Diarization path] diarize() / diarize_timed() / ...
        |
        +-- Step 1: run_segmentation_loop_batched()
        |     Sliding window (10s, step 2.5s) -> batch ONNX (batch=16)
        |     -> 7-class log-softmax powerset -> per-speaker binary activity
        |     -> center-crop stitching to global timeline
        |
        +-- Step 2: collect_embeddings()
        |     Active frames -> fbank features (80-dim, x32768 scaling)
        |     -> fixed 200-frame resampling -> embedding ONNX -> 256-dim L2-normalized
        |
        +-- Step 3: vbx_cluster_embeddings()
        |     PLDA transform (256->128) -> AHC initialization
        |     -> VB-HMM iteration (20 rounds) -> global speaker labels
        |
        +-- Step 4: reconstruct_frame_speaker_probs() -> build_speaker_segments()
              Center-crop overlap-add frame reconstruction
              -> argmax winner speaker -> sort + merge adjacent same-speaker segments

Key Files

File Description
src/lib.rs Public API and re-exports
src/types.rs SpeakerSegment, DiarizeSegment, DiarizeTiming, VoiceRange
src/diarizer.rs Core implementation (~3100 lines): VAD state machine, ONNX inference, VBx clustering, PLDA
scripts/pyannote/ Python model toolchain (download, export, PLDA precompute, reference generation)

Benchmark

Test platform: Apple M1 Max (8P+2E, 10 cores), 32 GB RAM, macOS, --release build.

Model Cold Start

Mode Models Time RSS
VAD-only segmentation (5.6 MB) 27 ms +33 MB
Full diarization seg + emb + PLDA 42 ms +68 MB

Diarization Performance

Audio Duration Seg Emb Clus Total RTF Speakers Segments
Dual-speaker dialogue 3:47 2.7s 24.3s 4ms 27.0s 0.119 2 42
Sound effects (bell) 0:01 6ms 3ms 0ms 9ms 0.006 0 0
Ambient (thunderstorm) 2:31 1.8s 3.1s 0ms 4.9s 0.032 0 0
Movie soundtrack 18:18 13.7s 57.0s 46ms 70.7s 0.064 6 128
Podcast (concatenated) 85:24 62.9s 690s 34.5s 787s 0.154 2 1308

Non-speech audio (sound effects, ambient) correctly produces 0 speakers, 0 segments.

VAD-Only Performance

Audio Duration VAD Time VAD RTF Voice Ranges Voice %
Dual-speaker dialogue 3:47 2.9s 0.013 32 69%
Sound effects (bell) 0:01 5ms 0.003 0 0%
Ambient (thunderstorm) 2:31 1.9s 0.013 0 0%
Movie soundtrack 18:18 14.7s 0.013 138 22%
Podcast (concatenated) 85:24 69.0s 0.014 881 85%

VAD RTF stable at 0.013 (~10x faster than full diarization).

Python pyannote Comparison

Same model (speaker-diarization-community-1). Python uses PyTorch inference.

Audio Python Rust Speedup Speakers DER
Dual-speaker (3:47) 367.2s 27.0s 13.6x 2 = 2 6.33%
Podcast 1 (42:51) 4093.2s ~395s ~10.4x 2 = 2 5.33%
Podcast 2 (8:31) 840.9s 2 = 2 6.21%
Podcast 3 (34:01) 2647.3s 2 = 2 6.61%

Performance Bottleneck

Embedding inference accounts for ~80% of total time (per-segment sequential ONNX calls). For an 85-minute podcast, embedding alone takes 690s out of 787s total. The bottleneck is the ~4000 sequential ONNX calls (one per active speech segment) — each call is fast, but the count scales linearly with speech density.

Optimizations

Implemented

Optimization Effect Details
Fbank caching Avoid rebuilding Fbank instance per embedding call Fbank initialized once in struct, reused across all chunks
Segmentation batching N single-window ONNX calls → ceil(N/16) batch calls [N, 1, 160000] input tensor, batch size 16
Fixed-frame resampling Eliminate ONNX Runtime per-shape JIT compilation All embedding inputs normalized to 200 frames — without this, ORT recompiles kernels for each unique shape, causing 14+ min inference on 18-min audio
Contiguous memory layout Reduce heap allocations in frame reconstruction Frame probability matrix changed from Vec<Vec<...>> to flat Vec<f32> row-major
Center-crop overlap-add Reduce boundary artifacts in window stitching Shared reconstruct_frame_speaker_probs() function ensures consistent behavior across all diarization paths
ort thread configuration Reduce multi-core contention Configurable intra_threads for ONNX Runtime intra-op parallelism

Attempted but Failed

Optimization Why it failed
CoreML EP for segmentation MLProgram compilation fails — SincConv1D operator topology not accepted by CoreML's model parser. Error: "Operations are expected to be topologically sorted". Auto-fallback to CPU implemented.
CoreML EP for embedding CoreML model compilation changes numerical behavior. Tested both CPUAndNeuralEngine (ANE float16) and CPUAndGPU (GPU float32) — both degrade embedding precision enough that PLDA clustering produces 1 speaker instead of 2. Explicitly disabled.
IOBinding Requires CoreML EP to be active for device-side tensor allocation. Since neither model uses CoreML, IOBinding provides no benefit. Code removed.
Embedding batching WeSpeaker model supports dynamic batch, but active frame counts range 20–997 per segment. Zero-padding waste is enormous — batch version is slower and uses more memory than sequential calls. Config field retained but unused.
TensorRef zero-copy ort 2.0.0-rc.12's TensorRef::from_array_view produces corrupted inference results (4 speakers vs expected 2). Reverted to Tensor::from_array(to_vec()) with data copy.

What Would Help Next

  • Smaller embedding model (e.g., ECAPA-TDNN) — fewer parameters, faster per-call
  • INT8 quantization — reduce embedding ONNX compute per call
  • Segment merging before embedding — reduce total call count by pre-clustering adjacent short segments
  • Accept current RTF 0.06–0.15 as sufficient for offline video indexing

License

MIT

About

Speaker diarization and VAD in Rust — pyannote ONNX pipeline with VBx clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors