Sidekick2000 supports dual-stream audio recording with local Whisper transcription. This captures two separate audio sources simultaneously:
- Local stream: Your microphone (your voice)
- Remote stream: System audio via BlackHole virtual cable (other participants)
Each stream is transcribed independently using whisper.cpp with Metal GPU acceleration, providing perfect speaker separation without any diarization algorithm.
BlackHole is a macOS virtual audio driver that routes system audio to a virtual input device.
- Install BlackHole 2ch (the 2-channel version is sufficient)
- Open Audio MIDI Setup (Applications > Utilities)
- Click the + button at the bottom left and select Create Multi-Output Device
- Check both your physical speakers/headphones AND BlackHole 2ch
- Set this Multi-Output Device as your system output (System Settings > Sound > Output)
This routes all system audio to both your speakers and BlackHole simultaneously. Sidekick2000 reads from "BlackHole 2ch" as an input device to capture what others say during calls.
The app uses ggml-large-v3-turbo-q5_0.bin (~550 MB), downloaded automatically on first use to ~/.sidekick2000/models/. You can also trigger the download manually via the download_whisper_model command.
[Mic] [BlackHole 2ch]
| |
| cpal | cpal
v v
ring buffer ring buffer
| |
| worker thread | worker thread (every 200ms)
v v
mono + resample 16k mono + resample 16k
| |
v v
VAD (Silero) VAD (Silero)
| |
v v
WhisperEngine #1 WhisperEngine #2 <-- 2 instances, Metal GPU
| |
| TranscriptSegment | TranscriptSegment
v v
emit "live-segment" emit "live-segment"
| |
+----------+------------+
|
v
merge (sort by time)
|
v
summarize (Claude)
|
v
export (.md)
Both audio streams share a single Instant as their t=0 origin, captured just before the cpal streams are started. All segment timestamps are relative to this origin, ensuring correct chronological ordering when merging.
The Silero Voice Activity Detector processes audio in 32ms chunks (512 samples at 16 kHz). Audio is accumulated and flushed to Whisper when:
- Silence detected: >= 300ms of consecutive sub-threshold VAD probability after speech, OR
- Max duration reached: >= 10 seconds of accumulated audio
Chunks shorter than 0.3 seconds are discarded as noise.
Each stream gets its own WhisperEngine instance rather than sharing one behind a mutex. On Apple Silicon (M4 Pro), Metal can schedule GPU work from multiple threads. A shared mutex would serialize transcription and add latency when both speakers talk simultaneously.
Returns input devices categorized as microphones vs loopback.
interface CategorizedDevices {
microphones: string[]; // Normal mic devices
loopback: string[]; // BlackHole devices
}
const devices: CategorizedDevices = await invoke('list_audio_devices');Start recording on both devices. In LocalWhisper mode, automatically spawns live transcription worker threads.
await invoke('start_recording', {
localDevice: 'MacBook Pro Microphone',
remoteDevice: 'BlackHole 2ch',
});Stops recording, finalizes live transcription, saves WAV/OGG files.
const [localOgg, localWav, remoteOgg, remoteWav] = await invoke('stop_recording');Check if the Whisper model is downloaded.
const status = await invoke('get_model_download_status');
// { downloaded: true, path: "/Users/.../.sidekick2000/models/ggml-large-v3-turbo-q5_0.bin", size_bytes: 574000000 }Trigger model download (emits progress events).
const modelPath = await invoke('download_whisper_model');Emitted during recording whenever a chunk is transcribed. One event per chunk per stream.
interface LiveSegmentEvent {
speaker: 'local' | 'remote';
segments: Array<{
id: number;
start: number; // seconds from recording start
end: number;
text: string;
}>;
}
listen('live-segment', (event) => {
const { speaker, segments } = event.payload;
// Append to live transcript display
});Emitted during Whisper model download.
interface ModelDownloadProgress {
downloaded: number; // bytes downloaded so far
total: number; // total size in bytes (0 if unknown)
progress: number; // 0.0 to 1.0
}
listen('model-download-progress', (event) => {
const { progress } = event.payload;
// Update download progress bar
});Existing event, unchanged. Emitted during pipeline execution.
interface PipelineProgress {
step: 'transcribing' | 'merging' | 'summarizing' | 'exporting' | 'committing' | 'creating_issues' | 'done';
progress: number; // 0.0 to 1.0
}The transcription_mode field in settings controls which engine is used:
{
"transcription_mode": "LocalWhisper",
"default_language": "fr"
}Values: "LocalWhisper" (default, offline) or "Groq" (cloud, requires API key).
The default_language is passed directly to Whisper as an ISO 639-1 code (e.g., "fr", "en").