Fails to detect two speakers and can't add them after the fact #28
Replies: 3 comments 2 replies
-
|
Hey Sam — thanks for the clear report, this is a real bug, not anything you did wrong. Quick explanation: on Apple Silicon, Doza Assist tries Parakeet first because it's the fastest engine, but Parakeet doesn't actually do speaker diarization. So even though you specified 2 speakers and named them in the popup, every word ended up labeled with whoever you typed as the first speaker. WhisperX (the engine that does diarize) never got reached because Parakeet was already running. Just opened #29 to fix the routing — any project with more than 1 speaker now skips Parakeet and goes straight to WhisperX. Will land in the next release. Heads-up though: WhisperX uses pyannote-audio for diarization, which needs a free HuggingFace token to download the model. Once the fix ships, the one-time setup is:
If you skip the token step, you'll still get a working transcript — just back to one speaker, but with a clear warning in the terminal so you know what's missing instead of it being a mystery. I'd love to bundle that setup more smoothly down the road, but for now the env var is the path. On your second question — yeah, being able to fix speaker attributions in the transcript UI after the fact (without having to re-run the whole thing) is on the list. It's queued behind this routing fix but I hear you that it's an obvious gap. I'll comment here when the release ships. |
Beta Was this translation helpful? Give feedback.
-
|
Hey Sam — v3.5.4 is out with the routing fix: https://github.com/DozaVisuals/doza-assist/releases/tag/v3.5.4 Grab Doza-Assist-3.5.4.dmg from the release page, drag the app to Applications (replacing the existing copy), and the multi-speaker case should now actually distinguish speakers. One thing to do before you re-run that interview: set up the HuggingFace token, otherwise diarization will still skip silently. The one-time setup is:
Easiest persistent way on macOS — add this line to Then open a fresh terminal and launch Doza Assist from there, or just relaunch via Finder (the env var will be picked up by GUI apps after a logout/login). If you forget the token step, you'll still get a transcript — just back to one speaker, but with a clear warning in server.log explaining what's missing. So no silent failure this time. Let me know how the re-run goes. |
Beta Was this translation helpful? Give feedback.
-
|
I followed those steps: set the token in .zshrc, logged out, logged back in. Running the transcription still resulted in a single speaker. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Before starting transcription, I specify there are two speakers in the clip, and name the interviewer and subject in the popup. After transcription runs, the transcript only shows a single speaker (interviewer) and combines all the dialogue.
Beta Was this translation helpful? Give feedback.
All reactions