Hi team π,
I wanted to ask about VibeVoice ASR inference performance and compare experiences with other users running the model locally on consumer GPUs.
Observed performance
Iβm currently seeing processing times close to real-time (β1.0x RTF) using a 4-bit quantized VibeVoice ASR model.
For reference, ~250 seconds of processing time for ~300 seconds of conversational audio.
Questions
Is this level of performance expected for VibeVoice ASR?
What kind of RTF / throughput are other users seeing, and on what hardware?
How are people running VibeVoice locally on GPUs with 16GB VRAM or less?
Are there any recommended runtime flags, model variants, batching/chunking strategies, or vLLM configurations that significantly improve inference speed?
Any benchmarks or real-world configurations would be greatly appreciated.
Thanks in advance, and great work on the project π