VibeVoice ASR performance expectations and real-world benchmarks on consumer GPUs

Hi team 👋,

I wanted to ask about VibeVoice ASR inference performance and compare experiences with other users running the model locally on consumer GPUs.

Observed performance
I’m currently seeing processing times close to real-time (≈1.0x RTF) using a 4-bit quantized VibeVoice ASR model.
For reference, ~250 seconds of processing time for ~300 seconds of conversational audio.

Questions

Is this level of performance expected for VibeVoice ASR?

What kind of RTF / throughput are other users seeing, and on what hardware?

How are people running VibeVoice locally on GPUs with 16GB VRAM or less?

Are there any recommended runtime flags, model variants, batching/chunking strategies, or vLLM configurations that significantly improve inference speed?

Any benchmarks or real-world configurations would be greatly appreciated.
Thanks in advance, and great work on the project 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VibeVoice ASR performance expectations and real-world benchmarks on consumer GPUs #238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VibeVoice ASR performance expectations and real-world benchmarks on consumer GPUs #238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions