Skip to content

VibeVoice ASR performance expectations and real-world benchmarks on consumer GPUsΒ #238

@arekucr

Description

@arekucr

Hi team πŸ‘‹,

I wanted to ask about VibeVoice ASR inference performance and compare experiences with other users running the model locally on consumer GPUs.

Observed performance
I’m currently seeing processing times close to real-time (β‰ˆ1.0x RTF) using a 4-bit quantized VibeVoice ASR model.
For reference, ~250 seconds of processing time for ~300 seconds of conversational audio.

Questions

Is this level of performance expected for VibeVoice ASR?

What kind of RTF / throughput are other users seeing, and on what hardware?

How are people running VibeVoice locally on GPUs with 16GB VRAM or less?

Are there any recommended runtime flags, model variants, batching/chunking strategies, or vLLM configurations that significantly improve inference speed?

Any benchmarks or real-world configurations would be greatly appreciated.
Thanks in advance, and great work on the project πŸš€

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions