A full-duplex speech LLM with ~75ms latency.
MichiAI is a lightweight, multimodal speech large language model designed for full-duplex interaction.
Unlike traditional serial pipelines (ASR → LLM → TTS), MichiAI can listen and speak simultaneously, mimicking natural human conversation with ultra-low latency.
| Feature | Specification |
|---|---|
| Model Size | 530M Parameters |
| Latency (TTFA) | ~75ms (tested on RTX 4090) |
| Architecture | Continuous Embeddings + Rectified Flow Matching |
| Base Backbone | SmolLM-360m |
| Key Innovation | No Coherence Loss / Single Step Decoding |
- Full-Duplex Capability: Handles interjections and backchanneling implicitly. It "hears" while it "talks."
- Continuous Audio Latents: Bypasses the slow decoding of traditional RVQ (Residual Vector Quantization) models. This enables high-fidelity audio with much fewer forward passes.
- Zero-Shot Voice Cloning: Captures vocal timbre and style from just a few seconds of audio prompt.
- Multimodal Input: Supports mixed text and audio prompting, making it compatible with existing RAG (Retrieval-Augmented Generation) frameworks.
- No Coherence Loss: Retains the reasoning and linguistic capabilities of the underlying text LLM without the typical degradation seen in speech-to-speech models.
- Paralinguistics: Naturally models breathing, laughing, and emotional prosody learned directly from the dataset.
A multi-modal encoder mapping raw audio into continuous embeddings while simultaneously generating text tokens. This ensures the model understands both the semantic meaning and the emotional context.
Predicts audio embeddings using Rectified Flow Matching. This allows for fast, high-quality, and diverse speech generation. The embeddings are then processed through a lightweight, causal HiFi-GAN vocoder for real-time streaming.
Despite being significantly smaller and trained on less data, MichiAI maintains high reasoning capabilities by efficiently utilizing pretrained text knowledge.
| Model | Parameters | Audio Training Data | Approach |
|---|---|---|---|
| Hertz-dev | 8.5B | 20,000,000 hours | Quantized |
| Moshi | 7B | 7,000,000 hours | Quantized |
| Qwen-Omni | 7B+ | 8,000,000+ hours | Quantized |
| MichiAI | 530M | ~5,000 hours | Continuous |
- Core Architecture: Continuous Embeddings + Flow Matching implementation.
- Scaling: Implementing a larger LLM backbone.
- Conversational Tuning: Training on specific dialogue datasets for better turn-taking.
- Multilingual Support: Integrating non-English datasets.
- Hugging Face Space: Launching a live interactive demo.
- Release API client Release an API client to this repo
