Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation

Chunyu Li^1,2,* Jiaye Li^2,* Ruiqiao Mei² Haoyuan Xia^1,3

¹Shanghai Innovation Institute ²Fudan University

³University of Science and Technology of China ⁴Nanjing University ⁵Baidu

📖 Introduction

We present Hallo-Live, a real-time text-driven joint audio-video avatar generation framework. The method adopts a causal dual-stream DiT model to generate synchronized avatar video and speech in a streaming manner. Hallo-Live reaches 20.38 FPS with 0.94 s latency on two NVIDIA H200 GPUs, while preserving strong lip-sync accuracy, visual fidelity, and speech quality.

🏗️ Framework

The framework of Hallo-Live. Top left: Stage I training adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future-expanding block-causal mask. Bottom left: Stage II training performs autoregressive self-rollout with the audio-video KV cache and optimizes the generated trajectory with reward-weighted dual-stream DMD. Right: Each causal fusion block in the dual-stream DiT consists of cross-modal attention between the video and audio streams, where the block-causal masks are utilized in Stage I ODE initialization, and KV cache is maintained for Stage II self-rollout and streaming inference.

🎬 Demo

Main Demo

The main demo showcases Hallo-Live’s real-time text-driven audio-video generation capabilities across anime-style characters, realistic human subjects, and multi-speaker scenarios.

Hallo-Live.demo.mp4

The following demos show each individual video along with its corresponding prompt. Click the prompt preview to expand the full text.

Input Prompt	Generated Video
Office close-up, man asks about the slides... Close-up on a man in an office. Window light creates soft highlights. He wears a suit, lapel texture visible. Background is blurred desks. He sits in chair, back straight. Face is head-and-shoulders, mouth sharp. He nods slightly while speaking. <S>Meeting starts in five.<E> <S>Have you got the slides?<E> <AUDCAP>Office hum, phone ring distant, professional male voice with clear articulation; no music.<ENDAUDCAP>	Close-up_on_a_man_in_an_office._Window_lig_801e1bcf_512x992_103.mp4
3D anime recording studio, asking for one more take... 3D anime cartoon style, polished toon-shaded character rendering, soft stylized materials, expressive face and eyes, smooth animation-ready posing, clear mouth shapes for readable lip sync. In a dimly lit recording studio with acoustic foam panels, a woman with curly brown hair sits framed head-and-shoulders. A large condenser microphone stands slightly off-axis to avoid plosives. Soft blue LED strip light outlines the background gear. Her skin shows natural texture under the key light. She holds a lyric sheet steady in her left hand, fingers visible against the white paper. No sudden movements occur. She breathes in slowly, then speaks directly into the mic. Her lips part clearly for each word. The paper remains still in her grip throughout the clip. <S>I think we need one more take.<E> <S>The harmony felt rushed.<E> <AUDCAP>Clear female voice with soft reverb; faint hum of ventilation; rustle of paper; no music; close proximity effect on mic; room tone is present.<ENDAUDCAP>	3D_anime_cartoon_style._polished_toon-shad_cba170fc_512x992_103.mp4
Hand-drawn anime cafe scene, waiting until midnight... Hand-drawn anime style with clean outlines, stylized proportions, bright illustrated lighting, detailed background art, expressive facial acting, crisp lip movement. Framed in close-up head-and-shoulders, a man with stubble sits in a dimly lit cafe. Neon sign glow reflects in his eyes, casting blue rim light on his profile. A condensation-covered glass sits on the table edge, visible in lower frame. His lips are sharp under the mixed lighting, moving clearly as he talks. His left hand rests on the table edge, fingers visible; no objects pass in front of it. He blinks slowly, then speaks with a slight nod. <S>They said the train was delayed.<E> <S>Now we wait until midnight.<E> <AUDCAP>Low cafe murmur, ice clinking in glass, HVAC hum; tired male voice with low resonance.<ENDAUDCAP>	Hand-drawn_anime_style_with_clean_outlines_ad1a85eb_512x992_103.mp4
Clay-court tennis player, retrying a serve... A tennis player stands on a clay court, framed chest-up in a tight-medium shot, red dust visible on his shirt. Sunlight is bright but diffused by a slight haze, preventing harsh shadows on his face. He holds a racket over his shoulder, the grip tape texture visible and hand position static. The camera remains steady, focusing on his eyes and the sweat on his brow. His mouth is open slightly as he speaks, clearly readable against the blurred net background. He shifts his weight slowly from one foot to the other, avoiding sudden jerks. <S>That serve was too wide.<E> <S>Let's try again, same spot.<E> <AUDCAP>Wind blowing across court; distant ball thud; clear male voice with athletic breath; clay court surface noise; natural outdoor sports ambience without crowd noise.<ENDAUDCAP>	A_tennis_player_stands_on_a_clay_court._fr_fe7564cd_512x992_103.mp4
3D anime room scene, boy playing guitar... 3D anime cartoon style, polished toon-shaded character rendering, soft stylized materials, expressive face and eyes, smooth animation-ready posing, clear mouth shapes for readable lip sync. Tight-medium shot of a boy in a room, chest-up, lit by computer monitor glow. Screen reflection shines in his eyes. He wears a hoodie, drawstrings hanging still. He holds a guitar neck, frets visible under his fingers. Mouth is clear in the blueish light. Right hand strums slowly; left hand holds chord. Posters on wall blurred behind. No head banging, stable posture. <S>I finally learned the chorus.<E> <S>It sounds pretty good.<E> <AUDCAP>Young male voice, proud; guitar strum sound; fan hum; no music.<ENDAUDCAP>	3D_anime_cartoon_style._polished_toon-shad_84bc5c75_512x992_103.mp4

🔧 Installation

1. Environment Setup

Create and activate the conda environment, then install Python dependencies:

source tools/setup_env.sh

2. Download Models

Set the model root MODEL_DIR=/path/to/your/model_dir in tools/download_models.sh before downloading.

For inference, you only need to download the required text encoder, VAEs, and DiT models:

bash tools/download_models.sh inference

For vanilla training or RL-based training, download the additional models as needed:

# Ovi model as real/fake score function for train.py
bash tools/download_models.sh train

# Optional reward models for RL-based training
bash tools/download_models.sh reward

🚀 Inference

Before inference, make sure the MODEL_DIR in scripts/inference.sh points to the root directory that contains the downloaded models, then run the script:

bash scripts/inference.sh

Generated videos will be saved in output_folder.

Parallel VAE Decoding

For faster inference on machines with at least two visible CUDA devices, you can overlap diffusion generation and VAE decoding by enabling parallel VAE decode:

bash scripts/inference_parallel_vae.sh

This mode keeps the diffusion model on the main GPU and moves the video/audio VAEs to a second GPU. It decodes generated video blocks in a background CUDA stream while the diffusion GPU continues generating the next block. By default, the VAE decode device is the next visible CUDA device after the diffusion device.

🏋️‍♂️ Training

Training uses torchrun and FSDP. Before launching, check the following fields in the config:

model_dir: directory containing Ovi, Wan, MMAudio, and optional reward checkpoints.
data_path: prompt CSV or LMDB path.
generator_ckpt: initialization checkpoint for the student.
real_score_ckpt and fake_score_ckpt: teacher and critic initialization checkpoints.
save_ckpt_dir: output directory for training checkpoints.
sharding_strategy, generator_fsdp_wrap_strategy, real_score_fsdp_wrap_strategy, fake_score_fsdp_wrap_strategy: distributed training strategy.

Training Dataset

For convenience, we’ve open-sourced the training dataset synthetic_prompts_32k.csv in our HuggingFace repo. You can either download it manually or use the following command to download it directly:

hf download fudan-generative-ai/Hallo-Live synthetic_prompts_32k.csv --local-dir "prompts/data"

Stage 1: Dual-Stream ODE Initialization

For convenience, we’ve provided a stage 1 ODE initialization checkpoint. You can manually download hallolive_ode_init.pt from our HuggingFace repo, or download it directly using the command line:

hf download fudan-generative-ai/Hallo-Live hallolive_ode_init.pt --local-dir "$MODEL_DIR/Hallo-Live"

We also provide utilities for generating ODE initialization data and packing it into LMDB, in case you want to perform stage 1 training on your own prompt dataset:

bash scripts/sample_ode_data.sh

The script performs three steps:

Generate ODE trajectories with hallolive.utils.sample_ode_data.
Build video-to-prompt mappings with tools/create_video_mappings.py.
Convert latent .pt files into LMDB with hallolive.utils.create_lmdb.

After the LMDB dataset is created, run the script for ODE initialization training:

bash scripts/train_ode_init.sh

Stage 2: Self-Rollout + Dual-Stream DMD

First, modify the generator_ckpt path in the config file to point to the checkpoint obtained after completing your ODE initialization training. Then run the script for DMD training:

bash scripts/train_dmd_5B.sh

To perform multi-node training, such as on 16 or 32 GPUs, run the script:

bash scripts/train_dmd_5B_multinode.sh

To reproduce HP-DMD, enable reward guidance in the DMD config:

enable_rl_reward: true
reward_types: [videoalign, audiobox, sync]
reward_beta: 2.0
reward_model_cpu_offload: true

The paper uses a continued Stage 2 strategy: first train video and audio jointly until the video stream stabilizes, then freeze the video stream and continue audio-only optimization. In this repository, audio-only continued training is controlled by:

train_audio_stream_only: true
video_loss_weight: 0
audio_loss_weight: 0.15

See configs/dual_stream_dmd_5B_audio.yaml for an example.

📊 Evaluation

Run batch inference for checkpoint evaluation:

bash scripts/inference_eval.sh

The script generates videos for selected checkpoint steps and writes video_prompt_mappings.json for downstream scoring.

🙏 Acknowledgements

This project builds on and benefits from the following open-source projects and research codebases:

Ovi for high-quality joint audio-video generation.
Self-forcing for autoregressive self-rollout training and DMD code.
Wan for video generation components.
MMAudio for audio VAE components.
VideoAlign, AudioBox Aesthetics and SyncNet for reward modeling.

📖 Citation

If you find this repository useful, please cite:

@article{li2026hallo,
  title={Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation},
  author={Li, Chunyu and Li, Jiaye and Mei, Ruiqiao and Xia, Haoyuan and Zhu, Hao and Wang, Jingdong and Zhu, Siyu},
  journal={arXiv preprint arXiv:2604.23632},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
docs		docs
hallolive		hallolive
prompts		prompts
scripts		scripts
third_party		third_party
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation

📖 Introduction

🏗️ Framework

🎬 Demo

Main Demo

🔧 Installation

1. Environment Setup

2. Download Models

🚀 Inference

Parallel VAE Decoding

🏋️‍♂️ Training

Training Dataset

Stage 1: Dual-Stream ODE Initialization

Stage 2: Self-Rollout + Dual-Stream DMD

📊 Evaluation

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation

📖 Introduction

🏗️ Framework

🎬 Demo

Main Demo

🔧 Installation

1. Environment Setup

2. Download Models

🚀 Inference

Parallel VAE Decoding

🏋️‍♂️ Training

Training Dataset

Stage 1: Dual-Stream ODE Initialization

Stage 2: Self-Rollout + Dual-Stream DMD

📊 Evaluation

🙏 Acknowledgements

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages