This document describes how the combined audio+video path works so that ltx.cpp can produce both video frames and a WAV from the same GGUF DiT and VAEs used in ComfyUI.
- Input: Text prompt, frame count, resolution (same as today).
- Output: Video frames (PPM/PNG) and a WAV file (optionally muxed with ffmpeg).
- Models: Same stack as LTX_COMFY_REFERENCE.md: unsloth/LTX-2.3-GGUF DiT, video VAE, audio VAE (
ltx-2.3-22b-dev_audio_vae.safetensors).
The LTX 2.3 DiT is an audio-video model: it expects a single sequence of tokens formed by concatenating video tokens and audio tokens in that order. Each token has the same dimension Pd = 128 (from patch_dim = patch_t × patch_h × patch_w × latent_channels; we use 1×2×2×32 or 128 for video, and 8×16=128 for audio).
- Video: Latent shape
[T_lat, H_lat, W_lat, C]with C=32 (DiT latent_channels). Patch size (1,2,2) → n_video_tok = (T_lat/1) × (H_lat/2) × (W_lat/2). - Audio: Latent shape
[T_audio, C_audio, mel_bins]with C_audio=8, mel_bins=16 (LightricksAudioLatentShape). Patchify:(B, C, T, F) → (B, T, C*F)→ n_audio_tok = T_audio. We align T_audio = T_lat so that one audio token per latent frame matches the Comfy “same length as video” rule. - Combined: Input to DiT is
[n_video_tok + n_audio_tok, Pd]; output is the same shape. We split the DiT output: firstn_video_toktokens → video velocity; remainingn_audio_toktokens → audio velocity.
References:
- Comfy: “AV latent: model sees concatenated video + audio latent; after sampling, LTXVSeparateAVLatent splits back to video and audio.”
- Lightricks:
AudioPatchifier.get_token_count=tgt_shape.frames;patchifyisb c t f → b t (c f)so Pd_audio = C_audio × mel_bins = 128.
- Latent init: Video latent
[T_lat, H_lat, W_lat, C]and audio latent[T_lat, C_audio, mel_bins](i.e.[T_lat, 8, 16]) filled with noise. - Per step:
- Patchify video →
[n_video_tok, 128]; patchify audio →[n_audio_tok, 128]. - Concat →
[n_video_tok + n_audio_tok, 128]. - DiT forward on combined sequence.
- Split output: video part → unpatchify → video velocity; audio part → unpatchify_audio → audio velocity.
- Euler step on video latent; Euler step on audio latent.
- (Optional) Frame conditioning on video as today (start/end frame pinning).
- Patchify video →
- Decode:
- Video: Existing video VAE decode → frames → write PPM/PNG.
- Audio: Audio VAE decode (latent → spectrogram) → vocoder or Griffin–Lim (spectrogram → waveform) → write WAV.
- File:
ltx-2.3-22b-dev_audio_vae.safetensors(unsloth/LTX-2.3-GGUF,vae/). - Role: Decoder maps audio latent
[B, C, T, F](C=8, F=16, T=T_lat) to spectrogram (e.g.[B, 1, T×4, 64]mel bins; exact from LightricksLATENT_DOWNSAMPLE_FACTOR=4and_adjust_output_shape). - Implementation: 2D conv decoder (conv_in → mid block → up blocks → norm_out → conv_out), loaded from safetensors; tensor names follow the Lightricks Decoder (e.g.
decoder.conv_in.*,decoder.mid.*,decoder.up.*,decoder.norm_out,decoder.conv_out). Our loader may add avae.prefix.
- Preferred: Full vocoder (spectrogram → waveform) as in Lightricks; if not available in C++, use Griffin–Lim (inverse STFT from magnitude) for a first milestone.
- Params: sample_rate 16000, hop_length 160, mel_bins 64 (from audio VAE / pipeline config).
--av: Enable audio+video path (allocate audio latent, concat/split, decode both).--audio-vae <path>: Optional path toltx-2.3-22b-dev_audio_vae.safetensors; when omitted, audio is synthesized from the denoised latent (fallback path).--out-wav <path>: Output WAV path (default:<out prefix>.wavwhen--av).
| Step | Video path (existing) | Audio path (new) |
|---|---|---|
| Init | Noise [T,H,W,C] |
Noise [T_lat, 8, 16] |
| Patchify | patchify() → [n_v, 128] |
patchify_audio() → [n_a, 128] |
| DiT | — | Single forward on [n_v+n_a, 128] |
| Split | — | First n_v → video, last n_a → audio |
| Unpatchify | unpatchify() → velocity |
unpatchify_audio() → audio velocity |
| Step | Euler on video latent | Euler on audio latent |
| Decode | Video VAE → frames | Audio VAE → spectrogram → WAV |
This yields a single, proven audio+video pipeline using the same GGUF models as ComfyUI.