-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Feature Request
- Expose last_image in pipeline call
prepare_latents already accepts a last_image parameter (pipeline_mova.py:261), but call hardcodes it to None. This
is a minimal change — just wire the parameter through to enable first+last frame conditioning (FLF2V mode).
- Multi-frame reference conditioning (longer-term)
WAN 2.2 supports FLF2V (first-last-frame to video) and per-frame reference conditioning (Time to Move), which enables
much longer coherent video generation by providing visual anchors throughout the sequence.
Adding similar support to MOVA would be especially valuable for audio-conditioned generation, where longer clips are
needed to match dialogue or music. Currently the only way to extend duration is to increase num_frames, which quickly
hits VRAM limits at higher resolutions.
Key considerations for MOVA specifically:
- How per-frame visual references would interact with the audio-video bridge cross-attention
- Whether reference frames could help maintain lip sync coherence over longer durations
- Potential for segment-based generation with overlapping context (difficult today because audio conditioning can't be
cleanly split)
Context
The base WAN architecture already supports this — MOVA's prepare_latents even has the last_image code path. The main
gap is in the pipeline interface and extending beyond two reference frames.