Feature Request: Expose last_image parameter and support multi-frame reference conditioning

 Feature Request

  1. Expose last_image in pipeline __call__

  prepare_latents already accepts a last_image parameter (pipeline_mova.py:261), but __call__ hardcodes it to None. This
   is a minimal change — just wire the parameter through to enable first+last frame conditioning (FLF2V mode).

  2. Multi-frame reference conditioning (longer-term)

  WAN 2.2 supports FLF2V (first-last-frame to video) and per-frame reference conditioning (Time to Move), which enables
  much longer coherent video generation by providing visual anchors throughout the sequence.

  Adding similar support to MOVA would be especially valuable for audio-conditioned generation, where longer clips are
  needed to match dialogue or music. Currently the only way to extend duration is to increase num_frames, which quickly
  hits VRAM limits at higher resolutions.

  Key considerations for MOVA specifically:
  - How per-frame visual references would interact with the audio-video bridge cross-attention
  - Whether reference frames could help maintain lip sync coherence over longer durations
  - Potential for segment-based generation with overlapping context (difficult today because audio conditioning can't be
   cleanly split)

  Context

  The base WAN architecture already supports this — MOVA's prepare_latents even has the last_image code path. The main
  gap is in the pipeline interface and extending beyond two reference frames.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Expose last_image parameter and support multi-frame reference conditioning #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Expose last_image parameter and support multi-frame reference conditioning #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions