Skip to content

Feature Request: Expose last_image parameter and support multi-frame reference conditioning #36

@richservo

Description

@richservo

Feature Request

  1. Expose last_image in pipeline call

prepare_latents already accepts a last_image parameter (pipeline_mova.py:261), but call hardcodes it to None. This
is a minimal change — just wire the parameter through to enable first+last frame conditioning (FLF2V mode).

  1. Multi-frame reference conditioning (longer-term)

WAN 2.2 supports FLF2V (first-last-frame to video) and per-frame reference conditioning (Time to Move), which enables
much longer coherent video generation by providing visual anchors throughout the sequence.

Adding similar support to MOVA would be especially valuable for audio-conditioned generation, where longer clips are
needed to match dialogue or music. Currently the only way to extend duration is to increase num_frames, which quickly
hits VRAM limits at higher resolutions.

Key considerations for MOVA specifically:

  • How per-frame visual references would interact with the audio-video bridge cross-attention
  • Whether reference frames could help maintain lip sync coherence over longer durations
  • Potential for segment-based generation with overlapping context (difficult today because audio conditioning can't be
    cleanly split)

Context

The base WAN architecture already supports this — MOVA's prepare_latents even has the last_image code path. The main
gap is in the pipeline interface and extending beyond two reference frames.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions