Excluding first frames for encoding clean images, and for aligning latents

Is there a specific reason why the first frame are skipped in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L185 and https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297. 

The only reason that I can think is so that we work with divisible components as for example the VideoMAEv2 will create from 48 temporal frames to 24 as the comment in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L207-L208. So similarly we have to adjust for intermediate features from the DiT and skip the first latent frame https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297 and then with the [interpolation](https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L329) we match the temporal dimension of the VideoMAEv2 features.

I was wondering whether if ones want to adjust VideroREPA in the I2V setting (with first-frame) conditioning this will not only correspond to the reason that I mentioned above but rather the fact that we have guidance from the first frame which we want to keep since CogVideoX backbone operate as SVD (look at the attached diagram  - CogVideoX concatenation strategy which is similar to that of VDM). So in that case we do not care that much about aligning the first frame. Any suggestions insights, would be useful

<img width="6300" height="3579" alt="Image" src="https://github.com/user-attachments/assets/fc3a2165-5a7e-4e3e-af57-64bb8d025e73" />

<img width="6300" height="3579" alt="Image" src="https://github.com/user-attachments/assets/1ca08c1a-085d-41a6-9ffa-da79acdf9ff9" />

  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excluding first frames for encoding clean images, and for aligning latents #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Excluding first frames for encoding clean images, and for aligning latents #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions