Is there a specific reason why the first frame are skipped in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L185 and https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297.
The only reason that I can think is so that we work with divisible components as for example the VideoMAEv2 will create from 48 temporal frames to 24 as the comment in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L207-L208. So similarly we have to adjust for intermediate features from the DiT and skip the first latent frame https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297 and then with the interpolation we match the temporal dimension of the VideoMAEv2 features.
I was wondering whether if ones want to adjust VideroREPA in the I2V setting (with first-frame) conditioning this will not only correspond to the reason that I mentioned above but rather the fact that we have guidance from the first frame which we want to keep since CogVideoX backbone operate as SVD (look at the attached diagram - CogVideoX concatenation strategy which is similar to that of VDM). So in that case we do not care that much about aligning the first frame. Any suggestions insights, would be useful

Is there a specific reason why the first frame are skipped in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L185 and https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297.
The only reason that I can think is so that we work with divisible components as for example the VideoMAEv2 will create from 48 temporal frames to 24 as the comment in https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L207-L208. So similarly we have to adjust for intermediate features from the DiT and skip the first latent frame https://github.com/aHapBean/VideoREPA/blob/main/finetune/models/cogvideox_t2v_align/lora_trainer.py#L297 and then with the interpolation we match the temporal dimension of the VideoMAEv2 features.
I was wondering whether if ones want to adjust VideroREPA in the I2V setting (with first-frame) conditioning this will not only correspond to the reason that I mentioned above but rather the fact that we have guidance from the first frame which we want to keep since CogVideoX backbone operate as SVD (look at the attached diagram - CogVideoX concatenation strategy which is similar to that of VDM). So in that case we do not care that much about aligning the first frame. Any suggestions insights, would be useful