Skip to content

Fix swapped height/width dimensions in I2VDenoiser#911

Open
Mr-Neutr0n wants to merge 2 commits intohpcaitech:mainfrom
Mr-Neutr0n:fix/swap-height-width-in-denoiser
Open

Fix swapped height/width dimensions in I2VDenoiser#911
Mr-Neutr0n wants to merge 2 commits intohpcaitech:mainfrom
Mr-Neutr0n:fix/swap-height-width-in-denoiser

Conversation

@Mr-Neutr0n
Copy link

Summary

  • Fix dimension unpacking order in I2VDenoiser.denoise() where masked_ref.size() was incorrectly unpacked as (b, c, t, w, h) instead of (b, c, t, h, w)
  • The standard PyTorch video tensor layout is (B, C, T, H, W), matching how masked_ref is constructed in prepare_inference_condition() (B, C, T, H, W = z.shape)
  • The swapped variables cause the image_gs guidance scale tensor (built via .repeat(b, c, 1, h, w)) to have incorrect spatial dimensions when scale_temporal_osci is enabled, leading to wrong guidance scaling or shape errors for non-square video resolutions

Details

In opensora/utils/sampling.py, line 186:

# Before (incorrect):
b, c, t, w, h = masked_ref.size()

# After (correct):
b, c, t, h, w = masked_ref.size()

The h and w variables are used downstream in the temporal oscillation scaling branch:

image_gs = torch.linspace(1.0, step_upper_image_gs, t)[
    None, None, :, None, None
].repeat(b, c, 1, h, w)

With the old code, for a non-square video (e.g., 720x480 latent), h would hold the width value and w the height value, producing an image_gs tensor with swapped spatial dimensions that cannot correctly broadcast against cond, uncond, and uncond_2 in the guidance computation.

Test plan

  • Verified that prepare_inference_condition() in inference.py creates tensors with (B, C, T, H, W) layout
  • Confirmed the fix aligns the unpacking order with all downstream usage of h and w
  • Generate non-square videos with scale_temporal_osci=True to verify correct guidance scaling

- Fix rescale_image_by_path and rescale_video_by_path passing (width, height)
  to transforms.Resize(), which expects (height, width)
- Fix rand_size_crop_arr using height instead of width for w_start boundary
- Fix download_url passing encoding="utf-8" to binary write mode "wb"
In `I2VDenoiser.denoise()`, the `masked_ref` tensor dimensions were
unpacked as `(b, c, t, w, h)` instead of the correct `(b, c, t, h, w)`.
The standard PyTorch video tensor layout is (B, C, T, H, W), and this
is confirmed by `prepare_inference_condition()` in inference.py which
constructs masked_ref using `B, C, T, H, W = z.shape`.

The swapped variables cause the `image_gs` guidance scale tensor to be
constructed with incorrect spatial dimensions when `scale_temporal_osci`
is enabled, since `.repeat(b, c, 1, h, w)` would receive the wrong
values for h and w. This leads to a shape mismatch (and runtime error)
or silently incorrect guidance scaling for non-square video resolutions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant