Fix swapped height/width dimensions in I2VDenoiser by Mr-Neutr0n · Pull Request #911 · hpcaitech/Open-Sora

Mr-Neutr0n · 2026-02-11T18:31:33Z

Summary

Fix dimension unpacking order in I2VDenoiser.denoise() where masked_ref.size() was incorrectly unpacked as (b, c, t, w, h) instead of (b, c, t, h, w)
The standard PyTorch video tensor layout is (B, C, T, H, W), matching how masked_ref is constructed in prepare_inference_condition() (B, C, T, H, W = z.shape)
The swapped variables cause the image_gs guidance scale tensor (built via .repeat(b, c, 1, h, w)) to have incorrect spatial dimensions when scale_temporal_osci is enabled, leading to wrong guidance scaling or shape errors for non-square video resolutions

Details

In opensora/utils/sampling.py, line 186:

# Before (incorrect):
b, c, t, w, h = masked_ref.size()

# After (correct):
b, c, t, h, w = masked_ref.size()

The h and w variables are used downstream in the temporal oscillation scaling branch:

image_gs = torch.linspace(1.0, step_upper_image_gs, t)[
    None, None, :, None, None
].repeat(b, c, 1, h, w)

With the old code, for a non-square video (e.g., 720x480 latent), h would hold the width value and w the height value, producing an image_gs tensor with swapped spatial dimensions that cannot correctly broadcast against cond, uncond, and uncond_2 in the guidance computation.

Test plan

Verified that prepare_inference_condition() in inference.py creates tensors with (B, C, T, H, W) layout
Confirmed the fix aligns the unpacking order with all downstream usage of h and w
Generate non-square videos with scale_temporal_osci=True to verify correct guidance scaling

- Fix rescale_image_by_path and rescale_video_by_path passing (width, height) to transforms.Resize(), which expects (height, width) - Fix rand_size_crop_arr using height instead of width for w_start boundary - Fix download_url passing encoding="utf-8" to binary write mode "wb"

In `I2VDenoiser.denoise()`, the `masked_ref` tensor dimensions were unpacked as `(b, c, t, w, h)` instead of the correct `(b, c, t, h, w)`. The standard PyTorch video tensor layout is (B, C, T, H, W), and this is confirmed by `prepare_inference_condition()` in inference.py which constructs masked_ref using `B, C, T, H, W = z.shape`. The swapped variables cause the `image_gs` guidance scale tensor to be constructed with incorrect spatial dimensions when `scale_temporal_osci` is enabled, since `.repeat(b, c, 1, h, w)` would receive the wrong values for h and w. This leads to a shape mismatch (and runtime error) or silently incorrect guidance scaling for non-square video resolutions.

Mr-Neutr0n added 2 commits February 11, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix swapped height/width dimensions in I2VDenoiser#911

Fix swapped height/width dimensions in I2VDenoiser#911
Mr-Neutr0n wants to merge 2 commits intohpcaitech:mainfrom
Mr-Neutr0n:fix/swap-height-width-in-denoiser

Mr-Neutr0n commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mr-Neutr0n commented Feb 11, 2026

Summary

Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant