fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226) by jamesbrink · Pull Request #230 · utensils/mold

jamesbrink · 2026-04-17T02:08:48Z

Summary

Cap LTX-2 / LTX-2.3 frames at 153 (derived from the checkpoint's positional_embedding_max_pos[0] = 20 temporal RoPE budget and the VAE's 8x temporal compression).
Emit an explicit "temporal RoPE budget" error at the validator layer so overruns fail at the CLI instead of after a multi-minute generation returns rainbow/static garbage.
Leave the existing 257-frame ceiling in place for ltx-video and other families.

Context

Closes #226.

crates/mold-inference/src/ltx2/model/video_transformer.rs builds its Ltx2VideoRotaryPosEmbed with max_pos: [20, 2048, 2048] — that's what all four public LTX-2 / LTX-2.3 FP8 checkpoints ship in their safetensors metadata (verified against 19B dev, 19B distilled, 22B dev, 22B distilled headers). fractional_positions normalizes each position by its max_pos dimension without capping, so a request whose latent frame count exceeds 20 pushes RoPE's sin/cos inputs past the normalized [-1, 1] range the model was trained on.

The VAE applies 8x temporal compression, so max pixel frames = (20 - 1) * 8 + 1 = 153. Above that, the denoiser collapses latents into random-color / static noise, and because audio and video share the joint AV transformer, the audio track collapses at the same time — which is why --audio also drops out in overrun runs, exactly as the repro in #226 describes.

Developer validation artifacts recorded in #187 stayed well under the limit (9–49 frames), and the working LTX-2.3 run in #227 used 97 frames (= 13 latent frames). The first wild overrun was #226's --frames 193 → 25 latent frames.

Changes

crates/mold-core/src/validation.rs: add LTX2_MAX_FRAMES = 153 constant and family-aware cap. Four new unit tests:
- ltx2_frames_at_rope_budget_accepted — 153 is the boundary and accepted
- ltx2_frames_over_rope_budget_rejected — 161 rejected on 19B
- ltx2_3_frames_over_rope_budget_rejected — 193 (the LTX-2 19B distilled outputs rainbow/static blobs instead of prompt-relevant video #226 repro) rejected on 22B
- ltx_video_family_is_not_subject_to_the_ltx2_rope_cap — LTX Video 0.9 family unaffected
CHANGELOG.md: Unreleased fix entry with the root-cause explanation.

Test plan

cargo test -p mold-ai-core — 563 passed, 0 failed
cargo clippy -p mold-ai-core --all-targets -- -D warnings — clean
cargo fmt --check — clean

Side finding (not in this PR)

While investigating, I noticed a separate correctness bug: render_real_distilled_av / render_real_one_stage_av / render_real_retake_av in mold-inference/src/ltx2/runtime.rs call load_ltx2_av_transformer(plan, device) which passes &[] to the LoRA-aware loader, silently dropping plan.loras for those pipelines (only render_real_two_stage_av uses load_ltx2_av_transformer_with_loras). That's unrelated to the rainbow/static collapse in #226 — happy to file as a follow-up issue so camera-control LoRAs on the Distilled / OneStage / Retake paths actually take effect.

Holding off on merge

Per direction from the author, this PR is opened without auto-merge — leaving room for review before landing.

Requests with more than 153 frames silently overran the LTX-2 video transformer's `positional_embedding_max_pos[0] = 20` temporal RoPE budget (checkpoints ship `[20, 2048, 2048]` across 19B dev, 19B distilled, 22B dev, 22B distilled). The VAE applies 8x temporal compression, so the max safe pixel-frame count is `(20 - 1) * 8 + 1 = 153`. Beyond that, `Ltx2VideoRotaryPosEmbed::fractional_positions` in `mold-inference/src/ltx2/model/video_transformer.rs` normalizes by `max_pos` without capping, so overflowing positions drive the RoPE sin/cos into an untrained region and the denoiser collapses the latents into random-color / static noise. Downstream, the audio branch runs through the same joint AV transformer and inherits the collapse, which is why `--audio` outputs lose their audio track in overrun runs. Previously `validate_generate_request` accepted `frames` up to 257 (inherited from LTX Video 0.9, which has a larger RoPE budget) for both LTX families, so LTX-2 users would only learn about the overflow after a minutes-long generation returned garbage. Add a new `LTX2_MAX_FRAMES = 153` constant, cap LTX-2 / LTX-2.3 requests at it with an explicit "temporal RoPE budget" error, and leave the existing 257-frame ceiling in place for `ltx-video` and any other family. Four new unit tests cover the boundary (153 accepted), the overflow (161 rejected for 19B, 193 rejected for 22B matching the original report), and the fact that non-LTX-2 LTX families are unaffected.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

codecov · 2026-04-17T02:13:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.29%. Comparing base (eb55755) to head (006804f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #230      +/-   ##
==========================================
+ Coverage   58.28%   58.29%   +0.01%     
==========================================
  Files         169      169              
  Lines       79538    79599      +61     
==========================================
+ Hits        46358    46403      +45     
- Misses      33180    33196      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Codex peer review flagged that `derive_stage1_render_shape` in `crates/mold-inference/src/ltx2/model/upsampler.rs` halves `frames` when `temporal_upscale == Some(X2)`, so the transformer only denoises `(frames - 1) / 2 + 1` pixel frames in that mode. The previous unconditional 153-frame cap rejected valid long-clip temporal-upscale requests in the 161–257 frame range that would have stayed within the 20-latent-frame RoPE budget. Apply the cap to the stage-1 frame count instead, and extend the error message to point users at `--temporal-upscale x2` as the documented escape hatch for longer clips. Two new tests confirm that `frames=257` with X2 is accepted and that stage-1 overflow is still rejected.

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

jamesbrink added 2 commits April 16, 2026 19:15

Merge branch 'main' into fix/ltx2-frame-count-temporal-rope-budget

006804f

jamesbrink marked this pull request as draft April 17, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230

fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230
jamesbrink wants to merge 3 commits intomainfrom
fix/ltx2-frame-count-temporal-rope-budget

jamesbrink commented Apr 17, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesbrink commented Apr 17, 2026

Summary

Context

Changes

Test plan

Side finding (not in this PR)

Holding off on merge

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 17, 2026 •

edited

Loading