Skip to content

fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230

Draft
jamesbrink wants to merge 3 commits intomainfrom
fix/ltx2-frame-count-temporal-rope-budget
Draft

fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230
jamesbrink wants to merge 3 commits intomainfrom
fix/ltx2-frame-count-temporal-rope-budget

Conversation

@jamesbrink
Copy link
Copy Markdown
Member

Summary

  • Cap LTX-2 / LTX-2.3 frames at 153 (derived from the checkpoint's positional_embedding_max_pos[0] = 20 temporal RoPE budget and the VAE's 8x temporal compression).
  • Emit an explicit "temporal RoPE budget" error at the validator layer so overruns fail at the CLI instead of after a multi-minute generation returns rainbow/static garbage.
  • Leave the existing 257-frame ceiling in place for ltx-video and other families.

Context

Closes #226.

crates/mold-inference/src/ltx2/model/video_transformer.rs builds its Ltx2VideoRotaryPosEmbed with max_pos: [20, 2048, 2048] — that's what all four public LTX-2 / LTX-2.3 FP8 checkpoints ship in their safetensors metadata (verified against 19B dev, 19B distilled, 22B dev, 22B distilled headers). fractional_positions normalizes each position by its max_pos dimension without capping, so a request whose latent frame count exceeds 20 pushes RoPE's sin/cos inputs past the normalized [-1, 1] range the model was trained on.

The VAE applies 8x temporal compression, so max pixel frames = (20 - 1) * 8 + 1 = 153. Above that, the denoiser collapses latents into random-color / static noise, and because audio and video share the joint AV transformer, the audio track collapses at the same time — which is why --audio also drops out in overrun runs, exactly as the repro in #226 describes.

Developer validation artifacts recorded in #187 stayed well under the limit (9–49 frames), and the working LTX-2.3 run in #227 used 97 frames (= 13 latent frames). The first wild overrun was #226's --frames 193 → 25 latent frames.

Changes

  • crates/mold-core/src/validation.rs: add LTX2_MAX_FRAMES = 153 constant and family-aware cap. Four new unit tests:
  • CHANGELOG.md: Unreleased fix entry with the root-cause explanation.

Test plan

  • cargo test -p mold-ai-core — 563 passed, 0 failed
  • cargo clippy -p mold-ai-core --all-targets -- -D warnings — clean
  • cargo fmt --check — clean

Side finding (not in this PR)

While investigating, I noticed a separate correctness bug: render_real_distilled_av / render_real_one_stage_av / render_real_retake_av in mold-inference/src/ltx2/runtime.rs call load_ltx2_av_transformer(plan, device) which passes &[] to the LoRA-aware loader, silently dropping plan.loras for those pipelines (only render_real_two_stage_av uses load_ltx2_av_transformer_with_loras). That's unrelated to the rainbow/static collapse in #226 — happy to file as a follow-up issue so camera-control LoRAs on the Distilled / OneStage / Retake paths actually take effect.

Holding off on merge

Per direction from the author, this PR is opened without auto-merge — leaving room for review before landing.

Requests with more than 153 frames silently overran the LTX-2 video
transformer's `positional_embedding_max_pos[0] = 20` temporal RoPE
budget (checkpoints ship `[20, 2048, 2048]` across 19B dev, 19B
distilled, 22B dev, 22B distilled). The VAE applies 8x temporal
compression, so the max safe pixel-frame count is `(20 - 1) * 8 + 1 =
153`.

Beyond that, `Ltx2VideoRotaryPosEmbed::fractional_positions` in
`mold-inference/src/ltx2/model/video_transformer.rs` normalizes by
`max_pos` without capping, so overflowing positions drive the RoPE
sin/cos into an untrained region and the denoiser collapses the latents
into random-color / static noise. Downstream, the audio branch runs
through the same joint AV transformer and inherits the collapse, which
is why `--audio` outputs lose their audio track in overrun runs.

Previously `validate_generate_request` accepted `frames` up to 257
(inherited from LTX Video 0.9, which has a larger RoPE budget) for both
LTX families, so LTX-2 users would only learn about the overflow after
a minutes-long generation returned garbage.

Add a new `LTX2_MAX_FRAMES = 153` constant, cap LTX-2 / LTX-2.3
requests at it with an explicit "temporal RoPE budget" error, and leave
the existing 257-frame ceiling in place for `ltx-video` and any other
family. Four new unit tests cover the boundary (153 accepted), the
overflow (161 rejected for 19B, 193 rejected for 22B matching the
original report), and the fact that non-LTX-2 LTX families are
unaffected.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.29%. Comparing base (eb55755) to head (006804f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #230      +/-   ##
==========================================
+ Coverage   58.28%   58.29%   +0.01%     
==========================================
  Files         169      169              
  Lines       79538    79599      +61     
==========================================
+ Hits        46358    46403      +45     
- Misses      33180    33196      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Codex peer review flagged that `derive_stage1_render_shape` in
`crates/mold-inference/src/ltx2/model/upsampler.rs` halves `frames`
when `temporal_upscale == Some(X2)`, so the transformer only denoises
`(frames - 1) / 2 + 1` pixel frames in that mode. The previous
unconditional 153-frame cap rejected valid long-clip temporal-upscale
requests in the 161–257 frame range that would have stayed within the
20-latent-frame RoPE budget.

Apply the cap to the stage-1 frame count instead, and extend the error
message to point users at `--temporal-upscale x2` as the documented
escape hatch for longer clips. Two new tests confirm that `frames=257`
with X2 is accepted and that stage-1 overflow is still rejected.
@jamesbrink jamesbrink marked this pull request as draft April 17, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LTX-2 19B distilled outputs rainbow/static blobs instead of prompt-relevant video

1 participant