fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230
Draft
jamesbrink wants to merge 3 commits intomainfrom
Draft
fix: cap LTX-2 / LTX-2.3 frames at the temporal RoPE budget (#226)#230jamesbrink wants to merge 3 commits intomainfrom
jamesbrink wants to merge 3 commits intomainfrom
Conversation
Requests with more than 153 frames silently overran the LTX-2 video transformer's `positional_embedding_max_pos[0] = 20` temporal RoPE budget (checkpoints ship `[20, 2048, 2048]` across 19B dev, 19B distilled, 22B dev, 22B distilled). The VAE applies 8x temporal compression, so the max safe pixel-frame count is `(20 - 1) * 8 + 1 = 153`. Beyond that, `Ltx2VideoRotaryPosEmbed::fractional_positions` in `mold-inference/src/ltx2/model/video_transformer.rs` normalizes by `max_pos` without capping, so overflowing positions drive the RoPE sin/cos into an untrained region and the denoiser collapses the latents into random-color / static noise. Downstream, the audio branch runs through the same joint AV transformer and inherits the collapse, which is why `--audio` outputs lose their audio track in overrun runs. Previously `validate_generate_request` accepted `frames` up to 257 (inherited from LTX Video 0.9, which has a larger RoPE budget) for both LTX families, so LTX-2 users would only learn about the overflow after a minutes-long generation returned garbage. Add a new `LTX2_MAX_FRAMES = 153` constant, cap LTX-2 / LTX-2.3 requests at it with an explicit "temporal RoPE budget" error, and leave the existing 257-frame ceiling in place for `ltx-video` and any other family. Four new unit tests cover the boundary (153 accepted), the overflow (161 rejected for 19B, 193 rejected for 22B matching the original report), and the fact that non-LTX-2 LTX families are unaffected.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #230 +/- ##
==========================================
+ Coverage 58.28% 58.29% +0.01%
==========================================
Files 169 169
Lines 79538 79599 +61
==========================================
+ Hits 46358 46403 +45
- Misses 33180 33196 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Codex peer review flagged that `derive_stage1_render_shape` in `crates/mold-inference/src/ltx2/model/upsampler.rs` halves `frames` when `temporal_upscale == Some(X2)`, so the transformer only denoises `(frames - 1) / 2 + 1` pixel frames in that mode. The previous unconditional 153-frame cap rejected valid long-clip temporal-upscale requests in the 161–257 frame range that would have stayed within the 20-latent-frame RoPE budget. Apply the cap to the stage-1 frame count instead, and extend the error message to point users at `--temporal-upscale x2` as the documented escape hatch for longer clips. Two new tests confirm that `frames=257` with X2 is accepted and that stage-1 overflow is still rejected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
framesat 153 (derived from the checkpoint'spositional_embedding_max_pos[0] = 20temporal RoPE budget and the VAE's 8x temporal compression).ltx-videoand other families.Context
Closes #226.
crates/mold-inference/src/ltx2/model/video_transformer.rsbuilds itsLtx2VideoRotaryPosEmbedwithmax_pos: [20, 2048, 2048]— that's what all four public LTX-2 / LTX-2.3 FP8 checkpoints ship in their safetensors metadata (verified against 19B dev, 19B distilled, 22B dev, 22B distilled headers).fractional_positionsnormalizes each position by itsmax_posdimension without capping, so a request whose latent frame count exceeds 20 pushes RoPE's sin/cos inputs past the normalized[-1, 1]range the model was trained on.The VAE applies 8x temporal compression, so max pixel frames =
(20 - 1) * 8 + 1 = 153. Above that, the denoiser collapses latents into random-color / static noise, and because audio and video share the joint AV transformer, the audio track collapses at the same time — which is why--audioalso drops out in overrun runs, exactly as the repro in #226 describes.Developer validation artifacts recorded in #187 stayed well under the limit (9–49 frames), and the working LTX-2.3 run in #227 used 97 frames (= 13 latent frames). The first wild overrun was #226's
--frames 193→ 25 latent frames.Changes
crates/mold-core/src/validation.rs: addLTX2_MAX_FRAMES = 153constant and family-aware cap. Four new unit tests:ltx2_frames_at_rope_budget_accepted— 153 is the boundary and acceptedltx2_frames_over_rope_budget_rejected— 161 rejected on 19Bltx2_3_frames_over_rope_budget_rejected— 193 (the LTX-2 19B distilled outputs rainbow/static blobs instead of prompt-relevant video #226 repro) rejected on 22Bltx_video_family_is_not_subject_to_the_ltx2_rope_cap— LTX Video 0.9 family unaffectedCHANGELOG.md:Unreleasedfix entry with the root-cause explanation.Test plan
cargo test -p mold-ai-core— 563 passed, 0 failedcargo clippy -p mold-ai-core --all-targets -- -D warnings— cleancargo fmt --check— cleanSide finding (not in this PR)
While investigating, I noticed a separate correctness bug:
render_real_distilled_av/render_real_one_stage_av/render_real_retake_avinmold-inference/src/ltx2/runtime.rscallload_ltx2_av_transformer(plan, device)which passes&[]to the LoRA-aware loader, silently droppingplan.lorasfor those pipelines (onlyrender_real_two_stage_avusesload_ltx2_av_transformer_with_loras). That's unrelated to the rainbow/static collapse in #226 — happy to file as a follow-up issue so camera-control LoRAs on the Distilled / OneStage / Retake paths actually take effect.Holding off on merge
Per direction from the author, this PR is opened without auto-merge — leaving room for review before landing.