From d13a554b1c2f9214c301b5c07e18c64fce13c6b6 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 16:14:40 -0700
Subject: [PATCH 01/31] fix(ltx2): use pure source latents as i2v denoise-mask
 target

At `strength < 1.0` (the `--strength 0.75` LTX-2 i2v default),
`run_real_distilled_stage` was cloning `video_latents` *after*
`apply_stage_video_conditioning` had already soft-blended the first
latent frame positions with noise: the "clean reference" tensor that
the per-step denoise-mask blend pulls conditioned tokens toward became
`noise*(1-s) + source*s` at replacement positions. Used as the clean
target, that pre-blended tensor pinned the first latent to a noisy
ghost of the image at every step, so i2v runs produced a first frame
that was 25 % noise + 75 % image instead of the source image.

Introduce a `clean_latents_for_conditioning` helper that re-applies
the replacement-based conditioning with `strength = 1.0` on top of the
post-apply tensor, overwriting replacement positions with pure source
tokens while appended keyframe tokens and pure-noise regions pass
through unchanged. `strength = 1.0` and pure-T2V paths remain
bit-for-bit identical. Two regression tests cover the soft-blended
case and the no-replacements passthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                              |   1 +
 crates/mold-inference/src/ltx2/runtime.rs | 109 ++++++++++++++++++++--
 2 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 51e20f6e..af96fa71 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **LTX-2 image-to-video no longer locks the first latent frame to a noisy ghost of the source image at `strength < 1.0`.** In `run_real_distilled_stage` (`crates/mold-inference/src/ltx2/runtime.rs`) the "clean reference" that the per-step denoise-mask blend pulls the conditioned tokens toward was sourced by cloning `video_latents` *after* `apply_stage_video_conditioning` had already soft-blended the first-latent-frame positions with the initial noise (`noise*(1-s) + source*s`). Used as the clean target, that pre-blended tensor pinned the first latent to a noisy copy of the image instead of the pure image at every step — so i2v runs with `--strength 0.75` (the CLI default) produced a first frame that was 25 % noise + 75 % image rather than the source image. A new helper `clean_latents_for_conditioning` re-applies the replacements with strength 1.0 on top of the post-apply tensor so replacement positions hold pure source image tokens while appended keyframe tokens and pure-noise regions pass through unchanged. `strength = 1.0` and pure-T2V paths are bit-for-bit identical to before. Covered by two new regression tests (`clean_latents_replace_soft_blended_positions_with_pure_source`, `clean_latents_passthrough_when_no_replacements`).
 - **city96-format FLUX fine-tune GGUFs now fail with an honest, actionable error when no dev reference is downloaded, and surface the dependency at pull time instead of inside `ensure_gguf_embeddings`.** Community fine-tune GGUFs (e.g. the `silveroxides/ultrareal-fine-tune-GGUF` tree that powers `ultrareal-v4:q{8,5,4}`) ship only the diffusion blocks and expect the base FLUX input embedding layers (`img_in`, `time_in`, `vector_in`, `guidance_in`) to be patched in from a separately-downloaded flux-dev reference. Two bugs made this fail confusingly: (1) `find_flux_reference_gguf` in `crates/mold-inference/src/flux/pipeline.rs` returned the first candidate with `img_in.weight`, which let `flux-schnell:q8` pass the probe even though schnell is distilled without `guidance_in` — the subsequent patch loop bailed with `reference GGUF (.../flux-schnell-q8/flux1-schnell-Q8_0.gguf) is also missing required tensor 'guidance_in.in_layer.weight'`, making it look like schnell itself was broken. (2) The manifest didn't express the dependency at all, so the first indication a user had that `mold pull ultrareal-v4:q8` wasn't self-sufficient was an HTTP 500 on their first generation. Fixed by (a) adding a `needs_guidance: bool` parameter to `find_flux_reference_gguf` that skips schnell candidates for dev-family targets and verifies candidates contain `guidance_in.in_layer.weight` before accepting them, (b) rewriting both error messages so the source model is named and the reference path is shown as a filename rather than a full path, and (c) adding a pull-time probe in `crates/mold-core/src/download.rs` (`warn_if_flux_gguf_needs_reference`) that scans the first 4 MiB of any downloaded `.gguf` transformer for `img_in.weight`, and prints a one-line warning via the download callback when the GGUF is incomplete and no suitable dev reference is already on disk. Works for both the CLI (`pull_model`) and server (`pull_model_with_callback`) paths. New regression test `find_flux_reference_skips_schnell_when_dev_needed` covers the reference-picker behaviour.
 
 - **Prompt expansion can no longer OOM on a multi-GPU box with a tight main GPU.** `LocalExpander` previously hardcoded `gpu_ordinal: 0` and gated placement with a static 2 GB VRAM threshold — on a dual-GPU system with a busy main card it fell back to CPU unnecessarily, and on a q8/bf16 expand model (4+ GB weights) the 2 GB threshold under-budgeted activations so the GPU placement check could pass and the load then OOM. The expander now sizes its budget dynamically (`model_size + 2 GB activations`, matching the T5/Qwen3 pattern) and cascades through devices: main GPU → remaining GPUs in ordinal order → CPU, with `preflight_memory_check()` as the final hard-fail guard when system RAM can't hold the model either. Unified-memory Metal placements also run the RAM preflight (Metal allocations draw from the same pool). Device selection logic is factored into a pure `select_expand_device(gpus, threshold, is_metal) -> ExpandPlacement` helper with unit tests for every branch.
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 781ab3c4..72135d92 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -1379,6 +1379,36 @@ fn apply_video_token_replacements(
     Ok(patched)
 }
 
+/// Build the "clean reference" tensor used by the denoise mask blend at every
+/// step. For replacement-based conditioning (e.g. i2v source image) with
+/// `strength < 1.0`, `video_latents` already holds `noise*(1-s) + source*s` at
+/// the replacement positions. If we reuse that as the clean target, the
+/// denoise-mask blend pulls those tokens toward a noisy ghost of the image at
+/// every step — the first latent frame never converges to the pure source.
+///
+/// Re-applying the replacements with strength 1.0 overwrites those positions
+/// with the pure source tokens, leaving appended keyframe tokens (already
+/// full-strength in `apply_appended_video_conditioning`) and pure-noise
+/// regions untouched.
+fn clean_latents_for_conditioning(
+    video_latents: &Tensor,
+    conditioning: &StageVideoConditioning,
+) -> Result<Tensor> {
+    if conditioning.replacements.is_empty() {
+        return Ok(video_latents.clone());
+    }
+    let hard_replacements: Vec<VideoTokenReplacement> = conditioning
+        .replacements
+        .iter()
+        .map(|replacement| VideoTokenReplacement {
+            start_token: replacement.start_token,
+            tokens: replacement.tokens.clone(),
+            strength: 1.0,
+        })
+        .collect();
+    apply_video_token_replacements(video_latents, &hard_replacements)
+}
+
 fn apply_appended_video_conditioning(
     video_latents: &Tensor,
     video_positions: &Tensor,
@@ -2699,7 +2729,7 @@ fn run_real_distilled_stage(
     )?;
     let clean_video_latents = match video_clean_latents {
         Some(latents) => video_patchifier.patchify(latents)?,
-        None => video_latents.clone(),
+        None => clean_latents_for_conditioning(&video_latents, video_conditioning)?,
     };
     let video_denoise_mask = match video_denoise_mask {
         Some(mask) => mask.to_device(&device)?.to_dtype(DType::F32)?,
@@ -4622,14 +4652,14 @@ mod tests {
 
     use super::{
         apply_stage_video_conditioning, apply_video_token_replacements,
-        build_video_conditioning_self_attention_mask, convert_velocity_to_x0,
-        convert_x0_to_velocity, decoded_video_to_frames, effective_native_guidance_scale,
-        emit_denoise_progress, guided_velocity_from_cfg, keyframe_only_conditioning,
-        ltx2_video_transformer_config, reapply_stage_video_conditioning,
-        should_inspect_step_velocity, source_image_only_conditioning,
-        strip_appended_video_conditioning, Ltx2RuntimeSession, StageVideoConditioning,
-        VideoTokenAppendCondition, VideoTokenReplacement, LTX2_AUDIO_LATENT_CHANNELS,
-        LTX2_VIDEO_LATENT_CHANNELS,
+        build_video_conditioning_self_attention_mask, clean_latents_for_conditioning,
+        convert_velocity_to_x0, convert_x0_to_velocity, decoded_video_to_frames,
+        effective_native_guidance_scale, emit_denoise_progress, guided_velocity_from_cfg,
+        keyframe_only_conditioning, ltx2_video_transformer_config,
+        reapply_stage_video_conditioning, should_inspect_step_velocity,
+        source_image_only_conditioning, strip_appended_video_conditioning, Ltx2RuntimeSession,
+        StageVideoConditioning, VideoTokenAppendCondition, VideoTokenReplacement,
+        LTX2_AUDIO_LATENT_CHANNELS, LTX2_VIDEO_LATENT_CHANNELS,
     };
     use crate::ltx2::conditioning::{self, StagedConditioning};
     use crate::ltx2::model::VideoPixelShape;
@@ -5693,6 +5723,67 @@ mod tests {
         );
     }
 
+    #[test]
+    fn clean_latents_replace_soft_blended_positions_with_pure_source() {
+        // Simulate the state after `apply_stage_video_conditioning` with
+        // strength 0.75: at the replacement positions, `video_latents` already
+        // holds `noise*0.25 + source*0.75`. The denoise-mask blend uses
+        // `clean_latents` as the target it pulls those positions toward at
+        // every step — so the clean target must be pure source, not the
+        // pre-blended mix.
+        let noise = [0.0f32, 0.0, 1.0, 1.0, 2.0, 2.0];
+        let source = [10.0f32, 10.0];
+        let strength = 0.75f32;
+        let blended_first = [
+            noise[0] * (1.0 - strength) + source[0] * strength,
+            noise[1] * (1.0 - strength) + source[1] * strength,
+        ];
+        let soft_blended = Tensor::from_vec(
+            vec![
+                blended_first[0],
+                blended_first[1],
+                noise[2],
+                noise[3],
+                noise[4],
+                noise[5],
+            ],
+            (1, 3, 2),
+            &Device::Cpu,
+        )
+        .unwrap();
+        let conditioning = StageVideoConditioning {
+            replacements: vec![VideoTokenReplacement {
+                start_token: 0,
+                tokens: Tensor::from_vec(source.to_vec(), (1, 1, 2), &Device::Cpu).unwrap(),
+                strength: strength as f64,
+            }],
+            appended: vec![],
+        };
+
+        let clean = clean_latents_for_conditioning(&soft_blended, &conditioning).unwrap();
+        let values = clean.flatten_all().unwrap().to_vec1::<f32>().unwrap();
+
+        assert_eq!(
+            values,
+            vec![source[0], source[1], noise[2], noise[3], noise[4], noise[5]],
+            "soft-blended replacement positions must be overwritten with the pure \
+             source tokens; other positions must be preserved unchanged"
+        );
+    }
+
+    #[test]
+    fn clean_latents_passthrough_when_no_replacements() {
+        let latents =
+            Tensor::from_vec(vec![0.0f32, 1.0, 2.0, 3.0], (1, 2, 2), &Device::Cpu).unwrap();
+        let conditioning = StageVideoConditioning::default();
+
+        let clean = clean_latents_for_conditioning(&latents, &conditioning).unwrap();
+        assert_eq!(
+            clean.flatten_all().unwrap().to_vec1::<f32>().unwrap(),
+            vec![0.0, 1.0, 2.0, 3.0]
+        );
+    }
+
     #[test]
     fn video_conditioning_self_attention_mask_blocks_cross_keyframe_attention() {
         let conditioning = StageVideoConditioning {

From b4ed487578c49a202eaeead3eea2526186bb6817 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 16:26:21 -0700
Subject: [PATCH 02/31] feat(chain): add core wire types and request
 normalisation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce `mold_core::chain` with the `ChainStage` / `ChainRequest` /
`ChainResponse` types that will carry server-side chained LTX-2 video
generation. The wire format is stages-based from day one so the v2
movie-maker UI can author multi-prompt / multi-keyframe chains without
breaking callers: v1 only exposes a single-prompt auto-expand form
(`prompt` + `total_frames` + `clip_frames`), and `normalise()`
collapses it into a canonical `Vec<ChainStage>` before any engine work
runs.

Normalisation matches the stitch math that Phase 1.4 of the plan will
use:
  delivered_frames = clip_frames + (N - 1) * (clip_frames - motion_tail)

so auto-expand picks `N` large enough to cover `total_frames` with
tail-overlap trimming in mind; the over-production is discarded from
the final clip's tail per the 2026-04-20 sign-off. Guardrails cap
chains at 16 stages (≈1552 frames at 97-frame clips, ~64 s at 24 fps),
require `8k+1` frame counts for LTX-2, and forbid
`motion_tail_frames >= clip_frames` so every continuation emits at
least one new frame.

Also lifts the existing `base64_opt` serde helper in `types.rs` from
private to `pub(crate)` so chain types can share the single source of
truth for base64 wire encoding.

Unit tests cover: split-into-stages, first-stage-image preservation,
empty-request rejection, non-8k+1 rejection, canonical-form
passthrough, single-stage short chains, >16-stage guardrails,
motion-tail >= clip rejection, missing auto-expand fields, and a
property test confirming the auto-expand stage count delivers the
requested total frames under every representative (total, clip, tail)
combo from the design.

tasks/render-chain-v1-plan.md adds the signed-off decisions block at
the top so the rationale travels with the code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-core/src/chain.rs | 596 ++++++++++++++++++++++++++++++++++
 crates/mold-core/src/lib.rs   |   2 +
 crates/mold-core/src/types.rs |   2 +-
 tasks/render-chain-v1-plan.md | 410 +++++++++++++++++++++++
 4 files changed, 1009 insertions(+), 1 deletion(-)
 create mode 100644 crates/mold-core/src/chain.rs
 create mode 100644 tasks/render-chain-v1-plan.md

diff --git a/crates/mold-core/src/chain.rs b/crates/mold-core/src/chain.rs
new file mode 100644
index 00000000..101e479e
--- /dev/null
+++ b/crates/mold-core/src/chain.rs
@@ -0,0 +1,596 @@
+//! Wire types for server-side chained video generation.
+//!
+//! A *chain* is a sequence of per-clip render stages stitched into a single
+//! output video. The v1 CLI UX is single-prompt + arbitrary length, but the
+//! wire format is stages-based from day one so the eventual movie-maker
+//! (multi-prompt, keyframes, selective regen) can author stages by hand
+//! without a breaking change.
+//!
+//! The server only ever sees the canonical [`ChainRequest`] shape — a
+//! `Vec<ChainStage>`. Callers can either build that directly or use the
+//! auto-expand form (`prompt` + `total_frames` + `clip_frames`), which
+//! [`ChainRequest::normalise`] collapses into stages.
+//!
+//! See `tasks/render-chain-v1-plan.md` for the full design rationale.
+
+use serde::{Deserialize, Serialize};
+
+use crate::error::{MoldError, Result};
+use crate::types::{DevicePlacement, OutputFormat, VideoData};
+
+/// A single rendered clip in a chain. Concatenated in order with motion-tail
+/// trimming on continuations (stages with `idx >= 1` drop the leading
+/// `motion_tail_frames` pixel frames of their output because those duplicate
+/// the tail of the previous stage that the engine carried across as
+/// latent-space conditioning).
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)]
+pub struct ChainStage {
+    /// Prompt used for this stage. In v1 all stages receive the same prompt
+    /// (auto-expand form replicates it); the movie-maker UI in v2 will let
+    /// users author per-stage prompts.
+    #[schema(example = "a cat walking through autumn leaves")]
+    pub prompt: String,
+
+    /// Frame count for this stage. Must be `8k+1` (LTX-2 pipeline constraint:
+    /// 9, 17, 25, …, 97).
+    #[schema(example = 97)]
+    pub frames: u32,
+
+    /// Optional starting image (raw PNG/JPEG bytes, base64 in JSON). In v1
+    /// this is only meaningful on `stages[0]`; later stages draw their
+    /// conditioning from the prior stage's motion-tail latents instead.
+    #[serde(
+        default,
+        skip_serializing_if = "Option::is_none",
+        with = "crate::types::base64_opt"
+    )]
+    pub source_image: Option<Vec<u8>>,
+
+    /// Optional negative prompt for CFG-based stages. v1 LTX-2 ignores this
+    /// (the distilled family doesn't use CFG); the field is reserved so the
+    /// movie-maker can round-trip it without re-migrating the wire format.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub negative_prompt: Option<String>,
+
+    /// Optional per-stage seed offset. `None` in v1 — the orchestrator
+    /// derives each stage's seed from the chain's base seed. Reserved as the
+    /// v2 movie-maker override hook for "regenerate just this stage with a
+    /// different seed".
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub seed_offset: Option<u64>,
+}
+
+/// Chained generation request. Server accepts either the canonical form
+/// (`stages` non-empty) or the auto-expand form (`prompt` + `total_frames` +
+/// `clip_frames`); [`ChainRequest::normalise`] collapses the latter into the
+/// former so downstream code only deals with `stages`.
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)]
+pub struct ChainRequest {
+    #[schema(example = "ltx-2-19b-distilled:fp8")]
+    pub model: String,
+
+    /// Canonical stages list. Empty triggers auto-expand from
+    /// `prompt`/`total_frames`/`clip_frames`.
+    #[serde(default)]
+    pub stages: Vec<ChainStage>,
+
+    /// Pixel frames of motion-tail overlap between consecutive stages.
+    /// `0` = no overlap (simple concat). `>0` = the final K pixel frames of
+    /// stage N's latents are threaded into stage N+1's conditioning, and
+    /// stage N+1's leading K output frames are dropped at stitch time.
+    ///
+    /// Defaults to `4` for v1 (matches the CLI default). Must be strictly
+    /// less than each stage's `frames`.
+    #[serde(default = "default_motion_tail_frames")]
+    #[schema(example = 4)]
+    pub motion_tail_frames: u32,
+
+    #[schema(example = 1216)]
+    pub width: u32,
+    #[schema(example = 704)]
+    pub height: u32,
+    #[serde(default = "default_fps")]
+    #[schema(example = 24)]
+    pub fps: u32,
+
+    /// Chain base seed. Per-stage seeds are derived as
+    /// `base_seed ^ ((stage_idx as u64) << 32)` by the orchestrator so the
+    /// whole chain is reproducible from a single seed value.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    #[schema(example = 42)]
+    pub seed: Option<u64>,
+
+    #[schema(example = 8)]
+    pub steps: u32,
+
+    #[schema(example = 3.0)]
+    pub guidance: f64,
+
+    /// Denoising strength for `stages[0].source_image`. Ignored when the
+    /// first stage has no source image. Continuation stages are always
+    /// full-strength conditioned via motion-tail latents.
+    #[serde(default = "default_strength")]
+    #[schema(example = 1.0)]
+    pub strength: f64,
+
+    #[serde(default = "default_output_format")]
+    pub output_format: OutputFormat,
+
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub placement: Option<DevicePlacement>,
+
+    // ── Auto-expand form ────────────────────────────────────────────────
+    // These are only read when `stages` is empty; `normalise` clears them
+    // after expansion so the canonical form only ever carries `stages`.
+    /// Auto-expand: single prompt replicated across all stages.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub prompt: Option<String>,
+
+    /// Auto-expand: total pixel frames the stitched output should cover.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub total_frames: Option<u32>,
+
+    /// Auto-expand: per-clip frame count. Defaults to `97` (LTX-2 19B/22B
+    /// distilled cap). Must be `8k+1`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub clip_frames: Option<u32>,
+
+    /// Auto-expand: starting image for `stages[0]`.
+    #[serde(
+        default,
+        skip_serializing_if = "Option::is_none",
+        with = "crate::types::base64_opt"
+    )]
+    pub source_image: Option<Vec<u8>>,
+}
+
+/// Response from a chained generation request. The `video` is the stitched
+/// output; individual per-stage clips are not returned.
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)]
+pub struct ChainResponse {
+    pub video: VideoData,
+    /// Number of stages that actually ran (matches `request.stages.len()`
+    /// after normalisation).
+    #[schema(example = 5)]
+    pub stage_count: u32,
+    /// GPU ordinal that handled the chain (multi-GPU servers only).
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub gpu: Option<usize>,
+}
+
+fn default_motion_tail_frames() -> u32 {
+    4
+}
+
+fn default_fps() -> u32 {
+    24
+}
+
+fn default_strength() -> f64 {
+    1.0
+}
+
+fn default_output_format() -> OutputFormat {
+    OutputFormat::Mp4
+}
+
+/// Maximum number of stages the v1 orchestrator will accept in a single
+/// chain. 16 × 97-frame clips ≈ 1552 frames ≈ 64 s at 24 fps — comfortably
+/// past the 400-frame target without risking runaway jobs.
+pub const MAX_CHAIN_STAGES: usize = 16;
+
+impl ChainRequest {
+    /// Collapse the auto-expand form into a canonical `Vec<ChainStage>` and
+    /// validate the result. Called once on the server side immediately after
+    /// JSON parsing, before any engine work kicks off.
+    ///
+    /// Post-conditions on a successful return:
+    /// - `self.stages` is non-empty.
+    /// - Each stage's `frames` is `8k+1` and `> 0`.
+    /// - `self.stages.len() <= MAX_CHAIN_STAGES`.
+    /// - All auto-expand fields are `None` (caller must use `self.stages`).
+    pub fn normalise(mut self) -> Result<Self> {
+        if self.stages.is_empty() {
+            let prompt = self.prompt.take().ok_or_else(|| {
+                MoldError::Validation(
+                    "chain request needs either stages[] or prompt + total_frames".into(),
+                )
+            })?;
+            let total_frames = self.total_frames.ok_or_else(|| {
+                MoldError::Validation("chain auto-expand requires total_frames".into())
+            })?;
+            if total_frames == 0 {
+                return Err(MoldError::Validation(
+                    "chain total_frames must be > 0".into(),
+                ));
+            }
+            let clip_frames = self.clip_frames.unwrap_or(97);
+            if clip_frames == 0 {
+                return Err(MoldError::Validation(
+                    "chain clip_frames must be > 0".into(),
+                ));
+            }
+            if !is_ltx2_frame_count(clip_frames) {
+                return Err(MoldError::Validation(format!(
+                    "chain clip_frames ({clip_frames}) must be 8k+1 (9, 17, 25, …, 97)",
+                )));
+            }
+            let motion_tail = self.motion_tail_frames;
+            if motion_tail >= clip_frames {
+                return Err(MoldError::Validation(format!(
+                    "motion_tail_frames ({motion_tail}) must be strictly less than clip_frames ({clip_frames})",
+                )));
+            }
+
+            let source_image = self.source_image.take();
+            self.stages = build_auto_expand_stages(
+                &prompt,
+                total_frames,
+                clip_frames,
+                motion_tail,
+                source_image,
+            )?;
+        }
+
+        if self.stages.is_empty() {
+            return Err(MoldError::Validation("chain request has no stages".into()));
+        }
+        if self.stages.len() > MAX_CHAIN_STAGES {
+            return Err(MoldError::Validation(format!(
+                "chain request has {} stages; maximum is {}",
+                self.stages.len(),
+                MAX_CHAIN_STAGES,
+            )));
+        }
+        for (idx, stage) in self.stages.iter().enumerate() {
+            if stage.frames == 0 {
+                return Err(MoldError::Validation(format!("stage {idx} has 0 frames",)));
+            }
+            if !is_ltx2_frame_count(stage.frames) {
+                return Err(MoldError::Validation(format!(
+                    "stage {idx} has {} frames; LTX-2 requires 8k+1 (9, 17, 25, …, 97)",
+                    stage.frames,
+                )));
+            }
+            if self.motion_tail_frames >= stage.frames {
+                return Err(MoldError::Validation(format!(
+                    "motion_tail_frames ({}) must be strictly less than stage {idx}'s frames ({})",
+                    self.motion_tail_frames, stage.frames,
+                )));
+            }
+        }
+
+        // Canonicalise: clear auto-expand fields so downstream code only
+        // ever reads from `stages`.
+        self.prompt = None;
+        self.total_frames = None;
+        self.clip_frames = None;
+        self.source_image = None;
+
+        Ok(self)
+    }
+}
+
+/// Returns `true` iff `n` has the form `8k + 1` for some non-negative integer
+/// `k` (1, 9, 17, 25, …). The LTX-2 pipeline has this constraint on pixel
+/// frame counts due to the VAE's 8× temporal compression with a causal first
+/// frame.
+fn is_ltx2_frame_count(n: u32) -> bool {
+    n % 8 == 1
+}
+
+/// Compute the stage count and per-stage frame allocation for the auto-
+/// expand form, matching Phase 1.4's stitch math:
+///
+/// - Stage 0 contributes `clip_frames` pixel frames.
+/// - Each continuation contributes `clip_frames - motion_tail_frames` new
+///   frames (the leading `motion_tail_frames` are dropped at stitch time
+///   because they duplicate the prior stage's latent tail).
+///
+/// Returns enough stages so the stitched total reaches at least
+/// `total_frames`; over-production is trimmed from the tail at stitch time
+/// per the signed-off decision 2026-04-20.
+fn build_auto_expand_stages(
+    prompt: &str,
+    total_frames: u32,
+    clip_frames: u32,
+    motion_tail_frames: u32,
+    source_image: Option<Vec<u8>>,
+) -> Result<Vec<ChainStage>> {
+    let (stage_count, per_stage_frames) = if total_frames <= clip_frames {
+        // Single stage: match the user's requested length exactly so we
+        // don't render 97 frames and throw most of them away. The frame
+        // count will still be validated as 8k+1 by the caller.
+        (1u32, total_frames)
+    } else {
+        let effective = clip_frames - motion_tail_frames;
+        // effective > 0 because the caller has already ensured
+        // motion_tail_frames < clip_frames.
+        let remainder = total_frames - clip_frames;
+        let count = 1 + remainder.div_ceil(effective);
+        (count, clip_frames)
+    };
+
+    let count_usize = stage_count as usize;
+    if count_usize > MAX_CHAIN_STAGES {
+        return Err(MoldError::Validation(format!(
+            "auto-expand would produce {stage_count} stages; maximum is {MAX_CHAIN_STAGES} \
+             (try reducing total_frames or increasing clip_frames)",
+        )));
+    }
+
+    let mut stages = Vec::with_capacity(count_usize);
+    for idx in 0..stage_count {
+        stages.push(ChainStage {
+            prompt: prompt.to_string(),
+            frames: per_stage_frames,
+            source_image: if idx == 0 { source_image.clone() } else { None },
+            negative_prompt: None,
+            seed_offset: None,
+        });
+    }
+    Ok(stages)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    /// Build a minimal auto-expand request with the given knobs. All other
+    /// fields use their v1 defaults so tests can focus on the logic under
+    /// exercise.
+    fn auto_expand_request(
+        prompt: &str,
+        total_frames: u32,
+        clip_frames: u32,
+        motion_tail_frames: u32,
+        source_image: Option<Vec<u8>>,
+    ) -> ChainRequest {
+        ChainRequest {
+            model: "ltx-2-19b-distilled:fp8".into(),
+            stages: Vec::new(),
+            motion_tail_frames,
+            width: 1216,
+            height: 704,
+            fps: 24,
+            seed: Some(42),
+            steps: 8,
+            guidance: 3.0,
+            strength: 1.0,
+            output_format: OutputFormat::Mp4,
+            placement: None,
+            prompt: Some(prompt.into()),
+            total_frames: Some(total_frames),
+            clip_frames: Some(clip_frames),
+            source_image,
+        }
+    }
+
+    fn canonical_request(stages: Vec<ChainStage>, motion_tail_frames: u32) -> ChainRequest {
+        ChainRequest {
+            model: "ltx-2-19b-distilled:fp8".into(),
+            stages,
+            motion_tail_frames,
+            width: 1216,
+            height: 704,
+            fps: 24,
+            seed: Some(42),
+            steps: 8,
+            guidance: 3.0,
+            strength: 1.0,
+            output_format: OutputFormat::Mp4,
+            placement: None,
+            prompt: None,
+            total_frames: None,
+            clip_frames: None,
+            source_image: None,
+        }
+    }
+
+    fn make_stage(frames: u32) -> ChainStage {
+        ChainStage {
+            prompt: "test".into(),
+            frames,
+            source_image: None,
+            negative_prompt: None,
+            seed_offset: None,
+        }
+    }
+
+    #[test]
+    fn normalise_splits_single_prompt_into_stages() {
+        // total=400, clip=97, tail=4 → effective=93, remainder=303,
+        // N = 1 + ceil(303/93) = 1 + 4 = 5 stages of 97 frames each.
+        // Stitched = 97 + 4*93 = 469, which will be trimmed to 400 at
+        // stitch time (per the signed-off "trim from tail" decision).
+        let normalised = auto_expand_request("a cat walking", 400, 97, 4, None)
+            .normalise()
+            .expect("normalise should succeed");
+
+        assert_eq!(
+            normalised.stages.len(),
+            5,
+            "400/97 with a 4-frame motion tail should expand to 5 stages",
+        );
+        for stage in &normalised.stages {
+            assert_eq!(stage.frames, 97);
+            assert_eq!(stage.prompt, "a cat walking");
+            assert!(stage.seed_offset.is_none());
+        }
+        // Auto-expand fields are cleared post-normalisation.
+        assert!(normalised.prompt.is_none());
+        assert!(normalised.total_frames.is_none());
+        assert!(normalised.clip_frames.is_none());
+        assert!(normalised.source_image.is_none());
+    }
+
+    #[test]
+    fn normalise_preserves_first_stage_image() {
+        let png = vec![0x89, 0x50, 0x4e, 0x47, 0xde, 0xad, 0xbe, 0xef];
+        let normalised = auto_expand_request("test", 200, 97, 4, Some(png.clone()))
+            .normalise()
+            .expect("normalise should succeed");
+
+        assert!(normalised.stages.len() >= 2);
+        assert_eq!(
+            normalised.stages[0].source_image.as_deref(),
+            Some(png.as_slice()),
+            "stage 0 must carry the starting image",
+        );
+        for stage in &normalised.stages[1..] {
+            assert!(
+                stage.source_image.is_none(),
+                "continuation stages must not carry a source image; conditioning flows \
+                 through motion-tail latents instead",
+            );
+        }
+    }
+
+    #[test]
+    fn normalise_rejects_empty() {
+        let mut req = canonical_request(Vec::new(), 4);
+        // No auto-expand fields either.
+        req.prompt = None;
+        req.total_frames = None;
+
+        let err = req.normalise().expect_err("empty chain should fail");
+        assert!(
+            matches!(err, MoldError::Validation(_)),
+            "empty chain should be a validation error, got {err:?}",
+        );
+    }
+
+    #[test]
+    fn normalise_rejects_non_8k1_frames() {
+        // Canonical form with a stage whose frames violates the 8k+1
+        // constraint.
+        let req = canonical_request(vec![make_stage(50)], 4);
+        let err = req.normalise().expect_err("non-8k+1 frames should fail");
+        assert!(
+            matches!(err, MoldError::Validation(msg) if msg.contains("8k+1")),
+            "error must mention the 8k+1 constraint",
+        );
+    }
+
+    #[test]
+    fn normalise_accepts_canonical_form_unchanged() {
+        // Caller already built stages; normalise should validate and clear
+        // the (already-empty) auto-expand fields without touching stages.
+        let stages = vec![make_stage(97), make_stage(97), make_stage(97)];
+        let normalised = canonical_request(stages.clone(), 4)
+            .normalise()
+            .expect("valid canonical form should pass");
+        assert_eq!(normalised.stages.len(), 3);
+        for (left, right) in normalised.stages.iter().zip(stages.iter()) {
+            assert_eq!(left.frames, right.frames);
+            assert_eq!(left.prompt, right.prompt);
+        }
+    }
+
+    #[test]
+    fn normalise_single_stage_when_total_leq_clip() {
+        // total=9 fits in one clip; don't render a full 97-frame stage and
+        // throw most of it away.
+        let normalised = auto_expand_request("short", 9, 97, 4, None)
+            .normalise()
+            .expect("short single-clip chain should pass");
+        assert_eq!(normalised.stages.len(), 1);
+        assert_eq!(normalised.stages[0].frames, 9);
+    }
+
+    #[test]
+    fn normalise_rejects_too_many_stages() {
+        // 17 canonical stages exceeds MAX_CHAIN_STAGES (16).
+        let stages = (0..17).map(|_| make_stage(97)).collect();
+        let err = canonical_request(stages, 4)
+            .normalise()
+            .expect_err("17-stage chain should fail");
+        assert!(
+            matches!(err, MoldError::Validation(msg) if msg.contains("maximum")),
+            "error must mention the max-stages cap",
+        );
+    }
+
+    #[test]
+    fn normalise_rejects_auto_expand_too_long() {
+        // 16 × 97 = 1552 max stitched frames before trim; asking for
+        // 4000 frames should blow the guardrail.
+        let err = auto_expand_request("too long", 4000, 97, 4, None)
+            .normalise()
+            .expect_err("runaway auto-expand should fail");
+        assert!(
+            matches!(err, MoldError::Validation(msg) if msg.contains("stages")),
+            "error must name the stage count guardrail",
+        );
+    }
+
+    #[test]
+    fn normalise_rejects_motion_tail_ge_clip() {
+        // motion_tail must leave at least one new frame per continuation.
+        let err = auto_expand_request("bad tail", 200, 97, 97, None)
+            .normalise()
+            .expect_err("motion_tail >= clip should fail");
+        assert!(
+            matches!(err, MoldError::Validation(msg) if msg.contains("motion_tail_frames")),
+            "error must name motion_tail_frames",
+        );
+    }
+
+    #[test]
+    fn normalise_rejects_missing_total_frames_in_auto_expand() {
+        let mut req = canonical_request(Vec::new(), 4);
+        req.prompt = Some("missing total".into());
+        // total_frames omitted.
+        let err = req
+            .normalise()
+            .expect_err("missing total_frames should fail");
+        assert!(
+            matches!(err, MoldError::Validation(msg) if msg.contains("total_frames")),
+            "error must name total_frames",
+        );
+    }
+
+    #[test]
+    fn is_ltx2_frame_count_matches_8k_plus_1() {
+        for valid in [1u32, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97] {
+            assert!(
+                is_ltx2_frame_count(valid),
+                "{valid} should be a valid LTX-2 frame count",
+            );
+        }
+        for invalid in [0u32, 2, 8, 10, 16, 50, 96, 98, 100] {
+            assert!(
+                !is_ltx2_frame_count(invalid),
+                "{invalid} must not pass the 8k+1 check",
+            );
+        }
+    }
+
+    #[test]
+    fn build_stages_math_matches_stitch_budget() {
+        // Auto-expand must produce enough stages that the stitch delivers
+        // at least `total_frames` pixel frames. Stitch math:
+        //   delivered = clip_frames + (N - 1) * (clip_frames - motion_tail)
+        let cases = [
+            (400u32, 97u32, 4u32, 5u32), // 97 + 4*93 = 469 ≥ 400
+            (200, 97, 4, 3),             // 97 + 2*93 = 283 ≥ 200
+            (97, 97, 4, 1),              // single clip hits 97 exactly
+            (300, 97, 0, 4),             // zero tail, 4*97 = 388 ≥ 300
+        ];
+        for (total, clip, tail, expected_n) in cases {
+            let req = auto_expand_request("m", total, clip, tail, None)
+                .normalise()
+                .expect("valid auto-expand should normalise");
+            assert_eq!(
+                req.stages.len() as u32,
+                expected_n,
+                "expected {expected_n} stages for total={total}, clip={clip}, tail={tail}",
+            );
+            let delivered = clip + (expected_n - 1) * (clip - tail);
+            assert!(
+                delivered >= total,
+                "{expected_n} stages deliver {delivered} frames but {total} were requested",
+            );
+        }
+    }
+}
diff --git a/crates/mold-core/src/lib.rs b/crates/mold-core/src/lib.rs
index a16bb81f..e7b2f3c1 100644
--- a/crates/mold-core/src/lib.rs
+++ b/crates/mold-core/src/lib.rs
@@ -1,5 +1,6 @@
 pub mod build_info;
 pub mod catalog;
+pub mod chain;
 pub mod client;
 pub mod config;
 pub mod control;
@@ -18,6 +19,7 @@ mod config_test;
 mod test_support;
 
 pub use catalog::build_model_catalog;
+pub use chain::{ChainRequest, ChainResponse, ChainStage, MAX_CHAIN_STAGES};
 pub use client::MoldClient;
 pub use config::{
     parse_device_ref_str, Config, DefaultModelResolution, DefaultModelSource, LoggingConfig,
diff --git a/crates/mold-core/src/types.rs b/crates/mold-core/src/types.rs
index ade380e1..72e95b90 100644
--- a/crates/mold-core/src/types.rs
+++ b/crates/mold-core/src/types.rs
@@ -1,7 +1,7 @@
 use serde::{Deserialize, Serialize};
 
 /// Serde helpers for `Option<Vec<u8>>` as base64 in JSON.
-mod base64_opt {
+pub(crate) mod base64_opt {
     use base64::Engine as _;
     use serde::{Deserialize, Deserializer, Serializer};
 
diff --git a/tasks/render-chain-v1-plan.md b/tasks/render-chain-v1-plan.md
new file mode 100644
index 00000000..d0324863
--- /dev/null
+++ b/tasks/render-chain-v1-plan.md
@@ -0,0 +1,410 @@
+# Render Chain v1 — Implementation Plan
+
+> Server-side chained video generation for LTX-2: generate videos of arbitrary length by stringing together multiple per-clip renders and stitching the results. v1 exposes a single-prompt/arbitrary-length UX; the request shape is **stages-based from day one** so the eventual movie-maker (multi-prompt, multi-keyframe) extends without a breaking change.
+
+## Confirmed design decisions (signed off 2026-04-20)
+
+1. **Trim over-production from the tail** of the final clip, not the head. The head carries the user's starting image anchor and is perceptually load-bearing; tail frames are the freshest continuation but cheapest to lose.
+2. **Per-stage seed derivation: `stage_seed = base_seed ^ ((stage_idx as u64) << 32)`.** Deterministic, reproducible, avoids identical-noise artefacts when prompts match across stages. `ChainStage::seed_offset` stays reserved as the v2 movie-maker override hook.
+3. **Fail closed on mid-chain failure.** If any stage errors, return 502 and discard all prior stages. No partial stitch is ever written to the gallery. Partial-resume is a v2 movie-maker feature.
+4. **1 GB RAM ceiling for the accumulation buffer.** Hold decoded `RgbImage`s in memory through the stitch — acceptable for the 400-frame 1216×704 target. Revisit with streaming encode when someone pushes 1000+ frames.
+5. **Single-GPU per chain.** The orchestrator runs every stage on the GPU the engine was loaded onto. Multi-GPU stage fan-out is a v2 perf win; docs mention it, code doesn't build it.
+
+**Goal:** `mold run ltx-2-19b-distilled:fp8 "a cat walking" --image cat.png --frames 400` produces a single 400-frame MP4, stitched from ~4 coherent sub-clips, each seeded by a motion tail of latents from the prior clip.
+
+**Scope (v1):**
+
+- LTX-2 only (other video engines intentionally out of scope).
+- Single prompt replicated across all stages. Optional starting image on stage 0.
+- Motion-tail carryover **using cached latents in-process** (no VAE re-encode between clips).
+- Single stitched output to the gallery. No per-clip gallery rows, no `chain_id` grouping.
+- Sequential execution (clip N+1 waits for N). Multi-GPU fan-out is v2.
+- Server-side orchestration under a new `/api/generate/chain[/stream]` route. CLI auto-routes when `--frames > max_per_clip`.
+
+**Explicitly NOT in v1:**
+
+- Movie maker UI (that's v2, built on the same server API).
+- Per-stage prompts/keyframes (the request shape supports them; the CLI doesn't expose them yet).
+- Crossfade / colour-matching at clip boundaries.
+- Pause/resume/retry of a partial chain.
+
+**Base branch:** `main` · **Feature branch:** `feat/render-chain-v1` · **PR target:** `main`
+
+---
+
+## The compatibility contract
+
+The key architectural decision: **the wire format is already multi-stage.** v1 auto-synthesises the stages list from a single prompt + total length, but the server only ever sees the stages form. That means v2 (movie maker) is additive — the SPA just lets the user author the stages list by hand, no server breaking changes.
+
+```json
+POST /api/generate/chain
+{
+  "model": "ltx-2-19b-distilled:fp8",
+  "stages": [
+    { "prompt": "a cat walking", "frames": 97, "source_image": "<base64 PNG>" },
+    { "prompt": "a cat walking", "frames": 97 },
+    { "prompt": "a cat walking", "frames": 97 },
+    { "prompt": "a cat walking", "frames": 97 }
+  ],
+  "motion_tail_frames": 4,
+  "width": 1216, "height": 704, "fps": 24,
+  "seed": 42, "steps": 8, "guidance": 3.0, "strength": 1.0,
+  "output_format": "mp4"
+}
+```
+
+Or the auto-expand form (what v1 CLI sends):
+
+```json
+POST /api/generate/chain
+{
+  "model": "ltx-2-19b-distilled:fp8",
+  "prompt": "a cat walking",
+  "total_frames": 400,
+  "clip_frames": 97,
+  "source_image": "<base64 PNG>",
+  "motion_tail_frames": 4,
+  "width": 1216, "height": 704, "fps": 24,
+  "seed": 42, "steps": 8, "guidance": 3.0, "strength": 1.0,
+  "output_format": "mp4"
+}
+```
+
+Server-side, a canonicalising function collapses the auto-expand form into stages. From the engine's POV there's only ever a `Vec<ChainStage>`.
+
+---
+
+## File map
+
+### New
+
+```
+crates/mold-core/src/chain.rs                      -- ChainStage, ChainRequest, ChainResponse types
+crates/mold-inference/src/ltx2/chain.rs            -- LTX-2 chain orchestrator + latent-tail carry
+crates/mold-server/src/routes_chain.rs             -- POST /api/generate/chain[/stream]
+```
+
+### Modified
+
+```
+crates/mold-core/src/lib.rs                        -- re-export chain types
+crates/mold-core/src/client.rs                     -- MoldClient::generate_chain[_stream]()
+crates/mold-inference/src/ltx2/mod.rs              -- pub use chain::{Ltx2ChainOrchestrator, ChainTail}
+crates/mold-inference/src/ltx2/pipeline.rs         -- expose internal render path that returns (VideoData, ChainTail)
+crates/mold-inference/src/ltx2/runtime.rs          -- thread ChainTail through run_real_distilled_stage
+crates/mold-server/src/lib.rs                      -- route registration
+crates/mold-server/src/queue.rs                    -- chain handler uses ModelCache but does NOT enqueue via the existing video job queue (reason in §3)
+crates/mold-cli/src/main.rs                        -- auto-route --frames > clip_max to /api/generate/chain
+crates/mold-cli/src/commands/generate.rs           -- chain client + progress rendering
+CHANGELOG.md
+website/guide/video.md                             -- document --frames N and the chain endpoint
+```
+
+---
+
+## Conventions
+
+- All new Rust code gets unit tests where the logic is pure (stage expansion, tail shape math, concat-drop math). The orchestrator's end-to-end path is covered by an integration test that swaps in a fake engine.
+- `mold-inference` crate has `test = false` on the `lib` target — new tests in `ltx2/chain.rs` must either run under `#[cfg(test)] mod tests` with logic that doesn't touch candle weights, or use the fake-engine pattern. Keep tests weight-free.
+- CLI manual UAT runs against BEAST (`MOLD_HOST=http://beast:7680`) with `ltx-2-19b-distilled:fp8`.
+- Commit scopes: `feat(chain): …`, `fix(chain): …`, `test(chain): …`, `docs(chain): …`.
+- Every task ends with a commit. No mid-plan push.
+
+---
+
+## Phases
+
+### Phase 0 — core types (no-op at runtime)
+
+**0.1. Add `mold-core::chain` module with wire types.**
+
+```rust
+// crates/mold-core/src/chain.rs
+pub struct ChainStage {
+    pub prompt: String,
+    pub frames: u32,
+    pub source_image: Option<Vec<u8>>,   // PNG bytes
+    pub negative_prompt: Option<String>, // future-proof; v1 ignores if Some
+    pub seed_offset: Option<u64>,        // v2 hook; v1 derives from base seed
+}
+
+pub struct ChainRequest {
+    pub model: String,
+    pub stages: Vec<ChainStage>,                // canonical form
+    #[serde(default)]
+    pub motion_tail_frames: u32,                // 0 = single-frame handoff; >0 = multi-frame tail
+    pub width: u32, pub height: u32, pub fps: u32,
+    pub seed: Option<u64>, pub steps: u32, pub guidance: f64,
+    pub strength: f64,                          // applied to stage[0].source_image only
+    pub output_format: OutputFormat,
+    pub placement: Option<DevicePlacement>,
+    // auto-expand form (server normalises):
+    pub prompt: Option<String>,
+    pub total_frames: Option<u32>,
+    pub clip_frames: Option<u32>,
+    pub source_image: Option<Vec<u8>>,
+}
+
+pub struct ChainResponse { pub video: VideoData, pub stage_count: u32, pub gpu: Option<u32> }
+```
+
+- Add a `normalise(self) -> Result<ChainRequest>` that collapses the auto-expand fields into stages when `stages.is_empty()`.
+- Validation: at least one stage, each stage has `frames` satisfying 8k+1 and > 0, total stages × clip_frames ≤ 16 (early guardrail — users aren't generating feature films with this yet).
+- Tests: `normalise_splits_single_prompt_into_stages`, `normalise_preserves_first_stage_image`, `normalise_rejects_empty`, `normalise_rejects_non_8k1_frames`.
+
+Commit: `feat(chain): add core wire types and request normalisation`.
+
+**0.2. Re-export from `mold_core`, add `MoldClient::generate_chain`/`generate_chain_stream`.**
+
+Mirror the existing `generate` / `generate_stream` shape. No server changes yet — client just has the surface area.
+
+Commit: `feat(core): MoldClient chain methods`.
+
+---
+
+### Phase 1 — LTX-2 chain orchestrator (single GPU, in-process)
+
+**1.1. Define `ChainTail` as the latent-carryover payload.**
+
+```rust
+// crates/mold-inference/src/ltx2/chain.rs
+pub struct ChainTail {
+    pub frames: u32,                      // number of pixel frames this tail represents
+    pub latents: Tensor,                  // [1, C, tail_latent_frames, H/32, W/32] on the engine device
+    pub last_rgb_frame: RgbImage,         // for fallback + debugging
+}
+```
+
+The VAE temporal ratio is 8 with causal first frame, so `tail_latent_frames = ((tail_pixel_frames - 1) / 8 + 1).max(1)`. For `motion_tail_frames=4` this is 1 latent frame. For `motion_tail_frames=9` it's 2 latent frames. Tests cover the arithmetic.
+
+**1.2. Extend `Ltx2Engine` with a chain-aware generate path.**
+
+Add a method that `generate` proper delegates to:
+
+```rust
+impl Ltx2Engine {
+    pub fn generate_with_carryover(
+        &mut self,
+        req: &GenerateRequest,
+        carry: Option<&ChainTail>,
+    ) -> Result<(GenerateResponse, ChainTail)>;
+}
+```
+
+When `carry = None`, behaviour is identical to `self.generate(req)` (use the source_image path as today). When `carry = Some(tail)`, the engine:
+
+1. Skips VAE encode on `stage_conditioning` for the keyframe at frame 0.
+2. Instead, threads `tail.latents` straight into `maybe_load_stage_video_conditioning` via a new optional parameter. The patchified tail tokens go into `StageVideoConditioning::replacements` with `strength = 1.0` and `start_token = 0..tail_token_count`.
+3. Extracts the last `K = motion_tail_frames` pixel frames' worth of latents from the completed denoise (before VAE decode) and returns them as the new `ChainTail`.
+
+The new latent extraction hook needs to run **after the last denoise step, before `vae.decode`** in the distilled and two-stage paths. Surface it as a single helper `extract_tail_latents(&final_latents, motion_tail_frames) -> Tensor` that narrows along the time axis.
+
+- Tests for the helper: `extract_tail_computes_correct_latent_slice`, `extract_tail_preserves_device_and_dtype`, `extract_tail_handles_single_frame_edge_case`.
+
+**1.3. Stage conditioning: accept pre-encoded latents instead of a staged image.**
+
+Currently `maybe_load_stage_video_conditioning` (`runtime.rs:1215`) reads an image path, decodes, VAE-encodes. Add a sibling path that accepts `Option<&Tensor>` as pre-patchified tokens (or raw latents to be patchified in place). Route through it when the orchestrator passes carryover.
+
+Concretely: a new variant on `StagedImage` or a parallel `StagedLatent` struct carried through `StagedConditioning`. Prefer the latter — keeps the existing image path pristine.
+
+```rust
+pub struct StagedLatent {
+    pub latents: Tensor,   // [1, C, T, H/32, W/32]
+    pub frame: u32,        // start frame (0 for chain carryover)
+    pub strength: f32,     // 1.0 for chain
+}
+
+pub struct StagedConditioning {
+    pub images: Vec<StagedImage>,
+    pub latents: Vec<StagedLatent>,     // NEW, empty for today's callers
+    pub audio_path: Option<String>,
+    pub video_path: Option<String>,
+}
+```
+
+`maybe_load_stage_video_conditioning` iterates `images` then `latents`, patchifying the latter directly without calling `vae.encode`. All existing call sites pass an empty `latents` Vec.
+
+- Test: `staged_latent_produces_same_replacement_token_shape_as_image_for_single_latent_frame`.
+
+**1.4. Build `Ltx2ChainOrchestrator`.**
+
+```rust
+// crates/mold-inference/src/ltx2/chain.rs
+pub struct Ltx2ChainOrchestrator<'a> {
+    engine: &'a mut Ltx2Engine,
+}
+
+impl<'a> Ltx2ChainOrchestrator<'a> {
+    pub fn run(
+        &mut self,
+        req: &ChainRequest,
+        progress: Option<ProgressCallback>,
+    ) -> Result<ChainResponse>;
+}
+```
+
+Internal loop:
+
+```
+let mut tail: Option<ChainTail> = None;
+let mut accumulated_frames: Vec<RgbImage> = Vec::new();
+let tail_drop = req.motion_tail_frames as usize;
+
+for (idx, stage) in req.stages.iter().enumerate() {
+    let per_clip = build_clip_request(stage, &req, tail.is_some())?;
+    let (resp, new_tail) = self.engine.generate_with_carryover(&per_clip, tail.as_ref())?;
+    let frames = decode_video_frames_from_response(&resp)?;
+    if idx == 0 {
+        accumulated_frames.extend(frames);
+    } else {
+        // drop the leading `tail_drop` pixel frames; they duplicate the prior clip's tail
+        accumulated_frames.extend(frames.into_iter().skip(tail_drop));
+    }
+    tail = Some(new_tail);
+    emit_progress(progress.as_ref(), ChainStageDone { idx, total: req.stages.len() });
+}
+
+let stitched = encode_mp4(&accumulated_frames, req.fps)?;
+Ok(ChainResponse { video: stitched, ... })
+```
+
+- Stage-1 request has `source_image = stage.source_image`, `keyframes = None`.
+- Stage-N request (N ≥ 2) has `source_image = None`, `keyframes = None`; the carryover is passed via the `tail` parameter to `generate_with_carryover`, not through the request DTO.
+- Progress events: forward engine events with an added `stage_idx`, plus emit `ChainStageStart` / `ChainStageDone` / `ChainStitching` / `ChainComplete`.
+
+- Tests (fake engine): `chain_runs_all_stages_and_drops_tail_prefix_from_continuations`, `chain_with_zero_tail_concats_full_clips_without_drop`, `chain_progress_forwards_engine_events_with_stage_idx`, `chain_empty_stages_errors`.
+
+Commit: `feat(ltx2): chain orchestrator with latent-tail carryover`.
+
+---
+
+### Phase 2 — server route
+
+**2.1. `POST /api/generate/chain` (non-streaming).**
+
+Handler flow:
+
+1. Parse & normalise the `ChainRequest`.
+2. Validate model is an LTX-2 family (`anyhow::bail!` with a clear error otherwise).
+3. Grab the model's engine from `ModelCache` (load if needed, same as the existing video path).
+4. Construct `Ltx2ChainOrchestrator` against it and call `run()`.
+5. Save the stitched MP4 via the same save path as single-clip videos (`save_video_to_dir`), populating `OutputMetadata` with a synthetic prompt (`stages[0].prompt` for v1) and a note in a new optional metadata field `chain_stage_count: Option<u32>`.
+6. Return `ChainResponse` as JSON.
+
+Do **not** go through the existing single-job queue — a chain is a long-running compound job and would block the queue for 10+ minutes. Instead, the handler holds the `ModelCache` mutex the same way the multi-GPU worker does, for the full chain duration. This is OK because the multi-GPU pool already has per-GPU thread isolation.
+
+**2.2. `POST /api/generate/chain/stream` (SSE).**
+
+Same flow but progress events stream as `data:` frames. Event types:
+
+- `chain_start { stage_count, estimated_total_frames }`
+- `stage_start { stage_idx }`
+- `denoise_step { stage_idx, step, total }` (forwarded from engine with `stage_idx` wrapped in)
+- `stage_done { stage_idx, frames_emitted }`
+- `stitching { total_frames }`
+- `complete { video_frames, video_fps, video_base64, filename, seed, ... }` (same shape as `/api/generate/stream` complete event)
+- `error { message }`
+
+The existing SSE completion-event helper (`build_sse_complete_event` in `queue.rs`) is not reusable as-is because it takes a single `GenerateResponse`; write a sibling `build_chain_sse_complete_event(&ChainResponse)` that produces the same JSON structure plus `chain_stage_count`.
+
+- Tests: route-level tests with a fake engine that exercise both non-streaming and SSE shapes; verify SSE emits events in the expected order.
+
+Commit: `feat(server): chain render endpoint and SSE stream`.
+
+---
+
+### Phase 3 — CLI
+
+**3.1. Auto-route `mold run` to `/api/generate/chain` when `--frames > max_per_clip`.**
+
+Add a constant in `mold-cli` for LTX-2 clip caps (97 for 19B distilled, 97 for 22B — same as today's single-clip validation). When `frames > cap`:
+
+- Build a `ChainRequest` with `prompt=…`, `total_frames=…`, `clip_frames=cap`, `source_image=…`, `motion_tail_frames=4` (default).
+- Call `MoldClient::generate_chain_stream`.
+- Render a progress bar per stage stacked with a parent "chain" bar.
+
+When `frames ≤ cap`, path is unchanged (`/api/generate/stream`, single clip, today's behaviour).
+
+- New flag: `--clip-frames N` to let advanced users override the per-clip length (default = model cap).
+- New flag: `--motion-tail N` to override the tail (default 4, 0 to disable).
+- Help text for `--frames` updates to mention chained output when > cap.
+
+- Tests: `run_frames_above_cap_selects_chain_endpoint` (argparse-level; doesn't invoke the network).
+
+**3.2. `--local` chain mode.**
+
+For parity with `mold run --local`, the CLI should run the orchestrator in-process when `--local` is passed. Factor the orchestrator invocation into a helper so both the server handler and the CLI local path share it.
+
+Commit: `feat(cli): chain rendering for --frames above clip cap`.
+
+---
+
+### Phase 4 — docs & changelog
+
+**4.1. Website.** Add a new section in `website/guide/video.md` explaining chained video output, how motion tail works, and the CLI flags. Link it from the LTX-2 model page.
+
+**4.2. CHANGELOG.** Unreleased / Added entry describing the `/api/generate/chain` route, the CLI auto-routing behaviour, and the motion-tail carryover.
+
+**4.3. Skill file.** Update `.claude/skills/mold/SKILL.md` with the new CLI flags and endpoint.
+
+Commit: `docs(chain): guide, changelog, and skill updates`.
+
+---
+
+## Integration test: a realistic end-to-end
+
+One integration test lives in `crates/mold-server/tests/chain_integration.rs` (or inline in `tests/` if an integration dir exists). It:
+
+1. Stands up an in-process server with a **fake LTX-2 engine** (not real weights) whose `generate_with_carryover` returns a deterministic gradient pattern + a synthetic `ChainTail` whose latents are zeros but whose RGB tail frame is the last frame of the emitted clip.
+2. POSTs an auto-expand chain request with `total_frames=200`, `clip_frames=97`, `motion_tail_frames=4`.
+3. Asserts:
+   - Three stages fired.
+   - The stitched MP4 has `ceil((200 - 97) / 93) * 93 + 97 = 97 + 93*2 = 283 ≥ 200` frames before trim; after trim it's 200 frames.
+   - SSE stream emitted events in the expected order.
+   - The gallery DB got one row with `chain_stage_count = 3`.
+
+The fake-engine pattern keeps this test out of the GPU path and makes it safe to run in CI.
+
+---
+
+## Open design decisions I'm flagging for your sign-off
+
+1. **Trim policy.** If `total_frames = 400` and chain math produces 469 frames, should we trim from the tail (final clip's final frames get cut — but those are the freshest continuation) or from the head (stage-0 frames get cut — but those are the user-anchored ones)? I recommend **trim from tail** because the head is where the user's starting image landed and matters more perceptually.
+
+2. **Seed handling across stages.** Should each stage get the same seed (reproducible but with artifacts from identical noise when prompts match), or derive per-stage seeds (`base_seed ^ (stage_idx << 32)`)? I recommend **derive per-stage**. `seed_offset` on `ChainStage` lets the movie maker override.
+
+3. **Failure mode mid-chain.** If stage 3 of 4 fails, do we return a 502 and discard everything, or return the partial stitch of stages 1–3? I recommend **fail closed for v1** — no partial output. Partial resume is a v2 movie-maker feature where individual stage regen is first-class.
+
+4. **Memory.** 400 frames × 1216×704×3 ≈ 1 GB of RgbImages held in RAM before MP4 encode. Acceptable for v1. If users push to 1000+ frames we revisit with streaming encode.
+
+5. **Placement.** Chain always runs on a single GPU for v1 (the one the engine was loaded onto). Multi-GPU fan-out (stage N and N+1 on different cards) is a v2 perf win; mention in docs but don't build.
+
+---
+
+## What `mold run` looks like after this ships
+
+```console
+$ mold run ltx-2-19b-distilled:fp8 "a cat walking through autumn leaves" \
+    --image cat.png --frames 400
+
+⏳ Chain render: 4 stages × 97 frames (motion tail: 4) → 388 stitched frames
+▸ Stage 1/4 · denoise step 8/8 · 47s
+▸ Stage 2/4 · denoise step 8/8 · 44s   (tail carried from stage 1)
+▸ Stage 3/4 · denoise step 8/8 · 44s
+▸ Stage 4/4 · denoise step 8/8 · 44s
+▸ Stitching 388 frames @ 24fps …
+✔ Saved mold-ltx-2-19b-distilled-{ts}.mp4 (400 frames, 16.7s, 16MB)
+```
+
+---
+
+## Out-of-scope for v1 but in-scope for v2 (movie maker)
+
+- SPA route `/movie` with a timeline authoring UI.
+- Per-stage prompts and keyframes exposed in the request body (the server already supports this — only the UI needs to change).
+- Per-clip gallery rows with `chain_id` grouping so users can iterate on individual stages.
+- Selective stage regeneration (replace stage 2 without redoing 1/3/4).
+- Crossfade blending at clip boundaries.
+- Multi-GPU stage fan-out.
+
+The whole point of v1 is to ship a stable foundation these land on top of without breaking changes.

From 0328e765f7866315100ecd8091b82295aa2cb4fd Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 16:31:38 -0700
Subject: [PATCH 03/31] feat(core): MoldClient chain methods
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add `MoldClient::generate_chain` (POST /api/generate/chain, non-
streaming JSON request/response) and `MoldClient::generate_chain_stream`
(POST /api/generate/chain/stream, SSE) mirroring the existing
`generate` / `generate_stream` shape. The server routes land in
Phase 2; this commit ships the client surface so Phase 1's fake-engine
tests and Phase 2's route wiring have a settled wire contract to
implement against.

Chain-specific wire types (all new, under `mold_core::chain`):

- `ChainProgressEvent` — tagged enum streamed under `event: progress`.
  Variants: `chain_start { stage_count, estimated_total_frames }`,
  `stage_start { stage_idx }`, `denoise_step { stage_idx, step, total }`,
  `stage_done { stage_idx, frames_emitted }`,
  `stitching { total_frames }`. snake_case tagged JSON matches the
  existing `SseProgressEvent` style.
- `SseChainCompleteEvent` — kept as a sibling to
  `crate::types::SseCompleteEvent` rather than an extension, so chain
  completion shape can evolve independently (stage_count, stitched-
  video payload, optional thumb/GIF, audio metadata, elapsed time).

Error translation matches the single-clip methods:

| Status                 | generate_chain                                  | generate_chain_stream                           |
|------------------------|-------------------------------------------------|-------------------------------------------------|
| 200                    | parse ChainResponse JSON                        | parse SSE until `complete` event                |
| 404, empty body        | hard error "chain endpoint not found"           | `Ok(None)` — caller may fall back               |
| 404, non-empty body    | `MoldError::ModelNotFound`                      | `MoldError::ModelNotFound`                      |
| 422                    | `MoldError::Validation`                         | `MoldError::Validation`                         |
| 4xx/5xx else           | generic anyhow                                   | generic anyhow                                   |

The non-streaming empty-404 behaviour deliberately differs from SSE:
streaming clients can fall back to non-streaming, but non-streaming
callers have nowhere to go and should fail loudly.

Integration coverage:
- `crates/mold-core/tests/chain_client.rs` (wiremock): endpoint/body
  shape assertion on non-streaming; 422 → Validation; 404-with-body →
  ModelNotFound; non-streaming empty 404 → hard error; SSE empty 404 →
  Ok(None); SSE progress + complete roundtrip reconstructs
  `ChainResponse` with thumb + gpu.
- Pure serde roundtrip test for every `ChainProgressEvent` variant
  asserting snake_case tag format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-core/src/chain.rs          | 124 +++++++++++++
 crates/mold-core/src/client.rs         | 132 ++++++++++++++
 crates/mold-core/src/lib.rs            |   5 +-
 crates/mold-core/tests/chain_client.rs | 234 +++++++++++++++++++++++++
 4 files changed, 494 insertions(+), 1 deletion(-)
 create mode 100644 crates/mold-core/tests/chain_client.rs

diff --git a/crates/mold-core/src/chain.rs b/crates/mold-core/src/chain.rs
index 101e479e..e8b68c7e 100644
--- a/crates/mold-core/src/chain.rs
+++ b/crates/mold-core/src/chain.rs
@@ -158,6 +158,85 @@ pub struct ChainResponse {
     pub gpu: Option<usize>,
 }
 
+/// SSE completion event for a successful chain run. Streamed as the final
+/// `data:` frame under the `event: complete` SSE type. The payload is
+/// base64-encoded to stay JSON-safe; clients decode it into `VideoData`.
+///
+/// This is a sibling to [`crate::types::SseCompleteEvent`] rather than an
+/// extension so image/video vs. chain completion shapes stay independent
+/// and can evolve separately.
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)]
+pub struct SseChainCompleteEvent {
+    /// Base64-encoded stitched video bytes (format per `format` field).
+    pub video: String,
+    pub format: OutputFormat,
+    #[schema(example = 1216)]
+    pub width: u32,
+    #[schema(example = 704)]
+    pub height: u32,
+    #[schema(example = 400)]
+    pub frames: u32,
+    #[schema(example = 24)]
+    pub fps: u32,
+    /// Base64-encoded first-frame PNG thumbnail.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub thumbnail: Option<String>,
+    /// Base64-encoded animated GIF preview (always emitted for gallery UI).
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub gif_preview: Option<String>,
+    #[serde(default, skip_serializing_if = "std::ops::Not::not")]
+    pub has_audio: bool,
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub duration_ms: Option<u64>,
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub audio_sample_rate: Option<u32>,
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub audio_channels: Option<u32>,
+    /// Number of stages that ran end-to-end.
+    #[schema(example = 5)]
+    pub stage_count: u32,
+    /// GPU ordinal that handled the chain (multi-GPU only).
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub gpu: Option<usize>,
+    /// Wall-clock elapsed time across all stages + stitching.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub generation_time_ms: Option<u64>,
+}
+
+/// Chain-specific SSE progress event. Streamed as `data:` JSON frames from
+/// `POST /api/generate/chain/stream` under the `event: progress` SSE type.
+///
+/// Per-stage denoise steps are wrapped with `stage_idx` so consumers can
+/// render stacked progress bars (overall chain + per-stage) without a
+/// separate subscription. Non-denoise engine events (weight load, cache
+/// hits, etc.) are intentionally not forwarded through this enum in v1 —
+/// they're scoped to individual stages and the UX goal for v1 is per-stage
+/// progress, not per-component telemetry.
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema, PartialEq, Eq)]
+#[serde(tag = "type", rename_all = "snake_case")]
+pub enum ChainProgressEvent {
+    /// Emitted once at the start of the chain, after normalisation. Gives
+    /// consumers the final stage count and the target pre-trim frame total
+    /// so they can size progress bars up front.
+    ChainStart {
+        stage_count: u32,
+        estimated_total_frames: u32,
+    },
+    /// Stage `stage_idx` (0-indexed) has started its denoise loop.
+    StageStart { stage_idx: u32 },
+    /// Per-step denoise progress for the active stage.
+    DenoiseStep {
+        stage_idx: u32,
+        step: u32,
+        total: u32,
+    },
+    /// Stage finished generating; `frames_emitted` is the raw clip frame
+    /// count before motion-tail trim at stitch time.
+    StageDone { stage_idx: u32, frames_emitted: u32 },
+    /// All stages complete; stitching/encoding the final MP4.
+    Stitching { total_frames: u32 },
+}
+
 fn default_motion_tail_frames() -> u32 {
     4
 }
@@ -566,6 +645,51 @@ mod tests {
         }
     }
 
+    #[test]
+    fn chain_progress_event_roundtrips_json_with_snake_case_tags() {
+        let cases = [
+            (
+                ChainProgressEvent::ChainStart {
+                    stage_count: 5,
+                    estimated_total_frames: 469,
+                },
+                r#""type":"chain_start""#,
+            ),
+            (
+                ChainProgressEvent::StageStart { stage_idx: 0 },
+                r#""type":"stage_start""#,
+            ),
+            (
+                ChainProgressEvent::DenoiseStep {
+                    stage_idx: 2,
+                    step: 4,
+                    total: 8,
+                },
+                r#""type":"denoise_step""#,
+            ),
+            (
+                ChainProgressEvent::StageDone {
+                    stage_idx: 3,
+                    frames_emitted: 97,
+                },
+                r#""type":"stage_done""#,
+            ),
+            (
+                ChainProgressEvent::Stitching { total_frames: 400 },
+                r#""type":"stitching""#,
+            ),
+        ];
+        for (event, expected_tag) in cases {
+            let json = serde_json::to_string(&event).expect("serialize");
+            assert!(
+                json.contains(expected_tag),
+                "missing snake_case tag {expected_tag} in {json}",
+            );
+            let roundtrip: ChainProgressEvent = serde_json::from_str(&json).expect("deserialize");
+            assert_eq!(roundtrip, event, "roundtrip must preserve payload");
+        }
+    }
+
     #[test]
     fn build_stages_math_matches_stitch_budget() {
         // Auto-expand must produce enough stages that the stitch delivers
diff --git a/crates/mold-core/src/client.rs b/crates/mold-core/src/client.rs
index e900739f..ffea387e 100644
--- a/crates/mold-core/src/client.rs
+++ b/crates/mold-core/src/client.rs
@@ -1,3 +1,4 @@
+use crate::chain::{ChainProgressEvent, ChainRequest, ChainResponse, SseChainCompleteEvent};
 use crate::error::MoldError;
 use crate::types::{
     ExpandRequest, ExpandResponse, GalleryImage, GenerateRequest, GenerateResponse, ImageData,
@@ -313,6 +314,137 @@ impl MoldClient {
         anyhow::bail!("SSE stream ended without complete event")
     }
 
+    /// Submit a chained video generation request (non-streaming).
+    ///
+    /// The server normalises the auto-expand form into stages, runs each
+    /// stage sequentially with motion-tail latent carryover, stitches the
+    /// result into a single video, and returns a [`ChainResponse`]. Large
+    /// chains take minutes — prefer [`Self::generate_chain_stream`] for
+    /// interactive clients that want progress updates.
+    pub async fn generate_chain(&self, req: &ChainRequest) -> Result<ChainResponse> {
+        let resp = self
+            .client
+            .post(format!("{}/api/generate/chain", self.base_url))
+            .json(req)
+            .send()
+            .await?;
+
+        if resp.status() == reqwest::StatusCode::NOT_FOUND {
+            let body = resp.text().await.unwrap_or_default();
+            if body.is_empty() {
+                anyhow::bail!("chain endpoint not found — server predates render-chain v1");
+            }
+            return Err(MoldError::ModelNotFound(body).into());
+        }
+        if resp.status() == reqwest::StatusCode::UNPROCESSABLE_ENTITY {
+            let body = resp.text().await.unwrap_or_default();
+            return Err(MoldError::Validation(format!("validation error: {body}")).into());
+        }
+        if resp.status().is_client_error() || resp.status().is_server_error() {
+            let status = resp.status();
+            let body = resp.text().await.unwrap_or_default();
+            anyhow::bail!("server error {status}: {body}");
+        }
+
+        let chain: ChainResponse = resp.json().await?;
+        Ok(chain)
+    }
+
+    /// Submit a chained video generation request with SSE progress streaming.
+    ///
+    /// Returns:
+    /// - `Ok(Some(response))` — streaming succeeded and the `complete` event
+    ///   carried the stitched video.
+    /// - `Ok(None)` — server doesn't have the chain endpoint (empty 404).
+    ///   Callers can fall back to [`Self::generate_chain`] or error.
+    /// - `Err(_)` — validation, model-not-found, or mid-stream server error.
+    pub async fn generate_chain_stream(
+        &self,
+        req: &ChainRequest,
+        progress_tx: tokio::sync::mpsc::UnboundedSender<ChainProgressEvent>,
+    ) -> Result<Option<ChainResponse>> {
+        let mut resp = self
+            .client
+            .post(format!("{}/api/generate/chain/stream", self.base_url))
+            .json(req)
+            .send()
+            .await?;
+
+        if resp.status() == reqwest::StatusCode::NOT_FOUND {
+            let body = resp.text().await.unwrap_or_default();
+            if body.is_empty() {
+                return Ok(None);
+            }
+            return Err(MoldError::ModelNotFound(body).into());
+        }
+        if resp.status() == reqwest::StatusCode::UNPROCESSABLE_ENTITY {
+            let body = resp.text().await.unwrap_or_default();
+            return Err(MoldError::Validation(format!("validation error: {body}")).into());
+        }
+        if resp.status().is_client_error() || resp.status().is_server_error() {
+            let status = resp.status();
+            let body = resp.text().await.unwrap_or_default();
+            anyhow::bail!("server error {status}: {body}");
+        }
+
+        let b64 = base64::engine::general_purpose::STANDARD;
+        let mut buffer = String::new();
+        while let Some(chunk) = resp.chunk().await? {
+            buffer.push_str(&String::from_utf8_lossy(&chunk));
+
+            while let Some(event_text) = next_sse_event(&mut buffer) {
+                let (event_type, data) = parse_sse_event(&event_text);
+                match event_type.as_str() {
+                    "progress" => {
+                        if let Ok(p) = serde_json::from_str::<ChainProgressEvent>(&data) {
+                            let _ = progress_tx.send(p);
+                        }
+                    }
+                    "complete" => {
+                        let complete: SseChainCompleteEvent = serde_json::from_str(&data)?;
+                        let payload = b64.decode(&complete.video)?;
+                        let thumbnail = complete
+                            .thumbnail
+                            .as_deref()
+                            .and_then(|s| b64.decode(s).ok())
+                            .unwrap_or_default();
+                        let gif_preview = complete
+                            .gif_preview
+                            .as_deref()
+                            .and_then(|s| b64.decode(s).ok())
+                            .unwrap_or_default();
+                        let video = VideoData {
+                            data: payload,
+                            format: complete.format,
+                            width: complete.width,
+                            height: complete.height,
+                            frames: complete.frames,
+                            fps: complete.fps,
+                            thumbnail,
+                            gif_preview,
+                            has_audio: complete.has_audio,
+                            duration_ms: complete.duration_ms,
+                            audio_sample_rate: complete.audio_sample_rate,
+                            audio_channels: complete.audio_channels,
+                        };
+                        return Ok(Some(ChainResponse {
+                            video,
+                            stage_count: complete.stage_count,
+                            gpu: complete.gpu,
+                        }));
+                    }
+                    "error" => {
+                        let error: SseErrorEvent = serde_json::from_str(&data)?;
+                        anyhow::bail!("server error: {}", error.message);
+                    }
+                    _ => {}
+                }
+            }
+        }
+
+        anyhow::bail!("chain SSE stream ended without complete event")
+    }
+
     /// Ask the server to pull (download) a model. Blocks until the download
     /// completes on the server side. The server updates its in-memory config
     /// so subsequent generate/load requests can find the model.
diff --git a/crates/mold-core/src/lib.rs b/crates/mold-core/src/lib.rs
index e7b2f3c1..9da6a5e2 100644
--- a/crates/mold-core/src/lib.rs
+++ b/crates/mold-core/src/lib.rs
@@ -19,7 +19,10 @@ mod config_test;
 mod test_support;
 
 pub use catalog::build_model_catalog;
-pub use chain::{ChainRequest, ChainResponse, ChainStage, MAX_CHAIN_STAGES};
+pub use chain::{
+    ChainProgressEvent, ChainRequest, ChainResponse, ChainStage, SseChainCompleteEvent,
+    MAX_CHAIN_STAGES,
+};
 pub use client::MoldClient;
 pub use config::{
     parse_device_ref_str, Config, DefaultModelResolution, DefaultModelSource, LoggingConfig,
diff --git a/crates/mold-core/tests/chain_client.rs b/crates/mold-core/tests/chain_client.rs
new file mode 100644
index 00000000..dc06d1bb
--- /dev/null
+++ b/crates/mold-core/tests/chain_client.rs
@@ -0,0 +1,234 @@
+//! Integration tests for `MoldClient::generate_chain{,_stream}` using
+//! `wiremock` to simulate the `/api/generate/chain` server endpoints.
+//!
+//! These tests pin the HTTP surface (method, path, JSON request body) and
+//! verify error translation (422 → Validation, 404 empty → None on stream,
+//! 404 with body → ModelNotFound). They do NOT exercise real LTX-2 work —
+//! the server side lands in Phase 2.
+
+use base64::Engine as _;
+use mold_core::chain::{ChainProgressEvent, ChainRequest, ChainStage, SseChainCompleteEvent};
+use mold_core::error::MoldError;
+use mold_core::types::OutputFormat;
+use mold_core::MoldClient;
+use wiremock::matchers::{body_json_schema, method, path};
+use wiremock::{Mock, MockServer, ResponseTemplate};
+
+fn mold_error(err: &anyhow::Error) -> &MoldError {
+    err.downcast_ref::<MoldError>()
+        .unwrap_or_else(|| panic!("not a MoldError: {err}"))
+}
+
+fn sample_request() -> ChainRequest {
+    ChainRequest {
+        model: "ltx-2-19b-distilled:fp8".into(),
+        stages: vec![ChainStage {
+            prompt: "a cat walking".into(),
+            frames: 97,
+            source_image: None,
+            negative_prompt: None,
+            seed_offset: None,
+        }],
+        motion_tail_frames: 4,
+        width: 1216,
+        height: 704,
+        fps: 24,
+        seed: Some(42),
+        steps: 8,
+        guidance: 3.0,
+        strength: 1.0,
+        output_format: OutputFormat::Mp4,
+        placement: None,
+        prompt: None,
+        total_frames: None,
+        clip_frames: None,
+        source_image: None,
+    }
+}
+
+fn minimal_chain_response_json() -> serde_json::Value {
+    serde_json::json!({
+        "video": {
+            "data": [],
+            "format": "mp4",
+            "width": 1216,
+            "height": 704,
+            "frames": 97,
+            "fps": 24,
+            "thumbnail": []
+        },
+        "stage_count": 1
+    })
+}
+
+// ── /api/generate/chain (non-streaming) ────────────────────────────────
+
+#[tokio::test]
+async fn generate_chain_posts_to_correct_endpoint_and_parses_response() {
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain"))
+        .and(body_json_schema::<ChainRequest>)
+        .respond_with(ResponseTemplate::new(200).set_body_json(minimal_chain_response_json()))
+        .expect(1)
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let resp = client
+        .generate_chain(&sample_request())
+        .await
+        .expect("non-streaming chain should succeed on 200");
+    assert_eq!(resp.stage_count, 1);
+    assert_eq!(resp.video.frames, 97);
+    assert_eq!(resp.video.format, OutputFormat::Mp4);
+}
+
+#[tokio::test]
+async fn generate_chain_surfaces_422_as_validation_error() {
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain"))
+        .respond_with(ResponseTemplate::new(422).set_body_string("frames must be 8k+1"))
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let err = client
+        .generate_chain(&sample_request())
+        .await
+        .expect_err("422 must error");
+    assert!(
+        matches!(mold_error(&err), MoldError::Validation(msg) if msg.contains("8k+1")),
+        "422 must translate to MoldError::Validation carrying the body",
+    );
+}
+
+#[tokio::test]
+async fn generate_chain_translates_404_with_body_to_model_not_found() {
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain"))
+        .respond_with(ResponseTemplate::new(404).set_body_string("model 'ltx-2-foo' not found"))
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let err = client
+        .generate_chain(&sample_request())
+        .await
+        .expect_err("404 with body must error");
+    assert!(
+        matches!(mold_error(&err), MoldError::ModelNotFound(msg) if msg.contains("ltx-2-foo")),
+        "404-with-body must translate to MoldError::ModelNotFound",
+    );
+}
+
+#[tokio::test]
+async fn generate_chain_empty_404_fails_loudly_instead_of_silently() {
+    // Non-streaming callers have no fallback path — an empty 404 means the
+    // server predates render-chain v1, which is a hard error (unlike the
+    // streaming case where Ok(None) signals "try the non-streaming path").
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain"))
+        .respond_with(ResponseTemplate::new(404).set_body_string(""))
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let err = client
+        .generate_chain(&sample_request())
+        .await
+        .expect_err("empty 404 must error on non-streaming path");
+    let msg = format!("{err}");
+    assert!(
+        msg.contains("chain endpoint not found"),
+        "error must name the missing endpoint, got: {msg}",
+    );
+}
+
+// ── /api/generate/chain/stream (SSE) ───────────────────────────────────
+
+#[tokio::test]
+async fn generate_chain_stream_returns_none_on_empty_404() {
+    // An empty 404 on the streaming endpoint means the server doesn't
+    // support chain SSE yet — callers are expected to fall back to the
+    // non-streaming path.
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain/stream"))
+        .respond_with(ResponseTemplate::new(404).set_body_string(""))
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let (tx, _rx) = tokio::sync::mpsc::unbounded_channel::<ChainProgressEvent>();
+    let out = client
+        .generate_chain_stream(&sample_request(), tx)
+        .await
+        .expect("empty 404 should resolve to Ok(None)");
+    assert!(out.is_none(), "empty 404 must signal unsupported endpoint");
+}
+
+#[tokio::test]
+async fn generate_chain_stream_parses_progress_and_complete_events() {
+    let b64 = base64::engine::general_purpose::STANDARD;
+    let video_bytes = b"FAKE_MP4_BYTES";
+    let thumb_bytes = b"THUMB";
+    let complete = SseChainCompleteEvent {
+        video: b64.encode(video_bytes),
+        format: OutputFormat::Mp4,
+        width: 1216,
+        height: 704,
+        frames: 97,
+        fps: 24,
+        thumbnail: Some(b64.encode(thumb_bytes)),
+        gif_preview: None,
+        has_audio: false,
+        duration_ms: Some(4040),
+        audio_sample_rate: None,
+        audio_channels: None,
+        stage_count: 1,
+        gpu: Some(0),
+        generation_time_ms: Some(45_000),
+    };
+    let progress = ChainProgressEvent::DenoiseStep {
+        stage_idx: 0,
+        step: 4,
+        total: 8,
+    };
+    // Build a chunk-encoded SSE body carrying one progress event then
+    // complete. `\n\n` terminates each SSE event.
+    let body = format!(
+        "event: progress\ndata: {}\n\nevent: complete\ndata: {}\n\n",
+        serde_json::to_string(&progress).unwrap(),
+        serde_json::to_string(&complete).unwrap(),
+    );
+
+    let server = MockServer::start().await;
+    Mock::given(method("POST"))
+        .and(path("/api/generate/chain/stream"))
+        .respond_with(
+            ResponseTemplate::new(200)
+                .insert_header("content-type", "text/event-stream")
+                .set_body_string(body),
+        )
+        .mount(&server)
+        .await;
+
+    let client = MoldClient::new(&server.uri());
+    let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<ChainProgressEvent>();
+    let resp = client
+        .generate_chain_stream(&sample_request(), tx)
+        .await
+        .expect("SSE stream should succeed")
+        .expect("complete event should yield a response");
+
+    assert_eq!(resp.stage_count, 1);
+    assert_eq!(resp.video.data, video_bytes);
+    assert_eq!(resp.video.thumbnail, thumb_bytes);
+    assert_eq!(resp.gpu, Some(0));
+    let ev = rx.recv().await.expect("progress event should be forwarded");
+    assert_eq!(ev, progress);
+}

From e89826f0ec5422c799a9861d88caba57117d26ae Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 17:15:00 -0700
Subject: [PATCH 04/31] feat(ltx2): ChainTail type and latent-tail extraction
 helper
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce the carryover primitive that render-chain stages hand to each
other. `ChainTail { frames, latents, last_rgb_frame }` bundles the
final VAE latents of a stage's motion tail so the next stage can patch
those tokens straight into its conditioning without a VAE decode → RGB
→ VAE encode round-trip. No engine wiring yet — the orchestrator and
the `generate_with_carryover` entry point land in sibling commits.

Helpers in the new `ltx2::chain` module:

- `tail_latent_frame_count(pixel_frames: u32) -> usize` — exposes the
  LTX-2 VAE's 8× causal-first-frame temporal ratio as the formula
  `((n - 1) / 8) + 1`. Matches `VideoLatentShape::from_pixel_shape`.
  Panics on `0`; callers must validate upstream.

- `extract_tail_latents(final_latents: &Tensor, pixel_frames: u32) ->
  Result<Tensor>` — narrows the time axis of a rank-5
  `[B, C, T, H, W]` latents tensor down to the last K latent frames
  corresponding to the requested pixel-frame tail. Errors (not panics)
  on rank mismatch or oversize tail request so orchestrator bugs
  surface as operational errors, not process aborts.

Unit tests cover: the VAE formula across representative tail sizes
(4→1, 9→2, 16→2, 17→3, 97→13), rejection of a zero pixel-frame
tail, correct narrowing on a synthetic [1, 2, 3, 1, 1] tensor with
sentinel values proving the last latent frame is returned across all
channels, narrowing on a larger rank-5 tensor, rank-4 rejection, and
oversize-tail rejection. All tests are weight-free and run under
`cargo test -p mold-ai-inference --lib`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-inference/src/ltx2/chain.rs | 197 ++++++++++++++++++++++++
 crates/mold-inference/src/ltx2/mod.rs   |   2 +
 2 files changed, 199 insertions(+)
 create mode 100644 crates/mold-inference/src/ltx2/chain.rs

diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
new file mode 100644
index 00000000..b1c4c91e
--- /dev/null
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -0,0 +1,197 @@
+//! LTX-2 chain carryover primitives.
+//!
+//! Server-side chained video generation stitches multiple per-clip renders
+//! into a single output. To avoid a VAE decode → RGB → VAE encode round-trip
+//! between clips (which loses information and doubles VAE cost), the tail of
+//! each clip is carried across as latent-space tokens and threaded into the
+//! next clip's conditioning directly.
+//!
+//! This module owns the data types and shape math for that handoff. The
+//! orchestrator and the `Ltx2Engine::generate_with_carryover` entry point
+//! land in sibling commits.
+//!
+//! See `tasks/render-chain-v1-plan.md` Phase 1.1 for context.
+
+use anyhow::{anyhow, Context, Result};
+use candle_core::Tensor;
+use image::RgbImage;
+
+use crate::ltx2::model::shapes::SpatioTemporalScaleFactors;
+
+/// Opaque carryover payload handed from one chain stage to the next.
+///
+/// Holds the final VAE latents of the emitting stage's motion tail, not the
+/// decoded pixels — so the receiving stage can patchify the tokens directly
+/// into its conditioning without a VAE re-encode.
+#[derive(Debug, Clone)]
+pub struct ChainTail {
+    /// Number of *pixel* frames this tail represents (not latent frames).
+    /// Clients of [`ChainTail`] work in pixel-frame units because that's
+    /// what users think in; the latent-frame count is derived from this
+    /// plus the LTX-2 VAE's 8× causal temporal ratio.
+    pub frames: u32,
+
+    /// Latent tokens for the tail.
+    ///
+    /// Shape: `[batch=1, channels=128, tail_latent_frames, H/32, W/32]`
+    /// where `tail_latent_frames = tail_latent_frame_count(self.frames)`.
+    ///
+    /// Dtype is whatever the denoise loop produced — typically `F32`.
+    /// Device is the engine's active device (GPU or CPU); the orchestrator
+    /// is responsible for ensuring the next stage runs on the same device.
+    pub latents: Tensor,
+
+    /// The last decoded pixel frame of the emitting stage. Kept for
+    /// debugging, progress UIs that want a thumbnail of the handoff point,
+    /// and as a fallback rendering target if latent carryover ever needs
+    /// to be disabled at runtime.
+    pub last_rgb_frame: RgbImage,
+}
+
+/// Number of latent frames corresponding to `pixel_frames` pixel frames
+/// under the LTX-2 VAE's 8× causal temporal compression. `1` for
+/// `1..=8` pixel frames, `2` for `9..=16`, etc. Matches
+/// `VideoLatentShape::from_pixel_shape`.
+///
+/// Panics if `pixel_frames == 0` — a zero-frame tail is nonsensical and
+/// would under-flow the formula. Callers must validate upstream.
+pub fn tail_latent_frame_count(pixel_frames: u32) -> usize {
+    assert!(
+        pixel_frames > 0,
+        "tail_latent_frame_count: pixel_frames must be > 0",
+    );
+    let scale = SpatioTemporalScaleFactors::default().time;
+    ((pixel_frames as usize - 1) / scale) + 1
+}
+
+/// Slice the last `tail_latent_frame_count(pixel_frames)` frames off the
+/// time axis of a rank-5 video-latents tensor shaped
+/// `[B, C, T, H, W]`.
+///
+/// The returned tensor is a view/narrow on the input (no copy on candle's
+/// current backends) so callers who intend to hand it to a separate engine
+/// invocation — which may drop this engine's state and rebuild it — should
+/// `.contiguous()` or `.copy()` the result before the original owner goes
+/// out of scope.
+///
+/// Errors if the tensor is not rank-5 or the requested tail exceeds the
+/// available time axis — the latter would mean the orchestrator asked for
+/// more tail than the stage produced, which indicates a caller bug.
+pub fn extract_tail_latents(final_latents: &Tensor, pixel_frames: u32) -> Result<Tensor> {
+    let dims = final_latents.dims();
+    if dims.len() != 5 {
+        return Err(anyhow!(
+            "extract_tail_latents: expected rank-5 tensor [B, C, T, H, W], got shape {:?}",
+            dims,
+        ));
+    }
+    let time = dims[2];
+    let tail = tail_latent_frame_count(pixel_frames);
+    if tail > time {
+        return Err(anyhow!(
+            "extract_tail_latents: tail requests {} latent frames but the stage emitted only {} \
+             (pixel_frames={}, tensor shape={:?})",
+            tail,
+            time,
+            pixel_frames,
+            dims,
+        ));
+    }
+    let start = time - tail;
+    final_latents
+        .narrow(2, start, tail)
+        .with_context(|| format!("narrow last {tail} latent frames off time axis"))
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use candle_core::{DType, Device};
+
+    #[test]
+    fn tail_latent_frame_count_matches_vae_formula() {
+        // Single-frame tail and up to 8 pixel frames fit in 1 latent frame
+        // (LTX-2 VAE uses causal first frame + 8× temporal compression).
+        for px in [1u32, 2, 4, 8] {
+            assert_eq!(tail_latent_frame_count(px), 1, "{px} pixel frames");
+        }
+        // 9..=16 span 2 latent frames, 17..=24 span 3, etc.
+        assert_eq!(tail_latent_frame_count(9), 2);
+        assert_eq!(tail_latent_frame_count(16), 2);
+        assert_eq!(tail_latent_frame_count(17), 3);
+        assert_eq!(tail_latent_frame_count(24), 3);
+        // Full-clip tail (97 frames) → 13 latent frames, matching
+        // VideoLatentShape::from_pixel_shape under the same VAE ratio.
+        assert_eq!(tail_latent_frame_count(97), 13);
+    }
+
+    #[test]
+    #[should_panic(expected = "pixel_frames must be > 0")]
+    fn tail_latent_frame_count_rejects_zero() {
+        tail_latent_frame_count(0);
+    }
+
+    #[test]
+    fn extract_tail_narrows_last_latent_frame_for_4_pixel_frame_tail() {
+        // Build a synthetic [1, 2, 3, 1, 1] where channel 0 is the latent-
+        // frame index and channel 1 is a sentinel (42, 43, 44) so we can
+        // see which frames the narrow returns.
+        let data = vec![
+            // frame 0
+            0.0f32, 42.0, // frame 1
+            1.0, 43.0, // frame 2
+            2.0, 44.0,
+        ];
+        // Arrange [B=1, C=2, T=3, H=1, W=1]. `Tensor::from_vec` fills in
+        // row-major order — the permute below puts channels on axis 1.
+        let raw = Tensor::from_vec(data, (1, 3, 2, 1, 1), &Device::Cpu).expect("build raw tensor");
+        // Reshape [1, T, C, H, W] → [1, C, T, H, W]
+        let latents = raw
+            .permute([0, 2, 1, 3, 4])
+            .expect("permute to [B, C, T, H, W]");
+        assert_eq!(latents.dims(), &[1, 2, 3, 1, 1]);
+
+        // tail_latent_frame_count(4) = 1 → take the last latent frame only.
+        let tail = extract_tail_latents(&latents, 4).expect("extract");
+        assert_eq!(tail.dims(), &[1, 2, 1, 1, 1]);
+        let values = tail.flatten_all().unwrap().to_vec1::<f32>().unwrap();
+        assert_eq!(
+            values,
+            vec![2.0, 44.0],
+            "tail must be the last latent frame (index 2) across all channels",
+        );
+    }
+
+    #[test]
+    fn extract_tail_narrows_two_frames_for_9_pixel_frame_tail() {
+        // Simple rank-5 zero tensor with T=3; narrowing the last 2 frames
+        // out of 3 is enough to verify the shape without wrestling with
+        // permutations again.
+        let latents = Tensor::zeros((1, 1, 3, 2, 2), DType::F32, &Device::Cpu).unwrap();
+        let tail = extract_tail_latents(&latents, 9).expect("extract");
+        assert_eq!(tail.dims(), &[1, 1, 2, 2, 2]);
+    }
+
+    #[test]
+    fn extract_tail_rejects_rank_4_tensor() {
+        let bad = Tensor::zeros((1, 128, 3, 4), DType::F32, &Device::Cpu).unwrap();
+        let err = extract_tail_latents(&bad, 4).expect_err("rank 4 must fail");
+        let msg = format!("{err}");
+        assert!(
+            msg.contains("rank-5") && msg.contains("T, H, W"),
+            "error must identify the rank mismatch, got: {msg}",
+        );
+    }
+
+    #[test]
+    fn extract_tail_rejects_oversize_request() {
+        // Tensor has 1 latent frame; asking for a 9-pixel-frame tail needs 2.
+        let latents = Tensor::zeros((1, 128, 1, 4, 4), DType::F32, &Device::Cpu).unwrap();
+        let err = extract_tail_latents(&latents, 9).expect_err("oversize tail must fail");
+        let msg = format!("{err}");
+        assert!(
+            msg.contains("requests 2") && msg.contains("only 1"),
+            "error must name the latent-frame mismatch, got: {msg}",
+        );
+    }
+}
diff --git a/crates/mold-inference/src/ltx2/mod.rs b/crates/mold-inference/src/ltx2/mod.rs
index d2101d33..9858e1d5 100644
--- a/crates/mold-inference/src/ltx2/mod.rs
+++ b/crates/mold-inference/src/ltx2/mod.rs
@@ -1,5 +1,6 @@
 mod assets;
 mod backend;
+pub mod chain;
 mod conditioning;
 mod execution;
 mod guidance;
@@ -13,4 +14,5 @@ mod runtime;
 mod sampler;
 mod text;
 
+pub use chain::{extract_tail_latents, tail_latent_frame_count, ChainTail};
 pub use pipeline::Ltx2Engine;

From e91721090cdf3839897854214bdb7747ada10dd4 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 17:21:04 -0700
Subject: [PATCH 05/31] feat(ltx2): staged latent conditioning bypasses VAE
 encode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

`StagedConditioning` now carries both disk-backed images (existing
single-clip path) and in-memory latent blocks (new, empty for every
non-chain caller). The render-chain orchestrator will populate the new
`latents: Vec<StagedLatent>` field with a prior stage's motion-tail
latents so the receiving stage can patchify those tokens straight into
its `StageVideoConditioning::replacements` without the VAE decode → RGB
→ VAE encode round-trip — that's the point of latent carryover.

Changes:

- `StagedLatent { latents, frame, strength }` in
  `ltx2::conditioning` — mirrors `StagedImage`'s semantics but with a
  pre-encoded `candle_core::Tensor` instead of a disk path. `frame = 0`
  routes tokens through `replacements` (chain v1 motion tail);
  non-zero `frame` builds a `VideoTokenAppendCondition` so the movie-
  maker in v2 can thread latents into arbitrary positions.

- `StagedConditioning` drops `PartialEq` since `Tensor` doesn't
  implement structural equality. Grepped for comparison usages — none.
  Existing callers of `stage_conditioning()` get `latents: Vec::new()`.

- `maybe_load_stage_video_conditioning` in `runtime.rs`:
  - Early-return gate now also considers `plan.conditioning.latents`.
  - VAE is loaded conditionally: only when images or reference video
    need encoding. Pure-latent chain handoffs skip VAE load entirely.
  - New loop iterates staged latents, patchifies each block, routes
    frame-0 tokens to `replacements` (keyframe pipelines aside) and
    other frames to `appended` — symmetrical with the image path.

Tests (weight-free):

- `stage_conditioning_leaves_latents_empty_for_non_chain_callers` —
  pins the back-compat invariant: every non-chain generate path
  continues to receive an empty latents vec.
- `staged_latent_patchifies_to_same_token_shape_as_image_at_single_latent_frame`
  — verifies a `[1, 128, 1, 22, 38]` chain-tail latent block patchifies
  to `[1, 836, 128]` tokens, the same shape the image-conditioning
  path produces after VAE encode + patchify for the equivalent latent
  geometry.

Chain orchestrator + `Ltx2Engine::generate_with_carryover` land in the
sibling Phase 1c commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../mold-inference/src/ltx2/conditioning.rs   | 56 ++++++++++++++-
 crates/mold-inference/src/ltx2/runtime.rs     | 72 +++++++++++++++++--
 2 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/crates/mold-inference/src/ltx2/conditioning.rs b/crates/mold-inference/src/ltx2/conditioning.rs
index b0b8d0e7..ce7c9bb1 100644
--- a/crates/mold-inference/src/ltx2/conditioning.rs
+++ b/crates/mold-inference/src/ltx2/conditioning.rs
@@ -1,4 +1,5 @@
 use anyhow::{bail, Result};
+use candle_core::Tensor;
 use mold_core::{GenerateRequest, TimeRange};
 use std::fs;
 use std::ops::RangeInclusive;
@@ -11,9 +12,38 @@ pub(crate) struct StagedImage {
     pub(crate) strength: f32,
 }
 
-#[derive(Debug, Clone, PartialEq)]
+/// Pre-encoded latent block used as conditioning input, bypassing the
+/// staged-image path's VAE encode. Populated by the render-chain
+/// orchestrator when handing a motion-tail off between stages; empty for
+/// every non-chain caller today.
+///
+/// Tensor shape must be `[batch=1, channels=128, T_latent, H/32, W/32]`
+/// to match the LTX-2 video VAE output. The runtime patchifies it directly
+/// into conditioning tokens.
+#[derive(Debug, Clone)]
+pub(crate) struct StagedLatent {
+    pub(crate) latents: Tensor,
+    /// Starting pixel frame for this latent block. `0` routes the tokens
+    /// through `StageVideoConditioning::replacements`; non-zero values
+    /// build a `VideoTokenAppendCondition` like the keyframe image path.
+    pub(crate) frame: u32,
+    /// Replacement/append strength. `1.0` for chain motion-tail carryover
+    /// (hard-overwrite), matching the keyframe image strength convention.
+    pub(crate) strength: f32,
+}
+
+/// Conditioning inputs staged for a single run. Carries both disk-backed
+/// files (images, audio, reference video — existing single-clip flow) and
+/// in-memory latent blocks (chain carryover — new, empty for non-chain
+/// callers).
+///
+/// Not `PartialEq` because `StagedLatent` wraps a `candle_core::Tensor`
+/// which doesn't implement meaningful structural equality. Existing tests
+/// only compare individual fields so this is safe to drop.
+#[derive(Debug, Clone)]
 pub(crate) struct StagedConditioning {
     pub(crate) images: Vec<StagedImage>,
+    pub(crate) latents: Vec<StagedLatent>,
     pub(crate) audio_path: Option<String>,
     pub(crate) video_path: Option<String>,
 }
@@ -99,6 +129,7 @@ pub(crate) fn stage_conditioning(
 
     Ok(StagedConditioning {
         images,
+        latents: Vec::new(),
         audio_path,
         video_path,
     })
@@ -224,6 +255,29 @@ mod tests {
         assert!(mask[18..].iter().all(|value| *value == 0.0));
     }
 
+    #[test]
+    fn stage_conditioning_leaves_latents_empty_for_non_chain_callers() {
+        // Single-clip callers build `StagedConditioning` via this function;
+        // the `latents` field (used by the render-chain orchestrator to inject
+        // pre-encoded motion-tail tokens) must stay empty so existing runs
+        // keep routing conditioning through the image path with VAE encode.
+        let work_dir = tempfile::tempdir().unwrap();
+        let mut req = req();
+        req.source_image = Some(fake_png_bytes());
+        req.keyframes = Some(vec![KeyframeCondition {
+            frame: 8,
+            image: fake_png_bytes(),
+        }]);
+        req.source_video = Some(fake_mp4_bytes());
+        req.audio_file = Some(fake_wav_bytes());
+
+        let staged = stage_conditioning(&req, work_dir.path()).unwrap();
+        assert!(
+            staged.latents.is_empty(),
+            "non-chain callers must leave latents empty",
+        );
+    }
+
     #[test]
     fn stage_conditioning_stages_source_image_as_frame_zero_replacement() {
         let work_dir = tempfile::tempdir().unwrap();
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 72135d92..8df1f26c 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -1219,17 +1219,32 @@ fn maybe_load_stage_video_conditioning(
     dtype: DType,
     include_reference_video: bool,
 ) -> Result<StageVideoConditioning> {
-    if plan.conditioning.images.is_empty() && !include_reference_video {
+    if plan.conditioning.images.is_empty()
+        && plan.conditioning.latents.is_empty()
+        && !include_reference_video
+    {
         return Ok(StageVideoConditioning::default());
     }
 
-    let mut vae = load_ltx2_video_vae(plan, device, dtype)?;
-    vae.use_tiling = false;
-    vae.use_framewise_decoding = false;
+    // The VAE is only needed when we have images to encode or a reference
+    // video to ingest. Pre-encoded staged latents (chain carryover) skip
+    // VAE load entirely — that's the whole point of latent carryover.
+    let need_vae = !plan.conditioning.images.is_empty() || include_reference_video;
+    let mut vae = if need_vae {
+        let mut loaded = load_ltx2_video_vae(plan, device, dtype)?;
+        loaded.use_tiling = false;
+        loaded.use_framewise_decoding = false;
+        Some(loaded)
+    } else {
+        None
+    };
 
     let patchifier = VideoLatentPatchifier::new(1);
     let mut conditioning = StageVideoConditioning::default();
     for image in &plan.conditioning.images {
+        let vae = vae.as_mut().expect(
+            "need_vae guarantees the VAE is loaded whenever plan.conditioning.images is non-empty",
+        );
         let bytes = std::fs::read(&image.path).with_context(|| {
             format!(
                 "failed to read staged LTX-2 conditioning image '{}'",
@@ -1271,7 +1286,36 @@ fn maybe_load_stage_video_conditioning(
                 )?);
         }
     }
+    // Pre-encoded latents (chain carryover). No VAE needed — tokens come
+    // straight from the caller. For v1 chain this only ever holds a frame-0
+    // replacement (motion-tail latents from the prior stage); appended
+    // (non-frame-0) is kept as a forward-compat branch for the movie-maker.
+    for staged in &plan.conditioning.latents {
+        let latents = staged.latents.to_device(device)?.to_dtype(DType::F32)?;
+        let use_guiding_latent = matches!(plan.pipeline, PipelineKind::Keyframe);
+        if staged.frame == 0 && !use_guiding_latent {
+            let tokens = patchifier.patchify(&latents)?;
+            conditioning.replacements.push(VideoTokenReplacement {
+                start_token: 0,
+                tokens,
+                strength: staged.strength as f64,
+            });
+        } else {
+            conditioning
+                .appended
+                .push(append_condition_from_video_latents(
+                    &latents,
+                    pixel_shape,
+                    staged.frame,
+                    1,
+                    staged.strength as f64,
+                )?);
+        }
+    }
     if include_reference_video {
+        let vae = vae.as_mut().expect(
+            "need_vae guarantees the VAE is loaded whenever include_reference_video is true",
+        );
         let video_path = plan.conditioning.video_path.as_ref().with_context(|| {
             format!(
                 "native {:?} stage requested reference video conditioning without a staged source_video",
@@ -5784,6 +5828,26 @@ mod tests {
         );
     }
 
+    #[test]
+    fn staged_latent_patchifies_to_same_token_shape_as_image_at_single_latent_frame() {
+        // A 4-pixel-frame motion tail at 1216×704 output lands on a latent
+        // block of shape [1, 128, 1, 22, 38]. The render-chain orchestrator
+        // produces this block from the prior stage's denoise result; the
+        // image-conditioning path produces the same shape after VAE encode.
+        // Both must patchify to [1, T*H*W, C] = [1, 1*22*38, 128] tokens so
+        // the downstream replacement pass sees them identically regardless
+        // of which path produced them.
+        let latents = Tensor::zeros(
+            (1, LTX2_VIDEO_LATENT_CHANNELS, 1, 22, 38),
+            DType::F32,
+            &Device::Cpu,
+        )
+        .unwrap();
+        let patchifier = super::VideoLatentPatchifier::new(1);
+        let tokens = patchifier.patchify(&latents).expect("patchify");
+        assert_eq!(tokens.dims(), &[1, 22 * 38, LTX2_VIDEO_LATENT_CHANNELS]);
+    }
+
     #[test]
     fn video_conditioning_self_attention_mask_blocks_cross_keyframe_attention() {
         let conditioning = StageVideoConditioning {

From 14801c78574cf78b0c0bc01df862f5d7bd3147fc Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 17:28:37 -0700
Subject: [PATCH 06/31] feat(ltx2): chain orchestrator with motion-tail
 carryover loop
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add `Ltx2ChainOrchestrator<R: ChainStageRenderer>` that drives the
per-stage render loop for chained video generation: builds each
stage's `GenerateRequest`, threads the prior stage's `ChainTail`
through the renderer, drops the leading motion-tail frames on every
continuation, accumulates frames, and returns a `ChainRunOutput`.

The `ChainStageRenderer` trait is the seam between the orchestrator
(pure control flow) and the engine (tensor work). The LTX-2 engine
implementation lands in Phase 1d — this commit ships the orchestrator
fully tested against a fake renderer so the engine plumbing can be
reviewed in isolation.

Behaviour nailed down (from the 2026-04-20 sign-off):

- **Per-stage seeds**: `base_seed ^ ((stage_idx as u64) << 32)`. A
  stage's `seed_offset` overrides the default when set — reserved for
  the v2 movie-maker's "regen just this stage" affordance.

- **Motion-tail trim**: stage 0 emits all its frames; continuations
  drop the leading `req.motion_tail_frames` pixel frames because those
  duplicate the previous clip's tail that was threaded back as latent
  conditioning. `motion_tail_frames = 0` is a legitimate configuration
  (simple concat).

- **Fail closed**: a mid-chain renderer error bubbles up immediately.
  All frames accumulated so far are discarded — no partial stitch is
  ever written to the gallery. Partial resume is a v2 feature.

- **No audio or target-total-frame trim in v1**: the orchestrator
  delivers whatever frame count the stages produce (with tail drops
  applied). Target-total trimming is the caller's responsibility
  (server / CLI). Audio-video chains are out of scope for v1.

Progress events forwarded through `Option<&mut dyn FnMut(ChainProgressEvent)>`:
`ChainStart` → `StageStart` → `DenoiseStep` (wrapping the renderer's
`StageProgressEvent`s with `stage_idx`) → `StageDone` → (next stage)
→ `Stitching`. Chain-level subscribers can render a stacked
overall+per-stage progress bar without coordinating with the engine.

Per-stage `GenerateRequest` is constructed to ensure only stage 0
carries the optional starting image — even if the caller forgot to
clear it on later stages, the orchestrator suppresses it because
continuations must condition on motion-tail latents only. `strength`
becomes `1.0` on continuations regardless of the chain default since
the tail carryover is always a hard replacement.

Tests (weight-free, injecting a `FakeRenderer`):

- `chain_runs_all_stages_and_drops_tail_prefix_from_continuations` —
  3×97-frame clips with 4-frame tail produce exactly 97 + 2×93 = 283
  accumulated frames.
- `chain_with_zero_tail_concats_full_clips_without_drop` — `tail=0`
  keeps every frame on continuations.
- `chain_empty_stages_errors_without_calling_renderer` — zero-stage
  requests fail before touching the renderer.
- `chain_fails_closed_mid_chain_discarding_accumulated_frames` —
  simulated stage-1 failure bubbles up; stage 2 never runs.
- `chain_derives_per_stage_seed_from_base_seed` — three stages from
  base seed 42 land on 42, 42^(1<<32), 42^(2<<32).
- `chain_only_stage0_carries_source_image` — a source image set on
  stages[1] is suppressed, so continuations can't accidentally
  condition on a still image instead of the motion tail.
- `chain_forwards_engine_events_with_stage_idx_wrapping` — checks the
  full expected event order for a 2-stage chain with per-stage
  progress emission.
- `chain_rejects_motion_tail_ge_stage_frames_before_running` —
  up-front validation catches `motion_tail >= frames` so the renderer
  is never invoked with a degenerate configuration.
- `chain_respects_seed_offset_override_when_stage_provides_one` —
  pins `ChainStage::seed_offset` semantics for the v2 movie-maker
  hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-inference/src/ltx2/chain.rs | 593 +++++++++++++++++++++++-
 crates/mold-inference/src/ltx2/mod.rs   |   5 +-
 2 files changed, 596 insertions(+), 2 deletions(-)

diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
index b1c4c91e..dfcb08df 100644
--- a/crates/mold-inference/src/ltx2/chain.rs
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -12,9 +12,11 @@
 //!
 //! See `tasks/render-chain-v1-plan.md` Phase 1.1 for context.
 
-use anyhow::{anyhow, Context, Result};
+use anyhow::{anyhow, bail, Context, Result};
 use candle_core::Tensor;
 use image::RgbImage;
+use mold_core::chain::{ChainProgressEvent, ChainRequest, ChainStage};
+use mold_core::{GenerateRequest, OutputFormat};
 
 use crate::ltx2::model::shapes::SpatioTemporalScaleFactors;
 
@@ -103,6 +105,266 @@ pub fn extract_tail_latents(final_latents: &Tensor, pixel_frames: u32) -> Result
         .with_context(|| format!("narrow last {tail} latent frames off time axis"))
 }
 
+// ── Orchestrator: loops stages, drops motion-tail prefix, accumulates frames
+
+/// Per-stage progress events the orchestrator observes from the renderer.
+/// The renderer emits these synchronously while a stage is denoising; the
+/// orchestrator wraps them with `stage_idx` before forwarding as
+/// [`ChainProgressEvent`]s to the chain-level subscriber.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum StageProgressEvent {
+    /// Denoise step `step` of `total` completed for the active stage.
+    DenoiseStep { step: u32, total: u32 },
+}
+
+/// Output of a single stage render: the decoded pixel frames (full clip,
+/// before motion-tail trim), the pre-VAE-decode latent tail the next stage
+/// needs, and the wall-clock elapsed time for the render.
+#[derive(Debug)]
+pub struct StageOutcome {
+    pub frames: Vec<RgbImage>,
+    pub tail: ChainTail,
+    pub generation_time_ms: u64,
+}
+
+/// Abstraction over "render one chain stage". Production uses the LTX-2
+/// engine impl (lands in Phase 1d); tests inject a fake implementation
+/// that fabricates deterministic frames and a synthetic [`ChainTail`]
+/// without loading candle weights.
+pub trait ChainStageRenderer {
+    fn render_stage(
+        &mut self,
+        stage_req: &GenerateRequest,
+        carry: Option<&ChainTail>,
+        stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
+    ) -> Result<StageOutcome>;
+}
+
+/// Output of an end-to-end chain run: accumulated RGB frames with motion-
+/// tail prefix already trimmed on continuations, the number of stages
+/// that ran, and the total elapsed render time.
+///
+/// The orchestrator does *not* trim to a target total frame count or
+/// encode the frames into an output video — those are the caller's job
+/// (server / CLI). Keeps the orchestrator single-purpose: produce a
+/// coherent frame stream from a stages list.
+#[derive(Debug)]
+pub struct ChainRunOutput {
+    pub frames: Vec<RgbImage>,
+    pub stage_count: u32,
+    pub generation_time_ms: u64,
+}
+
+/// Drives the per-stage render loop for a chained generation. Borrows its
+/// renderer mutably so the loop can re-enter the engine on the same GPU
+/// context across stages.
+pub struct Ltx2ChainOrchestrator<'a, R: ChainStageRenderer> {
+    renderer: &'a mut R,
+}
+
+impl<'a, R: ChainStageRenderer> Ltx2ChainOrchestrator<'a, R> {
+    pub fn new(renderer: &'a mut R) -> Self {
+        Self { renderer }
+    }
+
+    /// Run every stage in `req.stages` and return the accumulated frames.
+    ///
+    /// Behaviour invariants (from the 2026-04-20 sign-off):
+    /// - Per-stage seeds are derived as `base_seed ^ ((stage_idx as u64) << 32)`.
+    /// - Stage 0's output is kept whole; continuations drop their leading
+    ///   `req.motion_tail_frames` pixel frames because those duplicate the
+    ///   prior stage's tail that was threaded back as latent conditioning.
+    /// - Mid-chain failure returns the error immediately; partial frames are
+    ///   discarded (no partial stitch is ever produced in v1).
+    pub fn run(
+        &mut self,
+        req: &ChainRequest,
+        mut chain_progress: Option<&mut dyn FnMut(ChainProgressEvent)>,
+    ) -> Result<ChainRunOutput> {
+        if req.stages.is_empty() {
+            bail!("Ltx2ChainOrchestrator::run: chain request has no stages");
+        }
+        validate_motion_tail(req)?;
+
+        let stage_count = req.stages.len() as u32;
+        let estimated_total_frames = estimate_stitched_frames(req);
+        if let Some(cb) = chain_progress.as_deref_mut() {
+            cb(ChainProgressEvent::ChainStart {
+                stage_count,
+                estimated_total_frames,
+            });
+        }
+
+        let base_seed = req.seed.unwrap_or(0);
+        let motion_tail_drop = req.motion_tail_frames as usize;
+        let mut accumulated_frames: Vec<RgbImage> = Vec::new();
+        let mut total_generation_ms: u64 = 0;
+        let mut carry: Option<ChainTail> = None;
+
+        for (idx, stage) in req.stages.iter().enumerate() {
+            let stage_idx = idx as u32;
+            if let Some(cb) = chain_progress.as_deref_mut() {
+                cb(ChainProgressEvent::StageStart { stage_idx });
+            }
+
+            let stage_seed = derive_stage_seed(base_seed, idx, stage);
+            let stage_req = build_stage_generate_request(stage, req, stage_seed, idx);
+
+            // Wrap the chain progress subscriber so per-stage denoise
+            // events land on it with `stage_idx` tagged in. The wrapping
+            // closure holds a mutable reborrow of the outer callback for
+            // just the duration of this call — `render_stage` is
+            // synchronous so the reborrow ends before the next iteration.
+            let outcome = match chain_progress.as_deref_mut() {
+                Some(chain_cb) => {
+                    let mut wrapping = |event: StageProgressEvent| match event {
+                        StageProgressEvent::DenoiseStep { step, total } => {
+                            chain_cb(ChainProgressEvent::DenoiseStep {
+                                stage_idx,
+                                step,
+                                total,
+                            });
+                        }
+                    };
+                    self.renderer
+                        .render_stage(&stage_req, carry.as_ref(), Some(&mut wrapping))?
+                }
+                None => self
+                    .renderer
+                    .render_stage(&stage_req, carry.as_ref(), None)?,
+            };
+
+            let mut frames = outcome.frames;
+            if idx > 0 && motion_tail_drop > 0 {
+                if motion_tail_drop >= frames.len() {
+                    bail!(
+                        "stage {stage_idx}: emitted {} frames but motion_tail_drop={motion_tail_drop} — tail would consume the whole clip",
+                        frames.len(),
+                    );
+                }
+                frames.drain(..motion_tail_drop);
+            }
+            let frames_emitted = frames.len() as u32;
+            accumulated_frames.extend(frames);
+            total_generation_ms = total_generation_ms.saturating_add(outcome.generation_time_ms);
+            carry = Some(outcome.tail);
+
+            if let Some(cb) = chain_progress.as_deref_mut() {
+                cb(ChainProgressEvent::StageDone {
+                    stage_idx,
+                    frames_emitted,
+                });
+            }
+        }
+
+        if let Some(cb) = chain_progress.as_deref_mut() {
+            cb(ChainProgressEvent::Stitching {
+                total_frames: accumulated_frames.len() as u32,
+            });
+        }
+
+        Ok(ChainRunOutput {
+            frames: accumulated_frames,
+            stage_count,
+            generation_time_ms: total_generation_ms,
+        })
+    }
+}
+
+fn validate_motion_tail(req: &ChainRequest) -> Result<()> {
+    for (idx, stage) in req.stages.iter().enumerate() {
+        if req.motion_tail_frames >= stage.frames {
+            bail!(
+                "motion_tail_frames ({}) must be strictly less than stage {idx}'s frames ({}) \
+                 so every continuation emits at least one new frame",
+                req.motion_tail_frames,
+                stage.frames,
+            );
+        }
+    }
+    Ok(())
+}
+
+fn estimate_stitched_frames(req: &ChainRequest) -> u32 {
+    // delivered = stages[0].frames + Σ (stages[i].frames - motion_tail) for i >= 1
+    let tail = req.motion_tail_frames;
+    req.stages
+        .iter()
+        .enumerate()
+        .map(|(idx, stage)| {
+            if idx == 0 {
+                stage.frames
+            } else {
+                stage.frames.saturating_sub(tail)
+            }
+        })
+        .sum()
+}
+
+fn derive_stage_seed(base_seed: u64, idx: usize, stage: &ChainStage) -> u64 {
+    if let Some(offset) = stage.seed_offset {
+        base_seed ^ offset
+    } else {
+        base_seed ^ ((idx as u64) << 32)
+    }
+}
+
+fn build_stage_generate_request(
+    stage: &ChainStage,
+    chain: &ChainRequest,
+    stage_seed: u64,
+    idx: usize,
+) -> GenerateRequest {
+    GenerateRequest {
+        prompt: stage.prompt.clone(),
+        negative_prompt: stage.negative_prompt.clone(),
+        model: chain.model.clone(),
+        width: chain.width,
+        height: chain.height,
+        steps: chain.steps,
+        guidance: chain.guidance,
+        seed: Some(stage_seed),
+        batch_size: 1,
+        // Continuation stages never use the per-chain output_format
+        // downstream — the orchestrator decodes to frames regardless —
+        // but MP4 is the canonical intermediate for LTX-2.
+        output_format: OutputFormat::Mp4,
+        embed_metadata: None,
+        scheduler: None,
+        // Stage 0 carries the optional starting image; continuations
+        // get their conditioning from motion-tail latents via the
+        // `carry` argument to `render_stage`.
+        source_image: if idx == 0 {
+            stage.source_image.clone()
+        } else {
+            None
+        },
+        edit_images: None,
+        strength: if idx == 0 { chain.strength } else { 1.0 },
+        mask_image: None,
+        control_image: None,
+        control_model: None,
+        control_scale: 1.0,
+        expand: None,
+        original_prompt: None,
+        lora: None,
+        frames: Some(stage.frames),
+        fps: Some(chain.fps),
+        upscale_model: None,
+        gif_preview: false,
+        enable_audio: Some(false), // v1 chain: no audio plumbing yet
+        audio_file: None,
+        source_video: None,
+        keyframes: None,
+        pipeline: None,
+        loras: None,
+        retake_range: None,
+        spatial_upscale: None,
+        temporal_upscale: None,
+        placement: chain.placement.clone(),
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -194,4 +456,333 @@ mod tests {
             "error must name the latent-frame mismatch, got: {msg}",
         );
     }
+
+    // ── Orchestrator tests (fake renderer, weight-free) ───────────────
+
+    use image::Rgb;
+    use mold_core::chain::ChainStage;
+
+    /// Deterministic fake renderer for orchestrator tests. Records every
+    /// call so assertions can inspect the per-stage request shape, emits
+    /// a solid-color frame block plus a zero-valued latent tail, and
+    /// optionally returns errors on pre-configured stage indices.
+    struct FakeRenderer {
+        calls: Vec<CallRecord>,
+        /// If set, fail on the listed stage indices with the given message.
+        fail_on: Vec<(usize, String)>,
+        /// Per-call override of frame count (default: use stage_req.frames).
+        frame_count_override: Option<u32>,
+        /// If true, emit one DenoiseStep event per stage so tests can
+        /// verify progress forwarding.
+        emit_progress: bool,
+    }
+
+    #[derive(Debug, Clone)]
+    struct CallRecord {
+        seed: Option<u64>,
+        frames: Option<u32>,
+        has_source_image: bool,
+        has_carry: bool,
+    }
+
+    impl FakeRenderer {
+        fn new() -> Self {
+            Self {
+                calls: Vec::new(),
+                fail_on: Vec::new(),
+                frame_count_override: None,
+                emit_progress: false,
+            }
+        }
+    }
+
+    impl ChainStageRenderer for FakeRenderer {
+        fn render_stage(
+            &mut self,
+            stage_req: &GenerateRequest,
+            carry: Option<&ChainTail>,
+            mut stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
+        ) -> Result<StageOutcome> {
+            let idx = self.calls.len();
+            self.calls.push(CallRecord {
+                seed: stage_req.seed,
+                frames: stage_req.frames,
+                has_source_image: stage_req.source_image.is_some(),
+                has_carry: carry.is_some(),
+            });
+            if let Some((_, msg)) = self.fail_on.iter().find(|(stage_idx, _)| *stage_idx == idx) {
+                bail!("{msg}");
+            }
+            if self.emit_progress {
+                if let Some(cb) = stage_progress.as_deref_mut() {
+                    cb(StageProgressEvent::DenoiseStep { step: 1, total: 1 });
+                }
+            }
+
+            let frame_count = self
+                .frame_count_override
+                .unwrap_or_else(|| stage_req.frames.expect("fake renderer: stage_req.frames"));
+            let width = stage_req.width;
+            let height = stage_req.height;
+            // Colour the frames with the stage index so assertions can
+            // verify which stage a frame came from.
+            let mut frames = Vec::with_capacity(frame_count as usize);
+            for frame_num in 0..frame_count {
+                let channel = (idx as u8).wrapping_mul(37).wrapping_add(frame_num as u8);
+                frames.push(RgbImage::from_pixel(width, height, Rgb([channel, 0, 0])));
+            }
+            let last_frame = frames.last().cloned().unwrap();
+
+            // Build a synthetic tail latent at the "right" shape for the
+            // requested motion tail. Shape isn't validated by the
+            // orchestrator itself — the engine impl in Phase 1d will check.
+            let latent = Tensor::zeros(
+                (1, 128, 1, height as usize / 32, width as usize / 32),
+                DType::F32,
+                &Device::Cpu,
+            )
+            .unwrap();
+
+            Ok(StageOutcome {
+                frames,
+                tail: ChainTail {
+                    frames: 4,
+                    latents: latent,
+                    last_rgb_frame: last_frame,
+                },
+                generation_time_ms: 100,
+            })
+        }
+    }
+
+    fn stage(prompt: &str, frames: u32) -> ChainStage {
+        ChainStage {
+            prompt: prompt.into(),
+            frames,
+            source_image: None,
+            negative_prompt: None,
+            seed_offset: None,
+        }
+    }
+
+    fn chain_req(stages: Vec<ChainStage>, motion_tail_frames: u32) -> ChainRequest {
+        ChainRequest {
+            model: "ltx-2-19b-distilled:fp8".into(),
+            stages,
+            motion_tail_frames,
+            width: 1216,
+            height: 704,
+            fps: 24,
+            seed: Some(42),
+            steps: 8,
+            guidance: 3.0,
+            strength: 1.0,
+            output_format: OutputFormat::Mp4,
+            placement: None,
+            prompt: None,
+            total_frames: None,
+            clip_frames: None,
+            source_image: None,
+        }
+    }
+
+    #[test]
+    fn chain_runs_all_stages_and_drops_tail_prefix_from_continuations() {
+        let stages = vec![stage("a", 97), stage("a", 97), stage("a", 97)];
+        let req = chain_req(stages, 4);
+        let mut renderer = FakeRenderer::new();
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        let out = orch.run(&req, None).expect("chain runs");
+        // Stage 0 keeps all 97 frames; each continuation drops the
+        // leading 4 frames, so delivered = 97 + 2 * (97 - 4) = 97 + 186 = 283.
+        assert_eq!(out.frames.len(), 97 + 93 * 2);
+        assert_eq!(out.stage_count, 3);
+        assert_eq!(renderer.calls.len(), 3);
+        // Stage 0 has no carry; later stages do.
+        assert!(!renderer.calls[0].has_carry);
+        assert!(renderer.calls[1].has_carry);
+        assert!(renderer.calls[2].has_carry);
+    }
+
+    #[test]
+    fn chain_with_zero_tail_concats_full_clips_without_drop() {
+        let stages = vec![stage("a", 97), stage("a", 97)];
+        let req = chain_req(stages, 0);
+        let mut renderer = FakeRenderer::new();
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        let out = orch.run(&req, None).expect("chain runs");
+        assert_eq!(
+            out.frames.len(),
+            97 * 2,
+            "zero motion tail must keep every frame on continuations",
+        );
+    }
+
+    #[test]
+    fn chain_empty_stages_errors_without_calling_renderer() {
+        let req = chain_req(vec![], 4);
+        let mut renderer = FakeRenderer::new();
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        let err = orch.run(&req, None).expect_err("empty stages must fail");
+        assert!(
+            format!("{err}").contains("has no stages"),
+            "error must name the missing stages, got: {err}",
+        );
+        assert!(renderer.calls.is_empty());
+    }
+
+    #[test]
+    fn chain_fails_closed_mid_chain_discarding_accumulated_frames() {
+        // Signed-off decision 2026-04-20: mid-chain failure returns the
+        // error immediately and throws away any frames already produced.
+        // No partial stitch is ever written to the gallery.
+        let stages = vec![stage("a", 97), stage("a", 97), stage("a", 97)];
+        let req = chain_req(stages, 4);
+        let mut renderer = FakeRenderer::new();
+        renderer.fail_on = vec![(1, "simulated GPU OOM on stage 1".into())];
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        let err = orch
+            .run(&req, None)
+            .expect_err("mid-chain failure must bubble up");
+        assert!(
+            format!("{err}").contains("simulated GPU OOM"),
+            "error must carry the renderer's message, got: {err}",
+        );
+        // Stage 0 ran (recorded), stage 1 failed (recorded before bail),
+        // stage 2 never ran.
+        assert_eq!(renderer.calls.len(), 2);
+    }
+
+    #[test]
+    fn chain_derives_per_stage_seed_from_base_seed() {
+        let stages = vec![stage("a", 9), stage("a", 9), stage("a", 9)];
+        let mut req = chain_req(stages, 0);
+        req.seed = Some(42);
+        let mut renderer = FakeRenderer::new();
+        renderer.frame_count_override = Some(9);
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        orch.run(&req, None).expect("chain runs");
+        // Per the sign-off: stage_seed = base ^ ((idx as u64) << 32).
+        assert_eq!(renderer.calls[0].seed, Some(42));
+        assert_eq!(renderer.calls[1].seed, Some(42 ^ (1u64 << 32)));
+        assert_eq!(renderer.calls[2].seed, Some(42 ^ (2u64 << 32)));
+    }
+
+    #[test]
+    fn chain_only_stage0_carries_source_image() {
+        let mut stages = vec![stage("a", 9), stage("a", 9)];
+        stages[0].source_image = Some(vec![0x89, 0x50, 0x4e, 0x47]); // PNG magic
+                                                                     // If a caller forgets to clear later stages' source_image, the
+                                                                     // orchestrator still suppresses it — continuations must always
+                                                                     // condition on motion-tail latents, never on a staged image.
+        stages[1].source_image = Some(vec![0x89, 0x50, 0x4e, 0x47]);
+        let req = chain_req(stages, 0);
+        let mut renderer = FakeRenderer::new();
+        renderer.frame_count_override = Some(9);
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        orch.run(&req, None).expect("chain runs");
+        assert!(renderer.calls[0].has_source_image);
+        assert!(!renderer.calls[1].has_source_image);
+    }
+
+    #[test]
+    fn chain_forwards_engine_events_with_stage_idx_wrapping() {
+        let stages = vec![stage("a", 9), stage("a", 9)];
+        let req = chain_req(stages, 0);
+        let mut renderer = FakeRenderer::new();
+        renderer.frame_count_override = Some(9);
+        renderer.emit_progress = true;
+
+        let mut events: Vec<ChainProgressEvent> = Vec::new();
+        {
+            let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+            let mut cb = |e: ChainProgressEvent| events.push(e);
+            orch.run(&req, Some(&mut cb)).expect("chain runs");
+        }
+
+        // Expected order:
+        //   ChainStart, StageStart(0), DenoiseStep(0), StageDone(0),
+        //   StageStart(1), DenoiseStep(1), StageDone(1), Stitching
+        assert!(matches!(
+            events[0],
+            ChainProgressEvent::ChainStart { stage_count: 2, .. }
+        ));
+        assert!(matches!(
+            events[1],
+            ChainProgressEvent::StageStart { stage_idx: 0 }
+        ));
+        assert!(matches!(
+            events[2],
+            ChainProgressEvent::DenoiseStep {
+                stage_idx: 0,
+                step: 1,
+                total: 1
+            }
+        ));
+        assert!(matches!(
+            events[3],
+            ChainProgressEvent::StageDone {
+                stage_idx: 0,
+                frames_emitted: 9
+            }
+        ));
+        assert!(matches!(
+            events[4],
+            ChainProgressEvent::StageStart { stage_idx: 1 }
+        ));
+        assert!(matches!(
+            events[5],
+            ChainProgressEvent::DenoiseStep {
+                stage_idx: 1,
+                step: 1,
+                total: 1
+            }
+        ));
+        assert!(matches!(
+            events[6],
+            ChainProgressEvent::StageDone {
+                stage_idx: 1,
+                frames_emitted: 9
+            }
+        ));
+        assert!(matches!(
+            events[7],
+            ChainProgressEvent::Stitching { total_frames: 18 }
+        ));
+        assert_eq!(events.len(), 8);
+    }
+
+    #[test]
+    fn chain_rejects_motion_tail_ge_stage_frames_before_running() {
+        let stages = vec![stage("a", 9), stage("a", 9)];
+        // tail=9 equals stage frames — no net-new content on continuation.
+        let req = chain_req(stages, 9);
+        let mut renderer = FakeRenderer::new();
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        let err = orch.run(&req, None).expect_err("must fail");
+        assert!(
+            format!("{err}").contains("motion_tail_frames"),
+            "error must name motion_tail_frames, got: {err}",
+        );
+        // Renderer never gets called because validation runs up-front.
+        assert!(renderer.calls.is_empty());
+    }
+
+    #[test]
+    fn chain_respects_seed_offset_override_when_stage_provides_one() {
+        let mut stages = vec![stage("a", 9), stage("a", 9)];
+        stages[1].seed_offset = Some(0xDEADBEEF);
+        let mut req = chain_req(stages, 0);
+        req.seed = Some(100);
+        let mut renderer = FakeRenderer::new();
+        renderer.frame_count_override = Some(9);
+        let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
+        orch.run(&req, None).expect("runs");
+        assert_eq!(renderer.calls[0].seed, Some(100));
+        assert_eq!(
+            renderer.calls[1].seed,
+            Some(100 ^ 0xDEADBEEFu64),
+            "seed_offset must take precedence over the default index-derived seed",
+        );
+    }
 }
diff --git a/crates/mold-inference/src/ltx2/mod.rs b/crates/mold-inference/src/ltx2/mod.rs
index 9858e1d5..ac0c5b6e 100644
--- a/crates/mold-inference/src/ltx2/mod.rs
+++ b/crates/mold-inference/src/ltx2/mod.rs
@@ -14,5 +14,8 @@ mod runtime;
 mod sampler;
 mod text;
 
-pub use chain::{extract_tail_latents, tail_latent_frame_count, ChainTail};
+pub use chain::{
+    extract_tail_latents, tail_latent_frame_count, ChainRunOutput, ChainStageRenderer, ChainTail,
+    Ltx2ChainOrchestrator, StageOutcome, StageProgressEvent,
+};
 pub use pipeline::Ltx2Engine;

From 350d27ec813a14efc1dc9475595319bfc8b193b9 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 17:48:42 -0700
Subject: [PATCH 07/31] docs(chain): render-chain-v1 context handoff for
 resuming work
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Capture the state of the branch (6 commits on local main, not pushed),
the five signed-off design decisions, the Phase 1d → 2 → 3 → 4 remaining
work with specific file:line surgery points, and a ready-to-paste prompt
for a fresh Claude Code session. Gotchas documented: stale
`test = false` claim in CLAUDE.md, pre-existing clippy warnings unrelated
to this branch, VAE 8× causal temporal ratio already encoded by
`extract_tail_latents`, and the existing-parameter-reuse opportunity on
`run_real_distilled_stage` (no new params needed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tasks/render-chain-v1-handoff.md | 311 +++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)
 create mode 100644 tasks/render-chain-v1-handoff.md

diff --git a/tasks/render-chain-v1-handoff.md b/tasks/render-chain-v1-handoff.md
new file mode 100644
index 00000000..540e04cc
--- /dev/null
+++ b/tasks/render-chain-v1-handoff.md
@@ -0,0 +1,311 @@
+# render-chain-v1 — context handoff
+
+> Paste the prompt at the bottom of this file into a fresh Claude Code session
+> to resume work on render-chain v1. Everything above it is reference material
+> that the prompt points at.
+
+## Status
+
+Branch: `main` (local). **6 commits stacked ahead of `origin/main`, not pushed**
+per plan convention (no mid-plan push):
+
+| # | Commit    | Scope    | Phase |
+|---|-----------|----------|-------|
+| 1 | `d13a554` | `fix(ltx2): use pure source latents as i2v denoise-mask target` | Fix A (prereq) |
+| 2 | `b4ed487` | `feat(chain): add core wire types and request normalisation`    | 0.1 |
+| 3 | `0328e76` | `feat(core): MoldClient chain methods`                           | 0.2 |
+| 4 | `e89826f` | `feat(ltx2): ChainTail type and latent-tail extraction helper`   | 1a |
+| 5 | `e917210` | `feat(ltx2): staged latent conditioning bypasses VAE encode`    | 1b |
+| 6 | `14801c7` | `feat(ltx2): chain orchestrator with motion-tail carryover loop` | 1c |
+
+Test status on commit 6: `mold-core` 617 pass, `mold-inference` 586 pass,
+`cargo fmt --check` clean, no candle weights loaded by any test.
+
+Pre-existing clippy warnings on main (NOT introduced by this branch):
+- `crates/mold-core/src/download.rs:1451` — `manual_repeat_n`
+- `crates/mold-core/src/placement_test.rs:167` — `field_reassign_with_default`
+
+These only fire on newer clippy versions than CI pins and are unrelated to
+the chain work. Don't "fix" them as part of render-chain.
+
+## Signed-off design decisions (do NOT re-litigate)
+
+User confirmed these 2026-04-20 and they're recorded at the top of
+`tasks/render-chain-v1-plan.md`:
+
+1. **Trim over-production from the tail** of the final clip, not the head.
+2. **Per-stage seed derivation: `stage_seed = base_seed ^ ((stage_idx as u64) << 32)`.**
+   `ChainStage::seed_offset` overrides this; reserved for the v2 movie-maker.
+3. **Fail closed on mid-chain failure.** 502 + discard all prior stages. No
+   partial stitch.
+4. **Accept ~1 GB RAM ceiling** for accumulated `RgbImage` buffer. Streaming
+   encode revisited at 1000+ frames.
+5. **Single-GPU per chain.** Multi-GPU stage fan-out is v2.
+
+The orchestrator already encodes 1, 2, 3 and Phase 2 server route handles 3.
+
+## What's done
+
+- **`mold_core::chain`** — wire types (`ChainRequest`, `ChainResponse`,
+  `ChainStage`, `ChainProgressEvent`, `SseChainCompleteEvent`) and
+  `ChainRequest::normalise()`. Re-exports from `mold_core`.
+- **`MoldClient::generate_chain{,_stream}`** with 422 → Validation, 404-with-
+  body → ModelNotFound, empty-404 → hard error (non-streaming) / `Ok(None)`
+  (streaming). Wiremock integration tests pin all four paths.
+- **`ltx2::chain::ChainTail` + `extract_tail_latents`** — pure tensor math,
+  VAE formula `((pixel - 1) / 8) + 1`. Errors (not panics) on rank
+  mismatch / oversize tail.
+- **`StagedLatent` + `StagedConditioning.latents`** — threaded through
+  `maybe_load_stage_video_conditioning` in `runtime.rs`. When the latents
+  vec is non-empty, the function builds `VideoTokenReplacement`s straight
+  from pre-encoded tokens and **skips VAE load entirely** (conditional
+  `Option<VAE>` — confirmed only loaded when images or reference video are
+  present).
+- **`Ltx2ChainOrchestrator<R: ChainStageRenderer>`** — fully tested against
+  a fake renderer. Handles seed derivation, motion-tail trim on
+  continuations (stage 0 keeps all frames, continuations drop leading K),
+  progress forwarding with `stage_idx` wrapping, fail-closed error handling.
+  Orchestrator does NOT trim to a target total or encode MP4 — those are
+  caller responsibilities.
+
+## What's remaining
+
+### Phase 1d — `impl ChainStageRenderer for Ltx2Engine` (engine integration)
+
+The one-sentence contract: given `stage_req`, optional `carry: &ChainTail`,
+and an optional stage-progress callback, return
+`StageOutcome { frames, tail, generation_time_ms }`.
+
+Three sub-tasks:
+
+1. **Tail capture slot.** Add a mechanism for `render_real_distilled_av`
+   (`crates/mold-inference/src/ltx2/runtime.rs:1722`) to clone the
+   pre-VAE-decode `latents` tensor into a caller-provided slot. The exact
+   capture point is immediately before `vae.decode(&latents...)` at
+   `runtime.rs:2010` — shape is `[1, 128, T_latent, H/32, W/32]` F32.
+   Preferred mechanism: a field on `Ltx2RuntimeSession` (or a method
+   argument threaded down) holding `Option<Arc<Mutex<Option<Tensor>>>>`.
+   Production non-chain callers leave it `None` and pay no overhead.
+
+2. **`Ltx2Engine::generate_with_carryover(&mut self, req, carry)`**:
+   - Validate the request is a supported family (v1 scope: distilled LTX-2
+     only — see `select_pipeline` at `crates/mold-inference/src/ltx2/pipeline.rs:108`).
+   - Build a `Ltx2GeneratePlan` via the existing `materialize_request` flow.
+     When `carry.is_some()`, wipe `source_image` and append a
+     `StagedLatent { latents: carry.latents.clone(), frame: 0, strength: 1.0 }`
+     to `plan.conditioning.latents`. The runtime already handles the rest
+     (`maybe_load_stage_video_conditioning` skips VAE, builds a frame-0
+     replacement from patchified tokens).
+   - Enable the tail-capture slot.
+   - Run the existing render → decode → encode pipeline.
+   - Pull the captured latents out of the slot.
+   - Call `ltx2::chain::extract_tail_latents(&captured, motion_tail_frames)`
+     to get the tail slice.
+   - Decode the stitched MP4 once to extract `last_rgb_frame` (or capture
+     it alongside the `frames` Vec from `decoded_video_to_frames`).
+   - Return `(GenerateResponse, ChainTail)`.
+
+3. **`impl ChainStageRenderer for Ltx2Engine`** that delegates to
+   `generate_with_carryover`. The orchestrator's fake-renderer tests
+   define the exact contract; no new test harness needed for the impl —
+   real-engine coverage is Phase 2's integration test.
+
+**Gotchas:**
+- `CLAUDE.md` claims `[lib] test = false` on `mold-inference` and
+  `mold-server` — **this is stale.** Both have normal test configs. Verified
+  in Phase 1a/b/c by running 586 tests.
+- `run_real_distilled_stage` already takes
+  `video_clean_latents: Option<&Tensor>` and `video_denoise_mask: Option<&Tensor>` —
+  don't add new parameters unnecessarily. The tail carryover rides on
+  `conditioning.replacements` via `StagedLatent`, not on `video_clean_latents`.
+- VAE temporal ratio is **8× with causal first frame** (`model/shapes.rs:20`).
+  `extract_tail_latents` already encodes this; just call it.
+- `motion_tail_frames` defaults to 4 per plan; orchestrator validates
+  `motion_tail < stage.frames` up front, but the engine should still
+  tolerate `motion_tail = 0` (simple concat, no latent carryover — `carry`
+  will be `None` for every stage in that configuration).
+
+### Phase 2 — `POST /api/generate/chain[/stream]` server route
+
+Plan §2. Handler flow:
+
+1. Parse + `ChainRequest::normalise()`.
+2. Reject non-LTX-2 models with a clear error.
+3. Grab the engine from `ModelCache` (`crates/mold-server/src/lib.rs` —
+   holds `AppState.model_cache: Arc<tokio::sync::Mutex<ModelCache>>`).
+4. Construct `Ltx2ChainOrchestrator` against it, call `run()`.
+5. Trim accumulated frames to target total (the ChainRequest no longer
+   carries `total_frames` after normalise — if you want tail-trim support,
+   add a `target_total_frames: Option<u32>` field that normalise
+   populates). Per the sign-off: trim from the tail.
+6. Encode stitched MP4. Reuse `ltx2::media::encode_frames_to_mp4` or the
+   existing `encode_native_video` path — scout during Phase 2.
+7. Save via `save_video_to_dir` with an `OutputMetadata` synthesised from
+   `stages[0].prompt`; optionally add `chain_stage_count: Option<u32>` to
+   `OutputMetadata`.
+8. Return `ChainResponse` JSON.
+
+**Do NOT go through the existing single-job queue.** A 10+ minute chain
+would block the queue. Instead hold the `ModelCache` mutex directly for
+the chain duration, same pattern as the multi-GPU pool. Reason in plan §2.1.
+
+SSE variant: same flow, stream `ChainProgressEvent` as `event: progress`
+JSON frames and a final `SseChainCompleteEvent` as `event: complete`.
+
+Tests: route-level with a fake engine (same trait seam as Phase 1c). No
+real weights.
+
+### Phase 3 — CLI auto-routing + flags
+
+When `--frames > clip_cap` (97 for LTX-2 19B/22B distilled), build a
+`ChainRequest` from the CLI args and route to
+`MoldClient::generate_chain_stream`. New flags: `--clip-frames N`,
+`--motion-tail N` (default 4).
+
+Stacked progress bar: one parent bar per chain (estimated total frames),
+one per-stage bar wiping between stages.
+
+`--local` parity: factor the orchestrator invocation so both server
+handler and CLI local path use the same code.
+
+### Phase 4 — docs
+
+- `website/guide/video.md`: new "Chained video output" section explaining
+  `--frames N`, motion tail, and the server endpoint.
+- `CHANGELOG.md`: Unreleased/Added entry.
+- `.claude/skills/mold/SKILL.md`: new CLI flags + endpoint.
+
+## Verification commands
+
+Run these in order after any Phase 1d change to verify nothing regressed:
+
+```bash
+cargo fmt -p mold-ai-inference -- --check
+cargo check -p mold-ai-inference
+cargo test -p mold-ai-inference --lib ltx2::chain::     # orchestrator + tail helpers
+cargo test -p mold-ai-inference --lib                   # full 586-test sweep (~35 s)
+cargo test -p mold-ai-core                              # sanity
+```
+
+Phase 1d's own tests should live alongside existing `pipeline.rs::tests`
+patterns (using `with_runtime_session` injection at
+`crates/mold-inference/src/ltx2/pipeline.rs:1062` — the existing test
+exercises the runtime without real weights).
+
+## File map — where everything lives now
+
+```
+NEW (this branch):
+  crates/mold-core/src/chain.rs                            # wire types + normalise
+  crates/mold-core/tests/chain_client.rs                   # wiremock integration
+  crates/mold-inference/src/ltx2/chain.rs                  # ChainTail + orchestrator
+
+MODIFIED (this branch):
+  crates/mold-core/src/lib.rs                              # re-exports
+  crates/mold-core/src/types.rs                            # pub(crate) base64_opt
+  crates/mold-core/src/client.rs                           # generate_chain{,_stream}
+  crates/mold-inference/src/ltx2/mod.rs                    # pub use chain::*
+  crates/mold-inference/src/ltx2/conditioning.rs           # StagedLatent
+  crates/mold-inference/src/ltx2/runtime.rs                # latents loop + Fix A
+
+TARGETS (Phase 1d):
+  crates/mold-inference/src/ltx2/pipeline.rs               # Ltx2Engine::generate_with_carryover
+  crates/mold-inference/src/ltx2/runtime.rs                # tail-capture slot on session
+
+TARGETS (Phase 2+):
+  crates/mold-server/src/routes_chain.rs                   # NEW
+  crates/mold-server/src/lib.rs                            # route registration
+  crates/mold-cli/src/main.rs                              # auto-route
+  crates/mold-cli/src/commands/generate.rs                 # chain path + local parity
+  website/guide/video.md                                   # docs
+  CHANGELOG.md
+  .claude/skills/mold/SKILL.md
+```
+
+## Convention reminders
+
+- Feature branch: `feat/render-chain-v1` (currently committing directly to
+  local `main` since pre-push). PR target: `main`.
+- Commit scopes: `feat(chain)`, `fix(chain)`, `test(chain)`, `docs(chain)`
+  (core), or `feat(ltx2)`, `feat(server)`, `feat(cli)` depending on crate.
+- **No mid-plan push.** All work accumulates locally until Phase 4 ends.
+- Every phase step ends with a commit; verification (`fmt`, `test`)
+  between every step.
+- Tests must be weight-free. Use the trait-seam pattern (Phase 1c) or the
+  `with_runtime_session` injection pattern (`pipeline.rs:1062`).
+
+---
+
+## The prompt
+
+Paste from here into a fresh Claude Code session:
+
+---
+
+I'm continuing work on **render-chain v1** — server-side chained LTX-2 video
+generation for the mold repo.
+
+## Read first, in this order
+
+1. `CLAUDE.md` (both global at `~/.claude-personal/CLAUDE.md` and
+   `/Users/jeffreydilley/github/mold/CLAUDE.md`).
+2. `tasks/render-chain-v1-plan.md` — full design, signed-off decisions.
+3. `tasks/render-chain-v1-handoff.md` — status, remaining work, gotchas.
+   **This is your primary briefing.** Read it end-to-end before writing code.
+
+## Status on entry
+
+- 6 commits stacked locally on `main`, not pushed (per plan convention).
+  Last commit: `14801c7 feat(ltx2): chain orchestrator with motion-tail carryover loop`.
+- Phase 0 (core wire types + client) and Phase 1a/b/c (ltx2 chain types,
+  StagedLatent plumbing, orchestrator + fake-renderer tests) are done.
+- `mold-inference` has 586 tests passing, `mold-core` 617. Nothing loads
+  candle weights. Fmt clean.
+- `CLAUDE.md`'s claim that `mold-inference` has `[lib] test = false` is
+  **stale** — the previous session verified tests run normally.
+
+## What you're doing
+
+**Phase 1d** — the engine integration that makes the orchestrator actually
+render. Spec in `render-chain-v1-handoff.md` under "Phase 1d". In one
+sentence: implement `impl ChainStageRenderer for Ltx2Engine` by adding a
+tail-capture slot to `Ltx2RuntimeSession` and a
+`Ltx2Engine::generate_with_carryover` method that populates
+`plan.conditioning.latents` from the `ChainTail` input and returns the
+captured tail alongside the response.
+
+Key surgery points already scouted:
+- Tail capture immediately before `vae.decode` at
+  `crates/mold-inference/src/ltx2/runtime.rs:2010`
+- Plan's staged-latents plumbing already works —
+  `maybe_load_stage_video_conditioning` accepts pre-encoded latents when
+  you populate `plan.conditioning.latents` (Phase 1b).
+
+After Phase 1d, Phases 2 (server route), 3 (CLI), and 4 (docs) per the plan.
+
+## How to work
+
+- Use `superpowers:subagent-driven-development` — the plan is sized for it.
+- Use `superpowers:verification-before-completion` before claiming any
+  phase done. The handoff doc has the exact verification commands.
+- Every step ends with a commit. Commit scope `feat(ltx2)` for Phase 1d.
+- Do NOT push anything — plan convention is no mid-plan push.
+- Do NOT re-litigate the signed-off design decisions in the handoff doc.
+- Tests must be weight-free (use the `with_runtime_session` injection
+  pattern from `pipeline.rs:1062` or the trait seam shipped in Phase 1c).
+
+## Start here
+
+1. Run `git status && git log --oneline -7` to confirm the 6 commits are
+   on the tree.
+2. Read `tasks/render-chain-v1-handoff.md` end-to-end.
+3. Delegate an Explore subagent to map `Ltx2RuntimeSession` and the full
+   `Ltx2Engine::generate` → `generate_inner` → `render_native_video` call
+   chain end-to-end before writing code. Cite file:line throughout. Keep
+   the report under 2000 words.
+4. Then plan the tail-capture mechanism (decide: field on
+   `Ltx2RuntimeSession` vs. threaded parameter, ergonomics tradeoffs).
+5. Implement. Commit. Then Phase 2.
+
+If you hit a surprise that invalidates an assumption in the plan or
+handoff doc, stop and re-plan rather than papering over it.

From 1c142e300fd248cfef68562eb380b0d051db7a17 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 18:13:42 -0700
Subject: [PATCH 08/31] feat(ltx2): Ltx2Engine chain stage renderer with
 pre-VAE latent tail capture

Add a pre-VAE-decode tail-capture slot on Ltx2RuntimeSession threaded
into render_real_distilled_av, implement Ltx2Engine::render_chain_stage
that injects a carryover ChainTail as a StagedLatent and extracts the
post-denoise tail, and wire it through impl ChainStageRenderer for
Ltx2Engine. Distilled-only in v1; other pipeline families error up-front.
Amend ChainStageRenderer::render_stage to carry motion_tail_pixel_frames
so the engine knows how many frames to narrow off the emitted latents.

Part of render-chain v1 (Phase 1d). Weight-free tests added; full
mold-inference and mold-core lib test suites stay green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-inference/src/ltx2/chain.rs    |  21 ++-
 crates/mold-inference/src/ltx2/pipeline.rs | 191 ++++++++++++++++++++-
 crates/mold-inference/src/ltx2/runtime.rs  |  36 +++-
 3 files changed, 239 insertions(+), 9 deletions(-)

diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
index dfcb08df..d40c95de 100644
--- a/crates/mold-inference/src/ltx2/chain.rs
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -136,6 +136,7 @@ pub trait ChainStageRenderer {
         &mut self,
         stage_req: &GenerateRequest,
         carry: Option<&ChainTail>,
+        motion_tail_pixel_frames: u32,
         stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
     ) -> Result<StageOutcome>;
 }
@@ -226,12 +227,19 @@ impl<'a, R: ChainStageRenderer> Ltx2ChainOrchestrator<'a, R> {
                             });
                         }
                     };
-                    self.renderer
-                        .render_stage(&stage_req, carry.as_ref(), Some(&mut wrapping))?
+                    self.renderer.render_stage(
+                        &stage_req,
+                        carry.as_ref(),
+                        req.motion_tail_frames,
+                        Some(&mut wrapping),
+                    )?
                 }
-                None => self
-                    .renderer
-                    .render_stage(&stage_req, carry.as_ref(), None)?,
+                None => self.renderer.render_stage(
+                    &stage_req,
+                    carry.as_ref(),
+                    req.motion_tail_frames,
+                    None,
+                )?,
             };
 
             let mut frames = outcome.frames;
@@ -257,7 +265,7 @@ impl<'a, R: ChainStageRenderer> Ltx2ChainOrchestrator<'a, R> {
             }
         }
 
-        if let Some(cb) = chain_progress.as_deref_mut() {
+        if let Some(cb) = chain_progress.as_mut() {
             cb(ChainProgressEvent::Stitching {
                 total_frames: accumulated_frames.len() as u32,
             });
@@ -501,6 +509,7 @@ mod tests {
             &mut self,
             stage_req: &GenerateRequest,
             carry: Option<&ChainTail>,
+            _motion_tail_pixel_frames: u32,
             mut stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
         ) -> Result<StageOutcome> {
             let idx = self.calls.len();
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 1f584d14..7f419d4d 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -1,6 +1,6 @@
 #![allow(clippy::type_complexity)]
 
-use anyhow::{bail, Context, Result};
+use anyhow::{anyhow, bail, Context, Result};
 use candle_core::Device;
 use mold_core::{
     GenerateRequest, GenerateResponse, Ltx2PipelineMode, ModelPaths, OutputFormat, VideoData,
@@ -11,7 +11,10 @@ use std::time::Instant;
 
 use super::assets;
 use super::backend::Ltx2Backend;
-use super::conditioning;
+use super::chain::{
+    extract_tail_latents, ChainStageRenderer, ChainTail, StageOutcome, StageProgressEvent,
+};
+use super::conditioning::{self, StagedLatent};
 use super::execution;
 use super::lora;
 use super::media::{self, ProbeMetadata};
@@ -522,6 +525,147 @@ impl Ltx2Engine {
             gpu: None,
         })
     }
+
+    /// Render a single chain stage, optionally conditioning on a carryover
+    /// tail from the prior stage.
+    ///
+    /// `motion_tail_pixel_frames` is the number of pixel frames to narrow
+    /// off the emitted latents for the *next* stage's carryover. `0`
+    /// returns an error (nonsensical — use the regular single-clip path
+    /// if no tail is wanted).
+    ///
+    /// Scope: distilled LTX-2 pipeline only. Other pipeline families
+    /// return an error up-front so the chain orchestrator fails fast.
+    pub(crate) fn render_chain_stage(
+        &mut self,
+        req: &GenerateRequest,
+        carry: Option<&ChainTail>,
+        motion_tail_pixel_frames: u32,
+    ) -> Result<StageOutcome> {
+        if motion_tail_pixel_frames == 0 {
+            bail!("render_chain_stage: motion_tail_pixel_frames must be > 0");
+        }
+        if !self.loaded {
+            self.load()?;
+        }
+        let start = Instant::now();
+        self.emit("Preparing native LTX-2 chain stage");
+
+        let pipeline = self.select_pipeline(req)?;
+        if !matches!(pipeline, PipelineKind::Distilled) {
+            bail!(
+                "render-chain v1 only supports the distilled LTX-2 pipeline, got {:?}",
+                pipeline,
+            );
+        }
+
+        let work_dir = tempfile::tempdir().context("failed to create LTX-2 temp directory")?;
+        let native_output = work_dir.path().join("ltx2-native-output.mp4");
+        let mut plan = self.materialize_request(req, work_dir.path(), &native_output)?;
+
+        // Inject carryover tail latents as StagedLatent on frame 0. The
+        // runtime detects a non-empty `conditioning.latents` and bypasses
+        // the VAE load entirely, patchifying the pre-encoded tokens into
+        // conditioning replacements directly (see conditioning.rs
+        // StagedLatent docstring + runtime.rs
+        // maybe_load_stage_video_conditioning).
+        if let Some(tail) = carry {
+            // The caller (orchestrator) is responsible for blanking
+            // source_image on continuation stages, but defence-in-depth:
+            // clear staged images so they can't compete with the latent
+            // carryover.
+            plan.conditioning.images.clear();
+            plan.conditioning.latents.push(StagedLatent {
+                latents: tail.latents.clone(),
+                frame: 0,
+                strength: 1.0,
+            });
+        }
+
+        // Reuse an existing runtime session if we have one; otherwise
+        // build one. Arm the tail-capture slot on the session before
+        // render.
+        let mut runtime = match self.native_runtime.take() {
+            Some(runtime) => runtime,
+            None => self.create_runtime_session(&plan)?,
+        };
+        let slot = runtime.arm_tail_capture();
+
+        self.emit("Executing native LTX-2 chain stage runtime");
+        let prepared = match runtime.prepare(&plan) {
+            Ok(prepared) => prepared,
+            Err(err) => {
+                runtime.clear_tail_capture();
+                self.native_runtime = Some(runtime);
+                return Err(err);
+            }
+        };
+        let render_result =
+            runtime.render_native_video(&plan, &prepared, self.on_progress.as_ref());
+        runtime.clear_tail_capture();
+        self.native_runtime = Some(runtime);
+        let rendered = render_result?;
+
+        // Drain captured latents. The slot must have been populated by
+        // the distilled render path — if it's empty, that's a wiring bug,
+        // not a user error.
+        let captured = slot
+            .lock()
+            .map_err(|_| anyhow!("chain tail-capture mutex was poisoned mid-render"))?
+            .take()
+            .ok_or_else(|| {
+                anyhow!(
+                    "distilled render completed without populating the chain tail-capture slot; \
+                     this is a pipeline wiring bug"
+                )
+            })?;
+
+        // `extract_tail_latents` returns a narrow view; make it
+        // contiguous so it survives independently of the runtime's
+        // working tensors.
+        let tail_slice = extract_tail_latents(&captured, motion_tail_pixel_frames)?;
+        let tail_latents = tail_slice
+            .contiguous()
+            .context("materializing chain tail latents into an owned tensor")?;
+
+        let frames = rendered.frames;
+        let last_rgb_frame = frames
+            .last()
+            .ok_or_else(|| anyhow!("distilled render returned zero frames"))?
+            .clone();
+
+        let generation_time_ms = start.elapsed().as_millis() as u64;
+        Self::log_timing("pipeline.render_chain_stage", start);
+
+        Ok(StageOutcome {
+            frames,
+            tail: ChainTail {
+                frames: motion_tail_pixel_frames,
+                latents: tail_latents,
+                last_rgb_frame,
+            },
+            generation_time_ms,
+        })
+    }
+}
+
+impl ChainStageRenderer for Ltx2Engine {
+    fn render_stage(
+        &mut self,
+        stage_req: &GenerateRequest,
+        carry: Option<&ChainTail>,
+        motion_tail_pixel_frames: u32,
+        _stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
+    ) -> Result<StageOutcome> {
+        // `_stage_progress` is intentionally unused in v1: per-stage
+        // denoise events flow through `self.on_progress` already. Phase 2's
+        // server route will install an on_progress callback that forwards
+        // those events onto the chain SSE stream with `stage_idx` tagged
+        // in. If the orchestrator later needs denoise-step events routed
+        // through its own channel, we can plumb `stage_progress` into a
+        // temporary ProgressCallback wrapper here.
+        self.render_chain_stage(stage_req, carry, motion_tail_pixel_frames)
+    }
 }
 
 impl InferenceEngine for Ltx2Engine {
@@ -1087,4 +1231,47 @@ mod tests {
         assert!(!video.has_audio);
         assert!(engine.native_runtime.is_none());
     }
+
+    #[test]
+    fn render_chain_stage_rejects_non_distilled_pipeline() {
+        // A model name without "distilled" in it selects `PipelineKind::TwoStage`
+        // via `select_pipeline`, which must be rejected up-front by the chain
+        // entry point before any runtime work happens.
+        let mut engine = Ltx2Engine::with_runtime_session(
+            "ltx-2-19b:fp8".to_string(),
+            dummy_paths(),
+            runtime_session(),
+        );
+        engine.loaded = true;
+        let req = request(OutputFormat::Mp4, Some(false));
+        let err = engine
+            .render_chain_stage(&req, None, 4)
+            .expect_err("must fail on non-distilled pipeline");
+        let msg = format!("{err}");
+        assert!(
+            msg.contains("distilled"),
+            "error must name the pipeline constraint, got: {msg}",
+        );
+    }
+
+    #[test]
+    fn render_chain_stage_rejects_zero_motion_tail() {
+        // Zero-frame motion tail is nonsensical — it would narrow nothing off
+        // for the next stage. Fast-fail before any allocation.
+        let mut engine = Ltx2Engine::with_runtime_session(
+            "ltx-2-19b-distilled:fp8".to_string(),
+            dummy_paths(),
+            runtime_session(),
+        );
+        engine.loaded = true;
+        let req = request(OutputFormat::Mp4, Some(false));
+        let err = engine
+            .render_chain_stage(&req, None, 0)
+            .expect_err("must fail on zero motion tail");
+        let msg = format!("{err}");
+        assert!(
+            msg.contains("motion_tail_pixel_frames"),
+            "error must name the motion_tail constraint, got: {msg}",
+        );
+    }
 }
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 8df1f26c..5fc0a62b 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -291,6 +291,11 @@ impl Ltx2VaeLatentStats {
 pub struct Ltx2RuntimeSession {
     device: Option<candle_core::Device>,
     prompt_encoder: Option<NativePromptEncoder>,
+    /// Optional slot wired into `render_real_distilled_av` so
+    /// `Ltx2Engine::render_chain_stage` can snapshot the pre-VAE-decode
+    /// final latents and forward them to the next chain stage as a
+    /// [`super::chain::ChainTail`]. `None` outside chain flow.
+    pub(crate) tail_capture: Option<std::sync::Arc<std::sync::Mutex<Option<Tensor>>>>,
 }
 
 impl Ltx2RuntimeSession {
@@ -298,6 +303,7 @@ impl Ltx2RuntimeSession {
         Self {
             device: Some(device),
             prompt_encoder: Some(prompt_encoder),
+            tail_capture: None,
         }
     }
 
@@ -305,9 +311,20 @@ impl Ltx2RuntimeSession {
         Self {
             device: None,
             prompt_encoder: Some(prompt_encoder),
+            tail_capture: None,
         }
     }
 
+    pub(crate) fn arm_tail_capture(&mut self) -> std::sync::Arc<std::sync::Mutex<Option<Tensor>>> {
+        let slot = std::sync::Arc::new(std::sync::Mutex::new(None));
+        self.tail_capture = Some(std::sync::Arc::clone(&slot));
+        slot
+    }
+
+    pub(crate) fn clear_tail_capture(&mut self) {
+        self.tail_capture = None;
+    }
+
     pub fn prepare(&mut self, plan: &Ltx2GeneratePlan) -> Result<NativePreparedRun> {
         let prepare_total_start = Instant::now();
         let mut stage1_shape = derive_stage1_render_shape(
@@ -597,7 +614,13 @@ impl Ltx2RuntimeSession {
             return Ok(None);
         }
         let render = match plan.pipeline {
-            PipelineKind::Distilled => render_real_distilled_av(plan, prepared, device, progress),
+            PipelineKind::Distilled => render_real_distilled_av(
+                plan,
+                prepared,
+                device,
+                progress,
+                self.tail_capture.as_ref(),
+            ),
             PipelineKind::OneStage => render_real_one_stage_av(plan, prepared, device, progress),
             PipelineKind::TwoStage
             | PipelineKind::TwoStageHq
@@ -1724,6 +1747,7 @@ fn render_real_distilled_av(
     prepared: &NativePreparedRun,
     device: &candle_core::Device,
     progress: Option<&ProgressCallback>,
+    tail_capture: Option<&std::sync::Arc<std::sync::Mutex<Option<Tensor>>>>,
 ) -> Result<NativeRenderedVideo> {
     let debug_enabled = ltx_debug_enabled();
     let prompt_inputs = prepare_render_prompt_inputs(
@@ -2008,6 +2032,16 @@ fn render_real_distilled_av(
     vae.use_tiling = false;
     vae.use_framewise_decoding = false;
     let decode_start = Instant::now();
+    // Chain-stage hook: capture the pre-decode F32 latents so
+    // `Ltx2Engine::render_chain_stage` can narrow the tail off for the next
+    // stage's conditioning. Cheap shallow clone (candle tensors are
+    // Arc-backed). A poisoned mutex is ignored here — the outer caller
+    // detects an empty slot and emits a clear error.
+    if let Some(slot) = tail_capture {
+        if let Ok(mut guard) = slot.lock() {
+            *guard = Some(latents.clone());
+        }
+    }
     let (_dec_output, video) = vae.decode(&latents.to_dtype(dtype)?, None, false, false)?;
     if debug_enabled {
         log_tensor_stats("decoded_video", &video)?;

From 548f2fc82d80de423575c4aed5b202747ec013ab Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 18:35:20 -0700
Subject: [PATCH 09/31] feat(server): chain render endpoint with SSE streaming

Add POST /api/generate/chain and POST /api/generate/chain/stream for
server-side chained LTX-2 video generation. Handler take/restores the
engine out of the model cache and runs the full chain in a
spawn_blocking so the sync orchestrator never blocks the async runtime.
Drives Ltx2ChainOrchestrator through the engine's ChainStageRenderer
view, trims accumulated frames to target total from the tail per
sign-off, encodes the stitched output (MP4 when the mp4 feature is on,
APNG fallback otherwise), and saves to the gallery with a synthesised
OutputMetadata.

Expose as_chain_renderer() on InferenceEngine (default None), overridden
by Ltx2Engine. Relax Ltx2ChainOrchestrator's renderer bound to ?Sized so
trait objects compose cleanly. Promote ltx_video::video_enc from
pub(crate) to pub so mold-server can reuse encode_mp4/encode_apng/
encode_gif/first_frame_png for chain stitching.

Weight-free route tests cover the happy path, the mid-chain failure
(502 Bad Gateway), the unsupported-model rejection (422), progress
event ordering through the SSE helper, and tail-trim behaviour.

Part of render-chain v1 (Phase 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 Cargo.lock                                 |   1 +
 crates/mold-cli/Cargo.toml                 |   2 +-
 crates/mold-inference/src/engine.rs        |  11 +
 crates/mold-inference/src/ltx2/chain.rs    |   4 +-
 crates/mold-inference/src/ltx2/pipeline.rs |   4 +
 crates/mold-inference/src/ltx_video/mod.rs |   5 +-
 crates/mold-server/Cargo.toml              |   5 +
 crates/mold-server/src/lib.rs              |   1 +
 crates/mold-server/src/routes.rs           |  27 +-
 crates/mold-server/src/routes_chain.rs     | 788 +++++++++++++++++++++
 10 files changed, 843 insertions(+), 5 deletions(-)
 create mode 100644 crates/mold-server/src/routes_chain.rs

diff --git a/Cargo.lock b/Cargo.lock
index c950b94e..4a76c58b 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3104,6 +3104,7 @@ dependencies = [
  "async-trait",
  "axum",
  "base64 0.22.1",
+ "candle-core-mold",
  "clap",
  "dirs 5.0.1",
  "futures",
diff --git a/crates/mold-cli/Cargo.toml b/crates/mold-cli/Cargo.toml
index a7734a87..07a71f88 100644
--- a/crates/mold-cli/Cargo.toml
+++ b/crates/mold-cli/Cargo.toml
@@ -22,7 +22,7 @@ discord = ["mold-discord"]
 expand = ["mold-inference/expand", "mold-server/expand", "mold-tui?/expand"]
 tui = ["dep:mold-tui"]
 webp = ["mold-inference/webp"]
-mp4 = ["mold-inference/mp4"]
+mp4 = ["mold-inference/mp4", "mold-server/mp4"]
 metrics = ["mold-server/metrics"]
 
 [dependencies]
diff --git a/crates/mold-inference/src/engine.rs b/crates/mold-inference/src/engine.rs
index 949f5089..cddafc7a 100644
--- a/crates/mold-inference/src/engine.rs
+++ b/crates/mold-inference/src/engine.rs
@@ -35,6 +35,17 @@ pub trait InferenceEngine: Send + Sync {
     fn model_paths(&self) -> Option<&mold_core::ModelPaths> {
         None
     }
+
+    /// Returns a [`ChainStageRenderer`] view of this engine if the family
+    /// supports chained video generation. Default is `None` — only LTX-2
+    /// distilled overrides this in v1.
+    ///
+    /// Callers (the server chain route) invoke this once per stage to drive
+    /// [`crate::ltx2::Ltx2ChainOrchestrator::run`]; engines that don't support
+    /// chaining return `None` and the caller responds with 422.
+    fn as_chain_renderer(&mut self) -> Option<&mut dyn crate::ltx2::ChainStageRenderer> {
+        None
+    }
 }
 
 /// Restores an `Option<T>` slot even if the current scope unwinds.
diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
index d40c95de..81eb2b47 100644
--- a/crates/mold-inference/src/ltx2/chain.rs
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -159,11 +159,11 @@ pub struct ChainRunOutput {
 /// Drives the per-stage render loop for a chained generation. Borrows its
 /// renderer mutably so the loop can re-enter the engine on the same GPU
 /// context across stages.
-pub struct Ltx2ChainOrchestrator<'a, R: ChainStageRenderer> {
+pub struct Ltx2ChainOrchestrator<'a, R: ChainStageRenderer + ?Sized> {
     renderer: &'a mut R,
 }
 
-impl<'a, R: ChainStageRenderer> Ltx2ChainOrchestrator<'a, R> {
+impl<'a, R: ChainStageRenderer + ?Sized> Ltx2ChainOrchestrator<'a, R> {
     pub fn new(renderer: &'a mut R) -> Self {
         Self { renderer }
     }
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 7f419d4d..9c744d8b 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -720,6 +720,10 @@ impl InferenceEngine for Ltx2Engine {
     fn model_paths(&self) -> Option<&ModelPaths> {
         Some(&self.paths)
     }
+
+    fn as_chain_renderer(&mut self) -> Option<&mut dyn crate::ltx2::ChainStageRenderer> {
+        Some(self)
+    }
 }
 
 #[cfg(test)]
diff --git a/crates/mold-inference/src/ltx_video/mod.rs b/crates/mold-inference/src/ltx_video/mod.rs
index 4e37627f..aa01b614 100644
--- a/crates/mold-inference/src/ltx_video/mod.rs
+++ b/crates/mold-inference/src/ltx_video/mod.rs
@@ -1,5 +1,8 @@
 pub(crate) mod latent_upsampler;
 mod pipeline;
-pub(crate) mod video_enc;
+// Video encoding helpers (GIF/APNG/WebP/MP4 + thumbnail) are used by
+// chain stitching in `mold-server`, so the module is public rather than
+// crate-private.
+pub mod video_enc;
 
 pub use pipeline::LtxVideoEngine;
diff --git a/crates/mold-server/Cargo.toml b/crates/mold-server/Cargo.toml
index 28e76c4b..34e6645f 100644
--- a/crates/mold-server/Cargo.toml
+++ b/crates/mold-server/Cargo.toml
@@ -25,6 +25,7 @@ default = []
 cuda = ["mold-inference/cuda"]
 metal = ["mold-inference/metal"]
 expand = ["mold-inference/expand"]
+mp4 = ["mold-inference/mp4"]
 metrics = ["dep:metrics", "dep:metrics-exporter-prometheus"]
 nvml = ["dep:nvml-wrapper"]
 
@@ -72,3 +73,7 @@ async-stream = "0.3"
 [dev-dependencies]
 tempfile = "3"
 tokio = { version = "1", features = ["full", "test-util"] }
+# Chain route tests build a synthetic motion-tail Tensor via the same
+# candle APIs the inference crate uses — keep this in lockstep with
+# mold-inference's pinned candle-core-mold version.
+candle-core = { package = "candle-core-mold", version = "0.9.10" }
diff --git a/crates/mold-server/src/lib.rs b/crates/mold-server/src/lib.rs
index 97ea6f42..62ef01ca 100644
--- a/crates/mold-server/src/lib.rs
+++ b/crates/mold-server/src/lib.rs
@@ -13,6 +13,7 @@ pub mod rate_limit;
 pub mod request_id;
 pub mod resources;
 pub mod routes;
+pub mod routes_chain;
 pub mod state;
 pub mod web_ui;
 
diff --git a/crates/mold-server/src/routes.rs b/crates/mold-server/src/routes.rs
index 4ea11cd8..2d6ffc3c 100644
--- a/crates/mold-server/src/routes.rs
+++ b/crates/mold-server/src/routes.rs
@@ -133,7 +133,19 @@ use crate::queue::clean_error_message;
 
 #[derive(OpenApi)]
 #[openapi(
-    paths(generate, generate_stream, expand_prompt, list_models, load_model, pull_model_endpoint, unload_model, server_status, health),
+    paths(
+        generate,
+        generate_stream,
+        expand_prompt,
+        list_models,
+        load_model,
+        pull_model_endpoint,
+        unload_model,
+        server_status,
+        health,
+        crate::routes_chain::generate_chain,
+        crate::routes_chain::generate_chain_stream,
+    ),
     components(schemas(
         mold_core::GenerateRequest,
         mold_core::GenerateResponse,
@@ -148,6 +160,11 @@ use crate::queue::clean_error_message;
         mold_core::SseProgressEvent,
         mold_core::SseCompleteEvent,
         mold_core::SseErrorEvent,
+        mold_core::ChainRequest,
+        mold_core::ChainResponse,
+        mold_core::ChainStage,
+        mold_core::ChainProgressEvent,
+        mold_core::SseChainCompleteEvent,
         ModelInfoExtended,
         LoadModelBody,
         UnloadRequest,
@@ -171,6 +188,14 @@ pub fn create_router(state: AppState) -> Router {
     Router::new()
         .route("/api/generate", post(generate))
         .route("/api/generate/stream", post(generate_stream))
+        .route(
+            "/api/generate/chain",
+            post(crate::routes_chain::generate_chain),
+        )
+        .route(
+            "/api/generate/chain/stream",
+            post(crate::routes_chain::generate_chain_stream),
+        )
         .route("/api/expand", post(expand_prompt))
         .route("/api/models", get(list_models))
         .route("/api/models/load", post(load_model))
diff --git a/crates/mold-server/src/routes_chain.rs b/crates/mold-server/src/routes_chain.rs
new file mode 100644
index 00000000..c8e4ef68
--- /dev/null
+++ b/crates/mold-server/src/routes_chain.rs
@@ -0,0 +1,788 @@
+//! Server-side chained video generation endpoints.
+//!
+//! Exposes `POST /api/generate/chain` (synchronous) and
+//! `POST /api/generate/chain/stream` (SSE). Both drive
+//! [`mold_inference::ltx2::Ltx2ChainOrchestrator`] through an engine's
+//! [`mold_inference::ltx2::ChainStageRenderer`] view.
+//!
+//! Unlike the single-shot generate path (which queues through
+//! [`crate::state::QueueHandle`] to keep small GPU jobs FIFO-fair), chains
+//! are multi-minute compound jobs — the handler take/restores the engine
+//! out of the model cache and runs the full sequence in a
+//! [`tokio::task::spawn_blocking`] so the sync orchestrator never blocks
+//! the async runtime. While the chain is running the engine is removed
+//! from the cache, so concurrent generate/chain requests for the same
+//! model cannot race.
+
+use std::convert::Infallible;
+
+use axum::{
+    extract::State,
+    response::sse::{Event as SseEvent, KeepAlive, Sse},
+    Json,
+};
+use base64::Engine as _;
+use mold_core::chain::{ChainProgressEvent, ChainRequest, ChainResponse, SseChainCompleteEvent};
+use mold_core::{OutputFormat, OutputMetadata, VideoData};
+use tokio_stream::StreamExt as _;
+
+use crate::model_cache::CachedEngine;
+use crate::model_manager;
+use crate::queue::save_video_to_dir;
+use crate::routes::ApiError;
+use crate::state::AppState;
+
+/// Internal wire event used by the chain SSE stream before per-event
+/// serialization. Separate from [`crate::state::SseMessage`] because chain
+/// complete events carry a different payload (`SseChainCompleteEvent`) and
+/// progress events are chain-shaped (`ChainProgressEvent`) rather than the
+/// single-stage `SseProgressEvent`.
+pub(crate) enum ChainSseMessage {
+    Progress(ChainProgressEvent),
+    Complete(SseChainCompleteEvent),
+    Error(String),
+}
+
+fn chain_sse_event(msg: ChainSseMessage) -> SseEvent {
+    match msg {
+        ChainSseMessage::Progress(ev) => match serde_json::to_string(&ev) {
+            Ok(data) => SseEvent::default().event("progress").data(data),
+            Err(e) => SseEvent::default()
+                .event("error")
+                .data(format!(r#"{{"message":"serialize progress: {e}"}}"#)),
+        },
+        ChainSseMessage::Complete(ev) => match serde_json::to_string(&ev) {
+            Ok(data) => SseEvent::default().event("complete").data(data),
+            Err(e) => SseEvent::default()
+                .event("error")
+                .data(format!(r#"{{"message":"serialize complete: {e}"}}"#)),
+        },
+        ChainSseMessage::Error(message) => SseEvent::default()
+            .event("error")
+            .data(serde_json::json!({ "message": message }).to_string()),
+    }
+}
+
+/// Encode chain frames into bytes for the requested output format. Returns
+/// the encoded payload plus a best-effort animated-GIF preview for the
+/// gallery.
+///
+/// MP4 is gated behind the `mp4` feature flag; when the flag is disabled,
+/// the handler falls back to APNG so the endpoint still produces a usable
+/// animation on every build.
+fn encode_chain_output(
+    frames: &[image::RgbImage],
+    fps: u32,
+    format: OutputFormat,
+) -> anyhow::Result<(Vec<u8>, OutputFormat, Vec<u8>)> {
+    use mold_inference::ltx_video::video_enc;
+
+    // Always produce a GIF preview for the gallery UI. Non-fatal.
+    let gif_preview = match video_enc::encode_gif(frames, fps) {
+        Ok(b) => b,
+        Err(e) => {
+            tracing::warn!("chain gif preview encode failed: {e:#}");
+            Vec::new()
+        }
+    };
+
+    let (bytes, actual_format) = match format {
+        OutputFormat::Mp4 => {
+            #[cfg(feature = "mp4")]
+            {
+                (video_enc::encode_mp4(frames, fps)?, OutputFormat::Mp4)
+            }
+            #[cfg(not(feature = "mp4"))]
+            {
+                tracing::warn!(
+                    "chain requested MP4 but server was built without the `mp4` feature — \
+                     falling back to APNG"
+                );
+                (
+                    video_enc::encode_apng(frames, fps, None)?,
+                    OutputFormat::Apng,
+                )
+            }
+        }
+        OutputFormat::Apng => (
+            video_enc::encode_apng(frames, fps, None)?,
+            OutputFormat::Apng,
+        ),
+        OutputFormat::Gif => (video_enc::encode_gif(frames, fps)?, OutputFormat::Gif),
+        // WebP is always available here because mold-inference's webp
+        // feature would need to gate at the transitive-dep level; for the
+        // chain route v1 we fall back to APNG when WebP is requested so
+        // we don't bind the server crate to another optional dep.
+        OutputFormat::Webp => {
+            tracing::warn!(
+                "chain WebP output is not supported on the server yet — falling back to APNG"
+            );
+            (
+                video_enc::encode_apng(frames, fps, None)?,
+                OutputFormat::Apng,
+            )
+        }
+        other => anyhow::bail!("{other:?} is not a video output format for chain generation"),
+    };
+
+    Ok((bytes, actual_format, gif_preview))
+}
+
+/// Build the `OutputMetadata` for a stitched chain output. Pulls chain-
+/// level parameters (dimensions, seed, steps) from `req` and the prompt /
+/// negative prompt from `stages[0]`.
+fn chain_output_metadata(req: &ChainRequest, frame_count: u32) -> OutputMetadata {
+    let first_stage = req.stages.first();
+    OutputMetadata {
+        prompt: first_stage.map(|s| s.prompt.clone()).unwrap_or_default(),
+        negative_prompt: first_stage.and_then(|s| s.negative_prompt.clone()),
+        original_prompt: None,
+        model: req.model.clone(),
+        seed: req.seed.unwrap_or(0),
+        steps: req.steps,
+        guidance: req.guidance,
+        width: req.width,
+        height: req.height,
+        strength: Some(req.strength),
+        scheduler: None,
+        lora: None,
+        lora_scale: None,
+        frames: Some(frame_count),
+        fps: Some(req.fps),
+        version: mold_core::build_info::version_string().to_string(),
+    }
+}
+
+/// Trim a frame buffer to the caller's requested total frame count, per
+/// the signed-off "trim from tail" decision (2026-04-20). The orchestrator
+/// always over-produces to hit or exceed `total_frames`; trimming here
+/// keeps the output length deterministic without altering per-stage
+/// denoise behaviour.
+fn trim_to_total_frames(frames: &mut Vec<image::RgbImage>, total_frames: Option<u32>) {
+    if let Some(target) = total_frames {
+        let target = target as usize;
+        if frames.len() > target {
+            frames.truncate(target);
+        }
+    }
+}
+
+/// Produce a PNG thumbnail for the chain output — best-effort, returns
+/// an empty `Vec` on failure so the save/response paths still succeed.
+fn chain_thumbnail(frames: &[image::RgbImage]) -> Vec<u8> {
+    match mold_inference::ltx_video::video_enc::first_frame_png(frames) {
+        Ok(b) => b,
+        Err(e) => {
+            tracing::warn!("chain thumbnail encode failed: {e:#}");
+            Vec::new()
+        }
+    }
+}
+
+/// Build a `VideoData` for the `ChainResponse` body.
+fn build_video_data(
+    bytes: Vec<u8>,
+    format: OutputFormat,
+    req: &ChainRequest,
+    frame_count: u32,
+    thumbnail: Vec<u8>,
+    gif_preview: Vec<u8>,
+) -> VideoData {
+    let duration_ms = if req.fps == 0 {
+        None
+    } else {
+        Some((frame_count as u64 * 1000) / req.fps as u64)
+    };
+    VideoData {
+        data: bytes,
+        format,
+        width: req.width,
+        height: req.height,
+        frames: frame_count,
+        fps: req.fps,
+        thumbnail,
+        gif_preview,
+        has_audio: false,
+        duration_ms,
+        audio_sample_rate: None,
+        audio_channels: None,
+    }
+}
+
+/// Build the SSE `complete` payload for a finished chain run. Sibling of
+/// [`crate::queue::build_sse_complete_event`] — kept in this module so the
+/// chain-specific payload can evolve independently from the single-shot
+/// one.
+fn build_sse_chain_complete_event(
+    resp: &ChainResponse,
+    generation_time_ms: u64,
+) -> SseChainCompleteEvent {
+    let b64 = base64::engine::general_purpose::STANDARD;
+    let video = &resp.video;
+    SseChainCompleteEvent {
+        video: b64.encode(&video.data),
+        format: video.format,
+        width: video.width,
+        height: video.height,
+        frames: video.frames,
+        fps: video.fps,
+        thumbnail: if video.thumbnail.is_empty() {
+            None
+        } else {
+            Some(b64.encode(&video.thumbnail))
+        },
+        gif_preview: if video.gif_preview.is_empty() {
+            None
+        } else {
+            Some(b64.encode(&video.gif_preview))
+        },
+        has_audio: video.has_audio,
+        duration_ms: video.duration_ms,
+        audio_sample_rate: video.audio_sample_rate,
+        audio_channels: video.audio_channels,
+        stage_count: resp.stage_count,
+        gpu: resp.gpu,
+        generation_time_ms: Some(generation_time_ms),
+    }
+}
+
+/// Errors surfaced from the chain-run helper. Mapped to appropriate HTTP
+/// status codes by the route handlers.
+#[derive(Debug)]
+enum ChainRunError {
+    /// Model family doesn't support chain rendering (422).
+    UnsupportedModel(String),
+    /// Engine missing from cache after `ensure_model_ready` (500).
+    CacheMiss(String),
+    /// Orchestrator returned an error mid-chain (502).
+    Inference(String),
+    /// Output encoding / stitch failure (500).
+    Encode(String),
+    /// Task panic or join error (500).
+    Internal(String),
+}
+
+impl From<ChainRunError> for ApiError {
+    fn from(err: ChainRunError) -> Self {
+        match err {
+            ChainRunError::UnsupportedModel(msg) => ApiError::validation(msg),
+            ChainRunError::CacheMiss(msg) => ApiError::internal(msg),
+            ChainRunError::Inference(msg) => {
+                ApiError::internal_with_status(msg, axum::http::StatusCode::BAD_GATEWAY)
+            }
+            ChainRunError::Encode(msg) => ApiError::internal(msg),
+            ChainRunError::Internal(msg) => ApiError::internal(msg),
+        }
+    }
+}
+
+/// Drive the chain to completion. Shared between the non-streaming and SSE
+/// paths — the only caller-provided variable is `progress_cb`, which is
+/// `None` for the plain JSON endpoint and `Some` for the SSE endpoint.
+async fn run_chain(
+    state: &AppState,
+    req: ChainRequest,
+    progress_cb: Option<Box<dyn FnMut(ChainProgressEvent) + Send>>,
+) -> Result<(ChainResponse, u64), ChainRunError> {
+    // Ensure the model is loaded. Progress forwarding is not plumbed yet —
+    // load-time events go through the model manager's own tracing. Chain
+    // stage events (StageStart/DenoiseStep/StageDone/Stitching) come from
+    // the orchestrator during the blocking task below.
+    model_manager::ensure_model_ready(state, &req.model, None)
+        .await
+        .map_err(|e| ChainRunError::CacheMiss(e.error))?;
+
+    // Take the engine out of the cache so the blocking orchestrator run
+    // owns it for the full multi-minute chain without holding the async
+    // mutex guard across an await. Restore when we're done (or on error).
+    let mut cache = state.model_cache.lock().await;
+    let cached: CachedEngine = cache.take(&req.model).ok_or_else(|| {
+        ChainRunError::CacheMiss(format!(
+            "engine '{}' vanished from cache after ensure_model_ready",
+            req.model
+        ))
+    })?;
+    drop(cache);
+
+    let req_for_task = req.clone();
+    let join_handle = tokio::task::spawn_blocking(move || {
+        let mut cached = cached;
+        let mut progress_cb = progress_cb;
+        let outcome = {
+            let engine = &mut cached.engine;
+            match engine.as_chain_renderer() {
+                Some(renderer) => {
+                    let mut orch = mold_inference::ltx2::Ltx2ChainOrchestrator::new(renderer);
+                    // The orchestrator expects `Option<&mut dyn FnMut(...)>`
+                    // — synthesise that from the optional boxed callback we
+                    // moved into this task.
+                    let result = if let Some(cb) = progress_cb.as_deref_mut() {
+                        orch.run(&req_for_task, Some(cb))
+                    } else {
+                        orch.run(&req_for_task, None)
+                    };
+                    result.map_err(|e| ChainRunError::Inference(format!("{e:#}")))
+                }
+                None => Err(ChainRunError::UnsupportedModel(format!(
+                    "model '{}' does not support chained video generation",
+                    req_for_task.model
+                ))),
+            }
+        };
+        (cached, outcome)
+    });
+
+    let (cached, outcome) = match join_handle.await {
+        Ok(pair) => pair,
+        Err(join_err) => {
+            return Err(ChainRunError::Internal(format!(
+                "chain orchestrator task failed: {join_err}"
+            )));
+        }
+    };
+
+    // Restore the engine to the cache regardless of success/failure so the
+    // next request can reuse it.
+    {
+        let mut cache = state.model_cache.lock().await;
+        cache.restore(cached);
+    }
+
+    let chain_output = outcome?;
+    let stage_count = chain_output.stage_count;
+    let generation_time_ms = chain_output.generation_time_ms;
+    let mut frames = chain_output.frames;
+    trim_to_total_frames(&mut frames, req.total_frames);
+
+    if frames.is_empty() {
+        return Err(ChainRunError::Encode(
+            "chain run emitted zero frames after trim".to_string(),
+        ));
+    }
+
+    let (bytes, output_format, gif_preview) =
+        encode_chain_output(&frames, req.fps, req.output_format)
+            .map_err(|e| ChainRunError::Encode(format!("encode chain output: {e:#}")))?;
+    let thumbnail = chain_thumbnail(&frames);
+    let frame_count = frames.len() as u32;
+
+    // Save to the gallery directory (best-effort, non-blocking).
+    let output_dir = {
+        let config = state.config.read().await;
+        if config.is_output_disabled() {
+            None
+        } else {
+            Some(config.effective_output_dir())
+        }
+    };
+    if let Some(dir) = output_dir {
+        let metadata = chain_output_metadata(&req, frame_count);
+        let bytes_clone = bytes.clone();
+        let gif_clone = gif_preview.clone();
+        let model = req.model.clone();
+        let db = state.metadata_db.clone();
+        tokio::task::spawn_blocking(move || {
+            save_video_to_dir(
+                &dir,
+                &bytes_clone,
+                &gif_clone,
+                output_format,
+                &model,
+                &metadata,
+                Some(generation_time_ms as i64),
+                db.as_ref().as_ref(),
+            );
+        });
+    }
+
+    let video = build_video_data(
+        bytes,
+        output_format,
+        &req,
+        frame_count,
+        thumbnail,
+        gif_preview,
+    );
+    let response = ChainResponse {
+        video,
+        stage_count,
+        gpu: None,
+    };
+    Ok((response, generation_time_ms))
+}
+
+/// `POST /api/generate/chain` — synchronous chained video generation.
+#[utoipa::path(
+    post,
+    path = "/api/generate/chain",
+    tag = "generation",
+    request_body = mold_core::ChainRequest,
+    responses(
+        (status = 200, description = "Stitched chain video", body = mold_core::ChainResponse),
+        (status = 422, description = "Invalid request or unsupported model"),
+        (status = 500, description = "Chain render failed"),
+        (status = 502, description = "Chain render failed mid-stage"),
+    )
+)]
+pub async fn generate_chain(
+    State(state): State<AppState>,
+    Json(req): Json<ChainRequest>,
+) -> Result<Json<ChainResponse>, ApiError> {
+    let req = req
+        .normalise()
+        .map_err(|e| ApiError::validation(e.to_string()))?;
+
+    tracing::info!(
+        model = %req.model,
+        stages = req.stages.len(),
+        width = req.width,
+        height = req.height,
+        fps = req.fps,
+        "generate/chain request"
+    );
+
+    let (response, _elapsed_ms) = run_chain(&state, req, None).await?;
+    Ok(Json(response))
+}
+
+/// `POST /api/generate/chain/stream` — SSE-streamed chain generation. Emits
+/// [`ChainProgressEvent`]s as `event: progress` frames while the chain
+/// runs, and a single `event: complete` frame with a [`SseChainCompleteEvent`]
+/// payload when the stitched output is ready. Mid-chain failure closes the
+/// stream with an `event: error` frame carrying the orchestrator message.
+#[utoipa::path(
+    post,
+    path = "/api/generate/chain/stream",
+    tag = "generation",
+    request_body = mold_core::ChainRequest,
+    responses(
+        (status = 200, description = "SSE event stream with chain progress and completion"),
+        (status = 422, description = "Invalid request or unsupported model"),
+        (status = 500, description = "Chain render failed"),
+    )
+)]
+pub async fn generate_chain_stream(
+    State(state): State<AppState>,
+    Json(req): Json<ChainRequest>,
+) -> Result<Sse<impl futures_core::Stream<Item = Result<SseEvent, Infallible>>>, ApiError> {
+    let req = req
+        .normalise()
+        .map_err(|e| ApiError::validation(e.to_string()))?;
+
+    tracing::info!(
+        model = %req.model,
+        stages = req.stages.len(),
+        width = req.width,
+        height = req.height,
+        fps = req.fps,
+        "generate/chain/stream request"
+    );
+
+    let (tx, rx) = tokio::sync::mpsc::unbounded_channel::<ChainSseMessage>();
+    let state_clone = state.clone();
+    let tx_for_task = tx.clone();
+
+    tokio::spawn(async move {
+        let tx_for_cb = tx_for_task.clone();
+        let cb: Box<dyn FnMut(ChainProgressEvent) + Send> = Box::new(move |event| {
+            let _ = tx_for_cb.send(ChainSseMessage::Progress(event));
+        });
+        match run_chain(&state_clone, req, Some(cb)).await {
+            Ok((response, elapsed_ms)) => {
+                let complete = build_sse_chain_complete_event(&response, elapsed_ms);
+                let _ = tx_for_task.send(ChainSseMessage::Complete(complete));
+            }
+            Err(err) => {
+                let api_err: ApiError = err.into();
+                let _ = tx_for_task.send(ChainSseMessage::Error(api_err.error));
+            }
+        }
+        // `tx_for_task` is dropped here, closing the channel and finalizing
+        // the SSE stream after the last complete/error frame.
+    });
+    drop(tx); // ensure only the task holds the sender
+
+    let stream = tokio_stream::wrappers::UnboundedReceiverStream::new(rx)
+        .map(|msg| Ok::<_, Infallible>(chain_sse_event(msg)));
+
+    Ok(Sse::new(stream).keep_alive(
+        KeepAlive::new()
+            .interval(std::time::Duration::from_secs(15))
+            .text("ping"),
+    ))
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use anyhow::Result;
+    use candle_core::{DType, Device, Tensor};
+    use image::{Rgb, RgbImage};
+    use mold_core::chain::{ChainProgressEvent, ChainRequest, ChainStage};
+    use mold_core::{GenerateRequest, GenerateResponse};
+    use mold_inference::ltx2::{ChainStageRenderer, ChainTail, StageOutcome, StageProgressEvent};
+    use mold_inference::InferenceEngine;
+    use std::sync::{Arc, Mutex};
+
+    /// Mock engine that delegates to a simple chain renderer producing
+    /// deterministic solid-color frames + a zero-valued latent tail. The
+    /// chain renderer is owned by the engine so `as_chain_renderer` can
+    /// hand out a `&mut dyn ChainStageRenderer` over it.
+    struct ChainMockEngine {
+        loaded: bool,
+        fail_on_stage: Option<usize>,
+        renderer_calls: Arc<Mutex<usize>>,
+    }
+
+    impl ChainMockEngine {
+        fn ready() -> Self {
+            Self {
+                loaded: true,
+                fail_on_stage: None,
+                renderer_calls: Arc::new(Mutex::new(0)),
+            }
+        }
+        fn failing_at(idx: usize) -> Self {
+            Self {
+                loaded: true,
+                fail_on_stage: Some(idx),
+                renderer_calls: Arc::new(Mutex::new(0)),
+            }
+        }
+    }
+
+    impl ChainStageRenderer for ChainMockEngine {
+        fn render_stage(
+            &mut self,
+            stage_req: &GenerateRequest,
+            _carry: Option<&ChainTail>,
+            _motion_tail_pixel_frames: u32,
+            _stage_progress: Option<&mut dyn FnMut(StageProgressEvent)>,
+        ) -> Result<StageOutcome> {
+            let idx = {
+                let mut calls = self.renderer_calls.lock().unwrap();
+                let idx = *calls;
+                *calls += 1;
+                idx
+            };
+            if self.fail_on_stage == Some(idx) {
+                anyhow::bail!("simulated chain failure at stage {idx}");
+            }
+            let frame_count = stage_req.frames.expect("chain stage missing frame count") as usize;
+            let width = stage_req.width;
+            let height = stage_req.height;
+            let mut frames = Vec::with_capacity(frame_count);
+            for f in 0..frame_count {
+                let shade = (idx as u8).wrapping_mul(17).wrapping_add(f as u8);
+                frames.push(RgbImage::from_pixel(width, height, Rgb([shade, 0, 0])));
+            }
+            let last_frame = frames.last().cloned().unwrap();
+            let latent = Tensor::zeros(
+                (1, 128, 1, height as usize / 32, width as usize / 32),
+                DType::F32,
+                &Device::Cpu,
+            )?;
+            Ok(StageOutcome {
+                frames,
+                tail: ChainTail {
+                    frames: 4,
+                    latents: latent,
+                    last_rgb_frame: last_frame,
+                },
+                generation_time_ms: 10,
+            })
+        }
+    }
+
+    impl InferenceEngine for ChainMockEngine {
+        fn generate(&mut self, _req: &GenerateRequest) -> Result<GenerateResponse> {
+            anyhow::bail!("chain mock engine does not support single-shot generate")
+        }
+        fn model_name(&self) -> &str {
+            "ltx-2-19b-distilled:mock"
+        }
+        fn is_loaded(&self) -> bool {
+            self.loaded
+        }
+        fn load(&mut self) -> Result<()> {
+            self.loaded = true;
+            Ok(())
+        }
+        fn as_chain_renderer(
+            &mut self,
+        ) -> Option<&mut dyn mold_inference::ltx2::ChainStageRenderer> {
+            Some(self)
+        }
+    }
+
+    /// Build an AppState whose model cache already contains a chain-capable
+    /// mock engine under the model name the tests pass in their requests.
+    fn state_with_chain_engine(engine: ChainMockEngine) -> AppState {
+        AppState::with_engine(engine)
+    }
+
+    fn chain_req_for_mock(model: &str, stages: u32) -> ChainRequest {
+        ChainRequest {
+            model: model.to_string(),
+            stages: (0..stages)
+                .map(|_| ChainStage {
+                    prompt: "a cat walking".into(),
+                    frames: 9,
+                    source_image: None,
+                    negative_prompt: None,
+                    seed_offset: None,
+                })
+                .collect(),
+            motion_tail_frames: 0, // simplifies frame accounting for the mock
+            width: 64,
+            height: 64,
+            fps: 12,
+            seed: Some(42),
+            steps: 4,
+            guidance: 3.0,
+            strength: 1.0,
+            output_format: OutputFormat::Apng, // avoid needing the mp4 feature in tests
+            placement: None,
+            prompt: None,
+            total_frames: None,
+            clip_frames: None,
+            source_image: None,
+        }
+    }
+
+    #[tokio::test]
+    async fn chain_happy_path_returns_stage_count_and_video() {
+        let engine = ChainMockEngine::ready();
+        let state = state_with_chain_engine(engine);
+        let req = chain_req_for_mock("ltx-2-19b-distilled:mock", 3);
+
+        let (resp, elapsed_ms) = run_chain(&state, req, None)
+            .await
+            .expect("chain run succeeds");
+
+        assert_eq!(resp.stage_count, 3, "response must report all 3 stages");
+        assert_eq!(resp.video.fps, 12);
+        assert_eq!(resp.video.frames, 9 * 3, "3 stages × 9 frames with tail=0");
+        assert_eq!(resp.video.format, OutputFormat::Apng);
+        assert!(!resp.video.data.is_empty(), "apng bytes written");
+        // elapsed_ms is the sum of the mock's reported per-stage time (10ms each).
+        assert_eq!(elapsed_ms, 30);
+    }
+
+    #[tokio::test]
+    async fn chain_stream_emits_progress_then_complete_in_order() {
+        let engine = ChainMockEngine::ready();
+        let state = state_with_chain_engine(engine);
+        let req = chain_req_for_mock("ltx-2-19b-distilled:mock", 2);
+
+        let collected: Arc<Mutex<Vec<ChainProgressEvent>>> = Arc::new(Mutex::new(Vec::new()));
+        let collected_cb = collected.clone();
+        let cb: Box<dyn FnMut(ChainProgressEvent) + Send> = Box::new(move |ev| {
+            collected_cb.lock().unwrap().push(ev);
+        });
+        let (resp, _) = run_chain(&state, req, Some(cb))
+            .await
+            .expect("chain run succeeds");
+
+        assert_eq!(resp.stage_count, 2);
+        let events = collected.lock().unwrap();
+        assert!(!events.is_empty(), "progress events must flow");
+        assert!(
+            matches!(
+                events[0],
+                ChainProgressEvent::ChainStart { stage_count: 2, .. }
+            ),
+            "first event must be ChainStart, got {:?}",
+            events[0]
+        );
+        assert!(
+            matches!(events.last().unwrap(), ChainProgressEvent::Stitching { .. }),
+            "last event must be Stitching, got {:?}",
+            events.last()
+        );
+        // There must be exactly one StageStart + StageDone per stage.
+        let stage_starts = events
+            .iter()
+            .filter(|e| matches!(e, ChainProgressEvent::StageStart { .. }))
+            .count();
+        let stage_dones = events
+            .iter()
+            .filter(|e| matches!(e, ChainProgressEvent::StageDone { .. }))
+            .count();
+        assert_eq!(stage_starts, 2);
+        assert_eq!(stage_dones, 2);
+    }
+
+    #[tokio::test]
+    async fn chain_mid_chain_failure_maps_to_bad_gateway() {
+        let engine = ChainMockEngine::failing_at(1);
+        let state = state_with_chain_engine(engine);
+        let req = chain_req_for_mock("ltx-2-19b-distilled:mock", 3);
+
+        let err = run_chain(&state, req, None)
+            .await
+            .expect_err("mid-chain failure must bubble up");
+        match err {
+            ChainRunError::Inference(msg) => {
+                assert!(
+                    msg.contains("simulated chain failure"),
+                    "inference error must carry renderer message, got: {msg}"
+                );
+            }
+            other => panic!("expected Inference error, got {other:?}"),
+        }
+    }
+
+    #[tokio::test]
+    async fn chain_unsupported_model_rejects_with_validation() {
+        /// Engine that is fully capable of single-shot generate but refuses
+        /// chain rendering (mirrors every non-LTX-2 family).
+        struct NonChainEngine;
+        impl InferenceEngine for NonChainEngine {
+            fn generate(&mut self, _req: &GenerateRequest) -> Result<GenerateResponse> {
+                anyhow::bail!("no single-shot generate in this test either")
+            }
+            fn model_name(&self) -> &str {
+                "flux-dev:q8"
+            }
+            fn is_loaded(&self) -> bool {
+                true
+            }
+            fn load(&mut self) -> Result<()> {
+                Ok(())
+            }
+            // No override for as_chain_renderer — default returns None.
+        }
+
+        let state = AppState::with_engine(NonChainEngine);
+        let mut req = chain_req_for_mock("flux-dev:q8", 2);
+        req.model = "flux-dev:q8".into();
+        let err = run_chain(&state, req, None)
+            .await
+            .expect_err("non-chain model must fail");
+        match err {
+            ChainRunError::UnsupportedModel(msg) => {
+                assert!(
+                    msg.contains("does not support chained video generation"),
+                    "unsupported-model error must name the constraint, got: {msg}"
+                );
+            }
+            other => panic!("expected UnsupportedModel, got {other:?}"),
+        }
+    }
+
+    #[tokio::test]
+    async fn chain_trims_frames_from_tail_when_total_frames_set() {
+        let engine = ChainMockEngine::ready();
+        let state = state_with_chain_engine(engine);
+        let mut req = chain_req_for_mock("ltx-2-19b-distilled:mock", 2);
+        // Each stage produces 9 frames with tail=0 → 18 total. Trim to 10.
+        req.total_frames = Some(10);
+
+        let (resp, _) = run_chain(&state, req, None).await.expect("chain runs");
+        assert_eq!(
+            resp.video.frames, 10,
+            "total_frames must trim the stitched output length"
+        );
+    }
+}

From 6ed6b59a93a3f65a1cc9f2b6440e7d701d05d8c5 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 18:54:43 -0700
Subject: [PATCH 10/31] feat(cli): chain rendering for --frames above clip cap

When --frames exceeds the model's per-clip cap (97 for LTX-2 distilled),
`mold run` now auto-builds a ChainRequest and routes to
POST /api/generate/chain/stream (server mode) or runs the
Ltx2ChainOrchestrator in-process (--local mode). New flags --clip-frames
and --motion-tail let users tune the per-clip length and the motion-tail
overlap (default 4 frames of latent carryover between clips).

Stacked progress bars render a parent "Chain" bar (total frames) and a
wiping per-stage bar (denoise step / total). Both server and local paths
share a single encode+save+preview epilogue so output formatting, stdout
piping, and gallery save are identical.

Models outside LTX-2 distilled families error fast when --frames exceeds
the single-clip cap rather than silently dropping frames or hitting the
server's chain route with a non-chainable model. A pure
`decide_chain_routing` helper captures the branching logic so auto-
routing is unit-testable without async or network.

Part of render-chain v1 (Phase 3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-cli/src/commands/chain.rs    | 843 +++++++++++++++++++++++
 crates/mold-cli/src/commands/generate.rs | 118 ++++
 crates/mold-cli/src/commands/mod.rs      |   1 +
 crates/mold-cli/src/commands/run.rs      |   4 +
 crates/mold-cli/src/main.rs              |  64 ++
 5 files changed, 1030 insertions(+)
 create mode 100644 crates/mold-cli/src/commands/chain.rs

diff --git a/crates/mold-cli/src/commands/chain.rs b/crates/mold-cli/src/commands/chain.rs
new file mode 100644
index 00000000..a0988b72
--- /dev/null
+++ b/crates/mold-cli/src/commands/chain.rs
@@ -0,0 +1,843 @@
+//! CLI-side render-chain orchestration for LTX-2 distilled models.
+//!
+//! When `mold run --frames N` exceeds the per-clip cap of the selected model,
+//! this module takes over from [`super::generate::run`]: it assembles a
+//! [`ChainRequest`] from the user's CLI args and either submits it to a
+//! running server via [`MoldClient::generate_chain_stream`] or, in `--local`
+//! mode, drives an in-process [`Ltx2ChainOrchestrator`].
+//!
+//! Both paths funnel through [`encode_and_save`] so stdout piping, gallery
+//! save, metadata DB writes, and preview behaviour match the single-clip
+//! path byte-for-byte.
+
+use std::io::Write;
+use std::time::Duration;
+
+use anyhow::Result;
+use colored::Colorize;
+use indicatif::{MultiProgress, ProgressBar, ProgressDrawTarget, ProgressStyle};
+use mold_core::chain::{ChainProgressEvent, ChainRequest};
+use mold_core::{Config, MoldClient, OutputFormat, VideoData};
+
+use crate::control::CliContext;
+use crate::output::{is_piped, status};
+use crate::theme;
+
+/// Per-clip frame cap for LTX-2 19B/22B distilled. The distilled VAE
+/// pipeline maxes at 97 pixel frames (13 latent frames) per clip.
+pub const LTX2_DISTILLED_CLIP_CAP: u32 = 97;
+
+/// Outcome of [`decide_chain_routing`]: either the caller should continue
+/// down the single-clip path, build a chain with the given settings, or
+/// reject the request because the model family can't be chained.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum ChainRoutingDecision {
+    /// Go through the normal single-clip path; no chaining required.
+    SingleClip,
+    /// Submit a chain. `clip_frames` is the clamped per-clip cap.
+    Chain { clip_frames: u32, motion_tail: u32 },
+    /// Model family doesn't support chaining and `frames` exceeds its cap.
+    Rejected { reason: String },
+}
+
+/// Pure decision function — given a model family, the user's requested
+/// `frames`, and the optional `--clip-frames` override, decide whether to
+/// chain, stay single-clip, or reject.
+///
+/// The clamp-to-cap behaviour surfaces through the returned `clip_frames`
+/// field; callers warn the user via stderr when they had to clamp.
+pub fn decide_chain_routing(
+    frames: Option<u32>,
+    family: Option<&str>,
+    model: &str,
+    clip_frames_flag: Option<u32>,
+    motion_tail: u32,
+) -> ChainRoutingDecision {
+    let Some(total_frames) = frames else {
+        return ChainRoutingDecision::SingleClip;
+    };
+
+    let is_ltx2_distilled = family == Some("ltx2") && model.contains("distilled");
+
+    if !is_ltx2_distilled {
+        // Non-chainable families: if the requested frame count is within a
+        // conservative single-clip budget, stay on the single-clip path and
+        // let the engine decide if it's acceptable. Otherwise, reject with
+        // a clear message rather than silently over-producing.
+        if total_frames <= LTX2_DISTILLED_CLIP_CAP {
+            return ChainRoutingDecision::SingleClip;
+        }
+        return ChainRoutingDecision::Rejected {
+            reason: format!(
+                "model '{model}' does not support chained video generation \
+                 (only LTX-2 distilled families do); specify --frames <= {} \
+                 per clip for this model",
+                LTX2_DISTILLED_CLIP_CAP,
+            ),
+        };
+    }
+
+    let cap = LTX2_DISTILLED_CLIP_CAP;
+    let effective_clip_frames = clip_frames_flag.unwrap_or(cap).min(cap);
+
+    if total_frames <= effective_clip_frames {
+        return ChainRoutingDecision::SingleClip;
+    }
+
+    if motion_tail >= effective_clip_frames {
+        return ChainRoutingDecision::Rejected {
+            reason: format!(
+                "--motion-tail ({motion_tail}) must be strictly less than \
+                 --clip-frames ({effective_clip_frames}) so every continuation \
+                 emits at least one new frame",
+            ),
+        };
+    }
+
+    ChainRoutingDecision::Chain {
+        clip_frames: effective_clip_frames,
+        motion_tail,
+    }
+}
+
+/// Emit a stderr warning if `--clip-frames` was above the model's cap and
+/// got clamped. Returns the effective value (caller should already have it).
+pub fn warn_if_clamped(flag: Option<u32>, cap: u32) {
+    if let Some(requested) = flag {
+        if requested > cap {
+            crate::output::status!(
+                "{} --clip-frames {} exceeds model cap {}, clamping to {}",
+                theme::prefix_warning(),
+                requested,
+                cap,
+                cap,
+            );
+        }
+    }
+}
+
+/// Caller-supplied inputs for a chain run, bundled so the remote + local
+/// paths can share a single helper without a 20-arg function signature.
+#[allow(clippy::too_many_arguments)]
+pub struct ChainInputs {
+    pub prompt: String,
+    pub model: String,
+    pub width: u32,
+    pub height: u32,
+    pub steps: u32,
+    pub guidance: f64,
+    pub strength: f64,
+    pub seed: Option<u64>,
+    pub fps: u32,
+    pub output_format: OutputFormat,
+    pub total_frames: u32,
+    pub clip_frames: u32,
+    pub motion_tail: u32,
+    pub source_image: Option<Vec<u8>>,
+    pub placement: Option<mold_core::DevicePlacement>,
+}
+
+impl ChainInputs {
+    fn to_chain_request(&self) -> ChainRequest {
+        ChainRequest {
+            model: self.model.clone(),
+            stages: Vec::new(),
+            motion_tail_frames: self.motion_tail,
+            width: self.width,
+            height: self.height,
+            fps: self.fps,
+            seed: self.seed,
+            steps: self.steps,
+            guidance: self.guidance,
+            strength: self.strength,
+            output_format: self.output_format,
+            placement: self.placement.clone(),
+            prompt: Some(self.prompt.clone()),
+            total_frames: Some(self.total_frames),
+            clip_frames: Some(self.clip_frames),
+            source_image: self.source_image.clone(),
+        }
+    }
+}
+
+/// Run a chain end-to-end, dispatching to the server (streaming) or the
+/// local orchestrator based on the `local` flag. Handles encoding, save,
+/// preview, and final status messages.
+#[allow(clippy::too_many_arguments)]
+pub async fn run_chain(
+    inputs: ChainInputs,
+    host: Option<String>,
+    output: Option<String>,
+    no_metadata: bool,
+    preview: bool,
+    local: bool,
+    gpus: Option<String>,
+    t5_variant: Option<String>,
+    qwen3_variant: Option<String>,
+    qwen2_variant: Option<String>,
+    qwen2_text_encoder_mode: Option<String>,
+    eager: bool,
+    offload: bool,
+) -> Result<()> {
+    // Validate the auto-expand form before touching the network / GPU so
+    // obvious mistakes (bad clip_frames math, too many stages) fail fast.
+    let chain_req = inputs.to_chain_request();
+    let normalised = chain_req.clone().normalise()?;
+    let stage_count = normalised.stages.len() as u32;
+
+    status!(
+        "{} Chain mode: {} frames → {} stages × {} frames (tail {})",
+        theme::icon_mode(),
+        inputs.total_frames,
+        stage_count,
+        inputs.clip_frames,
+        inputs.motion_tail,
+    );
+
+    let ctx = CliContext::new(host.as_deref());
+    let config = ctx.config().clone();
+    let embed_metadata = config.effective_embed_metadata(no_metadata.then_some(false));
+    let _ = embed_metadata; // reserved for future metadata-embed work on chain output
+
+    let t0 = std::time::Instant::now();
+    let video = if local {
+        #[cfg(any(feature = "cuda", feature = "metal"))]
+        {
+            crate::ui::print_using_local_inference();
+            run_chain_local(
+                &chain_req,
+                &config,
+                gpus,
+                t5_variant,
+                qwen3_variant,
+                qwen2_variant,
+                qwen2_text_encoder_mode,
+                eager,
+                offload,
+            )
+            .await?
+        }
+        #[cfg(not(any(feature = "cuda", feature = "metal")))]
+        {
+            let _ = (
+                gpus,
+                t5_variant,
+                qwen3_variant,
+                qwen2_variant,
+                qwen2_text_encoder_mode,
+                eager,
+                offload,
+            );
+            anyhow::bail!(
+                "No mold server running and this binary was built without GPU support.\n\
+                 Either start a server with `mold serve` or rebuild with --features cuda"
+            )
+        }
+    } else {
+        run_chain_remote(ctx.client(), &chain_req).await?
+    };
+
+    let elapsed_ms = t0.elapsed().as_millis() as u64;
+    let base_seed = inputs.seed.unwrap_or(0);
+
+    encode_and_save(
+        &inputs,
+        &video,
+        output.as_deref(),
+        preview,
+        elapsed_ms,
+        base_seed,
+    )?;
+
+    Config::write_last_model(&inputs.model);
+    Ok(())
+}
+
+/// Remote chain: streaming SSE with stacked progress bars.
+async fn run_chain_remote(client: &MoldClient, req: &ChainRequest) -> Result<VideoData> {
+    let (tx, rx) = tokio::sync::mpsc::unbounded_channel::<ChainProgressEvent>();
+    let render = tokio::spawn(render_chain_progress(rx));
+
+    let stream_result = client.generate_chain_stream(req, tx).await;
+    let _ = render.await;
+
+    match stream_result {
+        Ok(Some(resp)) => Ok(resp.video),
+        Ok(None) => {
+            // Server predates chain endpoint; fall back to non-streaming.
+            status!(
+                "{} Server SSE chain endpoint unavailable, falling back to blocking endpoint",
+                theme::prefix_warning(),
+            );
+            let resp = client.generate_chain(req).await?;
+            Ok(resp.video)
+        }
+        Err(e) => Err(e),
+    }
+}
+
+#[cfg(any(feature = "cuda", feature = "metal"))]
+#[allow(clippy::too_many_arguments)]
+async fn run_chain_local(
+    chain_req: &ChainRequest,
+    config: &Config,
+    gpus: Option<String>,
+    t5_variant_override: Option<String>,
+    qwen3_variant_override: Option<String>,
+    qwen2_variant_override: Option<String>,
+    qwen2_text_encoder_mode_override: Option<String>,
+    eager: bool,
+    offload: bool,
+) -> Result<VideoData> {
+    use mold_core::manifest::find_manifest;
+    use mold_core::ModelPaths;
+    use mold_inference::LoadStrategy;
+
+    // Normalise so we have expanded stages locally too.
+    let req = chain_req.clone().normalise()?;
+
+    // Apply encoder-variant overrides before constructing the engine so the
+    // factory's auto-select picks them up.
+    apply_local_engine_env_overrides(
+        t5_variant_override.as_deref(),
+        qwen3_variant_override.as_deref(),
+        qwen2_variant_override.as_deref(),
+        qwen2_text_encoder_mode_override.as_deref(),
+    );
+
+    let model_name = req.model.clone();
+
+    // Ensure the model is pulled + config rows are in place.
+    let (paths, effective_config) = if let Some(p) = ModelPaths::resolve(&model_name, config) {
+        (p, config.clone())
+    } else if find_manifest(&model_name).is_some() {
+        crate::output::status!(
+            "{} Model '{}' not found locally, pulling...",
+            theme::icon_info(),
+            model_name.bold(),
+        );
+        let updated = super::pull::pull_and_configure(
+            &model_name,
+            &mold_core::download::PullOptions::default(),
+        )
+        .await?;
+        let p = ModelPaths::resolve(&model_name, &updated).ok_or_else(|| {
+            anyhow::anyhow!("model '{model_name}' was pulled but paths could not be resolved")
+        })?;
+        (p, updated)
+    } else {
+        anyhow::bail!(
+            "no model paths configured for '{model_name}'. Add [models.{model_name}] \
+             to ~/.mold/config.toml or pull via `mold pull {model_name}`."
+        );
+    };
+
+    let is_eager = eager || std::env::var("MOLD_EAGER").is_ok_and(|v| v == "1");
+    let load_strategy = if is_eager {
+        LoadStrategy::Eager
+    } else {
+        LoadStrategy::Sequential
+    };
+    if is_eager {
+        std::env::set_var("MOLD_EAGER", "1");
+    }
+    let is_offload = offload || std::env::var("MOLD_OFFLOAD").is_ok_and(|v| v == "1");
+
+    let gpu_selection = match &gpus {
+        Some(s) => mold_core::types::GpuSelection::parse(s)?,
+        None => effective_config.gpu_selection(),
+    };
+    let discovered = mold_inference::device::discover_gpus();
+    let available = mold_inference::device::filter_gpus(&discovered, &gpu_selection);
+    let gpu_ordinal = mold_inference::device::select_best_gpu(&available)
+        .map(|g| g.ordinal)
+        .unwrap_or(0);
+
+    let mut engine = mold_inference::create_engine(
+        &model_name,
+        paths,
+        &effective_config,
+        load_strategy,
+        gpu_ordinal,
+        is_offload,
+    )?;
+
+    let (tx, rx) = tokio::sync::mpsc::unbounded_channel::<ChainProgressEvent>();
+    let render = tokio::spawn(render_chain_progress(rx));
+
+    let fps = req.fps;
+    let output_format = req.output_format;
+    let total_frames_opt = Some(req.total_frames.unwrap_or(u32::MAX));
+    let req_clone = req.clone();
+
+    let handle = tokio::task::spawn_blocking(move || -> Result<VideoData> {
+        engine.load()?;
+        let renderer = engine.as_chain_renderer().ok_or_else(|| {
+            anyhow::anyhow!(
+                "model '{}' does not support chained video generation \
+                 (only LTX-2 distilled engines expose a ChainStageRenderer view)",
+                req_clone.model,
+            )
+        })?;
+        let mut orch = mold_inference::ltx2::Ltx2ChainOrchestrator::new(renderer);
+
+        let tx = tx;
+        let mut chain_cb = move |event: ChainProgressEvent| {
+            let _ = tx.send(event);
+        };
+        let chain_output = orch.run(&req_clone, Some(&mut chain_cb))?;
+
+        let mut frames = chain_output.frames;
+        if let Some(target) = total_frames_opt {
+            let target = target as usize;
+            if frames.len() > target {
+                frames.truncate(target);
+            }
+        }
+        if frames.is_empty() {
+            anyhow::bail!("chain run emitted zero frames after trim");
+        }
+
+        encode_local_frames(&frames, fps, output_format)
+    });
+
+    let result = handle.await??;
+    let _ = render.await;
+    Ok(result)
+}
+
+#[cfg(any(feature = "cuda", feature = "metal"))]
+fn apply_local_engine_env_overrides(
+    t5_variant: Option<&str>,
+    qwen3_variant: Option<&str>,
+    qwen2_variant: Option<&str>,
+    qwen2_text_encoder_mode: Option<&str>,
+) {
+    if let Some(v) = t5_variant {
+        std::env::set_var("MOLD_T5_VARIANT", v);
+    }
+    if let Some(v) = qwen3_variant {
+        std::env::set_var("MOLD_QWEN3_VARIANT", v);
+    }
+    if let Some(v) = qwen2_variant {
+        std::env::set_var("MOLD_QWEN2_VARIANT", v);
+    }
+    if let Some(v) = qwen2_text_encoder_mode {
+        std::env::set_var("MOLD_QWEN2_TEXT_ENCODER_MODE", v);
+    }
+}
+
+/// Encode stitched frames to the requested container. MP4 is feature-gated;
+/// fall back to APNG when the CLI was built without `mp4`.
+#[cfg(any(feature = "cuda", feature = "metal"))]
+fn encode_local_frames(
+    frames: &[image::RgbImage],
+    fps: u32,
+    output_format: OutputFormat,
+) -> Result<VideoData> {
+    use mold_inference::ltx_video::video_enc;
+
+    let gif_preview = video_enc::encode_gif(frames, fps).unwrap_or_default();
+    let thumbnail = video_enc::first_frame_png(frames).unwrap_or_default();
+
+    let (bytes, actual_format) = match output_format {
+        OutputFormat::Mp4 => {
+            #[cfg(feature = "mp4")]
+            {
+                (video_enc::encode_mp4(frames, fps)?, OutputFormat::Mp4)
+            }
+            #[cfg(not(feature = "mp4"))]
+            {
+                crate::output::status!(
+                    "{} MP4 requested but this binary was built without --features mp4; \
+                     falling back to APNG",
+                    theme::prefix_warning(),
+                );
+                (
+                    video_enc::encode_apng(frames, fps, None)?,
+                    OutputFormat::Apng,
+                )
+            }
+        }
+        OutputFormat::Apng => (
+            video_enc::encode_apng(frames, fps, None)?,
+            OutputFormat::Apng,
+        ),
+        OutputFormat::Gif => (video_enc::encode_gif(frames, fps)?, OutputFormat::Gif),
+        OutputFormat::Webp => {
+            crate::output::status!(
+                "{} WebP chain output not supported locally yet; falling back to APNG",
+                theme::prefix_warning(),
+            );
+            (
+                video_enc::encode_apng(frames, fps, None)?,
+                OutputFormat::Apng,
+            )
+        }
+        other => anyhow::bail!("{other:?} is not a video output format for chain generation"),
+    };
+
+    let width = frames[0].width();
+    let height = frames[0].height();
+    let frame_count = frames.len() as u32;
+    let duration_ms = if fps == 0 {
+        None
+    } else {
+        Some((frame_count as u64 * 1000) / fps as u64)
+    };
+
+    Ok(VideoData {
+        data: bytes,
+        format: actual_format,
+        width,
+        height,
+        frames: frame_count,
+        fps,
+        thumbnail,
+        gif_preview,
+        has_audio: false,
+        duration_ms,
+        audio_sample_rate: None,
+        audio_channels: None,
+    })
+}
+
+/// Shared epilogue: write the stitched video to stdout/file/gallery and
+/// emit a terminal preview if requested.
+fn encode_and_save(
+    inputs: &ChainInputs,
+    video: &VideoData,
+    output: Option<&str>,
+    preview: bool,
+    elapsed_ms: u64,
+    base_seed: u64,
+) -> Result<()> {
+    let piped = is_piped();
+
+    if piped && output.is_none() {
+        let mut stdout = std::io::stdout().lock();
+        stdout.write_all(&video.data)?;
+        stdout.flush()?;
+    } else {
+        let filename = match output {
+            Some("-") => {
+                let mut stdout = std::io::stdout().lock();
+                stdout.write_all(&video.data)?;
+                stdout.flush()?;
+                None
+            }
+            Some(path) => Some(path.to_string()),
+            None => {
+                let timestamp = std::time::SystemTime::now()
+                    .duration_since(std::time::UNIX_EPOCH)
+                    .unwrap_or_default()
+                    .as_secs();
+                Some(mold_core::default_output_filename(
+                    &inputs.model,
+                    timestamp,
+                    video.format.extension(),
+                    1,
+                    0,
+                ))
+            }
+        };
+        if let Some(ref filename) = filename {
+            if std::path::Path::new(filename).exists() {
+                status!("{} Overwriting: {}", theme::icon_alert(), filename);
+            }
+            std::fs::write(filename, &video.data)?;
+            status!(
+                "{} Saved: {} ({} frames, {}x{}, {} fps)",
+                theme::icon_done(),
+                filename.bold(),
+                video.frames,
+                video.width,
+                video.height,
+                video.fps,
+            );
+
+            // Persist to the gallery metadata DB. Build a synthetic
+            // GenerateRequest so the existing record_local_save helper can
+            // infer dimensions/seed/steps/etc. without a dedicated chain
+            // row schema.
+            let req = synth_generate_request(inputs, video);
+            crate::metadata_db::record_local_save(
+                std::path::Path::new(filename),
+                &req,
+                inputs.seed.unwrap_or(base_seed),
+                elapsed_ms,
+                video.format,
+            );
+        }
+    }
+
+    if preview && !piped {
+        // Best-effort: show the gif preview or fall back to the thumbnail
+        // or the video bytes themselves (GIF/APNG decode as images).
+        let bytes_for_preview: &[u8] = if !video.gif_preview.is_empty() {
+            &video.gif_preview
+        } else if !video.thumbnail.is_empty() {
+            &video.thumbnail
+        } else {
+            &video.data
+        };
+        super::generate::preview_image(bytes_for_preview);
+    }
+
+    status!(
+        "{} Done — {} in {:.1}s ({} frames, seed: {})",
+        theme::icon_done(),
+        inputs.model.bold(),
+        elapsed_ms as f64 / 1000.0,
+        video.frames,
+        inputs.seed.unwrap_or(base_seed),
+    );
+
+    Ok(())
+}
+
+fn synth_generate_request(inputs: &ChainInputs, video: &VideoData) -> mold_core::GenerateRequest {
+    mold_core::GenerateRequest {
+        prompt: inputs.prompt.clone(),
+        negative_prompt: None,
+        model: inputs.model.clone(),
+        width: inputs.width,
+        height: inputs.height,
+        steps: inputs.steps,
+        guidance: inputs.guidance,
+        seed: inputs.seed,
+        batch_size: 1,
+        output_format: video.format,
+        embed_metadata: Some(false),
+        scheduler: None,
+        edit_images: None,
+        source_image: inputs.source_image.clone(),
+        strength: inputs.strength,
+        mask_image: None,
+        control_image: None,
+        control_model: None,
+        control_scale: 1.0,
+        expand: None,
+        original_prompt: None,
+        lora: None,
+        frames: Some(video.frames),
+        fps: Some(video.fps),
+        upscale_model: None,
+        gif_preview: false,
+        enable_audio: None,
+        audio_file: None,
+        source_video: None,
+        keyframes: None,
+        pipeline: None,
+        loras: None,
+        retake_range: None,
+        spatial_upscale: None,
+        temporal_upscale: None,
+        placement: inputs.placement.clone(),
+    }
+}
+
+/// Stacked progress bars for chain render: a parent "Chain" bar covering
+/// all pixel frames and a transient per-stage bar covering denoise steps.
+async fn render_chain_progress(mut rx: tokio::sync::mpsc::UnboundedReceiver<ChainProgressEvent>) {
+    // Always draw to stderr so image bytes piped to stdout stay clean.
+    let mp = MultiProgress::with_draw_target(ProgressDrawTarget::stderr());
+
+    let parent = mp.add(ProgressBar::new(0));
+    parent.set_style(
+        ProgressStyle::default_bar()
+            .template(&format!(
+                "{{prefix:.{c}}} [{{bar:30.{c}/dim}}] {{pos}}/{{len}} frames {{msg}}",
+                c = theme::SPINNER_STYLE,
+            ))
+            .unwrap()
+            .progress_chars("━╸─"),
+    );
+    parent.set_prefix("Chain");
+    parent.enable_steady_tick(Duration::from_millis(100));
+
+    let mut stage_bar: Option<ProgressBar> = None;
+    let mut stage_count: u32 = 0;
+
+    while let Some(event) = rx.recv().await {
+        match event {
+            ChainProgressEvent::ChainStart {
+                stage_count: sc,
+                estimated_total_frames,
+            } => {
+                stage_count = sc;
+                parent.set_length(estimated_total_frames as u64);
+                parent.set_message(format!("(stages {sc})"));
+            }
+            ChainProgressEvent::StageStart { stage_idx } => {
+                if let Some(old) = stage_bar.take() {
+                    old.finish_and_clear();
+                }
+                parent.set_message(format!("stage {}/{}", stage_idx + 1, stage_count));
+                let sb = mp.add(ProgressBar::new(0));
+                sb.set_style(
+                    ProgressStyle::default_bar()
+                        .template(&format!(
+                            "  Stage {{prefix}}  [{{bar:30.{c}/dim}}] {{pos}}/{{len}} steps",
+                            c = theme::SPINNER_STYLE,
+                        ))
+                        .unwrap()
+                        .progress_chars("━╸─"),
+                );
+                sb.set_prefix(format!("{}", stage_idx + 1));
+                sb.enable_steady_tick(Duration::from_millis(100));
+                stage_bar = Some(sb);
+            }
+            ChainProgressEvent::DenoiseStep {
+                stage_idx: _,
+                step,
+                total,
+            } => {
+                if let Some(ref sb) = stage_bar {
+                    if sb.length().unwrap_or(0) == 0 {
+                        sb.set_length(total as u64);
+                    }
+                    sb.set_position(step as u64);
+                }
+            }
+            ChainProgressEvent::StageDone {
+                stage_idx: _,
+                frames_emitted,
+            } => {
+                if let Some(sb) = stage_bar.take() {
+                    sb.finish_and_clear();
+                }
+                parent.inc(frames_emitted as u64);
+            }
+            ChainProgressEvent::Stitching { total_frames } => {
+                if let Some(sb) = stage_bar.take() {
+                    sb.finish_and_clear();
+                }
+                parent.set_message(format!("stitching {total_frames} frames…"));
+            }
+        }
+    }
+
+    if let Some(sb) = stage_bar.take() {
+        sb.finish_and_clear();
+    }
+    parent.finish_and_clear();
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn routing_single_clip_under_cap() {
+        let d = decide_chain_routing(Some(97), Some("ltx2"), "ltx-2-19b-distilled:fp8", None, 4);
+        assert_eq!(d, ChainRoutingDecision::SingleClip);
+    }
+
+    #[test]
+    fn routing_single_clip_when_frames_absent() {
+        let d = decide_chain_routing(None, Some("ltx2"), "ltx-2-19b-distilled:fp8", None, 4);
+        assert_eq!(d, ChainRoutingDecision::SingleClip);
+    }
+
+    #[test]
+    fn routing_chain_over_cap_ltx2_distilled() {
+        let d = decide_chain_routing(Some(200), Some("ltx2"), "ltx-2-19b-distilled:fp8", None, 4);
+        assert_eq!(
+            d,
+            ChainRoutingDecision::Chain {
+                clip_frames: 97,
+                motion_tail: 4,
+            },
+        );
+    }
+
+    #[test]
+    fn routing_rejects_non_distilled_over_cap() {
+        let d = decide_chain_routing(Some(200), Some("flux"), "flux-dev:q4", None, 4);
+        match d {
+            ChainRoutingDecision::Rejected { reason } => {
+                assert!(
+                    reason.contains("does not support chained video"),
+                    "unexpected reason: {reason}"
+                );
+            }
+            other => panic!("expected Rejected, got {other:?}"),
+        }
+    }
+
+    #[test]
+    fn routing_rejects_non_ltx2_family_over_cap() {
+        // ltx-video (not ltx2) is not chainable in v1.
+        let d = decide_chain_routing(Some(200), Some("ltx-video"), "ltx-video:0.9.6", None, 4);
+        assert!(matches!(d, ChainRoutingDecision::Rejected { .. }));
+    }
+
+    #[test]
+    fn routing_clip_frames_above_cap_clamps_to_cap() {
+        let d = decide_chain_routing(
+            Some(300),
+            Some("ltx2"),
+            "ltx-2-19b-distilled:fp8",
+            Some(200),
+            4,
+        );
+        assert_eq!(
+            d,
+            ChainRoutingDecision::Chain {
+                clip_frames: 97,
+                motion_tail: 4,
+            },
+        );
+    }
+
+    #[test]
+    fn routing_clip_frames_under_cap_respected() {
+        let d = decide_chain_routing(
+            Some(300),
+            Some("ltx2"),
+            "ltx-2-19b-distilled:fp8",
+            Some(65),
+            4,
+        );
+        assert_eq!(
+            d,
+            ChainRoutingDecision::Chain {
+                clip_frames: 65,
+                motion_tail: 4,
+            },
+        );
+    }
+
+    #[test]
+    fn routing_motion_tail_ge_clip_frames_rejects() {
+        let d = decide_chain_routing(
+            Some(300),
+            Some("ltx2"),
+            "ltx-2-19b-distilled:fp8",
+            Some(49),
+            49,
+        );
+        match d {
+            ChainRoutingDecision::Rejected { reason } => {
+                assert!(
+                    reason.contains("--motion-tail"),
+                    "unexpected reason: {reason}"
+                );
+            }
+            other => panic!("expected Rejected, got {other:?}"),
+        }
+    }
+
+    #[test]
+    fn routing_motion_tail_at_clip_frames_rejects() {
+        let d = decide_chain_routing(Some(200), Some("ltx2"), "ltx-2-19b-distilled:fp8", None, 97);
+        assert!(matches!(d, ChainRoutingDecision::Rejected { .. }));
+    }
+
+    #[test]
+    fn ltx2_distilled_cap_matches_engine_constraint() {
+        // 97 = 8 * 12 + 1, satisfying the VAE 8k+1 constraint.
+        assert_eq!(LTX2_DISTILLED_CLIP_CAP % 8, 1);
+    }
+}
diff --git a/crates/mold-cli/src/commands/generate.rs b/crates/mold-cli/src/commands/generate.rs
index 028fa089..cded5bb1 100644
--- a/crates/mold-cli/src/commands/generate.rs
+++ b/crates/mold-cli/src/commands/generate.rs
@@ -157,6 +157,11 @@ fn apply_local_engine_env_overrides(
 pub struct Ltx2Options {
     pub frames: Option<u32>,
     pub fps: Option<u32>,
+    /// Per-clip cap for chained rendering. `None` = use the model-family default
+    /// (currently 97 for LTX-2 distilled). Only read when `frames > cap`.
+    pub clip_frames: Option<u32>,
+    /// Motion-tail overlap between chained clips (pixel frames).
+    pub motion_tail: u32,
     pub enable_audio: Option<bool>,
     pub audio_file: Option<Vec<u8>>,
     pub source_video: Option<Vec<u8>>,
@@ -210,6 +215,8 @@ pub async fn run(
     let Ltx2Options {
         frames,
         fps,
+        clip_frames,
+        motion_tail,
         enable_audio,
         audio_file,
         source_video,
@@ -243,6 +250,117 @@ pub async fn run(
     } else {
         format
     };
+
+    // ── Chain routing ─────────────────────────────────────────────────────
+    // When --frames exceeds the per-clip cap, auto-build a ChainRequest and
+    // delegate to the chain helper. Only LTX-2 distilled is chainable in v1;
+    // other video families error fast rather than silently over-producing.
+    {
+        use super::chain::{decide_chain_routing, warn_if_clamped, ChainRoutingDecision};
+        let decision = decide_chain_routing(
+            effective_frames,
+            family.as_deref(),
+            model,
+            clip_frames,
+            motion_tail,
+        );
+        match decision {
+            ChainRoutingDecision::SingleClip => {
+                // Fall through to the existing single-clip path below.
+            }
+            ChainRoutingDecision::Rejected { reason } => {
+                anyhow::bail!(reason);
+            }
+            ChainRoutingDecision::Chain {
+                clip_frames: cf,
+                motion_tail: mt,
+            } => {
+                warn_if_clamped(clip_frames, super::chain::LTX2_DISTILLED_CLIP_CAP);
+                let (eff_w, eff_h) = effective_dimensions(
+                    &config,
+                    &model_cfg,
+                    family.as_deref(),
+                    width,
+                    height,
+                    source_image.as_deref(),
+                    edit_images.as_deref(),
+                )?;
+                let eff_steps = steps.unwrap_or_else(|| model_cfg.effective_steps(&config));
+                let eff_guidance = guidance.unwrap_or_else(|| model_cfg.effective_guidance());
+                let eff_fps = effective_fps.unwrap_or(24);
+                let total_frames = effective_frames
+                    .expect("decide_chain_routing only returns Chain when frames is Some");
+
+                // Chain path doesn't use batch/edit_images/mask/control/loras —
+                // those are single-clip concepts. If the user set them, warn and
+                // continue (we don't hard-error to keep the UX lenient).
+                if batch > 1 {
+                    status!(
+                        "{} --batch has no effect in chain mode; rendering a single stitched video",
+                        theme::icon_warn(),
+                    );
+                }
+
+                let inputs = super::chain::ChainInputs {
+                    prompt: prompt.to_string(),
+                    model: model.to_string(),
+                    width: eff_w,
+                    height: eff_h,
+                    steps: eff_steps,
+                    guidance: eff_guidance,
+                    strength,
+                    seed,
+                    fps: eff_fps,
+                    output_format,
+                    total_frames,
+                    clip_frames: cf,
+                    motion_tail: mt,
+                    source_image: source_image.clone(),
+                    placement: placement.clone(),
+                };
+                // Consume otherwise-unused LTX-2 knobs that chain v1 ignores so
+                // clippy doesn't fire `unused_variables` on the early return.
+                let _ = (
+                    &audio_file,
+                    &source_video,
+                    &keyframes,
+                    &pipeline,
+                    &loras,
+                    &retake_range,
+                    &spatial_upscale,
+                    &temporal_upscale,
+                    &enable_audio,
+                    &mask_image,
+                    &control_image,
+                    &control_model,
+                    control_scale,
+                    &negative_prompt,
+                    &original_prompt,
+                    &batch_prompts,
+                    &lora,
+                    &scheduler,
+                    expand,
+                );
+                return super::chain::run_chain(
+                    inputs,
+                    host,
+                    output,
+                    no_metadata,
+                    preview,
+                    local,
+                    gpus,
+                    t5_variant,
+                    qwen3_variant,
+                    qwen2_variant,
+                    qwen2_text_encoder_mode,
+                    eager,
+                    offload,
+                )
+                .await;
+            }
+        }
+    }
+
     let piped = is_piped();
 
     // Reject batch > 1 when output goes to stdout (piped with no --output, or --output -)
diff --git a/crates/mold-cli/src/commands/mod.rs b/crates/mold-cli/src/commands/mod.rs
index ec4bbb27..82a19d16 100644
--- a/crates/mold-cli/src/commands/mod.rs
+++ b/crates/mold-cli/src/commands/mod.rs
@@ -1,3 +1,4 @@
+pub mod chain;
 pub mod clean;
 pub(crate) mod cleanup;
 pub mod config;
diff --git a/crates/mold-cli/src/commands/run.rs b/crates/mold-cli/src/commands/run.rs
index ba18d479..85269678 100644
--- a/crates/mold-cli/src/commands/run.rs
+++ b/crates/mold-cli/src/commands/run.rs
@@ -436,6 +436,8 @@ pub async fn run(
     batch: u32,
     frames: Option<u32>,
     fps: Option<u32>,
+    clip_frames: Option<u32>,
+    motion_tail: u32,
     audio: bool,
     no_audio: bool,
     audio_file: Option<String>,
@@ -825,6 +827,8 @@ pub async fn run(
         generate::Ltx2Options {
             frames,
             fps,
+            clip_frames,
+            motion_tail,
             enable_audio: if audio {
                 Some(true)
             } else if no_audio {
diff --git a/crates/mold-cli/src/main.rs b/crates/mold-cli/src/main.rs
index 03293886..f7e616d2 100644
--- a/crates/mold-cli/src/main.rs
+++ b/crates/mold-cli/src/main.rs
@@ -377,6 +377,9 @@ Examples:
 
         /// Number of video frames to generate (video models only, e.g. ltx-video).
         /// Implies video output mode; output defaults to .gif format.
+        ///
+        /// For LTX-2 distilled, values above 97 automatically chain multiple
+        /// clips at render time (see `--clip-frames` / `--motion-tail`).
         #[arg(long, help_heading = "Video")]
         frames: Option<u32>,
 
@@ -385,6 +388,19 @@ Examples:
         #[arg(long, help_heading = "Video")]
         fps: Option<u32>,
 
+        /// Per-clip frame cap for chained video. When --frames exceeds this,
+        /// the CLI splits into multiple chained clips stitched at render time.
+        /// Defaults to the model's native cap (97 for LTX-2 distilled).
+        #[arg(long, value_name = "N", help_heading = "Video")]
+        clip_frames: Option<u32>,
+
+        /// Motion-tail overlap between chained clips in pixel frames. Each clip
+        /// after the first reuses this many trailing latents from the prior
+        /// clip, trimming the duplicated pixel frames at stitch time. 0 disables
+        /// latent carryover (simple concat). Default 4.
+        #[arg(long, value_name = "N", default_value_t = 4, help_heading = "Video")]
+        motion_tail: u32,
+
         /// Enable synchronized audio for LTX-2 / LTX-2.3 generation.
         #[arg(long, help_heading = "Video", conflicts_with = "no_audio")]
         audio: bool,
@@ -1147,6 +1163,8 @@ async fn run() -> anyhow::Result<()> {
             batch,
             frames,
             fps,
+            clip_frames,
+            motion_tail,
             audio,
             no_audio,
             audio_file,
@@ -1205,6 +1223,8 @@ async fn run() -> anyhow::Result<()> {
                 batch,
                 frames,
                 fps,
+                clip_frames,
+                motion_tail,
                 audio,
                 no_audio,
                 audio_file,
@@ -2131,6 +2151,50 @@ mod tests {
         }
     }
 
+    #[test]
+    fn run_chain_flags_parse() {
+        let cli = parse(&[
+            "run",
+            "ltx-2-19b-distilled:fp8",
+            "a cat",
+            "--frames",
+            "200",
+            "--clip-frames",
+            "97",
+            "--motion-tail",
+            "4",
+        ]);
+        match cli.command {
+            Commands::Run {
+                frames,
+                clip_frames,
+                motion_tail,
+                ..
+            } => {
+                assert_eq!(frames, Some(200));
+                assert_eq!(clip_frames, Some(97));
+                assert_eq!(motion_tail, 4);
+            }
+            _ => panic!("expected Run"),
+        }
+    }
+
+    #[test]
+    fn run_motion_tail_defaults_to_four() {
+        let cli = parse(&["run", "ltx-2-19b-distilled:fp8", "a cat", "--frames", "200"]);
+        match cli.command {
+            Commands::Run {
+                motion_tail,
+                clip_frames,
+                ..
+            } => {
+                assert_eq!(motion_tail, 4, "default motion tail must be 4 frames");
+                assert_eq!(clip_frames, None);
+            }
+            _ => panic!("expected Run"),
+        }
+    }
+
     // --- Regression test for issue #190: --version includes git SHA ---
 
     #[test]

From 62111820d246a9d0cdd44bf763fc7f2f08c3f1e2 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 19:02:50 -0700
Subject: [PATCH 11/31] docs(chain): ltx2 guide, api endpoint, changelog, and
 skill updates

Document render-chain v1 across the four surfaces: a new "Chained video
output" section in website/models/ltx2.md explaining the per-clip cap,
motion-tail carryover, and the --frames / --clip-frames / --motion-tail
CLI contract; request/response/SSE schemas for the new
POST /api/generate/chain[/stream] endpoints in website/api/index.md;
an Unreleased/Added bullet in CHANGELOG.md covering the feature
end-to-end; and the new flags + endpoint in .claude/skills/mold/SKILL.md
so OpenClaw and the other AI agents surface chained video correctly.

Part of render-chain v1 (Phase 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/skills/mold/SKILL.md |  18 +++-
 CHANGELOG.md                 |   1 +
 website/api/index.md         | 157 +++++++++++++++++++++++++++++++++++
 website/models/ltx2.md       |  81 ++++++++++++++++++
 4 files changed, 256 insertions(+), 1 deletion(-)

diff --git a/.claude/skills/mold/SKILL.md b/.claude/skills/mold/SKILL.md
index 4c8ddeee..97f28b29 100644
--- a/.claude/skills/mold/SKILL.md
+++ b/.claude/skills/mold/SKILL.md
@@ -178,7 +178,9 @@ mold run ltx-2-19b-distilled:fp8 "lantern-lit cave entrance" --camera-control do
 
 **Models:** `ltx-2-19b-dev:fp8`, `ltx-2-19b-distilled:fp8`, `ltx-2.3-22b-dev:fp8`, `ltx-2.3-22b-distilled:fp8`
 
-**Important flags:** `--audio`, `--no-audio`, `--audio-file`, `--video`, repeatable `--keyframe`, repeatable `--lora`, `--pipeline`, `--retake`, `--camera-control`, `--spatial-upscale`, `--temporal-upscale`
+**Important flags:** `--audio`, `--no-audio`, `--audio-file`, `--video`, repeatable `--keyframe`, repeatable `--lora`, `--pipeline`, `--retake`, `--camera-control`, `--spatial-upscale`, `--temporal-upscale`, `--clip-frames`, `--motion-tail`
+
+**Chained (arbitrary-length) video output:** for LTX-2 19B and 22B distilled models, `--frames` above the 97-frame per-clip cap automatically renders multiple clips with a motion-tail of latents carried across each clip boundary, then stitches them into a single MP4. The CLI picks this path transparently — `mold run ltx-2-19b-distilled:fp8 "a cat walking" --frames 400` produces one 400-frame MP4 from 5 chained stages. Advanced callers can override the per-clip length via `--clip-frames N` (must be `8k+1`, clamped to the model cap) and the overlap via `--motion-tail N` (default 4 pixel frames, 0 disables carryover). Chains fail closed on mid-stage failure (no partial output) and run on a single GPU. Other model families reject `--frames > 97` with an actionable error.
 
 **Current constraints:** `x2` spatial upscaling is wired across the family, `x1.5` spatial upscaling is wired for `ltx-2.3-*`, and `x2` temporal upscaling is wired in the native runtime. Camera-control preset aliases currently auto-resolve the published LTX-2 19B LoRAs only. The family runs through the native Rust stack in `mold-inference`, with CUDA as the supported backend for real local generation, CPU as a correctness-only fallback, and Metal unsupported. On 24 GB Ada GPUs such as the RTX 4090, the validated path stays on the compatible `fp8-cast` mode rather than Hopper-only `fp8-scaled-mm`. The native CUDA matrix is validated across 19B/22B text+audio-video, image-to-video, audio-to-video, keyframe, retake, public IC-LoRA, spatial upscale (`x1.5` / `x2` where published), and temporal upscale (`x2`). When requests go through `mold serve`, the built-in body limit is `64 MiB`, which is enough for common inline source-video and source-audio workflows.
 
@@ -535,6 +537,20 @@ MOLD_HOST=http://gpu-host:7680 mold run "a cat"
 MOLD_OUTPUT_DIR=/srv/mold/output mold serve
 ```
 
+### HTTP API Endpoints
+
+Core endpoints exposed by `mold serve` (full list + schemas at `/api/docs`):
+
+- `POST /api/generate` — image/video generation, raw bytes response
+- `POST /api/generate/stream` — SSE progress + base64 complete event
+- `POST /api/generate/chain` — chained arbitrary-length video (LTX-2 distilled); body is `mold_core::chain::ChainRequest` (canonical `stages[]` or auto-expand `prompt`+`total_frames`+`clip_frames`)
+- `POST /api/generate/chain/stream` — same as above, SSE progress with per-stage `denoise_step` events
+- `POST /api/expand` — LLM prompt expansion
+- `GET /api/models` · `POST /api/models/load` · `POST /api/models/pull` · `DELETE /api/models/unload`
+- `GET /api/gallery` · `GET /api/gallery/image/:name` · `GET /api/gallery/thumbnail/:name` · `DELETE /api/gallery/image/:name`
+- `POST /api/upscale` · `POST /api/upscale/stream`
+- `GET /api/status` · `GET /health` · `GET /api/capabilities`
+
 ### Prometheus Metrics
 
 When built with the `metrics` feature flag (included in Docker images and Nix builds), the server exposes a `GET /metrics` endpoint in Prometheus text exposition format. This endpoint is excluded from auth and rate limiting for monitoring scrapers.
diff --git a/CHANGELOG.md b/CHANGELOG.md
index af96fa71..1867930b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **Render chain for arbitrary-length LTX-2 distilled video.** `mold run ltx-2-19b-distilled:fp8 "a cat walking" --frames 400` now produces a single stitched MP4 by splitting the request into multiple per-clip renders and carrying a motion-tail of latents across each clip boundary so the continuation stays coherent without a VAE encode/decode round-trip. New server endpoints `POST /api/generate/chain` and `POST /api/generate/chain/stream` (SSE) accept either a canonical `stages[]` body or an auto-expand form (`prompt` + `total_frames` + `clip_frames`) — the wire format is stages-based from day one so the v2 movie-maker UI can author per-stage prompts/keyframes without a breaking change. Request/response/event types live in `crates/mold-core/src/chain.rs` (`ChainRequest`, `ChainResponse`, `ChainProgressEvent`, `SseChainCompleteEvent`); the LTX-2 orchestrator is in `crates/mold-inference/src/ltx2/chain.rs` (`Ltx2ChainOrchestrator`, `ChainTail`); the server routes in `crates/mold-server/src/routes_chain.rs`; and the CLI side in `crates/mold-cli/src/commands/chain.rs`. `mold run` auto-routes to the chain endpoint when `--frames` exceeds the model's per-clip cap (97 for LTX-2 19B/22B distilled); non-distilled families fail fast with an actionable error instead of silently over-producing. New flags `--clip-frames N` (default = model cap) and `--motion-tail N` (default 4, 0 disables carryover) let advanced callers tune the split. The orchestrator derives per-stage seeds as `base_seed ^ ((stage_idx as u64) << 32)` so the whole chain reproduces from a single seed without identical-noise artefacts when every stage shares a prompt. Over-production at the final clip is trimmed from the tail (the head carries the user-anchored starting image and is perceptually load-bearing); mid-chain failures fail closed with HTTP 502 and no partial stitch is ever written to the gallery. Chains run on a single GPU — the chain handler bypasses the single-job queue and holds the `ModelCache` lock for the full chain duration (a multi-minute compound operation would otherwise stall the FIFO queue). Both the remote SSE path and the `--local` in-process path funnel through the same orchestrator via `Ltx2Engine::as_chain_renderer`, and `mold run` renders stacked `indicatif` progress bars (parent "Chain" frame counter + per-stage denoise-step bar). v1 is LTX-2 distilled only, single-GPU, and single-prompt; per-stage prompts, keyframes, selective regen, and multi-GPU stage fan-out are v2 movie-maker work.
 - **In-browser model downloads with queued progress, ETA, cancel, and retry** ([#255](https://github.com/utensils/mold/pull/255)). `ModelPicker.vue` now shows `(X GB)` next to every model — click an undownloaded one to enqueue a pull without leaving the generate flow. A new `DownloadsDrawer` (opened from a TopBar button with an active/queued count badge) shows per-file progress, client-computed ETA (10 s sliding window), and cancel/retry controls. Undownloaded models in the picker switch to inline progress or a "Queued (#N)" chip while their job is alive, and the picker auto-refreshes on `JobDone` so the model becomes selectable without a page reload. Server-side: a new single-writer `DownloadQueue` in `AppState` drives the existing `mold_core::download::pull_model_with_callback` one model at a time (files sequential inside a set — HF's CDN is bandwidth-bound, so file-level parallelism would only trip rate limits), with one auto-retry on transient failure. Cancellation aborts the in-flight pull, cleans up partials under `MOLD_MODELS_DIR/<model>/` while preserving any `.sha256-verified` markers, and leaves the HF blob cache intact so resume is cheap. The same cleanup runs on terminal failures, not just cancel. New routes: `POST /api/downloads` (idempotent — returns the existing job id on a second enqueue), `DELETE /api/downloads/:id`, `GET /api/downloads` (active + queued + last 20 history), `GET /api/downloads/stream` (SSE multiplex of `DownloadEvent` frames — `Enqueued`, `Started`, `Progress`, `FileDone`, `JobDone`, `JobFailed`, `JobCancelled`). Existing `POST /api/models/pull` becomes a thin compat shim that enqueues via the queue and re-emits the legacy SSE event shape, so the TUI keeps working unchanged.
 - **Always-visible VRAM + system RAM telemetry on `/generate`** ([#254](https://github.com/utensils/mold/pull/254)). A new `ResourceStrip.vue` docks at the bottom of the Composer sidebar on desktop (and collapses to a `🧠 used · total` chip in the TopBar on narrow viewports), showing one stacked-bar row per discovered GPU plus one for system RAM. Each row breaks usage into `mold` / `other` / `free` on CUDA hosts with per-process attribution (NVML feature-gated as `mold-ai-server` `--features nvml`, `nvidia-smi` subprocess fallback on by default) — on Metal the per-process fields are intentionally `None` and the SPA hides those breakdowns, since macOS doesn't expose per-process GPU attribution without private entitlements. Aggregated once per second on the server into a `ResourceSnapshot { hostname, gpus, system_ram }`, exposed as `GET /api/resources` (one-shot; `503` before the first aggregator tick) and `GET /api/resources/stream` (SSE broadcast with 15 s keepalive and the cached snapshot prepended as the first frame so new subscribers don't wait a full second). The aggregator handle is bound to `axum::serve`'s shutdown path so it's aborted on graceful exit. The strip's `useResources` composable is a provide/inject singleton mounted in `App.vue`, and it exposes a `gpuList: ComputedRef<GpuSnapshot[]>` that the new device-placement UI consumes directly.
 - **Per-component device placement for FLUX, Flux.2, Z-Image, and Qwen-Image** ([#256](https://github.com/utensils/mold/pull/256)). A new `PlacementPanel` disclosure inside the Composer lets users override which device each part of the pipeline runs on. Tier 1 is a single "Text encoders: Auto / CPU / GPU N" dropdown that applies to every model family (SD1.5, SDXL, SD3.5, Wuerstchen, LTX-Video, LTX-2 in addition to the Tier 2 four) — picking CPU reliably moves the text encoder off-GPU so a large transformer can stay on-device without triggering block-level offload. Tier 2 adds per-component selects (transformer, VAE, and family-appropriate encoder slots) for FLUX, Flux.2, Z-Image, and Qwen-Image, where the plumbing is cheapest and the value is clearest. SD3.5 was marked stretch in the design and cut cleanly — the UI correctly hides Advanced for SD3.5 with a tooltip so no user sees an override that silently no-ops. A new `DevicePlacement` serde type (`DeviceRef = Auto | Cpu | Gpu(ordinal)` plus an optional `AdvancedPlacement` sub-struct for per-component overrides) rides as an optional field on `GenerateRequest`; `None` preserves the existing VRAM-aware auto-placement end-to-end. A shared `resolve_device()` helper in `mold_inference::device` (and a companion `effective_device_ref()` shared by the four Tier-2 engines) maps each `DeviceRef` variant to a `candle_core::Device`, returning a clean `anyhow::Error` for bad ordinals instead of panicking. Defaults are saved per-model in `[models."name:tag".placement]` (with `MOLD_PLACE_TEXT_ENCODERS`, `MOLD_PLACE_TRANSFORMER`, `MOLD_PLACE_VAE`, `MOLD_PLACE_CLIP_L`, `MOLD_PLACE_CLIP_G`, `MOLD_PLACE_T5`, `MOLD_PLACE_QWEN` env overrides) via a new `PUT /api/config/model/:name/placement` route (with `DELETE` to clear); the route now returns a real `500` when `Config::save()` fails instead of silently lying to the client. The placement UI reads its GPU list from `useResources().gpuList`, so spinning up a mold server on a dual-3090 box auto-populates "GPU 0 · RTX 3090" / "GPU 1 · RTX 3090" in every dropdown without any extra discovery wiring. `mold run` gains matching CLI flags — `--device-text-encoders`, `--device-transformer`, `--device-vae`, `--device-t5`, `--device-clip-l`, `--device-clip-g`, `--device-qwen` — which override env vars and config; flag parse errors surface with the specific flag name so `--device-vae banana` reports `--device-vae: invalid device 'banana' (expected auto|cpu|gpu[:N])` instead of a generic failure. Documented in `website/guide/configuration.md` (new "Per-component device placement" section) and `website/guide/performance.md` (the "CPU text encoders" subsection now points at the CLI flags for deliberate VRAM tuning).
diff --git a/website/api/index.md b/website/api/index.md
index ff04172f..9c5c7c5a 100644
--- a/website/api/index.md
+++ b/website/api/index.md
@@ -8,6 +8,8 @@ When running `mold serve`, you get a REST API for remote image generation.
 | -------- | ------------------------------ | ------------------------------------ |
 | `POST`   | `/api/generate`                | Generate images from prompt          |
 | `POST`   | `/api/generate/stream`         | Generate with SSE progress streaming |
+| `POST`   | `/api/generate/chain`          | Chained video generation (LTX-2)     |
+| `POST`   | `/api/generate/chain/stream`   | Chained video with SSE progress      |
 | `POST`   | `/api/expand`                  | Expand a prompt using LLM            |
 | `GET`    | `/api/models`                  | List available models                |
 | `POST`   | `/api/models/load`             | Load/swap the active model           |
@@ -223,6 +225,161 @@ server internally.
 RunPod's proxy has a 100-second timeout. Use the SSE streaming endpoint for long generations to keep the connection alive.
 :::
 
+## `/api/generate/chain`
+
+Chained video generation for LTX-2 distilled models. Splits a long video into
+N per-clip renders, threads a motion-tail of latents across each clip
+boundary, and returns a single stitched MP4. See the
+[LTX-2 chained video output guide](/models/ltx2#chained-video-output) for the
+user-facing story; this section documents the wire format.
+
+The request body maps to `mold_core::chain::ChainRequest`; the response body
+maps to `mold_core::chain::ChainResponse`. The canonical schema lives in the
+interactive docs at `/api/docs` (served by the running mold server) and in the
+OpenAPI JSON at `/api/openapi.json`.
+
+The server accepts either a pre-authored `stages[]` body or the auto-expand
+form (single `prompt` + `total_frames` + `clip_frames`). Auto-expand is the
+shape `mold run` sends; the canonical `stages[]` shape is reserved for the
+forthcoming movie-maker UI that will author per-stage prompts/keyframes. Both
+normalise to the same internal `Vec<ChainStage>` before any engine work kicks
+off.
+
+**Auto-expand body** (what `mold run --frames N` emits):
+
+```json
+{
+  "model": "ltx-2-19b-distilled:fp8",
+  "prompt": "a cat walking through autumn leaves",
+  "total_frames": 400,
+  "clip_frames": 97,
+  "source_image": "<base64 PNG>",
+  "motion_tail_frames": 4,
+  "width": 1216,
+  "height": 704,
+  "fps": 24,
+  "seed": 42,
+  "steps": 8,
+  "guidance": 3.0,
+  "strength": 1.0,
+  "output_format": "mp4"
+}
+```
+
+**Canonical body** (what the v2 movie-maker UI will author):
+
+```json
+{
+  "model": "ltx-2-19b-distilled:fp8",
+  "stages": [
+    { "prompt": "a cat walking", "frames": 97, "source_image": "<base64 PNG>" },
+    { "prompt": "a cat walking", "frames": 97 },
+    { "prompt": "a cat walking", "frames": 97 },
+    { "prompt": "a cat walking", "frames": 97 }
+  ],
+  "motion_tail_frames": 4,
+  "width": 1216,
+  "height": 704,
+  "fps": 24,
+  "seed": 42,
+  "steps": 8,
+  "guidance": 3.0,
+  "strength": 1.0,
+  "output_format": "mp4"
+}
+```
+
+**Response:**
+
+```json
+{
+  "video": {
+    "data": "<base64 mp4>",
+    "format": "mp4",
+    "width": 1216,
+    "height": 704,
+    "frames": 400,
+    "fps": 24,
+    "thumbnail": "<base64 png>",
+    "gif_preview": "<base64 gif>",
+    "has_audio": false,
+    "duration_ms": 16666
+  },
+  "stage_count": 5,
+  "gpu": 0
+}
+```
+
+**Error cases:**
+
+- `422 Unprocessable Entity` — validation failure (missing `prompt` +
+  `total_frames` in the auto-expand form, a stage with non-`8k+1` `frames`,
+  `motion_tail_frames >= clip_frames`, more than 16 stages, etc.).
+- `422 Unprocessable Entity` — unsupported model family. Only LTX-2 distilled
+  engines expose a chain renderer; other families are rejected with an
+  error that names the constraint.
+- `502 Bad Gateway` — a stage errored mid-chain. The whole chain is discarded
+  and nothing is written to the gallery; v1 is fail-closed and partial
+  resume is a v2 feature.
+
+::: tip Queue behaviour
+The chain handler deliberately **bypasses the single-job queue**. A chain is a
+multi-minute compound operation that would stall the FIFO queue for every
+other request, so the handler takes the engine out of `ModelCache` for the
+full chain duration and restores it on completion (or error). Chains
+therefore run one-at-a-time on a given GPU; submit chains to separate GPUs
+via `MOLD_GPUS` / `--gpus` if you need parallelism.
+:::
+
+## `/api/generate/chain/stream`
+
+Same request body as `/api/generate/chain`, with the response delivered as
+Server-Sent Events. Progress frames stream as `event: progress` and the
+terminal frame is either `event: complete` (success) or `event: error`
+(failure; the connection closes after the error frame).
+
+Progress event payloads map to `mold_core::chain::ChainProgressEvent` variants:
+
+```text
+event: progress
+data: {"type":"chain_start","stage_count":5,"estimated_total_frames":485}
+
+event: progress
+data: {"type":"stage_start","stage_idx":0}
+
+event: progress
+data: {"type":"denoise_step","stage_idx":0,"step":1,"total":8}
+
+event: progress
+data: {"type":"stage_done","stage_idx":0,"frames_emitted":97}
+
+event: progress
+data: {"type":"stitching","total_frames":385}
+
+event: complete
+data: {"video":"<base64 mp4>","format":"mp4","width":1216,"height":704,"frames":400,"fps":24,"thumbnail":"<base64 png>","gif_preview":"<base64 gif>","has_audio":false,"duration_ms":16666,"stage_count":5,"gpu":0,"generation_time_ms":226812}
+```
+
+The `complete` event payload maps to `mold_core::chain::SseChainCompleteEvent`.
+Non-denoise engine events (weight loads, cache hits, etc.) are intentionally
+not forwarded in v1 — the UX goal is per-stage progress, not per-component
+telemetry.
+
+```bash
+curl -N -X POST http://localhost:7680/api/generate/chain/stream \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "ltx-2-19b-distilled:fp8",
+    "prompt": "a cat walking through autumn leaves",
+    "total_frames": 400,
+    "clip_frames": 97,
+    "motion_tail_frames": 4,
+    "width": 1216, "height": 704, "fps": 24,
+    "steps": 8, "guidance": 3.0,
+    "output_format": "mp4"
+  }'
+```
+
 ## `/api/status`
 
 Example response:
diff --git a/website/models/ltx2.md b/website/models/ltx2.md
index 54d0b1fa..f1561b30 100644
--- a/website/models/ltx2.md
+++ b/website/models/ltx2.md
@@ -96,6 +96,87 @@ mold run ltx-2.3-22b-distilled:fp8 \
   --format mp4
 ```
 
+## Chained video output
+
+The LTX-2 distilled pipeline maxes out at 97 pixel frames per clip (13 latent
+frames after the VAE's 8× temporal compression — `8 × 12 + 1 = 97` satisfies the
+`8k+1` frame-grid constraint). For anything longer, mold renders a _chain_: the
+request is split into N sub-clips, each generated back-to-back, and stitched
+into a single MP4 at the end. mold keeps the last few frames of clip _N_'s
+final latents in memory and threads them directly into clip _N+1_'s
+conditioning, skipping a VAE encode/decode round-trip so the continuation
+stays visually coherent.
+
+`mold run` routes automatically: when `--frames` is `≤ 97` you stay on the
+single-clip path; above 97 the request is rewritten into a chain and dispatched
+to the new `/api/generate/chain/stream` endpoint. Chaining is supported for
+LTX-2 19B and 22B distilled today. Other model families reject
+`--frames > 97` with an actionable error rather than silently over-producing.
+
+```console
+$ mold run ltx-2-19b-distilled:fp8 "a cat walking through autumn leaves" \
+    --image cat.png --frames 400
+
+→ Chain mode: 400 frames → 5 stages × 97 frames (tail 4)
+Chain [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 385/385 frames (stages 5)
+  Stage 1  [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 8/8 steps
+  Stage 2  [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 8/8 steps
+  Stage 3  [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 8/8 steps
+  Stage 4  [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 8/8 steps
+  Stage 5  [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 8/8 steps
+✓ Saved: mold-ltx-2-19b-distilled-<ts>.mp4 (400 frames, 1216x704, 24 fps)
+✓ Done — ltx-2-19b-distilled:fp8 in 226.8s (400 frames, seed: 42)
+```
+
+### Motion-tail carryover
+
+`--motion-tail N` (default 4) controls how many trailing pixel frames of each
+clip are reused as latent-space conditioning for the next. Instead of decoding
+the prior clip's last frame back to RGB and re-encoding it through the VAE as
+a new `source_image`, mold narrows the final denoise tensor along its time
+axis and patchifies those latent tokens directly into the next stage's
+`StageVideoConditioning` — so the handoff never leaves latent space. At stitch
+time, every stage after the first drops its leading `N` output frames because
+those are the overlap region shared with the prior clip.
+
+- `--motion-tail 0` — hard concatenation, no overlap. Visible seams are common
+  at clip boundaries; useful when you _want_ discrete shots.
+- `--motion-tail 4` — the default. One latent frame of carryover at `fps=24`
+  gives the transformer enough temporal context to continue motion, object
+  identity, and lighting across the seam without wasting new frames.
+- Higher values buy more seam-smoothing at the cost of fewer fresh pixel
+  frames per clip. Must stay strictly below `--clip-frames`.
+
+### Flags
+
+| Flag              | Default          | Description                                                                          |
+| ----------------- | ---------------- | ------------------------------------------------------------------------------------ |
+| `--frames N`      | model default    | Total stitched length. Above the per-clip cap (97 for LTX-2 distilled), auto-chains. |
+| `--clip-frames N` | model cap (`97`) | Per-clip length. Must be `8k+1`; values above the cap are clamped with a warning.    |
+| `--motion-tail N` | `4`              | Pixel-frame overlap between clips. `0` disables carryover.                           |
+
+When the final clip over-produces (stage math rarely lands exactly on
+`total_frames`), mold trims from the tail so the user-anchored starting image
+at the head stays intact.
+
+### v1 constraints
+
+- **LTX-2 19B and 22B distilled only.** Other LTX-2 / LTX-Video variants and
+  every image-family model reject `--frames` above their single-clip budget.
+- **Single GPU per chain.** Every stage runs on the GPU the engine was loaded
+  onto — multi-GPU stage fan-out is a v2 movie-maker feature.
+- **Fail-closed.** If any stage errors, the whole chain returns `502` and
+  nothing is written to the gallery. There is no partial-resume in v1.
+- **Single prompt per chain from the CLI.** The server already accepts
+  per-stage prompts (see [`POST /api/generate/chain`](/api/#api-generate-chain)),
+  but `mold run` replicates one prompt across every stage for now.
+
+The rest of the LTX-2 surface — `--image`, `--audio-file`, `--lora`,
+`--camera-control`, `--spatial-upscale`, `--temporal-upscale`, and so on —
+applies to chain renders the same way it applies to single-clip renders. An
+`--image` supplied on the CLI lands on `stages[0]` and is carried forward by
+the motion-tail latents from there.
+
 ## Example Clips
 
 Here are a few longer LTX-2 examples rendered with mold. The docs page embeds

From 766322ebbc84fbcdfce92867e582a27673a167a9 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 21:43:17 -0700
Subject: [PATCH 12/31] fix(cli): pass owned String to create_engine in local
 chain path
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The `cuda`/`metal` feature-gated local orchestrator branch in
`run_chain_local` passed `&model_name` to `mold_inference::create_engine`,
which takes `model_name: String`. Phase 3's verification only ran
`cargo check --features preview,discord,expand,tui,webp,mp4` — the
feature-matrix omitted `cuda`/`metal`, so CI and the local-default check
both missed the mismatch. Caught at rebuild time on killswitch
(sm_86 / RTX 3090 dual-GPU build). `cargo check -p mold-ai
--features metal,expand` now clean locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-cli/src/commands/chain.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crates/mold-cli/src/commands/chain.rs b/crates/mold-cli/src/commands/chain.rs
index a0988b72..4a8796bf 100644
--- a/crates/mold-cli/src/commands/chain.rs
+++ b/crates/mold-cli/src/commands/chain.rs
@@ -354,7 +354,7 @@ async fn run_chain_local(
         .unwrap_or(0);
 
     let mut engine = mold_inference::create_engine(
-        &model_name,
+        model_name,
         paths,
         &effective_config,
         load_strategy,

From 41e85f7a2112bc2e2d86960e5ae37bfc823e70e1 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 22:04:21 -0700
Subject: [PATCH 13/31] fix(sd3): truncate CLIP token sequences to 77 with EOS
 preserved

`ClipWithTokenizer::encode_text_to_embedding` padded up to
`max_position_embeddings` (77) but never truncated down. Prompts that
tokenised to more than 77 CLIP tokens fed an `[1, N, 768]` tensor into
`ClipTextTransformer`, where the 77-slot position-embedding broadcast-add
blew up with `shape mismatch in broadcast_add, lhs: [1, N, 768], rhs: [1, 77, 768]`.
The pooled-output slice at `eos_position = tokens.len() - 1` was also
out-of-bounds on the same path.

Extract the token preparation into a pure `prepare_clip_tokens` helper
that truncates to `max_len` (copying the trailing EOS token into the
final slot so the pooled branch still reads an EOS-position hidden state)
and then pads up to `max_len`. Wire it into both CLIP-L and CLIP-G via
the shared `ClipWithTokenizer` path, so every `sd3*` model benefits.

Unit-tested weight-free with four cases: short prompt, exact-77,
132-token overlong (matches the observed failure shape), and an empty
tokenisation. All four pass; the 132-token test was red before the
fix and is green after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../mold-inference/src/encoders/sd3_clip.rs   | 106 +++++++++++++++++-
 1 file changed, 101 insertions(+), 5 deletions(-)

diff --git a/crates/mold-inference/src/encoders/sd3_clip.rs b/crates/mold-inference/src/encoders/sd3_clip.rs
index c396563f..ddb3a89d 100644
--- a/crates/mold-inference/src/encoders/sd3_clip.rs
+++ b/crates/mold-inference/src/encoders/sd3_clip.rs
@@ -83,18 +83,16 @@ impl ClipWithTokenizer {
                 .ok_or_else(|| anyhow::anyhow!("Failed to tokenize CLIP end-of-text"))?,
         };
 
-        let mut tokens = self
+        let raw_tokens = self
             .tokenizer
             .encode(prompt, true)
             .map_err(|e| anyhow::anyhow!("CLIP tokenization failed: {e}"))?
             .get_ids()
             .to_vec();
 
-        let eos_position = tokens.len() - 1;
+        let (tokens, eos_position) =
+            prepare_clip_tokens(raw_tokens, self.max_position_embeddings, pad_id);
 
-        while tokens.len() < self.max_position_embeddings {
-            tokens.push(pad_id);
-        }
         let tokens = Tensor::new(tokens.as_slice(), &self.device)?.unsqueeze(0)?;
         let (_text_embeddings, text_embeddings_penultimate) =
             clip.forward_until_encoder_layer(&tokens, usize::MAX, -2)?;
@@ -293,3 +291,101 @@ impl SD3TripleEncoder {
         self.clip_l.model.is_some() && self.clip_g.model.is_some() && self.t5.model.is_some()
     }
 }
+
+/// Prepare a CLIP token sequence for the fixed position-embedding window.
+///
+/// CLIP's position-embedding table holds exactly `max_len` entries, so a token
+/// tensor longer than that fails inside candle's `broadcast_add` when the
+/// position embeddings are applied. This helper:
+///
+/// - Truncates overlong sequences to `max_len`, copying the trailing token
+///   (the tokenizer's EOS, assuming `add_special_tokens=true`) into the last
+///   slot so the pooled-output path still reads an EOS-position hidden state.
+/// - Pads short sequences up to `max_len` with `pad_id`.
+/// - Returns the final `tokens` vector and the `eos_position` index the caller
+///   uses to slice the pooled output.
+fn prepare_clip_tokens(mut raw_tokens: Vec<u32>, max_len: usize, pad_id: u32) -> (Vec<u32>, usize) {
+    let original_len = raw_tokens.len();
+
+    if original_len > max_len {
+        let eos_id = *raw_tokens
+            .last()
+            .expect("original_len > max_len implies non-empty");
+        raw_tokens.truncate(max_len);
+        if let Some(last) = raw_tokens.last_mut() {
+            *last = eos_id;
+        }
+        tracing::debug!(
+            "SD3 CLIP prompt exceeded {} tokens ({} raw); truncated with EOS preserved",
+            max_len,
+            original_len,
+        );
+    }
+
+    let eos_position = raw_tokens.len().saturating_sub(1);
+
+    while raw_tokens.len() < max_len {
+        raw_tokens.push(pad_id);
+    }
+
+    (raw_tokens, eos_position)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::prepare_clip_tokens;
+
+    const MAX_LEN: usize = 77;
+    const PAD_ID: u32 = 0;
+    const EOS_ID: u32 = 49407;
+
+    #[test]
+    fn pads_short_prompt_to_max_len() {
+        let raw = vec![49406, 10, 20, 30, EOS_ID]; // 5 tokens, last is EOS
+        let (tokens, eos) = prepare_clip_tokens(raw, MAX_LEN, PAD_ID);
+        assert_eq!(tokens.len(), MAX_LEN, "must pad up to max_len");
+        assert_eq!(eos, 4, "eos_position tracks the raw EOS slot");
+        assert_eq!(tokens[4], EOS_ID, "EOS preserved at original position");
+        assert_eq!(tokens[5], PAD_ID, "pads follow the real tokens");
+        assert_eq!(*tokens.last().unwrap(), PAD_ID);
+    }
+
+    #[test]
+    fn leaves_exact_length_untouched() {
+        let mut raw: Vec<u32> = (1..MAX_LEN as u32).collect();
+        raw.push(EOS_ID);
+        assert_eq!(raw.len(), MAX_LEN);
+        let (tokens, eos) = prepare_clip_tokens(raw.clone(), MAX_LEN, PAD_ID);
+        assert_eq!(tokens.len(), MAX_LEN);
+        assert_eq!(eos, MAX_LEN - 1);
+        assert_eq!(tokens, raw);
+    }
+
+    #[test]
+    fn truncates_overlong_prompt_preserving_eos() {
+        // 132-token sequence — matches the shapes in the original bug report
+        // ([1, 132, 768] vs [1, 77, 768]).
+        let mut raw: Vec<u32> = (1..=131).collect();
+        raw.push(EOS_ID);
+        assert_eq!(raw.len(), 132);
+
+        let (tokens, eos) = prepare_clip_tokens(raw, MAX_LEN, PAD_ID);
+
+        assert_eq!(tokens.len(), MAX_LEN, "overlong sequence must be truncated");
+        assert_eq!(eos, MAX_LEN - 1, "eos_position must land on the last slot");
+        assert_eq!(
+            tokens[MAX_LEN - 1],
+            EOS_ID,
+            "EOS must be preserved in the final slot so pooled output reads EOS hidden state",
+        );
+    }
+
+    #[test]
+    fn handles_empty_input() {
+        // Degenerate case: tokenizer somehow returns no ids. Shouldn't panic.
+        let (tokens, eos) = prepare_clip_tokens(Vec::new(), MAX_LEN, PAD_ID);
+        assert_eq!(tokens.len(), MAX_LEN);
+        assert_eq!(eos, 0);
+        assert!(tokens.iter().all(|t| *t == PAD_ID));
+    }
+}

From adf1ff6f1e7a9eb10d89ae928094f7b1e50ad3f0 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 22:59:40 -0700
Subject: [PATCH 14/31] fix(multi-gpu): stop LTX-2 and upscaler from nuking GPU
 0's CUDA context
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

LTX-2 and the upscaler hardcoded `Device::new_cuda(0)` and
`reclaim_gpu_memory(0)` in their engine bodies, ignoring the
`gpu_ordinal` they were dispatched with. On a multi-GPU host that
meant dispatching LTX-2 to GPU 1 still destroyed GPU 0's primary
CUDA context mid-denoise, which surfaced as a misleading
CUDA_ERROR_OUT_OF_MEMORY on the sibling job and then segfaulted
inside `cuEventDestroy_v2` when candle's Drop chain unwound.

- Thread `gpu_ordinal` through `Ltx2Engine` → `Ltx2RuntimeSession`
  and `UpscalerEngine` / `create_upscale_engine`; replace all four
  hardcoded-0 call sites.
- Add a thread-local GPU binding (`init_thread_gpu_ordinal`) set by
  each GPU worker thread; `create_device` and `reclaim_gpu_memory`
  `debug_assert` the caller's ordinal matches, so any future
  hardcoded-0 regression panics in debug builds instead of silently
  corrupting a sibling GPU's context.
- Update all 4 `create_upscale_engine` callers (CLI, TUI, two in
  server routes) to pass ordinal 0 explicitly. Server upscaler cache
  stays pinned to GPU 0 with a comment noting the per-worker cache
  migration path if multi-GPU upscale becomes interesting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-cli/src/commands/upscale.rs      |  4 ++
 crates/mold-inference/src/device.rs          | 54 ++++++++++++++++++++
 crates/mold-inference/src/factory.rs         |  7 ++-
 crates/mold-inference/src/ltx2/execution.rs  |  2 +-
 crates/mold-inference/src/ltx2/pipeline.rs   | 35 ++++++++++---
 crates/mold-inference/src/ltx2/runtime.rs    | 28 +++++++---
 crates/mold-inference/src/upscaler/engine.rs | 18 +++++--
 crates/mold-server/src/gpu_worker.rs         |  4 ++
 crates/mold-server/src/routes.rs             |  6 +++
 crates/mold-tui/src/app.rs                   |  1 +
 10 files changed, 140 insertions(+), 19 deletions(-)

diff --git a/crates/mold-cli/src/commands/upscale.rs b/crates/mold-cli/src/commands/upscale.rs
index 7776a45a..a8224924 100644
--- a/crates/mold-cli/src/commands/upscale.rs
+++ b/crates/mold-cli/src/commands/upscale.rs
@@ -161,10 +161,14 @@ async fn upscale_local(
     let req_clone = req.clone();
 
     let resp = tokio::task::spawn_blocking(move || -> Result<mold_core::UpscaleResponse> {
+        // CLI upscale runs locally on the best available GPU (ordinal 0
+        // on single-GPU hosts). The multi-GPU server path routes through
+        // `gpu_worker`, which passes its own ordinal.
         let mut engine = mold_inference::create_upscale_engine(
             model_name_owned,
             weights_path,
             mold_inference::LoadStrategy::Sequential,
+            0,
         )?;
 
         // Set up progress callback for stderr
diff --git a/crates/mold-inference/src/device.rs b/crates/mold-inference/src/device.rs
index a16ca50f..acc12a04 100644
--- a/crates/mold-inference/src/device.rs
+++ b/crates/mold-inference/src/device.rs
@@ -1,6 +1,57 @@
 use crate::engine::LoadStrategy;
 use crate::progress::ProgressReporter;
 use mold_core::types::GpuSelection;
+use std::cell::Cell;
+
+// ── Thread-local GPU ordinal guard ─────────────────────────────────────────
+//
+// Each GPU worker thread is pinned to a single ordinal. We stash that ordinal
+// in a thread-local so cross-engine hotpaths (`create_device`, `reclaim_gpu_memory`)
+// can debug-assert the caller isn't drifting onto a sibling GPU's context —
+// the exact footgun that took the process down on killswitch when LTX-2 had
+// `reclaim_gpu_memory(0)` hardcoded and nuked GPU 0's context while SD3.5
+// was still denoising there.
+//
+// Threads without a bound ordinal (tokio blocking pool, tests) see `None`
+// and the assert is skipped.
+
+thread_local! {
+    static THREAD_GPU_ORDINAL: Cell<Option<usize>> = const { Cell::new(None) };
+}
+
+/// Bind the current thread to a GPU ordinal. Call once from each GPU worker
+/// thread's entry point. Any subsequent `create_device` / `reclaim_gpu_memory`
+/// call on this thread must match `ordinal` (debug builds only).
+pub fn init_thread_gpu_ordinal(ordinal: usize) {
+    THREAD_GPU_ORDINAL.with(|c| c.set(Some(ordinal)));
+}
+
+/// Clear the thread's GPU binding. Not strictly needed in production (workers
+/// run for the process lifetime) but useful for tests that reuse threads.
+pub fn clear_thread_gpu_ordinal() {
+    THREAD_GPU_ORDINAL.with(|c| c.set(None));
+}
+
+/// Returns the currently-bound ordinal, if any.
+pub fn thread_gpu_ordinal() -> Option<usize> {
+    THREAD_GPU_ORDINAL.with(|c| c.get())
+}
+
+/// Panic in debug builds if `ordinal` doesn't match the thread's bound GPU.
+/// A mismatch means a call site is ignoring its engine's `gpu_ordinal` and
+/// reaching for another GPU's context — the SD3.5/LTX-2 crash pattern.
+#[inline]
+fn debug_assert_ordinal_matches_thread(ordinal: usize, context: &'static str) {
+    if cfg!(debug_assertions) {
+        if let Some(expected) = thread_gpu_ordinal() {
+            assert_eq!(
+                expected, ordinal,
+                "{context}: ordinal {ordinal} does not match this thread's \
+                 bound GPU {expected} — hardcoded ordinal regression?"
+            );
+        }
+    }
+}
 
 // ── GPU discovery ──────────────────────────────────────────────────────────
 
@@ -107,6 +158,7 @@ pub fn create_device(
         tracing::info!("CPU forced via MOLD_DEVICE=cpu");
         return Ok(Device::Cpu);
     }
+    debug_assert_ordinal_matches_thread(ordinal, "create_device");
     if candle_core::utils::cuda_is_available() {
         progress.info(&format!("Using CUDA device {ordinal}"));
         tracing::info!("Using CUDA device {ordinal}");
@@ -389,6 +441,8 @@ pub fn available_system_memory_bytes() -> Option<u64> {
 pub fn reclaim_gpu_memory(ordinal: usize) {
     use candle_core::cuda_backend::cudarc::driver::{result, sys};
 
+    debug_assert_ordinal_matches_thread(ordinal, "reclaim_gpu_memory");
+
     // Synchronize to ensure all async GPU work completes before reset.
     let _ = result::ctx::synchronize();
 
diff --git a/crates/mold-inference/src/factory.rs b/crates/mold-inference/src/factory.rs
index 4a82bfed..7b78df40 100644
--- a/crates/mold-inference/src/factory.rs
+++ b/crates/mold-inference/src/factory.rs
@@ -181,7 +181,12 @@ pub fn create_engine_with_pool(
                 shared_pool,
             )))
         }
-        "ltx2" | "ltx-2" => Ok(Box::new(Ltx2Engine::new(model_name, paths, load_strategy))),
+        "ltx2" | "ltx-2" => Ok(Box::new(Ltx2Engine::new(
+            model_name,
+            paths,
+            load_strategy,
+            gpu_ordinal,
+        ))),
         "wuerstchen" | "wuerstchen-v2" => Ok(Box::new(WuerstchenEngine::new(
             model_name,
             paths,
diff --git a/crates/mold-inference/src/ltx2/execution.rs b/crates/mold-inference/src/ltx2/execution.rs
index b0624a2a..a76318ac 100644
--- a/crates/mold-inference/src/ltx2/execution.rs
+++ b/crates/mold-inference/src/ltx2/execution.rs
@@ -268,7 +268,7 @@ mod tests {
     }
 
     fn engine(model_name: &str, paths: ModelPaths) -> Ltx2Engine {
-        Ltx2Engine::new(model_name.to_string(), paths, LoadStrategy::Sequential)
+        Ltx2Engine::new(model_name.to_string(), paths, LoadStrategy::Sequential, 0)
     }
 
     #[test]
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 9c744d8b..871fa87d 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -34,6 +34,11 @@ pub struct Ltx2Engine {
     native_runtime: Option<Ltx2RuntimeSession>,
     on_progress: Option<ProgressCallback>,
     pending_placement: Option<mold_core::types::DevicePlacement>,
+    /// GPU ordinal this engine is pinned to. Every `Device::new_cuda` and
+    /// `reclaim_gpu_memory` call must use this ordinal — hardcoding `0` here
+    /// is what took down the process on killswitch when LTX-2 ran alongside
+    /// SD3.5 on a multi-GPU host.
+    gpu_ordinal: usize,
 }
 
 impl Ltx2Engine {
@@ -54,7 +59,12 @@ impl Ltx2Engine {
         }
     }
 
-    pub fn new(model_name: String, paths: ModelPaths, _load_strategy: LoadStrategy) -> Self {
+    pub fn new(
+        model_name: String,
+        paths: ModelPaths,
+        _load_strategy: LoadStrategy,
+        gpu_ordinal: usize,
+    ) -> Self {
         Self {
             model_name,
             paths,
@@ -62,6 +72,7 @@ impl Ltx2Engine {
             native_runtime: None,
             on_progress: None,
             pending_placement: None,
+            gpu_ordinal,
         }
     }
 
@@ -78,6 +89,7 @@ impl Ltx2Engine {
             native_runtime: Some(runtime),
             on_progress: None,
             pending_placement: None,
+            gpu_ordinal: 0,
         }
     }
 
@@ -220,7 +232,7 @@ impl Ltx2Engine {
         match backend {
             Ltx2Backend::Cuda => {
                 self.info("CUDA detected, using native LTX-2 GPU path");
-                let device = Device::new_cuda(0)?;
+                let device = Device::new_cuda(self.gpu_ordinal)?;
                 configure_native_ltx2_cuda_device(&device)?;
                 Ok(device)
             }
@@ -261,9 +273,16 @@ impl Ltx2Engine {
         )?;
         Self::log_timing("pipeline.create_runtime.load_prompt_encoder", load_start);
         if prompt_device.is_cuda() {
-            Ok(Ltx2RuntimeSession::new_deferred_cuda(prompt_encoder))
+            Ok(Ltx2RuntimeSession::new_deferred_cuda(
+                prompt_encoder,
+                self.gpu_ordinal,
+            ))
         } else {
-            Ok(Ltx2RuntimeSession::new(device, prompt_encoder))
+            Ok(Ltx2RuntimeSession::new(
+                device,
+                prompt_encoder,
+                self.gpu_ordinal,
+            ))
         }
     }
 
@@ -294,7 +313,7 @@ impl Ltx2Engine {
                 self.info(
                     "Native LTX-2 prompt path ran out of CUDA memory; retrying with CPU fallback",
                 );
-                crate::device::reclaim_gpu_memory(0);
+                crate::device::reclaim_gpu_memory(self.gpu_ordinal);
                 self.load_runtime_session_on_device(plan, Device::Cpu)
             }
             Err(err) => Err(err),
@@ -1003,7 +1022,7 @@ mod tests {
             .unwrap(),
             PaddingSide::Left,
         );
-        Ltx2RuntimeSession::new(Device::Cpu, prompt_encoder)
+        Ltx2RuntimeSession::new(Device::Cpu, prompt_encoder, 0)
     }
 
     fn request(output_format: OutputFormat, enable_audio: Option<bool>) -> GenerateRequest {
@@ -1053,6 +1072,7 @@ mod tests {
             "ltx-2.3-22b-distilled:fp8".to_string(),
             dummy_paths(),
             LoadStrategy::Sequential,
+            0,
         );
         let req = GenerateRequest {
             prompt: "test".to_string(),
@@ -1114,6 +1134,7 @@ mod tests {
             "ltx-2-19b-distilled:fp8".to_string(),
             dummy_paths(),
             LoadStrategy::Sequential,
+            0,
         );
         assert_eq!(engine.request_quantization(), Some("fp8-cast".to_string()));
     }
@@ -1133,6 +1154,7 @@ mod tests {
             "ltx-2-19b-distilled:fp8".to_string(),
             dummy_paths_with_gemma_root(gemma_dir.path()),
             LoadStrategy::Sequential,
+            0,
         );
         let req = GenerateRequest {
             prompt: "test".to_string(),
@@ -1200,6 +1222,7 @@ mod tests {
             "ltx-2-19b-distilled:fp8".to_string(),
             paths,
             LoadStrategy::Sequential,
+            0,
         );
 
         engine.load().unwrap();
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 5fc0a62b..221087b9 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -296,22 +296,34 @@ pub struct Ltx2RuntimeSession {
     /// final latents and forward them to the next chain stage as a
     /// [`super::chain::ChainTail`]. `None` outside chain flow.
     pub(crate) tail_capture: Option<std::sync::Arc<std::sync::Mutex<Option<Tensor>>>>,
+    /// GPU ordinal inherited from `Ltx2Engine`. Used for the deferred CUDA
+    /// device creation in `prepare()` and for post-OOM context reset.
+    gpu_ordinal: usize,
 }
 
 impl Ltx2RuntimeSession {
-    pub fn new(device: candle_core::Device, prompt_encoder: NativePromptEncoder) -> Self {
+    pub fn new(
+        device: candle_core::Device,
+        prompt_encoder: NativePromptEncoder,
+        gpu_ordinal: usize,
+    ) -> Self {
         Self {
             device: Some(device),
             prompt_encoder: Some(prompt_encoder),
             tail_capture: None,
+            gpu_ordinal,
         }
     }
 
-    pub fn new_deferred_cuda(prompt_encoder: NativePromptEncoder) -> Self {
+    pub fn new_deferred_cuda(
+        prompt_encoder: NativePromptEncoder,
+        gpu_ordinal: usize,
+    ) -> Self {
         Self {
             device: None,
             prompt_encoder: Some(prompt_encoder),
             tail_capture: None,
+            gpu_ordinal,
         }
     }
 
@@ -414,8 +426,8 @@ impl Ltx2RuntimeSession {
         let device_handoff_start = Instant::now();
         if prompt_device_is_cuda {
             if self.device.is_none() {
-                crate::device::reclaim_gpu_memory(0);
-                self.device = Some(new_native_cuda_device()?);
+                crate::device::reclaim_gpu_memory(self.gpu_ordinal);
+                self.device = Some(new_native_cuda_device(self.gpu_ordinal)?);
             } else if let Some(device) = self.device.as_ref() {
                 if device.is_cuda() {
                     device.synchronize()?;
@@ -864,8 +876,8 @@ fn overlay_alpha(overlay: &ConditioningOverlay, frame_idx: u32, total_frames: u3
 }
 
 #[cfg(feature = "cuda")]
-fn new_native_cuda_device() -> Result<candle_core::Device> {
-    let device = candle_core::Device::new_cuda(0)?;
+fn new_native_cuda_device(ordinal: usize) -> Result<candle_core::Device> {
+    let device = candle_core::Device::new_cuda(ordinal)?;
     let cuda = device.as_cuda_device()?;
     if cuda.is_event_tracking() {
         unsafe {
@@ -876,7 +888,7 @@ fn new_native_cuda_device() -> Result<candle_core::Device> {
 }
 
 #[cfg(not(feature = "cuda"))]
-fn new_native_cuda_device() -> Result<candle_core::Device> {
+fn new_native_cuda_device(_ordinal: usize) -> Result<candle_core::Device> {
     anyhow::bail!("CUDA backend is unavailable in this build")
 }
 
@@ -4990,7 +5002,7 @@ mod tests {
             .unwrap(),
             PaddingSide::Left,
         );
-        Ltx2RuntimeSession::new(candle_core::Device::Cpu, prompt_encoder)
+        Ltx2RuntimeSession::new(candle_core::Device::Cpu, prompt_encoder, 0)
     }
 
     fn build_plan(
diff --git a/crates/mold-inference/src/upscaler/engine.rs b/crates/mold-inference/src/upscaler/engine.rs
index 89590da3..eb741180 100644
--- a/crates/mold-inference/src/upscaler/engine.rs
+++ b/crates/mold-inference/src/upscaler/engine.rs
@@ -75,16 +75,26 @@ pub struct UpscalerEngine {
     loaded: Option<LoadedState>,
     progress: ProgressReporter,
     load_strategy: LoadStrategy,
+    /// GPU ordinal this engine is pinned to. Same multi-GPU footgun as
+    /// `Ltx2Engine::gpu_ordinal` — hardcoding `0` would corrupt a sibling
+    /// GPU's CUDA context on unload.
+    gpu_ordinal: usize,
 }
 
 impl UpscalerEngine {
-    pub fn new(name: String, weights_path: PathBuf, load_strategy: LoadStrategy) -> Self {
+    pub fn new(
+        name: String,
+        weights_path: PathBuf,
+        load_strategy: LoadStrategy,
+        gpu_ordinal: usize,
+    ) -> Self {
         Self {
             name,
             weights_path,
             loaded: None,
             progress: ProgressReporter::default(),
             load_strategy,
+            gpu_ordinal,
         }
     }
 
@@ -240,7 +250,7 @@ impl UpscaleEngine for UpscalerEngine {
         let load_start = Instant::now();
         self.progress.stage_start("Loading upscaler model");
 
-        let device = create_device(0, &self.progress)?;
+        let device = create_device(self.gpu_ordinal, &self.progress)?;
 
         // Determine dtype: prefer F16 on GPU, F32 on CPU
         let dtype = if matches!(device, Device::Cpu) {
@@ -317,7 +327,7 @@ impl UpscaleEngine for UpscalerEngine {
     fn unload(&mut self) {
         if self.loaded.is_some() {
             self.loaded = None;
-            crate::reclaim_gpu_memory(0);
+            crate::reclaim_gpu_memory(self.gpu_ordinal);
             tracing::info!("Upscaler model unloaded: {}", self.name);
         }
     }
@@ -340,6 +350,7 @@ pub fn create_upscale_engine(
     model_name: String,
     weights_path: PathBuf,
     load_strategy: LoadStrategy,
+    gpu_ordinal: usize,
 ) -> Result<Box<dyn UpscaleEngine>> {
     if !weights_path.exists() {
         bail!("upscaler weights not found: {}", weights_path.display());
@@ -348,5 +359,6 @@ pub fn create_upscale_engine(
         model_name,
         weights_path,
         load_strategy,
+        gpu_ordinal,
     )))
 }
diff --git a/crates/mold-server/src/gpu_worker.rs b/crates/mold-server/src/gpu_worker.rs
index a2a5fe72..104f8d38 100644
--- a/crates/mold-server/src/gpu_worker.rs
+++ b/crates/mold-server/src/gpu_worker.rs
@@ -21,6 +21,10 @@ pub fn spawn_gpu_thread(
     std::thread::Builder::new()
         .name(format!("gpu-worker-{}", worker.gpu.ordinal))
         .spawn(move || {
+            // Bind this thread to its GPU ordinal so `create_device` /
+            // `reclaim_gpu_memory` can debug-assert callers don't drift onto
+            // a sibling GPU's context. See device::init_thread_gpu_ordinal.
+            mold_inference::device::init_thread_gpu_ordinal(worker.gpu.ordinal);
             tracing::info!(
                 gpu = worker.gpu.ordinal,
                 name = %worker.gpu.name,
diff --git a/crates/mold-server/src/routes.rs b/crates/mold-server/src/routes.rs
index 2d6ffc3c..ab971832 100644
--- a/crates/mold-server/src/routes.rs
+++ b/crates/mold-server/src/routes.rs
@@ -656,10 +656,15 @@ async fn upscale(
                 .as_ref()
                 .is_none_or(|e| e.model_name() != model_name_owned);
             if needs_new {
+                // Server-side upscaler cache is process-global today and
+                // intentionally pinned to GPU 0 (matches prior behavior).
+                // If multi-GPU upscale becomes interesting, migrate this to
+                // a per-worker cache on `GpuWorker` and route via the pool.
                 let new_engine = mold_inference::create_upscale_engine(
                     model_name_owned,
                     weights_path,
                     mold_inference::LoadStrategy::Eager,
+                    0,
                 )?;
                 *cache = Some(new_engine);
             }
@@ -824,6 +829,7 @@ async fn upscale_stream(
                     model_name_owned,
                     weights_path,
                     mold_inference::LoadStrategy::Eager,
+                    0,
                 ) {
                     Ok(new_engine) => {
                         *cache = Some(new_engine);
diff --git a/crates/mold-tui/src/app.rs b/crates/mold-tui/src/app.rs
index 0ec7661f..16659f9e 100644
--- a/crates/mold-tui/src/app.rs
+++ b/crates/mold-tui/src/app.rs
@@ -1452,6 +1452,7 @@ impl App {
                     model_name_local.clone(),
                     weights_path,
                     mold_inference::LoadStrategy::Eager,
+                    0,
                 )?;
 
                 engine.set_on_progress(Box::new(move |event| {

From 24437ee1a2034854b5a8dffc379a8bd61dc5105c Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Mon, 20 Apr 2026 23:34:57 -0700
Subject: [PATCH 15/31] feat(web): hide-mode toggle + multi-select delete in
 gallery

Adds two bulk-UX affordances to the web gallery SPA. Hide-mode toggle blurs
every tile behind a dark shroud with a per-tile "Reveal" for single peeks;
the global preference persists in localStorage, peeks don't. Select mode
enables click-to-toggle, shift-click range, and drag-marquee selection with
a floating action bar for Select all / Clear / Delete selected / Delete all.
Bulk deletes parallelize via Promise.allSettled and partial failures surface
a rollup. Select button is gated on capabilities.gallery.can_delete so
servers without MOLD_GALLERY_ALLOW_DELETE=1 don't expose dead UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 web/src/components/GalleryCard.vue | 133 ++++++++++++++-
 web/src/components/GalleryFeed.vue | 179 +++++++++++++++++++-
 web/src/components/TopBar.vue      | 121 ++++++++++++-
 web/src/pages/GalleryPage.vue      | 263 +++++++++++++++++++++++++++++
 4 files changed, 674 insertions(+), 22 deletions(-)

diff --git a/web/src/components/GalleryCard.vue b/web/src/components/GalleryCard.vue
index 3ce44c42..355fc4ea 100644
--- a/web/src/components/GalleryCard.vue
+++ b/web/src/components/GalleryCard.vue
@@ -20,14 +20,36 @@ const props = withDefaults(
     // the header toggle, subsequent videos entering the viewport pick up
     // the preference automatically.
     muted?: boolean;
+    // Multi-select state. When `selectMode` is true, clicks toggle the
+    // selection instead of opening the detail drawer.
+    selectMode?: boolean;
+    selected?: boolean;
+    // Hide mode renders a blurred overlay over the media until the user
+    // clicks the reveal button (per-item) or flips the global toggle.
+    hideMode?: boolean;
+    revealed?: boolean;
   }>(),
-  { variant: "grid", muted: true },
+  {
+    variant: "grid",
+    muted: true,
+    selectMode: false,
+    selected: false,
+    hideMode: false,
+    revealed: false,
+  },
 );
 
 const emit = defineEmits<{
   (e: "open", item: GalleryImage): void;
+  (
+    e: "toggle-select",
+    payload: { item: GalleryImage; shift: boolean; meta: boolean },
+  ): void;
+  (e: "reveal", item: GalleryImage): void;
 }>();
 
+const isHidden = computed(() => props.hideMode && !props.revealed);
+
 /*
  * Lifecycle
  * ---------
@@ -129,26 +151,64 @@ function onVideoError() {
   stage.value = "broken";
 }
 
-function openDetail() {
+function onCardClick(evt: MouseEvent) {
+  if (props.selectMode) {
+    emit("toggle-select", {
+      item: props.item,
+      shift: evt.shiftKey,
+      meta: evt.metaKey || evt.ctrlKey,
+    });
+    return;
+  }
   emit("open", props.item);
 }
+
+function onCardKey(evt: KeyboardEvent) {
+  if (props.selectMode) {
+    emit("toggle-select", {
+      item: props.item,
+      shift: evt.shiftKey,
+      meta: evt.metaKey || evt.ctrlKey,
+    });
+    return;
+  }
+  emit("open", props.item);
+}
+
+function onReveal(evt: Event) {
+  evt.stopPropagation();
+  emit("reveal", props.item);
+}
 </script>
 
 <template>
   <article
     ref="root"
-    class="group relative block w-full cursor-zoom-in overflow-hidden bg-ink-900/80 shadow-[var(--shadow-card)] transition hover:ring-brand-400/50 focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-brand-400"
+    :data-filename="item.filename"
+    :data-selected="selected ? 'true' : 'false'"
+    class="group relative block w-full overflow-hidden bg-ink-900/80 shadow-[var(--shadow-card)] transition focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-brand-400"
     :class="[
       variant === 'feed'
         ? 'ring-0 sm:rounded-3xl sm:ring-1 sm:ring-white/5'
         : 'rounded-2xl ring-1 ring-white/5',
+      selectMode ? 'cursor-pointer' : 'cursor-zoom-in hover:ring-brand-400/50',
+      selected
+        ? 'ring-2 ring-brand-400 sm:ring-2'
+        : selectMode
+          ? 'ring-1 ring-white/10'
+          : '',
     ]"
     role="button"
     tabindex="0"
-    :aria-label="`Open ${item.filename}`"
-    @click="openDetail"
-    @keydown.enter.prevent="openDetail"
-    @keydown.space.prevent="openDetail"
+    :aria-pressed="selectMode ? selected : undefined"
+    :aria-label="
+      selectMode
+        ? `${selected ? 'Deselect' : 'Select'} ${item.filename}`
+        : `Open ${item.filename}`
+    "
+    @click="onCardClick"
+    @keydown.enter.prevent="onCardKey"
+    @keydown.space.prevent="onCardKey"
   >
     <!-- Media frame: aspect-ratio preserved, media absolutely positioned -->
     <div class="relative w-full overflow-hidden" :style="aspectStyle">
@@ -255,6 +315,65 @@ function openDetail() {
         {{ item.format }}
       </div>
 
+      <!-- Selection checkbox (top-left). Always visible in select mode so
+           users can see what's pickable before touching anything. -->
+      <div
+        v-if="selectMode"
+        class="pointer-events-none absolute left-3 top-3 z-10 inline-flex h-7 w-7 items-center justify-center rounded-full border-2 transition"
+        :class="
+          selected
+            ? 'border-brand-400 bg-brand-500 text-white shadow'
+            : 'border-white/60 bg-black/40 text-transparent backdrop-blur'
+        "
+        aria-hidden="true"
+      >
+        <svg
+          class="h-4 w-4"
+          viewBox="0 0 24 24"
+          fill="none"
+          stroke="currentColor"
+          stroke-width="3"
+          stroke-linecap="round"
+          stroke-linejoin="round"
+        >
+          <path d="m5 12 5 5L20 7" />
+        </svg>
+      </div>
+
+      <!-- Hide shroud. Covers the media with a heavy blur + dim layer. The
+           reveal button lets users peek one item without flipping the global
+           toggle — useful for scanning a NSFW feed with a coworker nearby. -->
+      <div
+        v-if="isHidden"
+        class="absolute inset-0 z-[5] flex flex-col items-center justify-center gap-2 bg-ink-950/70 text-ink-100 backdrop-blur-2xl"
+      >
+        <svg
+          class="h-6 w-6 text-ink-300"
+          viewBox="0 0 24 24"
+          fill="none"
+          stroke="currentColor"
+          stroke-width="1.8"
+          stroke-linecap="round"
+          stroke-linejoin="round"
+          aria-hidden="true"
+        >
+          <path
+            d="M10.6 5.1A10 10 0 0 1 12 5c6 0 10 7 10 7a17 17 0 0 1-3.3 4.2"
+          />
+          <path d="M6.7 6.7A17 17 0 0 0 2 12s4 7 10 7a9.7 9.7 0 0 0 5.3-1.7" />
+          <path d="m3 3 18 18" />
+          <path d="M9.9 9.9a3 3 0 0 0 4.2 4.2" />
+        </svg>
+        <button
+          type="button"
+          class="rounded-full bg-white/10 px-3 py-1 text-[12px] font-medium text-ink-100 transition hover:bg-white/20"
+          @click="onReveal"
+          @keydown.stop
+        >
+          Reveal
+        </button>
+      </div>
+
       <!-- Grid variant: bottom-anchored hover overlay (compact). -->
       <div
         v-if="variant === 'grid'"
diff --git a/web/src/components/GalleryFeed.vue b/web/src/components/GalleryFeed.vue
index 7bc877ad..71f11c8d 100644
--- a/web/src/components/GalleryFeed.vue
+++ b/web/src/components/GalleryFeed.vue
@@ -5,15 +5,40 @@ import GalleryCard from "./GalleryCard.vue";
 
 type ViewMode = "feed" | "grid";
 
-const props = defineProps<{
-  entries: GalleryImage[];
-  loading: boolean;
-  view: ViewMode;
-  muted: boolean;
-}>();
+// Selection / hide-mode props are gallery-only. The Generate page reuses
+// this component for its small inline preview list and doesn't need any
+// of it — defaults keep those surfaces behaviorally identical to before.
+const props = withDefaults(
+  defineProps<{
+    entries: GalleryImage[];
+    loading: boolean;
+    view: ViewMode;
+    muted: boolean;
+    selectMode?: boolean;
+    selection?: Set<string>;
+    hideMode?: boolean;
+    revealed?: Set<string>;
+  }>(),
+  {
+    selectMode: false,
+    selection: () => new Set<string>(),
+    hideMode: false,
+    revealed: () => new Set<string>(),
+  },
+);
 
 const emit = defineEmits<{
   (e: "open", item: GalleryImage): void;
+  (
+    e: "toggle-select",
+    payload: { item: GalleryImage; shift: boolean; meta: boolean },
+  ): void;
+  (e: "reveal", item: GalleryImage): void;
+  // Drag-select: emitted with the full finalized selection the parent
+  // should adopt. We snapshot the starting selection at pointerdown so
+  // additive drags (shift/meta) merge cleanly with the existing set;
+  // plain drags replace it. The parent just assigns whatever arrives.
+  (e: "drag-select", payload: { filenames: string[] }): void;
 }>();
 
 /*
@@ -81,10 +106,124 @@ watch(
 const skeletons = computed(() =>
   props.loading && props.entries.length === 0 ? 8 : 0,
 );
+
+/*
+ * Drag / marquee selection
+ * ------------------------
+ * Active only when `selectMode` is true. We capture pointerdown on the
+ * feed root, draw a translucent rectangle as the pointer moves, and
+ * diff it against each mounted card's bounding rect to derive the set
+ * of filenames touched by the marquee.
+ *
+ * Notes:
+ *  - We gate on a 6px movement threshold so a simple click doesn't
+ *    register as a drag and wipe the selection.
+ *  - `shift` / `meta` during drag = additive (union with the current
+ *    selection). A bare drag replaces the selection with the marquee.
+ *  - Scroll-while-dragging is out of scope for v1; the feed already
+ *    renders ~150 cards at a time and they all fit in the mouseable area.
+ */
+const feedRoot = ref<HTMLElement | null>(null);
+const dragBox = ref<{
+  x: number;
+  y: number;
+  w: number;
+  h: number;
+} | null>(null);
+
+type DragState = {
+  startX: number;
+  startY: number;
+  additive: boolean;
+  started: boolean;
+  base: Set<string>;
+};
+let drag: DragState | null = null;
+
+function onPointerDown(evt: PointerEvent) {
+  if (!props.selectMode) return;
+  // Left button only; ignore right-clicks and touch-pinch-zoom gestures.
+  if (evt.button !== 0) return;
+  // Don't start a drag if the click originated on a card — we want the
+  // click handler on the card to toggle its selection cleanly. Empty
+  // space (gaps between cards, padding around the grid) is the drag
+  // surface.
+  const target = evt.target as HTMLElement | null;
+  if (target?.closest("[data-filename]")) return;
+  if (target?.closest("button, a, input, textarea, [data-swipe-ignore]")) {
+    return;
+  }
+  drag = {
+    startX: evt.clientX,
+    startY: evt.clientY,
+    additive: evt.shiftKey || evt.metaKey || evt.ctrlKey,
+    started: false,
+    // Snapshot the selection at drag start so additive drags union cleanly.
+    // A plain drag discards this and emits marquee-only.
+    base: new Set(props.selection),
+  };
+  window.addEventListener("pointermove", onPointerMove);
+  window.addEventListener("pointerup", onPointerUp, { once: true });
+}
+
+function onPointerMove(evt: PointerEvent) {
+  if (!drag) return;
+  const dx = evt.clientX - drag.startX;
+  const dy = evt.clientY - drag.startY;
+  if (!drag.started && Math.hypot(dx, dy) < 6) return;
+  drag.started = true;
+  const x = Math.min(drag.startX, evt.clientX);
+  const y = Math.min(drag.startY, evt.clientY);
+  const w = Math.abs(dx);
+  const h = Math.abs(dy);
+  dragBox.value = { x, y, w, h };
+  const hits = collectHits(x, y, w, h);
+  const final = drag.additive
+    ? new Set([...drag.base, ...hits])
+    : new Set(hits);
+  emit("drag-select", { filenames: Array.from(final) });
+}
+
+function onPointerUp() {
+  window.removeEventListener("pointermove", onPointerMove);
+  drag = null;
+  dragBox.value = null;
+}
+
+function collectHits(x: number, y: number, w: number, h: number): string[] {
+  if (!feedRoot.value) return [];
+  const cards = feedRoot.value.querySelectorAll<HTMLElement>("[data-filename]");
+  const hits: string[] = [];
+  const right = x + w;
+  const bottom = y + h;
+  for (const card of cards) {
+    const rect = card.getBoundingClientRect();
+    // AABB intersection test in viewport coords.
+    if (
+      rect.right < x ||
+      rect.left > right ||
+      rect.bottom < y ||
+      rect.top > bottom
+    ) {
+      continue;
+    }
+    const name = card.dataset.filename;
+    if (name) hits.push(name);
+  }
+  return hits;
+}
+
+onBeforeUnmount(() => {
+  window.removeEventListener("pointermove", onPointerMove);
+});
 </script>
 
 <template>
-  <section>
+  <section
+    ref="feedRoot"
+    :class="{ 'select-none': selectMode }"
+    @pointerdown="onPointerDown"
+  >
     <!-- Loading skeletons -->
     <div
       v-if="skeletons > 0 && view === 'grid'"
@@ -152,7 +291,13 @@ const skeletons = computed(() =>
         :item="entry"
         :muted="muted"
         variant="feed"
+        :select-mode="selectMode"
+        :selected="selection.has(entry.filename)"
+        :hide-mode="hideMode"
+        :revealed="revealed.has(entry.filename)"
         @open="emit('open', entry)"
+        @toggle-select="emit('toggle-select', $event)"
+        @reveal="emit('reveal', $event)"
       />
     </div>
 
@@ -167,7 +312,13 @@ const skeletons = computed(() =>
         :item="entry"
         :muted="muted"
         variant="grid"
+        :select-mode="selectMode"
+        :selected="selection.has(entry.filename)"
+        :hide-mode="hideMode"
+        :revealed="revealed.has(entry.filename)"
         @open="emit('open', entry)"
+        @toggle-select="emit('toggle-select', $event)"
+        @reveal="emit('reveal', $event)"
       />
     </div>
 
@@ -185,5 +336,19 @@ const skeletons = computed(() =>
         End of feed · {{ entries.length }} items
       </span>
     </div>
+
+    <!-- Marquee rectangle drawn during drag-select. Positioned in the
+         fixed viewport because drag coords come from pointer events. -->
+    <div
+      v-if="dragBox"
+      class="pointer-events-none fixed z-40 rounded-sm border border-brand-400/70 bg-brand-400/10"
+      :style="{
+        left: `${dragBox.x}px`,
+        top: `${dragBox.y}px`,
+        width: `${dragBox.w}px`,
+        height: `${dragBox.h}px`,
+      }"
+      aria-hidden="true"
+    ></div>
   </section>
 </template>
diff --git a/web/src/components/TopBar.vue b/web/src/components/TopBar.vue
index 5680c4ec..baab5dce 100644
--- a/web/src/components/TopBar.vue
+++ b/web/src/components/TopBar.vue
@@ -19,20 +19,38 @@ function openDownloadsDrawer() {
   window.dispatchEvent(new CustomEvent("mold:open-downloads"));
 }
 
-const props = defineProps<{
-  filter: FilterKind;
-  search: string;
-  view: ViewMode;
-  muted: boolean;
-  counts: { total: number; images: number; video: number; filtered: number };
-  loading: boolean;
-}>();
+// Gallery-only props (`hideMode`, `selectMode`, `selectionCount`,
+// `canDelete`) are optional. The Generate page mounts this same TopBar
+// without a gallery underneath it, so their defaults keep the toolbar
+// in its "no bulk actions" state.
+const props = withDefaults(
+  defineProps<{
+    filter: FilterKind;
+    search: string;
+    view: ViewMode;
+    muted: boolean;
+    counts: { total: number; images: number; video: number; filtered: number };
+    loading: boolean;
+    hideMode?: boolean;
+    selectMode?: boolean;
+    selectionCount?: number;
+    canDelete?: boolean;
+  }>(),
+  {
+    hideMode: false,
+    selectMode: false,
+    selectionCount: 0,
+    canDelete: false,
+  },
+);
 
 const emit = defineEmits<{
   (e: "update:filter", value: FilterKind): void;
   (e: "update:search", value: string): void;
   (e: "update:view", value: ViewMode): void;
   (e: "update:muted", value: boolean): void;
+  (e: "update:hide-mode", value: boolean): void;
+  (e: "update:select-mode", value: boolean): void;
   (e: "refresh"): void;
 }>();
 
@@ -353,6 +371,93 @@ function clearSearch() {
         </svg>
       </button>
 
+      <!-- Hide/blur toggle. Flipping it on blurs every gallery item; flipping
+           it back off reveals everything. Per-item "Reveal" still lets users
+           peek a single tile without disabling the global shroud. -->
+      <button
+        class="inline-flex h-10 w-10 items-center justify-center rounded-full border border-white/5 text-ink-200 transition hover:text-white"
+        :class="hideMode ? 'bg-brand-500 text-white' : 'bg-white/5'"
+        :aria-pressed="hideMode"
+        :aria-label="hideMode ? 'Unhide gallery' : 'Hide gallery'"
+        :title="hideMode ? 'Unhide gallery' : 'Hide gallery'"
+        @click="emit('update:hide-mode', !hideMode)"
+      >
+        <svg
+          v-if="!hideMode"
+          class="h-4 w-4"
+          viewBox="0 0 24 24"
+          fill="none"
+          stroke="currentColor"
+          stroke-width="2"
+          stroke-linecap="round"
+          stroke-linejoin="round"
+          aria-hidden="true"
+        >
+          <path d="M2 12s3.5-7 10-7 10 7 10 7-3.5 7-10 7S2 12 2 12Z" />
+          <circle cx="12" cy="12" r="3" />
+        </svg>
+        <svg
+          v-else
+          class="h-4 w-4"
+          viewBox="0 0 24 24"
+          fill="none"
+          stroke="currentColor"
+          stroke-width="2"
+          stroke-linecap="round"
+          stroke-linejoin="round"
+          aria-hidden="true"
+        >
+          <path
+            d="M10.6 5.1A10 10 0 0 1 12 5c6 0 10 7 10 7a17 17 0 0 1-3.3 4.2"
+          />
+          <path d="M6.7 6.7A17 17 0 0 0 2 12s4 7 10 7a9.7 9.7 0 0 0 5.3-1.7" />
+          <path d="m3 3 18 18" />
+          <path d="M9.9 9.9a3 3 0 0 0 4.2 4.2" />
+        </svg>
+      </button>
+
+      <!-- Select mode toggle. Opens a bulk-edit mode where clicks select
+           instead of opening the detail drawer, shift-clicks extend the
+           range, and drag draws a marquee. Only shown when the server
+           actually allows deletion — the only destructive action we gate
+           behind selection today. -->
+      <button
+        v-if="canDelete"
+        class="inline-flex h-10 items-center gap-1.5 rounded-full border border-white/5 px-3 text-[13px] font-medium transition hover:text-white"
+        :class="
+          selectMode ? 'bg-brand-500 text-white' : 'bg-white/5 text-ink-200'
+        "
+        :aria-pressed="selectMode"
+        :title="selectMode ? 'Exit select mode' : 'Select multiple'"
+        @click="emit('update:select-mode', !selectMode)"
+      >
+        <svg
+          class="h-4 w-4"
+          viewBox="0 0 24 24"
+          fill="none"
+          stroke="currentColor"
+          stroke-width="2"
+          stroke-linecap="round"
+          stroke-linejoin="round"
+          aria-hidden="true"
+        >
+          <rect x="3" y="3" width="7" height="7" rx="1.2" />
+          <path d="m14 5 3 3 5-5" />
+          <rect x="3" y="14" width="7" height="7" rx="1.2" />
+          <rect x="14" y="14" width="7" height="7" rx="1.2" />
+        </svg>
+        <span class="hidden sm:inline">
+          {{ selectMode ? `Selected ${selectionCount}` : "Select" }}
+        </span>
+        <span
+          v-if="selectMode && selectionCount > 0"
+          class="sm:hidden"
+          aria-hidden="true"
+        >
+          {{ selectionCount }}
+        </span>
+      </button>
+
       <button
         class="inline-flex h-10 w-10 items-center justify-center rounded-full border border-white/5 bg-white/5 text-ink-200 transition hover:text-white disabled:opacity-60"
         :disabled="loading"
diff --git a/web/src/pages/GalleryPage.vue b/web/src/pages/GalleryPage.vue
index 2f71b457..f7db015a 100644
--- a/web/src/pages/GalleryPage.vue
+++ b/web/src/pages/GalleryPage.vue
@@ -66,6 +66,177 @@ function setMuted(next: boolean) {
 const selected = ref<GalleryImage | null>(null);
 const selectedIndex = ref<number>(-1);
 
+/*
+ * Hide mode.
+ *
+ * When on, every gallery card renders a blurred shroud with a per-item
+ * "Reveal" button. Flipping the toggle off reveals everything globally.
+ * Persisted in localStorage so the privacy preference survives reloads —
+ * users working in public places don't want to re-enable it every session.
+ *
+ * The `revealed` set tracks per-item peeks so users can unwrap a single
+ * tile without disabling the global shroud. It's deliberately _not_
+ * persisted: revealing an item should not survive a reload.
+ */
+const HIDE_STORAGE_KEY = "mold.gallery.hide";
+function loadHide(): boolean {
+  try {
+    const v = localStorage.getItem(HIDE_STORAGE_KEY);
+    if (v === "true") return true;
+  } catch {
+    /* ignore */
+  }
+  return false;
+}
+const hideMode = ref<boolean>(loadHide());
+const revealed = ref<Set<string>>(new Set());
+
+function setHideMode(next: boolean) {
+  hideMode.value = next;
+  // Flipping the toggle should clear per-item peeks in both directions:
+  // turning hide-mode off makes revealed items moot; turning it back on
+  // should re-hide anything the user peeked at earlier.
+  revealed.value = new Set();
+  try {
+    localStorage.setItem(HIDE_STORAGE_KEY, String(next));
+  } catch {
+    /* ignore */
+  }
+}
+
+function revealOne(item: GalleryImage) {
+  const next = new Set(revealed.value);
+  next.add(item.filename);
+  revealed.value = next;
+}
+
+/*
+ * Multi-select.
+ *
+ * `selectMode` flips the gallery into bulk-edit: clicks on cards toggle
+ * their selection instead of opening the detail drawer. `selection` holds
+ * filenames (stable id — survives re-fetches). `selectionAnchor` is the
+ * last single-clicked filename; shift-clicking another tile selects the
+ * inclusive range between them in filter order (Finder-style).
+ *
+ * Exiting select mode clears both the mode flag and the current selection
+ * so the next entry starts from a clean slate.
+ */
+const selectMode = ref(false);
+const selection = ref<Set<string>>(new Set());
+const selectionAnchor = ref<string | null>(null);
+
+function setSelectMode(next: boolean) {
+  selectMode.value = next;
+  if (!next) {
+    selection.value = new Set();
+    selectionAnchor.value = null;
+  }
+}
+
+function toggleSelect(payload: {
+  item: GalleryImage;
+  shift: boolean;
+  meta: boolean;
+}) {
+  const { item, shift, meta } = payload;
+  const name = item.filename;
+  if (shift && selectionAnchor.value) {
+    // Shift-click: select the contiguous range in the currently-filtered
+    // list between the anchor and the clicked item. Doesn't touch
+    // selections outside that range — matches macOS Finder behavior.
+    const list = filtered.value;
+    const a = list.findIndex((e) => e.filename === selectionAnchor.value);
+    const b = list.findIndex((e) => e.filename === name);
+    if (a === -1 || b === -1) {
+      // Anchor is no longer in the filtered list (filter changed). Fall
+      // back to single toggle.
+      toggleOne(name, meta);
+      return;
+    }
+    const [lo, hi] = a < b ? [a, b] : [b, a];
+    const next = new Set(selection.value);
+    for (let i = lo; i <= hi; i++) {
+      const f = list[i]?.filename;
+      if (f) next.add(f);
+    }
+    selection.value = next;
+    return;
+  }
+  toggleOne(name, meta);
+  selectionAnchor.value = name;
+}
+
+function toggleOne(name: string, _meta: boolean) {
+  const next = new Set(selection.value);
+  if (next.has(name)) next.delete(name);
+  else next.add(name);
+  selection.value = next;
+}
+
+function onDragSelect(payload: { filenames: string[] }) {
+  selection.value = new Set(payload.filenames);
+}
+
+function selectAllVisible() {
+  const next = new Set<string>();
+  for (const e of filtered.value) next.add(e.filename);
+  selection.value = next;
+}
+
+function clearSelection() {
+  selection.value = new Set();
+  selectionAnchor.value = null;
+}
+
+async function handleDeleteMany(names: string[]): Promise<number> {
+  // Fire deletes in parallel — the server is local and individual DELETEs
+  // are cheap. `Promise.allSettled` lets partial failures not take down
+  // the whole batch; we surface the error count to the user.
+  const results = await Promise.allSettled(
+    names.map((n) => deleteGalleryImage(n)),
+  );
+  const deleted = new Set<string>();
+  let failed = 0;
+  names.forEach((n, i) => {
+    if (results[i]?.status === "fulfilled") deleted.add(n);
+    else failed++;
+  });
+  entries.value = entries.value.filter((e) => !deleted.has(e.filename));
+  if (deleted.size > 0) {
+    const next = new Set(selection.value);
+    for (const n of deleted) next.delete(n);
+    selection.value = next;
+  }
+  if (failed > 0) {
+    errorMessage.value = `Deleted ${deleted.size} of ${names.length}. ${failed} failed.`;
+  }
+  return deleted.size;
+}
+
+async function deleteSelected() {
+  const names = Array.from(selection.value);
+  if (names.length === 0) return;
+  const msg =
+    names.length === 1
+      ? `Delete ${names[0]}? This can't be undone.`
+      : `Delete ${names.length} items? This can't be undone.`;
+  if (!window.confirm(msg)) return;
+  await handleDeleteMany(names);
+}
+
+async function deleteAllFiltered() {
+  const list = filtered.value;
+  if (list.length === 0) return;
+  const msg =
+    list.length === entries.value.length
+      ? `Delete ALL ${list.length} gallery items? This can't be undone.`
+      : `Delete all ${list.length} items in the current filter? This can't be undone.`;
+  if (!window.confirm(msg)) return;
+  const names = list.map((e) => e.filename);
+  await handleDeleteMany(names);
+}
+
 /*
  * Server-reported feature toggles. We fetch these once on mount so the UI
  * can hide affordances the operator hasn't opted in to — most notably the
@@ -213,10 +384,16 @@ onMounted(async () => {
       :muted="muted"
       :counts="counts"
       :loading="loading"
+      :hide-mode="hideMode"
+      :select-mode="selectMode"
+      :selection-count="selection.size"
+      :can-delete="capabilities.gallery.can_delete"
       @update:filter="(f) => (filter = f)"
       @update:search="(s) => (search = s)"
       @update:view="setView"
       @update:muted="setMuted"
+      @update:hide-mode="setHideMode"
+      @update:select-mode="setSelectMode"
       @refresh="refresh"
     />
 
@@ -238,10 +415,96 @@ onMounted(async () => {
         :loading="loading"
         :view="view"
         :muted="muted"
+        :select-mode="selectMode"
+        :selection="selection"
+        :hide-mode="hideMode"
+        :revealed="revealed"
         @open="openItem"
+        @toggle-select="toggleSelect"
+        @reveal="revealOne"
+        @drag-select="onDragSelect"
       />
     </main>
 
+    <!--
+      Floating selection action bar.
+      Appears when the user enters select mode. Surfaces counts and the
+      four bulk actions: select-all-in-filter, clear, delete selected,
+      delete all (the "nuke this filter" escape hatch). Positioned above
+      the back-to-top FAB so both can coexist. We intentionally use
+      window.confirm() inside the delete handlers — consistent with the
+      detail drawer's single-item delete, and avoids shipping a modal.
+    -->
+    <Transition name="fade">
+      <div
+        v-if="selectMode"
+        class="fixed inset-x-0 bottom-[max(1.25rem,env(safe-area-inset-bottom))] z-30 flex justify-center px-4"
+      >
+        <div
+          class="glass flex max-w-full flex-wrap items-center gap-2 rounded-full border border-white/10 bg-ink-900/80 px-3 py-2 text-[13px] text-ink-100 shadow-xl backdrop-blur"
+          role="toolbar"
+          aria-label="Selection actions"
+        >
+          <span class="px-2 font-medium tabular-nums">
+            {{ selection.size }}
+            <span class="text-ink-400">/ {{ filtered.length }} selected</span>
+          </span>
+          <button
+            class="rounded-full border border-white/10 bg-white/5 px-3 py-1 font-medium transition hover:bg-white/10 disabled:opacity-60"
+            :disabled="filtered.length === 0"
+            @click="selectAllVisible"
+          >
+            Select all
+          </button>
+          <button
+            class="rounded-full border border-white/10 bg-white/5 px-3 py-1 font-medium transition hover:bg-white/10 disabled:opacity-60"
+            :disabled="selection.size === 0"
+            @click="clearSelection"
+          >
+            Clear
+          </button>
+          <button
+            class="rounded-full bg-rose-500/90 px-3 py-1 font-semibold text-white transition hover:bg-rose-500 disabled:opacity-50"
+            :disabled="selection.size === 0"
+            @click="deleteSelected"
+          >
+            Delete selected
+          </button>
+          <button
+            class="rounded-full border border-rose-400/40 bg-rose-500/20 px-3 py-1 font-semibold text-rose-100 transition hover:bg-rose-500/30 disabled:opacity-50"
+            :disabled="filtered.length === 0"
+            :title="
+              filtered.length === counts.total
+                ? 'Delete every item in the gallery'
+                : 'Delete every item that matches the current filter'
+            "
+            @click="deleteAllFiltered"
+          >
+            Delete all
+          </button>
+          <button
+            class="ml-1 inline-flex h-7 w-7 items-center justify-center rounded-full text-ink-300 transition hover:bg-white/10 hover:text-white"
+            aria-label="Exit select mode"
+            @click="setSelectMode(false)"
+          >
+            <svg
+              class="h-3.5 w-3.5"
+              viewBox="0 0 24 24"
+              fill="none"
+              stroke="currentColor"
+              stroke-width="2.4"
+              stroke-linecap="round"
+              stroke-linejoin="round"
+              aria-hidden="true"
+            >
+              <path d="M6 6l12 12" />
+              <path d="M18 6 6 18" />
+            </svg>
+          </button>
+        </div>
+      </div>
+    </Transition>
+
     <!-- Back-to-top FAB. Appears once the user has scrolled more than one
          viewport down, replacing the on-desktop convenience of the sticky
          header on mobile. Fade/translate transition keeps it unobtrusive. -->

From ae7ae98f5a8af70a20718dd52c56293ed9b866ac Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 08:16:55 -0700
Subject: [PATCH 16/31] fix(multi-gpu): restore queueing and worker affinity

---
 CHANGELOG.md                                  |   1 +
 crates/mold-cli/src/commands/upscale.rs       |  11 +-
 crates/mold-core/src/client.rs                |  27 +-
 crates/mold-inference/src/device.rs           |  44 +++
 crates/mold-inference/src/expand.rs           |  27 +-
 .../mold-inference/src/qwen_image/offload.rs  |   3 +-
 .../mold-inference/src/qwen_image/pipeline.rs |   1 +
 crates/mold-server/src/gpu_pool.rs            | 220 ++++++++++-
 crates/mold-server/src/gpu_worker.rs          |  25 +-
 crates/mold-server/src/lib.rs                 |  88 +++--
 crates/mold-server/src/queue.rs               | 266 ++++++++++++--
 crates/mold-server/src/routes.rs              | 344 ++++++++++++++----
 crates/mold-server/src/routes_test.rs         | 136 ++++++-
 crates/mold-tui/src/ui/info.rs                |  21 +-
 14 files changed, 1049 insertions(+), 165 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1867930b..b23a4841 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Multi-GPU worker affinity now holds end-to-end for queued generation, prompt expansion, and upscaling.** The generation dispatcher no longer rejects work just because the tiny per-worker channels are full; jobs stay pending in the configured global queue until a worker can accept them. Explicit placement GPU ordinals are now validated against the active worker pool and may target only one worker GPU per request/config entry, so per-component overrides can no longer silently allocate on a sibling card while auto-placement heuristics continue reading VRAM from the bound worker. Busy workers keep advertising their active model during the cache `take()` window, so follow-up requests queue behind the warm copy instead of spuriously reloading elsewhere. Server-side local prompt expansion now honors the selected GPU set (and prefers an explicitly requested worker GPU when present), Qwen-Image offload budgeting reads the worker's real ordinal instead of GPU 0, and multi-GPU upscaling now routes through the pool instead of a process-global GPU-0 singleton. Disconnected queued jobs are skipped before expensive work begins, multi-GPU `/api/status` now reports real prompt hashes/timestamps for active generations, the TUI info panel reads per-GPU status, and `MoldClient` can target model unloads by GPU/model instead of only exposing the legacy global unload.
 - **LTX-2 image-to-video no longer locks the first latent frame to a noisy ghost of the source image at `strength < 1.0`.** In `run_real_distilled_stage` (`crates/mold-inference/src/ltx2/runtime.rs`) the "clean reference" that the per-step denoise-mask blend pulls the conditioned tokens toward was sourced by cloning `video_latents` *after* `apply_stage_video_conditioning` had already soft-blended the first-latent-frame positions with the initial noise (`noise*(1-s) + source*s`). Used as the clean target, that pre-blended tensor pinned the first latent to a noisy copy of the image instead of the pure image at every step — so i2v runs with `--strength 0.75` (the CLI default) produced a first frame that was 25 % noise + 75 % image rather than the source image. A new helper `clean_latents_for_conditioning` re-applies the replacements with strength 1.0 on top of the post-apply tensor so replacement positions hold pure source image tokens while appended keyframe tokens and pure-noise regions pass through unchanged. `strength = 1.0` and pure-T2V paths are bit-for-bit identical to before. Covered by two new regression tests (`clean_latents_replace_soft_blended_positions_with_pure_source`, `clean_latents_passthrough_when_no_replacements`).
 - **city96-format FLUX fine-tune GGUFs now fail with an honest, actionable error when no dev reference is downloaded, and surface the dependency at pull time instead of inside `ensure_gguf_embeddings`.** Community fine-tune GGUFs (e.g. the `silveroxides/ultrareal-fine-tune-GGUF` tree that powers `ultrareal-v4:q{8,5,4}`) ship only the diffusion blocks and expect the base FLUX input embedding layers (`img_in`, `time_in`, `vector_in`, `guidance_in`) to be patched in from a separately-downloaded flux-dev reference. Two bugs made this fail confusingly: (1) `find_flux_reference_gguf` in `crates/mold-inference/src/flux/pipeline.rs` returned the first candidate with `img_in.weight`, which let `flux-schnell:q8` pass the probe even though schnell is distilled without `guidance_in` — the subsequent patch loop bailed with `reference GGUF (.../flux-schnell-q8/flux1-schnell-Q8_0.gguf) is also missing required tensor 'guidance_in.in_layer.weight'`, making it look like schnell itself was broken. (2) The manifest didn't express the dependency at all, so the first indication a user had that `mold pull ultrareal-v4:q8` wasn't self-sufficient was an HTTP 500 on their first generation. Fixed by (a) adding a `needs_guidance: bool` parameter to `find_flux_reference_gguf` that skips schnell candidates for dev-family targets and verifies candidates contain `guidance_in.in_layer.weight` before accepting them, (b) rewriting both error messages so the source model is named and the reference path is shown as a filename rather than a full path, and (c) adding a pull-time probe in `crates/mold-core/src/download.rs` (`warn_if_flux_gguf_needs_reference`) that scans the first 4 MiB of any downloaded `.gguf` transformer for `img_in.weight`, and prints a one-line warning via the download callback when the GGUF is incomplete and no suitable dev reference is already on disk. Works for both the CLI (`pull_model`) and server (`pull_model_with_callback`) paths. New regression test `find_flux_reference_skips_schnell_when_dev_needed` covers the reference-picker behaviour.
 
diff --git a/crates/mold-cli/src/commands/upscale.rs b/crates/mold-cli/src/commands/upscale.rs
index a8224924..043ec624 100644
--- a/crates/mold-cli/src/commands/upscale.rs
+++ b/crates/mold-cli/src/commands/upscale.rs
@@ -159,16 +159,19 @@ async fn upscale_local(
     // Create engine and run upscaling in a blocking thread
     let model_name_owned = model_name.clone();
     let req_clone = req.clone();
+    let best_gpu_ordinal =
+        mold_inference::device::select_best_gpu(&mold_inference::device::discover_gpus())
+            .map(|g| g.ordinal)
+            .unwrap_or(0);
 
     let resp = tokio::task::spawn_blocking(move || -> Result<mold_core::UpscaleResponse> {
-        // CLI upscale runs locally on the best available GPU (ordinal 0
-        // on single-GPU hosts). The multi-GPU server path routes through
-        // `gpu_worker`, which passes its own ordinal.
+        // Local upscale should target the GPU with the most free VRAM instead
+        // of hardcoding ordinal 0 on multi-GPU hosts.
         let mut engine = mold_inference::create_upscale_engine(
             model_name_owned,
             weights_path,
             mold_inference::LoadStrategy::Sequential,
-            0,
+            best_gpu_ordinal,
         )?;
 
         // Set up progress callback for stderr
diff --git a/crates/mold-core/src/client.rs b/crates/mold-core/src/client.rs
index ffea387e..66835d5d 100644
--- a/crates/mold-core/src/client.rs
+++ b/crates/mold-core/src/client.rs
@@ -544,14 +544,27 @@ impl MoldClient {
     }
 
     pub async fn unload_model(&self) -> Result<String> {
-        let resp = self
+        self.unload_model_target(None, None).await
+    }
+
+    pub async fn unload_model_target(
+        &self,
+        model: Option<&str>,
+        gpu: Option<usize>,
+    ) -> Result<String> {
+        let req = serde_json::json!({
+            "model": model,
+            "gpu": gpu,
+        });
+        let builder = self
             .client
-            .delete(format!("{}/api/models/unload", self.base_url))
-            .send()
-            .await?
-            .error_for_status()?
-            .text()
-            .await?;
+            .delete(format!("{}/api/models/unload", self.base_url));
+        let builder = if model.is_some() || gpu.is_some() {
+            builder.json(&req)
+        } else {
+            builder
+        };
+        let resp = builder.send().await?.error_for_status()?.text().await?;
         Ok(resp)
     }
 
diff --git a/crates/mold-inference/src/device.rs b/crates/mold-inference/src/device.rs
index acc12a04..94f6bd4e 100644
--- a/crates/mold-inference/src/device.rs
+++ b/crates/mold-inference/src/device.rs
@@ -237,13 +237,37 @@ pub fn select_expand_device(
     gpus: &[DiscoveredGpu],
     threshold: u64,
     is_metal: bool,
+) -> ExpandPlacement {
+    select_expand_device_with_preference(gpus, threshold, is_metal, None)
+}
+
+/// Same as [`select_expand_device`], but prefers `preferred_ordinal` when it
+/// is in the allowed GPU set and has enough free VRAM.
+pub fn select_expand_device_with_preference(
+    gpus: &[DiscoveredGpu],
+    threshold: u64,
+    is_metal: bool,
+    preferred_ordinal: Option<usize>,
 ) -> ExpandPlacement {
     if is_metal {
+        if let Some(ordinal) = preferred_ordinal {
+            if let Some(g) = gpus.iter().find(|g| g.ordinal == ordinal) {
+                return ExpandPlacement::Gpu(g.ordinal);
+            }
+        }
         if let Some(g) = gpus.first() {
             return ExpandPlacement::Gpu(g.ordinal);
         }
         return ExpandPlacement::Cpu;
     }
+    if let Some(ordinal) = preferred_ordinal {
+        if let Some(g) = gpus
+            .iter()
+            .find(|g| g.ordinal == ordinal && g.free_vram_bytes > threshold)
+        {
+            return ExpandPlacement::Gpu(g.ordinal);
+        }
+    }
     for g in gpus {
         if g.free_vram_bytes > threshold {
             return ExpandPlacement::Gpu(g.ordinal);
@@ -287,12 +311,14 @@ where
 
 #[cfg(feature = "cuda")]
 fn resolve_gpu_ordinal(ordinal: usize) -> anyhow::Result<candle_core::Device> {
+    debug_assert_ordinal_matches_thread(ordinal, "resolve_device");
     candle_core::Device::new_cuda(ordinal)
         .map_err(|e| anyhow::anyhow!("failed to open CUDA device {ordinal}: {e}"))
 }
 
 #[cfg(all(not(feature = "cuda"), feature = "metal"))]
 fn resolve_gpu_ordinal(ordinal: usize) -> anyhow::Result<candle_core::Device> {
+    debug_assert_ordinal_matches_thread(ordinal, "resolve_device");
     candle_core::Device::new_metal(ordinal)
         .map_err(|e| anyhow::anyhow!("failed to open Metal device {ordinal}: {e}"))
 }
@@ -1307,4 +1333,22 @@ mod tests {
             ExpandPlacement::Cpu,
         );
     }
+
+    #[test]
+    fn expand_prefers_requested_gpu_when_it_fits() {
+        let gpus = vec![gpu(0, 20), gpu(1, 20)];
+        assert_eq!(
+            select_expand_device_with_preference(&gpus, 3 * GB, false, Some(1)),
+            ExpandPlacement::Gpu(1),
+        );
+    }
+
+    #[test]
+    fn expand_preference_falls_back_when_requested_gpu_cannot_fit() {
+        let gpus = vec![gpu(0, 20), gpu(1, 1)];
+        assert_eq!(
+            select_expand_device_with_preference(&gpus, 3 * GB, false, Some(1)),
+            ExpandPlacement::Gpu(0),
+        );
+    }
 }
diff --git a/crates/mold-inference/src/expand.rs b/crates/mold-inference/src/expand.rs
index 1129397d..ede064f0 100644
--- a/crates/mold-inference/src/expand.rs
+++ b/crates/mold-inference/src/expand.rs
@@ -14,16 +14,19 @@ use mold_core::expand::{ExpandConfig, ExpandResult, PromptExpander};
 use mold_core::expand_prompts::{build_batch_messages, build_single_messages, format_chatml};
 
 use crate::device::{
-    discover_gpus, expand_vram_threshold, memory_status_string, preflight_memory_check,
-    select_expand_device, ExpandPlacement,
+    discover_gpus, expand_vram_threshold, filter_gpus, memory_status_string,
+    preflight_memory_check, select_expand_device_with_preference, ExpandPlacement,
 };
 use crate::progress::{ProgressCallback, ProgressReporter};
+use mold_core::types::GpuSelection;
 
 /// Local prompt expander using quantized Qwen3 GGUF.
 pub struct LocalExpander {
     model_path: PathBuf,
     tokenizer_path: PathBuf,
     progress: ProgressReporter,
+    gpu_selection: GpuSelection,
+    preferred_gpu: Option<usize>,
 }
 
 impl LocalExpander {
@@ -33,6 +36,8 @@ impl LocalExpander {
             model_path: model_path.into(),
             tokenizer_path: tokenizer_path.into(),
             progress: ProgressReporter::default(),
+            gpu_selection: GpuSelection::All,
+            preferred_gpu: None,
         }
     }
 
@@ -41,6 +46,18 @@ impl LocalExpander {
         self.progress.set_callback(callback);
     }
 
+    /// Restrict local expansion to the GPU ordinals selected by the caller.
+    pub fn with_gpu_selection(mut self, gpu_selection: GpuSelection) -> Self {
+        self.gpu_selection = gpu_selection;
+        self
+    }
+
+    /// Prefer this GPU ordinal when it is allowed and has enough free VRAM.
+    pub fn with_preferred_gpu(mut self, preferred_gpu: Option<usize>) -> Self {
+        self.preferred_gpu = preferred_gpu;
+        self
+    }
+
     /// Try to create a local expander by finding the model files.
     ///
     /// Searches the standard mold models directory for the expand model's
@@ -124,9 +141,11 @@ impl LocalExpander {
         // Cascade: main GPU → remaining GPUs (ordinal order) → CPU.
         // `discover_gpus()` returns an empty list on CPU-only builds, which
         // lands us directly on CPU.
-        let gpus = discover_gpus();
+        let discovered = discover_gpus();
+        let gpus = filter_gpus(&discovered, &self.gpu_selection);
         let is_metal = candle_core::utils::metal_is_available();
-        let placement = select_expand_device(&gpus, threshold, is_metal);
+        let placement =
+            select_expand_device_with_preference(&gpus, threshold, is_metal, self.preferred_gpu);
 
         let device = match placement {
             ExpandPlacement::Gpu(ordinal) => {
diff --git a/crates/mold-inference/src/qwen_image/offload.rs b/crates/mold-inference/src/qwen_image/offload.rs
index 774afdd0..0b0733d4 100644
--- a/crates/mold-inference/src/qwen_image/offload.rs
+++ b/crates/mold-inference/src/qwen_image/offload.rs
@@ -495,6 +495,7 @@ impl OffloadedQwenImageTransformer {
         cpu_vb: VarBuilder,
         cfg: &QwenImageConfig,
         gpu_device: &Device,
+        gpu_ordinal: usize,
         progress: &ProgressReporter,
     ) -> Result<Self> {
         progress.info("Loading transformer with dynamic GPU/CPU placement…");
@@ -521,7 +522,7 @@ impl OffloadedQwenImageTransformer {
 
         // Measure free VRAM after stem layers and decide how many blocks fit
         gpu_device.synchronize()?;
-        let free_vram = crate::device::free_vram_bytes(0).unwrap_or(0);
+        let free_vram = crate::device::free_vram_bytes(gpu_ordinal).unwrap_or(0);
         const VRAM_HEADROOM: u64 = 4_500_000_000; // 4.5GB for attention + activations + CUDA overhead
         let vram_budget = free_vram.saturating_sub(VRAM_HEADROOM);
 
diff --git a/crates/mold-inference/src/qwen_image/pipeline.rs b/crates/mold-inference/src/qwen_image/pipeline.rs
index 7128afa8..711a840b 100644
--- a/crates/mold-inference/src/qwen_image/pipeline.rs
+++ b/crates/mold-inference/src/qwen_image/pipeline.rs
@@ -1003,6 +1003,7 @@ impl QwenImageEngine {
                         cpu_vb,
                         cfg,
                         device,
+                        self.base.gpu_ordinal,
                         &self.base.progress,
                     )?,
                 ))
diff --git a/crates/mold-server/src/gpu_pool.rs b/crates/mold-server/src/gpu_pool.rs
index 89ced462..ce919ec4 100644
--- a/crates/mold-server/src/gpu_pool.rs
+++ b/crates/mold-server/src/gpu_pool.rs
@@ -1,8 +1,9 @@
 use crate::model_cache::{ModelCache, ModelResidency};
-use mold_core::types::{GpuWorkerState, GpuWorkerStatus};
+use mold_core::types::{DevicePlacement, DeviceRef, GpuWorkerState, GpuWorkerStatus};
 use mold_db::MetadataDb;
 use mold_inference::device::DiscoveredGpu;
 use mold_inference::shared_pool::SharedPool;
+use std::collections::BTreeSet;
 use std::sync::atomic::{AtomicUsize, Ordering};
 use std::sync::{Arc, Mutex, RwLock};
 use std::time::Instant;
@@ -24,6 +25,8 @@ pub struct GpuWorker {
 #[derive(Debug)]
 pub struct ActiveGeneration {
     pub model: String,
+    pub prompt_sha256: String,
+    pub started_at_unix_ms: u64,
     pub started_at: Instant,
 }
 
@@ -63,6 +66,7 @@ impl GpuWorker {
     /// Build a status snapshot for this worker.
     pub fn status(&self) -> GpuWorkerStatus {
         let active_gen = self.active_generation.read().unwrap();
+        let in_flight = self.in_flight.load(Ordering::SeqCst);
         // Prefer the active-generation model name — during inflight generation
         // the cache entry is taken out of the cache (take-and-restore pattern),
         // so `cache.active_model()` returns None. Falling back to the cache
@@ -74,7 +78,7 @@ impl GpuWorker {
 
         let state = if self.is_degraded() {
             GpuWorkerState::Degraded
-        } else if active_gen.is_some() {
+        } else if active_gen.is_some() || in_flight > 0 {
             GpuWorkerState::Generating
         } else {
             GpuWorkerState::Idle
@@ -92,6 +96,61 @@ impl GpuWorker {
 }
 
 impl GpuPool {
+    /// Return the worker bound to `ordinal`, if present in this pool.
+    pub fn worker_by_ordinal(&self, ordinal: usize) -> Option<Arc<GpuWorker>> {
+        self.workers
+            .iter()
+            .find(|w| w.gpu.ordinal == ordinal)
+            .cloned()
+    }
+
+    /// Validate a request/config placement against the active worker pool.
+    ///
+    /// In multi-GPU worker mode a request may explicitly pin components to at
+    /// most one GPU ordinal. Cross-GPU component placement would bypass the
+    /// worker-affinity model entirely, so reject it here instead of letting the
+    /// engines silently allocate on a sibling GPU.
+    pub fn resolve_explicit_placement_gpu(
+        &self,
+        placement: Option<&DevicePlacement>,
+    ) -> Result<Option<usize>, String> {
+        if self.workers.is_empty() {
+            return Ok(None);
+        }
+        let Some(placement) = placement else {
+            return Ok(None);
+        };
+
+        let ordinals = placement_gpu_ordinals(placement);
+        if ordinals.is_empty() {
+            return Ok(None);
+        }
+        if ordinals.len() > 1 {
+            let rendered = ordinals
+                .iter()
+                .map(|o| format!("gpu:{o}"))
+                .collect::<Vec<_>>()
+                .join(", ");
+            return Err(format!(
+                "multi-GPU worker mode only supports placement on one GPU ordinal per request; got {rendered}"
+            ));
+        }
+
+        let ordinal = *ordinals.iter().next().expect("checked non-empty");
+        if self.worker_by_ordinal(ordinal).is_none() {
+            let available = self
+                .workers
+                .iter()
+                .map(|w| w.gpu.ordinal.to_string())
+                .collect::<Vec<_>>()
+                .join(", ");
+            return Err(format!(
+                "gpu:{ordinal} is not available in this server's worker pool [{available}]"
+            ));
+        }
+        Ok(Some(ordinal))
+    }
+
     /// Find a non-degraded worker that already has this model loaded on GPU.
     /// If multiple workers have it, prefer the one with fewer in-flight requests.
     pub fn find_loaded(&self, model_name: &str) -> Option<Arc<GpuWorker>> {
@@ -102,6 +161,10 @@ impl GpuPool {
                 if w.is_degraded() {
                     return false;
                 }
+                let active_gen = w.active_generation.read().unwrap();
+                if active_gen.as_ref().is_some_and(|g| g.model == model_name) {
+                    return true;
+                }
                 let cache = w.model_cache.lock().unwrap();
                 cache
                     .get(model_name)
@@ -117,8 +180,8 @@ impl GpuPool {
     /// Select the best worker for a model, using the placement strategy
     /// (checked in order):
     /// 1. Loaded and idle (model on GPU, no in-flight requests).
-    /// 2. Idle GPU with no model (spreads hot models across free GPUs).
-    /// 3. Loaded but busy — whichever loaded copy has the fewest in-flight.
+    /// 2. Loaded but busy — queue behind the warm copy instead of reloading.
+    /// 3. Idle GPU with no model (spreads cold loads across free GPUs).
     /// 4. Non-degraded worker with the most headroom (will evict LRU).
     pub fn select_worker(&self, model_name: &str, estimated_vram: u64) -> Option<Arc<GpuWorker>> {
         self.select_worker_excluding(model_name, estimated_vram, &[])
@@ -149,13 +212,19 @@ impl GpuPool {
         let mut other: Vec<&Arc<GpuWorker>> = Vec::new();
 
         for w in &eligible {
+            let active_gen = w.active_generation.read().unwrap();
+            let active_model = active_gen.as_ref().map(|g| g.model.as_str());
             let (has_model, has_any_loaded) = {
                 let cache = w.model_cache.lock().unwrap();
-                let has_model = cache
-                    .get(model_name)
-                    .map(|e| e.residency == ModelResidency::Gpu)
-                    .unwrap_or(false);
-                (has_model, cache.active_model().is_some())
+                let has_model = active_model == Some(model_name)
+                    || cache
+                        .get(model_name)
+                        .map(|e| e.residency == ModelResidency::Gpu)
+                        .unwrap_or(false);
+                (
+                    has_model,
+                    active_model.is_some() || cache.active_model().is_some(),
+                )
             };
             let in_flight = w.in_flight.load(Ordering::SeqCst);
             // During an in-flight generation the worker thread calls
@@ -169,7 +238,7 @@ impl GpuPool {
             // and `active_generation.is_some()` (set by the worker around
             // the take-and-restore window) together cover every moment
             // between "about to pick up a job" and "just finished".
-            let is_busy = in_flight > 0 || w.active_generation.read().unwrap().is_some();
+            let is_busy = in_flight > 0 || active_model.is_some();
 
             if has_model && !is_busy {
                 loaded_idle.push(w);
@@ -188,7 +257,13 @@ impl GpuPool {
             return loaded_idle.first().map(|w| (*w).clone());
         }
 
-        // 2. Idle GPU with no model — spread! Prefer smallest GPU that fits.
+        // 2. Loaded but busy — least in-flight wins.
+        if !loaded_busy.is_empty() {
+            loaded_busy.sort_by_key(|w| w.in_flight.load(Ordering::SeqCst));
+            return loaded_busy.first().map(|w| (*w).clone());
+        }
+
+        // 3. Idle GPU with no model — spread! Prefer smallest GPU that fits.
         if !idle_empty.is_empty() {
             idle_empty.sort_by_key(|w| w.gpu.total_vram_bytes);
             if let Some(w) = idle_empty
@@ -201,12 +276,6 @@ impl GpuPool {
             return idle_empty.last().map(|w| (*w).clone());
         }
 
-        // 3. Loaded but busy — least in-flight wins.
-        if !loaded_busy.is_empty() {
-            loaded_busy.sort_by_key(|w| w.in_flight.load(Ordering::SeqCst));
-            return loaded_busy.first().map(|w| (*w).clone());
-        }
-
         // 4. All GPUs busy with other models — most headroom first (evict LRU there).
         let mut busy = other;
         busy.sort_by(|a, b| {
@@ -228,10 +297,39 @@ impl GpuPool {
     }
 }
 
+fn placement_gpu_ordinals(placement: &DevicePlacement) -> BTreeSet<usize> {
+    let mut ordinals = BTreeSet::new();
+    collect_gpu_ordinal(placement.text_encoders, &mut ordinals);
+    if let Some(adv) = placement.advanced.as_ref() {
+        collect_gpu_ordinal(adv.transformer, &mut ordinals);
+        collect_gpu_ordinal(adv.vae, &mut ordinals);
+        if let Some(device) = adv.clip_l {
+            collect_gpu_ordinal(device, &mut ordinals);
+        }
+        if let Some(device) = adv.clip_g {
+            collect_gpu_ordinal(device, &mut ordinals);
+        }
+        if let Some(device) = adv.t5 {
+            collect_gpu_ordinal(device, &mut ordinals);
+        }
+        if let Some(device) = adv.qwen {
+            collect_gpu_ordinal(device, &mut ordinals);
+        }
+    }
+    ordinals
+}
+
+fn collect_gpu_ordinal(device: DeviceRef, out: &mut BTreeSet<usize>) {
+    if let DeviceRef::Gpu { ordinal } = device {
+        out.insert(ordinal);
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
     use crate::model_cache::ModelCache;
+    use mold_core::types::AdvancedPlacement;
     use mold_inference::shared_pool::SharedPool;
 
     /// Build a test GpuWorker with a scratch job channel and everything else
@@ -301,6 +399,8 @@ mod tests {
 
         *busy.active_generation.write().unwrap() = Some(ActiveGeneration {
             model: "big-model".to_string(),
+            prompt_sha256: String::new(),
+            started_at_unix_ms: 0,
             started_at: Instant::now(),
         });
 
@@ -346,4 +446,90 @@ mod tests {
         // Both busy → "most headroom" — the larger GPU wins.
         assert_eq!(picked.gpu.ordinal, 0);
     }
+
+    #[test]
+    fn select_worker_keeps_queueing_behind_busy_warm_worker() {
+        let (warm_busy, _warm_busy_rx) = test_worker(0, 24_000_000_000);
+        let (cold_idle, _cold_idle_rx) = test_worker(1, 24_000_000_000);
+
+        warm_busy.in_flight.store(1, Ordering::SeqCst);
+        *warm_busy.active_generation.write().unwrap() = Some(ActiveGeneration {
+            model: "flux-dev:q4".to_string(),
+            prompt_sha256: String::new(),
+            started_at_unix_ms: 0,
+            started_at: Instant::now(),
+        });
+
+        let pool = GpuPool {
+            workers: vec![warm_busy.clone(), cold_idle.clone()],
+        };
+
+        let picked = pool
+            .select_worker("flux-dev:q4", 6_000_000_000)
+            .expect("warm worker should be preferred");
+        assert_eq!(picked.gpu.ordinal, 0);
+    }
+
+    #[test]
+    fn resolve_explicit_placement_gpu_accepts_single_worker_ordinal() {
+        let (worker, _rx) = test_worker(1, 24_000_000_000);
+        let pool = GpuPool {
+            workers: vec![worker],
+        };
+        let placement = DevicePlacement {
+            text_encoders: DeviceRef::Auto,
+            advanced: Some(AdvancedPlacement {
+                transformer: DeviceRef::gpu(1),
+                ..AdvancedPlacement::default()
+            }),
+        };
+
+        assert_eq!(
+            pool.resolve_explicit_placement_gpu(Some(&placement))
+                .unwrap(),
+            Some(1)
+        );
+    }
+
+    #[test]
+    fn resolve_explicit_placement_gpu_rejects_cross_gpu_requests() {
+        let (worker0, _rx0) = test_worker(0, 24_000_000_000);
+        let (worker1, _rx1) = test_worker(1, 24_000_000_000);
+        let pool = GpuPool {
+            workers: vec![worker0, worker1],
+        };
+        let placement = DevicePlacement {
+            text_encoders: DeviceRef::gpu(0),
+            advanced: Some(AdvancedPlacement {
+                transformer: DeviceRef::gpu(1),
+                ..AdvancedPlacement::default()
+            }),
+        };
+
+        let err = pool
+            .resolve_explicit_placement_gpu(Some(&placement))
+            .unwrap_err();
+        assert!(err.contains("one GPU ordinal per request"), "{err}");
+    }
+
+    #[test]
+    fn resolve_explicit_placement_gpu_rejects_ordinals_outside_pool() {
+        let (worker1, _rx1) = test_worker(1, 24_000_000_000);
+        let pool = GpuPool {
+            workers: vec![worker1],
+        };
+        let placement = DevicePlacement {
+            text_encoders: DeviceRef::Auto,
+            advanced: Some(AdvancedPlacement {
+                transformer: DeviceRef::gpu(0),
+                ..AdvancedPlacement::default()
+            }),
+        };
+
+        let err = pool
+            .resolve_explicit_placement_gpu(Some(&placement))
+            .unwrap_err();
+        assert!(err.contains("gpu:0"), "{err}");
+        assert!(err.contains("[1]"), "{err}");
+    }
 }
diff --git a/crates/mold-server/src/gpu_worker.rs b/crates/mold-server/src/gpu_worker.rs
index 104f8d38..b813c09c 100644
--- a/crates/mold-server/src/gpu_worker.rs
+++ b/crates/mold-server/src/gpu_worker.rs
@@ -8,9 +8,10 @@ use mold_core::{
     Config, ImageData, ModelPaths, OutputFormat, OutputMetadata, SseErrorEvent, SseProgressEvent,
 };
 use mold_inference::device;
+use sha2::{Digest, Sha256};
 use std::sync::atomic::Ordering;
 use std::sync::Arc;
-use std::time::{Duration, Instant};
+use std::time::{Duration, Instant, SystemTime, UNIX_EPOCH};
 
 /// Spawn the dedicated OS thread for a GPU worker.
 /// Returns the JoinHandle (caller should keep it alive).
@@ -58,6 +59,12 @@ fn process_job(worker: &GpuWorker, job: GpuJob) {
     }
     let _slot = QueueSlot(job.queue.clone());
 
+    if job.result_tx.is_closed() {
+        tracing::debug!(gpu = ordinal, model = %model_name, "skipping dispatched job — client disconnected");
+        worker.in_flight.fetch_sub(1, Ordering::SeqCst);
+        return;
+    }
+
     tracing::info!(gpu = ordinal, model = %model_name, "dispatched job");
 
     // Acquire per-GPU load lock — ensures only one model load at a time per GPU.
@@ -84,10 +91,26 @@ fn process_job(worker: &GpuWorker, job: GpuJob) {
         let mut gen = worker.active_generation.write().unwrap();
         *gen = Some(ActiveGeneration {
             model: model_name.clone(),
+            prompt_sha256: format!("{:x}", Sha256::digest(job.request.prompt.as_bytes())),
+            started_at_unix_ms: SystemTime::now()
+                .duration_since(UNIX_EPOCH)
+                .unwrap_or_default()
+                .as_millis() as u64,
             started_at: Instant::now(),
         });
     }
 
+    if job.result_tx.is_closed() {
+        tracing::debug!(
+            gpu = ordinal,
+            model = %model_name,
+            "skipping generation after model readiness — client disconnected"
+        );
+        worker.in_flight.fetch_sub(1, Ordering::SeqCst);
+        clear_active_generation(worker);
+        return;
+    }
+
     // Take-and-restore: remove engine from cache, release lock during inference.
     let taken = {
         let mut cache = worker.model_cache.lock().unwrap();
diff --git a/crates/mold-server/src/lib.rs b/crates/mold-server/src/lib.rs
index 62ef01ca..b6caea40 100644
--- a/crates/mold-server/src/lib.rs
+++ b/crates/mold-server/src/lib.rs
@@ -116,12 +116,12 @@ pub async fn run_server(
     }
 
     // ── Create generation queue ────────────────────────────────────────────
-    let (job_tx, job_rx) = tokio::sync::mpsc::channel(16);
+    let (job_tx, job_rx) = tokio::sync::mpsc::channel(queue_size.max(1));
     let queue_handle = QueueHandle::new(job_tx);
 
     // ── Create AppState ────────────────────────────────────────────────────
-    let mut state = match ModelPaths::resolve(&model_name, &config) {
-        Some(paths) => {
+    let mut state = if gpu_pool.worker_count() > 0 {
+        if let Some(paths) = ModelPaths::resolve(&model_name, &config) {
             info!(model = %model_name, "configured model");
             info!(transformer = %paths.transformer.display());
             info!(vae = %paths.vae.display());
@@ -152,25 +152,71 @@ pub async fn run_server(
             if let Some(text_tok) = &paths.text_tokenizer {
                 info!(text_tok = %text_tok.display());
             }
-
-            let offload = std::env::var("MOLD_OFFLOAD").is_ok_and(|v| v == "1");
-            let engine = mold_inference::create_engine_with_pool(
-                model_name,
-                paths,
-                &config,
-                mold_inference::LoadStrategy::Eager,
-                0,
-                offload,
-                Some(shared_pool.clone()),
-            )?;
-            let mut state =
-                state::AppState::new(engine, config, queue_handle, gpu_pool.clone(), queue_size);
-            state.shared_pool = shared_pool;
-            state
-        }
-        None => {
+            info!("multi-GPU mode defers model loading to per-GPU workers");
+        } else {
             info!("no default model configured — models will be pulled on first request");
-            state::AppState::empty(config, queue_handle, gpu_pool.clone(), queue_size)
+        }
+        let mut state = state::AppState::empty(config, queue_handle, gpu_pool.clone(), queue_size);
+        state.shared_pool = shared_pool;
+        state
+    } else {
+        match ModelPaths::resolve(&model_name, &config) {
+            Some(paths) => {
+                info!(model = %model_name, "configured model");
+                info!(transformer = %paths.transformer.display());
+                info!(vae = %paths.vae.display());
+                if let Some(spatial_upscaler) = &paths.spatial_upscaler {
+                    info!(spatial_upscaler = %spatial_upscaler.display());
+                }
+                if let Some(t5) = &paths.t5_encoder {
+                    info!(t5 = %t5.display());
+                }
+                if let Some(clip) = &paths.clip_encoder {
+                    info!(clip = %clip.display());
+                }
+                if let Some(t5_tok) = &paths.t5_tokenizer {
+                    info!(t5_tok = %t5_tok.display());
+                }
+                if let Some(clip_tok) = &paths.clip_tokenizer {
+                    info!(clip_tok = %clip_tok.display());
+                }
+                if let Some(clip2) = &paths.clip_encoder_2 {
+                    info!(clip2 = %clip2.display());
+                }
+                if let Some(clip2_tok) = &paths.clip_tokenizer_2 {
+                    info!(clip2_tok = %clip2_tok.display());
+                }
+                for (i, te) in paths.text_encoder_files.iter().enumerate() {
+                    info!(text_encoder_shard = i, path = %te.display());
+                }
+                if let Some(text_tok) = &paths.text_tokenizer {
+                    info!(text_tok = %text_tok.display());
+                }
+
+                let offload = std::env::var("MOLD_OFFLOAD").is_ok_and(|v| v == "1");
+                let engine = mold_inference::create_engine_with_pool(
+                    model_name,
+                    paths,
+                    &config,
+                    mold_inference::LoadStrategy::Eager,
+                    0,
+                    offload,
+                    Some(shared_pool.clone()),
+                )?;
+                let mut state = state::AppState::new(
+                    engine,
+                    config,
+                    queue_handle,
+                    gpu_pool.clone(),
+                    queue_size,
+                );
+                state.shared_pool = shared_pool;
+                state
+            }
+            None => {
+                info!("no default model configured — models will be pulled on first request");
+                state::AppState::empty(config, queue_handle, gpu_pool.clone(), queue_size)
+            }
         }
     };
 
diff --git a/crates/mold-server/src/queue.rs b/crates/mold-server/src/queue.rs
index 56655114..01971a3d 100644
--- a/crates/mold-server/src/queue.rs
+++ b/crates/mold-server/src/queue.rs
@@ -574,6 +574,33 @@ pub async fn run_queue_dispatcher(
 
         let model_name = job.request.model.clone();
         let estimated_vram = estimate_model_vram(&model_name);
+        let preferred_gpu = match state
+            .gpu_pool
+            .resolve_explicit_placement_gpu(job.request.placement.as_ref())
+        {
+            Ok(ordinal) => ordinal,
+            Err(err_msg) => {
+                tracing::warn!(model = %model_name, "{err_msg}");
+                if let Some(tx) = job.progress_tx {
+                    let _ = tx.send(SseMessage::Error(SseErrorEvent {
+                        message: err_msg.clone(),
+                    }));
+                }
+                let _ = job.result_tx.send(Err(err_msg));
+                state.queue.decrement();
+                #[cfg(feature = "metrics")]
+                crate::metrics::record_queue_depth(state.queue.pending());
+                continue;
+            }
+        };
+
+        if job.result_tx.is_closed() {
+            tracing::debug!(model = %model_name, "skipping queued multi-GPU job — client disconnected");
+            state.queue.decrement();
+            #[cfg(feature = "metrics")]
+            crate::metrics::record_queue_depth(state.queue.pending());
+            continue;
+        }
 
         // Build the GpuJob once; the retry loop moves it between attempts.
         let mut gpu_job = Some(GpuJob {
@@ -588,18 +615,50 @@ pub async fn run_queue_dispatcher(
         });
 
         let mut skip: Vec<usize> = Vec::new();
-        let max_attempts = state.gpu_pool.worker_count().max(1);
         let mut dispatched = false;
 
-        for _ in 0..max_attempts {
-            let worker =
-                match state
+        while !dispatched {
+            if gpu_job
+                .as_ref()
+                .is_some_and(|pending| pending.result_tx.is_closed())
+            {
+                tracing::debug!(
+                    model = %model_name,
+                    "dropping queued multi-GPU job before dispatch — client disconnected"
+                );
+                state.queue.decrement();
+                break;
+            }
+
+            let worker = if let Some(ordinal) = preferred_gpu {
+                state.gpu_pool.worker_by_ordinal(ordinal)
+            } else {
+                state
                     .gpu_pool
                     .select_worker_excluding(&model_name, estimated_vram, &skip)
-                {
-                    Some(w) => w,
-                    None => break,
+            };
+
+            let Some(worker) = worker else {
+                let rejected = gpu_job
+                    .take()
+                    .expect("gpu_job retained after failed dispatch");
+                let err_msg = if state.gpu_pool.worker_count() == 0 {
+                    format!("no GPU available for model {model_name}")
+                } else if let Some(ordinal) = preferred_gpu {
+                    format!("gpu:{ordinal} is not available for model {model_name}")
+                } else {
+                    format!("no GPU worker available for model {model_name}")
                 };
+                tracing::error!(model = %model_name, "{err_msg}");
+                if let Some(tx) = rejected.progress_tx {
+                    let _ = tx.send(SseMessage::Error(SseErrorEvent {
+                        message: err_msg.clone(),
+                    }));
+                }
+                let _ = rejected.result_tx.send(Err(err_msg));
+                state.queue.decrement();
+                break;
+            };
 
             // Increment in-flight BEFORE sending to reserve the slot.
             worker.in_flight.fetch_add(1, Ordering::SeqCst);
@@ -607,41 +666,47 @@ pub async fn run_queue_dispatcher(
             match worker.job_tx.try_send(pending) {
                 Ok(()) => {
                     dispatched = true;
-                    break;
                 }
-                Err(std::sync::mpsc::TrySendError::Full(j))
-                | Err(std::sync::mpsc::TrySendError::Disconnected(j)) => {
+                Err(std::sync::mpsc::TrySendError::Full(j)) => {
+                    worker.in_flight.fetch_sub(1, Ordering::SeqCst);
+                    gpu_job = Some(j);
+                    if preferred_gpu.is_none() {
+                        skip.push(worker.gpu.ordinal);
+                        if skip.len() >= state.gpu_pool.worker_count().max(1) {
+                            skip.clear();
+                            tokio::time::sleep(std::time::Duration::from_millis(10)).await;
+                        }
+                    } else {
+                        tokio::time::sleep(std::time::Duration::from_millis(10)).await;
+                    }
+                }
+                Err(std::sync::mpsc::TrySendError::Disconnected(j)) => {
                     worker.in_flight.fetch_sub(1, Ordering::SeqCst);
                     tracing::warn!(
                         gpu = worker.gpu.ordinal,
-                        "GPU worker channel full — retrying on another worker"
+                        "GPU worker disconnected — retrying dispatch"
                     );
-                    skip.push(worker.gpu.ordinal);
                     gpu_job = Some(j);
+                    if preferred_gpu.is_none() {
+                        skip.push(worker.gpu.ordinal);
+                    } else {
+                        let rejected = gpu_job.take().expect("gpu_job retained after disconnect");
+                        let err_msg = format!(
+                            "gpu:{} disconnected while dispatching model {model_name}",
+                            worker.gpu.ordinal
+                        );
+                        if let Some(tx) = rejected.progress_tx {
+                            let _ = tx.send(SseMessage::Error(SseErrorEvent {
+                                message: err_msg.clone(),
+                            }));
+                        }
+                        let _ = rejected.result_tx.send(Err(err_msg));
+                        state.queue.decrement();
+                        break;
+                    }
                 }
             }
         }
-
-        if !dispatched {
-            // Either no workers are eligible or every candidate's channel is full.
-            let rejected = gpu_job.expect("gpu_job retained after failed dispatch");
-            let err_msg = if state.gpu_pool.worker_count() == 0 {
-                format!("no GPU available for model {model_name}")
-            } else {
-                format!("all GPU workers are busy for model {model_name} — queue is full")
-            };
-            tracing::error!(model = %model_name, "{err_msg}");
-            if let Some(tx) = rejected.progress_tx {
-                let _ = tx.send(SseMessage::Error(SseErrorEvent {
-                    message: err_msg.clone(),
-                }));
-            }
-            let _ = rejected.result_tx.send(Err(err_msg));
-            // Job was rejected before the worker could observe it, so we must
-            // release the global queue slot here — the worker-side decrement
-            // won't run.
-            state.queue.decrement();
-        }
         #[cfg(feature = "metrics")]
         crate::metrics::record_queue_depth(state.queue.pending());
     }
@@ -670,8 +735,15 @@ pub fn estimate_model_vram(model_name: &str) -> u64 {
 #[cfg(test)]
 mod tests {
     use super::*;
+    use crate::gpu_pool::{GpuPool, GpuWorker};
+    use crate::model_cache::ModelCache;
+    use crate::state::QueueHandle;
     use mold_core::{GenerateRequest, ImageData, OutputFormat};
     use mold_db::MetadataDb;
+    use mold_inference::device::DiscoveredGpu;
+    use mold_inference::shared_pool::SharedPool;
+    use std::sync::atomic::AtomicUsize;
+    use std::sync::{Arc, Mutex, RwLock};
     use tempfile::TempDir;
 
     /// A `GenerateRequest` with the bare minimum fields populated — enough to
@@ -729,6 +801,33 @@ mod tests {
         }
     }
 
+    fn test_worker(
+        ordinal: usize,
+        channel_size: usize,
+    ) -> (
+        Arc<GpuWorker>,
+        std::sync::mpsc::Receiver<crate::gpu_pool::GpuJob>,
+    ) {
+        let (job_tx, job_rx) = std::sync::mpsc::sync_channel(channel_size);
+        let worker = Arc::new(GpuWorker {
+            gpu: DiscoveredGpu {
+                ordinal,
+                name: format!("gpu{ordinal}"),
+                total_vram_bytes: 24_000_000_000,
+                free_vram_bytes: 24_000_000_000,
+            },
+            model_cache: Arc::new(Mutex::new(ModelCache::new(3))),
+            active_generation: Arc::new(RwLock::new(None)),
+            model_load_lock: Arc::new(Mutex::new(())),
+            shared_pool: Arc::new(Mutex::new(SharedPool::new())),
+            in_flight: AtomicUsize::new(0),
+            consecutive_failures: AtomicUsize::new(0),
+            degraded_until: RwLock::new(None),
+            job_tx,
+        });
+        (worker, job_rx)
+    }
+
     #[test]
     fn save_image_to_dir_writes_file_and_creates_missing_dir() {
         let tmp = TempDir::new().unwrap();
@@ -1050,4 +1149,105 @@ mod tests {
         assert!(!event.video_has_audio);
         assert!(event.video_duration_ms.is_none());
     }
+
+    #[tokio::test(flavor = "multi_thread", worker_threads = 2)]
+    async fn queue_dispatcher_waits_for_worker_capacity_instead_of_rejecting() {
+        let (worker, worker_rx) = test_worker(0, 1);
+        let (job_tx, job_rx) = tokio::sync::mpsc::channel(4);
+        let queue = QueueHandle::new(job_tx.clone());
+        let state = crate::state::AppState::empty(
+            mold_core::Config::default(),
+            queue.clone(),
+            Arc::new(GpuPool {
+                workers: vec![worker.clone()],
+            }),
+            8,
+        );
+
+        let (filler_result_tx, _filler_result_rx) = tokio::sync::oneshot::channel();
+        let filler_job = crate::gpu_pool::GpuJob {
+            model: "busy-model".to_string(),
+            request: fake_request("busy-model"),
+            progress_tx: None,
+            result_tx: filler_result_tx,
+            output_dir: None,
+            config: state.config.clone(),
+            metadata_db: state.metadata_db.clone(),
+            queue: state.queue.clone(),
+        };
+        worker.job_tx.send(filler_job).unwrap();
+
+        let dispatcher = tokio::spawn(run_queue_dispatcher(job_rx, state.clone()));
+
+        let (result_tx, mut result_rx) = tokio::sync::oneshot::channel();
+        let job = crate::state::GenerationJob {
+            request: fake_request("flux-dev:q4"),
+            progress_tx: None,
+            result_tx,
+            output_dir: None,
+        };
+        let _position = queue.submit(job, 8).await.unwrap();
+
+        tokio::time::sleep(std::time::Duration::from_millis(25)).await;
+        assert!(
+            result_rx.try_recv().is_err(),
+            "dispatcher should keep the job pending while all worker channels are full"
+        );
+
+        let _filler = worker_rx
+            .recv()
+            .expect("filler job should occupy the local channel");
+        let dispatched = worker_rx
+            .recv_timeout(std::time::Duration::from_secs(1))
+            .expect("queued job should dispatch once capacity is available");
+        assert_eq!(dispatched.model, "flux-dev:q4");
+
+        drop(job_tx);
+        dispatcher.abort();
+    }
+
+    #[tokio::test(flavor = "multi_thread", worker_threads = 2)]
+    async fn queue_dispatcher_honors_explicit_placement_gpu() {
+        let (worker0, rx0) = test_worker(0, 1);
+        let (worker1, rx1) = test_worker(1, 1);
+        let (job_tx, job_rx) = tokio::sync::mpsc::channel(4);
+        let queue = QueueHandle::new(job_tx.clone());
+        let state = crate::state::AppState::empty(
+            mold_core::Config::default(),
+            queue.clone(),
+            Arc::new(GpuPool {
+                workers: vec![worker0, worker1],
+            }),
+            8,
+        );
+
+        let dispatcher = tokio::spawn(run_queue_dispatcher(job_rx, state));
+
+        let mut request = fake_request("flux-dev:q4");
+        request.placement = Some(mold_core::types::DevicePlacement {
+            text_encoders: mold_core::types::DeviceRef::Auto,
+            advanced: Some(mold_core::types::AdvancedPlacement {
+                transformer: mold_core::types::DeviceRef::gpu(1),
+                ..mold_core::types::AdvancedPlacement::default()
+            }),
+        });
+
+        let (result_tx, _result_rx) = tokio::sync::oneshot::channel();
+        let job = crate::state::GenerationJob {
+            request,
+            progress_tx: None,
+            result_tx,
+            output_dir: None,
+        };
+        let _position = queue.submit(job, 8).await.unwrap();
+
+        let dispatched = rx1
+            .recv_timeout(std::time::Duration::from_secs(1))
+            .expect("explicit placement should route to gpu 1");
+        assert_eq!(dispatched.model, "flux-dev:q4");
+        assert!(rx0.try_recv().is_err(), "gpu 0 should not receive the job");
+
+        drop(job_tx);
+        dispatcher.abort();
+    }
 }
diff --git a/crates/mold-server/src/routes.rs b/crates/mold-server/src/routes.rs
index ab971832..cfaec861 100644
--- a/crates/mold-server/src/routes.rs
+++ b/crates/mold-server/src/routes.rs
@@ -10,11 +10,13 @@ use axum::{
 };
 use base64::Engine as _;
 use mold_core::{
-    ActiveGenerationStatus, GpuInfo, GpuWorkerState, ModelInfoExtended, ResourceSnapshot,
-    ServerStatus, SseErrorEvent, SseProgressEvent,
+    types::GpuSelection, ActiveGenerationStatus, GpuInfo, GpuWorkerState, ModelInfoExtended,
+    ResourceSnapshot, ServerStatus, SseErrorEvent, SseProgressEvent,
 };
 use serde::{Deserialize, Serialize};
+use std::cmp::Reverse;
 use std::convert::Infallible;
+use std::sync::atomic::Ordering;
 use tokio_stream::StreamExt as _;
 use utoipa::OpenApi;
 
@@ -300,8 +302,10 @@ async fn prepare_generation(
     // return `SubmitError::Full`, which is mapped to `ApiError::queue_full()`.
     apply_default_metadata_setting(state, request).await;
 
+    let preferred_gpu = validate_multi_gpu_placement(state, request.placement.as_ref())?;
+
     // Expand prompt if requested (before validation, so the expanded prompt gets validated)
-    maybe_expand_prompt(state, request).await?;
+    maybe_expand_prompt(state, request, preferred_gpu).await?;
 
     if let Err(e) = validate_generate_request(request) {
         return Err(ApiError::validation(e));
@@ -326,6 +330,61 @@ async fn prepare_generation(
     Ok((output_dir, dim_warning))
 }
 
+fn active_gpu_selection(state: &AppState) -> GpuSelection {
+    let ordinals: Vec<usize> = state
+        .gpu_pool
+        .workers
+        .iter()
+        .map(|w| w.gpu.ordinal)
+        .collect();
+    if ordinals.is_empty() {
+        GpuSelection::All
+    } else {
+        GpuSelection::Specific(ordinals)
+    }
+}
+
+fn validate_multi_gpu_placement(
+    state: &AppState,
+    placement: Option<&mold_core::types::DevicePlacement>,
+) -> Result<Option<usize>, ApiError> {
+    state
+        .gpu_pool
+        .resolve_explicit_placement_gpu(placement)
+        .map_err(ApiError::validation)
+}
+
+fn select_aux_worker(
+    state: &AppState,
+) -> Result<std::sync::Arc<crate::gpu_pool::GpuWorker>, ApiError> {
+    let mut workers: Vec<_> = state
+        .gpu_pool
+        .workers
+        .iter()
+        .filter(|w| !w.is_degraded())
+        .cloned()
+        .collect();
+    workers.sort_by_key(|w| {
+        (
+            w.in_flight.load(Ordering::SeqCst),
+            Reverse(w.gpu.total_vram_bytes),
+        )
+    });
+    workers
+        .into_iter()
+        .next()
+        .ok_or_else(|| ApiError::internal("no GPU worker available for auxiliary workload"))
+}
+
+fn clear_global_upscaler_cache(state: &AppState) {
+    if let Ok(mut cache) = state.upscaler_cache.try_lock() {
+        if cache.is_some() {
+            *cache = None;
+            tracing::info!("upscaler cache cleared");
+        }
+    }
+}
+
 // ── /api/generate ─────────────────────────────────────────────────────────────
 
 #[utoipa::path(
@@ -485,12 +544,14 @@ async fn apply_default_metadata_setting(state: &AppState, req: &mut mold_core::G
 async fn maybe_expand_prompt(
     state: &AppState,
     req: &mut mold_core::GenerateRequest,
+    preferred_gpu: Option<usize>,
 ) -> Result<(), ApiError> {
     if req.expand != Some(true) {
         return Ok(());
     }
 
     let config = state.config.read().await;
+    let config_snapshot = config.clone();
     let expand_settings = config.expand.clone().with_env_overrides();
 
     // Resolve model family for prompt style
@@ -512,7 +573,12 @@ async fn maybe_expand_prompt(
     // Drop config lock before blocking
     drop(config);
 
-    let expander = create_server_expander(&expand_settings)?;
+    let expander = create_server_expander(
+        &config_snapshot,
+        &expand_settings,
+        active_gpu_selection(state),
+        preferred_gpu,
+    )?;
     let result =
         tokio::task::spawn_blocking(move || expander.expand(&original_prompt, &expand_config))
             .await
@@ -529,7 +595,10 @@ async fn maybe_expand_prompt(
 
 /// Create the appropriate expander for server-side use.
 fn create_server_expander(
+    _config: &mold_core::Config,
     settings: &mold_core::ExpandSettings,
+    _gpu_selection: GpuSelection,
+    _preferred_gpu: Option<usize>,
 ) -> Result<Box<dyn mold_core::PromptExpander>, ApiError> {
     if let Some(api_expander) = settings.create_api_expander() {
         return Ok(Box::new(api_expander));
@@ -537,11 +606,14 @@ fn create_server_expander(
 
     #[cfg(feature = "expand")]
     {
-        let config = mold_core::Config::load_or_default();
         if let Some(local) =
-            mold_inference::expand::LocalExpander::from_config(&config, Some(&settings.model))
+            mold_inference::expand::LocalExpander::from_config(_config, Some(&settings.model))
         {
-            return Ok(Box::new(local));
+            return Ok(Box::new(
+                local
+                    .with_gpu_selection(_gpu_selection)
+                    .with_preferred_gpu(_preferred_gpu),
+            ));
         }
         return Err(ApiError::validation(
             "local expand model not found — run: mold pull qwen3-expand".to_string(),
@@ -586,9 +658,15 @@ async fn expand_prompt(
     let expand_settings = config.expand.clone().with_env_overrides();
     let expand_config = expand_settings.to_expand_config(&req.model_family, req.variations);
     let prompt = req.prompt.clone();
+    let config_snapshot = config.clone();
     drop(config);
 
-    let expander = create_server_expander(&expand_settings)?;
+    let expander = create_server_expander(
+        &config_snapshot,
+        &expand_settings,
+        active_gpu_selection(&state),
+        None,
+    )?;
     let result = tokio::task::spawn_blocking(move || expander.expand(&prompt, &expand_config))
         .await
         .map_err(|e| ApiError::internal(format!("expand task failed: {e}")))?
@@ -646,20 +724,44 @@ async fn upscale(
     let model_name_owned = model_name.clone();
     drop(config);
 
-    let upscaler_cache = state.upscaler_cache.clone();
-    let resp =
+    let resp = if state.gpu_pool.worker_count() > 0 {
+        let worker = select_aux_worker(&state)?;
+        worker.in_flight.fetch_add(1, Ordering::SeqCst);
+        let worker_clone = worker.clone();
+        let result =
+            tokio::task::spawn_blocking(move || -> anyhow::Result<mold_core::UpscaleResponse> {
+                struct ThreadGpuGuard;
+                impl Drop for ThreadGpuGuard {
+                    fn drop(&mut self) {
+                        mold_inference::device::clear_thread_gpu_ordinal();
+                    }
+                }
+
+                mold_inference::device::init_thread_gpu_ordinal(worker_clone.gpu.ordinal);
+                let _thread_gpu = ThreadGpuGuard;
+                let _load_lock = worker_clone.model_load_lock.lock().unwrap();
+                let mut engine = mold_inference::create_upscale_engine(
+                    model_name_owned,
+                    weights_path,
+                    mold_inference::LoadStrategy::Eager,
+                    worker_clone.gpu.ordinal,
+                )?;
+                engine.upscale(&req)
+            })
+            .await
+            .map_err(|e| ApiError::internal(format!("upscale task panicked: {e}")));
+        worker.in_flight.fetch_sub(1, Ordering::SeqCst);
+        result?.map_err(|e| ApiError::internal(format!("upscale failed: {e}")))?
+    } else {
+        let upscaler_cache = state.upscaler_cache.clone();
         tokio::task::spawn_blocking(move || -> anyhow::Result<mold_core::UpscaleResponse> {
             let mut cache = upscaler_cache.lock().unwrap_or_else(|e| e.into_inner());
 
-            // Reuse cached engine if same model
+            // Reuse cached engine if same model.
             let needs_new = cache
                 .as_ref()
                 .is_none_or(|e| e.model_name() != model_name_owned);
             if needs_new {
-                // Server-side upscaler cache is process-global today and
-                // intentionally pinned to GPU 0 (matches prior behavior).
-                // If multi-GPU upscale becomes interesting, migrate this to
-                // a per-worker cache on `GpuWorker` and route via the pool.
                 let new_engine = mold_inference::create_upscale_engine(
                     model_name_owned,
                     weights_path,
@@ -673,7 +775,8 @@ async fn upscale(
         })
         .await
         .map_err(|e| ApiError::internal(format!("upscale task panicked: {e}")))?
-        .map_err(|e| ApiError::internal(format!("upscale failed: {e}")))?;
+        .map_err(|e| ApiError::internal(format!("upscale failed: {e}")))?
+    };
 
     Ok(Json(resp))
 }
@@ -813,71 +916,163 @@ async fn upscale_stream(
             return;
         };
 
-        let result = tokio::task::spawn_blocking(move || {
-            let mut cache = upscaler_cache.lock().unwrap();
+        let result = if state_clone.gpu_pool.worker_count() > 0 {
+            match select_aux_worker(&state_clone) {
+                Ok(worker) => {
+                    worker.in_flight.fetch_add(1, Ordering::SeqCst);
+                    let worker_clone = worker.clone();
+                    let tx_for_worker = tx.clone();
+                    let model_name_for_worker = model_name_owned.clone();
+                    let weights_path_for_worker = weights_path.clone();
+                    let req_for_worker = req.clone();
+                    let result = tokio::task::spawn_blocking(move || {
+                        struct ThreadGpuGuard;
+                        impl Drop for ThreadGpuGuard {
+                            fn drop(&mut self) {
+                                mold_inference::device::clear_thread_gpu_ordinal();
+                            }
+                        }
 
-            let needs_new = cache
-                .as_ref()
-                .is_none_or(|e| e.model_name() != model_name_owned);
-            if needs_new {
-                let _ = tx.send(SseMessage::Progress(
-                    mold_core::SseProgressEvent::StageStart {
-                        name: "Loading upscaler model".to_string(),
-                    },
-                ));
-                match mold_inference::create_upscale_engine(
-                    model_name_owned,
-                    weights_path,
-                    mold_inference::LoadStrategy::Eager,
-                    0,
-                ) {
-                    Ok(new_engine) => {
-                        *cache = Some(new_engine);
-                    }
-                    Err(e) => {
-                        let _ = tx.send(SseMessage::Error(mold_core::SseErrorEvent {
-                            message: format!("failed to load upscaler: {e}"),
+                        mold_inference::device::init_thread_gpu_ordinal(worker_clone.gpu.ordinal);
+                        let _thread_gpu = ThreadGpuGuard;
+                        let _load_lock = worker_clone.model_load_lock.lock().unwrap();
+                        let _ = tx_for_worker.send(SseMessage::Progress(
+                            mold_core::SseProgressEvent::StageStart {
+                                name: format!(
+                                    "Loading upscaler model on GPU {}",
+                                    worker_clone.gpu.ordinal
+                                ),
+                            },
+                        ));
+                        let mut engine = match mold_inference::create_upscale_engine(
+                            model_name_for_worker,
+                            weights_path_for_worker,
+                            mold_inference::LoadStrategy::Eager,
+                            worker_clone.gpu.ordinal,
+                        ) {
+                            Ok(engine) => engine,
+                            Err(e) => {
+                                let _ = tx_for_worker.send(SseMessage::Error(
+                                    mold_core::SseErrorEvent {
+                                        message: format!("failed to load upscaler: {e}"),
+                                    },
+                                ));
+                                return;
+                            }
+                        };
+
+                        let tx_progress = tx_for_worker.clone();
+                        engine.set_on_progress(Box::new(move |event| {
+                            let sse_event: mold_core::SseProgressEvent = event.into();
+                            let _ = tx_progress.send(SseMessage::Progress(sse_event));
                         }));
-                        return;
-                    }
-                }
-            }
 
-            let engine = cache.as_mut().unwrap();
-
-            // Install progress callback for tile-by-tile progress
-            let tx_progress = tx.clone();
-            engine.set_on_progress(Box::new(move |event| {
-                let sse_event: mold_core::SseProgressEvent = event.into();
-                let _ = tx_progress.send(SseMessage::Progress(sse_event));
-            }));
+                        match engine.upscale(&req_for_worker) {
+                            Ok(resp) => {
+                                let image_b64 = base64::engine::general_purpose::STANDARD
+                                    .encode(&resp.image.data);
+                                let _ = tx_for_worker.send(SseMessage::UpscaleComplete(
+                                    mold_core::SseUpscaleCompleteEvent {
+                                        image: image_b64,
+                                        format: resp.image.format,
+                                        model: resp.model,
+                                        scale_factor: resp.scale_factor,
+                                        original_width: resp.original_width,
+                                        original_height: resp.original_height,
+                                        upscale_time_ms: resp.upscale_time_ms,
+                                    },
+                                ));
+                            }
+                            Err(e) => {
+                                let _ = tx_for_worker.send(SseMessage::Error(
+                                    mold_core::SseErrorEvent {
+                                        message: format!("upscale failed: {e}"),
+                                    },
+                                ));
+                            }
+                        }
 
-            match engine.upscale(&req) {
-                Ok(resp) => {
-                    let image_b64 =
-                        base64::engine::general_purpose::STANDARD.encode(&resp.image.data);
-                    let _ = tx.send(SseMessage::UpscaleComplete(
-                        mold_core::SseUpscaleCompleteEvent {
-                            image: image_b64,
-                            format: resp.image.format,
-                            model: resp.model,
-                            scale_factor: resp.scale_factor,
-                            original_width: resp.original_width,
-                            original_height: resp.original_height,
-                            upscale_time_ms: resp.upscale_time_ms,
-                        },
-                    ));
+                        engine.clear_on_progress();
+                    })
+                    .await;
+                    worker.in_flight.fetch_sub(1, Ordering::SeqCst);
+                    result
                 }
                 Err(e) => {
                     let _ = tx.send(SseMessage::Error(mold_core::SseErrorEvent {
-                        message: format!("upscale failed: {e}"),
+                        message: e.error,
                     }));
+                    return;
                 }
             }
+        } else {
+            let model_name_for_cache = model_name_owned.clone();
+            let weights_path_for_cache = weights_path.clone();
+            let req_for_cache = req.clone();
+            tokio::task::spawn_blocking(move || {
+                let mut cache = upscaler_cache.lock().unwrap();
+
+                let needs_new = cache
+                    .as_ref()
+                    .is_none_or(|e| e.model_name() != model_name_for_cache);
+                if needs_new {
+                    let _ = tx.send(SseMessage::Progress(
+                        mold_core::SseProgressEvent::StageStart {
+                            name: "Loading upscaler model".to_string(),
+                        },
+                    ));
+                    match mold_inference::create_upscale_engine(
+                        model_name_for_cache,
+                        weights_path_for_cache,
+                        mold_inference::LoadStrategy::Eager,
+                        0,
+                    ) {
+                        Ok(new_engine) => {
+                            *cache = Some(new_engine);
+                        }
+                        Err(e) => {
+                            let _ = tx.send(SseMessage::Error(mold_core::SseErrorEvent {
+                                message: format!("failed to load upscaler: {e}"),
+                            }));
+                            return;
+                        }
+                    }
+                }
 
-            engine.clear_on_progress();
-        })
-        .await;
+                let engine = cache.as_mut().unwrap();
+                let tx_progress = tx.clone();
+                engine.set_on_progress(Box::new(move |event| {
+                    let sse_event: mold_core::SseProgressEvent = event.into();
+                    let _ = tx_progress.send(SseMessage::Progress(sse_event));
+                }));
+
+                match engine.upscale(&req_for_cache) {
+                    Ok(resp) => {
+                        let image_b64 =
+                            base64::engine::general_purpose::STANDARD.encode(&resp.image.data);
+                        let _ = tx.send(SseMessage::UpscaleComplete(
+                            mold_core::SseUpscaleCompleteEvent {
+                                image: image_b64,
+                                format: resp.image.format,
+                                model: resp.model,
+                                scale_factor: resp.scale_factor,
+                                original_width: resp.original_width,
+                                original_height: resp.original_height,
+                                upscale_time_ms: resp.upscale_time_ms,
+                            },
+                        ));
+                    }
+                    Err(e) => {
+                        let _ = tx.send(SseMessage::Error(mold_core::SseErrorEvent {
+                            message: format!("upscale failed: {e}"),
+                        }));
+                    }
+                }
+
+                engine.clear_on_progress();
+            })
+            .await
+        };
 
         if let Err(e) = result {
             tracing::error!("upscale task panicked: {e}");
@@ -1248,6 +1443,7 @@ async fn unload_model(
 ) -> Result<impl IntoResponse, ApiError> {
     let req = body.map(|b| b.0).unwrap_or_default();
     tracing::debug!(model = ?req.model, gpu = ?req.gpu, "unload request");
+    clear_global_upscaler_cache(&state);
 
     // Multi-GPU path: target specific GPU or model across the pool.
     if state.gpu_pool.worker_count() > 0 {
@@ -1340,11 +1536,8 @@ async fn server_status(State(state): State<AppState>) -> Json<ServerStatus> {
             let gen = w.active_generation.read().ok()?;
             gen.as_ref().map(|g| ActiveGenerationStatus {
                 model: g.model.clone(),
-                // The per-worker ActiveGeneration doesn't carry the prompt hash,
-                // so expose the model-only summary. Callers that need the hash
-                // can subscribe to SSE progress events.
-                prompt_sha256: String::new(),
-                started_at_unix_ms: 0,
+                prompt_sha256: g.prompt_sha256.clone(),
+                started_at_unix_ms: g.started_at_unix_ms,
                 elapsed_ms: g.started_at.elapsed().as_millis() as u64,
             })
         })
@@ -2440,6 +2633,7 @@ async fn put_model_placement(
     axum::extract::Path(name): axum::extract::Path<String>,
     Json(placement): Json<mold_core::types::DevicePlacement>,
 ) -> Result<Json<serde_json::Value>, ApiError> {
+    validate_multi_gpu_placement(&state, Some(&placement))?;
     {
         let mut cfg = state.config.write().await;
         cfg.set_model_placement(&name, Some(placement.clone()));
diff --git a/crates/mold-server/src/routes_test.rs b/crates/mold-server/src/routes_test.rs
index a7c1237f..2bad2329 100644
--- a/crates/mold-server/src/routes_test.rs
+++ b/crates/mold-server/src/routes_test.rs
@@ -12,7 +12,7 @@ mod tests {
     use std::path::PathBuf;
     use std::sync::{
         atomic::{AtomicBool, AtomicUsize, Ordering},
-        Arc, Condvar, Mutex,
+        Arc, Condvar, Mutex, RwLock,
     };
     use std::time::Duration;
     use tower::ServiceExt;
@@ -277,6 +277,34 @@ mod tests {
         ))
     }
 
+    fn gpu_worker_stub(ordinal: usize) -> Arc<crate::gpu_pool::GpuWorker> {
+        let (job_tx, _job_rx) = std::sync::mpsc::sync_channel(1);
+        Arc::new(crate::gpu_pool::GpuWorker {
+            gpu: mold_inference::device::DiscoveredGpu {
+                ordinal,
+                name: format!("gpu{ordinal}"),
+                total_vram_bytes: 24_000_000_000,
+                free_vram_bytes: 24_000_000_000,
+            },
+            model_cache: Arc::new(Mutex::new(crate::model_cache::ModelCache::new(3))),
+            active_generation: Arc::new(RwLock::new(None)),
+            model_load_lock: Arc::new(Mutex::new(())),
+            shared_pool: Arc::new(Mutex::new(mold_inference::shared_pool::SharedPool::new())),
+            in_flight: AtomicUsize::new(0),
+            consecutive_failures: AtomicUsize::new(0),
+            degraded_until: RwLock::new(None),
+            job_tx,
+        })
+    }
+
+    fn app_with_worker_pool(engine: MockEngine, ordinals: &[usize]) -> axum::Router {
+        let mut state = AppState::with_engine(engine);
+        state.gpu_pool = Arc::new(crate::gpu_pool::GpuPool {
+            workers: ordinals.iter().copied().map(gpu_worker_stub).collect(),
+        });
+        create_router(state)
+    }
+
     fn generate_body(prompt: &str, width: u32, height: u32) -> String {
         // Use "mock-model" to match MockEngine::model_name() — avoids hot-swap path.
         format!(
@@ -381,6 +409,41 @@ mod tests {
         assert_eq!(status["current_generation"], serde_json::Value::Null);
     }
 
+    #[tokio::test]
+    async fn status_multi_gpu_current_generation_includes_prompt_hash_and_timestamp() {
+        let worker = gpu_worker_stub(1);
+        *worker.active_generation.write().unwrap() = Some(crate::gpu_pool::ActiveGeneration {
+            model: "flux-dev:q4".to_string(),
+            prompt_sha256: "abc123".to_string(),
+            started_at_unix_ms: 1_700_000_000_000,
+            started_at: std::time::Instant::now(),
+        });
+
+        let mut state = AppState::with_engine(MockEngine::ready());
+        state.gpu_pool = Arc::new(crate::gpu_pool::GpuPool {
+            workers: vec![worker],
+        });
+        let app = app_with_state(state);
+
+        let resp = app
+            .oneshot(Request::get("/api/status").body(Body::empty()).unwrap())
+            .await
+            .unwrap();
+        assert_eq!(resp.status(), StatusCode::OK);
+
+        let body = axum::body::to_bytes(resp.into_body(), 1024 * 1024)
+            .await
+            .unwrap();
+        let status: serde_json::Value = serde_json::from_slice(&body).unwrap();
+        assert_eq!(status["current_generation"]["model"], "flux-dev:q4");
+        assert_eq!(status["current_generation"]["prompt_sha256"], "abc123");
+        assert_eq!(
+            status["current_generation"]["started_at_unix_ms"],
+            serde_json::json!(1_700_000_000_000_u64)
+        );
+        assert_eq!(status["gpus"][0]["ordinal"], serde_json::json!(1));
+    }
+
     #[tokio::test]
     async fn status_includes_hostname_and_memory_status() {
         let app = app_empty();
@@ -2263,6 +2326,77 @@ mod tests {
             resp.status()
         );
     }
+
+    #[tokio::test]
+    async fn put_model_placement_rejects_gpu_outside_worker_pool() {
+        let _lock = env_lock().lock().unwrap_or_else(|e| e.into_inner());
+        let (tx, _rx) = tokio::sync::mpsc::channel(16);
+        let queue = crate::state::QueueHandle::new(tx);
+        let gpu_pool = Arc::new(crate::gpu_pool::GpuPool {
+            workers: vec![gpu_worker_stub(1)],
+        });
+        let state = AppState::empty(mold_core::Config::default(), queue, gpu_pool, 200);
+        let app = crate::routes::create_router(state);
+
+        let body = serde_json::json!({
+            "text_encoders": { "kind": "auto" },
+            "advanced": {
+                "transformer": { "kind": "gpu", "ordinal": 0 },
+                "vae": { "kind": "auto" }
+            }
+        });
+
+        let resp = app
+            .oneshot(
+                Request::builder()
+                    .method("PUT")
+                    .uri("/api/config/model/flux-dev%3Aq4/placement")
+                    .header("content-type", "application/json")
+                    .body(Body::from(body.to_string()))
+                    .unwrap(),
+            )
+            .await
+            .unwrap();
+        assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
+        let body = json_body(resp).await;
+        assert!(body["error"].as_str().unwrap().contains("gpu:0"));
+    }
+
+    #[tokio::test]
+    async fn generate_rejects_gpu_outside_worker_pool() {
+        let app = app_with_worker_pool(MockEngine::ready(), &[1]);
+        let body = serde_json::json!({
+            "prompt": "a cat",
+            "model": "mock-model",
+            "width": 512,
+            "height": 512,
+            "steps": 4,
+            "batch_size": 1,
+            "output_format": "png",
+            "placement": {
+                "text_encoders": { "kind": "auto" },
+                "advanced": {
+                    "transformer": { "kind": "gpu", "ordinal": 0 },
+                    "vae": { "kind": "auto" }
+                }
+            }
+        });
+
+        let resp = app
+            .oneshot(
+                Request::builder()
+                    .method("POST")
+                    .uri("/api/generate")
+                    .header("content-type", "application/json")
+                    .body(Body::from(body.to_string()))
+                    .unwrap(),
+            )
+            .await
+            .unwrap();
+        assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
+        let body = json_body(resp).await;
+        assert!(body["error"].as_str().unwrap().contains("gpu:0"));
+    }
     // ─── Downloads UI (Agent A) ─────────────────────────────────────────────
 
     #[tokio::test]
diff --git a/crates/mold-tui/src/ui/info.rs b/crates/mold-tui/src/ui/info.rs
index 93f0260c..1202b126 100644
--- a/crates/mold-tui/src/ui/info.rs
+++ b/crates/mold-tui/src/ui/info.rs
@@ -55,7 +55,26 @@ pub fn render(frame: &mut Frame, app: &App, area: Rect) {
 
     // GPU info from remote server
     if let Some(ref status) = ri.server_status {
-        if let Some(ref gpu) = status.gpu_info {
+        if let Some(ref gpus) = status.gpus {
+            for gpu in gpus {
+                let vram_free = gpu.vram_total_bytes.saturating_sub(gpu.vram_used_bytes);
+                lines.push(Line::from(Span::styled(
+                    format!(
+                        "GPU {} {}: {:.1} GB free",
+                        gpu.ordinal,
+                        gpu.name,
+                        vram_free as f64 / 1_073_741_824.0
+                    ),
+                    theme.dim(),
+                )));
+            }
+            if let (Some(depth), Some(capacity)) = (status.queue_depth, status.queue_capacity) {
+                lines.push(Line::from(Span::styled(
+                    format!("Queue: {depth}/{capacity}"),
+                    theme.dim(),
+                )));
+            }
+        } else if let Some(ref gpu) = status.gpu_info {
             let vram_free = gpu.vram_total_mb.saturating_sub(gpu.vram_used_mb);
             lines.push(Line::from(Span::styled(
                 format!("{}: {:.1} GB free", gpu.name, vram_free as f64 / 1024.0),

From 2c8846328637b96c174d478beb002060d8281c9f Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 09:43:57 -0700
Subject: [PATCH 17/31] style: apply rustfmt to ltx2 runtime new_deferred_cuda

Collapse multi-line signature CI rustfmt wanted on one line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-inference/src/ltx2/runtime.rs | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 221087b9..2410e728 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -315,10 +315,7 @@ impl Ltx2RuntimeSession {
         }
     }
 
-    pub fn new_deferred_cuda(
-        prompt_encoder: NativePromptEncoder,
-        gpu_ordinal: usize,
-    ) -> Self {
+    pub fn new_deferred_cuda(prompt_encoder: NativePromptEncoder, gpu_ordinal: usize) -> Self {
         Self {
             device: None,
             prompt_encoder: Some(prompt_encoder),

From 9f0a5cce6b28f1febee16fbc6927ffb20543b477 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 10:09:24 -0700
Subject: [PATCH 18/31] fix(flux): accept flux-krea GGUFs as reference for
 city96 fine-tunes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

find_flux_reference_gguf hardcoded the candidate list to flux-dev:q{8,6,4}
(plus schnell for schnell targets), so a host with flux-krea:q8 on disk —
a dev-family QuantStack GGUF with the full embedding set including
guidance_in — still forced users to download a redundant ~12 GB
flux-dev:q8 reference before they could load ultrareal-v4:q8 or any other
city96-format fine-tune.

Probe flux-krea:q{8,6,4} after base flux-dev. The existing
gguf_has_guidance verification still gates acceptance, so nothing is
assumed about candidate completeness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                               |  1 +
 crates/mold-inference/src/flux/pipeline.rs | 41 +++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index b23a4841..d7043e34 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - **Multi-GPU worker affinity now holds end-to-end for queued generation, prompt expansion, and upscaling.** The generation dispatcher no longer rejects work just because the tiny per-worker channels are full; jobs stay pending in the configured global queue until a worker can accept them. Explicit placement GPU ordinals are now validated against the active worker pool and may target only one worker GPU per request/config entry, so per-component overrides can no longer silently allocate on a sibling card while auto-placement heuristics continue reading VRAM from the bound worker. Busy workers keep advertising their active model during the cache `take()` window, so follow-up requests queue behind the warm copy instead of spuriously reloading elsewhere. Server-side local prompt expansion now honors the selected GPU set (and prefers an explicitly requested worker GPU when present), Qwen-Image offload budgeting reads the worker's real ordinal instead of GPU 0, and multi-GPU upscaling now routes through the pool instead of a process-global GPU-0 singleton. Disconnected queued jobs are skipped before expensive work begins, multi-GPU `/api/status` now reports real prompt hashes/timestamps for active generations, the TUI info panel reads per-GPU status, and `MoldClient` can target model unloads by GPU/model instead of only exposing the legacy global unload.
 - **LTX-2 image-to-video no longer locks the first latent frame to a noisy ghost of the source image at `strength < 1.0`.** In `run_real_distilled_stage` (`crates/mold-inference/src/ltx2/runtime.rs`) the "clean reference" that the per-step denoise-mask blend pulls the conditioned tokens toward was sourced by cloning `video_latents` *after* `apply_stage_video_conditioning` had already soft-blended the first-latent-frame positions with the initial noise (`noise*(1-s) + source*s`). Used as the clean target, that pre-blended tensor pinned the first latent to a noisy copy of the image instead of the pure image at every step — so i2v runs with `--strength 0.75` (the CLI default) produced a first frame that was 25 % noise + 75 % image rather than the source image. A new helper `clean_latents_for_conditioning` re-applies the replacements with strength 1.0 on top of the post-apply tensor so replacement positions hold pure source image tokens while appended keyframe tokens and pure-noise regions pass through unchanged. `strength = 1.0` and pure-T2V paths are bit-for-bit identical to before. Covered by two new regression tests (`clean_latents_replace_soft_blended_positions_with_pure_source`, `clean_latents_passthrough_when_no_replacements`).
+- **city96-format FLUX fine-tune GGUFs now accept `flux-krea:q{8,6,4}` as a valid reference.** `find_flux_reference_gguf` in `crates/mold-inference/src/flux/pipeline.rs` previously hardcoded the candidate list to `flux-dev:q{8,6,4}` (plus schnell for schnell targets), so a box that already had `flux-krea:q8` on disk — a dev-family QuantStack GGUF with the full embedding set — would still error on first `ultrareal-v4:q8` generation and force the user to download a redundant ~12 GB `flux-dev:q8` reference. Krea is now probed after base flux-dev; the existing `gguf_has_guidance` check still gates acceptance so nothing is assumed about completeness. Regression test `find_flux_reference_accepts_krea_when_no_base_dev` covers the new path.
 - **city96-format FLUX fine-tune GGUFs now fail with an honest, actionable error when no dev reference is downloaded, and surface the dependency at pull time instead of inside `ensure_gguf_embeddings`.** Community fine-tune GGUFs (e.g. the `silveroxides/ultrareal-fine-tune-GGUF` tree that powers `ultrareal-v4:q{8,5,4}`) ship only the diffusion blocks and expect the base FLUX input embedding layers (`img_in`, `time_in`, `vector_in`, `guidance_in`) to be patched in from a separately-downloaded flux-dev reference. Two bugs made this fail confusingly: (1) `find_flux_reference_gguf` in `crates/mold-inference/src/flux/pipeline.rs` returned the first candidate with `img_in.weight`, which let `flux-schnell:q8` pass the probe even though schnell is distilled without `guidance_in` — the subsequent patch loop bailed with `reference GGUF (.../flux-schnell-q8/flux1-schnell-Q8_0.gguf) is also missing required tensor 'guidance_in.in_layer.weight'`, making it look like schnell itself was broken. (2) The manifest didn't express the dependency at all, so the first indication a user had that `mold pull ultrareal-v4:q8` wasn't self-sufficient was an HTTP 500 on their first generation. Fixed by (a) adding a `needs_guidance: bool` parameter to `find_flux_reference_gguf` that skips schnell candidates for dev-family targets and verifies candidates contain `guidance_in.in_layer.weight` before accepting them, (b) rewriting both error messages so the source model is named and the reference path is shown as a filename rather than a full path, and (c) adding a pull-time probe in `crates/mold-core/src/download.rs` (`warn_if_flux_gguf_needs_reference`) that scans the first 4 MiB of any downloaded `.gguf` transformer for `img_in.weight`, and prints a one-line warning via the download callback when the GGUF is incomplete and no suitable dev reference is already on disk. Works for both the CLI (`pull_model`) and server (`pull_model_with_callback`) paths. New regression test `find_flux_reference_skips_schnell_when_dev_needed` covers the reference-picker behaviour.
 
 - **Prompt expansion can no longer OOM on a multi-GPU box with a tight main GPU.** `LocalExpander` previously hardcoded `gpu_ordinal: 0` and gated placement with a static 2 GB VRAM threshold — on a dual-GPU system with a busy main card it fell back to CPU unnecessarily, and on a q8/bf16 expand model (4+ GB weights) the 2 GB threshold under-budgeted activations so the GPU placement check could pass and the load then OOM. The expander now sizes its budget dynamically (`model_size + 2 GB activations`, matching the T5/Qwen3 pattern) and cascades through devices: main GPU → remaining GPUs in ordinal order → CPU, with `preflight_memory_check()` as the final hard-fail guard when system RAM can't hold the model either. Unified-memory Metal placements also run the RAM preflight (Metal allocations draw from the same pool). Device selection logic is factored into a pure `select_expand_device(gpus, threshold, is_metal) -> ExpandPlacement` helper with unit tests for every branch.
diff --git a/crates/mold-inference/src/flux/pipeline.rs b/crates/mold-inference/src/flux/pipeline.rs
index 46255056..d225bf95 100644
--- a/crates/mold-inference/src/flux/pipeline.rs
+++ b/crates/mold-inference/src/flux/pipeline.rs
@@ -317,7 +317,17 @@ fn find_flux_reference_gguf(
 
     // Dev candidates satisfy both schnell and dev targets (schnell tensors are a
     // subset of dev). Schnell candidates only satisfy schnell targets.
-    let mut candidates: Vec<&str> = vec!["flux-dev:q8", "flux-dev:q6", "flux-dev:q4"];
+    // flux-krea is a dev-family fine-tune shipped as complete GGUFs by
+    // QuantStack, so it carries the full embedding set including guidance_in —
+    // fall back to it before asking the user to download flux-dev.
+    let mut candidates: Vec<&str> = vec![
+        "flux-dev:q8",
+        "flux-dev:q6",
+        "flux-dev:q4",
+        "flux-krea:q8",
+        "flux-krea:q6",
+        "flux-krea:q4",
+    ];
     if !needs_guidance {
         candidates.extend(["flux-schnell:q8", "flux-schnell:q4"]);
     }
@@ -2589,6 +2599,35 @@ mod tests {
         std::fs::remove_dir_all(&dir).ok();
     }
 
+    #[test]
+    fn find_flux_reference_accepts_krea_when_no_base_dev() {
+        // flux-krea is a dev-family fine-tune shipped as complete GGUFs — it
+        // should serve as a reference for city96-format fine-tunes (UltraReal,
+        // etc.) even when the base flux-dev GGUF isn't downloaded.
+        let dir = std::env::temp_dir().join(format!(
+            "mold-ref-krea-{}-{}",
+            std::process::id(),
+            std::time::SystemTime::now()
+                .duration_since(std::time::UNIX_EPOCH)
+                .unwrap()
+                .as_nanos()
+        ));
+        let models_dir = dir.join("models");
+        let krea_dir = models_dir.join("flux-krea-q8");
+        std::fs::create_dir_all(&krea_dir).unwrap();
+        let krea_path = krea_dir.join("flux1-krea-dev-Q8_0.gguf");
+
+        let mut complete: Vec<&str> = super::FLUX_EMBEDDING_TENSORS.to_vec();
+        complete.extend_from_slice(super::FLUX_GUIDANCE_EMBEDDING_TENSORS);
+        write_test_gguf(&krea_path, &complete);
+
+        let picked = super::find_flux_reference_gguf(true, Some(&models_dir))
+            .expect("complete flux-krea reference must be accepted for dev targets");
+        assert_eq!(picked, krea_path);
+
+        std::fs::remove_dir_all(&dir).ok();
+    }
+
     #[test]
     fn embedding_tensor_names_are_exhaustive() {
         // Verify the const arrays cover all non-diffusion-block tensors that

From 3fe883930a3432c0bc16554303acfc7b58816a68 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 12:45:08 -0700
Subject: [PATCH 19/31] feat(web): auto-promote long LTX-2 distilled video
 requests to chain endpoint
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The SPA previously POSTed every generate to /api/generate/stream regardless
of frame count. Requesting frames > 97 (the LTX-2 19B/22B distilled per-clip
cap) then OOMed at VAE decode after a full denoise run — three identical
failures on a 241-frame 512x512 img2v request before the missing routing
was traced.

Mirror the CLI's decide_chain_routing in a new pure helper
(web/src/lib/chainRouting.ts) so the Composer and the submit path share
the same decision. When the decision is `chain`, useGenerateStream
dispatches to /api/generate/chain/stream with an auto-expand
ChainRequest, folds ChainProgressEvent into the existing JobProgress
(so RunningJobCard renders "Denoising clip K/N · step X/Y" with no per-
event UI changes), and shape-shifts SseChainCompleteEvent into
SseCompleteEvent on completion. A `reject` decision hard-blocks submit
with an alert() and surfaces a red error pill in the Composer; a `chain`
decision surfaces a brand-tinted "Will render as N chained clips" pill
so users understand the expected latency.

Non-chainable families below the per-clip budget stay on the single-clip
path unchanged. LTX-2 distilled requests at-or-below 97 frames also stay
single.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                             |   4 +
 web/src/api.ts                           |  72 ++++++++
 web/src/components/Composer.vue          |  23 +++
 web/src/composables/useGenerateStream.ts | 214 ++++++++++++++++++++---
 web/src/lib/chainRouting.test.ts         |  85 +++++++++
 web/src/lib/chainRouting.ts              |  65 +++++++
 web/src/pages/GeneratePage.vue           |  21 ++-
 web/src/types.ts                         |  59 +++++++
 8 files changed, 514 insertions(+), 29 deletions(-)
 create mode 100644 web/src/lib/chainRouting.test.ts
 create mode 100644 web/src/lib/chainRouting.ts

diff --git a/CHANGELOG.md b/CHANGELOG.md
index d7043e34..c9816eab 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- **Web UI auto-promotes long LTX-2 distilled video requests to the chain endpoint** so the SPA can render arbitrary-length clips without the user manually hitting `mold run` on the CLI. Previously, requesting `frames > 97` (the LTX-2 19B/22B distilled per-clip cap) from the in-browser generate composer POSTed straight to `/api/generate/stream`, the engine dutifully tried to render the full request in one pass, transformer denoise fit fine (~10 GB residual after the Gemma 3 text encoder dropped), and then VAE decode of the full-length latent stack exceeded the 24 GB 3090 budget and `CUDA_ERROR_OUT_OF_MEMORY` killed the job minutes from the end — repeatable three times in a row on a 241-frame 512×512 img2v request before the symptom was traced back to the missing client-side routing. A new pure `decideChainRouting` helper in `web/src/lib/chainRouting.ts` mirrors the CLI's `decide_chain_routing` (`crates/mold-cli/src/commands/chain.rs`): it checks the selected model's family, promotes to the chain endpoint when `frames > LTX2_DISTILLED_CLIP_CAP` for an `ltx2`-family distilled model, rejects cleanly for non-chainable video families that exceed the per-clip budget, and stays `single` otherwise. `useGenerateStream.submit` now accepts the decision, dispatches to either `/api/generate/stream` or `/api/generate/chain/stream`, and folds chain-shaped `ChainProgressEvent`s into the existing `JobProgress` so `RunningJobCard` renders a familiar "Denoising clip K/N · step X/Y" readout without any per-event UI changes. Completion uses a shape-shifter (`chainCompleteToSingle`) to transform `SseChainCompleteEvent` into the canonical `SseCompleteEvent` the gallery detail drawer expects — the `seed_used` fallback to `req.seed ?? 0` loses the auto-open-on-complete affordance for chain runs (chains derive per-stage seeds from the base seed, so there isn't a single seed to match against gallery metadata) but the gallery still refreshes and the new video surfaces at the top. The Composer shows a brand-tinted pill ("Will render as N chained clips of 97 frames (motion-tail 4) — expect this to take substantially longer than a single clip") whenever the routing decision resolves to `chain`, and a red error pill when it resolves to `reject`; `onSubmit` also hard-blocks rejected requests with an `alert()` before building the wire body. All eight routing helper branches and the existing 75 generate-form tests pass (`bun run test`).
+
 ### Fixed
 
 - **Multi-GPU worker affinity now holds end-to-end for queued generation, prompt expansion, and upscaling.** The generation dispatcher no longer rejects work just because the tiny per-worker channels are full; jobs stay pending in the configured global queue until a worker can accept them. Explicit placement GPU ordinals are now validated against the active worker pool and may target only one worker GPU per request/config entry, so per-component overrides can no longer silently allocate on a sibling card while auto-placement heuristics continue reading VRAM from the bound worker. Busy workers keep advertising their active model during the cache `take()` window, so follow-up requests queue behind the warm copy instead of spuriously reloading elsewhere. Server-side local prompt expansion now honors the selected GPU set (and prefers an explicitly requested worker GPU when present), Qwen-Image offload budgeting reads the worker's real ordinal instead of GPU 0, and multi-GPU upscaling now routes through the pool instead of a process-global GPU-0 singleton. Disconnected queued jobs are skipped before expensive work begins, multi-GPU `/api/status` now reports real prompt hashes/timestamps for active generations, the TUI info panel reads per-GPU status, and `MoldClient` can target model unloads by GPU/model instead of only exposing the legacy global unload.
diff --git a/web/src/api.ts b/web/src/api.ts
index 01de2379..fc9cf14d 100644
--- a/web/src/api.ts
+++ b/web/src/api.ts
@@ -1,4 +1,6 @@
 import type {
+  ChainProgressEvent,
+  ChainRequestWire,
   ExpandRequestWire,
   ExpandResponseWire,
   GalleryImage,
@@ -6,6 +8,7 @@ import type {
   ModelInfoExtended,
   ServerCapabilities,
   ServerStatus,
+  SseChainCompleteEvent,
   SseCompleteEvent,
   SseProgressEvent,
 } from "./types";
@@ -170,6 +173,75 @@ export async function generateStream(
   }
 }
 
+export interface ChainStreamHandlers {
+  onProgress: (evt: ChainProgressEvent) => void;
+  onComplete: (evt: SseChainCompleteEvent) => void;
+  onError: (
+    err:
+      | { kind: "http"; status: number; retryAfter?: number; body: string }
+      | { kind: "network"; message: string },
+  ) => void;
+}
+
+/** POST /api/generate/chain/stream — SSE stream for chained video
+ * generation. Same SSE framing as `/api/generate/stream` but with a
+ * `ChainRequest` body and chain-shaped progress/complete events. */
+export async function generateChainStream(
+  req: ChainRequestWire,
+  handlers: ChainStreamHandlers,
+  signal?: AbortSignal,
+): Promise<void> {
+  try {
+    const res = await streamSse({
+      url: `${base}/api/generate/chain/stream`,
+      body: req,
+      signal,
+      onEvent: (evt) => {
+        if (!evt.data) return;
+        let parsed: unknown;
+        try {
+          parsed = JSON.parse(evt.data);
+        } catch {
+          return;
+        }
+        if (evt.event === "complete") {
+          handlers.onComplete(parsed as SseChainCompleteEvent);
+        } else if (evt.event === "error") {
+          handlers.onError({ kind: "http", status: 0, body: evt.data });
+        } else {
+          handlers.onProgress(parsed as ChainProgressEvent);
+        }
+      },
+      onHttpError: (res) => {
+        const retryAfterHeader = res.headers.get("Retry-After");
+        const retryAfter = retryAfterHeader
+          ? Number.parseFloat(retryAfterHeader)
+          : undefined;
+        res
+          .text()
+          .then((body) =>
+            handlers.onError({
+              kind: "http",
+              status: res.status,
+              retryAfter: Number.isFinite(retryAfter) ? retryAfter : undefined,
+              body,
+            }),
+          )
+          .catch(() =>
+            handlers.onError({ kind: "http", status: res.status, body: "" }),
+          );
+      },
+    });
+    void res;
+  } catch (err) {
+    if (signal?.aborted) return;
+    handlers.onError({
+      kind: "network",
+      message: err instanceof Error ? err.message : String(err),
+    });
+  }
+}
+
 // ─── Downloads UI (Agent A) ───────────────────────────────────────────────────
 import type { DownloadJobWire, DownloadsListingWire } from "./types";
 
diff --git a/web/src/components/Composer.vue b/web/src/components/Composer.vue
index 5dd31ca2..db76fc4e 100644
--- a/web/src/components/Composer.vue
+++ b/web/src/components/Composer.vue
@@ -3,6 +3,8 @@ import { computed, nextTick, ref, watch } from "vue";
 import type { DevicePlacement, GenerateFormState } from "../types";
 import PlacementPanel from "./PlacementPanel.vue";
 
+import type { ChainRoutingDecision } from "../lib/chainRouting";
+
 const props = defineProps<{
   modelValue: GenerateFormState;
   queueDepth: number | null;
@@ -14,6 +16,10 @@ const props = defineProps<{
   family: string; // family of the currently-selected model
   placementGpus: { ordinal: number; name: string }[];
   // ──────────────────────────────────────────────────────────────────
+  /** Chain routing decision for the current form settings. When `chain`,
+   * the Composer shows a "will render as N clips" cue so users understand
+   * why the request will take much longer than a single-clip submit. */
+  chainDecision: ChainRoutingDecision;
 }>();
 
 const emit = defineEmits<{
@@ -140,6 +146,23 @@ function updatePlacement(p: DevicePlacement | null) {
       {{ statusLine }}
     </div>
 
+    <div
+      v-if="chainDecision.kind === 'chain'"
+      class="rounded-lg bg-brand-900/40 px-3 py-1.5 text-xs text-brand-200"
+    >
+      Will render as
+      <span class="font-semibold">{{ chainDecision.stageCount }}</span>
+      chained clips of {{ chainDecision.clipFrames }} frames (motion-tail
+      {{ chainDecision.motionTail }}) — expect this to take substantially longer
+      than a single clip.
+    </div>
+    <div
+      v-else-if="chainDecision.kind === 'reject'"
+      class="rounded-lg bg-red-900/40 px-3 py-1.5 text-xs text-red-200"
+    >
+      {{ chainDecision.reason }}
+    </div>
+
     <!-- Agent C (model-ui-overhaul §3): device placement -->
     <PlacementPanel
       :model-value="modelValue.placement"
diff --git a/web/src/composables/useGenerateStream.ts b/web/src/composables/useGenerateStream.ts
index 65442892..91e8d014 100644
--- a/web/src/composables/useGenerateStream.ts
+++ b/web/src/composables/useGenerateStream.ts
@@ -1,10 +1,14 @@
 import { reactive, ref, type Ref } from "vue";
-import { generateStream } from "../api";
+import { generateChainStream, generateStream } from "../api";
 import type {
+  ChainProgressEvent,
+  ChainRequestWire,
   GenerateRequestWire,
+  SseChainCompleteEvent,
   SseCompleteEvent,
   SseProgressEvent,
 } from "../types";
+import type { ChainRoutingDecision } from "../lib/chainRouting";
 
 export interface JobProgress {
   stage: string;
@@ -26,6 +30,15 @@ export interface Job {
   result: SseCompleteEvent | null;
   error: string | null;
   state: "running" | "done" | "error" | "canceled";
+  /** When the job was auto-promoted to the chain endpoint. `null` for a
+   * normal single-clip submission. */
+  chain: ChainJobMeta | null;
+}
+
+export interface ChainJobMeta {
+  stageCount: number;
+  currentStage: number;
+  estimatedTotalFrames: number | null;
 }
 
 function emptyProgress(): JobProgress {
@@ -75,9 +88,116 @@ function applyProgress(job: Job, evt: SseProgressEvent) {
   }
 }
 
+/** Chain progress events come from a separate SSE stream shape than the
+ * single-clip path; we fold them into the same `JobProgress` so the
+ * `RunningJobCard` UI renders a familiar "Denoising clip K/N · step X/Y"
+ * readout without the per-event UI layer needing to know about chaining. */
+function applyChainProgress(job: Job, evt: ChainProgressEvent) {
+  const p = job.progress;
+  const meta = job.chain;
+  switch (evt.type) {
+    case "chain_start":
+      if (meta) {
+        meta.stageCount = evt.stage_count;
+        meta.estimatedTotalFrames = evt.estimated_total_frames;
+      }
+      p.stage = `Chain · ${evt.stage_count} clips · ~${evt.estimated_total_frames} frames`;
+      break;
+    case "stage_start":
+      if (meta) meta.currentStage = evt.stage_idx;
+      p.stage = chainStageLabel(meta, evt.stage_idx, "Starting");
+      p.step = null;
+      p.totalSteps = null;
+      break;
+    case "denoise_step":
+      if (meta) meta.currentStage = evt.stage_idx;
+      p.stage = chainStageLabel(meta, evt.stage_idx, "Denoising");
+      p.step = evt.step;
+      p.totalSteps = evt.total;
+      break;
+    case "stage_done":
+      p.stage = chainStageLabel(meta, evt.stage_idx, "Done");
+      p.step = null;
+      p.totalSteps = null;
+      break;
+    case "stitching":
+      p.stage = `Stitching ${evt.total_frames} frames…`;
+      p.step = null;
+      p.totalSteps = null;
+      break;
+  }
+}
+
+function chainStageLabel(
+  meta: ChainJobMeta | null,
+  stageIdx: number,
+  action: string,
+): string {
+  const total = meta?.stageCount ?? null;
+  const human = stageIdx + 1;
+  return total !== null
+    ? `${action} clip ${human}/${total}`
+    : `${action} clip ${human}`;
+}
+
+/** Chain complete events carry a `video` payload instead of `image`, no
+ * single seed, and separate thumbnail/gif_preview fields. Shape-shift into
+ * `SseCompleteEvent` so `GeneratePage.openJob` + `RunningJobCard` stay
+ * unchanged. `seed_used` falls back to the request seed (or 0) — the
+ * gallery match will miss but the refresh-on-complete still surfaces the
+ * new item. */
+function chainCompleteToSingle(
+  req: GenerateRequestWire,
+  evt: SseChainCompleteEvent,
+): SseCompleteEvent {
+  return {
+    image: evt.video,
+    format: evt.format,
+    width: evt.width,
+    height: evt.height,
+    seed_used: req.seed ?? 0,
+    generation_time_ms: evt.generation_time_ms ?? 0,
+    model: req.model,
+    video_frames: evt.frames,
+    video_fps: evt.fps,
+    video_thumbnail: evt.thumbnail ?? null,
+    video_gif_preview: evt.gif_preview ?? null,
+    video_has_audio: evt.has_audio ?? false,
+    video_duration_ms: evt.duration_ms ?? null,
+    video_audio_sample_rate: evt.audio_sample_rate ?? null,
+    video_audio_channels: evt.audio_channels ?? null,
+    gpu: evt.gpu ?? null,
+  };
+}
+
+/** Translate a single-clip `GenerateRequestWire` + chain routing decision
+ * into the auto-expand `ChainRequestWire` the server expects. */
+function buildChainRequest(
+  req: GenerateRequestWire,
+  decision: Extract<ChainRoutingDecision, { kind: "chain" }>,
+): ChainRequestWire {
+  return {
+    model: req.model,
+    motion_tail_frames: decision.motionTail,
+    width: req.width,
+    height: req.height,
+    fps: req.fps ?? 24,
+    seed: req.seed ?? null,
+    steps: req.steps,
+    guidance: req.guidance,
+    strength: req.strength ?? 1.0,
+    output_format: req.output_format,
+    placement: req.placement ?? null,
+    prompt: req.prompt,
+    total_frames: req.frames ?? undefined,
+    clip_frames: decision.clipFrames,
+    source_image: req.source_image ?? null,
+  };
+}
+
 export interface UseGenerateStream {
   jobs: Ref<Job[]>;
-  submit: (req: GenerateRequestWire) => string;
+  submit: (req: GenerateRequestWire, decision?: ChainRoutingDecision) => string;
   cancel: (id: string) => void;
   clearDone: () => void;
 }
@@ -87,9 +207,13 @@ export function useGenerateStream(
 ): UseGenerateStream {
   const jobs = ref<Job[]>([]);
 
-  function submit(req: GenerateRequestWire): string {
+  function submit(
+    req: GenerateRequestWire,
+    decision: ChainRoutingDecision = { kind: "single" },
+  ): string {
     const id = crypto.randomUUID();
     const controller = new AbortController();
+    const isChain = decision.kind === "chain";
     // Wrap in reactive() so that property mutations during SSE streaming
     // (stage, step, state, result) trigger RunningJobCard re-renders. The
     // closure must hold the proxy, not the raw object — mutations through
@@ -103,36 +227,70 @@ export function useGenerateStream(
       result: null,
       error: null,
       state: "running",
+      chain: isChain
+        ? {
+            stageCount: (
+              decision as Extract<ChainRoutingDecision, { kind: "chain" }>
+            ).stageCount,
+            currentStage: 0,
+            estimatedTotalFrames: null,
+          }
+        : null,
     }) as Job;
     jobs.value = [job, ...jobs.value];
 
-    generateStream(
-      req,
-      {
-        onProgress: (evt) => {
-          applyProgress(job, evt);
-        },
-        onComplete: (evt) => {
-          job.result = evt;
-          job.state = "done";
-          if (evt.gpu !== null && evt.gpu !== undefined)
-            job.progress.gpu = evt.gpu;
-          onComplete?.(job);
+    const onErrorCommon = (err: {
+      kind: "http" | "network";
+      status?: number;
+      retryAfter?: number;
+      body?: string;
+      message?: string;
+    }) => {
+      if (err.kind === "http") {
+        job.error =
+          err.status === 503
+            ? `Queue full (retry after ${err.retryAfter ?? "?"}s)`
+            : `HTTP ${err.status}: ${err.body ?? ""}`;
+      } else {
+        job.error = err.message ?? "network error";
+      }
+      job.state = "error";
+    };
+
+    if (decision.kind === "chain") {
+      const chainReq = buildChainRequest(req, decision);
+      generateChainStream(
+        chainReq,
+        {
+          onProgress: (evt) => applyChainProgress(job, evt),
+          onComplete: (evt) => {
+            job.result = chainCompleteToSingle(req, evt);
+            job.state = "done";
+            if (evt.gpu !== null && evt.gpu !== undefined)
+              job.progress.gpu = evt.gpu;
+            onComplete?.(job);
+          },
+          onError: onErrorCommon,
         },
-        onError: (err) => {
-          if (err.kind === "http") {
-            job.error =
-              err.status === 503
-                ? `Queue full (retry after ${err.retryAfter ?? "?"}s)`
-                : `HTTP ${err.status}: ${err.body}`;
-          } else {
-            job.error = err.message;
-          }
-          job.state = "error";
+        controller.signal,
+      );
+    } else {
+      generateStream(
+        req,
+        {
+          onProgress: (evt) => applyProgress(job, evt),
+          onComplete: (evt) => {
+            job.result = evt;
+            job.state = "done";
+            if (evt.gpu !== null && evt.gpu !== undefined)
+              job.progress.gpu = evt.gpu;
+            onComplete?.(job);
+          },
+          onError: onErrorCommon,
         },
-      },
-      controller.signal,
-    );
+        controller.signal,
+      );
+    }
 
     return id;
   }
diff --git a/web/src/lib/chainRouting.test.ts b/web/src/lib/chainRouting.test.ts
new file mode 100644
index 00000000..8c710b89
--- /dev/null
+++ b/web/src/lib/chainRouting.test.ts
@@ -0,0 +1,85 @@
+import { describe, expect, it } from "vitest";
+import {
+  DEFAULT_MOTION_TAIL,
+  LTX2_DISTILLED_CLIP_CAP,
+  decideChainRouting,
+} from "./chainRouting";
+
+describe("decideChainRouting", () => {
+  it("returns single when frames is null/undefined/zero", () => {
+    expect(
+      decideChainRouting(null, "ltx2", "ltx-2.3-22b-distilled:fp8"),
+    ).toEqual({ kind: "single" });
+    expect(
+      decideChainRouting(undefined, "ltx2", "ltx-2.3-22b-distilled:fp8"),
+    ).toEqual({ kind: "single" });
+    expect(decideChainRouting(0, "ltx2", "ltx-2.3-22b-distilled:fp8")).toEqual({
+      kind: "single",
+    });
+  });
+
+  it("returns single for ltx2-distilled at-or-below the cap", () => {
+    expect(
+      decideChainRouting(
+        LTX2_DISTILLED_CLIP_CAP,
+        "ltx2",
+        "ltx-2.3-22b-distilled:fp8",
+      ),
+    ).toEqual({ kind: "single" });
+    expect(decideChainRouting(25, "ltx2", "ltx-2.3-22b-distilled:fp8")).toEqual(
+      { kind: "single" },
+    );
+  });
+
+  it("chains ltx2-distilled requests above the cap", () => {
+    // 241 = 97 + 4*(97-4) - 9  → ceil(144/93) = 2 → 1+2 = 3 stages
+    const d = decideChainRouting(241, "ltx2", "ltx-2.3-22b-distilled:fp8");
+    expect(d).toEqual({
+      kind: "chain",
+      clipFrames: 97,
+      motionTail: DEFAULT_MOTION_TAIL,
+      stageCount: 3,
+    });
+  });
+
+  it("chain stage count matches Rust normalise() expectations", () => {
+    // Mirrors crates/mold-core/src/chain.rs test cases:
+    //   (400, 97, 4, 5)  — 97 + 4*93 = 469 ≥ 400
+    //   (200, 97, 4, 3)  — 97 + 2*93 = 283 ≥ 200
+    //   (97,  97, 4, 1)  — single clip hits 97 exactly (handled as "single")
+    expect(
+      decideChainRouting(400, "ltx2", "ltx-2.3-22b-distilled:fp8", 4),
+    ).toMatchObject({ kind: "chain", stageCount: 5 });
+    expect(
+      decideChainRouting(200, "ltx2", "ltx-2.3-22b-distilled:fp8", 4),
+    ).toMatchObject({ kind: "chain", stageCount: 3 });
+  });
+
+  it("rejects non-ltx2-distilled models when frames exceed the single-clip budget", () => {
+    const d = decideChainRouting(241, "ltx2", "ltx-2-19b:fp8");
+    expect(d.kind).toBe("reject");
+  });
+
+  it("stays single for non-ltx2-distilled when frames are within budget", () => {
+    expect(decideChainRouting(49, "ltx-video", "ltx-video-0.9.6:bf16")).toEqual(
+      { kind: "single" },
+    );
+    expect(decideChainRouting(97, "ltx2", "ltx-2-19b:fp8")).toEqual({
+      kind: "single",
+    });
+  });
+
+  it("rejects when motion tail is >= clip frames", () => {
+    const d = decideChainRouting(200, "ltx2", "ltx-2.3-22b-distilled:fp8", 97);
+    expect(d.kind).toBe("reject");
+  });
+
+  it("returns single when family is missing", () => {
+    expect(decideChainRouting(50, null, "anything")).toEqual({
+      kind: "single",
+    });
+    expect(decideChainRouting(50, undefined, "anything")).toEqual({
+      kind: "single",
+    });
+  });
+});
diff --git a/web/src/lib/chainRouting.ts b/web/src/lib/chainRouting.ts
new file mode 100644
index 00000000..4c68dd14
--- /dev/null
+++ b/web/src/lib/chainRouting.ts
@@ -0,0 +1,65 @@
+/**
+ * Client-side mirror of `crates/mold-cli/src/commands/chain.rs`'s
+ * `decide_chain_routing` so the SPA can auto-promote long video requests to
+ * the chain endpoint without a round-trip. Keeping the decision logic pure
+ * and out of the composable makes it unit-testable and lets us reuse it in
+ * the Composer for the "will render as N chained clips" UX cue.
+ *
+ * The constants and branch structure match the Rust side exactly — if the
+ * engine cap ever diverges from 97 we'd need to bump both (and ideally
+ * expose it through a server capability). A regression test in chain.rs
+ * asserts `LTX2_DISTILLED_CLIP_CAP % 8 == 1`.
+ */
+
+export const LTX2_DISTILLED_CLIP_CAP = 97;
+export const DEFAULT_MOTION_TAIL = 4;
+
+export type ChainRoutingDecision =
+  | { kind: "single" }
+  | {
+      kind: "chain";
+      clipFrames: number;
+      motionTail: number;
+      stageCount: number;
+    }
+  | { kind: "reject"; reason: string };
+
+export function decideChainRouting(
+  frames: number | null | undefined,
+  family: string | null | undefined,
+  model: string,
+  motionTail: number = DEFAULT_MOTION_TAIL,
+): ChainRoutingDecision {
+  if (!frames || frames <= 0) return { kind: "single" };
+
+  const isLtx2Distilled = family === "ltx2" && model.includes("distilled");
+
+  if (!isLtx2Distilled) {
+    if (frames <= LTX2_DISTILLED_CLIP_CAP) return { kind: "single" };
+    return {
+      kind: "reject",
+      reason: `Model '${model}' does not support chained video generation (only LTX-2 distilled families do). Reduce frames to ${LTX2_DISTILLED_CLIP_CAP} or less.`,
+    };
+  }
+
+  const clipFrames = LTX2_DISTILLED_CLIP_CAP;
+  if (frames <= clipFrames) return { kind: "single" };
+
+  if (motionTail >= clipFrames) {
+    return {
+      kind: "reject",
+      reason: `motion tail (${motionTail}) must be strictly less than clip frames (${clipFrames}).`,
+    };
+  }
+
+  // Stage count mirrors `ChainRequest::normalise` in chain.rs:
+  //   1 + ceil((total - clipFrames) / (clipFrames - motionTail))
+  // — the first clip emits `clipFrames` frames, every continuation emits
+  // `clipFrames - motionTail` new frames after the motion tail is trimmed
+  // at stitch time.
+  const effective = clipFrames - motionTail;
+  const remainder = frames - clipFrames;
+  const stageCount = 1 + Math.ceil(remainder / effective);
+
+  return { kind: "chain", clipFrames, motionTail, stageCount };
+}
diff --git a/web/src/pages/GeneratePage.vue b/web/src/pages/GeneratePage.vue
index 0d2e91a4..3942db4e 100644
--- a/web/src/pages/GeneratePage.vue
+++ b/web/src/pages/GeneratePage.vue
@@ -17,6 +17,7 @@ import {
 } from "../api";
 import { useGenerateForm } from "../composables/useGenerateForm";
 import { useGenerateStream, type Job } from "../composables/useGenerateStream";
+import { decideChainRouting } from "../lib/chainRouting";
 import { useStatusPoll } from "../composables/useStatusPoll";
 import type {
   ExpandFormState,
@@ -144,13 +145,30 @@ const topBarCounts = computed(() => ({
   filtered: galleryEntries.value.length,
 }));
 
+const chainDecision = computed(() =>
+  decideChainRouting(
+    form.state.value.frames,
+    currentModel.value?.family ?? null,
+    form.state.value.model,
+  ),
+);
+
 function onSubmit() {
   if (!form.state.value.model) {
     showSettings.value = true;
     return;
   }
+  const decision = chainDecision.value;
+  if (decision.kind === "reject") {
+    // Block submit on a well-defined routing rejection (non-chainable
+    // family over its per-clip budget). Keeping this as alert() matches
+    // the existing terse validation UX — a toast system would be a
+    // separate piece of work.
+    alert(decision.reason);
+    return;
+  }
   const req = form.toRequest();
-  stream.submit(req);
+  stream.submit(req, decision);
 }
 
 function onClearSource() {
@@ -261,6 +279,7 @@ onMounted(async () => {
         :settings-dirty="settingsDirty"
         :family="currentModel?.family ?? ''"
         :placement-gpus="gpuListForPlacement"
+        :chain-decision="chainDecision"
         @submit="onSubmit"
         @open-settings="showSettings = true"
         @open-expand="showExpand = true"
diff --git a/web/src/types.ts b/web/src/types.ts
index 1d1090b0..30cf59e7 100644
--- a/web/src/types.ts
+++ b/web/src/types.ts
@@ -216,6 +216,65 @@ export interface SseCompleteEvent {
   gpu?: number | null;
 }
 
+// ── Chained video generation (POST /api/generate/chain/stream) ────────────
+// Mirrors `mold_core::chain::{ChainRequest, ChainProgressEvent,
+// SseChainCompleteEvent}`. The SPA only uses the auto-expand form in v1
+// (single prompt across all stages), so `stages` is never populated here —
+// the server expands `prompt`/`total_frames`/`clip_frames` into canonical
+// stages on the wire.
+export interface ChainRequestWire {
+  model: string;
+  stages?: never; // SPA always auto-expands; server populates stages
+  motion_tail_frames?: number;
+  width: number;
+  height: number;
+  fps?: number;
+  seed?: number | null;
+  steps: number;
+  guidance: number;
+  strength?: number;
+  output_format?: OutputFormat;
+  placement?: DevicePlacement | null;
+  prompt?: string;
+  total_frames?: number;
+  clip_frames?: number;
+  source_image?: string | null;
+}
+
+export type ChainProgressEvent =
+  | {
+      type: "chain_start";
+      stage_count: number;
+      estimated_total_frames: number;
+    }
+  | { type: "stage_start"; stage_idx: number }
+  | {
+      type: "denoise_step";
+      stage_idx: number;
+      step: number;
+      total: number;
+    }
+  | { type: "stage_done"; stage_idx: number; frames_emitted: number }
+  | { type: "stitching"; total_frames: number };
+
+export interface SseChainCompleteEvent {
+  video: string; // base64
+  format: OutputFormat;
+  width: number;
+  height: number;
+  frames: number;
+  fps: number;
+  thumbnail?: string | null;
+  gif_preview?: string | null;
+  has_audio?: boolean;
+  duration_ms?: number | null;
+  audio_sample_rate?: number | null;
+  audio_channels?: number | null;
+  stage_count: number;
+  gpu?: number | null;
+  generation_time_ms?: number | null;
+}
+
 export interface ExpandRequestWire {
   prompt: string;
   model_family: string;

From 63471b1a784c0a39375e8462ff4af52816494e75 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 13:09:49 -0700
Subject: [PATCH 20/31] fix(ltx2): cache prompt encoding across chain stages +
 serialize chain requests

Two bugs that surfaced once the web UI started auto-promoting long LTX-2
distilled video requests to the chain endpoint.

1) Stage 2 of a chain errored with "native LTX-2 prompt encoder is
   unavailable" because Ltx2RuntimeSession::prepare() consumes the encoder
   on first call (intentional VRAM-free pattern) and render_chain_stage
   restores the session between stages. Fix: cache the NativePromptEncoding
   output on the session keyed by the EncodedPromptPair + unconditional
   flag so same-prompt follow-ups skip the encoder entirely. A new
   can_reuse_for(&plan) helper lets Ltx2Engine detect when a persisted
   session carries a consumed encoder AND a different prompt arrived, in
   which case the engine drops it and builds a fresh session.

2) Concurrent chain requests raced with "engine '...' vanished from cache
   after ensure_model_ready" because routes_chain deliberately takes the
   engine out of model_cache for the full chain duration without any
   serialization across chain requests. Fix: add AppState.chain_lock held
   for the whole run_chain; single-clip requests still flow through the
   normal generation queue unchanged.

Test updates: runtime_session_prepare_consumes_prompt_encoder now
documents the same-prompt cache-hit semantic; a new
runtime_session_prepare_rejects_encoder_reuse_with_different_prompt
locks in the fresh-session-required branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                               |   3 +
 crates/mold-inference/src/ltx2/pipeline.rs |  24 ++--
 crates/mold-inference/src/ltx2/runtime.rs  | 127 ++++++++++++++++++++-
 crates/mold-server/src/routes_chain.rs     |  11 ++
 crates/mold-server/src/routes_test.rs      |   3 +
 crates/mold-server/src/state.rs            |  12 ++
 6 files changed, 169 insertions(+), 11 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index c9816eab..48c62c20 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Chained LTX-2 video generation no longer errors on stage 2 with "native LTX-2 prompt encoder is unavailable".** `Ltx2RuntimeSession::prepare` in `crates/mold-inference/src/ltx2/runtime.rs` intentionally consumes the prompt encoder on first call via `self.prompt_encoder.take()` so the Gemma 3 encoder's ~10 GB of VRAM can be freed for the transformer (same drop-and-reload pattern that FLUX uses for T5 and Z-Image uses for Qwen3). That invariant is fine for single-clip generation because the pipeline drops the whole runtime session after each call (`generate_inner` takes but never restores `native_runtime`), so the next request builds a fresh session with a fresh encoder. But `render_chain_stage` *does* restore the session so the transformer and VAE stay warm across stages — and stage 2+ then hit `.context("native LTX-2 prompt encoder is unavailable")?` because the encoder slot is already empty. Three back-to-back failures on a 241-frame img2v chain before it was traced. Fixed by caching the last `NativePromptEncoding` output on the session (keyed by the plan's full `EncodedPromptPair` plus the unconditional flag), so same-prompt follow-ups skip the encoder entirely; chain v1 replicates a single prompt across all stages so the cache is always warm from stage 1 onward. A new `Ltx2RuntimeSession::can_reuse_for(&plan)` helper lets `Ltx2Engine` detect when a persisted session carries a consumed encoder *and* a different prompt — in that case the engine drops it and builds a fresh session, which is the only way to re-encode from scratch. The existing `runtime_session_prepare_consumes_prompt_encoder` test is updated to reflect the new semantic (same-prompt second call succeeds via cache), and a new `runtime_session_prepare_rejects_encoder_reuse_with_different_prompt` test locks in the fresh-session-required branch so future refactors can't silently regress.
+- **Chained LTX-2 generation now serializes concurrent chain requests instead of racing on the model cache.** `routes_chain::run_chain` in `crates/mold-server/src/routes_chain.rs` deliberately bypasses the normal generation queue and holds the engine out of `model_cache` for the full multi-minute chain run — a design tradeoff documented in CLAUDE.md to keep the transformer warm across stages without blocking single-clip requests. But that design had no serialization across *chain* requests: a second chain arriving while the first was running called `ensure_model_ready` → saw the cache empty (request A had taken the engine) → tried to `create_and_load_engine` a second copy → then request B's `cache.take()` surfaced the cryptic "engine '…' vanished from cache after ensure_model_ready" error. Manifested the moment the web UI auto-promoted multiple long-frame LTX-2 requests to chain mode. Fixed by adding `chain_lock: Arc<Mutex<()>>` to `AppState` and acquiring it at the top of `run_chain` before `ensure_model_ready`. The lock is held for the entire chain, so concurrent chain requests queue naturally; single-clip requests continue to flow through the normal generation queue unchanged. `ensure_model_ready`'s global `model_load_lock` still prevents concurrent loads from racing; this new lock adds a second level of serialization specific to chain's "hold engine out of cache" pattern.
+
 - **Multi-GPU worker affinity now holds end-to-end for queued generation, prompt expansion, and upscaling.** The generation dispatcher no longer rejects work just because the tiny per-worker channels are full; jobs stay pending in the configured global queue until a worker can accept them. Explicit placement GPU ordinals are now validated against the active worker pool and may target only one worker GPU per request/config entry, so per-component overrides can no longer silently allocate on a sibling card while auto-placement heuristics continue reading VRAM from the bound worker. Busy workers keep advertising their active model during the cache `take()` window, so follow-up requests queue behind the warm copy instead of spuriously reloading elsewhere. Server-side local prompt expansion now honors the selected GPU set (and prefers an explicitly requested worker GPU when present), Qwen-Image offload budgeting reads the worker's real ordinal instead of GPU 0, and multi-GPU upscaling now routes through the pool instead of a process-global GPU-0 singleton. Disconnected queued jobs are skipped before expensive work begins, multi-GPU `/api/status` now reports real prompt hashes/timestamps for active generations, the TUI info panel reads per-GPU status, and `MoldClient` can target model unloads by GPU/model instead of only exposing the legacy global unload.
 - **LTX-2 image-to-video no longer locks the first latent frame to a noisy ghost of the source image at `strength < 1.0`.** In `run_real_distilled_stage` (`crates/mold-inference/src/ltx2/runtime.rs`) the "clean reference" that the per-step denoise-mask blend pulls the conditioned tokens toward was sourced by cloning `video_latents` *after* `apply_stage_video_conditioning` had already soft-blended the first-latent-frame positions with the initial noise (`noise*(1-s) + source*s`). Used as the clean target, that pre-blended tensor pinned the first latent to a noisy copy of the image instead of the pure image at every step — so i2v runs with `--strength 0.75` (the CLI default) produced a first frame that was 25 % noise + 75 % image rather than the source image. A new helper `clean_latents_for_conditioning` re-applies the replacements with strength 1.0 on top of the post-apply tensor so replacement positions hold pure source image tokens while appended keyframe tokens and pure-noise regions pass through unchanged. `strength = 1.0` and pure-T2V paths are bit-for-bit identical to before. Covered by two new regression tests (`clean_latents_replace_soft_blended_positions_with_pure_source`, `clean_latents_passthrough_when_no_replacements`).
 - **city96-format FLUX fine-tune GGUFs now accept `flux-krea:q{8,6,4}` as a valid reference.** `find_flux_reference_gguf` in `crates/mold-inference/src/flux/pipeline.rs` previously hardcoded the candidate list to `flux-dev:q{8,6,4}` (plus schnell for schnell targets), so a box that already had `flux-krea:q8` on disk — a dev-family QuantStack GGUF with the full embedding set — would still error on first `ultrareal-v4:q8` generation and force the user to download a redundant ~12 GB `flux-dev:q8` reference. Krea is now probed after base flux-dev; the existing `gguf_has_guidance` check still gates acceptance so nothing is assumed about completeness. Regression test `find_flux_reference_accepts_krea_when_no_base_dev` covers the new path.
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 871fa87d..624af885 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -458,9 +458,16 @@ impl Ltx2Engine {
             plan.prompt_tokens.unconditional.valid_len()
         ));
         let create_runtime_start = Instant::now();
+        // Reuse a persisted runtime only if it can serve this plan. An LTX-2
+        // session consumes its prompt encoder on first `prepare()` (see
+        // runtime.rs `prepare()` — the take+drop frees VRAM for the
+        // transformer); a stale session left behind by a prior chain run
+        // survives intact for same-prompt continuations via the session-
+        // level encoding cache, but we must rebuild from scratch when the
+        // prompt changes so `prepare()` doesn't error on a consumed encoder.
         let mut runtime = match self.native_runtime.take() {
-            Some(runtime) => runtime,
-            None => self.create_runtime_session(&plan)?,
+            Some(runtime) if runtime.can_reuse_for(&plan) => runtime,
+            _ => self.create_runtime_session(&plan)?,
         };
         Self::log_timing("pipeline.create_runtime", create_runtime_start);
 
@@ -601,12 +608,15 @@ impl Ltx2Engine {
             });
         }
 
-        // Reuse an existing runtime session if we have one; otherwise
-        // build one. Arm the tail-capture slot on the session before
-        // render.
+        // Reuse an existing runtime session if we have one AND it can
+        // serve this plan. Between stages of a same-prompt chain the
+        // session-level encoding cache handles the consumed-encoder
+        // invariant; if the prompt shifts (or a stale session leaked in
+        // from a prior run) we drop the runtime and rebuild so
+        // `prepare()` doesn't error on a missing encoder.
         let mut runtime = match self.native_runtime.take() {
-            Some(runtime) => runtime,
-            None => self.create_runtime_session(&plan)?,
+            Some(runtime) if runtime.can_reuse_for(&plan) => runtime,
+            _ => self.create_runtime_session(&plan)?,
         };
         let slot = runtime.arm_tail_capture();
 
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 2410e728..9e156985 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -291,6 +291,15 @@ impl Ltx2VaeLatentStats {
 pub struct Ltx2RuntimeSession {
     device: Option<candle_core::Device>,
     prompt_encoder: Option<NativePromptEncoder>,
+    /// Cached output of the last successful `encode_prompt_pair_with_unconditional`
+    /// call. The prompt encoder is intentionally consumed during the first
+    /// `prepare()` so its VRAM can be freed for the transformer (see the
+    /// `take()` + drop pattern below); that leaves subsequent `prepare()`
+    /// calls on the same session with no encoder. For the render-chain
+    /// path every stage shares the same prompt tokens, so we cache the
+    /// encoding after the first encode and reuse it on follow-up stages —
+    /// no re-encode, no encoder re-load, no VRAM re-hit.
+    cached_prompt_encoding: Option<CachedPromptEncoding>,
     /// Optional slot wired into `render_real_distilled_av` so
     /// `Ltx2Engine::render_chain_stage` can snapshot the pre-VAE-decode
     /// final latents and forward them to the next chain stage as a
@@ -301,6 +310,17 @@ pub struct Ltx2RuntimeSession {
     gpu_ordinal: usize,
 }
 
+/// Remembers the last `encode_prompt_pair_with_unconditional` call so
+/// successive `prepare()` calls with the same prompt can skip the encoder
+/// entirely — used by the render-chain path where stages share a prompt.
+struct CachedPromptEncoding {
+    token_pair: super::text::gemma::EncodedPromptPair,
+    encode_unconditional: bool,
+    encoding: NativePromptEncoding,
+    prompt_device_is_cuda: bool,
+    prepared_device: candle_core::Device,
+}
+
 impl Ltx2RuntimeSession {
     pub fn new(
         device: candle_core::Device,
@@ -310,6 +330,7 @@ impl Ltx2RuntimeSession {
         Self {
             device: Some(device),
             prompt_encoder: Some(prompt_encoder),
+            cached_prompt_encoding: None,
             tail_capture: None,
             gpu_ordinal,
         }
@@ -319,6 +340,7 @@ impl Ltx2RuntimeSession {
         Self {
             device: None,
             prompt_encoder: Some(prompt_encoder),
+            cached_prompt_encoding: None,
             tail_capture: None,
             gpu_ordinal,
         }
@@ -334,6 +356,31 @@ impl Ltx2RuntimeSession {
         self.tail_capture = None;
     }
 
+    /// Whether this session can serve `plan` without a rebuild. Returns
+    /// `true` if the encoder is still available OR the cached encoding
+    /// matches the plan's prompt tokens. Callers use this to decide
+    /// whether to reuse a persisted runtime (fast path — keeps transformer
+    /// and VAE warm) or drop it and build a fresh one (the only way to
+    /// recover when the encoder has been consumed on a prior `prepare()`
+    /// and a different prompt arrives).
+    pub fn can_reuse_for(&self, plan: &Ltx2GeneratePlan) -> bool {
+        if self.prompt_encoder.is_some() {
+            return true;
+        }
+        let Ok(encode_unconditional) = prompt_requires_unconditional_context(plan) else {
+            return false;
+        };
+        // Alt-prompt debug mode requires the live encoder; cache alone
+        // isn't sufficient.
+        if ltx_debug_alt_prompt().is_some() {
+            return false;
+        }
+        self.cached_prompt_encoding.as_ref().is_some_and(|cached| {
+            cached.encode_unconditional == encode_unconditional
+                && cached.token_pair == plan.prompt_tokens
+        })
+    }
+
     pub fn prepare(&mut self, plan: &Ltx2GeneratePlan) -> Result<NativePreparedRun> {
         let prepare_total_start = Instant::now();
         let mut stage1_shape = derive_stage1_render_shape(
@@ -360,7 +407,31 @@ impl Ltx2RuntimeSession {
             stage1_shape.width = implicit_x2_shape.width;
             stage1_shape.height = implicit_x2_shape.height;
         }
-        let (prompt_device_is_cuda, prepared_device, prompt, debug_alt_prompt) = {
+        let encode_unconditional_prompt = prompt_requires_unconditional_context(plan)?;
+        let alt_prompt_env = ltx_debug_alt_prompt();
+        // Chain path fast-path: if a previous `prepare()` already encoded
+        // the exact same prompt+unconditional combo, reuse those embeddings
+        // instead of demanding the encoder back. Disabled when the
+        // `MOLD_LTX_ALT_PROMPT` debug hook is active because that branch
+        // still needs the live encoder.
+        let cache_hit = alt_prompt_env.is_none()
+            && self.cached_prompt_encoding.as_ref().is_some_and(|cached| {
+                cached.encode_unconditional == encode_unconditional_prompt
+                    && cached.token_pair == plan.prompt_tokens
+            });
+        let (prompt_device_is_cuda, prepared_device, prompt, debug_alt_prompt) = if cache_hit {
+            let cached = self
+                .cached_prompt_encoding
+                .as_ref()
+                .expect("cache_hit implies cached_prompt_encoding is Some");
+            log_timing("prepare.prompt_pair", Instant::now());
+            (
+                cached.prompt_device_is_cuda,
+                cached.prepared_device.clone(),
+                cached.encoding.clone(),
+                None,
+            )
+        } else {
             let mut prompt_encoder = self
                 .prompt_encoder
                 .take()
@@ -372,7 +443,6 @@ impl Ltx2RuntimeSession {
                 prompt_encoder.device().clone()
             };
             let prompt_encode_start = Instant::now();
-            let encode_unconditional_prompt = prompt_requires_unconditional_context(plan)?;
             let prompt = move_prompt_encoding_to_device(
                 prompt_encoder.encode_prompt_pair_with_unconditional(
                     &plan.prompt_tokens,
@@ -382,7 +452,7 @@ impl Ltx2RuntimeSession {
             )?;
             log_timing("prepare.prompt_pair", prompt_encode_start);
             let alt_prompt_start = Instant::now();
-            let debug_alt_prompt = match ltx_debug_alt_prompt() {
+            let debug_alt_prompt = match alt_prompt_env.clone() {
                 Some(alt_prompt) => {
                     let assets =
                         super::text::gemma::GemmaAssets::discover(Path::new(&plan.gemma_root))
@@ -413,6 +483,18 @@ impl Ltx2RuntimeSession {
                 }
             }
             log_timing("prepare.prompt_debug", prompt_debug_start);
+            // Cache the encoding for the next chain stage. Dropping the
+            // encoder here (end of the else branch) still happens — we're
+            // only holding on to the `NativePromptEncoding` output, not the
+            // encoder itself, so the VRAM-free property of the original
+            // take() pattern is preserved.
+            self.cached_prompt_encoding = Some(CachedPromptEncoding {
+                token_pair: plan.prompt_tokens.clone(),
+                encode_unconditional: encode_unconditional_prompt,
+                encoding: prompt.clone(),
+                prompt_device_is_cuda,
+                prepared_device: prepared_device.clone(),
+            });
             (
                 prompt_device_is_cuda,
                 prepared_device,
@@ -5504,6 +5586,33 @@ mod tests {
 
     #[test]
     fn runtime_session_prepare_consumes_prompt_encoder() {
+        // The encoder is still consumed on first prepare() — the encoder
+        // slot moves out to free VRAM for the transformer. But same-prompt
+        // follow-up calls now short-circuit through `cached_prompt_encoding`
+        // so chain stages that replicate the prompt can reuse the session
+        // instead of erroring on a consumed encoder.
+        let req = req("ltx-2.3-22b-distilled:fp8", OutputFormat::Mp4, Some(false));
+        let temp_dir = tempfile::tempdir().unwrap();
+        let conditioning = conditioning::stage_conditioning(&req, temp_dir.path()).unwrap();
+        let preset = preset_for_model(&req.model).unwrap();
+        let plan = build_plan(&req, preset, conditioning);
+
+        let mut session = runtime_session();
+        session.prepare(&plan).unwrap();
+
+        // Encoder slot is empty post-take.
+        assert!(session.prompt_encoder.is_none());
+        // But `can_reuse_for` reports true because the cached encoding
+        // matches the incoming plan's prompt tokens.
+        assert!(session.can_reuse_for(&plan));
+        // Same-prompt re-prepare succeeds from the cache.
+        session
+            .prepare(&plan)
+            .expect("same-prompt cache hit must succeed");
+    }
+
+    #[test]
+    fn runtime_session_prepare_rejects_encoder_reuse_with_different_prompt() {
         let req = req("ltx-2.3-22b-distilled:fp8", OutputFormat::Mp4, Some(false));
         let temp_dir = tempfile::tempdir().unwrap();
         let conditioning = conditioning::stage_conditioning(&req, temp_dir.path()).unwrap();
@@ -5513,7 +5622,17 @@ mod tests {
         let mut session = runtime_session();
         session.prepare(&plan).unwrap();
 
-        assert!(session.prepare(&plan).is_err());
+        // Mutate the plan's prompt tokens so the cache key misses.
+        let mut plan_alt = plan.clone();
+        plan_alt.prompt_tokens.conditional.input_ids[0] =
+            plan_alt.prompt_tokens.conditional.input_ids[0].wrapping_add(1);
+
+        // can_reuse_for must report false for a fresh prompt because the
+        // encoder has already been consumed.
+        assert!(!session.can_reuse_for(&plan_alt));
+        // And prepare() with the new plan fails explicitly so the caller
+        // knows to drop the session and rebuild.
+        assert!(session.prepare(&plan_alt).is_err());
     }
 
     #[test]
diff --git a/crates/mold-server/src/routes_chain.rs b/crates/mold-server/src/routes_chain.rs
index c8e4ef68..b7e0949c 100644
--- a/crates/mold-server/src/routes_chain.rs
+++ b/crates/mold-server/src/routes_chain.rs
@@ -284,6 +284,17 @@ async fn run_chain(
     req: ChainRequest,
     progress_cb: Option<Box<dyn FnMut(ChainProgressEvent) + Send>>,
 ) -> Result<(ChainResponse, u64), ChainRunError> {
+    // Serialize concurrent chain requests. The chain handler deliberately
+    // takes the engine out of `model_cache` for the full multi-minute run
+    // (see below) — without this lock a second chain request arriving
+    // mid-run calls `ensure_model_ready`, sees an empty cache, tries to
+    // load a second copy of the model, and the subsequent `cache.take()`
+    // reports "engine vanished from cache after ensure_model_ready".
+    // Holding for the whole chain is intentional: single-clip requests
+    // keep flowing through the normal generation queue; only chains wait
+    // on each other.
+    let _chain_guard = state.chain_lock.lock().await;
+
     // Ensure the model is loaded. Progress forwarding is not plumbed yet —
     // load-time events go through the model manager's own tracing. Chain
     // stage events (StageStart/DenoiseStep/StageDone/Stitching) come from
diff --git a/crates/mold-server/src/routes_test.rs b/crates/mold-server/src/routes_test.rs
index 2bad2329..cd261c14 100644
--- a/crates/mold-server/src/routes_test.rs
+++ b/crates/mold-server/src/routes_test.rs
@@ -1259,6 +1259,7 @@ mod tests {
             start_time: std::time::Instant::now(),
             model_load_lock: Arc::new(tokio::sync::Mutex::new(())),
             pull_lock: Arc::new(tokio::sync::Mutex::new(())),
+            chain_lock: Arc::new(tokio::sync::Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(
                 mold_inference::shared_pool::SharedPool::new(),
@@ -1312,6 +1313,7 @@ mod tests {
             start_time: std::time::Instant::now(),
             model_load_lock: Arc::new(tokio::sync::Mutex::new(())),
             pull_lock: Arc::new(tokio::sync::Mutex::new(())),
+            chain_lock: Arc::new(tokio::sync::Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(
                 mold_inference::shared_pool::SharedPool::new(),
@@ -1568,6 +1570,7 @@ mod tests {
             start_time: std::time::Instant::now(),
             model_load_lock: Arc::new(tokio::sync::Mutex::new(())),
             pull_lock: Arc::new(tokio::sync::Mutex::new(())),
+            chain_lock: Arc::new(tokio::sync::Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(
                 mold_inference::shared_pool::SharedPool::new(),
diff --git a/crates/mold-server/src/state.rs b/crates/mold-server/src/state.rs
index 56213cca..2c9415be 100644
--- a/crates/mold-server/src/state.rs
+++ b/crates/mold-server/src/state.rs
@@ -138,6 +138,14 @@ pub struct AppState {
     pub model_load_lock: Arc<Mutex<()>>,
     /// Guards concurrent pulls — only one download at a time.
     pub pull_lock: Arc<Mutex<()>>,
+    /// Serializes chained video renders. The chain handler removes the
+    /// engine from `model_cache` and runs blocking work outside that
+    /// lock for the full multi-minute chain; without a dedicated lock two
+    /// concurrent chain requests race on `cache.take()` and the loser
+    /// surfaces "engine vanished from cache after ensure_model_ready".
+    /// Held for the entire chain (load + all stages + restore); other
+    /// single-clip requests continue to queue normally on `queue`.
+    pub chain_lock: Arc<Mutex<()>>,
     /// Generation request queue.
     pub queue: QueueHandle,
     /// Shared tokenizer pool for cross-engine caching.
@@ -188,6 +196,7 @@ impl AppState {
             start_time: Instant::now(),
             model_load_lock: Arc::new(Mutex::new(())),
             pull_lock: Arc::new(Mutex::new(())),
+            chain_lock: Arc::new(Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(SharedPool::new())),
             shutdown_tx: Arc::new(tokio::sync::Mutex::new(None)),
@@ -215,6 +224,7 @@ impl AppState {
             start_time: Instant::now(),
             model_load_lock: Arc::new(Mutex::new(())),
             pull_lock: Arc::new(Mutex::new(())),
+            chain_lock: Arc::new(Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(SharedPool::new())),
             shutdown_tx: Arc::new(tokio::sync::Mutex::new(None)),
@@ -264,6 +274,7 @@ impl AppState {
             start_time: Instant::now(),
             model_load_lock: Arc::new(Mutex::new(())),
             pull_lock: Arc::new(Mutex::new(())),
+            chain_lock: Arc::new(Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(SharedPool::new())),
             shutdown_tx: Arc::new(tokio::sync::Mutex::new(None)),
@@ -300,6 +311,7 @@ impl AppState {
             start_time: Instant::now(),
             model_load_lock: Arc::new(Mutex::new(())),
             pull_lock: Arc::new(Mutex::new(())),
+            chain_lock: Arc::new(Mutex::new(())),
             queue,
             shared_pool: Arc::new(std::sync::Mutex::new(SharedPool::new())),
             shutdown_tx: Arc::new(tokio::sync::Mutex::new(None)),

From 5fe4f05e8faf41e40fa815328d79e459ca9a8576 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 14:58:21 -0700
Subject: [PATCH 21/31] feat(web): resource tray, persistent queue,
 output-format on composer, video-length UX
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reworks the /generate surface per user feedback:

- Move output-format dropdown onto the Composer next to the icon buttons;
  auto-pick the first valid format for the selected model family and reset
  when switching between image/video families (png → mp4 on FLUX → LTX-2).
- Promote resource telemetry to a global collapsible bottom tray mounted
  in App.vue (defaults collapsed, state persisted, `r` to toggle). Adds
  GPU core-utilization (NVML) and CPU utilization (sysinfo, sampled from
  a persistent System threaded through the 1 Hz aggregator) alongside the
  existing VRAM/RAM bars.
- Persist the generate queue to localStorage so cards survive refresh;
  running jobs rehydrate as "Disconnected — may still be running on the
  server" with a per-card dismiss + "Clear finished" affordance.
- Poll /api/gallery every 10 s and /api/models every 15 s from the
  Generate page; bump model polling to 3 s while the settings modal is
  open so freshly-downloaded variants show up without a manual refresh.
- Collapse the device-placement panel by default (state persisted).
- Disable the prompt-expansion Preview button while a generation is
  running in the queue; expansion already defaulted off.
- Add a Video Length (s) field in the settings video row; editing any of
  Frames / Length / FPS recomputes the other two with the backend only
  consuming the 8n+1-clamped frame count.

Backend:

- GpuSnapshot.gpu_utilization: Option<u8> populated via NVML
  utilization_rates(); None on nvidia-smi fallback and on Metal.
- ResourceSnapshot.cpu: Option<CpuSnapshot { cores, usage_percent }>
  driven by a CpuSampler with a long-lived sysinfo::System, threaded
  through spawn_blocking. First tick has cpu = None (no delta yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-core/src/types.rs             |  23 ++++
 crates/mold-server/src/resources.rs       |  98 +++++++++++---
 crates/mold-server/src/resources_test.rs  |   2 +
 crates/mold-server/src/routes_test.rs     |   1 +
 web/src/App.vue                           |   2 +
 web/src/components/Composer.vue           |  33 ++++-
 web/src/components/ExpandModal.vue        |  15 ++-
 web/src/components/PlacementPanel.test.ts |  61 +++++++--
 web/src/components/PlacementPanel.vue     |  46 ++++++-
 web/src/components/ResourceStrip.vue      | 150 ++++++++++++++--------
 web/src/components/ResourceTray.vue       | 121 +++++++++++++++++
 web/src/components/RunningJobCard.vue     |  11 ++
 web/src/components/RunningStrip.vue       |  38 ++++--
 web/src/components/SettingsModal.vue      | 100 +++++++++------
 web/src/composables/useGenerateForm.ts    |  22 ++++
 web/src/composables/useGenerateStream.ts  |  94 +++++++++++++-
 web/src/pages/GalleryPage.vue             |   2 +-
 web/src/pages/GeneratePage.vue            |  73 ++++++++++-
 web/src/types.ts                          |  10 ++
 19 files changed, 751 insertions(+), 151 deletions(-)
 create mode 100644 web/src/components/ResourceTray.vue

diff --git a/crates/mold-core/src/types.rs b/crates/mold-core/src/types.rs
index 72e95b90..f737c776 100644
--- a/crates/mold-core/src/types.rs
+++ b/crates/mold-core/src/types.rs
@@ -1006,6 +1006,11 @@ pub struct ResourceSnapshot {
     pub timestamp: i64,
     pub gpus: Vec<GpuSnapshot>,
     pub system_ram: RamSnapshot,
+    /// System-wide CPU utilization (averaged across all cores). `None` when
+    /// the aggregator hasn't had two samples yet (CPU usage is computed from
+    /// deltas — the first snapshot always reports zero).
+    #[serde(default)]
+    pub cpu: Option<CpuSnapshot>,
 }
 
 /// Per-GPU memory snapshot.
@@ -1021,6 +1026,18 @@ pub struct GpuSnapshot {
     pub vram_used_by_mold: Option<u64>,
     /// `vram_used - vram_used_by_mold`. `None` whenever `vram_used_by_mold` is.
     pub vram_used_by_other: Option<u64>,
+    /// GPU core utilization in percent (0-100). `None` on Metal and on the
+    /// `nvidia-smi` fallback path — only NVML exposes this cheaply.
+    #[serde(default)]
+    pub gpu_utilization: Option<u8>,
+}
+
+/// Aggregate CPU snapshot. `usage_percent` is a 0-100 average across every
+/// logical core.
+#[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)]
+pub struct CpuSnapshot {
+    pub cores: u16,
+    pub usage_percent: f32,
 }
 
 /// System RAM snapshot. Per-process fields are always populated (via sysinfo).
@@ -2233,6 +2250,7 @@ mod tests {
                 vram_used: 14_200_000_000,
                 vram_used_by_mold: Some(10_100_000_000),
                 vram_used_by_other: Some(4_100_000_000),
+                gpu_utilization: Some(42),
             }],
             system_ram: RamSnapshot {
                 total: 64_000_000_000,
@@ -2240,6 +2258,10 @@ mod tests {
                 used_by_mold: 22_100_000_000,
                 used_by_other: 16_300_000_000,
             },
+            cpu: Some(CpuSnapshot {
+                cores: 16,
+                usage_percent: 27.5,
+            }),
         };
         let json = serde_json::to_string(&snap).unwrap();
         let back: ResourceSnapshot = serde_json::from_str(&json).unwrap();
@@ -2269,6 +2291,7 @@ mod tests {
             vram_used: 38_000_000_000,
             vram_used_by_mold: None,
             vram_used_by_other: None,
+            gpu_utilization: None,
         };
         let json = serde_json::to_string(&snap).unwrap();
         // Both fields are present as `null` (not elided) so the SPA can
diff --git a/crates/mold-server/src/resources.rs b/crates/mold-server/src/resources.rs
index d764e59b..50e33390 100644
--- a/crates/mold-server/src/resources.rs
+++ b/crates/mold-server/src/resources.rs
@@ -113,6 +113,9 @@ pub(crate) mod nvml_source {
                         .sum::<u64>()
                 });
                 let used_by_other = used_by_mold.map(|m| mem.used.saturating_sub(m));
+                // NVML's GPU-core utilization over the last sample period.
+                // Cheap — this is just a driver query, not a counter reset.
+                let gpu_util = dev.utilization_rates().ok().map(|u| u.gpu.min(100) as u8);
                 out.push(GpuSnapshot {
                     ordinal: ordinal as usize,
                     name,
@@ -121,6 +124,7 @@ pub(crate) mod nvml_source {
                     vram_used: mem.used,
                     vram_used_by_mold: used_by_mold,
                     vram_used_by_other: used_by_other,
+                    gpu_utilization: gpu_util,
                 });
             }
             out
@@ -160,8 +164,8 @@ pub fn parse_nvidia_smi_line(line: &str) -> Option<(usize, String, u64, u64)> {
     Some((ordinal, name, total_mb * 1_000_000, used_mb * 1_000_000))
 }
 
-use mold_core::RamSnapshot;
-use sysinfo::{Pid, ProcessRefreshKind, RefreshKind, System};
+use mold_core::{CpuSnapshot, RamSnapshot};
+use sysinfo::{CpuRefreshKind, Pid, ProcessRefreshKind, RefreshKind, System};
 
 /// Metal unified-memory snapshot — macOS only. Off-Darwin returns an empty
 /// Vec so callers on Linux/CUDA hosts can unconditionally call this.
@@ -188,6 +192,7 @@ pub fn metal_snapshot() -> Vec<GpuSnapshot> {
             vram_used: used,
             vram_used_by_mold: None,
             vram_used_by_other: None,
+            gpu_utilization: None,
         }]
     }
     #[cfg(not(target_os = "macos"))]
@@ -265,6 +270,7 @@ impl SmiSource {
                     vram_used: used,
                     vram_used_by_mold: None,
                     vram_used_by_other: None,
+                    gpu_utilization: None,
                 })
             })
             .collect()
@@ -276,7 +282,15 @@ impl SmiSource {
 ///
 /// Source priority on CUDA: NVML (if linked) → `nvidia-smi` subprocess → empty.
 /// On macOS: `metal_snapshot()`.
+///
+/// CPU utilization is `None` — call `build_snapshot_with_cpu` with a
+/// persistent `System` to populate it (sysinfo computes CPU usage from
+/// deltas between refreshes, so the aggregator needs to hold state).
 pub fn build_snapshot() -> ResourceSnapshot {
+    build_snapshot_inner(None)
+}
+
+fn build_snapshot_inner(cpu: Option<CpuSnapshot>) -> ResourceSnapshot {
     let hostname = hostname::get()
         .ok()
         .and_then(|h| h.into_string().ok())
@@ -294,6 +308,40 @@ pub fn build_snapshot() -> ResourceSnapshot {
         timestamp,
         gpus,
         system_ram,
+        cpu,
+    }
+}
+
+/// Holds the persistent `System` sysinfo needs for CPU delta computation.
+pub struct CpuSampler {
+    sys: System,
+    cores: u16,
+}
+
+impl CpuSampler {
+    pub fn new() -> Self {
+        let mut sys = System::new_with_specifics(
+            RefreshKind::nothing().with_cpu(CpuRefreshKind::everything().with_cpu_usage()),
+        );
+        // Prime the sampler. The first `global_cpu_usage()` read always
+        // returns 0 — the real number shows up on the second refresh.
+        sys.refresh_cpu_usage();
+        let cores = sys.cpus().len().min(u16::MAX as usize) as u16;
+        Self { sys, cores }
+    }
+
+    pub fn sample(&mut self) -> CpuSnapshot {
+        self.sys.refresh_cpu_usage();
+        CpuSnapshot {
+            cores: self.cores,
+            usage_percent: self.sys.global_cpu_usage().clamp(0.0, 100.0),
+        }
+    }
+}
+
+impl Default for CpuSampler {
+    fn default() -> Self {
+        Self::new()
     }
 }
 
@@ -326,27 +374,45 @@ fn collect_gpus() -> Vec<GpuSnapshot> {
 pub fn spawn_aggregator(bcast: Arc<ResourceBroadcaster>) -> JoinHandle<()> {
     tokio::spawn(async move {
         // Immediate first tick so `latest()` is populated before any HTTP
-        // request arrives.
-        bcast.publish(build_snapshot());
+        // request arrives. CPU usage is None on this first sample (no delta
+        // to compute against yet).
+        bcast.publish(build_snapshot_inner(None));
         let mut interval = tokio::time::interval(Duration::from_secs(1));
         interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
         // Consume the first tick (it fires immediately) so we don't double-emit.
         interval.tick().await;
+
+        // The sampler lives on the blocking thread across ticks — sysinfo
+        // computes CPU usage from deltas, so we can't rebuild it every tick.
+        let mut sampler: Option<CpuSampler> = None;
         loop {
             interval.tick().await;
-            let snap = tokio::task::spawn_blocking(build_snapshot)
-                .await
-                .unwrap_or_else(|_| ResourceSnapshot {
-                    hostname: "unknown".to_string(),
-                    timestamp: 0,
-                    gpus: Vec::new(),
-                    system_ram: mold_core::RamSnapshot {
-                        total: 0,
-                        used: 0,
-                        used_by_mold: 0,
-                        used_by_other: 0,
+            let taken = sampler.take();
+            let (snap, returned) = tokio::task::spawn_blocking(move || {
+                let mut s = taken.unwrap_or_default();
+                let cpu = s.sample();
+                let snap = build_snapshot_inner(Some(cpu));
+                (snap, s)
+            })
+            .await
+            .unwrap_or_else(|_| {
+                (
+                    ResourceSnapshot {
+                        hostname: "unknown".to_string(),
+                        timestamp: 0,
+                        gpus: Vec::new(),
+                        system_ram: mold_core::RamSnapshot {
+                            total: 0,
+                            used: 0,
+                            used_by_mold: 0,
+                            used_by_other: 0,
+                        },
+                        cpu: None,
                     },
-                });
+                    CpuSampler::new(),
+                )
+            });
+            sampler = Some(returned);
             bcast.publish(snap);
         }
     })
diff --git a/crates/mold-server/src/resources_test.rs b/crates/mold-server/src/resources_test.rs
index d6bff40a..2e34ba22 100644
--- a/crates/mold-server/src/resources_test.rs
+++ b/crates/mold-server/src/resources_test.rs
@@ -15,6 +15,7 @@ fn fake_snapshot() -> ResourceSnapshot {
             vram_used: 0,
             vram_used_by_mold: Some(0),
             vram_used_by_other: Some(0),
+            gpu_utilization: None,
         }],
         system_ram: RamSnapshot {
             total: 64_000_000_000,
@@ -22,6 +23,7 @@ fn fake_snapshot() -> ResourceSnapshot {
             used_by_mold: 0,
             used_by_other: 0,
         },
+        cpu: None,
     }
 }
 
diff --git a/crates/mold-server/src/routes_test.rs b/crates/mold-server/src/routes_test.rs
index cd261c14..3ed7854d 100644
--- a/crates/mold-server/src/routes_test.rs
+++ b/crates/mold-server/src/routes_test.rs
@@ -2645,6 +2645,7 @@ mod tests {
                 used_by_mold: 0,
                 used_by_other: 0,
             },
+            cpu: None,
         });
 
         let app = create_router(state);
diff --git a/web/src/App.vue b/web/src/App.vue
index e3556d4c..9ae4ef75 100644
--- a/web/src/App.vue
+++ b/web/src/App.vue
@@ -1,6 +1,7 @@
 <script setup lang="ts">
 import { computed, onBeforeUnmount, ref } from "vue";
 import DownloadsDrawer from "./components/DownloadsDrawer.vue";
+import ResourceTray from "./components/ResourceTray.vue";
 import {
   computeEtaSeconds,
   onDownloadComplete,
@@ -67,6 +68,7 @@ provide(RESOURCES_INJECTION_KEY, resources);
 
 <template>
   <router-view />
+  <ResourceTray />
   <DownloadsDrawer
     :open="drawerOpen"
     :active="downloads.active.value"
diff --git a/web/src/components/Composer.vue b/web/src/components/Composer.vue
index db76fc4e..3bccf595 100644
--- a/web/src/components/Composer.vue
+++ b/web/src/components/Composer.vue
@@ -1,7 +1,12 @@
 <script setup lang="ts">
 import { computed, nextTick, ref, watch } from "vue";
-import type { DevicePlacement, GenerateFormState } from "../types";
+import type {
+  DevicePlacement,
+  GenerateFormState,
+  OutputFormat,
+} from "../types";
 import PlacementPanel from "./PlacementPanel.vue";
+import { outputFormatsForFamily } from "../composables/useGenerateForm";
 
 import type { ChainRoutingDecision } from "../lib/chainRouting";
 
@@ -75,6 +80,15 @@ const statusLine = computed(() => {
 function updatePlacement(p: DevicePlacement | null) {
   emit("update:modelValue", { ...props.modelValue, placement: p });
 }
+
+const outputFormats = computed(() => outputFormatsForFamily(props.family));
+
+function updateOutputFormat(v: string) {
+  emit("update:modelValue", {
+    ...props.modelValue,
+    outputFormat: v as OutputFormat,
+  });
+}
 </script>
 
 <template>
@@ -110,6 +124,23 @@ function updatePlacement(p: DevicePlacement | null) {
       />
 
       <div class="flex flex-shrink-0 flex-col gap-1 sm:flex-row">
+        <label class="sr-only" for="composer-output-format"
+          >Output format</label
+        >
+        <select
+          id="composer-output-format"
+          data-test="composer-output-format"
+          :value="modelValue.outputFormat"
+          class="h-9 rounded-full bg-slate-900/60 px-3 text-sm text-slate-100 focus:outline-none"
+          :title="`Output format — default: ${outputFormats[0]}`"
+          @change="
+            updateOutputFormat(($event.target as HTMLSelectElement).value)
+          "
+        >
+          <option v-for="f in outputFormats" :key="f" :value="f">
+            {{ f }}
+          </option>
+        </select>
         <button
           type="button"
           class="icon-btn"
diff --git a/web/src/components/ExpandModal.vue b/web/src/components/ExpandModal.vue
index 5cafbcb4..e2b12ef6 100644
--- a/web/src/components/ExpandModal.vue
+++ b/web/src/components/ExpandModal.vue
@@ -8,6 +8,10 @@ const props = defineProps<{
   prompt: string;
   expand: ExpandFormState;
   currentModel: ModelInfoExtended | null;
+  /** Preview hits the LLM server-side; the queue is a single lane, so
+   * previewing while a generation is running would block the UI on the
+   * inference slot. Disable the button instead of silently queueing it. */
+  queueBusy?: boolean;
 }>();
 const emit = defineEmits<{
   (e: "update:expand", v: ExpandFormState): void;
@@ -128,11 +132,20 @@ function pick(text: string) {
           <button
             type="button"
             class="rounded-lg bg-brand-500 px-3 py-1.5 text-sm text-white disabled:opacity-50"
-            :disabled="previewing || !prompt.trim()"
+            :disabled="previewing || !prompt.trim() || queueBusy"
+            :title="
+              queueBusy
+                ? 'Another generation is in the queue — wait for it to finish before previewing.'
+                : undefined
+            "
+            data-test="expand-preview"
             @click="preview"
           >
             {{ previewing ? "Expanding…" : "Preview" }}
           </button>
+          <span v-if="queueBusy" class="text-xs text-slate-400">
+            Queue busy — preview disabled.
+          </span>
         </div>
 
         <div v-if="previewError" class="mt-2 text-sm text-rose-300">
diff --git a/web/src/components/PlacementPanel.test.ts b/web/src/components/PlacementPanel.test.ts
index 1ca0a468..11ab4f38 100644
--- a/web/src/components/PlacementPanel.test.ts
+++ b/web/src/components/PlacementPanel.test.ts
@@ -1,13 +1,13 @@
-import { describe, expect, it } from "vitest";
+import { beforeEach, describe, expect, it } from "vitest";
 import { mount } from "@vue/test-utils";
 import PlacementPanel from "./PlacementPanel.vue";
 
-function mountPanel(props: {
+async function mountPanel(props: {
   family: string;
   placement?: import("../types").DevicePlacement | null;
   model?: string;
 }) {
-  return mount(PlacementPanel, {
+  const wrapper = mount(PlacementPanel, {
     props: {
       modelValue: props.placement ?? null,
       family: props.family,
@@ -18,11 +18,26 @@ function mountPanel(props: {
       ],
     },
   });
+  // Panel defaults collapsed — expand so the existing assertions about the
+  // Tier 1 select / advanced toggle still reach DOM.
+  const sectionToggle = wrapper.find(
+    "button[data-test='placement-section-toggle']",
+  );
+  if (sectionToggle.exists()) await sectionToggle.trigger("click");
+  return wrapper;
 }
 
 describe("PlacementPanel", () => {
-  it("renders the Tier 1 select with Auto/CPU/GPU options", () => {
-    const wrapper = mountPanel({ family: "flux" });
+  beforeEach(() => {
+    try {
+      localStorage.clear();
+    } catch {
+      /* ignore */
+    }
+  });
+
+  it("renders the Tier 1 select with Auto/CPU/GPU options", async () => {
+    const wrapper = await mountPanel({ family: "flux" });
     const opts = wrapper.findAll("select[data-test='tier1-select'] option");
     const labels = opts.map((o) => o.text());
     expect(labels).toContain("Auto");
@@ -31,7 +46,7 @@ describe("PlacementPanel", () => {
     expect(labels.some((l) => l.includes("GPU 1"))).toBe(true);
   });
 
-  it("hides Tier 1 select when GPU list is empty", () => {
+  it("hides Tier 1 select when GPU list is empty", async () => {
     const wrapper = mount(PlacementPanel, {
       props: {
         modelValue: null,
@@ -40,27 +55,30 @@ describe("PlacementPanel", () => {
         gpus: [],
       },
     });
+    await wrapper
+      .find("button[data-test='placement-section-toggle']")
+      .trigger("click");
     expect(wrapper.find("select[data-test='tier1-select']").exists()).toBe(
       false,
     );
   });
 
-  it("enables Advanced disclosure for Tier 2 families", () => {
-    const wrapper = mountPanel({ family: "flux" });
+  it("enables Advanced disclosure for Tier 2 families", async () => {
+    const wrapper = await mountPanel({ family: "flux" });
     const toggle = wrapper.find("button[data-test='advanced-toggle']");
     expect(toggle.exists()).toBe(true);
     expect(toggle.attributes("disabled")).toBeUndefined();
   });
 
-  it("disables Advanced disclosure for Tier 1-only families with a tooltip", () => {
-    const wrapper = mountPanel({ family: "sdxl" });
+  it("disables Advanced disclosure for Tier 1-only families with a tooltip", async () => {
+    const wrapper = await mountPanel({ family: "sdxl" });
     const toggle = wrapper.find("button[data-test='advanced-toggle']");
     expect(toggle.attributes("disabled")).toBeDefined();
     expect(toggle.attributes("title")).toMatch(/not yet available/i);
   });
 
   it("emits update:modelValue when Tier 1 changes", async () => {
-    const wrapper = mountPanel({ family: "flux" });
+    const wrapper = await mountPanel({ family: "flux" });
     const select = wrapper.find("select[data-test='tier1-select']");
     await select.setValue("cpu");
     const emitted = wrapper.emitted("update:modelValue");
@@ -71,8 +89,8 @@ describe("PlacementPanel", () => {
     expect(last?.text_encoders).toEqual({ kind: "cpu" });
   });
 
-  it("renders Save-as-default button when placement differs from saved", () => {
-    const wrapper = mountPanel({
+  it("renders Save-as-default button when placement differs from saved", async () => {
+    const wrapper = await mountPanel({
       family: "flux",
       placement: {
         text_encoders: { kind: "cpu" },
@@ -83,4 +101,21 @@ describe("PlacementPanel", () => {
       true,
     );
   });
+
+  it("defaults collapsed — Tier 1 select is hidden until toggled", () => {
+    const wrapper = mount(PlacementPanel, {
+      props: {
+        modelValue: null,
+        family: "flux",
+        model: "flux-dev:q4",
+        gpus: [{ ordinal: 0, name: "RTX 3090" }],
+      },
+    });
+    expect(wrapper.find("select[data-test='tier1-select']").exists()).toBe(
+      false,
+    );
+    expect(
+      wrapper.find("button[data-test='placement-section-toggle']").exists(),
+    ).toBe(true);
+  });
 });
diff --git a/web/src/components/PlacementPanel.vue b/web/src/components/PlacementPanel.vue
index 0d85f026..86a9d157 100644
--- a/web/src/components/PlacementPanel.vue
+++ b/web/src/components/PlacementPanel.vue
@@ -24,6 +24,27 @@ const { supportsAdvanced } = usePlacement();
 const tier2 = computed(() => supportsAdvanced(props.family));
 const advancedOpen = ref(false);
 
+// The whole placement section collapses by default — most users never touch
+// these controls, so keeping them folded cleans up the composer. State is
+// persisted so power users who keep it open don't have to re-expand it.
+const STORAGE_KEY = "mold.composer.placement.open";
+function loadOpen(): boolean {
+  try {
+    return localStorage.getItem(STORAGE_KEY) === "1";
+  } catch {
+    return false;
+  }
+}
+const sectionOpen = ref(loadOpen());
+function toggleSection() {
+  sectionOpen.value = !sectionOpen.value;
+  try {
+    localStorage.setItem(STORAGE_KEY, sectionOpen.value ? "1" : "0");
+  } catch {
+    /* ignore */
+  }
+}
+
 const tier1Value = computed<DeviceRef>(
   () => props.modelValue?.text_encoders ?? { kind: "auto" },
 );
@@ -120,9 +141,21 @@ const isDirty = computed(() => props.modelValue !== null);
 <template>
   <section class="glass flex flex-col gap-2 rounded-2xl p-3 text-sm">
     <header class="flex items-center justify-between">
-      <span class="font-medium text-slate-200">Device placement</span>
       <button
-        v-if="isDirty"
+        type="button"
+        class="flex items-center gap-2 text-left font-medium text-slate-200 hover:text-slate-50"
+        :aria-expanded="sectionOpen"
+        data-test="placement-section-toggle"
+        @click="toggleSection"
+      >
+        <span>{{ sectionOpen ? "▾" : "▸" }}</span>
+        <span>Device placement</span>
+        <span v-if="isDirty && !sectionOpen" class="text-xs text-brand-400">
+          · custom
+        </span>
+      </button>
+      <button
+        v-if="isDirty && sectionOpen"
         type="button"
         data-test="save-default"
         class="text-xs text-brand-400 hover:underline"
@@ -132,7 +165,7 @@ const isDirty = computed(() => props.modelValue !== null);
       </button>
     </header>
 
-    <div v-if="gpus.length > 0" class="flex items-center gap-2">
+    <div v-if="sectionOpen && gpus.length > 0" class="flex items-center gap-2">
       <label class="text-slate-400">Text encoders</label>
       <select
         data-test="tier1-select"
@@ -148,7 +181,7 @@ const isDirty = computed(() => props.modelValue !== null);
       </select>
     </div>
 
-    <div class="flex items-center gap-2">
+    <div v-if="sectionOpen" class="flex items-center gap-2">
       <button
         type="button"
         data-test="advanced-toggle"
@@ -165,7 +198,10 @@ const isDirty = computed(() => props.modelValue !== null);
       </button>
     </div>
 
-    <div v-if="tier2 && advancedOpen" class="flex flex-col gap-1 pl-4">
+    <div
+      v-if="sectionOpen && tier2 && advancedOpen"
+      class="flex flex-col gap-1 pl-4"
+    >
       <div class="flex items-center gap-2">
         <label class="w-24 text-slate-400">Transformer</label>
         <select
diff --git a/web/src/components/ResourceStrip.vue b/web/src/components/ResourceStrip.vue
index ed70df85..c8ad23b0 100644
--- a/web/src/components/ResourceStrip.vue
+++ b/web/src/components/ResourceStrip.vue
@@ -1,19 +1,24 @@
 <script setup lang="ts">
 /**
- * Always-visible VRAM + system-RAM telemetry panel.
+ * VRAM + GPU-utilization + system-RAM + CPU telemetry rows. Consumed by
+ * the bottom tray (`ResourceTray.vue`) as the expanded body and by the
+ * TopBar as a compact chip on narrow viewports.
  *
  * Modes:
- *  - `variant="full"` (default) — docked at the bottom of the Composer column
- *    on /generate. One row per GPU plus one for system RAM, click-to-expand
- *    side sheet.
- *  - `variant="chip"` — compact single-line summary for the TopBar on
- *    narrow viewports (< lg).
+ *  - `variant="full"` (default) — one row per GPU (with VRAM + core-util
+ *    bars) plus system-RAM and CPU rows.
+ *  - `variant="chip"` — compact single-line summary.
  *
  * Data comes from the `useResources` singleton provided by App.vue.
  */
-import { computed, inject, ref } from "vue";
+import { computed, inject } from "vue";
 import { RESOURCES_INJECTION_KEY } from "../composables/useResources";
-import type { GpuSnapshot, ResourceSnapshot, RamSnapshot } from "../types";
+import type {
+  CpuSnapshot,
+  GpuSnapshot,
+  ResourceSnapshot,
+  RamSnapshot,
+} from "../types";
 import type { ComputedRef, Ref } from "vue";
 
 type UseResourcesShape = {
@@ -42,12 +47,7 @@ const gpus = computed<GpuSnapshot[]>(() => injected?.gpuList.value ?? []);
 const ram = computed<RamSnapshot | null>(
   () => snapshot.value?.system_ram ?? null,
 );
-
-const sheetOpen = ref(false);
-function toggleSheet() {
-  sheetOpen.value = !sheetOpen.value;
-}
-void sheetOpen;
+const cpu = computed<CpuSnapshot | null>(() => snapshot.value?.cpu ?? null);
 
 function fmtGb(bytes: number): string {
   return (bytes / 1_000_000_000).toFixed(1);
@@ -61,23 +61,29 @@ function pct(used: number, total: number): number {
 const chipSummary = computed(() => {
   const s = snapshot.value;
   if (!s) return "…";
-  const ramStr = ram.value
-    ? `${fmtGb(ram.value.used)} / ${fmtGb(ram.value.total)} GB`
-    : "…";
-  if (s.gpus.length === 0) return `RAM ${ramStr}`;
-  const primary = s.gpus[0];
-  return `GPU ${fmtGb(primary.vram_used)} / ${fmtGb(primary.vram_total)} · RAM ${ramStr}`;
+  const parts: string[] = [];
+  if (s.gpus.length > 0) {
+    const g = s.gpus[0];
+    const mem = `${fmtGb(g.vram_used)}/${fmtGb(g.vram_total)}`;
+    const util = g.gpu_utilization != null ? ` ${g.gpu_utilization}%` : "";
+    parts.push(`GPU ${mem}${util}`);
+  }
+  if (ram.value) {
+    parts.push(`RAM ${fmtGb(ram.value.used)}/${fmtGb(ram.value.total)}`);
+  }
+  if (cpu.value) {
+    parts.push(`CPU ${Math.round(cpu.value.usage_percent)}%`);
+  }
+  return parts.join(" · ");
 });
 </script>
 
 <template>
   <div v-if="variant === 'chip'" class="inline-flex">
-    <button
+    <span
       data-test="resource-chip"
-      type="button"
-      class="inline-flex h-9 items-center gap-2 rounded-full border border-white/5 bg-white/5 px-3 text-[13px] font-medium text-ink-200 transition hover:text-white"
+      class="inline-flex h-9 items-center gap-2 rounded-full border border-white/5 bg-white/5 px-3 text-[13px] font-medium text-ink-200"
       :title="snapshot?.hostname ?? ''"
-      @click="toggleSheet"
     >
       <svg
         viewBox="0 0 24 24"
@@ -92,46 +98,62 @@ const chipSummary = computed(() => {
         />
       </svg>
       <span class="tabular-nums">{{ chipSummary }}</span>
-    </button>
+    </span>
   </div>
 
-  <aside
-    v-else
-    class="glass rounded-2xl px-4 py-3"
-    role="region"
-    aria-label="Resource telemetry"
-  >
-    <div v-if="!snapshot" class="text-[13px] text-ink-400">…</div>
+  <div v-else role="region" aria-label="Resource telemetry" class="text-[13px]">
+    <div v-if="!snapshot" class="text-ink-400">…</div>
 
-    <div v-else class="flex flex-col gap-2 text-[13px]">
+    <div v-else class="flex flex-col gap-2">
       <div
         v-for="gpu in gpus"
         :key="gpu.ordinal"
         data-test="resource-row"
-        class="flex items-center gap-3"
+        class="flex flex-col gap-1"
       >
-        <div class="w-28 shrink-0 font-medium text-ink-100">
-          GPU {{ gpu.ordinal }} · {{ gpu.name }}
-        </div>
-        <div class="w-32 shrink-0 tabular-nums text-ink-200">
-          {{ fmtGb(gpu.vram_used) }} / {{ fmtGb(gpu.vram_total) }} GB
-        </div>
-        <div
-          class="relative h-2 flex-1 overflow-hidden rounded-full bg-white/5"
-        >
+        <div class="flex items-center gap-3">
+          <div class="w-28 shrink-0 font-medium text-ink-100">
+            GPU {{ gpu.ordinal }} · {{ gpu.name }}
+          </div>
+          <div class="w-32 shrink-0 tabular-nums text-ink-200">
+            {{ fmtGb(gpu.vram_used) }} / {{ fmtGb(gpu.vram_total) }} GB
+          </div>
           <div
-            class="absolute inset-y-0 left-0 bg-brand-400/70"
-            :style="{ width: `${pct(gpu.vram_used, gpu.vram_total)}%` }"
-          />
+            class="relative h-2 flex-1 overflow-hidden rounded-full bg-white/5"
+          >
+            <div
+              class="absolute inset-y-0 left-0 bg-brand-400/70"
+              :style="{ width: `${pct(gpu.vram_used, gpu.vram_total)}%` }"
+            />
+          </div>
+          <div
+            v-if="
+              gpu.vram_used_by_mold !== null &&
+              gpu.vram_used_by_other !== null &&
+              gpu.vram_used_by_mold !== undefined
+            "
+            class="w-40 shrink-0 text-right text-[11px] text-ink-400 tabular-nums"
+          >
+            mold {{ fmtGb(gpu.vram_used_by_mold) }} · other
+            {{ fmtGb(gpu.vram_used_by_other ?? 0) }}
+          </div>
         </div>
-        <div
-          v-if="
-            gpu.vram_used_by_mold !== null && gpu.vram_used_by_other !== null
-          "
-          class="w-40 shrink-0 text-right text-[11px] text-ink-400 tabular-nums"
-        >
-          mold {{ fmtGb(gpu.vram_used_by_mold) }} · other
-          {{ fmtGb(gpu.vram_used_by_other) }}
+        <div v-if="gpu.gpu_utilization != null" class="flex items-center gap-3">
+          <div class="w-28 shrink-0 pl-3 text-[11px] uppercase text-ink-500">
+            core load
+          </div>
+          <div class="w-32 shrink-0 tabular-nums text-ink-200">
+            {{ gpu.gpu_utilization }}%
+          </div>
+          <div
+            class="relative h-2 flex-1 overflow-hidden rounded-full bg-white/5"
+          >
+            <div
+              class="absolute inset-y-0 left-0 bg-amber-400/70"
+              :style="{ width: `${gpu.gpu_utilization}%` }"
+            />
+          </div>
+          <div class="w-40 shrink-0"></div>
         </div>
       </div>
 
@@ -160,10 +182,28 @@ const chipSummary = computed(() => {
         </div>
       </div>
 
+      <div v-if="cpu" data-test="resource-row" class="flex items-center gap-3">
+        <div class="w-28 shrink-0 font-medium text-ink-100">
+          CPU · {{ cpu.cores }} cores
+        </div>
+        <div class="w-32 shrink-0 tabular-nums text-ink-200">
+          {{ cpu.usage_percent.toFixed(1) }}%
+        </div>
+        <div
+          class="relative h-2 flex-1 overflow-hidden rounded-full bg-white/5"
+        >
+          <div
+            class="absolute inset-y-0 left-0 bg-sky-400/70"
+            :style="{ width: `${Math.min(100, cpu.usage_percent)}%` }"
+          />
+        </div>
+        <div class="w-40 shrink-0"></div>
+      </div>
+
       <div class="pt-1 text-[11px] text-ink-500">
         host {{ snapshot.hostname }} · updated
         {{ new Date(snapshot.timestamp).toLocaleTimeString() }}
       </div>
     </div>
-  </aside>
+  </div>
 </template>
diff --git a/web/src/components/ResourceTray.vue b/web/src/components/ResourceTray.vue
new file mode 100644
index 00000000..f633db18
--- /dev/null
+++ b/web/src/components/ResourceTray.vue
@@ -0,0 +1,121 @@
+<script setup lang="ts">
+/**
+ * Fixed bottom telemetry tray. Defaults collapsed; clicking the header
+ * toggles the expanded body which hosts the full `<ResourceStrip>`.
+ *
+ * Mounted once at the App root so it persists across route changes. State
+ * (`expanded` / `auto-hide`) lives in localStorage so a user who wants the
+ * tray always visible doesn't have to re-expand it every reload.
+ */
+import { computed, onBeforeUnmount, onMounted, ref } from "vue";
+import ResourceStrip from "./ResourceStrip.vue";
+import type { ResourceSnapshot } from "../types";
+import { RESOURCES_INJECTION_KEY } from "../composables/useResources";
+import { inject } from "vue";
+import type { ComputedRef, Ref } from "vue";
+
+const STORAGE_KEY = "mold.resource-tray.expanded";
+
+type UseResourcesShape = {
+  snapshot: Ref<ResourceSnapshot | null>;
+  gpuList: ComputedRef<unknown[]>;
+  error: Ref<string | null>;
+};
+
+const injected = inject<UseResourcesShape | null>(
+  RESOURCES_INJECTION_KEY,
+  null,
+);
+
+function loadExpanded(): boolean {
+  try {
+    return localStorage.getItem(STORAGE_KEY) === "1";
+  } catch {
+    return false;
+  }
+}
+
+const expanded = ref(loadExpanded());
+
+function toggle() {
+  expanded.value = !expanded.value;
+  try {
+    localStorage.setItem(STORAGE_KEY, expanded.value ? "1" : "0");
+  } catch {
+    /* ignore */
+  }
+}
+
+const summary = computed(() => {
+  const s = injected?.snapshot.value ?? null;
+  if (!s) return "resources loading…";
+  const parts: string[] = [];
+  if (s.gpus.length > 0) {
+    const g = s.gpus[0];
+    const gb = (n: number) => (n / 1_000_000_000).toFixed(1);
+    const util = g.gpu_utilization != null ? ` ${g.gpu_utilization}%` : "";
+    const label = s.gpus.length > 1 ? `GPU0` : "GPU";
+    parts.push(`${label} ${gb(g.vram_used)}/${gb(g.vram_total)}GB${util}`);
+  }
+  if (s.cpu) {
+    parts.push(`CPU ${Math.round(s.cpu.usage_percent)}%`);
+  }
+  const gb = (n: number) => (n / 1_000_000_000).toFixed(1);
+  parts.push(`RAM ${gb(s.system_ram.used)}/${gb(s.system_ram.total)}GB`);
+  return parts.join(" · ");
+});
+
+function onKey(e: KeyboardEvent) {
+  // Press `r` (not while focused in an input) to toggle the tray.
+  if (e.key !== "r" || e.ctrlKey || e.metaKey || e.altKey) return;
+  const t = e.target as HTMLElement | null;
+  if (
+    t &&
+    (t.tagName === "INPUT" || t.tagName === "TEXTAREA" || t.isContentEditable)
+  ) {
+    return;
+  }
+  toggle();
+}
+
+onMounted(() => window.addEventListener("keydown", onKey));
+onBeforeUnmount(() => window.removeEventListener("keydown", onKey));
+</script>
+
+<template>
+  <div
+    class="pointer-events-none fixed inset-x-0 bottom-0 z-30 flex justify-center px-2 pb-2 sm:px-4 sm:pb-3"
+  >
+    <div
+      class="pointer-events-auto w-full max-w-[1800px] rounded-2xl border border-white/5 bg-slate-950/80 backdrop-blur-md"
+    >
+      <button
+        type="button"
+        class="flex w-full items-center justify-between gap-3 px-4 py-2 text-left text-[12px] text-ink-300 hover:text-ink-100"
+        :aria-expanded="expanded"
+        aria-controls="resource-tray-body"
+        @click="toggle"
+      >
+        <span class="flex items-center gap-2">
+          <span
+            class="inline-block h-2 w-2 rounded-full"
+            :class="
+              injected?.snapshot.value
+                ? 'bg-emerald-400'
+                : 'bg-slate-500 animate-pulse'
+            "
+          />
+          <span class="tabular-nums">{{ summary }}</span>
+        </span>
+        <span class="text-ink-400">{{ expanded ? "▾" : "▸" }}</span>
+      </button>
+      <div
+        v-if="expanded"
+        id="resource-tray-body"
+        class="border-t border-white/5 px-4 py-3"
+      >
+        <ResourceStrip variant="full" />
+      </div>
+    </div>
+  </div>
+</template>
diff --git a/web/src/components/RunningJobCard.vue b/web/src/components/RunningJobCard.vue
index c8b8fce9..1ee045eb 100644
--- a/web/src/components/RunningJobCard.vue
+++ b/web/src/components/RunningJobCard.vue
@@ -6,6 +6,7 @@ const props = defineProps<{ job: Job }>();
 const emit = defineEmits<{
   (e: "cancel", id: string): void;
   (e: "open", job: Job): void;
+  (e: "dismiss", id: string): void;
 }>();
 
 // Done jobs are clickable — they open the gallery detail drawer for the
@@ -95,10 +96,20 @@ const thumbSrc = computed(() => {
         v-if="job.state === 'running'"
         type="button"
         class="text-slate-400 hover:text-rose-300"
+        :aria-label="'Cancel job'"
         @click.stop="emit('cancel', job.id)"
       >
         ✕
       </button>
+      <button
+        v-else
+        type="button"
+        class="text-slate-500 hover:text-slate-200"
+        :aria-label="'Dismiss card'"
+        @click.stop="emit('dismiss', job.id)"
+      >
+        ✕
+      </button>
     </div>
   </div>
 </template>
diff --git a/web/src/components/RunningStrip.vue b/web/src/components/RunningStrip.vue
index 567d1ad9..3da27d0f 100644
--- a/web/src/components/RunningStrip.vue
+++ b/web/src/components/RunningStrip.vue
@@ -1,22 +1,42 @@
 <script setup lang="ts">
+import { computed } from "vue";
 import type { Job } from "../composables/useGenerateStream";
 import RunningJobCard from "./RunningJobCard.vue";
 
-defineProps<{ jobs: Job[] }>();
+const props = defineProps<{ jobs: Job[] }>();
 const emit = defineEmits<{
   (e: "cancel", id: string): void;
   (e: "open", job: Job): void;
+  (e: "dismiss", id: string): void;
+  (e: "clear-finished"): void;
 }>();
+
+const hasFinished = computed(() =>
+  props.jobs.some((j) => j.state !== "running"),
+);
 </script>
 
 <template>
-  <div v-if="jobs.length" class="mt-4 flex gap-3 overflow-x-auto pb-2">
-    <RunningJobCard
-      v-for="job in jobs"
-      :key="job.id"
-      :job="job"
-      @cancel="(id: string) => emit('cancel', id)"
-      @open="(j: Job) => emit('open', j)"
-    />
+  <div v-if="jobs.length" class="mt-4 flex flex-col gap-2">
+    <div class="flex gap-3 overflow-x-auto pb-2">
+      <RunningJobCard
+        v-for="job in jobs"
+        :key="job.id"
+        :job="job"
+        @cancel="(id: string) => emit('cancel', id)"
+        @open="(j: Job) => emit('open', j)"
+        @dismiss="(id: string) => emit('dismiss', id)"
+      />
+    </div>
+    <div v-if="hasFinished" class="flex justify-end">
+      <button
+        type="button"
+        class="text-xs text-slate-400 hover:text-slate-200"
+        data-test="clear-finished"
+        @click="emit('clear-finished')"
+      >
+        Clear finished
+      </button>
+    </div>
   </div>
 </template>
diff --git a/web/src/components/SettingsModal.vue b/web/src/components/SettingsModal.vue
index 307fe255..1b1219cd 100644
--- a/web/src/components/SettingsModal.vue
+++ b/web/src/components/SettingsModal.vue
@@ -1,17 +1,13 @@
 <script setup lang="ts">
 import { computed, ref } from "vue";
-import type {
-  GenerateFormState,
-  ModelInfoExtended,
-  OutputFormat,
-  Scheduler,
-} from "../types";
+import type { GenerateFormState, ModelInfoExtended, Scheduler } from "../types";
 import {
   NO_CFG_FAMILIES,
   UNET_SCHEDULER_FAMILIES,
   VIDEO_FAMILIES,
 } from "../types";
 import ModelPicker from "./ModelPicker.vue";
+import { outputFormatsForFamily } from "../composables/useGenerateForm";
 
 const props = defineProps<{
   open: boolean;
@@ -58,6 +54,13 @@ function selectModel(m: ModelInfoExtended) {
     next.frames = null;
     next.fps = null;
   }
+  // Reset output format to the family's default when it's no longer valid
+  // (e.g. switching from FLUX → LTX-2 leaves `png` selected, which the
+  // server would reject).
+  const formats = outputFormatsForFamily(m.family);
+  if (!formats.includes(next.outputFormat)) {
+    next.outputFormat = formats[0];
+  }
   emit("update:modelValue", next);
 }
 
@@ -68,6 +71,39 @@ function clampFrames(n: number): number {
   return rounded;
 }
 
+// Video length is a pure UI convenience — the backend only consumes frames.
+// We derive it from frames / fps and let the user edit either side; the
+// other recomputes (see onChangeLength / onChangeFps).
+const videoLength = computed(() => {
+  const frames = props.modelValue.frames ?? 25;
+  const fps = props.modelValue.fps ?? 24;
+  return fps > 0 ? frames / fps : 0;
+});
+
+function onChangeFrames(raw: string) {
+  const n = clampFrames(Number(raw));
+  patch("frames", n);
+}
+
+function onChangeLength(raw: string) {
+  const secs = Number(raw);
+  if (!Number.isFinite(secs) || secs <= 0) return;
+  const fps = props.modelValue.fps ?? 24;
+  patch("frames", clampFrames(secs * fps));
+}
+
+function onChangeFps(raw: string) {
+  const nextFps = Number(raw);
+  if (!Number.isFinite(nextFps) || nextFps <= 0) return;
+  // Changing fps keeps length steady and adjusts frames.
+  const length = videoLength.value;
+  emit("update:modelValue", {
+    ...props.modelValue,
+    fps: nextFps,
+    frames: clampFrames(length * nextFps),
+  });
+}
+
 const advancedOpen = ref(false);
 
 const sizePresets = [512, 768, 1024] as const;
@@ -79,9 +115,6 @@ const schedulerOptions: Scheduler[] = [
   "euler-ancestral",
   "unipc",
 ];
-const outputFormatOptions = computed<OutputFormat[]>(() =>
-  showVideo.value ? ["mp4", "gif", "apng", "webp"] : ["png", "jpeg", "webp"],
-);
 </script>
 
 <template>
@@ -295,7 +328,7 @@ const outputFormatOptions = computed<OutputFormat[]>(() =>
 
         <section
           v-if="showVideo"
-          class="mt-4 grid grid-cols-1 gap-4 sm:grid-cols-2"
+          class="mt-4 grid grid-cols-1 gap-4 sm:grid-cols-3"
         >
           <div>
             <label class="text-xs uppercase text-slate-400"
@@ -306,29 +339,33 @@ const outputFormatOptions = computed<OutputFormat[]>(() =>
               :value="modelValue.frames ?? 25"
               class="mt-1 w-full rounded-lg bg-slate-900/60 px-2 py-1 text-slate-100"
               @change="
-                patch(
-                  'frames',
-                  clampFrames(
-                    Number(($event.target as HTMLInputElement).value),
-                  ),
-                )
+                onChangeFrames(($event.target as HTMLInputElement).value)
               "
             />
           </div>
           <div>
-            <label class="text-xs uppercase text-slate-400">FPS</label>
+            <label class="text-xs uppercase text-slate-400">Length (s)</label>
             <input
               type="number"
-              :value="modelValue.fps ?? 24"
+              step="0.1"
+              min="0.1"
+              :value="videoLength.toFixed(2)"
               class="mt-1 w-full rounded-lg bg-slate-900/60 px-2 py-1 text-slate-100"
+              data-test="video-length"
               @change="
-                patch(
-                  'fps',
-                  Number(($event.target as HTMLInputElement).value) || 24,
-                )
+                onChangeLength(($event.target as HTMLInputElement).value)
               "
             />
           </div>
+          <div>
+            <label class="text-xs uppercase text-slate-400">FPS</label>
+            <input
+              type="number"
+              :value="modelValue.fps ?? 24"
+              class="mt-1 w-full rounded-lg bg-slate-900/60 px-2 py-1 text-slate-100"
+              @change="onChangeFps(($event.target as HTMLInputElement).value)"
+            />
+          </div>
         </section>
 
         <section class="mt-4">
@@ -365,25 +402,6 @@ const outputFormatOptions = computed<OutputFormat[]>(() =>
                 </option>
               </select>
             </div>
-            <div>
-              <label class="text-xs uppercase text-slate-400"
-                >Output format</label
-              >
-              <select
-                :value="modelValue.outputFormat"
-                class="mt-1 w-full rounded-lg bg-slate-900/60 px-2 py-1 text-slate-100"
-                @change="
-                  patch(
-                    'outputFormat',
-                    ($event.target as HTMLSelectElement).value as OutputFormat,
-                  )
-                "
-              >
-                <option v-for="f in outputFormatOptions" :key="f" :value="f">
-                  {{ f }}
-                </option>
-              </select>
-            </div>
           </div>
         </section>
       </div>
diff --git a/web/src/composables/useGenerateForm.ts b/web/src/composables/useGenerateForm.ts
index 62f68572..e22afb3d 100644
--- a/web/src/composables/useGenerateForm.ts
+++ b/web/src/composables/useGenerateForm.ts
@@ -3,6 +3,7 @@ import type {
   GenerateFormState,
   GenerateRequestWire,
   ModelInfoExtended,
+  OutputFormat,
   Scheduler,
 } from "../types";
 import {
@@ -11,6 +12,19 @@ import {
   VIDEO_FAMILIES,
 } from "../types";
 
+/** Output-format options for a given model family, ordered by preference.
+ * The first entry is the default the UI auto-selects when a model is
+ * chosen. */
+export function outputFormatsForFamily(family: string): OutputFormat[] {
+  return VIDEO_FAMILIES.includes(family)
+    ? ["mp4", "gif", "apng", "webp"]
+    : ["png", "jpeg", "webp"];
+}
+
+export function defaultOutputFormat(family: string): OutputFormat {
+  return outputFormatsForFamily(family)[0];
+}
+
 const STORAGE_KEY = "mold.generate.form";
 
 function defaultForm(): GenerateFormState {
@@ -101,6 +115,14 @@ export function useGenerateForm(): UseGenerateForm {
         state.value.frames = null;
         state.value.fps = null;
       }
+      // Auto-pick a valid output format whenever the model family changes so
+      // users never have to manually toggle this — switching from an image
+      // to a video family and back would otherwise leave an invalid format
+      // (e.g. `png` on LTX-2) stuck in the form.
+      const formats = outputFormatsForFamily(m.family);
+      if (!formats.includes(state.value.outputFormat)) {
+        state.value.outputFormat = formats[0];
+      }
     },
     toRequest: () => {
       const s = state.value;
diff --git a/web/src/composables/useGenerateStream.ts b/web/src/composables/useGenerateStream.ts
index 91e8d014..862d6500 100644
--- a/web/src/composables/useGenerateStream.ts
+++ b/web/src/composables/useGenerateStream.ts
@@ -1,4 +1,4 @@
-import { reactive, ref, type Ref } from "vue";
+import { reactive, ref, watch, type Ref } from "vue";
 import { generateChainStream, generateStream } from "../api";
 import type {
   ChainProgressEvent,
@@ -200,12 +200,96 @@ export interface UseGenerateStream {
   submit: (req: GenerateRequestWire, decision?: ChainRoutingDecision) => string;
   cancel: (id: string) => void;
   clearDone: () => void;
+  /** Remove a specific job from the list (used to dismiss persisted cards). */
+  remove: (id: string) => void;
+}
+
+const STORAGE_KEY = "mold.generate.jobs";
+
+/** Shape we persist to localStorage — everything in `Job` minus the
+ * non-serializable `AbortController`, plus a marker so we can short-circuit
+ * loads from a future schema bump. */
+interface PersistedJob {
+  id: string;
+  request: GenerateRequestWire;
+  startedAt: number;
+  progress: JobProgress;
+  result: SseCompleteEvent | null;
+  error: string | null;
+  state: Job["state"];
+  chain: ChainJobMeta | null;
+}
+
+function loadPersistedJobs(): Job[] {
+  try {
+    const raw = localStorage.getItem(STORAGE_KEY);
+    if (!raw) return [];
+    const parsed = JSON.parse(raw) as PersistedJob[];
+    if (!Array.isArray(parsed)) return [];
+    return parsed.map((p) => {
+      // Running jobs can't survive a reload — the SSE stream is gone and we
+      // have no request-id to reconnect to. Mark them so the user can see
+      // "this was running when I hit refresh" and either retry or dismiss,
+      // but don't pretend they're still alive.
+      const state: Job["state"] = p.state === "running" ? "error" : p.state;
+      const error =
+        p.state === "running"
+          ? "Disconnected — generation may still be running on the server"
+          : p.error;
+      return {
+        id: p.id,
+        request: p.request,
+        startedAt: p.startedAt,
+        // Dangling controllers from prior sessions aren't used — cancel()
+        // bails early for non-running jobs anyway.
+        controller: new AbortController(),
+        progress: p.progress ?? emptyProgress(),
+        result: p.result,
+        error,
+        state,
+        chain: p.chain,
+      };
+    });
+  } catch {
+    return [];
+  }
+}
+
+function persistJobs(jobs: Job[]) {
+  try {
+    const serializable: PersistedJob[] = jobs.map((j) => ({
+      id: j.id,
+      request: j.request,
+      startedAt: j.startedAt,
+      progress: j.progress,
+      result: j.result,
+      error: j.error,
+      state: j.state,
+      chain: j.chain,
+    }));
+    localStorage.setItem(STORAGE_KEY, JSON.stringify(serializable));
+  } catch {
+    /* quota / privacy mode — silently drop */
+  }
 }
 
 export function useGenerateStream(
   onComplete?: (job: Job) => void,
 ): UseGenerateStream {
-  const jobs = ref<Job[]>([]);
+  const jobs = ref<Job[]>(loadPersistedJobs());
+
+  // Persist whenever the list or any job's mutable state changes. 200 ms
+  // debounce keeps writes out of the SSE hot path (we get a progress event
+  // roughly every 50 ms during denoising).
+  let persistTimer: ReturnType<typeof setTimeout> | null = null;
+  watch(
+    jobs,
+    (v) => {
+      if (persistTimer) clearTimeout(persistTimer);
+      persistTimer = setTimeout(() => persistJobs(v), 200);
+    },
+    { deep: true },
+  );
 
   function submit(
     req: GenerateRequestWire,
@@ -306,5 +390,9 @@ export function useGenerateStream(
     jobs.value = jobs.value.filter((j) => j.state === "running");
   }
 
-  return { jobs, submit, cancel, clearDone };
+  function remove(id: string) {
+    jobs.value = jobs.value.filter((j) => j.id !== id);
+  }
+
+  return { jobs, submit, cancel, clearDone, remove };
 }
diff --git a/web/src/pages/GalleryPage.vue b/web/src/pages/GalleryPage.vue
index f7db015a..f6d54ed0 100644
--- a/web/src/pages/GalleryPage.vue
+++ b/web/src/pages/GalleryPage.vue
@@ -375,7 +375,7 @@ onMounted(async () => {
 
 <template>
   <div
-    class="relative mx-auto flex min-h-[100svh] max-w-[1800px] flex-col px-4 pb-32 sm:px-6 lg:px-10"
+    class="relative mx-auto flex min-h-[100svh] max-w-[1800px] flex-col px-4 pb-40 sm:px-6 lg:px-10"
   >
     <TopBar
       :filter="filter"
diff --git a/web/src/pages/GeneratePage.vue b/web/src/pages/GeneratePage.vue
index 3942db4e..97b1d9b2 100644
--- a/web/src/pages/GeneratePage.vue
+++ b/web/src/pages/GeneratePage.vue
@@ -1,7 +1,6 @@
 <script setup lang="ts">
-import { computed, onMounted, ref } from "vue";
+import { computed, onBeforeUnmount, onMounted, ref, watch } from "vue";
 import Composer from "../components/Composer.vue";
-import ResourceStrip from "../components/ResourceStrip.vue";
 import SettingsModal from "../components/SettingsModal.vue";
 import ExpandModal from "../components/ExpandModal.vue";
 import ImagePickerModal from "../components/ImagePickerModal.vue";
@@ -101,6 +100,55 @@ async function refreshGallery() {
   }
 }
 
+// ── Auto-refresh ──────────────────────────────────────────────────────────
+// Gallery gets new entries whenever a job completes (via stream onComplete)
+// or whenever someone drops a file into MOLD_OUTPUT_DIR out-of-band (e.g.
+// `mold run --local` on the same host). Polling every 10 s catches the
+// second case without hammering the server. Models poll less often; new
+// entries show up when `mold pull` finishes or a new weight is dropped in.
+let galleryTimer: ReturnType<typeof setInterval> | null = null;
+let modelsTimer: ReturnType<typeof setInterval> | null = null;
+
+function startAutoRefresh() {
+  stopAutoRefresh();
+  galleryTimer = setInterval(() => {
+    if (!document.hidden) void refreshGallery();
+  }, 10_000);
+  modelsTimer = setInterval(() => {
+    if (!document.hidden) void refreshModels();
+  }, 15_000);
+}
+
+function stopAutoRefresh() {
+  if (galleryTimer) {
+    clearInterval(galleryTimer);
+    galleryTimer = null;
+  }
+  if (modelsTimer) {
+    clearInterval(modelsTimer);
+    modelsTimer = null;
+  }
+}
+
+// Faster model polling while the advanced/settings modal is open — the user
+// is likely waiting for a download to complete so they can pick the new
+// variant without hitting a manual refresh.
+let settingsModelsTimer: ReturnType<typeof setInterval> | null = null;
+watch(
+  () => showSettings.value,
+  (open) => {
+    if (settingsModelsTimer) {
+      clearInterval(settingsModelsTimer);
+      settingsModelsTimer = null;
+    }
+    if (open) {
+      settingsModelsTimer = setInterval(() => {
+        if (!document.hidden) void refreshModels();
+      }, 3_000);
+    }
+  },
+);
+
 const currentModel = computed(
   () => models.value.find((m) => m.name === form.state.value.model) ?? null,
 );
@@ -234,6 +282,10 @@ function setMuted(v: boolean) {
   persistMuted(v);
 }
 
+const queueBusy = computed(() =>
+  stream.jobs.value.some((j) => j.state === "running"),
+);
+
 onMounted(async () => {
   await refreshModels();
   try {
@@ -250,11 +302,20 @@ onMounted(async () => {
     const first = models.value.find((m) => m.downloaded);
     if (first) form.applyModelDefaults(first);
   }
+  startAutoRefresh();
+});
+
+onBeforeUnmount(() => {
+  stopAutoRefresh();
+  if (settingsModelsTimer) {
+    clearInterval(settingsModelsTimer);
+    settingsModelsTimer = null;
+  }
 });
 </script>
 
 <template>
-  <div class="mx-auto max-w-[1800px] px-4 pb-32 pt-4 sm:px-6 sm:pt-6 lg:px-10">
+  <div class="mx-auto max-w-[1800px] px-4 pb-40 pt-4 sm:px-6 sm:pt-6 lg:px-10">
     <TopBar
       :filter="'all'"
       :search="''"
@@ -287,13 +348,12 @@ onMounted(async () => {
         @clear-source="onClearSource"
       />
 
-      <!-- Agent B: always-visible VRAM + RAM telemetry -->
-      <ResourceStrip class="mt-3 hidden lg:block" variant="full" />
-
       <RunningStrip
         :jobs="stream.jobs.value"
         @cancel="stream.cancel"
         @open="openJob"
+        @dismiss="stream.remove"
+        @clear-finished="stream.clearDone"
       />
 
       <div class="mt-6">
@@ -318,6 +378,7 @@ onMounted(async () => {
       :prompt="form.state.value.prompt"
       :expand="form.state.value.expand"
       :current-model="currentModel"
+      :queue-busy="queueBusy"
       @update:expand="(v: ExpandFormState) => (form.state.value.expand = v)"
       @apply-prompt="(v: string) => (form.state.value.prompt = v)"
       @close="showExpand = false"
diff --git a/web/src/types.ts b/web/src/types.ts
index 30cf59e7..49ff6256 100644
--- a/web/src/types.ts
+++ b/web/src/types.ts
@@ -413,6 +413,8 @@ export interface GpuSnapshot {
   vram_used: number;
   vram_used_by_mold: number | null;
   vram_used_by_other: number | null;
+  /** 0-100. `null` on Metal and on the `nvidia-smi` fallback path. */
+  gpu_utilization?: number | null;
 }
 
 export interface RamSnapshot {
@@ -422,9 +424,17 @@ export interface RamSnapshot {
   used_by_other: number;
 }
 
+export interface CpuSnapshot {
+  cores: number;
+  /** 0-100 averaged across all cores. */
+  usage_percent: number;
+}
+
 export interface ResourceSnapshot {
   hostname: string;
   timestamp: number;
   gpus: GpuSnapshot[];
   system_ram: RamSnapshot;
+  /** `null` on the first sample (sysinfo needs a prior refresh to compute deltas). */
+  cpu?: CpuSnapshot | null;
 }

From 7760eddab0e6acfc680d4f076ff68fb804bdb059 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 14:58:59 -0700
Subject: [PATCH 22/31] fix(ltx2): stabilize chain seed across stages, bump
 motion-tail default to 9
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous revision XORed `(idx as u64) << 32` into each stage's seed so
the initial noise tensor differed per clip. With the motion-tail pin now
grounded on a proper causal-first latent, per-stage noise diversity just
amplifies drift at the stitch point — same-seed noise stays frozen in the
pinned region and settles on a consistent motion profile in the free region.
Callers that want variation can still supply `stage.seed_offset` explicitly.

Also bump --motion-tail default from 4 → 9 pixel frames (two LTX-2 latent
frames under the VAE's 8× causal temporal compression: causal-first slot +
one continuation slot). Four only pinned the causal slot, which the decoder
reconstructs as a single pixel frame and leaves the inter-clip stitch
visibly jumpy. `DEFAULT_MOTION_TAIL` in web/src/lib/chainRouting.ts is
kept in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-cli/src/main.rs                   | 14 +++--
 crates/mold-inference/src/ltx2/chain.rs       | 36 +++++++++---
 .../mold-inference/src/ltx2/conditioning.rs   | 12 ++++
 crates/mold-inference/src/ltx2/pipeline.rs    |  8 +++
 crates/mold-inference/src/ltx2/runtime.rs     | 57 ++++++++++++++++---
 web/src/lib/chainRouting.ts                   |  7 ++-
 6 files changed, 112 insertions(+), 22 deletions(-)

diff --git a/crates/mold-cli/src/main.rs b/crates/mold-cli/src/main.rs
index f7e616d2..fe38a757 100644
--- a/crates/mold-cli/src/main.rs
+++ b/crates/mold-cli/src/main.rs
@@ -397,8 +397,11 @@ Examples:
         /// Motion-tail overlap between chained clips in pixel frames. Each clip
         /// after the first reuses this many trailing latents from the prior
         /// clip, trimming the duplicated pixel frames at stitch time. 0 disables
-        /// latent carryover (simple concat). Default 4.
-        #[arg(long, value_name = "N", default_value_t = 4, help_heading = "Video")]
+        /// latent carryover (simple concat). Default 9 — two LTX-2 latent frames
+        /// of carryover at the 8× causal temporal compression, covering the
+        /// causal-first-frame slot plus one continuation slot so the stitch
+        /// point has enough anchor to stay coherent.
+        #[arg(long, value_name = "N", default_value_t = 9, help_heading = "Video")]
         motion_tail: u32,
 
         /// Enable synchronized audio for LTX-2 / LTX-2.3 generation.
@@ -2180,7 +2183,7 @@ mod tests {
     }
 
     #[test]
-    fn run_motion_tail_defaults_to_four() {
+    fn run_motion_tail_defaults_to_nine() {
         let cli = parse(&["run", "ltx-2-19b-distilled:fp8", "a cat", "--frames", "200"]);
         match cli.command {
             Commands::Run {
@@ -2188,7 +2191,10 @@ mod tests {
                 clip_frames,
                 ..
             } => {
-                assert_eq!(motion_tail, 4, "default motion tail must be 4 frames");
+                assert_eq!(
+                    motion_tail, 9,
+                    "default motion tail must be 9 frames (two LTX-2 latent frames)"
+                );
                 assert_eq!(clip_frames, None);
             }
             _ => panic!("expected Run"),
diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
index 81eb2b47..c743c1fe 100644
--- a/crates/mold-inference/src/ltx2/chain.rs
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -170,8 +170,10 @@ impl<'a, R: ChainStageRenderer + ?Sized> Ltx2ChainOrchestrator<'a, R> {
 
     /// Run every stage in `req.stages` and return the accumulated frames.
     ///
-    /// Behaviour invariants (from the 2026-04-20 sign-off):
-    /// - Per-stage seeds are derived as `base_seed ^ ((stage_idx as u64) << 32)`.
+    /// Behaviour invariants (from the 2026-04-20 sign-off, amended 2026-04-21):
+    /// - Per-stage seeds default to the shared `base_seed` so the continuation
+    ///   denoise starts from matching noise. Stages can opt in to variation by
+    ///   setting `seed_offset`, which XORs into the base seed.
     /// - Stage 0's output is kept whole; continuations drop their leading
     ///   `req.motion_tail_frames` pixel frames because those duplicate the
     ///   prior stage's tail that was threaded back as latent conditioning.
@@ -309,11 +311,22 @@ fn estimate_stitched_frames(req: &ChainRequest) -> u32 {
         .sum()
 }
 
-fn derive_stage_seed(base_seed: u64, idx: usize, stage: &ChainStage) -> u64 {
+fn derive_stage_seed(base_seed: u64, _idx: usize, stage: &ChainStage) -> u64 {
+    // Keep the seed stable across stages by default. An earlier revision
+    // XORed `(idx as u64) << 32` into each stage's seed so the initial
+    // noise tensor differed per clip; with the motion-tail pin re-
+    // grounded on a proper causal-first latent (see
+    // `StagedLatent::causal_first_frame_rgb`) that diversification just
+    // amplifies drift at the stitch point — the replacement region is
+    // frozen by `video_denoise_mask` anyway, so same-seed noise in the
+    // pinned tokens is a no-op, and same-seed noise in the free region
+    // lets the continuation settle on a consistent motion profile.
+    // Callers who want per-stage variation supply `stage.seed_offset`
+    // explicitly.
     if let Some(offset) = stage.seed_offset {
         base_seed ^ offset
     } else {
-        base_seed ^ ((idx as u64) << 32)
+        base_seed
     }
 }
 
@@ -663,7 +676,7 @@ mod tests {
     }
 
     #[test]
-    fn chain_derives_per_stage_seed_from_base_seed() {
+    fn chain_holds_seed_stable_across_stages_by_default() {
         let stages = vec![stage("a", 9), stage("a", 9), stage("a", 9)];
         let mut req = chain_req(stages, 0);
         req.seed = Some(42);
@@ -671,10 +684,15 @@ mod tests {
         renderer.frame_count_override = Some(9);
         let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
         orch.run(&req, None).expect("chain runs");
-        // Per the sign-off: stage_seed = base ^ ((idx as u64) << 32).
+        // The orchestrator used to XOR `(idx as u64) << 32` into each
+        // stage's seed so initial noise differed per clip. With the
+        // motion-tail pin grounded on a proper causal-first latent,
+        // per-stage noise diversity just amplifies drift at the stitch
+        // point — same-seed noise stays frozen in the pinned region and
+        // produces a more consistent motion profile in the free region.
         assert_eq!(renderer.calls[0].seed, Some(42));
-        assert_eq!(renderer.calls[1].seed, Some(42 ^ (1u64 << 32)));
-        assert_eq!(renderer.calls[2].seed, Some(42 ^ (2u64 << 32)));
+        assert_eq!(renderer.calls[1].seed, Some(42));
+        assert_eq!(renderer.calls[2].seed, Some(42));
     }
 
     #[test]
@@ -791,7 +809,7 @@ mod tests {
         assert_eq!(
             renderer.calls[1].seed,
             Some(100 ^ 0xDEADBEEFu64),
-            "seed_offset must take precedence over the default index-derived seed",
+            "seed_offset must XOR into the stable base seed when a stage opts in to variation",
         );
     }
 }
diff --git a/crates/mold-inference/src/ltx2/conditioning.rs b/crates/mold-inference/src/ltx2/conditioning.rs
index ce7c9bb1..ca83e1d2 100644
--- a/crates/mold-inference/src/ltx2/conditioning.rs
+++ b/crates/mold-inference/src/ltx2/conditioning.rs
@@ -1,5 +1,6 @@
 use anyhow::{bail, Result};
 use candle_core::Tensor;
+use image::RgbImage;
 use mold_core::{GenerateRequest, TimeRange};
 use std::fs;
 use std::ops::RangeInclusive;
@@ -30,6 +31,17 @@ pub(crate) struct StagedLatent {
     /// Replacement/append strength. `1.0` for chain motion-tail carryover
     /// (hard-overwrite), matching the keyframe image strength convention.
     pub(crate) strength: f32,
+    /// Optional RGB frame that the runtime re-encodes through the video VAE
+    /// and swaps in for the first latent frame of `latents` before
+    /// patchifying. Used by the chain orchestrator to fix a semantic
+    /// mismatch: raw tail latents come from the emitting clip's
+    /// *continuation* slots (each encoding 8 pixel frames), but when
+    /// replayed at `frame: 0` of the next clip they land in the VAE's
+    /// *causal* slot (which decodes to a single pixel frame). Encoding the
+    /// last decoded RGB frame as a proper causal-first latent removes that
+    /// mismatch and makes `clip_N+1`'s first pixel frame visually match
+    /// `clip_N`'s last pixel frame.
+    pub(crate) causal_first_frame_rgb: Option<RgbImage>,
 }
 
 /// Conditioning inputs staged for a single run. Carries both disk-backed
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 624af885..3ef3f1e7 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -601,10 +601,18 @@ impl Ltx2Engine {
             // clear staged images so they can't compete with the latent
             // carryover.
             plan.conditioning.images.clear();
+            // Hand the RGB last frame to the runtime so it can re-encode
+            // it as a proper causal-first latent and swap in for the first
+            // latent frame before patchifying. The raw tail latents from
+            // the emitting stage's tail encode 8 pixel frames per slot;
+            // dropped straight into token 0 they'd collide with the VAE's
+            // causal-first-frame convention at the decode step and produce
+            // a visible seam on each continuation.
             plan.conditioning.latents.push(StagedLatent {
                 latents: tail.latents.clone(),
                 frame: 0,
                 strength: 1.0,
+                causal_first_frame_rgb: Some(tail.last_rgb_frame.clone()),
             });
         }
 
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 9e156985..98f6a978 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -1340,10 +1340,18 @@ fn maybe_load_stage_video_conditioning(
         return Ok(StageVideoConditioning::default());
     }
 
-    // The VAE is only needed when we have images to encode or a reference
-    // video to ingest. Pre-encoded staged latents (chain carryover) skip
-    // VAE load entirely — that's the whole point of latent carryover.
-    let need_vae = !plan.conditioning.images.is_empty() || include_reference_video;
+    // The VAE is needed for staged images, reference video ingest, and —
+    // on chain continuations — re-encoding `causal_first_frame_rgb` on a
+    // StagedLatent so the causal-first slot is filled with a proper single-
+    // pixel-frame latent instead of the emitting stage's continuation-slot
+    // tokens. Pure pre-encoded latent carryover still skips the VAE load.
+    let need_vae = !plan.conditioning.images.is_empty()
+        || include_reference_video
+        || plan
+            .conditioning
+            .latents
+            .iter()
+            .any(|s| s.causal_first_frame_rgb.is_some());
     let mut vae = if need_vae {
         let mut loaded = load_ltx2_video_vae(plan, device, dtype)?;
         loaded.use_tiling = false;
@@ -1400,12 +1408,45 @@ fn maybe_load_stage_video_conditioning(
                 )?);
         }
     }
-    // Pre-encoded latents (chain carryover). No VAE needed — tokens come
-    // straight from the caller. For v1 chain this only ever holds a frame-0
-    // replacement (motion-tail latents from the prior stage); appended
+    // Pre-encoded latents (chain carryover). VAE is only touched when a
+    // `causal_first_frame_rgb` rides along — in that case the runtime re-
+    // encodes that RGB frame as a single-frame causal-first latent and
+    // swaps it in for latent frame 0 of the staged tensor, fixing the
+    // causal-vs-continuation slot mismatch before patchify. For v1 chain
+    // this path only ever produces a frame-0 replacement; appended
     // (non-frame-0) is kept as a forward-compat branch for the movie-maker.
     for staged in &plan.conditioning.latents {
-        let latents = staged.latents.to_device(device)?.to_dtype(DType::F32)?;
+        let mut latents = staged.latents.to_device(device)?.to_dtype(DType::F32)?;
+        if let Some(rgb) = staged.causal_first_frame_rgb.as_ref() {
+            let vae = vae.as_mut().expect(
+                "need_vae guarantees the VAE is loaded whenever a staged latent carries a causal_first_frame_rgb",
+            );
+            let video = video_tensor_from_frames(std::slice::from_ref(rgb), device, dtype)
+                .context("encode causal first frame into pixel tensor for chain carryover")?;
+            let causal_latent = vae.encode(&video).context(
+                "failed to encode chain-carryover causal first frame through the LTX-2 video VAE",
+            )?;
+            // `causal_latent` shape: [1, 128, 1, H/32, W/32]. Splice in
+            // as latent frame 0 of the staged tail so only the causal
+            // slot is overwritten; continuation slots (frames 1..K of
+            // `staged.latents`) stay as the emitting stage's true tail
+            // tokens, which already share the continuation-slot semantics
+            // of the receiving clip's slots 1..K.
+            let causal_latent = causal_latent.to_dtype(DType::F32)?;
+            let (_b, _c, causal_frames, _h, _w) = causal_latent.dims5()?;
+            if causal_frames != 1 {
+                anyhow::bail!(
+                    "VAE encode of a single RGB frame produced {causal_frames} latent frames; expected exactly 1"
+                );
+            }
+            let tail_frames = latents.dim(2)?;
+            latents = if tail_frames <= 1 {
+                causal_latent
+            } else {
+                let continuation = latents.narrow(2, 1, tail_frames - 1)?;
+                Tensor::cat(&[&causal_latent, &continuation], 2)?.contiguous()?
+            };
+        }
         let use_guiding_latent = matches!(plan.pipeline, PipelineKind::Keyframe);
         if staged.frame == 0 && !use_guiding_latent {
             let tokens = patchifier.patchify(&latents)?;
diff --git a/web/src/lib/chainRouting.ts b/web/src/lib/chainRouting.ts
index 4c68dd14..fc7abecd 100644
--- a/web/src/lib/chainRouting.ts
+++ b/web/src/lib/chainRouting.ts
@@ -12,7 +12,12 @@
  */
 
 export const LTX2_DISTILLED_CLIP_CAP = 97;
-export const DEFAULT_MOTION_TAIL = 4;
+// 9 pixel frames → 2 LTX-2 latent frames of carryover under the VAE's 8× causal
+// temporal compression (causal-first slot + one continuation slot). Four frames
+// — the prior default — only pinned the causal slot, which the decoder
+// reconstructs as a single pixel frame, leaving the inter-clip stitch visibly
+// jumpy. Keep this in sync with `default_value_t` on --motion-tail in mold-cli.
+export const DEFAULT_MOTION_TAIL = 9;
 
 export type ChainRoutingDecision =
   | { kind: "single" }

From 4430fab2fc1e3d04e4cafb3582ab6cbb1d2ce47d Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 14:59:06 -0700
Subject: [PATCH 23/31] docs: add SD3 CLIP 77-token truncation handoff note

Captures the repro, the failing code paths, and the already-landed fix in
the SD1.5/SDXL shared encoder so a future session can pick up the SD3-
specific wrapper regression without re-deriving context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tasks/sd3-clip-77-truncation-handoff.md | 249 ++++++++++++++++++++++++
 1 file changed, 249 insertions(+)
 create mode 100644 tasks/sd3-clip-77-truncation-handoff.md

diff --git a/tasks/sd3-clip-77-truncation-handoff.md b/tasks/sd3-clip-77-truncation-handoff.md
new file mode 100644
index 00000000..093d84c6
--- /dev/null
+++ b/tasks/sd3-clip-77-truncation-handoff.md
@@ -0,0 +1,249 @@
+# SD3 CLIP-L/G 77-token truncation bug — handoff
+
+> Paste the prompt at the bottom of this file into a fresh Claude Code session
+> to pick up where this one left off.
+
+## TL;DR
+
+`sd3.5-large:q8` (and every other `sd3*` family) fails with
+`"shape mismatch in broadcast_add, lhs: [1, N, 768], rhs: [1, 77, 768]"`
+whenever the prompt tokenises to `N > 77` tokens. Observed three times in
+killswitch's server log over the last few hours (seq lengths 130, 131, 132).
+The shared SD1.5/SDXL encoder path already truncates to 77 correctly; the
+SD3-specific wrapper regressed.
+
+## Repro
+
+Any sd3 family, any prompt long enough to exceed 77 CLIP tokens:
+
+```bash
+# Remote:
+curl -sS -X POST http://beast:7680/api/generate \
+  -H 'content-type: application/json' \
+  -d '{"model":"sd3.5-large:q8","prompt":"'"$(python3 -c 'print("a highly detailed " * 30)')"'","width":1024,"height":1024,"steps":20,"guidance":4.0}' | head -c 300
+# → HTTP 500 with "shape mismatch in broadcast_add, lhs: [1, 132, 768], rhs: [1, 77, 768]"
+```
+
+Same bug on `--local` once an sd3 engine builds.
+
+## Root cause (with file:line)
+
+**`crates/mold-inference/src/encoders/sd3_clip.rs:86-97`** — the
+`ClipWithTokenizer::encode_text_to_embedding` method tokenises the prompt
+and pads UP to `max_position_embeddings` (77) but never truncates DOWN when
+the tokeniser returns more than 77 ids:
+
+```rust
+let mut tokens = self.tokenizer.encode(prompt, true)...get_ids().to_vec();
+let eos_position = tokens.len() - 1;        // ← overshoots when len > 77
+while tokens.len() < self.max_position_embeddings {
+    tokens.push(pad_id);                    // ← pads up only
+}
+let tokens = Tensor::new(tokens.as_slice(), &self.device)?.unsqueeze(0)?;
+let (_, text_embeddings_penultimate) =
+    clip.forward_until_encoder_layer(&tokens, usize::MAX, -2)?;
+// …
+last_hidden.i((0, eos_position, ..))?     // ← also out-of-bounds when len > 77
+```
+
+CLIP's position-embedding table holds exactly 77 entries. When the token
+tensor is `[1, 132]`, the internal `embedding + position_embedding` add
+hits `lhs: [1, 132, 768], rhs: [1, 77, 768]`, which is the error the user
+surfaced. `eos_position = tokens.len() - 1` is also out-of-bounds for the
+pooled-output slice when it happens to not panic before the add fires.
+
+**Compare to the correct shared path** at
+`crates/mold-inference/src/encoders/clip.rs:105-107`:
+
+```rust
+let mut tokens = ...get_ids().to_vec();
+// CLIP hard limit: 77 tokens (including BOS/EOS)
+tokens.truncate(77);
+```
+
+Same CLIP tokeniser, same 77-limit constant — the SD3 wrapper just
+forgot to call `truncate`.
+
+**Blast radius.** `ClipWithTokenizer` is used for BOTH SD3's CLIP-L
+(`encoders/sd3_clip.rs:232` — `encode_text_to_embedding` on `clip_l`) AND
+CLIP-G (`:236` — same method on `clip_g`). Fixing the helper once fixes
+both. Every `sd3*` model is affected:
+
+- `sd3.5-large:q8` (observed — three failures)
+- `sd3.5-large:fp16`
+- `sd3.5-medium:*`
+- `sd3-large:*` (base sd3, if still in the manifest)
+
+SD1.5, SDXL, FLUX, Flux.2, Z-Image, LTX-Video, LTX-2, Qwen-Image,
+Wuerstchen — all unaffected (different encoders, different truncation
+paths already verified).
+
+## The fix (sketch — do not paste blindly)
+
+```rust
+let raw_tokens = self.tokenizer.encode(prompt, true)
+    .map_err(|e| anyhow::anyhow!("CLIP tokenization failed: {e}"))?
+    .get_ids()
+    .to_vec();
+
+// Truncate to max_position_embeddings, preserving EOS as the last slot.
+// CLIP's pooled output reads from the EOS position; losing EOS breaks
+// the pooled branch silently.
+let eos_id = *raw_tokens.last().unwrap_or(&pad_id);
+let mut tokens = raw_tokens;
+if tokens.len() > self.max_position_embeddings {
+    tokens.truncate(self.max_position_embeddings);
+    *tokens.last_mut().expect("non-empty after truncate") = eos_id;
+}
+let eos_position = tokens.len() - 1;
+while tokens.len() < self.max_position_embeddings {
+    tokens.push(pad_id);
+}
+```
+
+Three subtle points to not miss:
+
+1. **EOS preservation** — CLIP's pooled output pulls from the EOS-position
+   hidden state. If you just `tokens.truncate(77)` you lose EOS when the
+   raw length exceeds 77; the pooled branch then reads a content token's
+   hidden state, which changes output. The shared encoder at
+   `encoders/clip.rs:107` gets away with a bare `truncate(77)` because
+   that path doesn't compute a pooled-at-EOS output at all — it returns
+   only the last-layer `forward(...)` result.
+2. **`eos_position` recompute** — it's fine to leave as
+   `tokens.len() - 1` after truncate-then-pad, but only because the pad
+   step happens after the truncate. If you flip the order, `eos_position`
+   lands in the pad region.
+3. **Don't silently warn** on overlong prompts unless logging is cheap —
+   users may hit this with an expanded prompt that's 80 tokens. Prefer a
+   single `tracing::debug!` with the truncation count so the CLI doesn't
+   spam on every generation.
+
+## Verification
+
+Weight-free unit test (no candle weights — small synthetic model is fine):
+
+1. Build a `ClipWithTokenizer` against the real SDXL CLIP-L tokenizer JSON
+   (it's a file-based tokeniser, no weights needed to tokenise).
+2. Run `encode_text_to_embedding` on a 50-token prompt and a 200-token
+   prompt in the same test. Assert both return `[1, 77, 768]` penultimate
+   and `[768]` pooled (the pooled shape is actually `[768]` after the `i`
+   slice at line 104 — confirm).
+3. Actually this test path is weight-bearing (`clip.forward_until_encoder_layer`
+   needs real CLIP weights). Split into a pure tokeniser-level test that
+   just verifies `tokens.len() == 77` after the new truncation logic, with
+   `tokens[76] == eos_id` when the raw tokeniser output exceeded 77.
+
+Integration test: hit `http://beast:7680/api/generate` against
+`sd3.5-large:q8` with a 300-char prompt after the fix ships. Expected:
+200 + image bytes, no shape-mismatch error in `~/.mold/logs/server.log`.
+
+## Render-chain v1 context (the other active thread)
+
+This session just finished **render-chain v1**. The state:
+
+- Branch `feat/render-chain-v1` is pushed to origin, 12 commits ahead of
+  `origin/main`. PR not opened yet — the URL stub is
+  `https://github.com/utensils/mold/pull/new/feat/render-chain-v1`.
+- Last commit on the branch: `766322e fix(cli): pass owned String to
+  create_engine in local chain path` — caught only when the CUDA build
+  ran on killswitch (Phase 3 feature-matrix check omitted `cuda`/`metal`).
+- killswitch is running the new binary at `766322e` as PID 1199380
+  (`./target/release/mold serve --bind 0.0.0.0 --gpus 0,1`), logs at
+  `~/.mold/logs/server.log`. `MOLD_HOST=http://beast:7680` reaches it.
+- Local `main` is parity with `origin/main` at `1410d08`. The chain work
+  lives on the feature branch only.
+- All four plan phases landed with tests green: `mold-ai-core` 611 pass,
+  `mold-ai-inference` lib 588 pass, `mold-ai-server` lib 186 pass (+5
+  chain route tests), `mold-ai` 369 unit + 38 integration + 11 runpod
+  (+12 chain CLI tests). `cargo fmt/clippy --workspace -- -D warnings`
+  clean. Website `bun run fmt:check/verify/build` clean.
+
+This SD3 CLIP-L/G truncation bug is **NOT** related to render-chain — it
+fires on the single-clip SD3 path (`mold_server::gpu_worker::process_job`).
+Fixing it is an independent fix that can land on `main` directly, or
+stack on top of `feat/render-chain-v1` and go out as part of the chain
+PR. Your call.
+
+## Branch / commit layout
+
+```
+6211182 (origin/feat/render-chain-v1~1) docs(chain): …
+766322e (origin/feat/render-chain-v1, HEAD of feat branch) fix(cli): pass owned String …
+1410d08 (origin/main, local main) fix(flux): surface city96-GGUF …
+```
+
+---
+
+## The prompt
+
+Paste from here into a fresh Claude Code session:
+
+---
+
+I need to fix a CLIP-L/G prompt-truncation bug in the mold repo
+(`/Users/jeffreydilley/github/mold`) that currently makes every `sd3*`
+model (SD 3.5 Large q8 confirmed) return HTTP 500 with
+`"shape mismatch in broadcast_add, lhs: [1, N, 768], rhs: [1, 77, 768]"`
+for any prompt that tokenises to more than 77 CLIP tokens.
+
+## Read first
+
+1. `tasks/sd3-clip-77-truncation-handoff.md` — full diagnosis, file:line
+   cites, repro curl, fix sketch, verification plan. Read end-to-end.
+2. `crates/mold-inference/src/encoders/sd3_clip.rs:60-131` — the buggy
+   function (`ClipWithTokenizer::encode_text_to_embedding`).
+3. `crates/mold-inference/src/encoders/clip.rs:100-115` — the shared
+   path that correctly truncates. Contrast to see the regression.
+
+## Status on entry
+
+- Branch `main` locally and on origin at `1410d08`.
+- `feat/render-chain-v1` is on origin at `766322e` (12 commits ahead of
+  main). It's unrelated to this bug but is a parallel in-flight PR;
+  ignore unless you're explicitly asked to stack the fix on it.
+- killswitch (BEAST, dual-3090) is running `766322e` as
+  `mold serve --bind 0.0.0.0 --gpus 0,1`. Reproduce against it via
+  `MOLD_HOST=http://beast:7680`. Log tail:
+  `ssh killswitch@192.168.1.67 "tail -f ~/.mold/logs/server.log"`.
+- `CLAUDE.md` claims `mold-inference`/`mold-server` have
+  `[lib] test = false` — **stale, tests run normally**. Verified in
+  render-chain v1's Phase 1d–4 landings.
+- Pre-existing clippy warnings unrelated to this bug (do NOT fix):
+  `manual_repeat_n` in `mold-core/src/download.rs:1451`,
+  `field_reassign_with_default` in `mold-core/src/placement_test.rs:167`.
+
+## What you're doing
+
+Fix the bug per the handoff's "The fix" section. Short commit, targeted
+test, verify on killswitch. One atomic commit, `fix(sd3): truncate CLIP
+token sequences to 77 with EOS preserved` (or similar).
+
+Do NOT push unless the user asks. Do NOT stack on `feat/render-chain-v1`
+unless asked — land on `main` as a standalone fix so it can merge
+independently of the render-chain PR.
+
+Verify with:
+
+```bash
+cargo fmt --all -- --check
+cargo clippy --workspace -- -D warnings
+cargo test -p mold-ai-inference --lib
+cargo test -p mold-ai-core
+# + a new unit test you add for the truncation helper
+```
+
+Then optionally rebuild on killswitch and retry the repro curl against
+`http://beast:7680` to confirm the user-surfaced symptom is gone.
+
+## Process
+
+- Use `superpowers:systematic-debugging` — the root cause is already
+  in the handoff, but the skill will keep you honest about verifying.
+- Use `superpowers:test-driven-development` — write a failing test for
+  the 200-token case first, then fix, then watch it go green.
+- Use `superpowers:verification-before-completion` before claiming done.
+
+If you discover the bug is broader than the handoff claims (e.g. it
+affects another encoder I missed), stop and re-scope rather than
+silently widening the fix.

From 95a1af91e2ce960dbe313e0130a4a5c6a2b1c33e Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 15:24:44 -0700
Subject: [PATCH 24/31] chore: ignore .worktrees/ directory

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index f0f1b13c..ac06ff17 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,6 +6,7 @@
 # Claude Code
 .claude/worktrees/
 .claude/scheduled_tasks.lock
+.worktrees/
 .playwright-mcp/
 .direnv/
 

From 398f65e2f3591c3d8f92b36e3d9489b332b59d18 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 15:29:08 -0700
Subject: [PATCH 25/31] build: speed up local and CI builds

---
 .cargo/config.toml                        |  10 ++
 .github/workflows/ci.yml                  |  68 ++++---------
 .github/workflows/release.yml             |  26 ++++-
 CHANGELOG.md                              |   8 ++
 CLAUDE.md                                 |  22 ++--
 Cargo.toml                                |   8 ++
 README.md                                 |   6 +-
 crates/mold-inference/src/ltx2/runtime.rs |   2 +-
 flake.nix                                 | 117 +++++++++++++++++-----
 scripts/ensure-web-dist.sh                |  59 +++++++++++
 scripts/tui-uat.sh                        |   6 +-
 website/deployment/nixos.md               |  33 +++---
 website/guide/installation.md             |  16 ++-
 13 files changed, 272 insertions(+), 109 deletions(-)
 create mode 100644 .cargo/config.toml
 create mode 100755 scripts/ensure-web-dist.sh

diff --git a/.cargo/config.toml b/.cargo/config.toml
new file mode 100644
index 00000000..441dc758
--- /dev/null
+++ b/.cargo/config.toml
@@ -0,0 +1,10 @@
+[alias]
+build-dev = "build --profile dev-fast"
+run-dev = "run --profile dev-fast"
+
+[build]
+incremental = true
+
+[target.x86_64-unknown-linux-gnu]
+linker = "clang"
+rustflags = ["-C", "link-arg=-fuse-ld=lld"]
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 02a51666..887cd430 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -8,6 +8,7 @@ on:
 
 env:
   CARGO_TERM_COLOR: always
+  CARGO_INCREMENTAL: "1"
 
 # Note: CUDA builds are not run in CI — they require a GPU host with NixOS + CUDA.
 # CI only checks non-CUDA compilation, lints, and tests.
@@ -56,60 +57,49 @@ jobs:
         working-directory: web
         run: bun run build
 
-  fmt:
+  rust:
     runs-on: ubuntu-latest
+    env:
+      RUSTC_WRAPPER: sccache
+      SCCACHE_GHA_ENABLED: "true"
     steps:
       - uses: actions/checkout@v6
       - uses: dtolnay/rust-toolchain@stable
         with:
-          components: rustfmt
-      - run: cargo fmt --all -- --check
-
-  check:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-      - uses: dtolnay/rust-toolchain@stable
+          components: rustfmt,clippy
+      - name: Install Rust build deps
+        run: sudo apt-get update && sudo apt-get install -y clang lld nasm libwebp-dev
+      - uses: mozilla-actions/sccache-action@v0.0.7
       - uses: Swatinem/rust-cache@v2
         with:
           shared-key: workspace-default
           save-if: ${{ github.ref == 'refs/heads/main' }}
+      - name: Format
+        run: cargo fmt --all -- --check
       - name: Check
         run: cargo check --workspace
-
-  clippy:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-      - uses: dtolnay/rust-toolchain@stable
-        with:
-          components: clippy
-      - uses: Swatinem/rust-cache@v2
-        with:
-          shared-key: workspace-default
-          save-if: ${{ github.ref == 'refs/heads/main' }}
       - name: Clippy
-        run: cargo clippy --workspace -- -D warnings
-
-  test:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-      - uses: dtolnay/rust-toolchain@stable
-      - uses: Swatinem/rust-cache@v2
-        with:
-          shared-key: workspace-default
-          save-if: ${{ github.ref == 'refs/heads/main' }}
+        run: cargo clippy --workspace --all-targets -- -D warnings
       - name: Test
         run: cargo test --workspace
+      - name: Check with all optional features
+        run: cargo check -p mold-ai --features preview,discord,expand,tui,webp,mp4
 
   coverage:
     runs-on: ubuntu-latest
+    env:
+      RUSTC_WRAPPER: sccache
+      SCCACHE_GHA_ENABLED: "true"
     steps:
       - uses: actions/checkout@v6
 
       - uses: dtolnay/rust-toolchain@stable
 
+      - name: Install Rust build deps
+        run: sudo apt-get update && sudo apt-get install -y clang lld nasm libwebp-dev
+
+      - uses: mozilla-actions/sccache-action@v0.0.7
+
       - uses: Swatinem/rust-cache@v2
         with:
           shared-key: workspace-llvm-cov
@@ -127,17 +117,3 @@ jobs:
           files: lcov.info
           token: ${{ secrets.CODECOV_TOKEN }}
           fail_ci_if_error: false
-
-  check-features:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-      - uses: dtolnay/rust-toolchain@stable
-      - uses: Swatinem/rust-cache@v2
-        with:
-          shared-key: workspace-all-features
-          save-if: ${{ github.ref == 'refs/heads/main' }}
-      - name: Install system deps for optional features
-        run: sudo apt-get update && sudo apt-get install -y nasm libwebp-dev
-      - name: Check with all features (preview, discord, expand, tui, webp, mp4)
-        run: cargo check -p mold-ai --features preview,discord,expand,tui,webp,mp4
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index e4831e94..09edd513 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -12,11 +12,17 @@ concurrency:
 jobs:
   build-macos:
     runs-on: macos-14
+    env:
+      CARGO_INCREMENTAL: "1"
+      RUSTC_WRAPPER: sccache
+      SCCACHE_GHA_ENABLED: "true"
     steps:
       - uses: actions/checkout@v6
 
       - uses: dtolnay/rust-toolchain@stable
 
+      - uses: mozilla-actions/sccache-action@v0.0.7
+
       - uses: Swatinem/rust-cache@v2
         with:
           key: macos-release
@@ -39,11 +45,15 @@ jobs:
   # Ada Lovelace (RTX 40-series, sm_89)
   build-linux-sm89:
     runs-on: ubuntu-latest
+    env:
+      CARGO_INCREMENTAL: "1"
+      RUSTC_WRAPPER: sccache
+      SCCACHE_GHA_ENABLED: "true"
     steps:
       - uses: actions/checkout@v6
 
-      - name: Install nasm (required by openh264 source build for mp4 feature)
-        run: sudo apt-get update && sudo apt-get install -y nasm
+      - name: Install build deps
+        run: sudo apt-get update && sudo apt-get install -y clang lld nasm
 
       - uses: Jimver/cuda-toolkit@v0.2.35
         with:
@@ -53,6 +63,8 @@ jobs:
 
       - uses: dtolnay/rust-toolchain@stable
 
+      - uses: mozilla-actions/sccache-action@v0.0.7
+
       - uses: Swatinem/rust-cache@v2
         with:
           key: linux-cuda-sm89-release
@@ -75,11 +87,15 @@ jobs:
   # Blackwell (RTX 50-series, sm_120)
   build-linux-sm120:
     runs-on: ubuntu-latest
+    env:
+      CARGO_INCREMENTAL: "1"
+      RUSTC_WRAPPER: sccache
+      SCCACHE_GHA_ENABLED: "true"
     steps:
       - uses: actions/checkout@v6
 
-      - name: Install nasm (required by openh264 source build for mp4 feature)
-        run: sudo apt-get update && sudo apt-get install -y nasm
+      - name: Install build deps
+        run: sudo apt-get update && sudo apt-get install -y clang lld nasm
 
       - uses: Jimver/cuda-toolkit@v0.2.35
         with:
@@ -89,6 +105,8 @@ jobs:
 
       - uses: dtolnay/rust-toolchain@stable
 
+      - uses: mozilla-actions/sccache-action@v0.0.7
+
       - uses: Swatinem/rust-cache@v2
         with:
           key: linux-cuda-sm120-release
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 48c62c20..c095ae74 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **Fast local build profile and web bundle helper**: the workspace now defines a `dev-fast` Cargo profile (`thin` LTO, `codegen-units = 16`, incremental on, debuginfo retained) plus `scripts/ensure-web-dist.sh`, which only rebuilds `web/dist` when the SPA inputs changed. The Nix devshell's default `build`, `build-server`, `mold`, `serve`, and `generate` commands now use that profile and embed the real web gallery by default instead of falling back to the placeholder stub.
+
+### Changed
+
+- **Nix/crane build caching now matches the release feature set**: `flake.nix` feeds the same `mold-ai` feature list into `craneLib.buildDepsOnly` that the packaged binary uses, so dependency artifacts are reused instead of being recompiled in the final package build. Local devshell defaults also stop compiling the full optional feature set unless explicitly requested, while `build-release` and `build-ltx2` still produce the all-features binary.
+- **Rust builds now default to `sccache` and a faster Linux linker path in supported build environments**: the repo ships `.cargo/config.toml` aliases for `dev-fast`, Linux builds link via `clang` + `lld`, the devshell includes `sccache`, and the CI/release workflows install the matching linker/tooling.
+- **CI now reuses one warmed Rust target dir instead of rebuilding in three separate jobs**: `fmt`, `check`, `clippy`, `test`, and the all-features `cargo check` now run in a single `rust` job with `sccache`, reducing repeated workspace compilation while keeping the existing `coverage`, `docs`, and `web` jobs separate.
+
 - **Web UI auto-promotes long LTX-2 distilled video requests to the chain endpoint** so the SPA can render arbitrary-length clips without the user manually hitting `mold run` on the CLI. Previously, requesting `frames > 97` (the LTX-2 19B/22B distilled per-clip cap) from the in-browser generate composer POSTed straight to `/api/generate/stream`, the engine dutifully tried to render the full request in one pass, transformer denoise fit fine (~10 GB residual after the Gemma 3 text encoder dropped), and then VAE decode of the full-length latent stack exceeded the 24 GB 3090 budget and `CUDA_ERROR_OUT_OF_MEMORY` killed the job minutes from the end — repeatable three times in a row on a 241-frame 512×512 img2v request before the symptom was traced back to the missing client-side routing. A new pure `decideChainRouting` helper in `web/src/lib/chainRouting.ts` mirrors the CLI's `decide_chain_routing` (`crates/mold-cli/src/commands/chain.rs`): it checks the selected model's family, promotes to the chain endpoint when `frames > LTX2_DISTILLED_CLIP_CAP` for an `ltx2`-family distilled model, rejects cleanly for non-chainable video families that exceed the per-clip budget, and stays `single` otherwise. `useGenerateStream.submit` now accepts the decision, dispatches to either `/api/generate/stream` or `/api/generate/chain/stream`, and folds chain-shaped `ChainProgressEvent`s into the existing `JobProgress` so `RunningJobCard` renders a familiar "Denoising clip K/N · step X/Y" readout without any per-event UI changes. Completion uses a shape-shifter (`chainCompleteToSingle`) to transform `SseChainCompleteEvent` into the canonical `SseCompleteEvent` the gallery detail drawer expects — the `seed_used` fallback to `req.seed ?? 0` loses the auto-open-on-complete affordance for chain runs (chains derive per-stage seeds from the base seed, so there isn't a single seed to match against gallery metadata) but the gallery still refreshes and the new video surfaces at the top. The Composer shows a brand-tinted pill ("Will render as N chained clips of 97 frames (motion-tail 4) — expect this to take substantially longer than a single clip") whenever the routing decision resolves to `chain`, and a red error pill when it resolves to `reject`; `onSubmit` also hard-blocks rejected requests with an `alert()` before building the wire body. All eight routing helper branches and the existing 75 generate-form tests pass (`bun run test`).
 
 ### Fixed
diff --git a/CLAUDE.md b/CLAUDE.md
index 0c24df48..12b2e7f5 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -24,10 +24,11 @@ nix flake check                                      # Validate formatting + fla
 
 | Category | Command | Description |
 |----------|---------|-------------|
-| build | `build` | `cargo build` (debug, all crates) |
-| build | `build-release` | `cargo build --release` |
-| build | `build-server` | `cargo build -p mold-ai --features {cuda\|metal}` (single binary with GPU) |
-| build | `build-discord` | `cargo build -p mold-ai --features discord` |
+| build | `build` | Fast local `mold` build (`cargo build --profile dev-fast -p mold-ai`) with the web bundle embedded |
+| build | `build-workspace` | `cargo build` (debug, all crates) |
+| build | `build-release` | Shipping `cargo build --release -p mold-ai --features {gpu},preview,discord,expand,tui,webp,mp4,metrics` |
+| build | `build-server` | Fast local single-binary server build with GPU + preview + expand and embedded web UI |
+| build | `build-discord` | Fast local `cargo build --profile dev-fast -p mold-ai --features discord` |
 | check | `check` | `cargo check` |
 | check | `clippy` | `cargo clippy` |
 | check | `run-tests` | `cargo test` |
@@ -43,6 +44,7 @@ nix flake check                                      # Validate formatting + fla
 
 ```bash
 cargo build                                          # Debug build (all crates)
+cargo build --profile dev-fast                       # Fast local optimized build
 cargo build --release                                # Release build
 cargo build -p mold-ai                               # Just the CLI
 cargo build -p mold-ai --features cuda               # CLI with CUDA (includes serve)
@@ -54,8 +56,8 @@ cargo test -p mold-ai-core                           # Single crate
 ./scripts/coverage.sh                                # Test coverage summary
 ./scripts/coverage.sh --html                         # HTML coverage report
 ./scripts/fetch-tokenizers.sh                        # Pre-download tokenizer files
-cargo run -p mold-ai -- run "a cat"                  # Generate image
-cargo run -p mold-ai -- serve                        # Start server
+./scripts/ensure-web-dist.sh && cargo run --profile dev-fast -p mold-ai --features metal,preview,expand -- run "a cat"
+./scripts/ensure-web-dist.sh && cargo run --profile dev-fast -p mold-ai --features metal,preview,expand -- serve
 cargo run -p mold-ai-inference --features dev-bins --bin ltx2_review -- clip.mp4
 ```
 
@@ -65,12 +67,8 @@ CI runs on every push and PR (`.github/workflows/ci.yml`). All jobs must pass:
 
 | Job | What it checks |
 |-----|----------------|
-| `fmt` | `cargo fmt --all -- --check` |
-| `check` | `cargo check --workspace` |
-| `clippy` | `cargo clippy --workspace -- -D warnings` |
-| `test` | `cargo test --workspace` |
+| `rust` | `cargo fmt --all -- --check && cargo check --workspace && cargo clippy --workspace --all-targets -- -D warnings && cargo test --workspace && cargo check -p mold-ai --features preview,discord,expand,tui,webp,mp4` |
 | `coverage` | `cargo llvm-cov` → Codecov upload |
-| `check-features` | `cargo check -p mold-ai --features preview,discord,expand,tui,webp,mp4` (all optional features) |
 | `docs` | `bun run fmt:check && bun run verify && bun run build` in `website/` |
 
 > **Note:** `mold-inference` and `mold-server` have `[lib] test = false` in their `Cargo.toml` files. The test harness for these crates links against candle/CUDA which triggers heavy model weight initialization (~32GB RAM, 40+ min hang). The `mold-server` binary target also has `test = false`. Unit tests in `mold-core` and `mold-cli` run normally. If you add tests to `mold-inference` or `mold-server`, run them with `cargo test -p <crate> --lib` after temporarily removing the `test = false` flag.
@@ -266,7 +264,7 @@ Location: `~/.config/mold/config.toml` (XDG) or `~/.mold/config.toml` (legacy 
 
 **Documentation site**: VitePress 1.6 + Tailwind CSS v4 + bun in `website/`. Dev server: `cd website && bun install && bun run dev -- --host 0.0.0.0`. Build: `bun run build`. Deployed to GitHub Pages via `.github/workflows/pages.yml` on push to `main` (website/** paths). Base path is `/mold/` (served at `utensils.github.io/mold/`).
 
-**Web gallery UI** (separate from the docs site): Vue 3 + Vite 7 + Tailwind CSS v4.2 SPA in `web/`. The SPA is **embedded directly into the `mold` binary at compile time** via [`rust-embed`](https://crates.io/crates/rust-embed), so `nix build` (or `cargo build` after running `cd web && bun run build`) produces a single-file server that serves the gallery with zero runtime filesystem dependency. `crates/mold-server/build.rs` resolves the bundle from one of three sources, in order: `$MOLD_WEB_DIST` (set by the Nix flake's `mold-web` derivation — built reproducibly via [bun2nix](https://github.com/nix-community/bun2nix) from `web/bun.lock` → `web/bun.nix`), `<repo>/web/dist`, or a generated placeholder stub (`$OUT_DIR/web-stub/__mold_placeholder`). The stub is detected at runtime and swapped for the inline "mold is running" page so a bare `cargo build` still produces a working binary. For SPA hot-iteration without Rust recompiles, `MOLD_WEB_DIR` (and the legacy `$XDG_DATA_HOME/mold/web`, `~/.mold/web`, `<binary dir>/web`, `./web/dist` candidates) still take precedence over the embedded bundle — so `bun run dev` or a local `web/dist` can be swapped in without rebuilding Rust. Dev server: `bun run dev` (proxies `/api` + `/health` to `http://localhost:7680`; override with `MOLD_API_ORIGIN`). See `crates/mold-server/src/web_ui.rs` for the resolver + embed wiring and `web/README.md` for the frontend stack.
+**Web gallery UI** (separate from the docs site): Vue 3 + Vite 7 + Tailwind CSS v4.2 SPA in `web/`. The SPA is **embedded directly into the `mold` binary at compile time** via [`rust-embed`](https://crates.io/crates/rust-embed), so `nix build` produces a single-file server that serves the gallery with zero runtime filesystem dependency. Local devshell `build`, `build-server`, `mold`, `serve`, and `generate` commands now call `./scripts/ensure-web-dist.sh` first, so `target/dev-fast/mold` normally includes the real SPA by default instead of the placeholder stub. `crates/mold-server/build.rs` resolves the bundle from one of three sources, in order: `$MOLD_WEB_DIST` (set by the Nix flake's `mold-web` derivation — built reproducibly via [bun2nix](https://github.com/nix-community/bun2nix) from `web/bun.lock` → `web/bun.nix`), `<repo>/web/dist`, or a generated placeholder stub (`$OUT_DIR/web-stub/__mold_placeholder`). The stub is detected at runtime and swapped for the inline "mold is running" page so a bare `cargo build` still produces a working binary. For SPA hot-iteration without Rust recompiles, `MOLD_WEB_DIR` (and the legacy `$XDG_DATA_HOME/mold/web`, `~/.mold/web`, `<binary dir>/web`, `./web/dist` candidates) still take precedence over the embedded bundle — so `bun run dev` or a local `web/dist` can be swapped in without rebuilding Rust. Dev server: `bun run dev` (proxies `/api` + `/health` to `http://localhost:7680`; override with `MOLD_API_ORIGIN`). See `crates/mold-server/src/web_ui.rs` for the resolver + embed wiring and `web/README.md` for the frontend stack.
 
 **Gallery metadata DB at `MOLD_HOME/mold.db`** (override with `MOLD_DB_PATH`, disable with `MOLD_DB_DISABLE=1`). The `mold-db` crate (rusqlite + bundled SQLite, WAL mode) holds one row per saved file with the full `OutputMetadata` plus `file_mtime_ms`, `file_size_bytes`, `generation_time_ms`, `backend` (`cuda`/`metal`/`cpu`), `hostname`, `format`, and a `source` column (`server` / `cli` / `backfill`). Both surfaces write rows after a successful save:
 
diff --git a/Cargo.toml b/Cargo.toml
index c667ed3f..62a98080 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -30,3 +30,11 @@ name = "mold"
 lto = "fat"
 codegen-units = 1
 strip = true
+
+[profile.dev-fast]
+inherits = "release"
+lto = "thin"
+codegen-units = 16
+incremental = true
+debug = 1
+strip = false
diff --git a/README.md b/README.md
index 9a7c2714..5d87a70d 100644
--- a/README.md
+++ b/README.md
@@ -37,8 +37,10 @@ nix run github:utensils/mold#mold-sm120 -- run "a cat"        # Blackwell / RTX
 ### From source
 
 ```bash
-cargo build --release -p mold-ai --features cuda    # Linux (NVIDIA)
-cargo build --release -p mold-ai --features metal   # macOS (Apple Silicon)
+./scripts/ensure-web-dist.sh && cargo build --profile dev-fast -p mold-ai --features cuda   # Linux (NVIDIA), fast local build
+./scripts/ensure-web-dist.sh && cargo build --profile dev-fast -p mold-ai --features metal  # macOS (Apple Silicon), fast local build
+cargo build --release -p mold-ai --features cuda                                          # Linux (NVIDIA), shipping build
+cargo build --release -p mold-ai --features metal                                         # macOS (Apple Silicon), shipping build
 ```
 
 Add `preview`, `expand`, `discord`, or `tui` to the features list as needed.
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 98f6a978..7b8d346b 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -412,7 +412,7 @@ impl Ltx2RuntimeSession {
         // Chain path fast-path: if a previous `prepare()` already encoded
         // the exact same prompt+unconditional combo, reuse those embeddings
         // instead of demanding the encoder back. Disabled when the
-        // `MOLD_LTX_ALT_PROMPT` debug hook is active because that branch
+        // `MOLD_LTX_DEBUG_ALT_PROMPT` debug hook is active because that branch
         // still needs the live encoder.
         let cache_hit = alt_prompt_env.is_none()
             && self.cached_prompt_encoding.as_ref().is_some_and(|cached| {
diff --git a/flake.nix b/flake.nix
index 2f5186b8..8a791107 100644
--- a/flake.nix
+++ b/flake.nix
@@ -109,6 +109,7 @@
               pkgs.llvmPackages.libclang.lib
             ]
             ++ lib.optionals isLinux [
+              pkgs.lld
               pkgs.cudaPackages.cuda_nvcc
             ];
             buildInputs = [
@@ -136,8 +137,6 @@
           opensslLibDir = "${pkgs.lib.getLib pkgs.openssl}/lib";
           opensslIncludeDir = "${pkgs.openssl.dev}/include";
 
-          cargoArtifacts = craneLib.buildDepsOnly commonArgs;
-
           gpuFeature =
             if isLinux then
               "cuda"
@@ -146,12 +145,34 @@
             else
               "";
 
-          # Features string for devshell commands: GPU + preview + discord + expand + tui + video formats
+          devProfile = "dev-fast";
+
+          # Fast local iteration defaults: GPU backend + preview + prompt expansion.
           devFeatures =
             if gpuFeature != "" then
-              "${gpuFeature},preview,discord,expand,tui,webp,mp4"
+              "${gpuFeature},preview,expand"
+            else
+              "preview,expand";
+
+          # Full shipping feature set used for release builds and feature coverage.
+          releaseFeatures =
+            if gpuFeature != "" then
+              "${gpuFeature},preview,discord,expand,tui,webp,mp4,metrics"
             else
-              "preview,discord,expand,tui,webp,mp4";
+              "preview,discord,expand,tui,webp,mp4,metrics";
+
+          cargoArtifacts = craneLib.buildDepsOnly (
+            commonArgs
+            // {
+              cargoExtraArgs = "-p mold-ai --features ${releaseFeatures}";
+            }
+          );
+
+          webEmbedSetup = ''
+            export SCCACHE_DIR="''${MOLD_SCCACHE_DIR:-$PWD/.cache/sccache}"
+            export MOLD_WEB_DIST="$PWD/web/dist"
+            ./scripts/ensure-web-dist.sh
+          '';
 
           # Merged CUDA toolkit so bindgen_cuda can find both bin/nvcc and include/cuda.h
           cudaToolkit = pkgs.symlinkJoin {
@@ -215,9 +236,7 @@
               // {
                 inherit cargoArtifacts meta;
                 MOLD_WEB_DIST = "${mold-web}";
-                cargoExtraArgs =
-                  "-p mold-ai --features preview,discord,expand,tui,webp,mp4,metrics"
-                  + lib.optionalString (gpuFeature != "") ",${gpuFeature}";
+                cargoExtraArgs = "-p mold-ai --features ${releaseFeatures}";
                 postInstall = ''
                   installShellCompletion --cmd mold \
                     --bash <($out/bin/mold completions bash) \
@@ -278,6 +297,7 @@
               pkgs.pkg-config
               pkgs.openssl
               pkgs.nasm
+              pkgs.sccache
               pkgs.git
               pkgs.gh
               pkgs.jq
@@ -296,6 +316,7 @@
               pkgs.llvmPackages.libcxxClang
             ]
             ++ lib.optionals isLinux [
+              pkgs.lld
               pkgs.cudaPackages.cuda_nvcc
               pkgs.cudaPackages.cuda_cudart
               pkgs.cudaPackages.libcublas.lib
@@ -313,6 +334,14 @@
                 name = "MOLD_LTX_DEBUG";
                 value = "1";
               }
+              {
+                name = "CARGO_INCREMENTAL";
+                value = "1";
+              }
+              {
+                name = "RUSTC_WRAPPER";
+                value = "sccache";
+              }
               {
                 name = "PKG_CONFIG_PATH";
                 value = opensslPkgConfigPath;
@@ -386,26 +415,44 @@
               {
                 category = "build";
                 name = "build";
-                help = "cargo build (debug, all crates)";
+                help = "fast local mold build with the web bundle embedded";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo build --profile ${devProfile} -p mold-ai --features ${devFeatures} "$@"
+                '';
+              }
+              {
+                category = "build";
+                name = "build-workspace";
+                help = "cargo build the full workspace in debug mode";
                 command = "cargo build \"$@\"";
               }
               {
                 category = "build";
                 name = "build-release";
-                help = "cargo build --release -p mold-ai --features ${devFeatures}";
-                command = "cargo build --release -p mold-ai --features ${devFeatures} \"$@\"";
+                help = "shipping mold build with the full feature set and embedded web UI";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo build --release -p mold-ai --features ${releaseFeatures} "$@"
+                '';
               }
               {
                 category = "build";
                 name = "build-server";
-                help = "cargo build -p mold-ai --features ${devFeatures} (single binary with GPU + preview)";
-                command = "cargo build -p mold-ai --features ${devFeatures} \"$@\"";
+                help = "fast local server build with GPU + preview + expand and embedded web UI";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo build --profile ${devProfile} -p mold-ai --features ${devFeatures} "$@"
+                '';
               }
               {
                 category = "build";
                 name = "build-discord";
-                help = "cargo build -p mold-ai --features discord";
-                command = "cargo build -p mold-ai --features ${devFeatures} \"$@\"";
+                help = "fast local Discord-bot build";
+                command = "cargo build --profile ${devProfile} -p mold-ai --features discord \"$@\"";
               }
               {
                 category = "build";
@@ -434,8 +481,8 @@
               {
                 category = "check";
                 name = "clippy";
-                help = "cargo clippy --workspace -- -D warnings (matches CI)";
-                command = "cargo clippy --workspace \"$@\" -- -D warnings";
+                help = "cargo clippy --workspace --all-targets -- -D warnings (matches CI)";
+                command = "cargo clippy --workspace --all-targets \"$@\" -- -D warnings";
               }
               {
                 category = "check";
@@ -451,7 +498,7 @@
                   set -euo pipefail
                   cargo fmt --all -- --check
                   cargo check --workspace
-                  cargo clippy --workspace -- -D warnings
+                  cargo clippy --workspace --all-targets -- -D warnings
                   cargo test --workspace
                   cargo check -p mold-ai --features preview,discord,expand,tui,webp,mp4
                 '';
@@ -493,26 +540,38 @@
               {
                 category = "run";
                 name = "mold";
-                help = "run mold CLI (e.g. mold list, mold ps, mold pull)";
-                command = "cargo run -p mold-ai --features ${devFeatures} -- \"$@\"";
+                help = "run mold CLI with the fast local feature set";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo run --profile ${devProfile} -p mold-ai --features ${devFeatures} -- "$@"
+                '';
               }
               {
                 category = "run";
                 name = "serve";
                 help = "start the mold server";
-                command = "cargo run -p mold-ai --features ${devFeatures} -- serve \"$@\"";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo run --profile ${devProfile} -p mold-ai --features ${devFeatures} -- serve "$@"
+                '';
               }
               {
                 category = "run";
                 name = "generate";
                 help = "generate an image from a prompt";
-                command = "cargo run -p mold-ai --features ${devFeatures} -- run \"$@\"";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo run --profile ${devProfile} -p mold-ai --features ${devFeatures} -- run "$@"
+                '';
               }
               {
                 category = "run";
                 name = "discord-bot";
                 help = "start the mold Discord bot";
-                command = "cargo run -p mold-ai --features ${devFeatures} -- discord \"$@\"";
+                command = "cargo run --profile ${devProfile} -p mold-ai --features discord -- discord \"$@\"";
               }
               {
                 category = "runpod";
@@ -548,13 +607,21 @@
                 category = "run";
                 name = "build-ltx2";
                 help = "build mold with the full feature set for LTX-2 work";
-                command = "cargo build -p mold-ai --features ${devFeatures} \"$@\"";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo build --profile ${devProfile} -p mold-ai --features ${releaseFeatures} "$@"
+                '';
               }
               {
                 category = "run";
                 name = "smoke-ltx2";
                 help = "run a local LTX-2 / LTX-2.3 smoke inference";
-                command = "cargo run -p mold-ai --features ${devFeatures} -- run --local \"$@\"";
+                command = ''
+                  set -euo pipefail
+                  ${webEmbedSetup}
+                  cargo run --profile ${devProfile} -p mold-ai --features ${releaseFeatures} -- run --local "$@"
+                '';
               }
               {
                 category = "run";
diff --git a/scripts/ensure-web-dist.sh b/scripts/ensure-web-dist.sh
new file mode 100755
index 00000000..443bf70e
--- /dev/null
+++ b/scripts/ensure-web-dist.sh
@@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+web_dir="${MOLD_WEB_ROOT:-$repo_root/web}"
+dist_dir="$web_dir/dist"
+stamp_file="$dist_dir/.mold-build-stamp"
+install_stamp="$web_dir/node_modules/.mold-install-stamp"
+
+needs_build=0
+
+if [ ! -f "$dist_dir/index.html" ] || [ ! -f "$stamp_file" ]; then
+    needs_build=1
+else
+    while IFS= read -r path; do
+        if [ "$path" -nt "$stamp_file" ]; then
+            needs_build=1
+            break
+        fi
+    done < <(
+        find \
+            "$web_dir/src" \
+            "$web_dir/public" \
+            -type f \
+            -print
+        printf '%s\n' \
+            "$web_dir/package.json" \
+            "$web_dir/bun.lock" \
+            "$web_dir/bun.nix" \
+            "$web_dir/index.html" \
+            "$web_dir/vite.config.ts" \
+            "$web_dir/tsconfig.json" \
+            "$web_dir/tsconfig.app.json" \
+            "$web_dir/tsconfig.node.json" \
+            "$web_dir/vitest.config.ts"
+    )
+fi
+
+if [ "$needs_build" -eq 0 ]; then
+    exit 0
+fi
+
+if [ ! -d "$web_dir/node_modules" ] \
+    || [ ! -f "$install_stamp" ] \
+    || [ "$web_dir/package.json" -nt "$install_stamp" ] \
+    || [ "$web_dir/bun.lock" -nt "$install_stamp" ]; then
+    (
+        cd "$web_dir"
+        bun install --frozen-lockfile
+        touch "$install_stamp"
+    )
+fi
+
+(
+    cd "$web_dir"
+    bun run build
+    touch "$stamp_file"
+)
diff --git a/scripts/tui-uat.sh b/scripts/tui-uat.sh
index b8ca5c62..23566a5e 100755
--- a/scripts/tui-uat.sh
+++ b/scripts/tui-uat.sh
@@ -27,7 +27,11 @@
 set -euo pipefail
 
 STATE_FILE="/tmp/mold-tui-uat.state"
-MOLD_BIN="${MOLD_BIN:-./target/debug/mold}"
+DEFAULT_MOLD_BIN="./target/dev-fast/mold"
+if [ ! -x "$DEFAULT_MOLD_BIN" ]; then
+    DEFAULT_MOLD_BIN="./target/debug/mold"
+fi
+MOLD_BIN="${MOLD_BIN:-$DEFAULT_MOLD_BIN}"
 
 # ── Helpers ─────────────────────────────────────────────────────────
 
diff --git a/website/deployment/nixos.md b/website/deployment/nixos.md
index 498f54c0..ebcf65e4 100644
--- a/website/deployment/nixos.md
+++ b/website/deployment/nixos.md
@@ -218,19 +218,20 @@ nix develop github:utensils/mold
 
 The devshell includes Rust toolchain, CUDA toolkit, and convenience commands:
 
-| Command         | Description                        |
-| --------------- | ---------------------------------- |
-| `build`         | `cargo build` (debug)              |
-| `build-release` | `cargo build --release`            |
-| `build-server`  | Build with GPU + preview + discord |
-| `serve`         | Start the mold server              |
-| `generate`      | Generate an image                  |
-| `mold`          | Run any mold CLI command           |
-| `check`         | `cargo check`                      |
-| `clippy`        | `cargo clippy`                     |
-| `fmt`           | `cargo fmt`                        |
-| `run-tests`     | `cargo test`                       |
-| `coverage`      | Test coverage report               |
-| `docs-dev`      | Start VitePress docs dev server    |
-| `docs-build`    | Build the documentation site       |
-| `docs-fmt`      | Format docs with prettier          |
+| Command           | Description                                               |
+| ----------------- | --------------------------------------------------------- |
+| `build`           | Fast local `mold` build (`dev-fast`) with embedded web UI |
+| `build-workspace` | `cargo build` (debug, all crates)                         |
+| `build-release`   | Shipping release build with the full feature set          |
+| `build-server`    | Fast local server build with GPU + preview + expand       |
+| `serve`           | Start the mold server                                     |
+| `generate`        | Generate an image                                         |
+| `mold`            | Run any mold CLI command                                  |
+| `check`           | `cargo check`                                             |
+| `clippy`          | `cargo clippy`                                            |
+| `fmt`             | `cargo fmt`                                               |
+| `run-tests`       | `cargo test`                                              |
+| `coverage`        | Test coverage report                                      |
+| `docs-dev`        | Start VitePress docs dev server                           |
+| `docs-build`      | Build the documentation site                              |
+| `docs-fmt`        | Format docs with prettier                                 |
diff --git a/website/guide/installation.md b/website/guide/installation.md
index 794b6457..2ae343de 100644
--- a/website/guide/installation.md
+++ b/website/guide/installation.md
@@ -49,11 +49,19 @@ nix profile install github:utensils/mold
 
 ::: code-group
 
-```bash [Linux (CUDA)]
+```bash [Linux (CUDA), fast local build]
+./scripts/ensure-web-dist.sh && cargo build --profile dev-fast -p mold-ai --features cuda
+```
+
+```bash [macOS (Metal), fast local build]
+./scripts/ensure-web-dist.sh && cargo build --profile dev-fast -p mold-ai --features metal
+```
+
+```bash [Linux (CUDA), shipping build]
 cargo build --release -p mold-ai --features cuda
 ```
 
-```bash [macOS (Metal)]
+```bash [macOS (Metal), shipping build]
 cargo build --release -p mold-ai --features metal
 ```
 
@@ -66,6 +74,10 @@ Optional features can be added to the same build, for example
 `--features metal,preview,expand,discord,tui` if you also want terminal preview,
 local prompt expansion, the Discord bot, or the interactive TUI.
 
+`dev-fast` is the repo's local-iteration profile: it keeps debuginfo, enables
+incremental compilation, and uses thin LTO plus more codegen units so optimized
+builds stay much faster than the shipping `--release` profile.
+
 ## Docker
 
 ```bash

From bd8974d9f85183eed9aac173c077aad32ce9e395 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 16:33:26 -0700
Subject: [PATCH 26/31] fix(ltx2): stabilize chain video transitions across all
 stages
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three coupled fixes so chained LTX-2 video generation stops drifting into
"really strange" territory after the first clip:

1. Persistent source-image anchor on every continuation. Stop zeroing
   ChainStage.source_image in build_auto_expand_stages and
   build_stage_generate_request so every stage receives the starting image.
   render_chain_stage re-routes the frame-0 staged image into the append
   path at frame = motion_tail_pixel_frames with CHAIN_SOFT_ANCHOR_STRENGTH
   = 0.4, giving the free-region denoise a durable cross-attention
   reference for identity without freezing any pixels. Frame-0 slot stays
   owned by the motion-tail pin.

2. Decoded-pixel carryover instead of raw latent carryover. ChainTail now
   carries tail_rgb_frames: Vec<RgbImage> (the last K decoded frames);
   StagedLatent drops its latents: Tensor + causal_first_frame_rgb fields
   in favour of the same. The receiving side VAE-encodes the RGB window
   fresh on the receiving clip's own time axis, so slot 0 is a proper
   causal 1-pixel encoding and slots 1+ are proper 8-pixel continuation
   encodings — no slot-semantics mismatch against the LTX-2 VAE's
   causal-first-frame convention and no backward-pointing jump at the
   stitch boundary. Pre-VAE-decode tail_capture plumbing in
   Ltx2RuntimeSession is kept (marked dead_code) for future diagnostic
   tooling.

3. Bump DEFAULT_MOTION_TAIL and the --motion-tail CLI default 9 → 17 pixel
   frames, i.e. three latent frames (causal + two continuation, ≈0.7 s at
   24 fps). The prior 9-frame window was too little context for the
   denoiser to reconstruct scene / lighting / subject after the pin ran
   out.

Drive-by: fix two pre-existing rust-1.94 clippy lints that were blocking
--all-targets clean builds (field_reassign_with_default in
placement_test.rs, manual repeat_n in download.rs) and update a stale
test-mock in routes_chain.rs to the new ChainTail shape.

Test renames reflect the new invariants:
- normalise_preserves_first_stage_image → normalise_preserves_starting_image_across_all_stages
- chain_only_stage0_carries_source_image → chain_propagates_source_image_to_every_stage

Full workspace test suite and web chainRouting tests green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |   1 +
 crates/mold-cli/src/main.rs                   |  16 +--
 crates/mold-core/src/chain.rs                 |  34 +++--
 crates/mold-core/src/download.rs              |   2 +-
 crates/mold-core/src/placement_test.rs        |  20 +--
 crates/mold-inference/src/ltx2/chain.rs       | 131 ++++++++++--------
 .../mold-inference/src/ltx2/conditioning.rs   |  41 +++---
 crates/mold-inference/src/ltx2/pipeline.rs    | 118 ++++++++--------
 crates/mold-inference/src/ltx2/runtime.rs     |  79 +++++------
 crates/mold-server/src/routes_chain.rs        |  18 ++-
 web/src/lib/chainRouting.test.ts              |   3 +-
 web/src/lib/chainRouting.ts                   |  14 +-
 12 files changed, 248 insertions(+), 229 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index c095ae74..38eb2c5d 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Changed
 
+- **LTX-2 chain continuations hold identity far better across multi-clip videos** by threading the starting image into every stage as a soft long-range anchor, re-encoding the motion tail from decoded pixels rather than raw latents, and bumping the default motion-tail window from 9 → 17 pixel frames. Previously the first clip looked on-model but everything after it drifted into "really strange" territory because (1) `build_auto_expand_stages` + `build_stage_generate_request` dropped the `source_image` on every stage past index 0, so the model had no persistent identity reference — each continuation was anchored only to the drifted last frame of the prior clip and errors compounded stage-over-stage; (2) the carryover was a narrowed slice of the emitting stage's final latent tensor, where the last two latent slots encoded pixel frames near the *end* of stage N but got pinned at the *start* of stage N+1 at the receiving clip's RoPE positions, so the LTX-2 VAE's causal-first-slot / 8-pixel-continuation-slot convention was misaligned with the pinned content (`runtime.rs`'s `causal_first_frame_rgb` splice only patched the one causal slot, leaving a backward-pointing jump at slot 1); and (3) only ≈0.4 s of pixel context was carried across the boundary, which is nowhere near enough for the denoiser to reconstruct scene / lighting / subject without help. Fixed by (a) propagating `source_image` to every `ChainStage` in `crates/mold-core/src/chain.rs` and through `build_stage_generate_request` in `crates/mold-inference/src/ltx2/chain.rs`, with `Ltx2Engine::render_chain_stage` re-routing the staged image into the `VideoTokenAppendCondition` append path at a non-zero frame (`motion_tail_pixel_frames`) with soft `CHAIN_SOFT_ANCHOR_STRENGTH = 0.4` — the frame-0 replacement slot stays owned by the motion-tail pin, so the append tokens act as a durable cross-attention anchor for appearance without freezing any pixels; (b) replacing `ChainTail::{latents, last_rgb_frame}` and `StagedLatent::{latents, causal_first_frame_rgb}` with a single `tail_rgb_frames: Vec<RgbImage>`, which the receiving `maybe_load_stage_video_conditioning` VAE-encodes fresh into proper-slot-semantics latents on the receiving clip's own time axis (slot 0 = causal 1 pixel, slots 1+ = 8-pixel continuation, all monotonic forward-in-time with correct RoPE); and (c) bumping `DEFAULT_MOTION_TAIL` in `web/src/lib/chainRouting.ts` and the `--motion-tail` CLI default in `crates/mold-cli/src/main.rs` from 9 to 17 pixel frames (three latent frames: causal + two continuation, ≈0.7 s at 24 fps) for enough hard-pinned context that the first free latent frame has plausible neighbours to reason against. The pre-VAE-decode `tail_capture` mutex plumbing in `Ltx2RuntimeSession` is kept (marked `#[allow(dead_code)]`) for future quality-diagnostic tooling; the production chain path no longer arms it. Existing chain orchestrator + `chainRouting` + `chain_client` tests continue to pass; `normalise_preserves_first_stage_image` is renamed to `normalise_preserves_starting_image_across_all_stages` and flipped to assert every stage carries the image, and `chain_only_stage0_carries_source_image` is renamed to `chain_propagates_source_image_to_every_stage` for the same reason.
 - **Nix/crane build caching now matches the release feature set**: `flake.nix` feeds the same `mold-ai` feature list into `craneLib.buildDepsOnly` that the packaged binary uses, so dependency artifacts are reused instead of being recompiled in the final package build. Local devshell defaults also stop compiling the full optional feature set unless explicitly requested, while `build-release` and `build-ltx2` still produce the all-features binary.
 - **Rust builds now default to `sccache` and a faster Linux linker path in supported build environments**: the repo ships `.cargo/config.toml` aliases for `dev-fast`, Linux builds link via `clang` + `lld`, the devshell includes `sccache`, and the CI/release workflows install the matching linker/tooling.
 - **CI now reuses one warmed Rust target dir instead of rebuilding in three separate jobs**: `fmt`, `check`, `clippy`, `test`, and the all-features `cargo check` now run in a single `rust` job with `sccache`, reducing repeated workspace compilation while keeping the existing `coverage`, `docs`, and `web` jobs separate.
diff --git a/crates/mold-cli/src/main.rs b/crates/mold-cli/src/main.rs
index fe38a757..811a6d20 100644
--- a/crates/mold-cli/src/main.rs
+++ b/crates/mold-cli/src/main.rs
@@ -397,11 +397,11 @@ Examples:
         /// Motion-tail overlap between chained clips in pixel frames. Each clip
         /// after the first reuses this many trailing latents from the prior
         /// clip, trimming the duplicated pixel frames at stitch time. 0 disables
-        /// latent carryover (simple concat). Default 9 — two LTX-2 latent frames
-        /// of carryover at the 8× causal temporal compression, covering the
-        /// causal-first-frame slot plus one continuation slot so the stitch
-        /// point has enough anchor to stay coherent.
-        #[arg(long, value_name = "N", default_value_t = 9, help_heading = "Video")]
+        /// latent carryover (simple concat). Default 17 — three LTX-2 latent
+        /// frames of carryover at the 8× causal temporal compression (causal-
+        /// first slot + two continuation slots, ≈0.7 s at 24 fps), enough hard-
+        /// pinned pixel context to keep scene identity coherent across clips.
+        #[arg(long, value_name = "N", default_value_t = 17, help_heading = "Video")]
         motion_tail: u32,
 
         /// Enable synchronized audio for LTX-2 / LTX-2.3 generation.
@@ -2183,7 +2183,7 @@ mod tests {
     }
 
     #[test]
-    fn run_motion_tail_defaults_to_nine() {
+    fn run_motion_tail_defaults_to_seventeen() {
         let cli = parse(&["run", "ltx-2-19b-distilled:fp8", "a cat", "--frames", "200"]);
         match cli.command {
             Commands::Run {
@@ -2192,8 +2192,8 @@ mod tests {
                 ..
             } => {
                 assert_eq!(
-                    motion_tail, 9,
-                    "default motion tail must be 9 frames (two LTX-2 latent frames)"
+                    motion_tail, 17,
+                    "default motion tail must be 17 frames (three LTX-2 latent frames: causal + two continuation)"
                 );
                 assert_eq!(clip_frames, None);
             }
diff --git a/crates/mold-core/src/chain.rs b/crates/mold-core/src/chain.rs
index e8b68c7e..994d46bf 100644
--- a/crates/mold-core/src/chain.rs
+++ b/crates/mold-core/src/chain.rs
@@ -399,11 +399,20 @@ fn build_auto_expand_stages(
     }
 
     let mut stages = Vec::with_capacity(count_usize);
-    for idx in 0..stage_count {
+    for _ in 0..stage_count {
+        // Every stage carries the starting image: stage 0 uses it as the
+        // i2v replacement at frame 0, and continuation stages use it as a
+        // soft identity anchor through the append path (see
+        // `Ltx2Engine::render_chain_stage`). Keeping a durable reference
+        // across stages is what stops scene/identity drift past the first
+        // clip, whose effects were traced in render-chain v1 as the
+        // dominant cause of "strange" continuations — the motion tail
+        // alone only carries ~0.7 s of pixel context, nowhere near enough
+        // for the model to remember the scene across an 8-stage chain.
         stages.push(ChainStage {
             prompt: prompt.to_string(),
             frames: per_stage_frames,
-            source_image: if idx == 0 { source_image.clone() } else { None },
+            source_image: source_image.clone(),
             negative_prompt: None,
             seed_offset: None,
         });
@@ -504,23 +513,22 @@ mod tests {
     }
 
     #[test]
-    fn normalise_preserves_first_stage_image() {
+    fn normalise_preserves_starting_image_across_all_stages() {
         let png = vec![0x89, 0x50, 0x4e, 0x47, 0xde, 0xad, 0xbe, 0xef];
         let normalised = auto_expand_request("test", 200, 97, 4, Some(png.clone()))
             .normalise()
             .expect("normalise should succeed");
 
         assert!(normalised.stages.len() >= 2);
-        assert_eq!(
-            normalised.stages[0].source_image.as_deref(),
-            Some(png.as_slice()),
-            "stage 0 must carry the starting image",
-        );
-        for stage in &normalised.stages[1..] {
-            assert!(
-                stage.source_image.is_none(),
-                "continuation stages must not carry a source image; conditioning flows \
-                 through motion-tail latents instead",
+        for (idx, stage) in normalised.stages.iter().enumerate() {
+            // Every stage must carry the starting image. Stage 0 uses it
+            // as the i2v replacement at frame 0; continuations use it as a
+            // soft identity anchor through the append path so scene and
+            // subject identity stay coherent past the motion-tail window.
+            assert_eq!(
+                stage.source_image.as_deref(),
+                Some(png.as_slice()),
+                "stage {idx} must carry the starting image for cross-stage identity anchoring",
             );
         }
     }
diff --git a/crates/mold-core/src/download.rs b/crates/mold-core/src/download.rs
index 7f30665e..94610be7 100644
--- a/crates/mold-core/src/download.rs
+++ b/crates/mold-core/src/download.rs
@@ -1448,7 +1448,7 @@ mod tests {
         buf.extend_from_slice(b"GGUF");
         // Pad a couple hundred bytes of synthetic header bytes, then include
         // every tensor name as a plain UTF-8 substring so the scanner finds it.
-        buf.extend(std::iter::repeat(0u8).take(256));
+        buf.extend(std::iter::repeat_n(0u8, 256));
         for name in tensor_names {
             buf.extend_from_slice(name.as_bytes());
             buf.push(0);
diff --git a/crates/mold-core/src/placement_test.rs b/crates/mold-core/src/placement_test.rs
index 800540a1..bf3e81c2 100644
--- a/crates/mold-core/src/placement_test.rs
+++ b/crates/mold-core/src/placement_test.rs
@@ -163,16 +163,18 @@ fn generate_request_without_placement_is_none() {
 #[test]
 fn model_config_serializes_placement_section() {
     use crate::config::{Config, ModelConfig};
-    let mut mc = ModelConfig::default();
-    mc.placement = Some(DevicePlacement {
-        text_encoders: DeviceRef::Cpu,
-        advanced: Some(AdvancedPlacement {
-            transformer: DeviceRef::gpu(0),
-            vae: DeviceRef::Cpu,
-            t5: Some(DeviceRef::Cpu),
-            ..Default::default()
+    let mc = ModelConfig {
+        placement: Some(DevicePlacement {
+            text_encoders: DeviceRef::Cpu,
+            advanced: Some(AdvancedPlacement {
+                transformer: DeviceRef::gpu(0),
+                vae: DeviceRef::Cpu,
+                t5: Some(DeviceRef::Cpu),
+                ..Default::default()
+            }),
         }),
-    });
+        ..Default::default()
+    };
     let mut cfg = Config::default();
     cfg.models.insert("flux-dev:q4".to_string(), mc);
 
diff --git a/crates/mold-inference/src/ltx2/chain.rs b/crates/mold-inference/src/ltx2/chain.rs
index c743c1fe..9657f06b 100644
--- a/crates/mold-inference/src/ltx2/chain.rs
+++ b/crates/mold-inference/src/ltx2/chain.rs
@@ -22,9 +22,17 @@ use crate::ltx2::model::shapes::SpatioTemporalScaleFactors;
 
 /// Opaque carryover payload handed from one chain stage to the next.
 ///
-/// Holds the final VAE latents of the emitting stage's motion tail, not the
-/// decoded pixels — so the receiving stage can patchify the tokens directly
-/// into its conditioning without a VAE re-encode.
+/// Holds the last `frames` decoded RGB frames of the emitting stage, not the
+/// raw tail latents. The receiving stage re-encodes them fresh through the
+/// LTX-2 video VAE so every resulting latent slot has correct causal /
+/// continuation semantics in the receiving clip's frame of reference — a
+/// direct latent slice from the emitting stage's continuation slots would
+/// appear at the receiving stage's position 0/1 with slot-meaning mismatched
+/// against the VAE's causal-first-frame convention.
+///
+/// The VAE encode cost on the receiving side is negligible (≈tens of ms for
+/// 17 frames at 704×1216), and it's paid inside a VAE load that's already
+/// needed for the source-image anchor path (see pipeline.rs).
 #[derive(Debug, Clone)]
 pub struct ChainTail {
     /// Number of *pixel* frames this tail represents (not latent frames).
@@ -33,21 +41,14 @@ pub struct ChainTail {
     /// plus the LTX-2 VAE's 8× causal temporal ratio.
     pub frames: u32,
 
-    /// Latent tokens for the tail.
-    ///
-    /// Shape: `[batch=1, channels=128, tail_latent_frames, H/32, W/32]`
-    /// where `tail_latent_frames = tail_latent_frame_count(self.frames)`.
-    ///
-    /// Dtype is whatever the denoise loop produced — typically `F32`.
-    /// Device is the engine's active device (GPU or CPU); the orchestrator
-    /// is responsible for ensuring the next stage runs on the same device.
-    pub latents: Tensor,
-
-    /// The last decoded pixel frame of the emitting stage. Kept for
-    /// debugging, progress UIs that want a thumbnail of the handoff point,
-    /// and as a fallback rendering target if latent carryover ever needs
-    /// to be disabled at runtime.
-    pub last_rgb_frame: RgbImage,
+    /// The last `frames` decoded RGB frames of the emitting stage, in
+    /// capture order. The receiving stage VAE-encodes this contiguous pixel
+    /// window into `tail_latent_frame_count(frames)` latent slots. Each
+    /// resulting latent slot then carries correct causal (slot 0, 1 pixel)
+    /// or continuation (slots 1+, 8 pixels each) semantics for the receiving
+    /// clip's pinned region — monotonic, forward-in-time, no slot meaning
+    /// mismatch with the RoPE positions in the receiving clip.
+    pub tail_rgb_frames: Vec<RgbImage>,
 }
 
 /// Number of latent frames corresponding to `pixel_frames` pixel frames
@@ -79,6 +80,11 @@ pub fn tail_latent_frame_count(pixel_frames: u32) -> usize {
 /// Errors if the tensor is not rank-5 or the requested tail exceeds the
 /// available time axis — the latter would mean the orchestrator asked for
 /// more tail than the stage produced, which indicates a caller bug.
+///
+/// Kept after the v1.1 decoded-pixel-carryover switch because the utility
+/// still reads cleanly from tests and is useful for ad-hoc debugging /
+/// future experiments, but the production chain path no longer calls it.
+#[allow(dead_code)]
 pub fn extract_tail_latents(final_latents: &Tensor, pixel_frames: u32) -> Result<Tensor> {
     let dims = final_latents.dims();
     if dims.len() != 5 {
@@ -314,15 +320,13 @@ fn estimate_stitched_frames(req: &ChainRequest) -> u32 {
 fn derive_stage_seed(base_seed: u64, _idx: usize, stage: &ChainStage) -> u64 {
     // Keep the seed stable across stages by default. An earlier revision
     // XORed `(idx as u64) << 32` into each stage's seed so the initial
-    // noise tensor differed per clip; with the motion-tail pin re-
-    // grounded on a proper causal-first latent (see
-    // `StagedLatent::causal_first_frame_rgb`) that diversification just
-    // amplifies drift at the stitch point — the replacement region is
-    // frozen by `video_denoise_mask` anyway, so same-seed noise in the
-    // pinned tokens is a no-op, and same-seed noise in the free region
-    // lets the continuation settle on a consistent motion profile.
-    // Callers who want per-stage variation supply `stage.seed_offset`
-    // explicitly.
+    // noise tensor differed per clip; with the motion tail now re-encoded
+    // from the emitting stage's trailing RGB frames (see `ChainTail` +
+    // `StagedLatent`) the pinned region is frozen by `video_denoise_mask`
+    // anyway, so same-seed noise in the pinned tokens is a no-op, and
+    // same-seed noise in the free region lets the continuation settle on a
+    // consistent motion profile. Callers who want per-stage variation
+    // supply `stage.seed_offset` explicitly.
     if let Some(offset) = stage.seed_offset {
         base_seed ^ offset
     } else {
@@ -352,15 +356,19 @@ fn build_stage_generate_request(
         output_format: OutputFormat::Mp4,
         embed_metadata: None,
         scheduler: None,
-        // Stage 0 carries the optional starting image; continuations
-        // get their conditioning from motion-tail latents via the
-        // `carry` argument to `render_stage`.
-        source_image: if idx == 0 {
-            stage.source_image.clone()
-        } else {
-            None
-        },
+        // Every stage carries the starting image. Stage 0 uses it as the
+        // i2v replacement at frame 0; continuation stages have their
+        // frame-0 slot pinned by the motion-tail carryover latent, so
+        // `render_chain_stage` re-routes the staged image into the append
+        // path at a non-zero frame with soft strength — turning it into a
+        // durable identity anchor rather than a frame-0 replacement.
+        source_image: stage.source_image.clone(),
         edit_images: None,
+        // Replacement strength from the chain request is only meaningful
+        // for stage 0's frame-0 i2v pin. Continuations override this at
+        // `render_chain_stage` time (the anchor uses a lower soft-
+        // strength constant there), so the value we plant here is inert
+        // on continuations.
         strength: if idx == 0 { chain.strength } else { 1.0 },
         mask_image: None,
         control_image: None,
@@ -501,7 +509,6 @@ mod tests {
     #[derive(Debug, Clone)]
     struct CallRecord {
         seed: Option<u64>,
-        frames: Option<u32>,
         has_source_image: bool,
         has_carry: bool,
     }
@@ -528,7 +535,6 @@ mod tests {
             let idx = self.calls.len();
             self.calls.push(CallRecord {
                 seed: stage_req.seed,
-                frames: stage_req.frames,
                 has_source_image: stage_req.source_image.is_some(),
                 has_carry: carry.is_some(),
             });
@@ -536,7 +542,7 @@ mod tests {
                 bail!("{msg}");
             }
             if self.emit_progress {
-                if let Some(cb) = stage_progress.as_deref_mut() {
+                if let Some(cb) = stage_progress.as_mut() {
                     cb(StageProgressEvent::DenoiseStep { step: 1, total: 1 });
                 }
             }
@@ -553,24 +559,22 @@ mod tests {
                 let channel = (idx as u8).wrapping_mul(37).wrapping_add(frame_num as u8);
                 frames.push(RgbImage::from_pixel(width, height, Rgb([channel, 0, 0])));
             }
-            let last_frame = frames.last().cloned().unwrap();
-
-            // Build a synthetic tail latent at the "right" shape for the
-            // requested motion tail. Shape isn't validated by the
-            // orchestrator itself — the engine impl in Phase 1d will check.
-            let latent = Tensor::zeros(
-                (1, 128, 1, height as usize / 32, width as usize / 32),
-                DType::F32,
-                &Device::Cpu,
-            )
-            .unwrap();
+
+            // Synthesize a 4-pixel-frame tail from the trailing RGB frames
+            // so orchestrator tests can assert on the count/shape without
+            // loading a real VAE.
+            let tail_pixel_frames: u32 = 4;
+            let take_from = frames
+                .len()
+                .saturating_sub(tail_pixel_frames as usize)
+                .min(frames.len());
+            let tail_rgb_frames = frames[take_from..].to_vec();
 
             Ok(StageOutcome {
                 frames,
                 tail: ChainTail {
-                    frames: 4,
-                    latents: latent,
-                    last_rgb_frame: last_frame,
+                    frames: tail_pixel_frames,
+                    tail_rgb_frames,
                 },
                 generation_time_ms: 100,
             })
@@ -696,20 +700,31 @@ mod tests {
     }
 
     #[test]
-    fn chain_only_stage0_carries_source_image() {
+    fn chain_propagates_source_image_to_every_stage() {
+        // Every stage must receive the starting image in its GenerateRequest.
+        // Stage 0 uses it as the frame-0 i2v replacement; continuations use
+        // it at the engine level as a soft identity anchor (routed through
+        // the append path by `Ltx2Engine::render_chain_stage`). Identity
+        // drift past the first clip was traced to the prior behaviour of
+        // dropping the image on continuations — no long-range identity
+        // anchor meant each continuation was anchored only to the drifted
+        // last frame of the prior clip, compounding errors stage-over-stage.
         let mut stages = vec![stage("a", 9), stage("a", 9)];
         stages[0].source_image = Some(vec![0x89, 0x50, 0x4e, 0x47]); // PNG magic
-                                                                     // If a caller forgets to clear later stages' source_image, the
-                                                                     // orchestrator still suppresses it — continuations must always
-                                                                     // condition on motion-tail latents, never on a staged image.
         stages[1].source_image = Some(vec![0x89, 0x50, 0x4e, 0x47]);
         let req = chain_req(stages, 0);
         let mut renderer = FakeRenderer::new();
         renderer.frame_count_override = Some(9);
         let mut orch = Ltx2ChainOrchestrator::new(&mut renderer);
         orch.run(&req, None).expect("chain runs");
-        assert!(renderer.calls[0].has_source_image);
-        assert!(!renderer.calls[1].has_source_image);
+        assert!(
+            renderer.calls[0].has_source_image,
+            "stage 0 must carry source_image (frame-0 i2v replacement)",
+        );
+        assert!(
+            renderer.calls[1].has_source_image,
+            "continuation stage must also carry source_image (soft identity anchor)",
+        );
     }
 
     #[test]
diff --git a/crates/mold-inference/src/ltx2/conditioning.rs b/crates/mold-inference/src/ltx2/conditioning.rs
index ca83e1d2..d40c3a47 100644
--- a/crates/mold-inference/src/ltx2/conditioning.rs
+++ b/crates/mold-inference/src/ltx2/conditioning.rs
@@ -1,5 +1,4 @@
 use anyhow::{bail, Result};
-use candle_core::Tensor;
 use image::RgbImage;
 use mold_core::{GenerateRequest, TimeRange};
 use std::fs;
@@ -13,17 +12,23 @@ pub(crate) struct StagedImage {
     pub(crate) strength: f32,
 }
 
-/// Pre-encoded latent block used as conditioning input, bypassing the
-/// staged-image path's VAE encode. Populated by the render-chain
-/// orchestrator when handing a motion-tail off between stages; empty for
+/// Pre-decoded RGB frame window that the runtime re-encodes through the
+/// video VAE into conditioning tokens. Populated by the render-chain
+/// orchestrator with the trailing frames of the emitting stage; empty for
 /// every non-chain caller today.
 ///
-/// Tensor shape must be `[batch=1, channels=128, T_latent, H/32, W/32]`
-/// to match the LTX-2 video VAE output. The runtime patchifies it directly
-/// into conditioning tokens.
+/// Re-encoding on the receiving side (rather than narrowing the emitting
+/// stage's final latent tensor) is what keeps slot semantics correct: the
+/// first latent produced from `tail_rgb_frames` is a proper causal 1-pixel
+/// encoding and subsequent latents are proper 8-pixel continuation
+/// encodings at the receiving clip's own time axis, with no ambiguity
+/// about which latent slot corresponds to which pixel-frame range.
 #[derive(Debug, Clone)]
 pub(crate) struct StagedLatent {
-    pub(crate) latents: Tensor,
+    /// Contiguous, in-capture-order RGB frames from the end of the emitting
+    /// stage. Must be non-empty; the receiving runtime VAE-encodes them
+    /// into `tail_latent_frame_count(tail_rgb_frames.len())` latent slots.
+    pub(crate) tail_rgb_frames: Vec<RgbImage>,
     /// Starting pixel frame for this latent block. `0` routes the tokens
     /// through `StageVideoConditioning::replacements`; non-zero values
     /// build a `VideoTokenAppendCondition` like the keyframe image path.
@@ -31,27 +36,17 @@ pub(crate) struct StagedLatent {
     /// Replacement/append strength. `1.0` for chain motion-tail carryover
     /// (hard-overwrite), matching the keyframe image strength convention.
     pub(crate) strength: f32,
-    /// Optional RGB frame that the runtime re-encodes through the video VAE
-    /// and swaps in for the first latent frame of `latents` before
-    /// patchifying. Used by the chain orchestrator to fix a semantic
-    /// mismatch: raw tail latents come from the emitting clip's
-    /// *continuation* slots (each encoding 8 pixel frames), but when
-    /// replayed at `frame: 0` of the next clip they land in the VAE's
-    /// *causal* slot (which decodes to a single pixel frame). Encoding the
-    /// last decoded RGB frame as a proper causal-first latent removes that
-    /// mismatch and makes `clip_N+1`'s first pixel frame visually match
-    /// `clip_N`'s last pixel frame.
-    pub(crate) causal_first_frame_rgb: Option<RgbImage>,
 }
 
 /// Conditioning inputs staged for a single run. Carries both disk-backed
 /// files (images, audio, reference video — existing single-clip flow) and
-/// in-memory latent blocks (chain carryover — new, empty for non-chain
+/// in-memory RGB frame blocks (chain carryover — new, empty for non-chain
 /// callers).
 ///
-/// Not `PartialEq` because `StagedLatent` wraps a `candle_core::Tensor`
-/// which doesn't implement meaningful structural equality. Existing tests
-/// only compare individual fields so this is safe to drop.
+/// Not `PartialEq` because `StagedLatent` wraps `image::RgbImage` which
+/// doesn't implement meaningful structural equality beyond raw byte
+/// comparison. Existing tests only compare individual fields so this is
+/// safe to drop.
 #[derive(Debug, Clone)]
 pub(crate) struct StagedConditioning {
     pub(crate) images: Vec<StagedImage>,
diff --git a/crates/mold-inference/src/ltx2/pipeline.rs b/crates/mold-inference/src/ltx2/pipeline.rs
index 3ef3f1e7..1dc7644a 100644
--- a/crates/mold-inference/src/ltx2/pipeline.rs
+++ b/crates/mold-inference/src/ltx2/pipeline.rs
@@ -1,6 +1,6 @@
 #![allow(clippy::type_complexity)]
 
-use anyhow::{anyhow, bail, Context, Result};
+use anyhow::{bail, Context, Result};
 use candle_core::Device;
 use mold_core::{
     GenerateRequest, GenerateResponse, Ltx2PipelineMode, ModelPaths, OutputFormat, VideoData,
@@ -11,9 +11,7 @@ use std::time::Instant;
 
 use super::assets;
 use super::backend::Ltx2Backend;
-use super::chain::{
-    extract_tail_latents, ChainStageRenderer, ChainTail, StageOutcome, StageProgressEvent,
-};
+use super::chain::{ChainStageRenderer, ChainTail, StageOutcome, StageProgressEvent};
 use super::conditioning::{self, StagedLatent};
 use super::execution;
 use super::lora;
@@ -27,6 +25,14 @@ use crate::engine::{gpu_dtype, rand_seed, InferenceEngine, LoadStrategy};
 use crate::ltx_video::video_enc;
 use crate::progress::ProgressCallback;
 
+/// Soft-conditioning strength for the cross-stage identity anchor on chain
+/// continuations. The denoise mask at the anchor token becomes
+/// `1 - strength = 0.6`, so the denoiser blends ~60% generated / ~40%
+/// reference on every step — a gentle pull toward the source image rather
+/// than a hard pin (hard-pinning a single pixel frame past the motion tail
+/// would make continuations feel like cuts back to the starting shot).
+const CHAIN_SOFT_ANCHOR_STRENGTH: f32 = 0.4;
+
 pub struct Ltx2Engine {
     model_name: String,
     paths: ModelPaths,
@@ -589,30 +595,51 @@ impl Ltx2Engine {
         let native_output = work_dir.path().join("ltx2-native-output.mp4");
         let mut plan = self.materialize_request(req, work_dir.path(), &native_output)?;
 
-        // Inject carryover tail latents as StagedLatent on frame 0. The
-        // runtime detects a non-empty `conditioning.latents` and bypasses
-        // the VAE load entirely, patchifying the pre-encoded tokens into
-        // conditioning replacements directly (see conditioning.rs
-        // StagedLatent docstring + runtime.rs
-        // maybe_load_stage_video_conditioning).
+        // Inject carryover RGB frames as a StagedLatent at frame 0. The
+        // runtime VAE-encodes them fresh on the receiving side so every
+        // resulting latent slot has correct causal/continuation semantics
+        // in this clip's own time axis (see conditioning.rs StagedLatent
+        // docstring + runtime.rs maybe_load_stage_video_conditioning).
+        //
+        // When the chain request carries a starting image (i2v flow), the
+        // orchestrator passes it through on every stage. Stage 0 uses it
+        // as the frame-0 i2v replacement — great. On continuations the
+        // motion-tail pin owns frame 0, so we re-route any frame-0 staged
+        // image to a non-zero frame with reduced "soft anchor" strength:
+        // the image becomes a durable identity reference appended to the
+        // token sequence (via the `VideoTokenAppendCondition` path in
+        // `maybe_load_stage_video_conditioning`), giving the free-region
+        // denoise a persistent cross-attention anchor for subject / scene
+        // appearance without freezing any tokens. Without this anchor,
+        // identity drift compounds stage-over-stage because each clip's
+        // only long-range reference is its own drifted last-frame carry.
         if let Some(tail) = carry {
-            // The caller (orchestrator) is responsible for blanking
-            // source_image on continuation stages, but defence-in-depth:
-            // clear staged images so they can't compete with the latent
-            // carryover.
-            plan.conditioning.images.clear();
-            // Hand the RGB last frame to the runtime so it can re-encode
-            // it as a proper causal-first latent and swap in for the first
-            // latent frame before patchifying. The raw tail latents from
-            // the emitting stage's tail encode 8 pixel frames per slot;
-            // dropped straight into token 0 they'd collide with the VAE's
-            // causal-first-frame convention at the decode step and produce
-            // a visible seam on each continuation.
+            if tail.tail_rgb_frames.is_empty() {
+                bail!(
+                    "render_chain_stage: carry.tail_rgb_frames is empty; caller must provide at least one frame"
+                );
+            }
+
+            // Re-route any frame-0 staged image into the soft-anchor
+            // append slot. The anchor frame is the first pixel past the
+            // motion-tail pin, so the reference token's RoPE sits exactly
+            // where new content starts — cross-attention propagates
+            // identity into the free region most directly from there.
+            // `CHAIN_SOFT_ANCHOR_STRENGTH = 0.4` gives the denoise mask a
+            // value of `1 - 0.4 = 0.6` at the anchor token, so the
+            // denoiser blends ~60% generated / ~40% reference every step.
+            let anchor_frame = motion_tail_pixel_frames;
+            for image in plan.conditioning.images.iter_mut() {
+                if image.frame == 0 {
+                    image.frame = anchor_frame;
+                    image.strength = CHAIN_SOFT_ANCHOR_STRENGTH;
+                }
+            }
+
             plan.conditioning.latents.push(StagedLatent {
-                latents: tail.latents.clone(),
+                tail_rgb_frames: tail.tail_rgb_frames.clone(),
                 frame: 0,
                 strength: 1.0,
-                causal_first_frame_rgb: Some(tail.last_rgb_frame.clone()),
             });
         }
 
@@ -626,50 +653,32 @@ impl Ltx2Engine {
             Some(runtime) if runtime.can_reuse_for(&plan) => runtime,
             _ => self.create_runtime_session(&plan)?,
         };
-        let slot = runtime.arm_tail_capture();
 
         self.emit("Executing native LTX-2 chain stage runtime");
         let prepared = match runtime.prepare(&plan) {
             Ok(prepared) => prepared,
             Err(err) => {
-                runtime.clear_tail_capture();
                 self.native_runtime = Some(runtime);
                 return Err(err);
             }
         };
         let render_result =
             runtime.render_native_video(&plan, &prepared, self.on_progress.as_ref());
-        runtime.clear_tail_capture();
         self.native_runtime = Some(runtime);
         let rendered = render_result?;
 
-        // Drain captured latents. The slot must have been populated by
-        // the distilled render path — if it's empty, that's a wiring bug,
-        // not a user error.
-        let captured = slot
-            .lock()
-            .map_err(|_| anyhow!("chain tail-capture mutex was poisoned mid-render"))?
-            .take()
-            .ok_or_else(|| {
-                anyhow!(
-                    "distilled render completed without populating the chain tail-capture slot; \
-                     this is a pipeline wiring bug"
-                )
-            })?;
-
-        // `extract_tail_latents` returns a narrow view; make it
-        // contiguous so it survives independently of the runtime's
-        // working tensors.
-        let tail_slice = extract_tail_latents(&captured, motion_tail_pixel_frames)?;
-        let tail_latents = tail_slice
-            .contiguous()
-            .context("materializing chain tail latents into an owned tensor")?;
-
         let frames = rendered.frames;
-        let last_rgb_frame = frames
-            .last()
-            .ok_or_else(|| anyhow!("distilled render returned zero frames"))?
-            .clone();
+        let tail_pixel_frames = motion_tail_pixel_frames as usize;
+        if frames.len() < tail_pixel_frames {
+            bail!(
+                "distilled render returned {} pixel frames but the chain caller requested a {}-frame tail; \
+                 this is a pipeline wiring bug",
+                frames.len(),
+                motion_tail_pixel_frames,
+            );
+        }
+        let tail_start = frames.len() - tail_pixel_frames;
+        let tail_rgb_frames = frames[tail_start..].to_vec();
 
         let generation_time_ms = start.elapsed().as_millis() as u64;
         Self::log_timing("pipeline.render_chain_stage", start);
@@ -678,8 +687,7 @@ impl Ltx2Engine {
             frames,
             tail: ChainTail {
                 frames: motion_tail_pixel_frames,
-                latents: tail_latents,
-                last_rgb_frame,
+                tail_rgb_frames,
             },
             generation_time_ms,
         })
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 7b8d346b..7f5a2773 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -346,12 +346,22 @@ impl Ltx2RuntimeSession {
         }
     }
 
+    /// Arm the pre-VAE-decode latent capture slot. The distilled render
+    /// path writes its `final_video_latents` into the returned slot when
+    /// this is set, letting a caller drain the raw latents after a render
+    /// completes. Kept after the v1.1 decoded-pixel-carryover switch in
+    /// case future work (e.g. quality-diagnostic tooling) wants access
+    /// to the pre-decode tensor; the production chain path no longer
+    /// arms it.
+    #[allow(dead_code)]
     pub(crate) fn arm_tail_capture(&mut self) -> std::sync::Arc<std::sync::Mutex<Option<Tensor>>> {
         let slot = std::sync::Arc::new(std::sync::Mutex::new(None));
         self.tail_capture = Some(std::sync::Arc::clone(&slot));
         slot
     }
 
+    /// Disarm the latent capture slot. See [`arm_tail_capture`].
+    #[allow(dead_code)]
     pub(crate) fn clear_tail_capture(&mut self) {
         self.tail_capture = None;
     }
@@ -1341,17 +1351,13 @@ fn maybe_load_stage_video_conditioning(
     }
 
     // The VAE is needed for staged images, reference video ingest, and —
-    // on chain continuations — re-encoding `causal_first_frame_rgb` on a
-    // StagedLatent so the causal-first slot is filled with a proper single-
-    // pixel-frame latent instead of the emitting stage's continuation-slot
-    // tokens. Pure pre-encoded latent carryover still skips the VAE load.
+    // on chain continuations — re-encoding the emitting stage's trailing
+    // RGB frames into a proper-slot-semantics conditioning latent. Every
+    // StagedLatent now carries RGB frames, so any non-empty
+    // plan.conditioning.latents implies a VAE load.
     let need_vae = !plan.conditioning.images.is_empty()
         || include_reference_video
-        || plan
-            .conditioning
-            .latents
-            .iter()
-            .any(|s| s.causal_first_frame_rgb.is_some());
+        || !plan.conditioning.latents.is_empty();
     let mut vae = if need_vae {
         let mut loaded = load_ltx2_video_vae(plan, device, dtype)?;
         loaded.use_tiling = false;
@@ -1408,45 +1414,28 @@ fn maybe_load_stage_video_conditioning(
                 )?);
         }
     }
-    // Pre-encoded latents (chain carryover). VAE is only touched when a
-    // `causal_first_frame_rgb` rides along — in that case the runtime re-
-    // encodes that RGB frame as a single-frame causal-first latent and
-    // swaps it in for latent frame 0 of the staged tensor, fixing the
-    // causal-vs-continuation slot mismatch before patchify. For v1 chain
-    // this path only ever produces a frame-0 replacement; appended
-    // (non-frame-0) is kept as a forward-compat branch for the movie-maker.
+    // Chain carryover: every StagedLatent is a contiguous RGB window from
+    // the end of the emitting stage. Re-encoding on the receiving side
+    // (rather than slicing the emitting stage's final latent tensor) keeps
+    // slot semantics aligned with the receiving clip's time axis — slot 0
+    // is a proper causal 1-pixel encoding, slot 1+ are proper 8-pixel
+    // continuation encodings, with no ambiguity about which latent slot
+    // corresponds to which pixel-frame range.
     for staged in &plan.conditioning.latents {
-        let mut latents = staged.latents.to_device(device)?.to_dtype(DType::F32)?;
-        if let Some(rgb) = staged.causal_first_frame_rgb.as_ref() {
-            let vae = vae.as_mut().expect(
-                "need_vae guarantees the VAE is loaded whenever a staged latent carries a causal_first_frame_rgb",
+        if staged.tail_rgb_frames.is_empty() {
+            anyhow::bail!(
+                "StagedLatent has an empty tail_rgb_frames; at least one frame is required"
             );
-            let video = video_tensor_from_frames(std::slice::from_ref(rgb), device, dtype)
-                .context("encode causal first frame into pixel tensor for chain carryover")?;
-            let causal_latent = vae.encode(&video).context(
-                "failed to encode chain-carryover causal first frame through the LTX-2 video VAE",
-            )?;
-            // `causal_latent` shape: [1, 128, 1, H/32, W/32]. Splice in
-            // as latent frame 0 of the staged tail so only the causal
-            // slot is overwritten; continuation slots (frames 1..K of
-            // `staged.latents`) stay as the emitting stage's true tail
-            // tokens, which already share the continuation-slot semantics
-            // of the receiving clip's slots 1..K.
-            let causal_latent = causal_latent.to_dtype(DType::F32)?;
-            let (_b, _c, causal_frames, _h, _w) = causal_latent.dims5()?;
-            if causal_frames != 1 {
-                anyhow::bail!(
-                    "VAE encode of a single RGB frame produced {causal_frames} latent frames; expected exactly 1"
-                );
-            }
-            let tail_frames = latents.dim(2)?;
-            latents = if tail_frames <= 1 {
-                causal_latent
-            } else {
-                let continuation = latents.narrow(2, 1, tail_frames - 1)?;
-                Tensor::cat(&[&causal_latent, &continuation], 2)?.contiguous()?
-            };
         }
+        let vae = vae.as_mut().expect(
+            "need_vae guarantees the VAE is loaded whenever plan.conditioning.latents is non-empty",
+        );
+        let video = video_tensor_from_frames(&staged.tail_rgb_frames, device, dtype)
+            .context("encode chain tail RGB frames into pixel tensor for carryover")?;
+        let latents = vae
+            .encode(&video)
+            .context("failed to encode chain tail RGB frames through the LTX-2 video VAE")?
+            .to_dtype(DType::F32)?;
         let use_guiding_latent = matches!(plan.pipeline, PipelineKind::Keyframe);
         if staged.frame == 0 && !use_guiding_latent {
             let tokens = patchifier.patchify(&latents)?;
diff --git a/crates/mold-server/src/routes_chain.rs b/crates/mold-server/src/routes_chain.rs
index b7e0949c..d2240cf4 100644
--- a/crates/mold-server/src/routes_chain.rs
+++ b/crates/mold-server/src/routes_chain.rs
@@ -527,7 +527,6 @@ pub async fn generate_chain_stream(
 mod tests {
     use super::*;
     use anyhow::Result;
-    use candle_core::{DType, Device, Tensor};
     use image::{Rgb, RgbImage};
     use mold_core::chain::{ChainProgressEvent, ChainRequest, ChainStage};
     use mold_core::{GenerateRequest, GenerateResponse};
@@ -587,18 +586,17 @@ mod tests {
                 let shade = (idx as u8).wrapping_mul(17).wrapping_add(f as u8);
                 frames.push(RgbImage::from_pixel(width, height, Rgb([shade, 0, 0])));
             }
-            let last_frame = frames.last().cloned().unwrap();
-            let latent = Tensor::zeros(
-                (1, 128, 1, height as usize / 32, width as usize / 32),
-                DType::F32,
-                &Device::Cpu,
-            )?;
+            let tail_pixel_frames = 4usize;
+            let take_from = frames
+                .len()
+                .saturating_sub(tail_pixel_frames)
+                .min(frames.len());
+            let tail_rgb_frames = frames[take_from..].to_vec();
             Ok(StageOutcome {
                 frames,
                 tail: ChainTail {
-                    frames: 4,
-                    latents: latent,
-                    last_rgb_frame: last_frame,
+                    frames: tail_pixel_frames as u32,
+                    tail_rgb_frames,
                 },
                 generation_time_ms: 10,
             })
diff --git a/web/src/lib/chainRouting.test.ts b/web/src/lib/chainRouting.test.ts
index 8c710b89..01843ea0 100644
--- a/web/src/lib/chainRouting.test.ts
+++ b/web/src/lib/chainRouting.test.ts
@@ -32,7 +32,8 @@ describe("decideChainRouting", () => {
   });
 
   it("chains ltx2-distilled requests above the cap", () => {
-    // 241 = 97 + 4*(97-4) - 9  → ceil(144/93) = 2 → 1+2 = 3 stages
+    // 241 frames, clip=97, DEFAULT_MOTION_TAIL=17 → effective=80,
+    // remainder=144, stageCount = 1 + ceil(144/80) = 1 + 2 = 3.
     const d = decideChainRouting(241, "ltx2", "ltx-2.3-22b-distilled:fp8");
     expect(d).toEqual({
       kind: "chain",
diff --git a/web/src/lib/chainRouting.ts b/web/src/lib/chainRouting.ts
index fc7abecd..72f79a47 100644
--- a/web/src/lib/chainRouting.ts
+++ b/web/src/lib/chainRouting.ts
@@ -12,12 +12,14 @@
  */
 
 export const LTX2_DISTILLED_CLIP_CAP = 97;
-// 9 pixel frames → 2 LTX-2 latent frames of carryover under the VAE's 8× causal
-// temporal compression (causal-first slot + one continuation slot). Four frames
-// — the prior default — only pinned the causal slot, which the decoder
-// reconstructs as a single pixel frame, leaving the inter-clip stitch visibly
-// jumpy. Keep this in sync with `default_value_t` on --motion-tail in mold-cli.
-export const DEFAULT_MOTION_TAIL = 9;
+// 17 pixel frames → 3 LTX-2 latent frames of carryover under the VAE's 8×
+// causal temporal compression (causal-first slot + two continuation slots).
+// The prior 9-frame default only pinned one continuation slot (≈0.4 s at
+// 24 fps), which was too little context to keep scene identity coherent past
+// the first clip; bumping to 17 gives the denoiser ≈0.7 s of hard-pinned
+// pixel context at the stitch boundary. Keep this in sync with
+// `default_value_t` on --motion-tail in mold-cli.
+export const DEFAULT_MOTION_TAIL = 17;
 
 export type ChainRoutingDecision =
   | { kind: "single" }

From 77c2e69d146ec96f475e22ab5554e7862053cb0d Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 16:54:03 -0700
Subject: [PATCH 27/31] fix(web): scroll header on /generate and pin resource
 tray edge-to-edge

Make the TopBar sticky only on the gallery route so the Generate view's
header scrolls with the page. Flatten the ResourceTray so it hugs the
bottom and side edges with an opaque background instead of a floating
rounded card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 web/src/components/ResourceTray.vue |  4 ++--
 web/src/components/TopBar.vue       | 14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/web/src/components/ResourceTray.vue b/web/src/components/ResourceTray.vue
index f633db18..4ab80a04 100644
--- a/web/src/components/ResourceTray.vue
+++ b/web/src/components/ResourceTray.vue
@@ -84,10 +84,10 @@ onBeforeUnmount(() => window.removeEventListener("keydown", onKey));
 
 <template>
   <div
-    class="pointer-events-none fixed inset-x-0 bottom-0 z-30 flex justify-center px-2 pb-2 sm:px-4 sm:pb-3"
+    class="pointer-events-none fixed inset-x-0 bottom-0 z-30 flex justify-center"
   >
     <div
-      class="pointer-events-auto w-full max-w-[1800px] rounded-2xl border border-white/5 bg-slate-950/80 backdrop-blur-md"
+      class="pointer-events-auto w-full border-t border-white/5 bg-slate-950"
     >
       <button
         type="button"
diff --git a/web/src/components/TopBar.vue b/web/src/components/TopBar.vue
index baab5dce..b5e52562 100644
--- a/web/src/components/TopBar.vue
+++ b/web/src/components/TopBar.vue
@@ -90,14 +90,16 @@ function clearSearch() {
 
 <template>
   <!--
-    Sticky positioning is sm-and-up only. On mobile the header scrolls
-    away with the content so the feed has the full viewport height (a
-    persistent bar eats 10-15 % of vertical space, which matters on
-    phones). A "back to top" FAB in `App.vue` brings the user back when
-    they want to reach the header again.
+    Sticky positioning is sm-and-up only AND gallery-only. On mobile the
+    header scrolls away with the content so the feed has the full viewport
+    height (a persistent bar eats 10-15 % of vertical space, which matters
+    on phones). On /generate the header always scrolls — the Composer owns
+    the top of the viewport once the user starts filling it out, and a
+    sticky bar there competes with the ResourceTray at the bottom.
   -->
   <header
-    class="glass relative z-30 flex flex-col gap-3 rounded-3xl px-4 py-3 sm:sticky sm:top-4 sm:flex-row sm:items-center sm:gap-4 sm:px-5 sm:py-3.5"
+    class="glass relative z-30 flex flex-col gap-3 rounded-3xl px-4 py-3 sm:flex-row sm:items-center sm:gap-4 sm:px-5 sm:py-3.5"
+    :class="route.name === 'gallery' ? 'sm:sticky sm:top-4' : ''"
   >
     <!-- Brand -->
     <div class="flex shrink-0 items-center gap-3">

From 65b89a16cf11c394647e70028ab9da6516c33200 Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 16:58:35 -0700
Subject: [PATCH 28/31] Fix soft anchor token retention in LTX-2 chains

---
 CHANGELOG.md                              |  1 +
 crates/mold-inference/src/ltx2/runtime.rs | 33 ++++++++++++++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 38eb2c5d..caec3a3a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Native LTX-2 chain continuations with a staged source-image anchor no longer hit `shape mismatch in broadcast_mul` mid-denoise.** Continuation stages re-route the carried `source_image` into a soft appended anchor (`CHAIN_SOFT_ANCHOR_STRENGTH = 0.4`) so identity stays present without freezing frame-0 tokens, but `reapply_stage_video_conditioning()` in `crates/mold-inference/src/ltx2/runtime.rs` was incorrectly dropping appended conditioning tokens unless `strength >= 1.0`. That shrank the live video-token sequence after the first sampler step while the cached clean-latent tensor and denoise mask still reflected the longer conditioned shape, producing errors like `lhs: [1, 2717, 4096], rhs: [1, 2926, 4096]` on the next `blend_conditioned_denoised()` call. The reapply path now keeps appended tokens present for the full denoise loop regardless of strength; strength continues to be enforced by the denoise mask, and a regression test covers the soft-anchor case.
 - **Chained LTX-2 video generation no longer errors on stage 2 with "native LTX-2 prompt encoder is unavailable".** `Ltx2RuntimeSession::prepare` in `crates/mold-inference/src/ltx2/runtime.rs` intentionally consumes the prompt encoder on first call via `self.prompt_encoder.take()` so the Gemma 3 encoder's ~10 GB of VRAM can be freed for the transformer (same drop-and-reload pattern that FLUX uses for T5 and Z-Image uses for Qwen3). That invariant is fine for single-clip generation because the pipeline drops the whole runtime session after each call (`generate_inner` takes but never restores `native_runtime`), so the next request builds a fresh session with a fresh encoder. But `render_chain_stage` *does* restore the session so the transformer and VAE stay warm across stages — and stage 2+ then hit `.context("native LTX-2 prompt encoder is unavailable")?` because the encoder slot is already empty. Three back-to-back failures on a 241-frame img2v chain before it was traced. Fixed by caching the last `NativePromptEncoding` output on the session (keyed by the plan's full `EncodedPromptPair` plus the unconditional flag), so same-prompt follow-ups skip the encoder entirely; chain v1 replicates a single prompt across all stages so the cache is always warm from stage 1 onward. A new `Ltx2RuntimeSession::can_reuse_for(&plan)` helper lets `Ltx2Engine` detect when a persisted session carries a consumed encoder *and* a different prompt — in that case the engine drops it and builds a fresh session, which is the only way to re-encode from scratch. The existing `runtime_session_prepare_consumes_prompt_encoder` test is updated to reflect the new semantic (same-prompt second call succeeds via cache), and a new `runtime_session_prepare_rejects_encoder_reuse_with_different_prompt` test locks in the fresh-session-required branch so future refactors can't silently regress.
 - **Chained LTX-2 generation now serializes concurrent chain requests instead of racing on the model cache.** `routes_chain::run_chain` in `crates/mold-server/src/routes_chain.rs` deliberately bypasses the normal generation queue and holds the engine out of `model_cache` for the full multi-minute chain run — a design tradeoff documented in CLAUDE.md to keep the transformer warm across stages without blocking single-clip requests. But that design had no serialization across *chain* requests: a second chain arriving while the first was running called `ensure_model_ready` → saw the cache empty (request A had taken the engine) → tried to `create_and_load_engine` a second copy → then request B's `cache.take()` surfaced the cryptic "engine '…' vanished from cache after ensure_model_ready" error. Manifested the moment the web UI auto-promoted multiple long-frame LTX-2 requests to chain mode. Fixed by adding `chain_lock: Arc<Mutex<()>>` to `AppState` and acquiring it at the top of `run_chain` before `ensure_model_ready`. The lock is held for the entire chain, so concurrent chain requests queue naturally; single-clip requests continue to flow through the normal generation queue unchanged. `ensure_model_ready`'s global `model_load_lock` still prevents concurrent loads from racing; this new lock adds a second level of serialization specific to chain's "hold engine out of cache" pattern.
 
diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 7f5a2773..6b3d94fa 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -1672,9 +1672,10 @@ fn reapply_stage_video_conditioning(
 
     let mut parts = vec![base];
     for condition in &conditioning.appended {
-        if condition.strength < 1.0 {
-            continue;
-        }
+        // Appended conditioning tokens must remain present for the whole
+        // denoise loop. Their strength is expressed via the denoise mask;
+        // dropping "soft" appended tokens here desynchronizes the token
+        // count from the cached clean latents and mask tensors.
         parts.push(
             condition
                 .tokens
@@ -5959,6 +5960,32 @@ mod tests {
         );
     }
 
+    #[test]
+    fn reapply_stage_video_conditioning_keeps_soft_appended_tokens() {
+        let latents =
+            Tensor::from_vec(vec![0.0f32, 0.0, 1.0, 1.0], (1, 2, 2), &Device::Cpu).unwrap();
+        let conditioning = StageVideoConditioning {
+            replacements: vec![],
+            appended: vec![VideoTokenAppendCondition {
+                tokens: Tensor::from_vec(vec![9.0f32, 10.0], (1, 1, 2), &Device::Cpu).unwrap(),
+                positions: Tensor::from_vec(
+                    vec![30.0f32, 40.0, 50.0],
+                    (1, 3, 1, 1),
+                    &Device::Cpu,
+                )
+                .unwrap(),
+                strength: 0.4,
+            }],
+        };
+
+        let reapplied = reapply_stage_video_conditioning(&latents, 2, &conditioning).unwrap();
+        assert_eq!(reapplied.dims3().unwrap(), (1, 3, 2));
+        assert_eq!(
+            reapplied.flatten_all().unwrap().to_vec1::<f32>().unwrap(),
+            vec![0.0, 0.0, 1.0, 1.0, 9.0, 10.0]
+        );
+    }
+
     #[test]
     fn clean_latents_replace_soft_blended_positions_with_pure_source() {
         // Simulate the state after `apply_stage_video_conditioning` with

From 589cce173b532269dff4281784c7751d92bd0d7a Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 17:15:02 -0700
Subject: [PATCH 29/31] style(ltx2): satisfy cargo fmt on runtime.rs tensor
 literal

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-inference/src/ltx2/runtime.rs | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/crates/mold-inference/src/ltx2/runtime.rs b/crates/mold-inference/src/ltx2/runtime.rs
index 6b3d94fa..8b1b6c55 100644
--- a/crates/mold-inference/src/ltx2/runtime.rs
+++ b/crates/mold-inference/src/ltx2/runtime.rs
@@ -5968,12 +5968,8 @@ mod tests {
             replacements: vec![],
             appended: vec![VideoTokenAppendCondition {
                 tokens: Tensor::from_vec(vec![9.0f32, 10.0], (1, 1, 2), &Device::Cpu).unwrap(),
-                positions: Tensor::from_vec(
-                    vec![30.0f32, 40.0, 50.0],
-                    (1, 3, 1, 1),
-                    &Device::Cpu,
-                )
-                .unwrap(),
+                positions: Tensor::from_vec(vec![30.0f32, 40.0, 50.0], (1, 3, 1, 1), &Device::Cpu)
+                    .unwrap(),
                 strength: 0.4,
             }],
         };

From 6ef68347daf44095329f99d223f0a1a4ac44511c Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 17:18:09 -0700
Subject: [PATCH 30/31] ci: tolerate sccache/GHA cache outages

Probe sccache with a trivial compile after setup. If it fails (e.g. GHA
artifact cache returning 400), unset RUSTC_WRAPPER for the remaining
steps so cargo can proceed without a wrapper. Swatinem/rust-cache still
handles cross-run caching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/ci.yml | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 887cd430..d013dafe 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -70,6 +70,19 @@ jobs:
       - name: Install Rust build deps
         run: sudo apt-get update && sudo apt-get install -y clang lld nasm libwebp-dev
       - uses: mozilla-actions/sccache-action@v0.0.7
+      - name: Probe sccache (disable on cache outage)
+        shell: bash
+        run: |
+          set +e
+          printf 'fn main() {}\n' > /tmp/sccache_probe.rs
+          sccache rustc --edition 2021 -- /tmp/sccache_probe.rs -o /tmp/sccache_probe
+          status=$?
+          rm -f /tmp/sccache_probe /tmp/sccache_probe.rs
+          if [ $status -ne 0 ]; then
+            echo "sccache probe failed — disabling RUSTC_WRAPPER for this job"
+            echo "RUSTC_WRAPPER=" >> "$GITHUB_ENV"
+            sccache --stop-server >/dev/null 2>&1 || true
+          fi
       - uses: Swatinem/rust-cache@v2
         with:
           shared-key: workspace-default
@@ -100,6 +113,20 @@ jobs:
 
       - uses: mozilla-actions/sccache-action@v0.0.7
 
+      - name: Probe sccache (disable on cache outage)
+        shell: bash
+        run: |
+          set +e
+          printf 'fn main() {}\n' > /tmp/sccache_probe.rs
+          sccache rustc --edition 2021 -- /tmp/sccache_probe.rs -o /tmp/sccache_probe
+          status=$?
+          rm -f /tmp/sccache_probe /tmp/sccache_probe.rs
+          if [ $status -ne 0 ]; then
+            echo "sccache probe failed — disabling RUSTC_WRAPPER for this job"
+            echo "RUSTC_WRAPPER=" >> "$GITHUB_ENV"
+            sccache --stop-server >/dev/null 2>&1 || true
+          fi
+
       - uses: Swatinem/rust-cache@v2
         with:
           shared-key: workspace-llvm-cov

From ecae6d8c496a896233774aad6015ed383d6bfbbc Mon Sep 17 00:00:00 2001
From: Jeffrey Dilley <jdilley@users.noreply.github.com>
Date: Tue, 21 Apr 2026 17:28:39 -0700
Subject: [PATCH 31/31] fix(server): satisfy clippy on test-only locking and
 module order

- Move `mod tests;` include to the bottom of downloads.rs so
  clippy::items_after_test_module stops firing.
- Scope-allow clippy::await_holding_lock in downloads_test.rs and
  routes_test.rs. Those tests use std::sync::Mutex<()> to serialize
  process-global env-var mutation; holding the guard across `.await`
  is intentional under the current-thread tokio test runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/mold-server/src/downloads.rs      | 8 ++++----
 crates/mold-server/src/downloads_test.rs | 5 +++++
 crates/mold-server/src/routes_test.rs    | 5 +++++
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/crates/mold-server/src/downloads.rs b/crates/mold-server/src/downloads.rs
index 28dcb754..dcfb8618 100644
--- a/crates/mold-server/src/downloads.rs
+++ b/crates/mold-server/src/downloads.rs
@@ -236,10 +236,6 @@ impl DownloadQueue {
     }
 }
 
-#[cfg(test)]
-#[path = "downloads_test.rs"]
-mod tests;
-
 // ── PullDriver trait + real & test implementations ──────────────────────────
 
 /// Trait that hides the HuggingFace pull behind something the tests can fake.
@@ -729,3 +725,7 @@ fn now_ms() -> i64 {
         .map(|d| d.as_millis() as i64)
         .unwrap_or(0)
 }
+
+#[cfg(test)]
+#[path = "downloads_test.rs"]
+mod tests;
diff --git a/crates/mold-server/src/downloads_test.rs b/crates/mold-server/src/downloads_test.rs
index d487d269..22dac0e8 100644
--- a/crates/mold-server/src/downloads_test.rs
+++ b/crates/mold-server/src/downloads_test.rs
@@ -4,6 +4,11 @@
 //! These tests never touch HuggingFace — they inject a fake `PullDriver` so
 //! the queue logic can be exercised in isolation.
 
+// The tests use `std::sync::Mutex<()>` to serialize process-global env-var
+// mutations; holding the guard across `.await` is intentional under the
+// current-thread tokio test runtime.
+#![allow(clippy::await_holding_lock)]
+
 use crate::downloads::DownloadQueue;
 
 #[tokio::test]
diff --git a/crates/mold-server/src/routes_test.rs b/crates/mold-server/src/routes_test.rs
index 3ed7854d..cb00dc93 100644
--- a/crates/mold-server/src/routes_test.rs
+++ b/crates/mold-server/src/routes_test.rs
@@ -1,3 +1,8 @@
+// Tests use `std::sync::Mutex<()>` to serialize process-global env-var
+// mutations; holding the guard across `.await` is intentional under the
+// current-thread tokio test runtime.
+#![allow(clippy::await_holding_lock)]
+
 #[cfg(test)]
 mod tests {
     use axum::{