diff --git a/contrib/models/LTX-2.3/README.md b/contrib/models/LTX-2.3/README.md
new file mode 100644
index 00000000..b2a017a9
--- /dev/null
+++ b/contrib/models/LTX-2.3/README.md
@@ -0,0 +1,379 @@
+# Contrib Model: LTX-2.3
+
+LTX-2.3 22B parameter DiT audio-video diffusion transformer running on AWS Trainium 2 via NxD Inference. Generates synchronized video + audio from text prompts, with optional image-to-video conditioning.
+
+## Model Information
+
+- **HuggingFace ID:** [`Lightricks/LTX-2.3`](https://huggingface.co/Lightricks/LTX-2.3)
+- **Model Type:** DiT (Diffusion Transformer) for joint audio-video generation (text-to-video and image-to-video)
+- **Parameters:** 22B (BF16) — 48 transformer blocks, 32 heads, 4096 video dim, 2048 audio dim
+- **Architecture:** Bidirectional audio-video cross-attention, gated attention, QK-RMSNorm, split RoPE, flow matching
+- **License:** See HuggingFace model card
+- **Framework:** Native [`ltx-core`](https://github.com/Lightricks/LTX-2) (not Diffusers)
+
+## Validation Results
+
+**Validated:** 2026-03-09
+**Instance:** trn2.3xlarge (TP=4, LNC=2, 4 logical NeuronCores)
+**SDK:** Neuron SDK 2.28, PyTorch 2.9, Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
+
+### Accuracy Validation
+
+| Component | Metric | Value | Notes |
+|-----------|--------|-------|-------|
+| Single forward pass (video) | Cosine similarity | 0.999947 | sigma=1.0, noise input |
+| Single forward pass (audio) | Cosine similarity | 0.999867 | sigma=1.0, noise input |
+| 8-step denoised latent (real text) | Cosine similarity | 0.972 | Same Gemma 3 text, same seed |
+
+All accuracy numbers measured against CPU reference (unsharded BF16, native ltx-core model).
+
+### Benchmark Results
+
+| Stage | Time | Notes |
+|-------|------|-------|
+| CPU component loading | 24.9s | LTXModel, VideoDecoder, AudioDecoder, Vocoder, EmbeddingsProcessor |
+| Gemma3 encoder loading (4 ranks) | 362s | Pre-sharded weights, NEFF rehydration (cold start) |
+| Text encoding — Neuron Gemma3 (warm) | 0.6s | Encoder forward pass only (644ms), with tensorizer-optimized compiler flags |
+| Text encoding — Neuron Gemma3 (E2E warm) | ~1.3s | Including tokenization + post-processing |
+| Text encoding — Neuron Gemma3 (warmup) | 16.3s | First forward pass on NeuronCores |
+| Text encoding — CPU fallback | ~162s | Without Neuron compilation |
+| Gemma3 unload | 2.6s | Explicit NRT resource cleanup |
+| Neuron backbone weight loading | 70s | Pre-sharded weights, 4×9.3 GB rank files |
+| Neuron backbone NEFF loading | 84s | Compiled model loaded onto 4 NeuronCores |
+| **Denoising step (warm, steps 2-8)** | **0.3s** | **Steady-state per-step latency** |
+| Denoising step (cold, step 1) | 174.6s | Includes Neuron device initialization |
+| DiT warmup (1st inference) | 138.5s | Forces NRT to load NEFF onto cores |
+| Total denoising (8 steps) | 176.9s | 22.1s/step average (dominated by cold start) |
+| Video decode (CPU) | 7.2s | 25 frames @ 384×512 |
+| Video decode (CPU, upscaled) | 33.3s | 49 frames @ 768×1024 |
+| Audio decode (CPU) | 2.5s | Stereo WAV, 48kHz |
+| Spatial upscaler (CPU) | 0.7s | 498M params, (1,128,4,12,16) → (1,128,4,24,32) |
+| Temporal upscaler (CPU) | 0.4s | 131M params, (1,128,4,24,32) → (1,128,7,24,32) |
+
+Note: Gemma3 and DiT backbone share the same 4 NeuronCores and are loaded sequentially. The cold start latency (NEFF rehydration) is a one-time cost when the compiled model is first loaded onto a fresh instance. Subsequent generations reuse the loaded model.
+
+### Per-Step Performance Detail
+
+Warm denoising steps (2-8) at 384×512, 25 frames, with all CPU optimizations applied:
+
+| Component | Time | % of Step |
+|-----------|------|-----------|
+| CPU Preprocess | 33.1 ms | 11.9% |
+| Neuron Backbone | 244.1 ms | 87.4% |
+| Euler Step | 2.1 ms | 0.7% |
+| **Total per step** | **279.3 ms** | 100% |
+
+Two CPU preprocessing optimizations are applied:
+
+1. **AdaLN deduplication**: Computes the timestep embedding MLP once per unique sigma value instead of per-token (768 tokens in T2V mode). Reduces AdaLN time from 54ms to 7ms.
+2. **Step-invariant caching**: RoPE embeddings, context projection (caption_projection Linear), and attention masks are constant across denoising steps. Computed once on step 1 and reused for steps 2-8. Saves ~57ms on the first optimization pass (context projection dominates at ~55ms), with combined CPU preprocessing reduced from 96.5ms (baseline) to 39.0ms (60% reduction).
+
+Overall per-step improvement vs unoptimized baseline: 330ms to 279.3ms (15.4% reduction).
+
+### Two-Stage Pipeline Benchmarks
+
+| Stage | Time | Notes |
+|-------|------|-------|
+| Stage 1 backbone loading (4 ranks) | 83.6s | Pre-sharded weights from full-res sharding |
+| Stage 1 warmup (1st inference) | 107.9s | Forces NRT to load half-res NEFF |
+| **S1 denoising step (warm, steps 2-8)** | **0.3s** | **192×256 latent, VIDEO_SEQ=192** |
+| S1 denoising step (cold, step 1) | 142.1s | Includes Neuron device initialization |
+| S1 total (8 steps) | 143.9s | 18.0s/step average |
+| Spatial upscaler loading (CPU) | 8.1s | 498M params |
+| Spatial upsample (CPU) | 0.3s | (1,128,4,6,8) → (1,128,4,12,16) |
+| Stage 2 backbone loading (4 ranks) | 17.2s | Same pre-sharded weights, cached |
+| Stage 2 warmup (1st inference) | 110.6s | Forces NRT to load full-res NEFF |
+| **S2 denoising step (warm, steps 2-3)** | **0.3s** | **384×512 latent, VIDEO_SEQ=768** |
+| S2 denoising step (cold, step 1) | 143.0s | Includes Neuron device initialization |
+| S2 total (3 steps) | 143.7s | 47.9s/step average |
+| **Combined denoising (S1+S2)** | **287.6s** | **2.7s actual compute after warmup** |
+
+Two-stage mode generates at half resolution (192×256) with 8 denoising steps, spatially upscales x2, then refines at full resolution (384×512) with 3 additional steps. The same backbone weights are used for both stages — only the compiled shapes differ.
+
+### trn2.48xlarge Full Benchmark (Two-Phase Pipeline)
+
+**Validated:** 2026-03-16
+**Instance:** trn2.48xlarge (TP=4/16, LNC=2, 32 logical NeuronCores)
+**SDK:** Neuron SDK 2.27, PyTorch 2.9, Deep Learning AMI Neuron (Ubuntu 24.04) 20260126
+**Resolution:** 512×768 → 1024×1536, 121 frames, Image-to-Video
+
+Two-phase execution is required because the NRT communicator cannot change TP degree mid-process:
+- **Phase 1** (TP=4): Gemma3 text encoding + S1 denoising (8 steps at 512×768)
+- **Phase 2** (TP=16): Spatial upsample → S2 denoising (3 steps at 1024×1536) → Neuron VAE decode
+
+| Component | Time | Notes |
+|-----------|------|-------|
+| **S1 denoising (8 steps, 512×768)** | **14.4s** | **1.8s/step, TP=4** |
+| **S2 denoising (3 steps, 1024×1536)** | **21.7s** | **7.2s/step, TP=16** |
+| **Combined denoising** | **36.0s** | S1 + S2 |
+| **VAE decode (Neuron, tiled)** | **23.5s** | 33 tiles @ 610ms/tile (TP=4), 3.4s first-tile warmup |
+| VAE decode (CPU, reference) | 78.7s | 3.3x slower than Neuron |
+| Audio decode (CPU) | 2.2s | Stereo WAV |
+| Spatial upsample (CPU) | 1.8s | 498M params |
+| Compilation (total) | ~33 min | Encoder 68s + S1 133s + S2 338s + VAE 570s |
+
+**VAE Decoder**: The LTX-2.3 video decoder is compiled at TP=4 with 4×16 latent tiles (128×512 pixels). After Phase 2 unloads the TP=16 S2 backbone, the TP=4 VAE loads onto the freed NeuronCores (4.6s load time). Tiled decode uses overlap blending (overlap_h=1 latent) for seamless spatial reconstruction at arbitrary resolutions.
+
+### Component Distribution
+
+| Component | Location | Notes |
+|-----------|----------|-------|
+| DiT transformer (48 blocks) | **Neuron** (TP=4) | ~11 GB/rank HBM |
+| Gemma 3 12B text encoder | **Neuron** (TP=4) or CPU | Shares NeuronCores with DiT, sequential execution |
+| VideoDecoder | CPU | Per-channel statistics normalization |
+| AudioDecoder + Vocoder | CPU | Float32 for vocoder accuracy |
+| Spatial/Temporal upscalers | CPU | Sub-second each |
+| EmbeddingsProcessor | CPU | Connectors + feature extraction |
+
+Both the DiT backbone and Gemma3 encoder are compiled for TP=4 and share the same 4 NeuronCores. They execute sequentially: Gemma3 loads, encodes text, and unloads; then the DiT backbone loads and runs the 8-step denoising loop. CPU fallback for Gemma3 is available but ~25x slower.
+
+## Usage
+
+### Prerequisites
+
+```bash
+# On trn2.3xlarge with Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+
+# Install ltx-core
+pip install git+https://github.com/Lightricks/LTX-2.git#subdirectory=packages/ltx-core
+
+# Download model weights
+huggingface-cli download Lightricks/LTX-2.3 ltx-2.3-22b-distilled.safetensors \
+  --local-dir /home/ubuntu/models/LTX-2.3/
+
+# Download Gemma 3 12B text encoder
+huggingface-cli download google/gemma-3-12b-it \
+  --local-dir /home/ubuntu/models/gemma-3-12b
+
+# Download upscaler weights (for --upscale, spatial x2 + temporal x2)
+huggingface-cli download Lightricks/LTX-2.3 ltx-2.3-spatial-upscaler-x2-1.0.safetensors \
+  --local-dir /home/ubuntu/models/LTX-2.3/upscalers/
+huggingface-cli download Lightricks/LTX-2.3 ltx-2.3-temporal-upscaler-x2-1.0.safetensors \
+  --local-dir /home/ubuntu/models/LTX-2.3/upscalers/
+```
+
+### Step 1: Compile the DiT Backbone
+
+```bash
+# Full-resolution backbone (384x512, VIDEO_SEQ=768)
+NEURON_FUSE_SOFTMAX=1 NEURON_CUSTOM_SILU=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+  torchrun --nproc_per_node=4 src/compile_transformer.py
+```
+
+Compilation takes approximately 60 seconds. The compiled model is saved to `compiler_workdir_tp4_lnc2_v2/tp_0.pt` (8.7 GB).
+
+For two-stage mode, also compile the half-resolution backbone:
+
+```bash
+# Half-resolution backbone (192x256, VIDEO_SEQ=192) — for two-stage mode
+NEURON_FUSE_SOFTMAX=1 NEURON_CUSTOM_SILU=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+  torchrun --nproc_per_node=4 src/compile_transformer_halfres.py
+```
+
+This compiles the same architecture at the half-res latent shape (4×6×8 instead of 4×12×16). Output: `compiler_workdir_tp4_lnc2_halfres/tp_0.pt` (~8.7 GB).
+
+### Step 2: Pre-shard Backbone Weights
+
+Pre-sharding avoids loading the full 41 GB safetensors file during generation:
+
+```bash
+python3 src/shard_backbone_weights.py \
+  --model-path /home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+  --output-dir /home/ubuntu/backbone_sharded
+```
+
+This produces 4 rank files (~9.3 GB each) that are loaded directly during generation.
+
+### Step 3: Compile and Shard Gemma3 Encoder (Recommended)
+
+Compiling Gemma3 for Neuron eliminates the ~162s CPU text encoding bottleneck:
+
+```bash
+# Compile the encoder graph
+NEURON_FUSE_SOFTMAX=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+  python3 src/compile_gemma3.py \
+    --compile-dir /home/ubuntu/gemma3_encoder_compiled
+
+# Pre-shard weights for fast loading (~5.9 GB per rank)
+python3 src/shard_gemma3_weights.py \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --output-dir /home/ubuntu/gemma3_encoder_sharded
+```
+
+The Gemma3 encoder uses tensorizer-optimized compiler flags (`--enable-ccop-compute-overlap`, `--cc-pipeline-tiling-factor=1`, `--vectorize-strided-dma`, `--enable-scalar-dge-vectorization`) that achieve 3.1x faster inference (644ms vs 2000ms) compared to the original precision flags.
+
+### Step 4: Generate Video + Audio
+
+```bash
+# With Neuron-compiled Gemma3 (recommended, fastest)
+python3 src/generate_ltx23.py \
+  --neuron-gemma \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+  --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+  --backbone-sharded-dir /home/ubuntu/backbone_sharded \
+  --prompt "A golden retriever puppy runs across a sunny green meadow"
+
+# With upscaling (384x512 @ 25 frames -> 768x1024 @ 49 frames)
+python3 src/generate_ltx23.py \
+  --neuron-gemma \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+  --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+  --backbone-sharded-dir /home/ubuntu/backbone_sharded \
+  --prompt "A golden retriever puppy runs across a sunny green meadow" \
+  --upscale
+
+# With CPU Gemma3 (slower, no compilation needed)
+python3 src/generate_ltx23.py \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --backbone-sharded-dir /home/ubuntu/backbone_sharded \
+  --prompt "A golden retriever puppy runs across a sunny green meadow"
+
+# Quick test with random embeddings (no Gemma needed)
+python3 src/generate_ltx23.py --no-text-encoder
+```
+
+#### Image-to-Video Generation
+
+Add `--image` to condition the video on an input photograph. Frame 0 is encoded from the image and preserved throughout denoising; subsequent frames are generated to match the prompt while maintaining visual consistency with the input.
+
+```bash
+# Image-to-video with Neuron Gemma3
+python3 src/generate_ltx23.py \
+  --neuron-gemma \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+  --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+  --backbone-sharded-dir /home/ubuntu/backbone_sharded \
+  --prompt "The woman turns and smiles warmly at the camera" \
+  --image /path/to/photo.png
+```
+
+The image encoder uses the ltx-core `VideoEncoder` loaded from the same safetensors checkpoint (no additional downloads needed). No recompilation of the DiT backbone is required — the same compiled model handles both T2V and I2V since the tensor shapes are identical.
+
+#### Two-Stage Generation (Refinement Denoising)
+
+Two-stage mode follows the LTX-2.3 `DistilledPipeline` reference: generate at half resolution (192×256) with 8 steps, spatially upsample x2, then refine at full resolution (384×512) with 3 additional denoising steps. This produces sharper output than single-stage generation. Requires the half-res backbone compilation from Step 1.
+
+```bash
+# Two-stage with Neuron Gemma3
+python3 src/generate_ltx23.py \
+  --neuron-gemma \
+  --gemma-path /home/ubuntu/models/gemma-3-12b \
+  --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+  --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+  --backbone-sharded-dir /home/ubuntu/backbone_sharded \
+  --prompt "A golden retriever puppy runs across a sunny green meadow" \
+  --two-stage \
+  --halfres-compiled-dir /home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_halfres
+```
+
+The pipeline sequence: Gemma3 encode → unload → half-res DiT (8 steps) → unload → spatial upsample x2 → full-res DiT (3 refinement steps) → unload → VAE decode. The same pre-sharded backbone weights are used for both stages (the model weights are identical; only the compiled shapes differ). The spatial upscaler weights are downloaded as part of the prerequisites.
+
+Output: PNG frames, MP4 video (if ffmpeg available), WAV audio.
+
+## Compatibility Matrix
+
+| Instance/Version | SDK 2.27 | SDK 2.28 |
+|------------------|----------|----------|
+| trn2.3xlarge (TP=4, LNC=2) | — | VALIDATED |
+| trn2.48xlarge (TP=4/16, LNC=2) | VALIDATED | — |
+
+## Example Checkpoints
+
+* [`Lightricks/LTX-2.3`](https://huggingface.co/Lightricks/LTX-2.3) — `ltx-2.3-22b-distilled.safetensors` (8-step distilled, 43 GB)
+* [`google/gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) — Text encoder (23 GB)
+
+## Testing Instructions
+
+```bash
+# Ensure model is downloaded and backbone is compiled (see Usage above), then:
+cd contrib/models/LTX-2.3
+
+MODEL_PATH=/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+COMPILED_MODEL_PATH=/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_v2 \
+  pytest test/integration/test_model.py -v -s
+```
+
+Tests validate:
+- Model loads successfully with weight injection
+- Forward pass produces valid (non-NaN) output
+- Cosine similarity >= 0.999 vs CPU reference
+- Per-step latency measurement
+
+## Architecture Details
+
+### Backbone Signature
+
+The compiled backbone takes 24 flat tensors (for XLA tracing compatibility):
+
+| Index | Shape | Description |
+|-------|-------|-------------|
+| 0 | (1, 768, 4096) | Video hidden states |
+| 1 | (1, 26, 2048) | Audio hidden states |
+| 2-3 | (1, 256, 4096/2048) | Text encoder context (video/audio) |
+| 4-5 | (1, seq, 9*dim) | AdaLN timestep embeddings |
+| 6-7 | (1, seq, dim) | Embedded timesteps |
+| 8-11 | (1, 1, ...) | Cross-attention AdaLN scale/shift/gate |
+| 12-19 | (1, heads, seq, dim/2) | RoPE cos/sin (self-attn + cross-attn) |
+| 20-21 | (1, 256) | Encoder attention masks (additive) |
+| 22-23 | (1, 1, ...) | Prompt timestep embeddings |
+
+### TP Sharding Pattern
+
+- **ColumnParallel**: Q, K, V projections, FFN gate/up projections, gate_logits
+- **RowParallel**: Attention output projection, FFN down projection
+- **DistributedRMSNorm**: QK-norm (q_norm, k_norm) with all-reduce for global variance
+- **SPMDRank**: Per-rank RoPE slicing via `torch.index_select`
+
+### Compiler Flags
+
+**DiT backbone:**
+```
+--model-type=transformer -O1 --auto-cast matmult --lnc 2
+--tensorizer-options='--enable-ccop-compute-overlap'
+--enable-fast-loading-neuron-binaries
+```
+
+**Optional DiT flag** — adding `--vectorize-strided-dma` to the DiT tensorizer options gives a 1.9% backbone speedup (244.1ms → 239.4ms per step) but reduces single-pass cosine similarity from 0.9999 to 0.996 due to reordered BF16 accumulations. Not enabled by default. Other tensorizer flags tested: `--cc-pipeline-tiling-factor` hurts DiT performance at all values tested (1/2/4/8), and `--enable-scalar-dge-vectorization` is neutral.
+
+**Gemma3 encoder** (tensorizer-optimized, 3.1x faster than original):
+```
+--model-type=transformer -O1 --auto-cast=none --lnc=2
+--tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=1
+  --vectorize-strided-dma --enable-scalar-dge-vectorization'
+```
+
+Environment: `NEURON_FUSE_SOFTMAX=1`, `NEURON_CUSTOM_SILU=1`, `NEURON_RT_STOCHASTIC_ROUNDING_EN=0`
+
+## Known Issues
+
+- **Cold start latency**: The DiT warmup pass takes ~139s and the first denoising step takes ~175s due to Neuron device initialization. Subsequent steps run at ~0.3s each. Gemma3 encoder NEFF rehydration adds ~362s on first load (one-time per instance).
+- **CPU text encoding fallback**: Without Neuron-compiled Gemma3, text encoding takes ~162s on CPU. Use `--neuron-gemma` for 0.6s warm text encoding (~270x faster).
+- **Two-stage cold start**: Two-stage mode loads two separate Neuron backbones sequentially (half-res and full-res), each with its own NEFF warmup. Total cold start overhead is ~2x single-stage.
+- **BF16 TP accumulation**: The 0.972 cosine similarity over 8 denoising steps (vs CPU) is due to normal BF16 rounding across TP=4 ranks. Single forward pass accuracy is 0.9999.
+- **No EFA**: The trn2.3xlarge single-instance setup does not use EFA for inter-node communication. NCCL/OFI warnings about EFA can be safely ignored.
+- **CPU video decode bottleneck**: At 384×512 (25 frames), the CPU video decoder takes ~4.7s — over half of warm E2E time. At higher resolutions (1024×1536, 121 frames), CPU decode takes 78.7s. The TP=4 tiled Neuron VAE decoder reduces this to 23.5s (3.3x speedup). Use `--vae-compiled-dir` in `run_phase2.py` to enable Neuron decode. The tiled approach compiles the decoder at 4×16 latent (128×512 pixels) — H×W ≤ 64 is the SRAM limit — and decodes via overlapping spatial tiles with linear blending.
+
+## Source Files
+
+| File | Purpose |
+|------|---------|
+| `src/modeling_ltx23.py` | Core backbone: TP sharding, DistributedRMSNorm, SDPA replacement, TransformerArgs construction |
+| `src/modeling_gemma3_encoder.py` | Custom Gemma3 encoder-only model: returns all 49 hidden states stacked, no KV cache |
+| `src/pipeline.py` | NeuronTransformerWrapper: CPU preprocessing, backbone routing, mask handling, AdaLN deduplication, step-invariant caching |
+| `src/compile_transformer.py` | DiT backbone compilation script — full-res 384×512 (torchrun --nproc_per_node=4) |
+| `src/compile_transformer_halfres.py` | DiT backbone compilation script — half-res 192×256 for two-stage mode |
+| `src/compile_gemma3.py` | Gemma3 encoder compilation script (parallel_model_trace) |
+| `src/shard_gemma3_weights.py` | Pre-shard Gemma3 weights to per-rank files for fast loading |
+| `src/shard_backbone_weights.py` | Pre-shard DiT backbone weights to per-rank files for fast loading |
+| `src/load_with_weights.py` | DiT backbone weight sharding and injection utilities |
+| `src/generate_ltx23.py` | E2E generation pipeline (text encoding, single/two-stage denoising, VAE decode, upscaling, image-to-video) |
+| `src/run_phase2.py` | Phase 2 standalone script: spatial upsample + S2 denoising + Neuron/CPU VAE decode |
+| `src/modeling_vae_23.py` | TP-sharded LTX-2.3 VAE decoder (~560 lines), ColumnRowParallelConv3d, CausalConv3d |
+| `src/compile_vae_23.py` | VAE decoder compilation script (TP=4, 4×16 tile, 121 frames) |
+| `src/tiled_vae_decode_23.py` | Tiled decode with overlap blending for arbitrary resolutions |
+| `src/compile_benchmark.py` | Full benchmark compilation script (encoder + S1 + S2 backbone) |
+| `src/application.py` | NeuronLTX23Application compositor for NxDI Application path |
diff --git a/contrib/models/LTX-2.3/src/__init__.py b/contrib/models/LTX-2.3/src/__init__.py
new file mode 100644
index 00000000..c83cc749
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/__init__.py
@@ -0,0 +1,35 @@
+from .modeling_ltx23 import (
+    NeuronLTX23TransformerBackbone,
+    NeuronLTX23BackboneApplication,
+    LTX23BackboneInferenceConfig,
+    ModelWrapperLTX23Backbone,
+    DistributedRMSNorm,
+)
+from .modeling_gemma3_encoder import (
+    Gemma3TextEncoderModel,
+    Gemma3EncoderInferenceConfig,
+    ModelWrapperGemma3Encoder,
+    NeuronGemma3EncoderApplication,
+    convert_hf_gemma3_to_encoder_state_dict,
+)
+from .pipeline import NeuronTransformerWrapper
+from .application import NeuronLTX23Application
+
+__all__ = [
+    # Backbone
+    "NeuronLTX23TransformerBackbone",
+    "NeuronLTX23BackboneApplication",
+    "LTX23BackboneInferenceConfig",
+    "ModelWrapperLTX23Backbone",
+    "DistributedRMSNorm",
+    # Gemma3 encoder
+    "Gemma3TextEncoderModel",
+    "Gemma3EncoderInferenceConfig",
+    "ModelWrapperGemma3Encoder",
+    "NeuronGemma3EncoderApplication",
+    "convert_hf_gemma3_to_encoder_state_dict",
+    # Pipeline
+    "NeuronTransformerWrapper",
+    # Top-level application
+    "NeuronLTX23Application",
+]
diff --git a/contrib/models/LTX-2.3/src/application.py b/contrib/models/LTX-2.3/src/application.py
new file mode 100644
index 00000000..56eab0ec
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/application.py
@@ -0,0 +1,268 @@
+"""
+NeuronLTX23Application — Top-level compositor for LTX-2.3 on Neuron
+===================================================================
+
+Manages the Gemma3 text encoder and DiT backbone as NeuronApplicationBase
+subclasses, following the Flux pattern from NxDI core
+(src/neuronx_distributed_inference/models/diffusers/flux/application.py).
+
+Key difference from Flux: LTX-2.3 uses **sequential NeuronCore sharing**
+on trn2.3xlarge. Both the Gemma3 12B encoder and 22B DiT backbone need
+all 4 TP cores, so they cannot be loaded simultaneously. The compositor
+manages load/unload cycling between sub-models.
+
+Usage:
+    from application import NeuronLTX23Application
+
+    app = NeuronLTX23Application(
+        backbone_config=backbone_config,
+        encoder_config=encoder_config,
+        model_path="/path/to/ltx23-safetensors",
+        encoder_path="/path/to/gemma3-weights",
+    )
+
+    # Compile both sub-models
+    app.compile("/path/to/compiled/")
+
+    # Load and run
+    app.load_text_encoder("/path/to/compiled/")
+    hidden_states = app.encode_text(input_ids, attention_mask)
+    app.unload_text_encoder()
+
+    app.load_backbone("/path/to/compiled/")
+    video_out, audio_out = app.backbone_app(*backbone_inputs)
+    app.unload_backbone()
+"""
+
+import logging
+import os
+from typing import Optional
+
+import torch
+import torch.nn as nn
+
+logger = logging.getLogger(__name__)
+
+try:
+    from neuronx_distributed_inference.models.config import InferenceConfig
+
+    _NXDI_AVAILABLE = True
+except ImportError:
+    _NXDI_AVAILABLE = False
+
+
+class NeuronLTX23Application(nn.Module):
+    """Top-level compositor for LTX-2.3 on Neuron.
+
+    NOT a NeuronApplicationBase subclass (same pattern as NeuronFluxApplication).
+    Holds two sub-applications that share NeuronCores sequentially.
+
+    Sub-applications:
+        backbone_app: NeuronLTX23BackboneApplication (22B DiT transformer)
+        encoder_app: NeuronGemma3EncoderApplication (12B Gemma3 text encoder)
+
+    CPU components (VAE decoders, vocoder, embeddings processor) are managed
+    separately by the caller or pipeline.
+    """
+
+    def __init__(
+        self,
+        backbone_config: "InferenceConfig",
+        encoder_config: "InferenceConfig",
+        model_path: str,
+        encoder_path: Optional[str] = None,
+    ):
+        """
+        Args:
+            backbone_config: InferenceConfig for the DiT backbone
+            encoder_config: InferenceConfig for the Gemma3 encoder
+            model_path: Path to the LTX-2.3 model weights (safetensors dir or file)
+            encoder_path: Path to Gemma3 weights (HuggingFace dir). If None,
+                defaults to model_path + "/text_encoder"
+        """
+        super().__init__()
+        try:
+            from .modeling_ltx23 import NeuronLTX23BackboneApplication
+            from .modeling_gemma3_encoder import NeuronGemma3EncoderApplication
+        except ImportError:
+            from modeling_ltx23 import NeuronLTX23BackboneApplication
+            from modeling_gemma3_encoder import NeuronGemma3EncoderApplication
+
+        self.model_path = model_path
+        self.encoder_path = encoder_path or os.path.join(model_path, "text_encoder")
+
+        self.backbone_config = backbone_config
+        self.encoder_config = encoder_config
+
+        self.backbone_app = NeuronLTX23BackboneApplication(
+            model_path=model_path, config=backbone_config
+        )
+        self.encoder_app = NeuronGemma3EncoderApplication(
+            model_path=self.encoder_path, config=encoder_config
+        )
+
+        self._backbone_loaded = False
+        self._encoder_loaded = False
+
+    def compile(self, compiled_model_path, debug=False, pre_shard_weights_hook=None):
+        """Compile both sub-models to separate subdirectories.
+
+        Creates:
+            compiled_model_path/backbone/model.pt + weights/
+            compiled_model_path/text_encoder/model.pt + weights/
+        """
+        backbone_path = os.path.join(compiled_model_path, "backbone/")
+        encoder_path = os.path.join(compiled_model_path, "text_encoder/")
+
+        logger.info("Compiling DiT backbone to %s", backbone_path)
+        self.backbone_app.compile(backbone_path, debug, pre_shard_weights_hook)
+
+        logger.info("Compiling Gemma3 encoder to %s", encoder_path)
+        self.encoder_app.compile(encoder_path, debug, pre_shard_weights_hook)
+
+        logger.info("Compilation complete for both sub-models")
+
+    def compile_backbone(
+        self, compiled_model_path, debug=False, pre_shard_weights_hook=None
+    ):
+        """Compile only the DiT backbone."""
+        backbone_path = os.path.join(compiled_model_path, "backbone/")
+        logger.info("Compiling DiT backbone to %s", backbone_path)
+        self.backbone_app.compile(backbone_path, debug, pre_shard_weights_hook)
+
+    def compile_encoder(
+        self, compiled_model_path, debug=False, pre_shard_weights_hook=None
+    ):
+        """Compile only the Gemma3 encoder."""
+        encoder_path = os.path.join(compiled_model_path, "text_encoder/")
+        logger.info("Compiling Gemma3 encoder to %s", encoder_path)
+        self.encoder_app.compile(encoder_path, debug, pre_shard_weights_hook)
+
+    def load_text_encoder(
+        self,
+        compiled_model_path,
+        start_rank_id=None,
+        local_ranks_size=None,
+        skip_warmup=False,
+    ):
+        """Load the compiled Gemma3 encoder to NeuronCores.
+
+        Must be called before encode_text(). Should be unloaded before
+        loading the backbone (sequential NeuronCore sharing).
+        """
+        if self._backbone_loaded:
+            raise RuntimeError(
+                "Cannot load text encoder while backbone is loaded. "
+                "Call unload_backbone() first (sequential NeuronCore sharing)."
+            )
+        encoder_path = os.path.join(compiled_model_path, "text_encoder/")
+        logger.info("Loading Gemma3 encoder from %s", encoder_path)
+        self.encoder_app.load(
+            encoder_path, start_rank_id, local_ranks_size, skip_warmup
+        )
+        self._encoder_loaded = True
+        logger.info("Gemma3 encoder loaded and ready")
+
+    def unload_text_encoder(self):
+        """Unload the Gemma3 encoder from NeuronCores.
+
+        Frees NeuronCore resources so the backbone can be loaded.
+        """
+        if not self._encoder_loaded:
+            logger.warning("Text encoder not loaded, nothing to unload")
+            return
+        # Destroy the traced model to release NRT resources
+        if (
+            hasattr(self.encoder_app, "traced_model")
+            and self.encoder_app.traced_model is not None
+        ):
+            del self.encoder_app.traced_model
+            self.encoder_app.traced_model = None
+        for model_wrapper in self.encoder_app.models:
+            model_wrapper.model = None
+        self.encoder_app.is_loaded_to_neuron = False
+        self._encoder_loaded = False
+
+        import gc
+
+        gc.collect()
+        logger.info("Gemma3 encoder unloaded from NeuronCores")
+
+    def load_backbone(
+        self,
+        compiled_model_path,
+        start_rank_id=None,
+        local_ranks_size=None,
+        skip_warmup=False,
+    ):
+        """Load the compiled DiT backbone to NeuronCores.
+
+        Must be called before running the denoising loop. Should be unloaded
+        before loading the encoder (sequential NeuronCore sharing).
+        """
+        if self._encoder_loaded:
+            raise RuntimeError(
+                "Cannot load backbone while text encoder is loaded. "
+                "Call unload_text_encoder() first (sequential NeuronCore sharing)."
+            )
+        backbone_path = os.path.join(compiled_model_path, "backbone/")
+        logger.info("Loading DiT backbone from %s", backbone_path)
+        self.backbone_app.load(
+            backbone_path, start_rank_id, local_ranks_size, skip_warmup
+        )
+        self._backbone_loaded = True
+        logger.info("DiT backbone loaded and ready")
+
+    def unload_backbone(self):
+        """Unload the DiT backbone from NeuronCores.
+
+        Frees NeuronCore resources so the encoder can be loaded.
+        """
+        if not self._backbone_loaded:
+            logger.warning("Backbone not loaded, nothing to unload")
+            return
+        if (
+            hasattr(self.backbone_app, "traced_model")
+            and self.backbone_app.traced_model is not None
+        ):
+            del self.backbone_app.traced_model
+            self.backbone_app.traced_model = None
+        for model_wrapper in self.backbone_app.models:
+            model_wrapper.model = None
+        self.backbone_app.is_loaded_to_neuron = False
+        self._backbone_loaded = False
+
+        import gc
+
+        gc.collect()
+        logger.info("DiT backbone unloaded from NeuronCores")
+
+    def encode_text(self, input_ids, attention_mask):
+        """Run the Gemma3 encoder. Must have called load_text_encoder() first.
+
+        Args:
+            input_ids: (B, seq_len) int64
+            attention_mask: (B, seq_len) int64
+
+        Returns:
+            stacked_hidden_states: (B, seq_len, hidden_size, 49)
+        """
+        if not self._encoder_loaded:
+            raise RuntimeError(
+                "Text encoder not loaded. Call load_text_encoder() first."
+            )
+        return self.encoder_app(input_ids, attention_mask)
+
+    @property
+    def is_backbone_loaded(self):
+        return self._backbone_loaded
+
+    @property
+    def is_encoder_loaded(self):
+        return self._encoder_loaded
+
+    def __call__(self, *args, **kwargs):
+        """Forward pass through the backbone. Alias for backbone_app.forward()."""
+        if not self._backbone_loaded:
+            raise RuntimeError("Backbone not loaded. Call load_backbone() first.")
+        return self.backbone_app(*args, **kwargs)
diff --git a/contrib/models/LTX-2.3/src/compile_benchmark.py b/contrib/models/LTX-2.3/src/compile_benchmark.py
new file mode 100644
index 00000000..fa539318
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/compile_benchmark.py
@@ -0,0 +1,347 @@
+"""
+Compile all 3 Application components for the full LTX-2.3 benchmark on trn2.48xlarge.
+
+Components:
+  1. Gemma3 encoder: TP=4, seq=512
+  2. Stage 1 backbone: 512x768, 121 frames, TP=4 (video_seq=16*24*16=6144)
+  3. Stage 2 backbone: 1024x1536, 121 frames, TP=16 (video_seq=32*48*16=24576)
+
+Usage:
+    python compile_benchmark.py \
+        --model-path /mnt/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+        --encoder-path /mnt/models/gemma-3-12b \
+        --output-base /mnt/models/compiled/benchmark
+
+This creates:
+    /mnt/models/compiled/benchmark/encoder_tp4/     (Gemma3 encoder TP=4)
+    /mnt/models/compiled/benchmark/s1_tp4/          (Stage 1 backbone TP=4)
+    /mnt/models/compiled/benchmark/s2_tp16/         (Stage 2 backbone TP=16)
+"""
+
+import argparse
+import gc
+import json
+import logging
+import os
+import sys
+import time
+
+import torch
+
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s"
+)
+logger = logging.getLogger("compile_benchmark")
+
+# Benchmark dimensions
+S1_HEIGHT = 512
+S1_WIDTH = 768
+S2_HEIGHT = 1024
+S2_WIDTH = 1536
+NUM_FRAMES = 121
+TEXT_SEQ = 256
+AUDIO_SEQ = 26
+ENCODER_SEQ = 512
+S1_TP = 4
+S2_TP = 16
+
+
+def load_ltx_config(model_path):
+    from safetensors import safe_open
+
+    with safe_open(model_path, framework="pt") as f:
+        metadata = f.metadata()
+    return json.loads(metadata["config"])
+
+
+def compile_encoder(model_path, encoder_path, output_dir, tp_degree=S1_TP):
+    """Compile Gemma3 encoder via Application path."""
+    from neuronx_distributed_inference.models.config import NeuronConfig
+    from modeling_ltx23 import LTX23BackboneInferenceConfig
+    from modeling_gemma3_encoder import Gemma3EncoderInferenceConfig, GEMMA3_12B_CONFIG
+    from application import NeuronLTX23Application
+
+    config = load_ltx_config(model_path)
+    tc = config["transformer"]
+
+    num_heads = tc["num_attention_heads"]
+    head_dim = tc["attention_head_dim"]
+    inner_dim = num_heads * head_dim
+    audio_num_heads = tc["audio_num_attention_heads"]
+    audio_head_dim = tc["audio_attention_head_dim"]
+    audio_inner_dim = audio_num_heads * audio_head_dim
+    audio_ca_dim = tc.get("audio_cross_attention_dim", 2048)
+
+    # Use S1 dimensions for the backbone config (encoder doesn't care about backbone dims)
+    latent_h = S1_HEIGHT // 32
+    latent_w = S1_WIDTH // 32
+    latent_f = (NUM_FRAMES - 1) // 8 + 1
+    video_seq = latent_f * latent_h * latent_w
+
+    dtype = torch.bfloat16
+
+    backbone_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=video_seq,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    backbone_config = LTX23BackboneInferenceConfig(
+        neuron_config=backbone_neuron_config,
+        num_layers=tc["num_layers"],
+        num_attention_heads=num_heads,
+        attention_head_dim=head_dim,
+        inner_dim=inner_dim,
+        audio_num_attention_heads=audio_num_heads,
+        audio_attention_head_dim=audio_head_dim,
+        audio_inner_dim=audio_inner_dim,
+        audio_cross_attention_dim=audio_ca_dim,
+        video_seq=video_seq,
+        audio_seq=AUDIO_SEQ,
+        text_seq=TEXT_SEQ,
+        height=latent_h,
+        width=latent_w,
+        num_frames=latent_f,
+        ltx_config_dict=config,
+    )
+
+    encoder_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=ENCODER_SEQ,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    encoder_config = Gemma3EncoderInferenceConfig(
+        neuron_config=encoder_neuron_config,
+        vocab_size=GEMMA3_12B_CONFIG["vocab_size"],
+        hidden_size=GEMMA3_12B_CONFIG["hidden_size"],
+        num_hidden_layers=GEMMA3_12B_CONFIG["num_hidden_layers"],
+        num_attention_heads=GEMMA3_12B_CONFIG["num_attention_heads"],
+        num_key_value_heads=GEMMA3_12B_CONFIG["num_key_value_heads"],
+        head_dim=GEMMA3_12B_CONFIG["head_dim"],
+        intermediate_size=GEMMA3_12B_CONFIG["intermediate_size"],
+        rms_norm_eps=GEMMA3_12B_CONFIG["rms_norm_eps"],
+        rope_theta=GEMMA3_12B_CONFIG["rope_theta"],
+        max_position_embeddings=GEMMA3_12B_CONFIG["max_position_embeddings"],
+        query_pre_attn_scalar=GEMMA3_12B_CONFIG["query_pre_attn_scalar"],
+        pad_token_id=GEMMA3_12B_CONFIG["pad_token_id"],
+    )
+
+    app = NeuronLTX23Application(
+        backbone_config=backbone_config,
+        encoder_config=encoder_config,
+        model_path=model_path,
+        encoder_path=encoder_path,
+    )
+
+    logger.info("Compiling Gemma3 encoder (TP=%d, seq=%d)...", tp_degree, ENCODER_SEQ)
+    t0 = time.time()
+    app.compile_encoder(output_dir)
+    logger.info("Encoder compiled in %.1fs -> %s", time.time() - t0, output_dir)
+
+    del app
+    gc.collect()
+
+
+def compile_backbone(
+    model_path, encoder_path, output_dir, height, width, tp_degree, label="backbone"
+):
+    """Compile DiT backbone via Application path."""
+    from neuronx_distributed_inference.models.config import NeuronConfig
+    from modeling_ltx23 import LTX23BackboneInferenceConfig
+    from modeling_gemma3_encoder import Gemma3EncoderInferenceConfig, GEMMA3_12B_CONFIG
+    from application import NeuronLTX23Application
+
+    config = load_ltx_config(model_path)
+    tc = config["transformer"]
+
+    num_heads = tc["num_attention_heads"]
+    head_dim = tc["attention_head_dim"]
+    inner_dim = num_heads * head_dim
+    audio_num_heads = tc["audio_num_attention_heads"]
+    audio_head_dim = tc["audio_attention_head_dim"]
+    audio_inner_dim = audio_num_heads * audio_head_dim
+    audio_ca_dim = tc.get("audio_cross_attention_dim", 2048)
+
+    latent_h = height // 32
+    latent_w = width // 32
+    latent_f = (NUM_FRAMES - 1) // 8 + 1
+    video_seq = latent_f * latent_h * latent_w
+
+    logger.info(
+        "  %s: %dx%d -> latent %dx%dx%d -> video_seq=%d",
+        label,
+        height,
+        width,
+        latent_f,
+        latent_h,
+        latent_w,
+        video_seq,
+    )
+
+    dtype = torch.bfloat16
+
+    backbone_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=video_seq,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    backbone_config = LTX23BackboneInferenceConfig(
+        neuron_config=backbone_neuron_config,
+        num_layers=tc["num_layers"],
+        num_attention_heads=num_heads,
+        attention_head_dim=head_dim,
+        inner_dim=inner_dim,
+        audio_num_attention_heads=audio_num_heads,
+        audio_attention_head_dim=audio_head_dim,
+        audio_inner_dim=audio_inner_dim,
+        audio_cross_attention_dim=audio_ca_dim,
+        video_seq=video_seq,
+        audio_seq=AUDIO_SEQ,
+        text_seq=TEXT_SEQ,
+        height=latent_h,
+        width=latent_w,
+        num_frames=latent_f,
+        ltx_config_dict=config,
+    )
+
+    # Encoder config (needed for Application even if we only compile backbone)
+    encoder_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=ENCODER_SEQ,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    encoder_config = Gemma3EncoderInferenceConfig(
+        neuron_config=encoder_neuron_config,
+        vocab_size=GEMMA3_12B_CONFIG["vocab_size"],
+        hidden_size=GEMMA3_12B_CONFIG["hidden_size"],
+        num_hidden_layers=GEMMA3_12B_CONFIG["num_hidden_layers"],
+        num_attention_heads=GEMMA3_12B_CONFIG["num_attention_heads"],
+        num_key_value_heads=GEMMA3_12B_CONFIG["num_key_value_heads"],
+        head_dim=GEMMA3_12B_CONFIG["head_dim"],
+        intermediate_size=GEMMA3_12B_CONFIG["intermediate_size"],
+        rms_norm_eps=GEMMA3_12B_CONFIG["rms_norm_eps"],
+        rope_theta=GEMMA3_12B_CONFIG["rope_theta"],
+        max_position_embeddings=GEMMA3_12B_CONFIG["max_position_embeddings"],
+        query_pre_attn_scalar=GEMMA3_12B_CONFIG["query_pre_attn_scalar"],
+        pad_token_id=GEMMA3_12B_CONFIG["pad_token_id"],
+    )
+
+    app = NeuronLTX23Application(
+        backbone_config=backbone_config,
+        encoder_config=encoder_config,
+        model_path=model_path,
+        encoder_path=encoder_path,
+    )
+
+    logger.info("Compiling %s (TP=%d, video_seq=%d)...", label, tp_degree, video_seq)
+    t0 = time.time()
+    app.compile_backbone(output_dir)
+    logger.info("%s compiled in %.1fs -> %s", label, time.time() - t0, output_dir)
+
+    del app
+    gc.collect()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Compile all benchmark components")
+    parser.add_argument("--model-path", required=True, help="LTX-2.3 safetensors path")
+    parser.add_argument("--encoder-path", required=True, help="Gemma3 12B model dir")
+    parser.add_argument("--output-base", required=True, help="Base output directory")
+    parser.add_argument(
+        "--skip-encoder", action="store_true", help="Skip encoder compilation"
+    )
+    parser.add_argument(
+        "--skip-s1", action="store_true", help="Skip Stage 1 backbone compilation"
+    )
+    parser.add_argument(
+        "--skip-s2", action="store_true", help="Skip Stage 2 backbone compilation"
+    )
+    args = parser.parse_args()
+
+    os.makedirs(args.output_base, exist_ok=True)
+
+    total_t0 = time.time()
+
+    # 1. Gemma3 encoder at TP=4
+    if not args.skip_encoder:
+        encoder_dir = os.path.join(args.output_base, "encoder_tp4")
+        logger.info("\n" + "=" * 60)
+        logger.info("STEP 1: Compile Gemma3 encoder (TP=%d)", S1_TP)
+        logger.info("=" * 60)
+        compile_encoder(
+            args.model_path, args.encoder_path, encoder_dir, tp_degree=S1_TP
+        )
+    else:
+        logger.info("Skipping encoder compilation")
+
+    # 2. Stage 1 backbone at 512x768, TP=4
+    if not args.skip_s1:
+        s1_dir = os.path.join(args.output_base, "s1_tp4")
+        logger.info("\n" + "=" * 60)
+        logger.info(
+            "STEP 2: Compile Stage 1 backbone (%dx%d, TP=%d)",
+            S1_HEIGHT,
+            S1_WIDTH,
+            S1_TP,
+        )
+        logger.info("=" * 60)
+        compile_backbone(
+            args.model_path,
+            args.encoder_path,
+            s1_dir,
+            height=S1_HEIGHT,
+            width=S1_WIDTH,
+            tp_degree=S1_TP,
+            label="Stage 1 backbone",
+        )
+    else:
+        logger.info("Skipping Stage 1 backbone compilation")
+
+    # 3. Stage 2 backbone at 1024x1536, TP=16
+    if not args.skip_s2:
+        s2_dir = os.path.join(args.output_base, "s2_tp16")
+        logger.info("\n" + "=" * 60)
+        logger.info(
+            "STEP 3: Compile Stage 2 backbone (%dx%d, TP=%d)",
+            S2_HEIGHT,
+            S2_WIDTH,
+            S2_TP,
+        )
+        logger.info("=" * 60)
+        compile_backbone(
+            args.model_path,
+            args.encoder_path,
+            s2_dir,
+            height=S2_HEIGHT,
+            width=S2_WIDTH,
+            tp_degree=S2_TP,
+            label="Stage 2 backbone",
+        )
+    else:
+        logger.info("Skipping Stage 2 backbone compilation")
+
+    logger.info("\n" + "=" * 60)
+    logger.info("ALL COMPILATIONS COMPLETE in %.1fs", time.time() - total_t0)
+    logger.info("=" * 60)
+    logger.info("  Encoder:  %s/encoder_tp4/", args.output_base)
+    logger.info("  Stage 1:  %s/s1_tp4/", args.output_base)
+    logger.info("  Stage 2:  %s/s2_tp16/", args.output_base)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/compile_gemma3.py b/contrib/models/LTX-2.3/src/compile_gemma3.py
new file mode 100644
index 00000000..46c9b57a
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/compile_gemma3.py
@@ -0,0 +1,180 @@
+#!/usr/bin/env python3
+"""
+Compile Gemma 3-12B text encoder for Neuron TP=4.
+
+Produces a compiled encoder graph that takes (input_ids, attention_mask)
+and returns all 49 hidden states stacked as (B, seq_len, 3840, 49).
+
+Compiler flags optimized for throughput:
+  --auto-cast=none with tensorizer flags for compute/communication overlap
+  and DMA vectorization. Achieves ~644ms forward pass (3.1x faster than
+  the original flags with --enable-saturate-infinity and
+  --enable-mixed-precision-accumulation which added overhead).
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  NEURON_FUSE_SOFTMAX=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+    python3 compile_gemma3.py [--compile-dir DIR] [--seq-len 1024]
+"""
+
+import argparse
+import gc
+import os
+import sys
+import time
+
+import torch
+
+os.environ.setdefault("NEURON_FUSE_SOFTMAX", "1")
+os.environ.setdefault("NEURON_RT_STOCHASTIC_ROUNDING_EN", "0")
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+TP_DEGREE = 4
+BATCH = 1
+NUM_LAYERS = 48
+
+
+def get_model_fn(tp_degree=TP_DEGREE):
+    from modeling_gemma3_encoder import Gemma3TextEncoderModel
+
+    model = Gemma3TextEncoderModel(
+        vocab_size=262208,
+        hidden_size=3840,
+        num_hidden_layers=NUM_LAYERS,
+        num_attention_heads=16,
+        num_key_value_heads=8,
+        head_dim=256,
+        intermediate_size=15360,
+        rms_norm_eps=1e-6,
+        rope_theta=1_000_000.0,
+        max_position_embeddings=131072,
+        query_pre_attn_scalar=256,
+        pad_token_id=0,
+        dtype=torch.bfloat16,
+    )
+    model = model.to(dtype=torch.bfloat16)
+    model.eval()
+    return model, None
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Compile Gemma3 encoder for Neuron")
+    parser.add_argument(
+        "--compile-dir",
+        default="/home/ubuntu/gemma3_encoder_compiled",
+        help="Directory to save compiled model",
+    )
+    parser.add_argument(
+        "--seq-len",
+        type=int,
+        default=1024,
+        help="Sequence length to compile for (default: 1024)",
+    )
+    args = parser.parse_args()
+
+    seq_len = args.seq_len
+    compile_dir = args.compile_dir
+
+    print("=" * 60)
+    print("Compiling Gemma3 encoder (TP=%d, seq=%d)" % (TP_DEGREE, seq_len))
+    print("=" * 60)
+
+    import torch_neuronx
+    from neuronx_distributed.trace import parallel_model_trace, parallel_model_save
+
+    os.makedirs(compile_dir, exist_ok=True)
+
+    input_ids = torch.zeros(BATCH, seq_len, dtype=torch.int64)
+    attention_mask = torch.ones(BATCH, seq_len, dtype=torch.int64)
+
+    # Tensorizer flags for compute/communication overlap and DMA vectorization.
+    # Removing --enable-saturate-infinity and --enable-mixed-precision-accumulation
+    # yields a 3.1x speedup (2000ms -> 644ms) with no accuracy degradation.
+    compiler_args = (
+        "--model-type=transformer -O1 --auto-cast=none --lnc=2 "
+        "--tensorizer-options='--enable-ccop-compute-overlap "
+        "--cc-pipeline-tiling-factor=1 "
+        "--vectorize-strided-dma "
+        "--enable-scalar-dge-vectorization'"
+    )
+    os.environ["NEURON_CC_FLAGS"] = compiler_args
+    print("  Compiler flags: %s" % compiler_args)
+
+    t0 = time.time()
+    traced = parallel_model_trace(
+        get_model_fn,
+        (input_ids, attention_mask),
+        tp_degree=TP_DEGREE,
+        compiler_workdir=os.path.join(compile_dir, "compiler_workdir"),
+        compiler_args=compiler_args,
+        inline_weights_to_neff=False,
+    )
+    elapsed = time.time() - t0
+    print("  Compile: %.1fs (%.1f min)" % (elapsed, elapsed / 60))
+
+    parallel_model_save(traced, compile_dir)
+    tp0_size = os.path.getsize(os.path.join(compile_dir, "tp_0.pt")) / 1e9
+    print("  Saved tp_0.pt: %.2f GB" % tp0_size)
+
+    # Quick forward test with random weights
+    print("\nLoading for forward test...")
+    from neuronx_distributed.trace.trace import (
+        _mock_parallel_state,
+        init_on_device,
+        get_sharded_checkpoint,
+        replace_weights,
+        TensorParallelNeuronModel,
+    )
+
+    _mock_parallel_state(1, 0)
+    with init_on_device(torch.device("cpu")):
+        ref_model, _ = get_model_fn()
+    checkpoint = ref_model.state_dict()
+    total_params = sum(v.numel() for v in checkpoint.values())
+    print(
+        "  Checkpoint: %d keys, %.2f B params" % (len(checkpoint), total_params / 1e9)
+    )
+    del ref_model
+    gc.collect()
+
+    models = []
+    for rank in range(TP_DEGREE):
+        t0r = time.time()
+        ckpt = {k: v.clone() for k, v in checkpoint.items()}
+        _mock_parallel_state(TP_DEGREE, rank)
+        with init_on_device(torch.device("meta")):
+            model, _ = get_model_fn()
+        get_sharded_checkpoint(ckpt, model, rank, TP_DEGREE)
+        with torch_neuronx.contexts.disable_nrt_load():
+            traced_model = torch.jit.load(os.path.join(compile_dir, "tp_0.pt"))
+        replace_weights(traced_model, ckpt)
+        models.append(traced_model)
+        print("  [rank %d] %.1fs" % (rank, time.time() - t0r))
+        gc.collect()
+    del checkpoint
+    gc.collect()
+
+    compiled = TensorParallelNeuronModel(models)
+    print("  All %d ranks loaded" % TP_DEGREE)
+
+    print("\nForward pass...")
+    _ = compiled(input_ids, attention_mask)  # warmup
+    t0 = time.time()
+    output = compiled(input_ids, attention_mask)
+    elapsed = time.time() - t0
+
+    expected = (BATCH, seq_len, 3840, NUM_LAYERS + 1)
+    print("  Time: %.3fs" % elapsed)
+    print("  Output shape: %s (expected %s)" % (tuple(output.shape), expected))
+    print("  Output dtype: %s" % output.dtype)
+    print("  NaN: %s" % ("FAIL" if torch.isnan(output).any() else "PASS"))
+    print("  Inf: %s" % ("FAIL" if torch.isinf(output).any() else "PASS"))
+    if tuple(output.shape) == expected:
+        print("\n  *** GEMMA3 ENCODER COMPILE + FORWARD: PASSED ***")
+    else:
+        print("\n  *** SHAPE MISMATCH -- FAILED ***")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/compile_transformer.py b/contrib/models/LTX-2.3/src/compile_transformer.py
new file mode 100644
index 00000000..61401638
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/compile_transformer.py
@@ -0,0 +1,378 @@
+#!/usr/bin/env python3
+"""
+LTX-2.3 Full 48-Block TP=4 Compilation
+=======================================
+Compiles the LTX-2.3 22B DiT transformer backbone (48 blocks) with TP=4,
+LNC=2 for trn2.3xlarge using native ltx-core model.
+
+Uses the NeuronLTX23TransformerBackbone from modeling_ltx23.py which:
+- Builds the model via LTXModelConfigurator.from_config()
+- Applies TP sharding (Column/RowParallelLinear, DistributedRMSNorm)
+- Constructs TransformerArgs from flat tensors for native block forward
+- Applies SPMDRank for per-rank RoPE slicing
+
+Output: COMPILE_DIR/tp_0.pt
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  NEURON_FUSE_SOFTMAX=1 NEURON_CUSTOM_SILU=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+    torchrun --nproc_per_node=4 compile_transformer.py
+"""
+
+import os
+import sys
+import time
+import gc
+import json
+import shutil
+import torch
+import torch.nn as nn
+
+os.environ["NEURON_FUSE_SOFTMAX"] = "1"
+os.environ["NEURON_CUSTOM_SILU"] = "1"
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+TP_DEGREE = 4
+BATCH = 1
+VIDEO_SEQ = 768  # 4 frames * 12 * 16 patches (384x512 resolution, patchsize=1x2x2)
+LATENT_H = 12  # 384 / 32
+LATENT_W = 16  # 512 / 32
+AUDIO_SEQ = 26  # audio tokens for ~2s
+TEXT_SEQ = 256  # max text sequence length
+
+MODEL_PATH = "/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors"
+COMPILE_DIR = "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_v2"
+
+# Architecture constants (from safetensors metadata)
+NUM_HEADS = 32
+AUDIO_NUM_HEADS = 32
+INNER_DIM = 4096  # NUM_HEADS * 128
+AUDIO_INNER_DIM = 2048  # AUDIO_NUM_HEADS * 64
+AUDIO_CA_DIM = 2048  # audio_cross_attention_dim
+
+
+def load_config_from_safetensors():
+    """Load the transformer config from safetensors metadata."""
+    from safetensors import safe_open
+
+    with safe_open(MODEL_PATH, framework="pt") as f:
+        metadata = f.metadata()
+    config = json.loads(metadata["config"])
+    tc = config["transformer"]
+    print(
+        f"  Loaded config: {tc['num_layers']} layers, "
+        f"{tc['num_attention_heads']} heads, "
+        f"head_dim={tc['attention_head_dim']}",
+        flush=True,
+    )
+    return config
+
+
+def precompute_inputs(config):
+    """Build example inputs using the native ltx-core preprocessing.
+
+    Creates a temporary unsharded model to run patchify_proj, adaln_single,
+    rope, etc. on CPU, producing the 24 flat tensors for the backbone.
+
+    Uses VideoLatentTools/AudioLatentTools for correct position generation.
+    """
+    from modeling_ltx23 import replace_sdpa_with_bmm
+
+    replace_sdpa_with_bmm()
+
+    from ltx_core.model.transformer.model_configurator import LTXModelConfigurator
+    from ltx_core.model.transformer.modality import Modality
+    from ltx_core.tools import (
+        VideoLatentTools,
+        VideoLatentPatchifier,
+        VideoLatentShape,
+        AudioLatentTools,
+        AudioPatchifier,
+        AudioLatentShape,
+        SpatioTemporalScaleFactors,
+    )
+
+    dtype = torch.bfloat16
+
+    # Build full unsharded model on CPU
+    ltx_model = LTXModelConfigurator.from_config(config)
+    ltx_model = ltx_model.to(dtype=dtype)
+    ltx_model.eval()
+
+    torch.manual_seed(123)
+
+    # Video: patch_size=1 (no spatial patchification in DiT), VAE channels=128
+    # 768 tokens = 4 frames * 12h * 16w in the VAE latent grid
+    video_shape = VideoLatentShape(
+        batch=BATCH, channels=128, frames=4, height=LATENT_H, width=LATENT_W
+    )
+    v_patchifier = VideoLatentPatchifier(patch_size=1)
+    v_scale = SpatioTemporalScaleFactors.default()  # time=8, height=32, width=32
+    video_tools = VideoLatentTools(
+        target_shape=video_shape,
+        patchifier=v_patchifier,
+        scale_factors=v_scale,
+        causal_fix=False,
+        fps=24.0,
+    )
+    video_state = video_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    # Audio: VAE z_channels=8, mel_bins_latent=16, patchified dim=128
+    audio_shape = AudioLatentShape(
+        batch=BATCH, channels=8, frames=AUDIO_SEQ, mel_bins=16
+    )
+    a_patchifier = AudioPatchifier(patch_size=16)
+    audio_tools = AudioLatentTools(patchifier=a_patchifier, target_shape=audio_shape)
+    audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    # CRITICAL: Use random noise as compile-time latent, NOT zeros from
+    # create_initial_state(). The denoising loop starts at sigma=1.0 with
+    # pure noise, which produces hidden_states ~10x larger after patchify_proj.
+    # Compiling with zeros causes numerical issues in the Neuron graph when
+    # actual noise-level inputs are fed at runtime.
+    # State is a frozen dataclass so we create noise tensors separately.
+    video_latent = torch.randn_like(video_state.latent)
+    audio_latent = torch.randn_like(audio_state.latent)
+
+    print(
+        f"  Video: latent={video_latent.shape}, "
+        f"positions={video_state.positions.shape}",
+        flush=True,
+    )
+    print(
+        f"  Audio: latent={audio_latent.shape}, "
+        f"positions={audio_state.positions.shape}",
+        flush=True,
+    )
+    print(
+        f"  Video latent norm: {video_latent.float().norm():.2f} (noise, not zeros)",
+        flush=True,
+    )
+
+    # Sigma for timestep: use sigma=1.0 (start of denoising) to compile
+    # with representative input magnitudes. The compiled graph must handle
+    # all sigma values from 1.0 to 0.0 during denoising.
+    sigma = torch.tensor([1.0], dtype=dtype)
+    v_ts = sigma.unsqueeze(1).expand(BATCH, video_latent.shape[1])
+    a_ts = sigma.unsqueeze(1).expand(BATCH, audio_latent.shape[1])
+
+    # Context: already projected by connector (random for compilation)
+    ctx_v = torch.randn(BATCH, TEXT_SEQ, INNER_DIM, dtype=dtype)
+    ctx_a = torch.randn(BATCH, TEXT_SEQ, AUDIO_INNER_DIM, dtype=dtype)
+    ctx_mask = torch.ones(BATCH, TEXT_SEQ, dtype=dtype)
+    ctx_mask[:, 50:] = 0  # mask out padding
+
+    video_mod = Modality(
+        latent=video_latent,
+        sigma=sigma,
+        timesteps=v_ts,
+        positions=video_state.positions,
+        context=ctx_v,
+        enabled=True,
+        context_mask=ctx_mask,
+        attention_mask=None,
+    )
+    audio_mod = Modality(
+        latent=audio_latent,
+        sigma=sigma,
+        timesteps=a_ts,
+        positions=audio_state.positions,
+        context=ctx_a,
+        enabled=True,
+        context_mask=ctx_mask,
+        attention_mask=None,
+    )
+
+    # Run model preprocessors to get TransformerArgs
+    with torch.no_grad():
+        va = ltx_model.video_args_preprocessor.prepare(video_mod, audio_mod)
+        aa = ltx_model.audio_args_preprocessor.prepare(audio_mod, video_mod)
+
+    # Extract flat tensors from TransformerArgs
+    video_pe_cos, video_pe_sin = va.positional_embeddings
+    audio_pe_cos, audio_pe_sin = aa.positional_embeddings
+    ca_video_pe_cos, ca_video_pe_sin = va.cross_positional_embeddings
+    ca_audio_pe_cos, ca_audio_pe_sin = aa.cross_positional_embeddings
+
+    inputs = (
+        va.x.to(dtype),  # hidden_states
+        aa.x.to(dtype),  # audio_hidden_states
+        va.context.to(dtype),  # encoder_hidden_states
+        aa.context.to(dtype),  # audio_encoder_hidden_states
+        va.timesteps.to(dtype),  # temb (per-token, 9*inner_dim)
+        aa.timesteps.to(dtype),  # temb_audio
+        va.embedded_timestep.to(dtype),  # embedded_timestep (per-token)
+        aa.embedded_timestep.to(dtype),  # audio_embedded_timestep
+        va.cross_scale_shift_timestep.to(dtype),  # video_ca_ss
+        aa.cross_scale_shift_timestep.to(dtype),  # audio_ca_ss
+        va.cross_gate_timestep.to(dtype),  # video_ca_gate
+        aa.cross_gate_timestep.to(dtype),  # audio_ca_gate
+        video_pe_cos.to(dtype),  # video_rot_cos
+        video_pe_sin.to(dtype),  # video_rot_sin
+        audio_pe_cos.to(dtype),  # audio_rot_cos
+        audio_pe_sin.to(dtype),  # audio_rot_sin
+        ca_video_pe_cos.to(dtype),  # ca_video_rot_cos
+        ca_video_pe_sin.to(dtype),  # ca_video_rot_sin
+        ca_audio_pe_cos.to(dtype),  # ca_audio_rot_cos
+        ca_audio_pe_sin.to(dtype),  # ca_audio_rot_sin
+        va.context_mask.to(dtype),  # encoder_attention_mask
+        aa.context_mask.to(
+            dtype
+        ).clone(),  # audio_encoder_attention_mask (must be distinct tensor object)
+        va.prompt_timestep.to(dtype),  # prompt_timestep
+        aa.prompt_timestep.to(dtype),  # audio_prompt_timestep
+    )
+
+    print(f"\n  Generated {len(inputs)} input tensors", flush=True)
+    for i, t in enumerate(inputs):
+        print(f"    [{i:2d}] {str(t.shape):40s} {t.dtype}", flush=True)
+
+    del ltx_model
+    gc.collect()
+    return inputs
+
+
+# Module-level reference for get_model_fn closure
+_CONFIG = None
+
+
+def get_model_fn():
+    """Build the TP-sharded backbone model on this rank.
+
+    Called by the NxD tracing machinery. Must return (model, input_output_aliases).
+    """
+    from modeling_ltx23 import (
+        NeuronLTX23TransformerBackbone,
+        replace_sdpa_with_bmm,
+    )
+    from neuronx_distributed.parallel_layers import parallel_state
+
+    replace_sdpa_with_bmm()
+
+    # Build a minimal InferenceConfig-like object with the attributes the backbone needs
+    class SimpleConfig:
+        pass
+
+    config = SimpleConfig()
+
+    # neuron_config
+    class NeuronConfigLike:
+        tp_degree = TP_DEGREE
+        world_size = TP_DEGREE
+        torch_dtype = torch.bfloat16
+
+    config.neuron_config = NeuronConfigLike()
+    config.ltx_config_dict = _CONFIG  # The full config dict from safetensors
+
+    backbone = NeuronLTX23TransformerBackbone(config)
+    backbone.eval()
+
+    return backbone, None
+
+
+def main():
+    global _CONFIG
+
+    import torch_neuronx
+    import torch_xla.core.xla_model as xm
+    from neuronx_distributed.parallel_layers import parallel_state
+    from neuronx_distributed.parallel_layers.checkpointing import NXD_SKIP_RENDEZVOUS
+    from neuronx_distributed.parallel_layers.utils import requires_init_pg_override
+
+    rank = int(os.environ.get("RANK", "0"))
+
+    # Initialize process group (torchrun sets env vars)
+    if requires_init_pg_override():
+        torch.distributed.init_process_group("xla", init_method="pjrt://")
+    else:
+        torch.distributed.init_process_group("xla")
+    parallel_state.initialize_model_parallel(tensor_model_parallel_size=TP_DEGREE)
+    torch.multiprocessing.set_sharing_strategy("file_system")
+
+    if rank == 0:
+        print("=" * 60, flush=True)
+        print("LTX-2.3 Full Compile: TP=%d, LNC=2" % TP_DEGREE, flush=True)
+        print("=" * 60, flush=True)
+
+        print("\n[1/4] Loading model config...", flush=True)
+        full_config = load_config_from_safetensors()
+        _CONFIG = full_config
+        tc = full_config["transformer"]
+        num_layers = tc.get("num_layers", "?")
+
+        print("\n[2/4] Precomputing inputs (%s blocks)..." % num_layers, flush=True)
+        example_inputs = precompute_inputs(full_config)
+        print(f"  Got {len(example_inputs)} inputs", flush=True)
+
+        print("\n[3/4] Building TP-sharded model (rank 0)...", flush=True)
+        os.makedirs(COMPILE_DIR, exist_ok=True)
+
+        os.environ[NXD_SKIP_RENDEZVOUS] = "1"
+        try:
+            model, input_output_alias = get_model_fn()
+        finally:
+            del os.environ[NXD_SKIP_RENDEZVOUS]
+
+        print(f"\n[4/4] Compiling {num_layers} blocks...", flush=True)
+        rank_workdir = os.path.join(COMPILE_DIR, "_tp0")
+        if os.path.exists(rank_workdir):
+            shutil.rmtree(rank_workdir)
+
+        compiler_args = [
+            "--model-type=transformer",
+            "-O1",
+            "--auto-cast",
+            "matmult",
+            "--lnc",
+            "2",
+            "--tensorizer-options=--enable-ccop-compute-overlap",
+        ]
+
+        t0 = time.time()
+        neff_filename, metaneff, flattener, packer, weights = (
+            torch_neuronx.xla_impl.trace._trace(
+                model,
+                example_inputs,
+                None,
+                input_output_alias,
+                rank_workdir,
+                compiler_args,
+                False,
+            )
+        )
+
+        # Debug: print flattener layout and test extraction
+        from torch_neuronx.xla_impl.structure import extract as struct_extract
+
+        print(f"  Flattener layout: {flattener.layout}", flush=True)
+        test_layout, test_uniques, test_constants = struct_extract(example_inputs)
+        print(f"  Input layout:     {test_layout}", flush=True)
+        print(f"  Match: {flattener.layout == test_layout}", flush=True)
+        print(f"  Flattener exclude: {flattener.exclude}", flush=True)
+
+        traced_model = torch_neuronx.xla_impl.trace.create_neuron_model(
+            neff_filename,
+            metaneff,
+            flattener,
+            packer,
+            example_inputs,
+            input_output_alias,
+            weights,
+        )
+
+        tp_0_path = os.path.join(COMPILE_DIR, "tp_0.pt")
+        torch.jit.save(traced_model, tp_0_path)
+        elapsed = time.time() - t0
+        size_gb = os.path.getsize(tp_0_path) / 1e9
+        print(f"  Compiled and saved in {elapsed:.1f}s", flush=True)
+        print(f"  Output: {tp_0_path} ({size_gb:.1f} GB)", flush=True)
+
+    # All ranks must reach rendezvous
+    xm.rendezvous("done-compilation")
+    if rank == 0:
+        print("\nAll ranks rendezvous'd. Compilation complete!", flush=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/compile_transformer_halfres.py b/contrib/models/LTX-2.3/src/compile_transformer_halfres.py
new file mode 100644
index 00000000..476d8c4a
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/compile_transformer_halfres.py
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+"""
+LTX-2.3 Half-Resolution Compilation for Stage 1
+=================================================
+Compiles the LTX-2.3 22B DiT transformer backbone for HALF resolution
+(192x256) as required by the distilled model's two-stage pipeline.
+
+The distilled model generates at half resolution in Stage 1, then
+upscales to full resolution in Stage 2.
+
+Half-res latent grid: 6x8 (192/32 x 256/32)
+Video tokens: 4 frames * 6 * 8 = 192
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  NEURON_FUSE_SOFTMAX=1 NEURON_CUSTOM_SILU=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 \
+    torchrun --nproc_per_node=4 compile_transformer_halfres.py
+"""
+
+import compile_transformer
+
+# Override constants for half-resolution Stage 1
+compile_transformer.VIDEO_SEQ = 192  # 4 frames * 6h * 8w (192x256 resolution)
+compile_transformer.LATENT_H = 6  # 192 / 32
+compile_transformer.LATENT_W = 8  # 256 / 32
+compile_transformer.COMPILE_DIR = (
+    "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_halfres"
+)
+
+if __name__ == "__main__":
+    compile_transformer.main()
diff --git a/contrib/models/LTX-2.3/src/compile_vae_23.py b/contrib/models/LTX-2.3/src/compile_vae_23.py
new file mode 100644
index 00000000..c91bd171
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/compile_vae_23.py
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""
+LTX-2.3 VAE Decoder — Standalone Compilation Script
+====================================================
+Compiles the tensor-parallel VAE decoder for Neuron.
+
+Default tile: 4x16 latent (128x512 pixels) — optimal for 1024x1536 output.
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  pip install "ltx-core @ git+https://github.com/Lightricks/LTX-2.git@ae855f8538843825f9015a419cf4ba5edaf5eec2#subdirectory=packages/ltx-core"
+
+  # Compile 4x16 tile (recommended for 1024x1536 output)
+  NEURON_RT_VISIBLE_CORES=0-3 python compile_vae_23.py \\
+    --tp-degree 4 --height 128 --width 512 \\
+    --model-path /mnt/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors
+
+Notes:
+  - Tile area must satisfy H_latent * W_latent <= 64 (SRAM limit)
+  - 4x16 tiles are ~12.5% faster per-tile than 8x8 tiles
+  - LTX-2.3 has timestep_conditioning=True (unlike LTX-2)
+"""
+
+import argparse
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+from modeling_vae_23 import compile_vae_decoder
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Compile LTX-2.3 VAE decoder with TP")
+    parser.add_argument(
+        "--height",
+        type=int,
+        default=128,
+        help="Tile pixel height (default 128 for 4-latent)",
+    )
+    parser.add_argument(
+        "--width",
+        type=int,
+        default=512,
+        help="Tile pixel width (default 512 for 16-latent)",
+    )
+    parser.add_argument(
+        "--num-frames", type=int, default=121, help="Number of video frames"
+    )
+    parser.add_argument(
+        "--tp-degree", type=int, default=4, help="Tensor parallel degree"
+    )
+    parser.add_argument("--output-dir", type=str, default="/home/ubuntu/ltx23_vae_tp4")
+    parser.add_argument(
+        "--compiler-workdir", type=str, default="/home/ubuntu/compiler_workdir_vae23"
+    )
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        default="/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors",
+    )
+    args = parser.parse_args()
+
+    # Verify tile area constraint
+    latent_h = args.height // 32
+    latent_w = args.width // 32
+    area = latent_h * latent_w
+    if area > 64:
+        print(
+            f"ERROR: Tile area {latent_h}x{latent_w} = {area} exceeds 64-element SRAM limit"
+        )
+        print("Reduce tile dimensions. Maximum compilable tiles:")
+        print("  8x8 (256x256 px), 4x16 (128x512 px)")
+        sys.exit(1)
+
+    compile_vae_decoder(
+        tp_degree=args.tp_degree,
+        tile_height=args.height,
+        tile_width=args.width,
+        num_frames=args.num_frames,
+        output_dir=args.output_dir,
+        compiler_workdir=args.compiler_workdir,
+        model_path=args.model_path,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/generate_ltx23.py b/contrib/models/LTX-2.3/src/generate_ltx23.py
new file mode 100644
index 00000000..30249246
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/generate_ltx23.py
@@ -0,0 +1,2362 @@
+#!/usr/bin/env python3
+"""
+LTX-2.3 E2E Generation on Neuron
+==================================
+Full end-to-end video+audio generation pipeline:
+  1. Text encoding (Gemma 3 12B on Neuron TP=4, CPU fallback, or random embeddings)
+  2. Optional image encoding for image-to-video (Diffusers VAE encoder on CPU)
+  3. Denoising loop (48-block DiT on Neuron TP=4, 8 Euler steps)
+  4. Optional latent upscaling (spatial x2 + temporal x2 on CPU)
+  5. Video decode (VideoDecoder on CPU)
+  6. Audio decode (AudioDecoder + VocoderWithBWE on CPU)
+
+Outputs: video frames (PNG), MP4 video, WAV audio.
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+
+  # Recommended: Application path (fastest, ~21% E2E speedup):
+  python3 generate_ltx23.py --use-app \
+    --gemma-path /mnt/models/gemma-3-12b \
+    --app-compiled-dir /mnt/models/compiled/e2e \
+    --prompt "A dog plays in a meadow"
+
+  # Text-to-Video with random embeddings (no Gemma required):
+  python3 generate_ltx23.py --no-text-encoder
+
+  # Text-to-Video with Neuron-compiled Gemma3 (fastest, recommended):
+  python3 generate_ltx23.py --neuron-gemma \
+    --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+    --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+    --gemma-path /home/ubuntu/models/gemma-3-12b \
+    --prompt "A dog plays in a meadow"
+
+  # Image-to-Video with Neuron Gemma3:
+  python3 generate_ltx23.py --neuron-gemma \
+    --gemma-compiled-dir /home/ubuntu/gemma3_encoder_compiled \
+    --gemma-sharded-dir /home/ubuntu/gemma3_encoder_sharded \
+    --gemma-path /home/ubuntu/models/gemma-3-12b \
+    --prompt "The woman turns and smiles at the camera" \
+    --image /path/to/photo.png
+
+  # With CPU Gemma3 (slow, no compilation needed):
+  python3 generate_ltx23.py --gemma-path /path/to/gemma-3-12b --prompt "A dog plays in a meadow"
+
+  # With upscaling (384x512 @ 25 frames -> 768x1024 @ 49 frames):
+  python3 generate_ltx23.py --gemma-path /path/to/gemma-3-12b --prompt "A dog plays in a meadow" --upscale
+"""
+
+import argparse
+import gc
+import json
+import logging
+import os
+import sys
+import time
+
+import torch
+import numpy as np
+
+os.environ["NEURON_FUSE_SOFTMAX"] = "1"
+os.environ["NEURON_CUSTOM_SILU"] = "1"
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+
+# Defaults
+MODEL_PATH = "/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors"
+COMPILE_DIR = "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_v2"
+OUTPUT_DIR = "/home/ubuntu/ltx23_output"
+TP_DEGREE = 4
+TEXT_SEQ = 256
+
+# Gemma3 Neuron defaults
+GEMMA3_COMPILED_DIR = "/home/ubuntu/gemma3_encoder_compiled"
+GEMMA3_SHARDED_DIR = "/home/ubuntu/gemma3_encoder_sharded"
+GEMMA3_SEQ_LEN = 1024
+
+# Default upscaler paths
+SPATIAL_UPSCALER_PATH = (
+    "/home/ubuntu/models/LTX-2.3/upscalers/ltx-2.3-spatial-upscaler-x2-1.0.safetensors"
+)
+TEMPORAL_UPSCALER_PATH = (
+    "/home/ubuntu/models/LTX-2.3/upscalers/ltx-2.3-temporal-upscaler-x2-1.0.safetensors"
+)
+
+# Distilled sigma values from the reference LTX-2.3 pipeline
+# See: ltx_pipelines/utils/constants.py DISTILLED_SIGMA_VALUES
+DISTILLED_SIGMA_VALUES = [
+    1.0,
+    0.99375,
+    0.9875,
+    0.98125,
+    0.975,
+    0.909375,
+    0.725,
+    0.421875,
+    0.0,
+]
+STAGE_2_DISTILLED_SIGMA_VALUES = [0.909375, 0.725, 0.421875, 0.0]
+
+# Default half-res compilation paths (for two-stage mode)
+HALFRES_COMPILE_DIR = "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_halfres"
+HALFRES_SHARDED_DIR = "/home/ubuntu/backbone_sharded_halfres"
+
+# Default backbone text_seq for Application-compiled NEFFs
+APP_BACKBONE_TEXT_SEQ = 256  # DiT backbone text_seq for Application-compiled NEFF
+APP_COMPILED_DIR = "/mnt/models/compiled/e2e_v2"
+APP_ENCODER_SEQ_LEN = 512  # Gemma3 encoder seq_len for Application-compiled NEFF
+APP_BACKBONE_AUDIO_SEQ = (
+    26  # Audio latent frames (AudioLatentShape frames=26, mel_bins=16, patch=16)
+)
+APP_HALFRES_COMPILED_DIR = "/mnt/models/compiled/e2e_v2_halfres"
+GEMMA3_MODEL_PATH = "/mnt/models/gemma-3-12b"
+
+
+def encode_image(image_path, model_path, height, width, dtype=torch.bfloat16):
+    """Encode an input image into normalized latent space for image-to-video.
+
+    Uses ltx-core's native VideoEncoder loaded from the same safetensors
+    checkpoint (which contains both encoder and decoder weights under
+    vae.encoder.* and vae.decoder.* prefixes). The encoder includes
+    per_channel_statistics and outputs PCS-normalized latents (mean~0, std~1)
+    matching the scale of Gaussian noise in the denoising loop.
+
+    Args:
+        image_path: Path to input image file
+        model_path: Path to LTX-2.3 safetensors checkpoint
+        height: Target video height (image will be resized to this)
+        width: Target video width (image will be resized to this)
+        dtype: Tensor dtype (default bf16)
+
+    Returns:
+        Latent tensor of shape (1, 128, 1, H//32, W//32)
+    """
+    from PIL import Image
+    from ltx_core.model.video_vae.model_configurator import VideoEncoderConfigurator
+    from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder
+    from ltx_core.loader.sd_ops import SDOps
+
+    logger.info("Encoding image: %s", image_path)
+    t0 = time.time()
+
+    # Load and preprocess image
+    img = Image.open(image_path).convert("RGB")
+    img = img.resize((width, height), Image.LANCZOS)
+    logger.info("  Image resized to %dx%d", width, height)
+
+    # Convert to tensor: (H, W, 3) uint8 -> (1, 3, 1, H, W) bf16 [-1, 1]
+    img_tensor = torch.from_numpy(np.array(img)).float() / 255.0
+    img_tensor = img_tensor * 2.0 - 1.0  # [0, 1] -> [-1, 1]
+    img_tensor = img_tensor.permute(2, 0, 1)  # (3, H, W)
+    img_tensor = img_tensor.unsqueeze(0).unsqueeze(2)  # (1, 3, 1, H, W)
+    img_tensor = img_tensor.to(dtype=dtype)
+    logger.info("  Image tensor: %s", img_tensor.shape)
+
+    # Build ltx-core VideoEncoder from the safetensors checkpoint
+    # The encoder weights are stored under vae.encoder.* and vae.per_channel_statistics.*
+    logger.info("  Building ltx-core VideoEncoder...")
+    ve_ops = (
+        SDOps("ve")
+        .with_matching(prefix="vae.encoder.")
+        .with_matching(prefix="vae.per_channel_statistics.")
+        .with_replacement("vae.encoder.", "")
+        .with_replacement("vae.per_channel_statistics.", "per_channel_statistics.")
+    )
+    ve_builder = SingleGPUModelBuilder(
+        model_class_configurator=VideoEncoderConfigurator,
+        model_path=model_path,
+        model_sd_ops=ve_ops,
+    )
+    video_encoder = ve_builder.build(device=torch.device("cpu"), dtype=dtype)
+    video_encoder.eval()
+
+    # Encode image — the ltx-core VideoEncoder includes per_channel_statistics
+    # and outputs latents already PCS-normalized (mean~0, std~1), matching
+    # the scale of Gaussian noise used in the T2V denoising loop.
+    # No additional normalization is needed.
+    with torch.no_grad():
+        latent = video_encoder(img_tensor)
+    logger.info(
+        "  Encoded latent: %s (mean=%.3f, std=%.3f) in %.1fs",
+        latent.shape,
+        latent.float().mean().item(),
+        latent.float().std().item(),
+        time.time() - t0,
+    )
+
+    # Free the encoder
+    del video_encoder
+    gc.collect()
+
+    return latent
+
+
+def load_config(model_path):
+    from safetensors import safe_open
+
+    with safe_open(model_path, framework="pt") as f:
+        metadata = f.metadata()
+    return json.loads(metadata["config"])
+
+
+def build_cpu_components(config, model_path, dtype=torch.bfloat16):
+    """Build all CPU-side components from the safetensors file.
+
+    Uses SingleGPUModelBuilder (with SDOps key remapping) for components that
+    need it (LTXModel, VideoDecoder, AudioDecoder), and manual loading for
+    components with complex key mappings (Vocoder, EmbeddingsProcessor).
+
+    Returns:
+        dict with keys: ltx_model, video_decoder, audio_decoder, vocoder,
+        embeddings_processor
+    """
+    from safetensors.torch import load_file
+    from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder
+    from ltx_core.loader.sd_ops import SDOps
+
+    t0 = time.time()
+
+    # Patch SDPA before building any models
+    sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+    from modeling_ltx23 import replace_sdpa_with_bmm
+
+    replace_sdpa_with_bmm()
+
+    # 1. LTXModel (for preprocessors) via SingleGPUModelBuilder
+    logger.info("Building LTXModel (preprocessors)...")
+    from ltx_core.model.transformer.model_configurator import LTXModelConfigurator
+
+    ltx_ops = (
+        SDOps("ltx")
+        .with_matching(prefix="model.diffusion_model.")
+        .with_replacement("model.diffusion_model.", "")
+    )
+    ltx_builder = SingleGPUModelBuilder(
+        model_class_configurator=LTXModelConfigurator,
+        model_path=model_path,
+        model_sd_ops=ltx_ops,
+    )
+    ltx_model = ltx_builder.build(device=torch.device("cpu"), dtype=dtype)
+    ltx_model.eval()
+    logger.info("  LTXModel: %d params loaded", sum(1 for _ in ltx_model.parameters()))
+
+    # 2. VideoDecoder via SingleGPUModelBuilder
+    logger.info("Building VideoDecoder...")
+    from ltx_core.model.video_vae.model_configurator import VideoDecoderConfigurator
+
+    vd_ops = (
+        SDOps("v")
+        .with_matching(prefix="vae.")
+        .with_replacement("vae.decoder.", "")
+        .with_replacement("vae.", "")
+    )
+    vd_builder = SingleGPUModelBuilder(
+        model_class_configurator=VideoDecoderConfigurator,
+        model_path=model_path,
+        model_sd_ops=vd_ops,
+    )
+    video_decoder = vd_builder.build(device=torch.device("cpu"), dtype=dtype)
+    video_decoder.eval()
+    logger.info(
+        "  VideoDecoder: %d params loaded", sum(1 for _ in video_decoder.parameters())
+    )
+
+    # 3. AudioDecoder via SingleGPUModelBuilder
+    logger.info("Building AudioDecoder...")
+    from ltx_core.model.audio_vae.model_configurator import AudioDecoderConfigurator
+
+    ad_ops = (
+        SDOps("a")
+        .with_matching(prefix="audio_vae.")
+        .with_replacement("audio_vae.decoder.", "")
+        .with_replacement("audio_vae.", "")
+    )
+    ad_builder = SingleGPUModelBuilder(
+        model_class_configurator=AudioDecoderConfigurator,
+        model_path=model_path,
+        model_sd_ops=ad_ops,
+    )
+    audio_decoder = ad_builder.build(device=torch.device("cpu"), dtype=dtype)
+    audio_decoder.eval()
+    logger.info(
+        "  AudioDecoder: %d params loaded", sum(1 for _ in audio_decoder.parameters())
+    )
+
+    # 4. Vocoder (manual loading — SDOps can't handle nested "vocoder." prefix)
+    logger.info("Building Vocoder...")
+    from ltx_core.model.audio_vae.model_configurator import VocoderConfigurator
+
+    vocoder = VocoderConfigurator.from_config(config)
+    full_sd = load_file(model_path)
+    voc_sd = {}
+    for k, v in full_sd.items():
+        if k.startswith("vocoder."):
+            rest = k[len("vocoder.") :]
+            voc_sd[rest] = v.to(torch.float32) if v.is_floating_point() else v
+    m, u = vocoder.load_state_dict(voc_sd, strict=False)
+    vocoder.eval()
+    logger.info(
+        "  Vocoder: %d loaded, %d missing, %d unexpected",
+        len(voc_sd) - len(m),
+        len(m),
+        len(u),
+    )
+    del full_sd
+
+    # 5. EmbeddingsProcessor (manual loading — multiple prefix remappings)
+    logger.info("Building EmbeddingsProcessor...")
+    from ltx_core.text_encoders.gemma.encoders.encoder_configurator import (
+        EmbeddingsProcessorConfigurator,
+    )
+
+    embeddings_processor = EmbeddingsProcessorConfigurator.from_config(config)
+    embeddings_processor = embeddings_processor.to(dtype=dtype)
+    embeddings_processor.eval()
+
+    full_sd = load_file(model_path)
+    prefix = "model.diffusion_model."
+    emb_keys = {}
+    for k, v in full_sd.items():
+        sk = k[len(prefix) :] if k.startswith(prefix) else k
+        if sk.startswith("video_embeddings_connector."):
+            new_key = "video_connector." + sk[len("video_embeddings_connector.") :]
+            emb_keys[new_key] = v.to(dtype) if v.is_floating_point() else v
+        elif sk.startswith("audio_embeddings_connector."):
+            new_key = "audio_connector." + sk[len("audio_embeddings_connector.") :]
+            emb_keys[new_key] = v.to(dtype) if v.is_floating_point() else v
+        elif sk.startswith("text_embedding_projection."):
+            new_key = "feature_extractor." + sk[len("text_embedding_projection.") :]
+            emb_keys[new_key] = v.to(dtype) if v.is_floating_point() else v
+    m, u = embeddings_processor.load_state_dict(emb_keys, strict=False)
+    logger.info(
+        "  EmbeddingsProcessor: %d loaded, %d missing, %d unexpected",
+        len(emb_keys) - len(m),
+        len(m),
+        len(u),
+    )
+    del full_sd
+
+    logger.info("All CPU components loaded in %.1fs", time.time() - t0)
+
+    # Free transformer block weights from CPU model (they run on Neuron)
+    if (
+        hasattr(ltx_model, "transformer_blocks")
+        and ltx_model.transformer_blocks is not None
+    ):
+        del ltx_model.transformer_blocks
+        ltx_model.transformer_blocks = None
+    for attr in ("norm_out", "proj_out", "audio_norm_out", "audio_proj_out"):
+        if hasattr(ltx_model, attr):
+            delattr(ltx_model, attr)
+    gc.collect()
+
+    return {
+        "ltx_model": ltx_model,
+        "video_decoder": video_decoder,
+        "audio_decoder": audio_decoder,
+        "vocoder": vocoder,
+        "embeddings_processor": embeddings_processor,
+    }
+
+
+def load_neuron_backbone(compile_dir, model_path, tp_degree=4, sharded_dir=None):
+    """Load compiled Neuron backbone with real weights.
+
+    If sharded_dir is provided, loads pre-sharded per-rank .pt files (~5GB each)
+    instead of reading the full 41GB safetensors. This reduces peak CPU memory
+    from ~80GB to ~15GB, eliminating swap thrashing on trn2.3xlarge.
+
+    Pre-shard weights with: python3 shard_backbone_weights.py
+    """
+    import torch_neuronx
+    from neuronx_distributed.trace.trace import TensorParallelNeuronModel
+
+    tp_0_path = os.path.join(compile_dir, "tp_0.pt")
+
+    if sharded_dir and os.path.isdir(sharded_dir):
+        # Fast path: load pre-sharded weights (~5GB per rank)
+        logger.info("Loading pre-sharded backbone weights from %s...", sharded_dir)
+        rank_sds = []
+        for rank in range(tp_degree):
+            rank_path = os.path.join(sharded_dir, "rank_%d.pt" % rank)
+            if not os.path.exists(rank_path):
+                raise FileNotFoundError(
+                    f"Sharded weights not found at {rank_path}. "
+                    "Run shard_backbone_weights.py first."
+                )
+            ckpt = torch.load(rank_path, weights_only=True)
+            rank_sds.append(ckpt)
+            if rank == 0:
+                logger.info("  rank_0: %d keys", len(ckpt))
+    else:
+        # Fallback: load from safetensors (slower, more memory)
+        from safetensors import safe_open
+        from load_with_weights import shard_weight
+
+        logger.info("Loading backbone weights (memory-mapped)...")
+        prefix = "model.diffusion_model."
+        backbone_prefixes = (
+            "transformer_blocks.",
+            "norm_out.",
+            "proj_out.",
+            "scale_shift_table",
+            "audio_norm_out.",
+            "audio_proj_out.",
+            "audio_scale_shift_table",
+        )
+        backbone_sd = {}
+        with safe_open(model_path, framework="pt") as f:
+            for k in f.keys():
+                stripped = k[len(prefix) :] if k.startswith(prefix) else k
+                if stripped.startswith(backbone_prefixes):
+                    backbone_sd[stripped] = (
+                        f.get_tensor(k).to(torch.bfloat16).contiguous()
+                    )
+        backbone_sd["spmd_rank.rank"] = torch.arange(0, tp_degree, dtype=torch.int32)
+        logger.info("  Loaded %d backbone tensors", len(backbone_sd))
+
+        def sf_key_to_jit_key(sf_key):
+            return "weights." + sf_key.replace(".", "->")
+
+        rank_sds = [{} for _ in range(tp_degree)]
+        for sf_key, full_weight in backbone_sd.items():
+            jit_key = sf_key_to_jit_key(sf_key)
+            for rank in range(tp_degree):
+                rank_sds[rank][jit_key] = shard_weight(
+                    full_weight, jit_key, rank, tp_degree
+                )
+        del backbone_sd
+        gc.collect()
+
+    # Load compiled models and inject weights
+    models = []
+    t0 = time.time()
+    for rank in range(tp_degree):
+        logger.info("  Loading Neuron rank %d...", rank)
+        with torch_neuronx.contexts.disable_nrt_load():
+            model = torch.jit.load(tp_0_path)
+        model_sd = dict(model.named_parameters())
+        injected = 0
+        for jit_key, sharded_weight in rank_sds[rank].items():
+            if jit_key in model_sd and model_sd[jit_key].shape == sharded_weight.shape:
+                model_sd[jit_key].data.copy_(sharded_weight)
+                injected += 1
+        if rank == 0:
+            logger.info("    Injected %d/%d weights", injected, len(rank_sds[rank]))
+        models.append(model)
+        # Free this rank's weights immediately after injection
+        rank_sds[rank] = None
+
+    logger.info("  All Neuron models loaded in %.1fs", time.time() - t0)
+    del rank_sds
+    gc.collect()
+
+    return TensorParallelNeuronModel(models)
+
+
+def unload_neuron_model(tp_model, name="model"):
+    """Fully unload a TensorParallelNeuronModel from NeuronCores.
+
+    The simple `del model; gc.collect()` pattern is insufficient because:
+    1. torch.jit.ScriptModule holds NRT (Neuron Runtime) resources
+    2. Python's GC may not immediately release the underlying NEFF allocations
+    3. The NRT resources occupy HBM even after Python references are dropped
+
+    This function explicitly destroys each rank's model and forces cleanup,
+    ensuring the NeuronCores are fully free for the next model to load.
+    """
+    logger.info("Unloading %s from NeuronCores...", name)
+    t0 = time.time()
+
+    # Access the underlying per-rank models
+    if hasattr(tp_model, "models"):
+        models = tp_model.models
+    elif hasattr(tp_model, "model_list"):
+        models = tp_model.model_list
+    else:
+        # Fallback: just delete the top-level object
+        logger.warning("  Cannot access per-rank models, using simple delete")
+        del tp_model
+        gc.collect()
+        return
+
+    # Delete each rank's model individually
+    for i in range(len(models)):
+        models[i] = None
+    del models
+    del tp_model
+    gc.collect()
+
+    # Force Python GC to run multiple generations
+    gc.collect(0)
+    gc.collect(1)
+    gc.collect(2)
+
+    # Give NRT time to release resources
+    import time as _time
+
+    _time.sleep(2)
+
+    logger.info("  %s unloaded in %.1fs", name, time.time() - t0)
+
+
+def load_neuron_gemma3(compiled_dir, sharded_dir, tp_degree=4):
+    """Load Neuron-compiled Gemma3 encoder with pre-sharded weights.
+
+    Both models (DiT backbone and Gemma3 encoder) share the same 4 NeuronCores
+    and execute sequentially.
+    """
+    import torch_neuronx
+    from neuronx_distributed.trace.trace import (
+        TensorParallelNeuronModel,
+        replace_weights,
+    )
+
+    tp_0_path = os.path.join(compiled_dir, "tp_0.pt")
+    if not os.path.exists(tp_0_path):
+        raise FileNotFoundError(
+            f"Compiled Gemma3 not found at {tp_0_path}. Run compile_gemma3.py first."
+        )
+
+    models = []
+    t0 = time.time()
+    for rank in range(tp_degree):
+        logger.info("  Loading Gemma3 Neuron rank %d...", rank)
+        rank_path = os.path.join(sharded_dir, "rank_%d.pt" % rank)
+        if not os.path.exists(rank_path):
+            raise FileNotFoundError(
+                f"Sharded weights not found at {rank_path}. "
+                "Run shard_gemma3_weights.py first."
+            )
+        ckpt = torch.load(rank_path, weights_only=True)
+        tp_path = os.path.join(compiled_dir, "tp_%d.pt" % rank)
+        with torch_neuronx.contexts.disable_nrt_load():
+            traced_model = torch.jit.load(tp_path)
+        replace_weights(traced_model, ckpt)
+        models.append(traced_model)
+        del ckpt
+        gc.collect()
+
+    logger.info("  Gemma3 encoder loaded in %.1fs", time.time() - t0)
+    return TensorParallelNeuronModel(models)
+
+
+def encode_text_neuron(
+    neuron_gemma3, tokenizer_path, prompt, text_seq, embeddings_processor
+):
+    """Encode text using Neuron-compiled Gemma3 encoder.
+
+    The Neuron encoder returns stacked hidden states (B, seq_len, 3840, 49).
+    We convert to a tuple of per-layer tensors for process_hidden_states().
+    """
+    from ltx_core.text_encoders.gemma.tokenizer import LTXVGemmaTokenizer
+
+    tokenizer = LTXVGemmaTokenizer(
+        tokenizer_path=tokenizer_path,
+        max_length=text_seq,
+    )
+
+    token_pairs = tokenizer.tokenize_with_weights(prompt)["gemma"]
+    input_ids = torch.tensor([[t[0] for t in token_pairs]], dtype=torch.int64)
+    attention_mask = torch.tensor([[w[1] for w in token_pairs]], dtype=torch.int64)
+    actual_len = input_ids.shape[1]
+
+    compiled_seq_len = GEMMA3_SEQ_LEN
+    if actual_len < compiled_seq_len:
+        pad_len = compiled_seq_len - actual_len
+        input_ids = torch.cat(
+            [torch.zeros(1, pad_len, dtype=torch.int64), input_ids], dim=1
+        )
+        attention_mask = torch.cat(
+            [torch.zeros(1, pad_len, dtype=torch.int64), attention_mask], dim=1
+        )
+    elif actual_len > compiled_seq_len:
+        input_ids = input_ids[:, :compiled_seq_len]
+        attention_mask = attention_mask[:, :compiled_seq_len]
+
+    logger.info(
+        "  Tokenized: %d tokens -> padded to %d", actual_len, input_ids.shape[1]
+    )
+
+    t0 = time.time()
+    with torch.no_grad():
+        stacked = neuron_gemma3(input_ids, attention_mask)
+    logger.info("  Neuron Gemma3 forward: %.1fs", time.time() - t0)
+
+    # Trim back to actual token length if we padded
+    if actual_len < compiled_seq_len:
+        pad_len = compiled_seq_len - actual_len
+        stacked = stacked[:, pad_len:, :, :]
+        attention_mask = attention_mask[:, pad_len:]
+
+    # Convert stacked tensor to tuple of per-layer tensors
+    hidden_states = tuple(stacked[:, :, :, i] for i in range(stacked.shape[-1]))
+
+    result = embeddings_processor.process_hidden_states(
+        hidden_states=hidden_states,
+        attention_mask=attention_mask,
+    )
+    return result.video_encoding, result.audio_encoding, result.attention_mask
+
+
+def load_upscalers(spatial_path, temporal_path, dtype=torch.bfloat16):
+    """Load spatial and temporal latent upscalers from separate safetensors files.
+
+    Each upscaler safetensors file has its config embedded in metadata.
+    Uses LatentUpsamplerConfigurator.from_config() + manual load_state_dict.
+
+    Returns:
+        dict with keys: spatial_upsampler, temporal_upsampler
+    """
+    import json as _json
+
+    from safetensors import safe_open
+    from safetensors.torch import load_file
+    from ltx_core.model.upsampler.model_configurator import LatentUpsamplerConfigurator
+
+    result = {}
+
+    for name, path in [
+        ("spatial_upsampler", spatial_path),
+        ("temporal_upsampler", temporal_path),
+    ]:
+        logger.info("Loading %s from %s...", name, path)
+        t0 = time.time()
+
+        # Read config from safetensors metadata
+        with safe_open(path, framework="pt") as f:
+            metadata = f.metadata()
+        upsampler_config = _json.loads(metadata["config"])
+
+        # Build model from config
+        upsampler = LatentUpsamplerConfigurator.from_config(upsampler_config)
+        upsampler = upsampler.to(dtype=dtype)
+        upsampler.eval()
+
+        # Load weights
+        sd = load_file(path)
+        sd = {k: v.to(dtype) if v.is_floating_point() else v for k, v in sd.items()}
+        m, u = upsampler.load_state_dict(sd, strict=False)
+        total_params = sum(p.numel() for p in upsampler.parameters())
+        logger.info(
+            "  %s: %.1fM params, %d missing, %d unexpected, loaded in %.1fs",
+            name,
+            total_params / 1e6,
+            len(m),
+            len(u),
+            time.time() - t0,
+        )
+        if m:
+            logger.warning("  Missing keys[:5]: %s", m[:5])
+
+        result[name] = upsampler
+        del sd
+
+    gc.collect()
+    return result
+
+
+def upscale_video_latent(
+    video_latent_5d, video_decoder, spatial_upsampler, temporal_upsampler
+):
+    """Upscale video latent using spatial and temporal upsamplers.
+
+    Flow: un_normalize -> spatial upsample -> temporal upsample -> normalize
+    Uses video_decoder.per_channel_statistics for normalization.
+
+    Args:
+        video_latent_5d: (B, C, F, H, W) normalized video latent
+        video_decoder: VideoDecoder with per_channel_statistics
+        spatial_upsampler: LatentUpsampler for spatial x2
+        temporal_upsampler: LatentUpsampler for temporal x2
+
+    Returns:
+        Upscaled (B, C, F', H*2, W*2) normalized video latent
+    """
+    pcs = video_decoder.per_channel_statistics
+    logger.info("  Input latent: %s", video_latent_5d.shape)
+
+    # Un-normalize to raw latent space
+    latent = pcs.un_normalize(video_latent_5d)
+    logger.info(
+        "  Un-normalized: %s (mean=%.3f, std=%.3f)",
+        latent.shape,
+        latent.float().mean().item(),
+        latent.float().std().item(),
+    )
+
+    # Spatial upsample (H, W doubled)
+    t0 = time.time()
+    with torch.no_grad():
+        latent = spatial_upsampler(latent)
+    logger.info("  After spatial upsample: %s in %.1fs", latent.shape, time.time() - t0)
+
+    # Temporal upsample (F roughly doubled, first frame removed)
+    t0 = time.time()
+    with torch.no_grad():
+        latent = temporal_upsampler(latent)
+    logger.info(
+        "  After temporal upsample: %s in %.1fs", latent.shape, time.time() - t0
+    )
+
+    # Re-normalize
+    latent = pcs.normalize(latent)
+    logger.info(
+        "  Re-normalized: %s (mean=%.3f, std=%.3f)",
+        latent.shape,
+        latent.float().mean().item(),
+        latent.float().std().item(),
+    )
+
+    return latent
+
+
+def load_spatial_upscaler(spatial_path, dtype=torch.bfloat16):
+    """Load only the spatial upscaler (for two-stage mode).
+
+    Two-stage mode uses spatial-only upscaling between stages (no temporal).
+    This follows the DistilledPipeline reference which calls upsample_video()
+    with only the spatial upsampler.
+    """
+    import json as _json
+
+    from safetensors import safe_open
+    from safetensors.torch import load_file
+    from ltx_core.model.upsampler.model_configurator import LatentUpsamplerConfigurator
+
+    logger.info("Loading spatial upsampler from %s...", spatial_path)
+    t0 = time.time()
+
+    with safe_open(spatial_path, framework="pt") as f:
+        metadata = f.metadata()
+    upsampler_config = _json.loads(metadata["config"])
+
+    upsampler = LatentUpsamplerConfigurator.from_config(upsampler_config)
+    upsampler = upsampler.to(dtype=dtype)
+    upsampler.eval()
+
+    sd = load_file(spatial_path)
+    sd = {k: v.to(dtype) if v.is_floating_point() else v for k, v in sd.items()}
+    upsampler.load_state_dict(sd, strict=False)
+    total_params = sum(p.numel() for p in upsampler.parameters())
+    logger.info(
+        "  Spatial upsampler: %.1fM params, loaded in %.1fs",
+        total_params / 1e6,
+        time.time() - t0,
+    )
+    del sd
+    return upsampler
+
+
+def spatial_upscale_latent(video_latent_5d, video_decoder, spatial_upsampler):
+    """Spatial-only upscale for two-stage pipeline.
+
+    Doubles H and W while preserving frame count F.
+    Flow: un_normalize -> spatial upsample x2 -> re_normalize.
+
+    This follows the DistilledPipeline reference (ltx-core upsample_video):
+      latent = video_encoder.per_channel_statistics.un_normalize(latent)
+      latent = upsampler(latent)
+      latent = video_encoder.per_channel_statistics.normalize(latent)
+
+    Args:
+        video_latent_5d: (B, C, F, H, W) normalized video latent
+        video_decoder: VideoDecoder with per_channel_statistics
+        spatial_upsampler: LatentUpsampler for spatial x2
+
+    Returns:
+        (B, C, F, H*2, W*2) normalized video latent (F unchanged)
+    """
+    pcs = video_decoder.per_channel_statistics
+    logger.info("  Input latent: %s", video_latent_5d.shape)
+
+    # Un-normalize to raw latent space
+    latent = pcs.un_normalize(video_latent_5d)
+    logger.info(
+        "  Un-normalized: %s (mean=%.3f, std=%.3f)",
+        latent.shape,
+        latent.float().mean().item(),
+        latent.float().std().item(),
+    )
+
+    # Spatial upsample only (H, W doubled, F unchanged)
+    t0 = time.time()
+    with torch.no_grad():
+        latent = spatial_upsampler(latent)
+    logger.info("  After spatial x2: %s in %.1fs", latent.shape, time.time() - t0)
+
+    # Re-normalize
+    latent = pcs.normalize(latent)
+    logger.info(
+        "  Re-normalized: %s (mean=%.3f, std=%.3f)",
+        latent.shape,
+        latent.float().mean().item(),
+        latent.float().std().item(),
+    )
+
+    return latent
+
+
+def create_app_compositor(
+    model_path,
+    encoder_path,
+    tp_degree=4,
+    text_seq=256,
+    height=384,
+    width=512,
+    num_frames=25,
+    audio_seq=APP_BACKBONE_AUDIO_SEQ,
+):
+    """Create NeuronLTX23Application compositor with proper configs.
+
+    Returns:
+        NeuronLTX23Application instance ready for compile/load.
+    """
+    from neuronx_distributed_inference.models.config import NeuronConfig
+    from modeling_ltx23 import LTX23BackboneInferenceConfig
+    from modeling_gemma3_encoder import (
+        Gemma3EncoderInferenceConfig,
+        GEMMA3_12B_CONFIG,
+    )
+    from application import NeuronLTX23Application
+
+    config = load_config(model_path)
+    tc = config["transformer"]
+
+    num_heads = tc["num_attention_heads"]
+    head_dim = tc["attention_head_dim"]
+    inner_dim = num_heads * head_dim
+    audio_num_heads = tc["audio_num_attention_heads"]
+    audio_head_dim = tc["audio_attention_head_dim"]
+    audio_inner_dim = audio_num_heads * audio_head_dim
+    audio_ca_dim = tc.get("audio_cross_attention_dim", 2048)
+
+    latent_h = height // 32
+    latent_w = width // 32
+    latent_f = (num_frames - 1) // 8 + 1
+    video_seq = latent_f * latent_h * latent_w
+
+    dtype = torch.bfloat16
+
+    # Backbone InferenceConfig
+    backbone_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=video_seq,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    backbone_config = LTX23BackboneInferenceConfig(
+        neuron_config=backbone_neuron_config,
+        num_layers=tc["num_layers"],
+        num_attention_heads=num_heads,
+        attention_head_dim=head_dim,
+        inner_dim=inner_dim,
+        audio_num_attention_heads=audio_num_heads,
+        audio_attention_head_dim=audio_head_dim,
+        audio_inner_dim=audio_inner_dim,
+        audio_cross_attention_dim=audio_ca_dim,
+        video_seq=video_seq,
+        audio_seq=audio_seq,
+        text_seq=APP_BACKBONE_TEXT_SEQ,  # Must match Application-compiled backbone
+        height=latent_h,
+        width=latent_w,
+        num_frames=latent_f,
+        ltx_config_dict=config,
+    )
+
+    # Gemma3 Encoder InferenceConfig
+    encoder_neuron_config = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=APP_ENCODER_SEQ_LEN,  # Must match Application-compiled encoder
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    encoder_config = Gemma3EncoderInferenceConfig(
+        neuron_config=encoder_neuron_config,
+        vocab_size=GEMMA3_12B_CONFIG["vocab_size"],
+        hidden_size=GEMMA3_12B_CONFIG["hidden_size"],
+        num_hidden_layers=GEMMA3_12B_CONFIG["num_hidden_layers"],
+        num_attention_heads=GEMMA3_12B_CONFIG["num_attention_heads"],
+        num_key_value_heads=GEMMA3_12B_CONFIG["num_key_value_heads"],
+        head_dim=GEMMA3_12B_CONFIG["head_dim"],
+        intermediate_size=GEMMA3_12B_CONFIG["intermediate_size"],
+        rms_norm_eps=GEMMA3_12B_CONFIG["rms_norm_eps"],
+        rope_theta=GEMMA3_12B_CONFIG["rope_theta"],
+        max_position_embeddings=GEMMA3_12B_CONFIG["max_position_embeddings"],
+        query_pre_attn_scalar=GEMMA3_12B_CONFIG["query_pre_attn_scalar"],
+        pad_token_id=GEMMA3_12B_CONFIG["pad_token_id"],
+    )
+
+    app = NeuronLTX23Application(
+        backbone_config=backbone_config,
+        encoder_config=encoder_config,
+        model_path=model_path,
+        encoder_path=encoder_path,
+    )
+    logger.info(
+        "NeuronLTX23Application created (video_seq=%d, text_seq=%d)",
+        video_seq,
+        text_seq,
+    )
+    return app
+
+
+def encode_text_with_app(
+    app, compiled_dir, tokenizer_path, prompt, text_seq, embeddings_processor
+):
+    """Encode text using the Application-loaded Gemma3 encoder.
+
+    Handles: load encoder → tokenize → forward → process → unload.
+
+    Args:
+        app: NeuronLTX23Application compositor
+        compiled_dir: Base compiled directory with text_encoder/ subdir
+        tokenizer_path: Path to Gemma3 tokenizer (HuggingFace dir)
+        prompt: Text prompt to encode
+        text_seq: Text sequence length for the backbone (256)
+        embeddings_processor: CPU EmbeddingsProcessor for post-processing
+
+    Returns:
+        (video_context, audio_context, context_mask) tensors
+    """
+    from ltx_core.text_encoders.gemma.tokenizer import LTXVGemmaTokenizer
+
+    # Load encoder to NeuronCores
+    logger.info("Loading Gemma3 encoder via Application...")
+    t0 = time.time()
+    app.load_text_encoder(compiled_dir)
+    logger.info("Gemma3 encoder loaded in %.1fs", time.time() - t0)
+
+    # Use the Application-compiled encoder seq_len (512), not the standalone one (1024)
+    compiled_seq_len = APP_ENCODER_SEQ_LEN
+
+    # Warmup
+    logger.info("  Warmup forward pass...")
+    t0 = time.time()
+    warmup_ids = torch.zeros(1, compiled_seq_len, dtype=torch.int64)
+    warmup_mask = torch.ones(1, compiled_seq_len, dtype=torch.int64)
+    with torch.no_grad():
+        _ = app.encode_text(warmup_ids, warmup_mask)
+    logger.info("  Warmup done in %.1fs", time.time() - t0)
+
+    # Tokenize
+    tokenizer = LTXVGemmaTokenizer(
+        tokenizer_path=tokenizer_path,
+        max_length=text_seq,
+    )
+    token_pairs = tokenizer.tokenize_with_weights(prompt)["gemma"]
+    input_ids = torch.tensor([[t[0] for t in token_pairs]], dtype=torch.int64)
+    attention_mask = torch.tensor([[w[1] for w in token_pairs]], dtype=torch.int64)
+    actual_len = input_ids.shape[1]
+
+    # compiled_seq_len already set to APP_ENCODER_SEQ_LEN above
+    if actual_len < compiled_seq_len:
+        pad_len = compiled_seq_len - actual_len
+        input_ids = torch.cat(
+            [torch.zeros(1, pad_len, dtype=torch.int64), input_ids], dim=1
+        )
+        attention_mask = torch.cat(
+            [torch.zeros(1, pad_len, dtype=torch.int64), attention_mask], dim=1
+        )
+    elif actual_len > compiled_seq_len:
+        input_ids = input_ids[:, :compiled_seq_len]
+        attention_mask = attention_mask[:, :compiled_seq_len]
+
+    logger.info(
+        "  Tokenized: %d tokens -> padded to %d", actual_len, input_ids.shape[1]
+    )
+
+    # Forward
+    t0 = time.time()
+    with torch.no_grad():
+        stacked = app.encode_text(input_ids, attention_mask)
+    logger.info("  Application Gemma3 forward: %.3fs", time.time() - t0)
+
+    # Trim padding
+    if actual_len < compiled_seq_len:
+        pad_len = compiled_seq_len - actual_len
+        stacked = stacked[:, pad_len:, :, :]
+        attention_mask = attention_mask[:, pad_len:]
+
+    # Convert stacked tensor to tuple of per-layer tensors
+    hidden_states = tuple(stacked[:, :, :, i] for i in range(stacked.shape[-1]))
+
+    result = embeddings_processor.process_hidden_states(
+        hidden_states=hidden_states,
+        attention_mask=attention_mask,
+    )
+
+    # Unload encoder to free NeuronCores for backbone
+    app.unload_text_encoder()
+    logger.info("  Gemma3 encoder unloaded")
+
+    return result.video_encoding, result.audio_encoding, result.attention_mask
+
+
+def generate(args):
+    """Main generation pipeline."""
+    config = load_config(args.model_path)
+    tc = config["transformer"]
+    logger.info(
+        "Model: %d layers, %d heads", tc["num_layers"], tc["num_attention_heads"]
+    )
+
+    # Build CPU components
+    logger.info("\n=== Building CPU components ===")
+    cpu = build_cpu_components(config, args.model_path)
+
+    # When using Neuron Gemma3, we must load/run/unload Gemma3 BEFORE loading
+    # the DiT backbone, since both share the same 4 NeuronCores. Loading both
+    # simultaneously causes memory contention and extreme swap thrashing
+    # (144s+ for the first denoising step instead of 0.3s).
+    #
+    # Sequence: CPU components -> Gemma3 on Neuron -> encode -> unload Gemma3
+    #           -> DiT on Neuron -> denoise -> decode
+
+    # Get context embeddings
+    dtype = torch.bfloat16
+    app = None  # NeuronLTX23Application compositor (only used with --use-app)
+    if args.use_app:
+        logger.info("\n=== Using Application path (recommended) ===")
+        app = create_app_compositor(
+            model_path=args.model_path,
+            encoder_path=args.gemma_path,
+            tp_degree=args.tp_degree,
+            text_seq=args.text_seq,
+            height=args.height,
+            width=args.width,
+            num_frames=args.num_frames,
+        )
+        t0 = time.time()
+        video_context, audio_context, context_mask = encode_text_with_app(
+            app=app,
+            compiled_dir=args.app_compiled_dir,
+            tokenizer_path=args.gemma_path,
+            prompt=args.prompt,
+            text_seq=args.text_seq,
+            embeddings_processor=cpu["embeddings_processor"],
+        )
+        logger.info("Text encoded via Application in %.1fs", time.time() - t0)
+        logger.info(
+            "  video_context: %s, audio_context: %s",
+            video_context.shape,
+            audio_context.shape,
+        )
+    elif args.no_text_encoder:
+        logger.info("\n=== Using random embeddings (no text encoder) ===")
+        torch.manual_seed(args.seed)
+        video_context = torch.randn(1, args.text_seq, 4096, dtype=dtype)
+        audio_context = torch.randn(1, args.text_seq, 2048, dtype=dtype)
+        context_mask = torch.ones(1, args.text_seq, dtype=torch.int64)
+        context_mask[:, 50:] = 0  # mask out most tokens
+    elif args.neuron_gemma:
+        logger.info("\n=== Running text encoder on Neuron ===")
+        logger.info("Loading Neuron-compiled Gemma3 encoder...")
+        t0 = time.time()
+        neuron_gemma3 = load_neuron_gemma3(
+            args.gemma_compiled_dir, args.gemma_sharded_dir, args.tp_degree
+        )
+        logger.info("Gemma3 encoder ready in %.1fs", time.time() - t0)
+
+        # Warmup pass
+        logger.info("  Warmup forward pass...")
+        t0 = time.time()
+        warmup_ids = torch.zeros(1, GEMMA3_SEQ_LEN, dtype=torch.int64)
+        warmup_mask = torch.ones(1, GEMMA3_SEQ_LEN, dtype=torch.int64)
+        with torch.no_grad():
+            _ = neuron_gemma3(warmup_ids, warmup_mask)
+        logger.info("  Warmup done in %.1fs", time.time() - t0)
+
+        t0 = time.time()
+        video_context, audio_context, context_mask = encode_text_neuron(
+            neuron_gemma3,
+            args.gemma_path,
+            args.prompt,
+            args.text_seq,
+            cpu["embeddings_processor"],
+        )
+        logger.info("Text encoded on Neuron in %.1fs", time.time() - t0)
+        logger.info(
+            "  video_context: %s, audio_context: %s",
+            video_context.shape,
+            audio_context.shape,
+        )
+
+        # Free Gemma3 Neuron model — must fully release NeuronCores
+        # before loading the DiT backbone which shares the same 4 cores
+        unload_neuron_model(neuron_gemma3, "Gemma3 encoder")
+        del neuron_gemma3
+    else:
+        logger.info("\n=== Running text encoder ===")
+        # Load Gemma 3 12B
+        from ltx_core.text_encoders.gemma.encoders.base_encoder import GemmaTextEncoder
+        from ltx_core.text_encoders.gemma.tokenizer import LTXVGemmaTokenizer
+        from transformers.models.gemma3 import Gemma3ForConditionalGeneration
+        from ltx_core.text_encoders.gemma.config import GEMMA3_CONFIG_FOR_LTX
+        from transformers import Gemma3Config
+
+        logger.info("Loading Gemma 3 12B from %s...", args.gemma_path)
+        t0 = time.time()
+
+        # Build config and load model
+        gemma_config = Gemma3Config.from_dict(GEMMA3_CONFIG_FOR_LTX.to_dict())
+        gemma_model = Gemma3ForConditionalGeneration.from_pretrained(
+            args.gemma_path,
+            config=gemma_config,
+            dtype=dtype,
+        )
+        gemma_model = gemma_model.to(dtype=dtype)
+        gemma_model.eval()
+        logger.info("Gemma loaded in %.1fs", time.time() - t0)
+
+        tokenizer = LTXVGemmaTokenizer(
+            tokenizer_path=args.gemma_path,
+            max_length=args.text_seq,
+        )
+        text_encoder = GemmaTextEncoder(
+            model=gemma_model,
+            tokenizer=tokenizer,
+            dtype=dtype,
+        )
+
+        # Encode prompt
+        logger.info("Encoding prompt: '%s'", args.prompt)
+        t0 = time.time()
+        with torch.no_grad():
+            hidden_states, attention_mask = text_encoder.encode(args.prompt)
+
+        # Run embeddings processor (handles additive mask conversion internally)
+        result = cpu["embeddings_processor"].process_hidden_states(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+        )
+        video_context = result.video_encoding
+        audio_context = result.audio_encoding
+        context_mask = (
+            result.attention_mask
+        )  # keep as int64 for proper additive mask conversion
+        logger.info("Text encoded in %.1fs", time.time() - t0)
+        logger.info(
+            "  video_context: %s, audio_context: %s",
+            video_context.shape,
+            audio_context.shape,
+        )
+
+        # Free Gemma to save RAM
+        del gemma_model, text_encoder
+        gc.collect()
+
+    # =========================================================================
+    # TWO-STAGE PIPELINE
+    # =========================================================================
+    if args.two_stage:
+        logger.info("\n" + "=" * 60)
+        logger.info("TWO-STAGE GENERATION")
+        logger.info(
+            "  Stage 1: %dx%d (%d steps)",
+            args.height // 2,
+            args.width // 2,
+            args.num_steps,
+        )
+        logger.info("  Stage 2: %dx%d (3 steps, refinement)", args.height, args.width)
+        logger.info("=" * 60)
+
+        from ltx_core.tools import (
+            VideoLatentTools,
+            VideoLatentPatchifier,
+            VideoLatentShape,
+            AudioLatentTools,
+            AudioPatchifier,
+            AudioLatentShape,
+            SpatioTemporalScaleFactors,
+        )
+        from ltx_core.model.transformer.modality import Modality
+        from pipeline import NeuronTransformerWrapper
+
+        v_patchifier = VideoLatentPatchifier(patch_size=1)
+        v_scale = SpatioTemporalScaleFactors.default()
+        s1_total_time = 0.0
+
+        # --- Phase 2 shortcut: Load S1 latent from a previous Phase 1 run ---
+        if args.load_s1_latent:
+            logger.info(
+                "\n=== Phase 2: Loading S1 latent from %s ===", args.load_s1_latent
+            )
+            saved = torch.load(
+                args.load_s1_latent, map_location="cpu", weights_only=True
+            )
+            s1_video_latent = saved["s1_video_latent"]
+            audio_sample = saved["audio_sample"]
+            video_context = saved["video_context"]
+            audio_context = saved["audio_context"]
+            context_mask = saved["context_mask"]
+            s1_total_time = saved.get("s1_total_time", 0.0)
+            logger.info(
+                "  Loaded S1 video latent: %s, audio: %s",
+                s1_video_latent.shape,
+                audio_sample.shape,
+            )
+            logger.info("  S1 denoising time from Phase 1: %.1fs", s1_total_time)
+
+            # Setup audio tools (needed for S2)
+            audio_shape = AudioLatentShape(
+                batch=1, channels=8, frames=args.audio_num_frames, mel_bins=16
+            )
+            a_patchifier = AudioPatchifier(patch_size=16)
+            audio_tools = AudioLatentTools(
+                patchifier=a_patchifier, target_shape=audio_shape
+            )
+            audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+        else:
+            # --- Normal Phase 1: Run encoder + S1 denoising ---
+            pass  # Fall through to existing S1 code below
+
+        # Skip S1 if loading from Phase 1 save
+        if not args.load_s1_latent:
+            # --- Stage 1: Half-res generation ---
+            logger.info("\n=== Stage 1: Half-resolution generation ===")
+
+        # Half-res latent dimensions
+        s1_height = args.height // 2
+        s1_width = args.width // 2
+        s1_latent_h = s1_height // 32
+        s1_latent_w = s1_width // 32
+        s1_latent_f = (args.num_frames - 1) // 8 + 1
+        logger.info(
+            "  Half-res: %dx%d, latent %dx%dx%d",
+            s1_height,
+            s1_width,
+            s1_latent_f,
+            s1_latent_h,
+            s1_latent_w,
+        )
+
+        s1_video_shape = VideoLatentShape(
+            batch=1,
+            channels=128,
+            frames=s1_latent_f,
+            height=s1_latent_h,
+            width=s1_latent_w,
+        )
+        v_patchifier = VideoLatentPatchifier(patch_size=1)
+        v_scale = SpatioTemporalScaleFactors.default()
+        s1_video_tools = VideoLatentTools(
+            target_shape=s1_video_shape,
+            patchifier=v_patchifier,
+            scale_factors=v_scale,
+            causal_fix=False,
+            fps=args.fps,
+        )
+        audio_shape = AudioLatentShape(
+            batch=1, channels=8, frames=args.audio_num_frames, mel_bins=16
+        )
+        a_patchifier = AudioPatchifier(patch_size=16)
+        audio_tools = AudioLatentTools(
+            patchifier=a_patchifier, target_shape=audio_shape
+        )
+
+        # Create initial noise at half-res
+        gen = torch.Generator().manual_seed(args.seed)
+        s1_video_state = s1_video_tools.create_initial_state(device="cpu", dtype=dtype)
+        audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+        video_sample = torch.randn(
+            s1_video_state.latent.shape, dtype=dtype, generator=gen
+        )
+        audio_sample = torch.randn(audio_state.latent.shape, dtype=dtype, generator=gen)
+
+        # Image-to-Video conditioning for Stage 1 (half-res)
+        s1_denoise_mask = None
+        s1_clean_latent = None
+        if args.image:
+            logger.info("\n=== Image-to-Video conditioning (Stage 1, half-res) ===")
+            image_latent_5d = encode_image(
+                args.image, args.model_path, s1_height, s1_width, dtype
+            )
+            image_tokens = v_patchifier.patchify(image_latent_5d)
+            frame_0_tokens = s1_latent_h * s1_latent_w
+            logger.info(
+                "  Image patchified: %s (frame 0 = %d tokens)",
+                image_tokens.shape,
+                frame_0_tokens,
+            )
+            video_sample[:, :frame_0_tokens] = image_tokens[:, :frame_0_tokens]
+            video_seq_len = video_sample.shape[1]
+            s1_denoise_mask = torch.ones(1, video_seq_len, 1, dtype=dtype)
+            s1_denoise_mask[:, :frame_0_tokens, :] = 0.0
+            s1_clean_latent = video_sample.clone()
+            logger.info(
+                "  I2V: %d conditioned, %d unconditioned tokens",
+                frame_0_tokens,
+                video_seq_len - frame_0_tokens,
+            )
+            del image_latent_5d, image_tokens
+
+        # Load half-res Neuron backbone
+        if args.use_app:
+            # Application path: create half-res compositor and load backbone
+            s1_app = create_app_compositor(
+                model_path=args.model_path,
+                encoder_path=args.gemma_path,
+                tp_degree=args.tp_degree,
+                text_seq=args.text_seq,
+                height=s1_height,
+                width=s1_width,
+                num_frames=args.num_frames,
+            )
+            logger.info(
+                "Loading half-res backbone from %s...",
+                args.app_halfres_compiled_dir,
+            )
+            t0 = time.time()
+            s1_app.load_backbone(args.app_halfres_compiled_dir)
+            logger.info(
+                "Half-res backbone loaded via Application in %.1fs",
+                time.time() - t0,
+            )
+            neuron_backbone = s1_app
+            wrapper = NeuronTransformerWrapper(
+                compiled_backbone=neuron_backbone,
+                cpu_ltx_model=cpu["ltx_model"],
+                text_seq=args.text_seq,
+                mask_4d=True,
+                compiled_text_seq=APP_BACKBONE_TEXT_SEQ,
+            )
+        else:
+            logger.info(
+                "Loading half-res backbone from %s...", args.halfres_compiled_dir
+            )
+            neuron_backbone = load_neuron_backbone(
+                args.halfres_compiled_dir,
+                args.model_path,
+                args.tp_degree,
+                sharded_dir=args.halfres_sharded_dir
+                if os.path.isdir(args.halfres_sharded_dir)
+                else args.backbone_sharded_dir,
+            )
+            wrapper = NeuronTransformerWrapper(
+                compiled_backbone=neuron_backbone,
+                cpu_ltx_model=cpu["ltx_model"],
+                text_seq=args.text_seq,
+            )
+
+        # Warmup half-res backbone
+        logger.info("Warming up half-res DiT backbone...")
+        warmup_sigma = torch.tensor([1.0])
+        warmup_v_ts = warmup_sigma.unsqueeze(0).expand(
+            1, s1_video_state.latent.shape[1]
+        )
+        warmup_a_ts = warmup_sigma.unsqueeze(0).expand(1, audio_state.latent.shape[1])
+        warmup_video_mod = Modality(
+            latent=torch.randn_like(video_sample),
+            sigma=warmup_sigma,
+            timesteps=warmup_v_ts,
+            positions=s1_video_state.positions,
+            context=video_context,
+            enabled=True,
+            context_mask=context_mask,
+            attention_mask=None,
+        )
+        warmup_audio_mod = Modality(
+            latent=torch.randn_like(audio_sample),
+            sigma=warmup_sigma,
+            timesteps=warmup_a_ts,
+            positions=audio_state.positions,
+            context=audio_context,
+            enabled=True,
+            context_mask=context_mask.clone(),
+            attention_mask=None,
+        )
+        t0 = time.time()
+        with torch.no_grad():
+            _ = wrapper(warmup_video_mod, warmup_audio_mod)
+        logger.info("  Half-res warmup done in %.1fs", time.time() - t0)
+        del warmup_video_mod, warmup_audio_mod
+
+        # Stage 1 denoising (8 steps at half-res)
+        logger.info(
+            "\n=== Stage 1 Denoising (%d steps at %dx%d) ===",
+            args.num_steps,
+            s1_height,
+            s1_width,
+        )
+        sigmas = torch.tensor(DISTILLED_SIGMA_VALUES, dtype=torch.float32)
+        s1_total_time = 0.0
+        for step_idx in range(args.num_steps):
+            sigma = sigmas[step_idx]
+            sigma_next = sigmas[step_idx + 1]
+            video_seq_len = s1_video_state.latent.shape[1]
+            audio_seq_len = audio_state.latent.shape[1]
+            # I2V: per-token timesteps (frame 0 gets 0, rest get sigma)
+            if s1_denoise_mask is not None:
+                v_ts = s1_denoise_mask.squeeze(-1) * sigma
+            else:
+                v_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, video_seq_len)
+            a_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, audio_seq_len)
+            video_mod = Modality(
+                latent=video_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=v_ts,
+                positions=s1_video_state.positions,
+                context=video_context,
+                enabled=True,
+                context_mask=context_mask,
+                attention_mask=None,
+            )
+            audio_mod = Modality(
+                latent=audio_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=a_ts,
+                positions=audio_state.positions,
+                context=audio_context,
+                enabled=True,
+                context_mask=context_mask.clone(),
+                attention_mask=None,
+            )
+            t0 = time.time()
+            with torch.no_grad():
+                video_velocity, audio_velocity = wrapper(video_mod, audio_mod)
+            step_time = time.time() - t0
+            s1_total_time += step_time
+            dt = sigma_next - sigma
+            video_sample = (video_sample.float() + video_velocity.float() * dt).to(
+                dtype
+            )
+            audio_sample = (audio_sample.float() + audio_velocity.float() * dt).to(
+                dtype
+            )
+            # I2V: preserve frame 0 tokens after each Euler step
+            if s1_denoise_mask is not None and s1_clean_latent is not None:
+                video_sample = (
+                    video_sample * s1_denoise_mask
+                    + s1_clean_latent * (1.0 - s1_denoise_mask)
+                ).to(dtype)
+            logger.info(
+                "  S1 Step %d/%d: sigma %.4f -> %.4f (%.1fs)",
+                step_idx + 1,
+                args.num_steps,
+                sigma.item(),
+                sigma_next.item(),
+                step_time,
+            )
+
+        logger.info(
+            "  Stage 1 total: %.1fs (%.1fs/step)",
+            s1_total_time,
+            s1_total_time / args.num_steps,
+        )
+
+        # Unload half-res backbone
+        if args.use_app:
+            s1_app.unload_backbone()
+            del s1_app
+        else:
+            unload_neuron_model(neuron_backbone, "half-res DiT backbone")
+        del neuron_backbone, wrapper
+
+        # Unpatchify Stage 1 output to spatial format
+        s1_video_latent = v_patchifier.unpatchify(video_sample, s1_video_shape)
+        logger.info("  Stage 1 video latent: %s", s1_video_latent.shape)
+
+        # Save S1 latent for two-phase TP switching (Phase 1 output)
+        if args.save_s1_latent:
+            save_data = {
+                "s1_video_latent": s1_video_latent,
+                "audio_sample": audio_sample,
+                "video_context": video_context,
+                "audio_context": audio_context,
+                "context_mask": context_mask,
+                "s1_total_time": s1_total_time,
+            }
+            torch.save(save_data, args.save_s1_latent)
+            logger.info(
+                "\n=== Phase 1 complete: S1 latent saved to %s ===",
+                args.save_s1_latent,
+            )
+            logger.info(
+                "  S1 denoising: %.1fs (%.1fs/step)",
+                s1_total_time,
+                s1_total_time / args.num_steps,
+            )
+            logger.info(
+                "  Run Phase 2 with --load-s1-latent %s --s2-tp-degree <N>",
+                args.save_s1_latent,
+            )
+            return
+
+        # --- Spatial upsample x2 ---
+        logger.info("\n=== Spatial Upsample x2 ===")
+        spatial_up = load_spatial_upscaler(args.spatial_upscaler_path, dtype=dtype)
+        s2_video_latent = spatial_upscale_latent(
+            s1_video_latent, cpu["video_decoder"], spatial_up
+        )
+        logger.info(
+            "  Upscaled: %s -> %s", s1_video_latent.shape, s2_video_latent.shape
+        )
+        del spatial_up, s1_video_latent
+        gc.collect()
+
+        # --- Stage 2: Full-res refinement ---
+        logger.info("\n=== Stage 2: Full-resolution refinement ===")
+
+        # Full-res latent dimensions
+        s2_latent_h = args.height // 32
+        s2_latent_w = args.width // 32
+        s2_latent_f = s2_video_latent.shape[2]  # Same frame count as Stage 1
+        s2_video_shape = VideoLatentShape(
+            batch=1,
+            channels=128,
+            frames=s2_latent_f,
+            height=s2_latent_h,
+            width=s2_latent_w,
+        )
+        s2_video_tools = VideoLatentTools(
+            target_shape=s2_video_shape,
+            patchifier=v_patchifier,
+            scale_factors=v_scale,
+            causal_fix=False,
+            fps=args.fps,
+        )
+        s2_video_state = s2_video_tools.create_initial_state(device="cpu", dtype=dtype)
+
+        # Patchify the upscaled latent for Stage 2 denoising
+        s2_upscaled_tokens = v_patchifier.patchify(s2_video_latent)
+        logger.info("  S2 upscaled tokens: %s", s2_upscaled_tokens.shape)
+
+        # Noise injection: mix upscaled latent with noise at sigma=0.909375
+        s2_sigmas = torch.tensor(STAGE_2_DISTILLED_SIGMA_VALUES, dtype=torch.float32)
+        noise_scale = s2_sigmas[0].item()
+        gen_s2 = torch.Generator().manual_seed(args.seed + 42)
+        s2_noise = torch.randn(s2_upscaled_tokens.shape, dtype=dtype, generator=gen_s2)
+        video_sample = (
+            noise_scale * s2_noise + (1.0 - noise_scale) * s2_upscaled_tokens
+        ).to(dtype)
+        logger.info(
+            "  Noise injected at sigma=%.4f: %.1f%% noise + %.1f%% signal",
+            noise_scale,
+            noise_scale * 100,
+            (1 - noise_scale) * 100,
+        )
+        del s2_noise, s2_upscaled_tokens, s2_video_latent
+
+        # Load full-res Neuron backbone (same compiled model, same weights)
+        if args.use_app:
+            # Application path: create or reuse compositor for Stage 2
+            s2_compiled_dir = args.s2_app_compiled_dir
+            if args.s2_tp_degree != args.tp_degree:
+                # Different TP for Stage 2 — need a fresh compositor
+                logger.info(
+                    "Creating Stage 2 compositor (TP=%d, different from S1 TP=%d)...",
+                    args.s2_tp_degree,
+                    args.tp_degree,
+                )
+                s2_app = create_app_compositor(
+                    model_path=args.model_path,
+                    encoder_path=args.gemma_path,
+                    tp_degree=args.s2_tp_degree,
+                    text_seq=args.text_seq,
+                    height=args.height,
+                    width=args.width,
+                    num_frames=args.num_frames,
+                )
+            else:
+                # Same TP — reuse the full-res compositor created for encoder
+                s2_app = app
+            logger.info("Loading full-res backbone from %s...", s2_compiled_dir)
+            t0 = time.time()
+            s2_app.load_backbone(s2_compiled_dir)
+            logger.info(
+                "Full-res backbone loaded via Application in %.1fs",
+                time.time() - t0,
+            )
+            neuron_backbone = s2_app
+            wrapper = NeuronTransformerWrapper(
+                compiled_backbone=neuron_backbone,
+                cpu_ltx_model=cpu["ltx_model"],
+                text_seq=args.text_seq,
+                mask_4d=True,
+                compiled_text_seq=APP_BACKBONE_TEXT_SEQ,
+            )
+        else:
+            logger.info("Loading full-res backbone from %s...", args.compile_dir)
+            neuron_backbone = load_neuron_backbone(
+                args.compile_dir,
+                args.model_path,
+                args.s2_tp_degree,
+                sharded_dir=args.backbone_sharded_dir,
+            )
+            wrapper = NeuronTransformerWrapper(
+                compiled_backbone=neuron_backbone,
+                cpu_ltx_model=cpu["ltx_model"],
+                text_seq=args.text_seq,
+            )
+
+        # Warmup full-res backbone
+        logger.info("Warming up full-res DiT backbone...")
+        warmup_sigma = torch.tensor([1.0])
+        warmup_v_ts = warmup_sigma.unsqueeze(0).expand(
+            1, s2_video_state.latent.shape[1]
+        )
+        warmup_a_ts = warmup_sigma.unsqueeze(0).expand(1, audio_state.latent.shape[1])
+        warmup_video_mod = Modality(
+            latent=torch.randn(1, s2_video_state.latent.shape[1], 128, dtype=dtype),
+            sigma=warmup_sigma,
+            timesteps=warmup_v_ts,
+            positions=s2_video_state.positions,
+            context=video_context,
+            enabled=True,
+            context_mask=context_mask,
+            attention_mask=None,
+        )
+        warmup_audio_mod = Modality(
+            latent=torch.randn_like(audio_sample),
+            sigma=warmup_sigma,
+            timesteps=warmup_a_ts,
+            positions=audio_state.positions,
+            context=audio_context,
+            enabled=True,
+            context_mask=context_mask.clone(),
+            attention_mask=None,
+        )
+        t0 = time.time()
+        with torch.no_grad():
+            _ = wrapper(warmup_video_mod, warmup_audio_mod)
+        logger.info("  Full-res warmup done in %.1fs", time.time() - t0)
+        del warmup_video_mod, warmup_audio_mod
+
+        # Stage 2 denoising (3 steps at full-res, no CFG)
+        s2_num_steps = len(s2_sigmas) - 1
+        logger.info(
+            "\n=== Stage 2 Denoising (%d steps at %dx%d) ===",
+            s2_num_steps,
+            args.height,
+            args.width,
+        )
+        s2_total_time = 0.0
+        for step_idx in range(s2_num_steps):
+            sigma = s2_sigmas[step_idx]
+            sigma_next = s2_sigmas[step_idx + 1]
+            video_seq_len = s2_video_state.latent.shape[1]
+            audio_seq_len = audio_state.latent.shape[1]
+            v_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, video_seq_len)
+            a_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, audio_seq_len)
+            video_mod = Modality(
+                latent=video_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=v_ts,
+                positions=s2_video_state.positions,
+                context=video_context,
+                enabled=True,
+                context_mask=context_mask,
+                attention_mask=None,
+            )
+            audio_mod = Modality(
+                latent=audio_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=a_ts,
+                positions=audio_state.positions,
+                context=audio_context,
+                enabled=True,
+                context_mask=context_mask.clone(),
+                attention_mask=None,
+            )
+            t0 = time.time()
+            with torch.no_grad():
+                video_velocity, audio_velocity = wrapper(video_mod, audio_mod)
+            step_time = time.time() - t0
+            s2_total_time += step_time
+            dt = sigma_next - sigma
+            video_sample = (video_sample.float() + video_velocity.float() * dt).to(
+                dtype
+            )
+            audio_sample = (audio_sample.float() + audio_velocity.float() * dt).to(
+                dtype
+            )
+            logger.info(
+                "  S2 Step %d/%d: sigma %.4f -> %.4f (%.1fs)",
+                step_idx + 1,
+                s2_num_steps,
+                sigma.item(),
+                sigma_next.item(),
+                step_time,
+            )
+
+        logger.info(
+            "  Stage 2 total: %.1fs (%.1fs/step)",
+            s2_total_time,
+            s2_total_time / s2_num_steps,
+        )
+        logger.info(
+            "\n  Combined denoising: S1=%.1fs + S2=%.1fs = %.1fs",
+            s1_total_time,
+            s2_total_time,
+            s1_total_time + s2_total_time,
+        )
+
+        # Unpatchify and decode (reuse the existing decode path)
+        video_latent_spatial = v_patchifier.unpatchify(video_sample, s2_video_shape)
+        audio_latent_spatial = a_patchifier.unpatchify(audio_sample, audio_shape)
+        video_shape = s2_video_shape  # for the decode path below
+
+        # Unload full-res backbone before decode
+        if args.use_app:
+            s2_app.unload_backbone()
+            if s2_app is not app:
+                del s2_app
+        else:
+            unload_neuron_model(neuron_backbone, "full-res DiT backbone")
+        del neuron_backbone, wrapper
+
+        # Skip to decode section (jump past single-stage code)
+        # Set these for the shared decode path
+        logger.info("\n=== Decoding ===")
+        video_latent_4d = video_latent_spatial[0]
+        logger.info("  Video latent for VAE: %s", video_latent_4d.shape)
+
+        # Video decode
+        os.makedirs(args.output_dir, exist_ok=True)
+        logger.info("  Decoding video...")
+        t0 = time.time()
+        from ltx_core.model.video_vae.video_vae import decode_video
+
+        video_chunks = []
+        with torch.no_grad():
+            for chunk in decode_video(video_latent_4d, cpu["video_decoder"]):
+                video_chunks.append(chunk)
+        video_frames = torch.cat(video_chunks, dim=0)
+        logger.info(
+            "  Video decoded: %s in %.1fs", video_frames.shape, time.time() - t0
+        )
+
+        from PIL import Image
+
+        for i in range(video_frames.shape[0]):
+            frame = video_frames[i].numpy()
+            img = Image.fromarray(frame)
+            img.save(os.path.join(args.output_dir, f"frame_{i:04d}.png"))
+        logger.info("  Saved %d frames to %s", video_frames.shape[0], args.output_dir)
+
+        # Save as MP4
+        try:
+            import subprocess
+
+            frame_pattern = os.path.join(args.output_dir, "frame_%04d.png")
+            mp4_path = os.path.join(args.output_dir, "output.mp4")
+            subprocess.run(
+                [
+                    "ffmpeg",
+                    "-y",
+                    "-framerate",
+                    str(int(args.fps)),
+                    "-i",
+                    frame_pattern,
+                    "-c:v",
+                    "libx264",
+                    "-pix_fmt",
+                    "yuv420p",
+                    mp4_path,
+                ],
+                capture_output=True,
+                check=True,
+            )
+            logger.info("  Saved MP4: %s", mp4_path)
+        except (subprocess.CalledProcessError, FileNotFoundError) as e:
+            logger.warning("  ffmpeg not available, skipping MP4: %s", e)
+
+        # Audio decode
+        logger.info("  Decoding audio...")
+        t0 = time.time()
+        from ltx_core.model.audio_vae.audio_vae import decode_audio
+
+        with torch.no_grad():
+            audio_result = decode_audio(
+                audio_latent_spatial.float(),
+                cpu["audio_decoder"].float(),
+                cpu["vocoder"],
+            )
+        logger.info(
+            "  Audio decoded: waveform %s, sr=%d in %.1fs",
+            audio_result.waveform.shape,
+            audio_result.sampling_rate,
+            time.time() - t0,
+        )
+
+        try:
+            import torchaudio
+
+            wav_path = os.path.join(args.output_dir, "output.wav")
+            torchaudio.save(
+                wav_path, audio_result.waveform.cpu(), audio_result.sampling_rate
+            )
+            logger.info("  Saved WAV: %s", wav_path)
+        except ImportError:
+            wav_path = os.path.join(args.output_dir, "audio_waveform.pt")
+            torch.save(
+                {
+                    "waveform": audio_result.waveform.cpu(),
+                    "sr": audio_result.sampling_rate,
+                },
+                wav_path,
+            )
+            logger.info("  Saved audio tensor: %s", wav_path)
+
+        logger.info("\n=== Done! Two-stage output saved to %s ===", args.output_dir)
+        return
+
+    # =========================================================================
+    # SINGLE-STAGE PIPELINE (original path)
+    # =========================================================================
+
+    # Load Neuron backbone — AFTER text encoding to avoid NeuronCore contention
+    # When using --neuron-gemma or --use-app, Gemma3 was already unloaded above
+    logger.info("\n=== Loading Neuron backbone ===")
+    if args.use_app:
+        # Application path: load backbone via compositor
+        t0 = time.time()
+        app.load_backbone(args.app_compiled_dir)
+        logger.info("Backbone loaded via Application in %.1fs", time.time() - t0)
+        neuron_backbone = app  # app is callable with the same 24-tensor interface
+    else:
+        neuron_backbone = load_neuron_backbone(
+            args.compile_dir,
+            args.model_path,
+            args.tp_degree,
+            sharded_dir=args.backbone_sharded_dir,
+        )
+
+    # Build pipeline wrapper
+    from pipeline import NeuronTransformerWrapper
+
+    # Application-compiled NEFFs use text_seq=256 (matching standalone); no padding needed
+    compiled_text_seq = APP_BACKBONE_TEXT_SEQ if args.use_app else None
+    wrapper = NeuronTransformerWrapper(
+        compiled_backbone=neuron_backbone,
+        cpu_ltx_model=cpu["ltx_model"],
+        text_seq=args.text_seq,
+        mask_4d=args.use_app,
+        compiled_text_seq=compiled_text_seq,
+    )
+
+    # Setup latent tools (needed for warmup and denoising)
+    from ltx_core.tools import (
+        VideoLatentTools,
+        VideoLatentPatchifier,
+        VideoLatentShape,
+        AudioLatentTools,
+        AudioPatchifier,
+        AudioLatentShape,
+        SpatioTemporalScaleFactors,
+    )
+    from ltx_core.model.transformer.modality import Modality
+
+    # Compute latent dimensions
+    # LTX-2.3 VAE downsamples spatially by 32x (not 16x as in some other models)
+    # For 384x512 -> height=12, width=16 latent grid
+    latent_h = args.height // 32
+    latent_w = args.width // 32
+    latent_f = (args.num_frames - 1) // 8 + 1
+
+    video_shape = VideoLatentShape(
+        batch=1, channels=128, frames=latent_f, height=latent_h, width=latent_w
+    )
+    v_patchifier = VideoLatentPatchifier(patch_size=1)
+    v_scale = SpatioTemporalScaleFactors.default()  # time=8, height=32, width=32
+    video_tools = VideoLatentTools(
+        target_shape=video_shape,
+        patchifier=v_patchifier,
+        scale_factors=v_scale,
+        causal_fix=False,
+        fps=args.fps,
+    )
+
+    audio_shape = AudioLatentShape(
+        batch=1, channels=8, frames=args.audio_num_frames, mel_bins=16
+    )
+    a_patchifier = AudioPatchifier(patch_size=16)
+    audio_tools = AudioLatentTools(patchifier=a_patchifier, target_shape=audio_shape)
+
+    # Create initial noise
+    gen = torch.Generator().manual_seed(args.seed)
+    video_state = video_tools.create_initial_state(device="cpu", dtype=dtype)
+    audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    video_sample = torch.randn(video_state.latent.shape, dtype=dtype, generator=gen)
+    audio_sample = torch.randn(audio_state.latent.shape, dtype=dtype, generator=gen)
+
+    # Image-to-Video conditioning: encode input image, replace frame 0 tokens,
+    # build denoise_mask for per-token sigma control
+    denoise_mask = None  # None = text-to-video mode (all tokens denoised equally)
+    clean_latent = None
+    if args.image:
+        logger.info("\n=== Image-to-Video conditioning ===")
+        # Encode the input image into normalized latent space
+        # Returns (1, 128, 1, latent_h, latent_w) normalized latent
+        image_latent_5d = encode_image(
+            args.image, args.model_path, args.height, args.width, dtype
+        )
+
+        # Patchify the image latent to get token representation
+        # Same patchifier as video — patch_size=1 so it's rearrange(b c f h w -> b (f*h*w) c)
+        image_tokens = v_patchifier.patchify(image_latent_5d)
+        # image_tokens shape: (1, latent_h * latent_w, 128)
+        frame_0_tokens = latent_h * latent_w
+        logger.info(
+            "  Image patchified: %s (frame 0 = %d tokens)",
+            image_tokens.shape,
+            frame_0_tokens,
+        )
+
+        # Replace frame 0 noise tokens with encoded image tokens
+        video_sample[:, :frame_0_tokens] = image_tokens[:, :frame_0_tokens]
+
+        # Build denoise_mask: 0.0 for conditioned frame 0, 1.0 for unconditioned rest
+        # Shape: (1, video_seq_len, 1) — broadcastable with latent (1, seq, C)
+        video_seq_len = video_sample.shape[1]
+        denoise_mask = torch.ones(1, video_seq_len, 1, dtype=dtype)
+        denoise_mask[:, :frame_0_tokens, :] = 0.0
+        logger.info(
+            "  Denoise mask: %d conditioned tokens, %d unconditioned tokens",
+            frame_0_tokens,
+            video_seq_len - frame_0_tokens,
+        )
+
+        # Store clean latent for post-step preservation of frame 0
+        clean_latent = video_sample.clone()
+
+        logger.info("  I2V conditioning applied")
+
+    # Warmup the DiT backbone — first call loads NEFF onto NeuronCores
+    # Without this, the first 1-2 denoising steps take 100-200s instead of 0.3s
+    logger.info("\n=== Warming up DiT backbone ===")
+    warmup_sigma = torch.tensor([1.0])
+    warmup_v_ts = warmup_sigma.unsqueeze(0).expand(1, video_state.latent.shape[1])
+    warmup_a_ts = warmup_sigma.unsqueeze(0).expand(1, audio_state.latent.shape[1])
+    warmup_ctx = (
+        video_context
+        if video_context is not None
+        else torch.randn(1, args.text_seq, 4096, dtype=dtype)
+    )
+    warmup_actx = (
+        audio_context
+        if audio_context is not None
+        else torch.randn(1, args.text_seq, 2048, dtype=dtype)
+    )
+    warmup_mask = (
+        context_mask
+        if context_mask is not None
+        else torch.ones(1, args.text_seq, dtype=torch.int64)
+    )
+
+    warmup_video_mod = Modality(
+        latent=torch.randn_like(video_sample),
+        sigma=warmup_sigma,
+        timesteps=warmup_v_ts,
+        positions=video_state.positions,
+        context=warmup_ctx,
+        enabled=True,
+        context_mask=warmup_mask,
+        attention_mask=None,
+    )
+    warmup_audio_mod = Modality(
+        latent=torch.randn_like(audio_sample),
+        sigma=warmup_sigma,
+        timesteps=warmup_a_ts,
+        positions=audio_state.positions,
+        context=warmup_actx,
+        enabled=True,
+        context_mask=warmup_mask.clone(),
+        attention_mask=None,
+    )
+    t0 = time.time()
+    with torch.no_grad():
+        _ = wrapper(warmup_video_mod, warmup_audio_mod)
+    logger.info("  DiT backbone warmup done in %.1fs", time.time() - t0)
+    del warmup_video_mod, warmup_audio_mod
+
+    logger.info("\n=== Denoising (%d steps) ===", args.num_steps)
+    logger.info(
+        "  Video latent: %s, Audio latent: %s", video_sample.shape, audio_sample.shape
+    )
+
+    # Sigma schedule — use distilled values for the distilled model
+    # The distilled model was trained with these exact sigma values
+    sigmas = torch.tensor(DISTILLED_SIGMA_VALUES, dtype=torch.float32)
+    assert len(sigmas) == args.num_steps + 1, (
+        f"Distilled sigma values have {len(sigmas)} entries "
+        f"but {args.num_steps} steps require {args.num_steps + 1}"
+    )
+    logger.info("  Sigmas: %s", [f"{s:.4f}" for s in sigmas.tolist()])
+
+    # Denoising loop
+    total_time = 0.0
+    for step_idx in range(args.num_steps):
+        sigma = sigmas[step_idx]
+        sigma_next = sigmas[step_idx + 1]
+
+        video_seq_len = video_state.latent.shape[1]
+        audio_seq_len = audio_state.latent.shape[1]
+
+        # Per-token timesteps: in I2V mode, frame 0 tokens get timestep=0
+        # (already clean), while all other tokens get timestep=sigma
+        if denoise_mask is not None:
+            # denoise_mask: (1, video_seq, 1), squeeze last dim for timesteps
+            v_ts = denoise_mask.squeeze(-1) * sigma  # (1, video_seq)
+        else:
+            v_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, video_seq_len)
+        a_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, audio_seq_len)
+
+        video_mod = Modality(
+            latent=video_sample,
+            sigma=sigma.unsqueeze(0),
+            timesteps=v_ts,
+            positions=video_state.positions,
+            context=video_context,
+            enabled=True,
+            context_mask=context_mask,
+            attention_mask=None,
+        )
+        audio_mod = Modality(
+            latent=audio_sample,
+            sigma=sigma.unsqueeze(0),
+            timesteps=a_ts,
+            positions=audio_state.positions,
+            context=audio_context,
+            enabled=True,
+            context_mask=context_mask.clone(),
+            attention_mask=None,
+        )
+
+        t0 = time.time()
+        with torch.no_grad():
+            video_velocity, audio_velocity = wrapper(video_mod, audio_mod)
+        step_time = time.time() - t0
+        total_time += step_time
+
+        # Euler step using velocity directly (backbone outputs velocity, NOT denoised)
+        # next = sample + velocity * (sigma_next - sigma)
+        dt = sigma_next - sigma
+        video_sample = (video_sample.float() + video_velocity.float() * dt).to(dtype)
+        audio_sample = (audio_sample.float() + audio_velocity.float() * dt).to(dtype)
+
+        # I2V: preserve frame 0 tokens (conditioned) after each Euler step
+        # This ensures the clean image latent is never corrupted by denoising
+        if denoise_mask is not None and clean_latent is not None:
+            # denoise_mask: (1, seq, 1), video_sample: (1, seq, C)
+            # Where mask=0 (frame 0): use clean_latent; where mask=1: keep denoised
+            video_sample = (
+                video_sample * denoise_mask + clean_latent * (1.0 - denoise_mask)
+            ).to(dtype)
+
+        logger.info(
+            "  Step %d/%d: sigma %.4f -> %.4f (%.1fs)",
+            step_idx + 1,
+            args.num_steps,
+            sigma.item(),
+            sigma_next.item(),
+            step_time,
+        )
+
+    logger.info(
+        "  Total denoising: %.1fs (%.1fs/step)", total_time, total_time / args.num_steps
+    )
+
+    # Unpatchify latents back to spatial format for VAE
+    logger.info("\n=== Decoding ===")
+
+    # Video: unpatchify (B, seq, C) -> (B, C, F, H, W) -> (C, F, H, W) for VAE
+    video_latent_spatial = v_patchifier.unpatchify(video_sample, video_shape)
+    logger.info("  Video latent after unpatchify: %s", video_latent_spatial.shape)
+
+    # Audio: unpatchify (B, seq, C) -> spatial format for VAE
+    audio_latent_spatial = a_patchifier.unpatchify(audio_sample, audio_shape)
+    logger.info("  Audio latent for VAE: %s", audio_latent_spatial.shape)
+
+    # Optional upscaling (spatial x2 + temporal x2)
+    if args.upscale:
+        logger.info("\n=== Upscaling latents ===")
+        upscalers = load_upscalers(
+            args.spatial_upscaler_path, args.temporal_upscaler_path, dtype=dtype
+        )
+        video_latent_spatial = upscale_video_latent(
+            video_latent_spatial,
+            cpu["video_decoder"],
+            upscalers["spatial_upsampler"],
+            upscalers["temporal_upsampler"],
+        )
+        # Free upscalers after use
+        del upscalers
+        gc.collect()
+
+    video_latent_4d = video_latent_spatial[0]  # remove batch dim -> (C, F, H, W)
+    logger.info("  Video latent for VAE: %s", video_latent_4d.shape)
+
+    # Video decode
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    logger.info("  Decoding video...")
+    t0 = time.time()
+    from ltx_core.model.video_vae.video_vae import decode_video
+
+    video_chunks = []
+    with torch.no_grad():
+        for chunk in decode_video(video_latent_4d, cpu["video_decoder"]):
+            video_chunks.append(chunk)
+    video_frames = torch.cat(video_chunks, dim=0)  # (F, H, W, 3) uint8
+    logger.info("  Video decoded: %s in %.1fs", video_frames.shape, time.time() - t0)
+
+    # Save video frames
+    from PIL import Image
+
+    for i in range(video_frames.shape[0]):
+        frame = video_frames[i].numpy()
+        img = Image.fromarray(frame)
+        img.save(os.path.join(args.output_dir, f"frame_{i:04d}.png"))
+    logger.info("  Saved %d frames to %s", video_frames.shape[0], args.output_dir)
+
+    # Save as MP4
+    try:
+        import subprocess
+
+        frame_pattern = os.path.join(args.output_dir, "frame_%04d.png")
+        mp4_path = os.path.join(args.output_dir, "output.mp4")
+        subprocess.run(
+            [
+                "ffmpeg",
+                "-y",
+                "-framerate",
+                str(int(args.fps)),
+                "-i",
+                frame_pattern,
+                "-c:v",
+                "libx264",
+                "-pix_fmt",
+                "yuv420p",
+                mp4_path,
+            ],
+            capture_output=True,
+            check=True,
+        )
+        logger.info("  Saved MP4: %s", mp4_path)
+    except (subprocess.CalledProcessError, FileNotFoundError) as e:
+        logger.warning("  ffmpeg not available, skipping MP4: %s", e)
+
+    # Audio decode
+    logger.info("  Decoding audio...")
+    t0 = time.time()
+    from ltx_core.model.audio_vae.audio_vae import decode_audio
+
+    with torch.no_grad():
+        audio_result = decode_audio(
+            audio_latent_spatial.float(), cpu["audio_decoder"].float(), cpu["vocoder"]
+        )
+    logger.info(
+        "  Audio decoded: waveform %s, sr=%d in %.1fs",
+        audio_result.waveform.shape,
+        audio_result.sampling_rate,
+        time.time() - t0,
+    )
+
+    # Save audio
+    try:
+        import torchaudio
+
+        wav_path = os.path.join(args.output_dir, "output.wav")
+        torchaudio.save(
+            wav_path, audio_result.waveform.cpu(), audio_result.sampling_rate
+        )
+        logger.info("  Saved WAV: %s", wav_path)
+    except ImportError:
+        # Fallback: save as raw tensor
+        wav_path = os.path.join(args.output_dir, "audio_waveform.pt")
+        torch.save(
+            {"waveform": audio_result.waveform.cpu(), "sr": audio_result.sampling_rate},
+            wav_path,
+        )
+        logger.info("  Saved audio tensor: %s", wav_path)
+
+    # Save latents for analysis
+    torch.save(
+        {
+            "video_latent": video_sample.cpu(),
+            "audio_latent": audio_sample.cpu(),
+            "video_latent_spatial": video_latent_spatial.cpu(),
+            "audio_latent_spatial": audio_latent_spatial.cpu(),
+        },
+        os.path.join(args.output_dir, "latents.pt"),
+    )
+
+    logger.info("\n=== Done! Output saved to %s ===", args.output_dir)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="LTX-2.3 E2E Generation on Neuron")
+    parser.add_argument(
+        "--model-path", default=MODEL_PATH, help="Safetensors model path"
+    )
+    parser.add_argument(
+        "--compile-dir", default=COMPILE_DIR, help="Compiled model directory"
+    )
+    parser.add_argument(
+        "--backbone-sharded-dir",
+        default="/home/ubuntu/backbone_sharded",
+        help="Directory with pre-sharded backbone weights (from shard_backbone_weights.py). "
+        "Falls back to safetensors loading if not found.",
+    )
+    parser.add_argument("--output-dir", default=OUTPUT_DIR, help="Output directory")
+    parser.add_argument(
+        "--prompt",
+        default="A golden retriever puppy runs across a sunny green meadow",
+        help="Text prompt",
+    )
+    parser.add_argument(
+        "--no-text-encoder",
+        action="store_true",
+        help="Use random embeddings instead of Gemma 3",
+    )
+    parser.add_argument("--gemma-path", default=None, help="Path to Gemma 3 12B model")
+    parser.add_argument(
+        "--neuron-gemma",
+        action="store_true",
+        help="Use Neuron-compiled Gemma3 encoder (requires compile_gemma3.py + shard_gemma3_weights.py)",
+    )
+    parser.add_argument(
+        "--gemma-compiled-dir",
+        default=GEMMA3_COMPILED_DIR,
+        help="Directory with compiled Gemma3 encoder (from compile_gemma3.py)",
+    )
+    parser.add_argument(
+        "--gemma-sharded-dir",
+        default=GEMMA3_SHARDED_DIR,
+        help="Directory with pre-sharded Gemma3 weights (from shard_gemma3_weights.py)",
+    )
+    parser.add_argument("--height", type=int, default=384, help="Video height")
+    parser.add_argument("--width", type=int, default=512, help="Video width")
+    parser.add_argument(
+        "--num-frames", type=int, default=25, help="Number of video frames"
+    )
+    parser.add_argument("--num-steps", type=int, default=8, help="Denoising steps")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed")
+    parser.add_argument("--fps", type=float, default=24.0, help="Video frame rate")
+    parser.add_argument(
+        "--audio-num-frames", type=int, default=26, help="Audio latent frames"
+    )
+    parser.add_argument(
+        "--text-seq", type=int, default=TEXT_SEQ, help="Text sequence length"
+    )
+    parser.add_argument("--tp-degree", type=int, default=TP_DEGREE, help="TP degree")
+    parser.add_argument(
+        "--s2-tp-degree",
+        type=int,
+        default=None,
+        help="TP degree for Stage 2 backbone in --two-stage mode (default: same as --tp-degree). "
+        "Use this when Stage 2 (full-res) needs more TP than Stage 1 (half-res), e.g. "
+        "--tp-degree 4 --s2-tp-degree 16 on trn2.48xlarge.",
+    )
+    parser.add_argument(
+        "--upscale",
+        action="store_true",
+        help="Apply spatial x2 + temporal x2 upscaling before VAE decode",
+    )
+    parser.add_argument(
+        "--spatial-upscaler-path",
+        default=SPATIAL_UPSCALER_PATH,
+        help="Path to spatial upscaler x2 safetensors",
+    )
+    parser.add_argument(
+        "--temporal-upscaler-path",
+        default=TEMPORAL_UPSCALER_PATH,
+        help="Path to temporal upscaler x2 safetensors",
+    )
+    parser.add_argument(
+        "--image",
+        default=None,
+        help="Path to input image for image-to-video generation. "
+        "When specified, frame 0 is conditioned on the encoded image "
+        "and only subsequent frames are denoised.",
+    )
+    parser.add_argument(
+        "--two-stage",
+        action="store_true",
+        help="Two-stage generation: Stage 1 at half-res (192x256), spatial upsample x2, "
+        "then Stage 2 refinement at full-res (384x512). Produces sharper output than "
+        "single-stage at the cost of additional compilation + denoising steps.",
+    )
+    parser.add_argument(
+        "--halfres-compiled-dir",
+        default=HALFRES_COMPILE_DIR,
+        help="Directory with half-res compiled backbone (from compile_transformer_halfres.py)",
+    )
+    parser.add_argument(
+        "--halfres-sharded-dir",
+        default=HALFRES_SHARDED_DIR,
+        help="Directory with pre-sharded backbone weights for half-res model "
+        "(same weights, different compiled shape)",
+    )
+    parser.add_argument(
+        "--use-app",
+        action="store_true",
+        help="Use NeuronApplicationBase path for both Gemma3 encoder and DiT backbone. "
+        "Faster than --neuron-gemma due to ModelBuilder weight layout optimization. "
+        "Requires --app-compiled-dir with backbone/ and text_encoder/ subdirs.",
+    )
+    parser.add_argument(
+        "--app-compiled-dir",
+        default=APP_COMPILED_DIR,
+        help="Base directory with Application-compiled artifacts. "
+        "Must contain backbone/ and text_encoder/ subdirs (from NeuronLTX23Application.compile()).",
+    )
+    parser.add_argument(
+        "--app-halfres-compiled-dir",
+        default=APP_HALFRES_COMPILED_DIR,
+        help="Base directory with half-res Application-compiled backbone. "
+        "Used for Stage 1 of --two-stage --use-app. Must contain backbone/ subdir.",
+    )
+    parser.add_argument(
+        "--s2-app-compiled-dir",
+        default=None,
+        help="Base directory with Stage 2 Application-compiled backbone (when using different TP). "
+        "Defaults to --app-compiled-dir. Used only with --two-stage --use-app --s2-tp-degree.",
+    )
+    parser.add_argument(
+        "--save-s1-latent",
+        default=None,
+        help="Path to save Stage 1 output latent + context (for two-phase TP switching). "
+        "When set with --two-stage, runs encoder + S1 denoising then saves and exits.",
+    )
+    parser.add_argument(
+        "--load-s1-latent",
+        default=None,
+        help="Path to load Stage 1 output latent + context (for two-phase TP switching). "
+        "When set with --two-stage, skips encoder + S1 and jumps directly to spatial upsample + S2.",
+    )
+
+    args = parser.parse_args()
+
+    # Resolve Stage 2 TP and compiled dir defaults
+    if args.s2_tp_degree is None:
+        args.s2_tp_degree = args.tp_degree
+    if args.s2_app_compiled_dir is None:
+        args.s2_app_compiled_dir = args.app_compiled_dir
+
+    if args.use_app:
+        if args.gemma_path is None:
+            args.gemma_path = GEMMA3_MODEL_PATH
+    elif not args.no_text_encoder and not args.neuron_gemma and args.gemma_path is None:
+        parser.error(
+            "Either --no-text-encoder, --neuron-gemma, --use-app, or --gemma-path must be specified. "
+            "Use --use-app for Application-compiled models (fastest, recommended)."
+        )
+
+    generate(args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/load_with_weights.py b/contrib/models/LTX-2.3/src/load_with_weights.py
new file mode 100644
index 00000000..fdcea7b7
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/load_with_weights.py
@@ -0,0 +1,453 @@
+#!/usr/bin/env python3
+"""
+LTX-2.3 Load with Real Weights & Forward Test
+================================================
+Loads the compiled tp_0.pt, injects properly TP-sharded weights from
+safetensors, and runs a single forward pass to validate correctness.
+
+This script handles the full weight loading pipeline:
+1. Load safetensors -> strip ComfyUI prefix
+2. Map safetensors keys to JIT model parameter names
+3. Shard each weight per TP rank according to the sharding pattern
+4. Inject into each rank's JIT model
+5. Load onto Neuron devices
+6. Run forward pass
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  python3 load_with_weights.py
+"""
+
+import os
+import sys
+import time
+import json
+import gc
+import torch
+import numpy as np
+
+os.environ["NEURON_FUSE_SOFTMAX"] = "1"
+os.environ["NEURON_CUSTOM_SILU"] = "1"
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+MODEL_PATH = "/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors"
+COMPILE_DIR = "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_v2"
+TP_DEGREE = 4
+BATCH = 1
+VIDEO_SEQ = 768
+AUDIO_SEQ = 26
+TEXT_SEQ = 256
+
+
+def load_config():
+    from safetensors import safe_open
+
+    with safe_open(MODEL_PATH, framework="pt") as f:
+        metadata = f.metadata()
+    return json.loads(metadata["config"])
+
+
+def precompute_inputs(config):
+    """Build example inputs using native ltx-core preprocessors."""
+    sys.path.insert(0, "/home/ubuntu/ltx23_nxdi")
+    from modeling_ltx23 import replace_sdpa_with_bmm
+
+    replace_sdpa_with_bmm()
+
+    from ltx_core.model.transformer.model_configurator import LTXModelConfigurator
+    from ltx_core.model.transformer.modality import Modality
+    from ltx_core.tools import (
+        VideoLatentTools,
+        VideoLatentPatchifier,
+        VideoLatentShape,
+        AudioLatentTools,
+        AudioPatchifier,
+        AudioLatentShape,
+        SpatioTemporalScaleFactors,
+    )
+
+    dtype = torch.bfloat16
+    ltx_model = LTXModelConfigurator.from_config(config)
+    ltx_model = ltx_model.to(dtype=dtype)
+    ltx_model.eval()
+
+    torch.manual_seed(42)
+
+    video_shape = VideoLatentShape(
+        batch=BATCH, channels=128, frames=4, height=12, width=16
+    )
+    v_patchifier = VideoLatentPatchifier(patch_size=1)
+    v_scale = SpatioTemporalScaleFactors(time=1, width=8, height=8)
+    video_tools = VideoLatentTools(
+        target_shape=video_shape,
+        patchifier=v_patchifier,
+        scale_factors=v_scale,
+        causal_fix=False,
+        fps=24.0,
+    )
+    video_state = video_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    audio_shape = AudioLatentShape(
+        batch=BATCH, channels=8, frames=AUDIO_SEQ, mel_bins=16
+    )
+    a_patchifier = AudioPatchifier(patch_size=16)
+    audio_tools = AudioLatentTools(patchifier=a_patchifier, target_shape=audio_shape)
+    audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    # CRITICAL: Use random noise as latent, NOT zeros from create_initial_state().
+    # The denoising loop starts with pure noise at sigma=1.0.
+    video_latent = torch.randn_like(video_state.latent)
+    audio_latent = torch.randn_like(audio_state.latent)
+
+    sigma = torch.tensor([1.0], dtype=dtype)
+    v_ts = sigma.unsqueeze(1).expand(BATCH, video_latent.shape[1])
+    a_ts = sigma.unsqueeze(1).expand(BATCH, audio_latent.shape[1])
+
+    ctx_v = torch.randn(BATCH, TEXT_SEQ, 4096, dtype=dtype)
+    ctx_a = torch.randn(BATCH, TEXT_SEQ, 2048, dtype=dtype)
+    ctx_mask = torch.ones(BATCH, TEXT_SEQ, dtype=dtype)
+    ctx_mask[:, 50:] = 0
+
+    video_mod = Modality(
+        latent=video_latent,
+        sigma=sigma,
+        timesteps=v_ts,
+        positions=video_state.positions,
+        context=ctx_v,
+        enabled=True,
+        context_mask=ctx_mask,
+        attention_mask=None,
+    )
+    audio_mod = Modality(
+        latent=audio_latent,
+        sigma=sigma,
+        timesteps=a_ts,
+        positions=audio_state.positions,
+        context=ctx_a,
+        enabled=True,
+        context_mask=ctx_mask,
+        attention_mask=None,
+    )
+
+    with torch.no_grad():
+        va = ltx_model.video_args_preprocessor.prepare(video_mod, audio_mod)
+        aa = ltx_model.audio_args_preprocessor.prepare(audio_mod, video_mod)
+
+    video_pe_cos, video_pe_sin = va.positional_embeddings
+    audio_pe_cos, audio_pe_sin = aa.positional_embeddings
+    ca_video_pe_cos, ca_video_pe_sin = va.cross_positional_embeddings
+    ca_audio_pe_cos, ca_audio_pe_sin = aa.cross_positional_embeddings
+
+    inputs = (
+        va.x.to(dtype),
+        aa.x.to(dtype),
+        va.context.to(dtype),
+        aa.context.to(dtype),
+        va.timesteps.to(dtype),
+        aa.timesteps.to(dtype),
+        va.embedded_timestep.to(dtype),
+        aa.embedded_timestep.to(dtype),
+        va.cross_scale_shift_timestep.to(dtype),
+        aa.cross_scale_shift_timestep.to(dtype),
+        va.cross_gate_timestep.to(dtype),
+        aa.cross_gate_timestep.to(dtype),
+        video_pe_cos.to(dtype),
+        video_pe_sin.to(dtype),
+        audio_pe_cos.to(dtype),
+        audio_pe_sin.to(dtype),
+        ca_video_pe_cos.to(dtype),
+        ca_video_pe_sin.to(dtype),
+        ca_audio_pe_cos.to(dtype),
+        ca_audio_pe_sin.to(dtype),
+        va.context_mask.to(dtype),
+        aa.context_mask.to(dtype).clone(),
+        va.prompt_timestep.to(dtype),
+        aa.prompt_timestep.to(dtype),
+    )
+
+    del ltx_model
+    gc.collect()
+    return inputs
+
+
+def shard_weight(full_weight, jit_param_name, tp_rank, tp_size):
+    """Shard a full weight tensor for the given TP rank.
+
+    Determines the sharding pattern from the JIT parameter name:
+    - ColumnParallelLinear (to_q, to_k, to_v, to_gate_logits, GEGLU proj):
+        weight sharded on dim 0, bias sharded on dim 0
+    - RowParallelLinear (to_out->0, ff->net->2):
+        weight sharded on dim 1, bias NOT sharded
+    - DistributedRMSNorm (q_norm, k_norm):
+        weight sharded on dim 0
+    - SPMDRank: rank tensor, select element for this rank
+    - Unsharded (scale_shift_table, norm_out, proj_out, etc.):
+        return full weight unchanged
+    """
+    name = jit_param_name
+
+    # SPMDRank: select this rank's value
+    if "spmd_rank" in name:
+        return torch.tensor([tp_rank], dtype=torch.int32)
+
+    # Determine sharding from the parameter name
+    shard_size = None
+    shard_dim = None
+
+    # Check if this is a sharded parameter
+    is_column_weight = False
+    is_column_bias = False
+    is_row_weight = False
+    is_row_bias = False
+    is_norm_weight = False
+
+    # Column-parallel: to_q, to_k, to_v, to_gate_logits
+    # Use delimited patterns (->X->) to avoid false matches like
+    # "to_v" matching "audio_to_video_attn"
+    for col_name in ["->to_q->", "->to_k->", "->to_v->", "->to_gate_logits->"]:
+        if col_name in name:
+            if name.endswith("weight"):
+                is_column_weight = True
+            elif name.endswith("bias"):
+                is_column_bias = True
+            break
+
+    # Column-parallel: GEGLU gate proj (ff->net->0->proj)
+    if "ff->net->0->proj" in name:
+        if name.endswith("weight"):
+            is_column_weight = True
+        elif name.endswith("bias"):
+            is_column_bias = True
+
+    # Row-parallel: output projection (to_out->0)
+    if "to_out->0" in name:
+        if name.endswith("weight"):
+            is_row_weight = True
+        elif name.endswith("bias"):
+            is_row_bias = True  # Bias not sharded for RowParallel
+
+    # Row-parallel: FFN down projection (ff->net->2)
+    if "ff->net->2" in name:
+        if name.endswith("weight"):
+            is_row_weight = True
+        elif name.endswith("bias"):
+            is_row_bias = True  # Bias not sharded
+
+    # DistributedRMSNorm: q_norm, k_norm
+    if ("q_norm" in name or "k_norm" in name) and name.endswith("weight"):
+        is_norm_weight = True
+
+    # Apply sharding
+    if is_column_weight or is_column_bias or is_norm_weight:
+        # Shard on dim 0
+        shard_size = full_weight.shape[0] // tp_size
+        return full_weight[shard_size * tp_rank : shard_size * (tp_rank + 1)].clone()
+
+    elif is_row_weight:
+        # Shard on dim 1
+        shard_size = full_weight.shape[1] // tp_size
+        return full_weight[:, shard_size * tp_rank : shard_size * (tp_rank + 1)].clone()
+
+    elif is_row_bias:
+        # Not sharded — full copy
+        return full_weight.clone()
+
+    else:
+        # Unsharded (scale_shift_table, norm_out, proj_out, audio variants)
+        return full_weight.clone()
+
+
+def load_and_shard_weights():
+    """Load safetensors and create per-rank state dicts."""
+    from safetensors.torch import load_file
+
+    print("  Loading safetensors...", flush=True)
+    t0 = time.time()
+    full_sd = load_file(MODEL_PATH)
+    print(f"  Loaded {len(full_sd)} tensors in {time.time() - t0:.1f}s", flush=True)
+
+    # Strip ComfyUI prefix
+    prefix = "model.diffusion_model."
+    stripped_sd = {}
+    for k, v in full_sd.items():
+        if k.startswith(prefix):
+            stripped_sd[k[len(prefix) :]] = v
+        else:
+            stripped_sd[k] = v
+
+    # Filter to backbone keys (same as compile_transformer.py)
+    backbone_prefixes = (
+        "transformer_blocks.",
+        "norm_out.",
+        "proj_out.",
+        "scale_shift_table",
+        "audio_norm_out.",
+        "audio_proj_out.",
+        "audio_scale_shift_table",
+    )
+    backbone_sd = {}
+    for k, v in stripped_sd.items():
+        if k.startswith(backbone_prefixes):
+            backbone_sd[k] = v.to(torch.bfloat16).contiguous()
+
+    # Add SPMDRank (will be per-rank in shard_weight)
+    backbone_sd["spmd_rank.rank"] = torch.arange(0, TP_DEGREE, dtype=torch.int32)
+
+    del full_sd, stripped_sd
+    gc.collect()
+
+    print(f"  {len(backbone_sd)} backbone keys", flush=True)
+
+    # Convert safetensors key format to JIT param format
+    # safetensors: "transformer_blocks.0.attn1.to_q.weight"
+    # JIT:        "weights.transformer_blocks->0->attn1->to_q->weight"
+    def sf_key_to_jit_key(sf_key):
+        return "weights." + sf_key.replace(".", "->")
+
+    # Create per-rank state dicts
+    rank_sds = [{} for _ in range(TP_DEGREE)]
+    for sf_key, full_weight in backbone_sd.items():
+        jit_key = sf_key_to_jit_key(sf_key)
+        for rank in range(TP_DEGREE):
+            sharded = shard_weight(full_weight, jit_key, rank, TP_DEGREE)
+            rank_sds[rank][jit_key] = sharded
+
+    del backbone_sd
+    gc.collect()
+
+    return rank_sds
+
+
+def main():
+    print("=" * 60, flush=True)
+    print("LTX-2.3 Load with Real Weights & Forward Test", flush=True)
+    print("=" * 60, flush=True)
+
+    # 1. Load config
+    print("\n[1/5] Loading config...", flush=True)
+    config = load_config()
+    tc = config["transformer"]
+    print(f"  {tc['num_layers']} layers, {tc['num_attention_heads']} heads", flush=True)
+
+    # 2. Precompute inputs
+    print("\n[2/5] Precomputing inputs...", flush=True)
+    inputs = precompute_inputs(config)
+    print(f"  Got {len(inputs)} input tensors", flush=True)
+
+    # 3. Load and shard weights
+    print("\n[3/5] Loading and sharding weights...", flush=True)
+    rank_sds = load_and_shard_weights()
+    for rank, sd in enumerate(rank_sds):
+        print(f"  Rank {rank}: {len(sd)} parameters", flush=True)
+
+    # 4. Load compiled models and inject weights
+    print("\n[4/5] Loading compiled models onto Neuron...", flush=True)
+    import torch_neuronx
+    from neuronx_distributed.trace.trace import TensorParallelNeuronModel
+
+    tp_0_path = os.path.join(COMPILE_DIR, "tp_0.pt")
+
+    models = []
+    t0 = time.time()
+    for rank in range(TP_DEGREE):
+        print(f"  Loading rank {rank}...", flush=True)
+        with torch_neuronx.contexts.disable_nrt_load():
+            model = torch.jit.load(tp_0_path)
+
+        # Inject per-rank weights
+        model_sd = dict(model.named_parameters())
+        injected = 0
+        missing = 0
+        mismatched = 0
+        for jit_key, sharded_weight in rank_sds[rank].items():
+            if jit_key in model_sd:
+                if model_sd[jit_key].shape == sharded_weight.shape:
+                    model_sd[jit_key].data.copy_(sharded_weight)
+                    injected += 1
+                else:
+                    mismatched += 1
+                    if rank == 0 and mismatched <= 5:
+                        print(
+                            f"    MISMATCH: {jit_key}: model={model_sd[jit_key].shape} vs shard={sharded_weight.shape}",
+                            flush=True,
+                        )
+            else:
+                missing += 1
+
+        if rank == 0:
+            print(
+                f"    Injected {injected}, mismatched {mismatched}, missing {missing}",
+                flush=True,
+            )
+
+        models.append(model)
+
+    print(f"  All models loaded in {time.time() - t0:.1f}s", flush=True)
+
+    del rank_sds
+    gc.collect()
+
+    # Create TensorParallelNeuronModel
+    neuron_model = TensorParallelNeuronModel(models)
+    print(f"  TP degree: {neuron_model.tp_degree}", flush=True)
+
+    # 5. Run forward pass
+    print("\n[5/5] Running forward pass...", flush=True)
+    with torch.no_grad():
+        t0 = time.time()
+        video_out, audio_out = neuron_model(*inputs)
+        elapsed = time.time() - t0
+
+    print(f"\n  Forward pass completed in {elapsed:.2f}s", flush=True)
+    print(f"  Video output: {video_out.shape}, dtype={video_out.dtype}", flush=True)
+    print(f"  Audio output: {audio_out.shape}, dtype={audio_out.dtype}", flush=True)
+
+    v_np = video_out.float().numpy()
+    a_np = audio_out.float().numpy()
+
+    print(
+        f"\n  Video stats: min={v_np.min():.4f}, max={v_np.max():.4f}, "
+        f"mean={v_np.mean():.4f}, std={v_np.std():.4f}",
+        flush=True,
+    )
+    print(
+        f"  Audio stats: min={a_np.min():.4f}, max={a_np.max():.4f}, "
+        f"mean={a_np.mean():.4f}, std={a_np.std():.4f}",
+        flush=True,
+    )
+
+    has_nan = np.isnan(v_np).any() or np.isnan(a_np).any()
+    has_inf = np.isinf(v_np).any() or np.isinf(a_np).any()
+    print(f"\n  NaN: {has_nan}, Inf: {has_inf}", flush=True)
+
+    # Save outputs
+    output_path = "/home/ubuntu/ltx23_neuron/test_outputs_real_weights.pt"
+    torch.save(
+        {
+            "video_output": video_out.cpu(),
+            "audio_output": audio_out.cpu(),
+            "inputs": tuple(t.cpu() for t in inputs),
+        },
+        output_path,
+    )
+    print(f"  Saved to {output_path}", flush=True)
+
+    all_ok = not has_nan and not has_inf
+    print(f"\n{'=' * 60}", flush=True)
+    print(f"  RESULT: {'PASS' if all_ok else 'FAIL'}", flush=True)
+    print(f"{'=' * 60}", flush=True)
+
+    # Second pass
+    print("\n  Running second forward pass (warmed up)...", flush=True)
+    with torch.no_grad():
+        t0 = time.time()
+        video_out2, audio_out2 = neuron_model(*inputs)
+        elapsed2 = time.time() - t0
+    print(f"  Second pass: {elapsed2:.2f}s", flush=True)
+
+    return 0 if all_ok else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/contrib/models/LTX-2.3/src/modeling_gemma3_encoder.py b/contrib/models/LTX-2.3/src/modeling_gemma3_encoder.py
new file mode 100644
index 00000000..c8a3b82a
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/modeling_gemma3_encoder.py
@@ -0,0 +1,832 @@
+"""
+NeuronGemma3TextEncoder — Gemma 3-12B as an encoder-only model for LTX-2.3 text conditioning
+==============================================================================================
+
+LTX-2.3 uses Gemma 3-12B as a text encoder, NOT as a causal language model. The pipeline
+calls the text encoder with output_hidden_states=True and collects ALL 49 hidden states
+(embedding + 48 decoder layers), stacks them, normalizes, and flattens to produce
+conditioning embeddings for the video and audio diffusion streams.
+
+NxDI's built-in NeuronGemma3ForCausalLM cannot provide this because:
+  - It only returns logits/tokens, not intermediate hidden states
+  - It has KV cache machinery baked into the compiled graph
+  - Its forward slices to the last token position
+
+This module builds a CUSTOM encoder-only model that:
+  1. Uses NxD parallel layers for TP-sharded attention, MLP, norms
+  2. Runs all 48 decoder layers in a single forward pass
+  3. Accumulates all hidden states and returns torch.stack(all_hidden_states, dim=-1)
+  4. Has NO KV cache, NO lm_head, NO sampling
+  5. Takes (input_ids, attention_mask) -> (B, seq_len, hidden_size, num_layers+1)
+
+Architecture:
+  Gemma3ScaledEmbedding -> 48 x Gemma3EncoderLayer -> Gemma3RMSNorm
+  Output: torch.stack([embed_out, layer_0_out, ..., layer_47_out], dim=-1)
+
+TP strategy:
+  - Q, K, V projections: ColumnParallelLinear (shard output dim)
+  - O projection: RowParallelLinear (shard input dim, all-reduce)
+  - gate_proj, up_proj: ColumnParallelLinear
+  - down_proj: RowParallelLinear
+  - Norms: replicated (not sharded) -- they operate on the full hidden_size
+  - Embedding: ParallelEmbedding (sharded across vocab)
+
+Adapted from the LTX-2 contrib (contrib/ltx2-video-audio) for LTX-2.3.
+"""
+
+import logging
+import math
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+    RowParallelLinear,
+)
+from neuronx_distributed.parallel_layers.parallel_state import (
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_size,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# ── RMSNorm (Gemma3 variant: 1 + weight) ────────────────────────────────────
+
+
+class Gemma3RMSNorm(nn.Module):
+    """Gemma3-specific RMSNorm: uses (1.0 + weight) instead of just weight."""
+
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.zeros(dim))
+
+    def _norm(self, x):
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+
+    def forward(self, x):
+        output = self._norm(x.float())
+        output = output * (1.0 + self.weight.float())
+        return output.type_as(x)
+
+
+# ── Scaled Embedding ────────────────────────────────────────────────────────
+
+
+class Gemma3ScaledEmbedding(nn.Module):
+    """Gemma3 embeddings scaled by sqrt(hidden_size)."""
+
+    def __init__(self, num_embeddings, embedding_dim, padding_idx, dtype):
+        super().__init__()
+        self.embed_scale = embedding_dim**0.5
+        self.embedding = ParallelEmbedding(
+            num_embeddings,
+            embedding_dim,
+            padding_idx,
+            dtype=dtype,
+            shard_across_embedding=True,
+            pad=True,
+        )
+
+    def forward(self, input_ids):
+        return self.embedding(input_ids) * self.embed_scale
+
+
+# ── Rotary Position Embedding ───────────────────────────────────────────────
+
+
+class RotaryEmbedding(nn.Module):
+    """Standard RoPE for Gemma3 (no sliding window variant needed for encoder)."""
+
+    def __init__(self, dim, max_position_embeddings=131072, base=1_000_000.0):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+    def forward(self, position_ids):
+        """
+        Args:
+            position_ids: (batch_size, seq_len)
+        Returns:
+            cos, sin: (batch_size, seq_len, dim)
+        """
+        inv_freq = self.inv_freq.to(position_ids.device)
+        pos = position_ids.unsqueeze(-1).float()
+        freqs = pos * inv_freq.unsqueeze(0).unsqueeze(0)
+        emb = torch.cat([freqs, freqs], dim=-1)
+        return emb.cos(), emb.sin()
+
+
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Apply rotary embedding to query and key tensors."""
+    cos = cos.unsqueeze(1)  # (B, 1, seq_len, head_dim)
+    sin = sin.unsqueeze(1)
+
+    def rotate_half(x):
+        x1, x2 = x.chunk(2, dim=-1)
+        return torch.cat([-x2, x1], dim=-1)
+
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+# ── Attention ───────────────────────────────────────────────────────────────
+
+
+class Gemma3EncoderAttention(nn.Module):
+    """Gemma3 attention for encoder-only use (no KV cache).
+
+    Uses GQA with Q-K normalization. Keeps causal attention to match
+    training behavior of the original Gemma3 weights.
+    """
+
+    def __init__(
+        self,
+        hidden_size: int,
+        num_attention_heads: int,
+        num_key_value_heads: int,
+        head_dim: int,
+        rms_norm_eps: float,
+        rope_theta: float,
+        max_position_embeddings: int,
+        query_pre_attn_scalar: int,
+        dtype: torch.dtype,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_heads = num_attention_heads
+        self.num_kv_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.num_kv_groups = num_attention_heads // num_key_value_heads
+
+        # Scaling uses query_pre_attn_scalar, not head_dim
+        self.scale = query_pre_attn_scalar**-0.5
+
+        tp_size = get_tensor_model_parallel_size()
+
+        self.q_proj = ColumnParallelLinear(
+            hidden_size,
+            num_attention_heads * head_dim,
+            bias=False,
+            gather_output=False,
+            dtype=dtype,
+            pad=True,
+        )
+        self.k_proj = ColumnParallelLinear(
+            hidden_size,
+            num_key_value_heads * head_dim,
+            bias=False,
+            gather_output=False,
+            dtype=dtype,
+            pad=True,
+        )
+        self.v_proj = ColumnParallelLinear(
+            hidden_size,
+            num_key_value_heads * head_dim,
+            bias=False,
+            gather_output=False,
+            dtype=dtype,
+            pad=True,
+        )
+        self.o_proj = RowParallelLinear(
+            num_attention_heads * head_dim,
+            hidden_size,
+            bias=False,
+            input_is_parallel=True,
+            dtype=dtype,
+        )
+
+        # Q-K normalization (Gemma3-specific)
+        self.q_layernorm = Gemma3RMSNorm(head_dim, eps=rms_norm_eps)
+        self.k_layernorm = Gemma3RMSNorm(head_dim, eps=rms_norm_eps)
+
+        # RoPE
+        self.rotary_emb = RotaryEmbedding(
+            dim=head_dim,
+            max_position_embeddings=max_position_embeddings,
+            base=rope_theta,
+        )
+
+        self.num_heads_per_rank = num_attention_heads // tp_size
+        self.num_kv_heads_per_rank = num_key_value_heads // tp_size
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+        position_ids: torch.Tensor,
+    ) -> torch.Tensor:
+        batch_size, seq_len, _ = hidden_states.shape
+
+        q = self.q_proj(hidden_states)
+        k = self.k_proj(hidden_states)
+        v = self.v_proj(hidden_states)
+
+        q = q.view(
+            batch_size, seq_len, self.num_heads_per_rank, self.head_dim
+        ).transpose(1, 2)
+        k = k.view(
+            batch_size, seq_len, self.num_kv_heads_per_rank, self.head_dim
+        ).transpose(1, 2)
+        v = v.view(
+            batch_size, seq_len, self.num_kv_heads_per_rank, self.head_dim
+        ).transpose(1, 2)
+
+        # Q-K normalization (before RoPE)
+        q = self.q_layernorm(q)
+        k = self.k_layernorm(k)
+
+        # Apply RoPE
+        cos, sin = self.rotary_emb(position_ids)
+        q, k = apply_rotary_pos_emb(q, k, cos, sin)
+
+        # GQA: repeat K, V for each query head group
+        if self.num_kv_groups > 1:
+            k = k.repeat_interleave(self.num_kv_groups, dim=1)
+            v = v.repeat_interleave(self.num_kv_groups, dim=1)
+
+        # BMM attention (Neuron-friendly, no SDPA)
+        attn_weights = (
+            torch.bmm(
+                q.reshape(batch_size * self.num_heads_per_rank, seq_len, self.head_dim),
+                k.reshape(
+                    batch_size * self.num_heads_per_rank, seq_len, self.head_dim
+                ).transpose(-1, -2),
+            )
+            * self.scale
+        )
+
+        attn_weights = attn_weights.view(
+            batch_size, self.num_heads_per_rank, seq_len, seq_len
+        )
+
+        if attention_mask is not None:
+            attn_weights = attn_weights + attention_mask
+
+        # Softmax in float32 for numerical precision
+        attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(
+            hidden_states.dtype
+        )
+
+        attn_output = torch.bmm(
+            attn_weights.reshape(
+                batch_size * self.num_heads_per_rank, seq_len, seq_len
+            ),
+            v.reshape(batch_size * self.num_heads_per_rank, seq_len, self.head_dim),
+        )
+
+        attn_output = attn_output.view(
+            batch_size, self.num_heads_per_rank, seq_len, self.head_dim
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(batch_size, seq_len, -1)
+
+        return self.o_proj(attn_output)
+
+
+# ── MLP ─────────────────────────────────────────────────────────────────────
+
+
+class Gemma3EncoderMLP(nn.Module):
+    """Gemma3 MLP: gate_proj * act(up_proj) -> down_proj, with GELU(tanh)."""
+
+    def __init__(self, hidden_size: int, intermediate_size: int, dtype: torch.dtype):
+        super().__init__()
+        self.gate_proj = ColumnParallelLinear(
+            hidden_size,
+            intermediate_size,
+            bias=False,
+            gather_output=False,
+            dtype=dtype,
+            pad=True,
+        )
+        self.up_proj = ColumnParallelLinear(
+            hidden_size,
+            intermediate_size,
+            bias=False,
+            gather_output=False,
+            dtype=dtype,
+            pad=True,
+        )
+        self.down_proj = RowParallelLinear(
+            intermediate_size,
+            hidden_size,
+            bias=False,
+            input_is_parallel=True,
+            dtype=dtype,
+        )
+        self.act_fn = nn.GELU(approximate="tanh")
+
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+
+
+# ── Decoder Layer (simplified for encoder use) ─────────────────────────────
+
+
+class Gemma3EncoderLayer(nn.Module):
+    """Single Gemma3 decoder layer adapted for encoder-only use (no KV cache).
+    Four norms per layer (Gemma3-specific).
+    """
+
+    def __init__(
+        self,
+        hidden_size: int,
+        num_attention_heads: int,
+        num_key_value_heads: int,
+        head_dim: int,
+        intermediate_size: int,
+        rms_norm_eps: float,
+        rope_theta: float,
+        max_position_embeddings: int,
+        query_pre_attn_scalar: int,
+        dtype: torch.dtype,
+    ):
+        super().__init__()
+        self.self_attn = Gemma3EncoderAttention(
+            hidden_size=hidden_size,
+            num_attention_heads=num_attention_heads,
+            num_key_value_heads=num_key_value_heads,
+            head_dim=head_dim,
+            rms_norm_eps=rms_norm_eps,
+            rope_theta=rope_theta,
+            max_position_embeddings=max_position_embeddings,
+            query_pre_attn_scalar=query_pre_attn_scalar,
+            dtype=dtype,
+        )
+        self.mlp = Gemma3EncoderMLP(hidden_size, intermediate_size, dtype)
+
+        # Four norms (Gemma3-specific)
+        self.input_layernorm = Gemma3RMSNorm(hidden_size, eps=rms_norm_eps)
+        self.post_attention_layernorm = Gemma3RMSNorm(hidden_size, eps=rms_norm_eps)
+        self.pre_feedforward_layernorm = Gemma3RMSNorm(hidden_size, eps=rms_norm_eps)
+        self.post_feedforward_layernorm = Gemma3RMSNorm(hidden_size, eps=rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+        position_ids: torch.Tensor,
+    ) -> torch.Tensor:
+        # Attention block
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states = self.self_attn(hidden_states, attention_mask, position_ids)
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = residual + hidden_states
+
+        # MLP block
+        residual = hidden_states
+        hidden_states = self.pre_feedforward_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = self.post_feedforward_layernorm(hidden_states)
+        hidden_states = residual + hidden_states
+
+        return hidden_states
+
+
+# ── Full Encoder Model ──────────────────────────────────────────────────────
+
+
+class Gemma3TextEncoderModel(nn.Module):
+    """Gemma3 used as a text encoder: returns all hidden states stacked.
+
+    This is the model that gets compiled to a Neuron graph. It takes
+    (input_ids, attention_mask) and returns a single tensor of shape
+    (B, seq_len, hidden_size, num_layers+1).
+
+    For Gemma 3-12B:
+      hidden_size = 3840, num_hidden_layers = 48
+      Output: (B, seq_len, 3840, 49)
+
+    Note on causal attention: Gemma3 was trained with causal (left-to-right)
+    attention. We keep causal masking to produce hidden states consistent
+    with the original model.
+    """
+
+    def __init__(
+        self,
+        vocab_size: int = 262208,
+        hidden_size: int = 3840,
+        num_hidden_layers: int = 48,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 8,
+        head_dim: int = 256,
+        intermediate_size: int = 15360,
+        rms_norm_eps: float = 1e-6,
+        rope_theta: float = 1_000_000.0,
+        max_position_embeddings: int = 131072,
+        query_pre_attn_scalar: int = 256,
+        pad_token_id: int = 0,
+        dtype: torch.dtype = torch.bfloat16,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.dtype = dtype
+
+        self.embed_tokens = Gemma3ScaledEmbedding(
+            vocab_size,
+            hidden_size,
+            pad_token_id,
+            dtype=dtype,
+        )
+
+        self.layers = nn.ModuleList(
+            [
+                Gemma3EncoderLayer(
+                    hidden_size=hidden_size,
+                    num_attention_heads=num_attention_heads,
+                    num_key_value_heads=num_key_value_heads,
+                    head_dim=head_dim,
+                    intermediate_size=intermediate_size,
+                    rms_norm_eps=rms_norm_eps,
+                    rope_theta=rope_theta,
+                    max_position_embeddings=max_position_embeddings,
+                    query_pre_attn_scalar=query_pre_attn_scalar,
+                    dtype=dtype,
+                )
+                for _ in range(num_hidden_layers)
+            ]
+        )
+
+        self.norm = Gemma3RMSNorm(hidden_size, eps=rms_norm_eps)
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,  # (B, seq_len), int64
+        attention_mask: torch.Tensor,  # (B, seq_len), int64 -- 1=real, 0=pad
+    ) -> torch.Tensor:
+        """
+        Returns:
+            stacked_hidden_states: (B, seq_len, hidden_size, num_layers+1)
+        """
+        batch_size, seq_len = input_ids.shape
+
+        position_ids = (
+            torch.arange(seq_len, device=input_ids.device)
+            .unsqueeze(0)
+            .expand(batch_size, -1)
+        )
+
+        # Causal mask: (1, 1, seq_len, seq_len) with -inf for future positions
+        causal_mask = torch.triu(
+            torch.full(
+                (seq_len, seq_len),
+                float("-inf"),
+                device=input_ids.device,
+                dtype=self.dtype,
+            ),
+            diagonal=1,
+        )
+        causal_mask = causal_mask.unsqueeze(0).unsqueeze(0)
+
+        # Padding mask: (B, 1, 1, seq_len) with -inf for padded positions
+        pad_mask = (1.0 - attention_mask.to(self.dtype)).unsqueeze(1).unsqueeze(
+            2
+        ) * float("-inf")
+        pad_mask = torch.nan_to_num(pad_mask, nan=0.0)
+        combined_mask = causal_mask + pad_mask
+
+        # Embedding
+        hidden_states = self.embed_tokens(input_ids)
+
+        all_hidden_states = [hidden_states]
+
+        for layer in self.layers:
+            hidden_states = layer(hidden_states, combined_mask, position_ids)
+            all_hidden_states.append(hidden_states)
+
+        # Apply final norm (replaces last element, matching HF behavior)
+        hidden_states = self.norm(hidden_states)
+        all_hidden_states[-1] = hidden_states
+
+        # Stack: (B, seq_len, hidden_size, num_layers+1)
+        return torch.stack(all_hidden_states, dim=-1)
+
+
+# ── State Dict Conversion ───────────────────────────────────────────────────
+
+
+def convert_hf_gemma3_to_encoder_state_dict(
+    hf_state_dict: dict, dtype: torch.dtype = torch.bfloat16
+) -> dict:
+    """Convert HuggingFace Gemma3 state dict to encoder format.
+
+    Handles multiple HF key prefix formats:
+      1. "base_text_encoder.language_model.model."  -- diffusers safetensors
+      2. "model.language_model."  -- pipeline state_dict
+      3. "language_model.model."  -- HF safetensors
+      4. "model."  -- bare Gemma3ForCausalLM
+
+    Key renames:
+      embed_tokens.weight -> embed_tokens.embedding.weight (ParallelEmbedding)
+      q_norm -> q_layernorm
+      k_norm -> k_layernorm
+    """
+    encoder_state_dict = {}
+
+    prefixes = [
+        "base_text_encoder.language_model.model.",
+        "model.language_model.",
+        "language_model.model.",
+        "model.",
+    ]
+
+    for key, value in hf_state_dict.items():
+        new_key = None
+        for prefix in prefixes:
+            if key.startswith(prefix):
+                new_key = key[len(prefix) :]
+                break
+        if new_key is None:
+            continue
+
+        # Skip lm_head, vision tower, projector
+        if "lm_head" in new_key:
+            continue
+        if not (
+            new_key.startswith("embed_tokens")
+            or new_key.startswith("layers.")
+            or new_key.startswith("norm.")
+        ):
+            continue
+
+        # Rename embed_tokens for ParallelEmbedding wrapper
+        if new_key == "embed_tokens.weight":
+            encoder_state_dict["embed_tokens.embedding.weight"] = (
+                value.detach().clone().to(dtype)
+            )
+            continue
+
+        # Rename Q-K norm: q_norm -> q_layernorm, k_norm -> k_layernorm
+        new_key = new_key.replace(".self_attn.q_norm.", ".self_attn.q_layernorm.")
+        new_key = new_key.replace(".self_attn.k_norm.", ".self_attn.k_layernorm.")
+
+        encoder_state_dict[new_key] = value.detach().clone().to(dtype)
+
+    return encoder_state_dict
+
+
+# ── NxDI Application Classes ────────────────────────────────────────────────
+# These follow the NeuronApplicationBase pattern from NxDI (see Flux model
+# in src/neuronx_distributed_inference/models/diffusers/flux/ for reference).
+
+try:
+    from neuronx_distributed_inference.models.application_base import (
+        NeuronApplicationBase,
+    )
+    from neuronx_distributed_inference.models.config import (
+        InferenceConfig,
+        NeuronConfig,
+    )
+    from neuronx_distributed_inference.models.model_wrapper import ModelWrapper
+    from neuronx_distributed.trace.model_builder import BaseModelInstance
+
+    _NXDI_AVAILABLE = True
+except ImportError:
+    _NXDI_AVAILABLE = False
+
+
+# Default Gemma3-12B architecture constants
+GEMMA3_12B_CONFIG = dict(
+    vocab_size=262208,
+    hidden_size=3840,
+    num_hidden_layers=48,
+    num_attention_heads=16,
+    num_key_value_heads=8,
+    head_dim=256,
+    intermediate_size=15360,
+    rms_norm_eps=1e-6,
+    rope_theta=1_000_000.0,
+    max_position_embeddings=131072,
+    query_pre_attn_scalar=256,
+    pad_token_id=0,
+)
+
+
+class Gemma3EncoderInferenceConfig(InferenceConfig if _NXDI_AVAILABLE else object):
+    """InferenceConfig for the Gemma3 text encoder."""
+
+    def __init__(self, *args, **kwargs):
+        if _NXDI_AVAILABLE:
+            super().__init__(*args, **kwargs)
+
+    def get_required_attributes(self):
+        return [
+            "vocab_size",
+            "hidden_size",
+            "num_hidden_layers",
+            "num_attention_heads",
+            "num_key_value_heads",
+            "head_dim",
+            "intermediate_size",
+        ]
+
+
+class ModelWrapperGemma3Encoder(ModelWrapper if _NXDI_AVAILABLE else object):
+    """ModelWrapper for the Gemma3 text encoder."""
+
+    def __init__(
+        self,
+        config,
+        model_cls,
+        tag="",
+        compiler_args=None,
+        priority_model_idx=None,
+        model_init_kwargs={},
+    ):
+        if _NXDI_AVAILABLE:
+            super().__init__(
+                config,
+                model_cls,
+                tag,
+                compiler_args,
+                priority_model_idx,
+                model_init_kwargs=model_init_kwargs,
+            )
+        self.bucket_config = None
+
+    def input_generator(self):
+        """Generate example inputs for compilation: (input_ids, attention_mask)."""
+        seq_len = self.config.neuron_config.seq_len
+        batch_size = self.config.neuron_config.batch_size
+
+        input_ids = torch.zeros(batch_size, seq_len, dtype=torch.int64)
+        attention_mask = torch.ones(batch_size, seq_len, dtype=torch.int64)
+
+        return [(input_ids, attention_mask)]
+
+    def get_model_instance(self):
+        """Create a factory for the Gemma3TextEncoderModel."""
+        config = self.config
+
+        def _create_model():
+            model = self.model_cls(
+                vocab_size=getattr(
+                    config, "vocab_size", GEMMA3_12B_CONFIG["vocab_size"]
+                ),
+                hidden_size=config.hidden_size,
+                num_hidden_layers=config.num_hidden_layers,
+                num_attention_heads=config.num_attention_heads,
+                num_key_value_heads=config.num_key_value_heads,
+                head_dim=config.head_dim,
+                intermediate_size=config.intermediate_size,
+                rms_norm_eps=getattr(
+                    config, "rms_norm_eps", GEMMA3_12B_CONFIG["rms_norm_eps"]
+                ),
+                rope_theta=getattr(
+                    config, "rope_theta", GEMMA3_12B_CONFIG["rope_theta"]
+                ),
+                max_position_embeddings=getattr(
+                    config,
+                    "max_position_embeddings",
+                    GEMMA3_12B_CONFIG["max_position_embeddings"],
+                ),
+                query_pre_attn_scalar=getattr(
+                    config,
+                    "query_pre_attn_scalar",
+                    GEMMA3_12B_CONFIG["query_pre_attn_scalar"],
+                ),
+                pad_token_id=getattr(
+                    config, "pad_token_id", GEMMA3_12B_CONFIG["pad_token_id"]
+                ),
+                dtype=config.neuron_config.torch_dtype,
+            )
+            model = model.to(dtype=config.neuron_config.torch_dtype)
+            model.eval()
+            return model
+
+        return BaseModelInstance(module_cls=_create_model, input_output_aliases={})
+
+    def forward(self, *args, **kwargs):
+        if self.model is None:
+            raise RuntimeError("Forward called before load. Run load() first.")
+        return self._forward(*args)
+
+
+class NeuronGemma3EncoderApplication(
+    NeuronApplicationBase if _NXDI_AVAILABLE else object
+):
+    """NxDI Application for the Gemma3-12B text encoder.
+
+    Handles compilation, weight sharding, loading, and inference following
+    the same pattern as NeuronClipApplication / NeuronT5Application in Flux.
+
+    Unlike the Flux text encoders which run simultaneously with other sub-models,
+    the Gemma3 encoder shares NeuronCores with the DiT backbone sequentially
+    on trn2.3xlarge. The compositor (NeuronLTX23Application) manages the
+    load/unload cycling.
+    """
+
+    _model_cls = Gemma3TextEncoderModel
+
+    def __init__(self, *args, **kwargs):
+        if _NXDI_AVAILABLE:
+            super().__init__(*args, **kwargs)
+        self.model_wrapper = self.get_model_wrapper_cls()
+
+        self.model = self.model_wrapper(
+            config=self.config,
+            model_cls=self._model_cls,
+            tag=self._model_cls.__name__,
+            compiler_args=self.get_compiler_args(),
+            priority_model_idx=0,
+        )
+        self.models.append(self.model)
+        self.dtype = self.config.neuron_config.torch_dtype
+
+    @classmethod
+    def get_config_cls(cls):
+        return Gemma3EncoderInferenceConfig
+
+    def get_model_wrapper_cls(self):
+        return ModelWrapperGemma3Encoder
+
+    def forward(self, input_ids, attention_mask):
+        """Forward pass: (input_ids, attention_mask) -> (B, seq_len, hidden_size, 49)."""
+        return self.models[0](input_ids, attention_mask)
+
+    def get_compiler_args(self):
+        """Compiler args for the Gemma3 encoder.
+
+        Uses tensorizer flags that achieve 3.1x speedup (2000ms -> 644ms).
+        """
+        import os as _os
+
+        compiler_args = (
+            "--model-type=transformer -O1 --auto-cast=none --lnc=2 "
+            "--tensorizer-options='--enable-ccop-compute-overlap "
+            "--cc-pipeline-tiling-factor=1 "
+            "--vectorize-strided-dma "
+            "--enable-scalar-dge-vectorization'"
+        )
+
+        _os.environ["NEURON_FUSE_SOFTMAX"] = "1"
+        _os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+        _os.environ["NEURON_RT_VIRTUAL_CORE_SIZE"] = "2"
+
+        return compiler_args
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict):
+        pass
+
+    def checkpoint_loader_fn(self, mmap: bool = False):
+        """Load Gemma3 weights from HuggingFace checkpoint.
+
+        Supports loading from:
+        - A directory with safetensors shards (google/gemma-3-12b-it-qat-q4_0-unquantized)
+        - A single safetensors file
+        """
+        import os as _os
+        from safetensors.torch import load_file
+
+        # NeuronApplicationBase.normalize_path() adds trailing '/'; strip it
+        # so os.path.isdir/isfile checks work correctly.
+        model_path = self.model_path.rstrip("/")
+        logger.info("Loading Gemma3 encoder weights from %s", model_path)
+
+        if _os.path.isdir(model_path):
+            import glob as _glob
+
+            safetensors_files = sorted(
+                _glob.glob(_os.path.join(model_path, "*.safetensors"))
+            )
+            if safetensors_files:
+                hf_sd = {}
+                for sf in safetensors_files:
+                    hf_sd.update(load_file(sf))
+                logger.info(
+                    "Loaded %d tensors from %d safetensors files",
+                    len(hf_sd),
+                    len(safetensors_files),
+                )
+            else:
+                raise FileNotFoundError(f"No safetensors files in {model_path}")
+        elif _os.path.isfile(model_path) and model_path.endswith(".safetensors"):
+            hf_sd = load_file(model_path)
+            logger.info("Loaded %d tensors from %s", len(hf_sd), model_path)
+        else:
+            raise FileNotFoundError(f"Cannot load weights from {model_path}")
+
+        # Convert HF state dict to encoder format
+        encoder_sd = convert_hf_gemma3_to_encoder_state_dict(
+            hf_sd, dtype=self.config.neuron_config.torch_dtype
+        )
+        logger.info("Converted to encoder format: %d keys", len(encoder_sd))
+        return encoder_sd
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(state_dict, config):
+        """State dict is already converted by checkpoint_loader_fn."""
+        return state_dict
diff --git a/contrib/models/LTX-2.3/src/modeling_ltx23.py b/contrib/models/LTX-2.3/src/modeling_ltx23.py
new file mode 100644
index 00000000..979e2cd8
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/modeling_ltx23.py
@@ -0,0 +1,835 @@
+"""
+NxDI LTX-2.3 Transformer Model
+===============================
+Neuron-optimized implementation of the LTX-2.3 22B DiT audio-video
+diffusion transformer. Uses the native ltx-core model architecture
+with TransformerArgs dataclasses for block communication.
+
+Architecture:
+  - Native ltx-core LTXModel with BasicAVTransformerBlock
+  - 48 transformer blocks, 32 heads, 4096 video dim, 2048 audio dim
+  - Gated attention (to_gate_logits per attention head)
+  - Cross-attention AdaLN (prompt_scale_shift_table)
+  - QK-norm uses q_norm/k_norm (not norm_q/norm_k as in Diffusers)
+  - RoPE type: split (not interleaved)
+  - Caption projection is in text encoder connectors (not in transformer)
+
+The backbone takes 22 flat tensor inputs (for XLA tracing), constructs
+TransformerArgs dataclasses internally, and calls native block forwards.
+All preprocessing (patchify, adaln, rope, connector) done on CPU.
+
+Usage:
+  See application.py for the high-level NeuronLTX23Application class.
+"""
+
+import logging
+import math
+import os
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+
+logger = logging.getLogger(__name__)
+
+# These imports are only available on Neuron instances
+try:
+    from neuronx_distributed.parallel_layers.layers import (
+        ColumnParallelLinear,
+        RowParallelLinear,
+        SPMDRank,
+    )
+    from neuronx_distributed.parallel_layers.parallel_state import (
+        get_tensor_model_parallel_size,
+    )
+    from neuronx_distributed.parallel_layers.utils import (
+        set_tensor_model_parallel_attributes,
+    )
+    import neuronx_distributed.trace.trace as _nxd_trace
+
+    from neuronx_distributed_inference.models.application_base import (
+        NeuronApplicationBase,
+    )
+    from neuronx_distributed_inference.models.config import (
+        InferenceConfig,
+        NeuronConfig,
+    )
+    from neuronx_distributed_inference.models.model_wrapper import (
+        BaseModelInstance,
+        ModelWrapper,
+    )
+
+    NEURON_AVAILABLE = True
+except ImportError:
+    NEURON_AVAILABLE = False
+
+
+# -- BMM-based SDPA replacement -----------------------------------------------
+_sdpa_replaced = False
+_sdpa_original = None
+
+
+def replace_sdpa_with_bmm():
+    """Replace F.scaled_dot_product_attention with BMM-based implementation.
+
+    SDPA is not supported on Neuron XLA. This replacement uses explicit
+    BMM + softmax which compiles cleanly. Handles 3D and 4D inputs,
+    optional attention masks, and falls back to original SDPA on CPU.
+    """
+    global _sdpa_replaced, _sdpa_original
+    if _sdpa_replaced:
+        return _sdpa_original
+    _sdpa_original = torch.nn.functional.scaled_dot_product_attention
+
+    def neuron_sdpa(
+        query,
+        key,
+        value,
+        attn_mask=None,
+        dropout_p=0.0,
+        is_causal=False,
+        scale=None,
+        enable_gqa=False,
+    ):
+        # CPU fallback (for text encoder, preprocessing)
+        if query.device.type == "cpu":
+            return _sdpa_original(
+                query,
+                key,
+                value,
+                attn_mask=attn_mask,
+                dropout_p=dropout_p,
+                is_causal=is_causal,
+                scale=scale,
+            )
+        d = query.shape[-1]
+        if scale is None:
+            scale = 1.0 / math.sqrt(d)
+        orig_shape = None
+        if len(query.shape) == 4:
+            orig_shape = query.shape
+            b, h, sq, d_head = query.shape
+            query = query.reshape(b * h, sq, d_head)
+            key = key.reshape(b * h, -1, d_head)
+            value = value.reshape(b * h, -1, d_head)
+            if attn_mask is not None and attn_mask.ndim == 4:
+                # Expand broadcastable dims to (b, h, ...) before flattening to 3D.
+                # PytorchAttention may pass masks like (1, 1, 1, 256) where b=1 and h=1
+                # are broadcastable over the actual (b, h) from query. A direct reshape
+                # would fail because total elements differ. Expand first, then reshape.
+                attn_mask = attn_mask.expand(b, h, -1, -1).reshape(
+                    b * h, attn_mask.shape[-2], attn_mask.shape[-1]
+                )
+            elif attn_mask is not None and attn_mask.ndim == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+            elif attn_mask is not None and attn_mask.ndim == 3:
+                if attn_mask.shape[0] == orig_shape[0]:
+                    attn_mask = (
+                        attn_mask.unsqueeze(1)
+                        .expand(orig_shape[0], orig_shape[1], -1, -1)
+                        .reshape(
+                            orig_shape[0] * orig_shape[1],
+                            attn_mask.shape[-2],
+                            attn_mask.shape[-1],
+                        )
+                    )
+        scores = torch.bmm(query, key.transpose(-1, -2)) * scale
+        if attn_mask is not None:
+            scores = scores + attn_mask
+        probs = scores.softmax(dim=-1)
+        out = torch.bmm(probs, value)
+        if orig_shape is not None:
+            out = out.reshape(orig_shape[0], orig_shape[1], -1, orig_shape[3])
+        return out
+
+    torch.nn.functional.scaled_dot_product_attention = neuron_sdpa
+    _sdpa_replaced = True
+    return _sdpa_original
+
+
+# -- DistributedRMSNorm -------------------------------------------------------
+class DistributedRMSNorm(nn.Module):
+    """RMSNorm with all-reduce for global variance computation across TP ranks.
+
+    Standard RMSNorm on a TP-sharded hidden dimension only sees the local shard.
+    This version computes sum-of-squares locally, all-reduces across ranks, then
+    normalizes with the global RMS. Essential for QK-norm accuracy in TP>1.
+
+    The all-reduce in this norm is NOT redundant -- removing it (LocalRMSNorm
+    experiment in LTX-2) made quality significantly worse.
+    """
+
+    def __init__(self, normalized_shape, eps=1e-5, tp_size=4, dtype=torch.bfloat16):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(normalized_shape, dtype=dtype))
+        self.eps = eps
+        self.tp_size = tp_size
+        self.local_dim = normalized_shape
+
+        if NEURON_AVAILABLE:
+            set_tensor_model_parallel_attributes(
+                self.weight, is_parallel=True, dim=0, stride=1, num_partitions=tp_size
+            )
+
+    def forward(self, hidden_states):
+        hidden_states_f32 = hidden_states.to(torch.float32)
+        local_sum_sq = hidden_states_f32.pow(2).sum(dim=-1, keepdim=True)
+        import torch_xla.core.xla_model as xm
+
+        global_sum_sq = xm.all_reduce(xm.REDUCE_SUM, local_sum_sq)
+        global_dim = self.local_dim * self.tp_size
+        rms = torch.rsqrt(global_sum_sq / global_dim + self.eps)
+        hidden_states = hidden_states_f32 * rms
+        if self.weight.dtype in [torch.float16, torch.bfloat16]:
+            hidden_states = hidden_states.to(self.weight.dtype)
+        return hidden_states * self.weight
+
+
+# Register DistributedRMSNorm as a supported sharded module for NxD tracing
+if NEURON_AVAILABLE:
+    _nxd_trace.__SUPPORTED_SHARDED_MODULES = (
+        *_nxd_trace.__SUPPORTED_SHARDED_MODULES,
+        DistributedRMSNorm,
+    )
+
+
+# -- NeuronLTX23TransformerBackbone -------------------------------------------
+class NeuronLTX23TransformerBackbone(nn.Module):
+    """The core LTX-2.3 DiT transformer backbone for Neuron.
+
+    Contains the 48 transformer blocks, output normalization, and projection layers.
+    Takes 22 preprocessed tensor inputs (all preprocessing done on CPU) and returns
+    (video_output, audio_output).
+
+    Key differences from LTX-2:
+    - Uses native ltx-core LTXModel (not Diffusers LTX2VideoTransformer3DModel)
+    - Caption projection is NOT in the transformer (moved to text encoder in 2.3)
+    - Attention QK-norm uses q_norm/k_norm (vs norm_q/norm_k in Diffusers)
+    - Gated attention may be present (to_gate_logits per attention module)
+    """
+
+    def __init__(self, config):
+        """Initialize from InferenceConfig.
+
+        If config has an `ltx_config_dict` attribute (set during config creation),
+        the transformer is automatically built from the ltx-core model with TP sharding.
+        """
+        super().__init__()
+        self.config = config
+        self.tp_degree = config.neuron_config.tp_degree
+
+        ltx_config_dict = getattr(config, "ltx_config_dict", None)
+        if ltx_config_dict is not None:
+            self._build_from_ltx_core(ltx_config_dict)
+        else:
+            self.transformer_blocks = None
+            self.norm_out = None
+            self.proj_out = None
+            self.scale_shift_table = None
+            self.audio_norm_out = None
+            self.audio_proj_out = None
+            self.audio_scale_shift_table = None
+            self.spmd_rank = None
+
+    def _build_from_ltx_core(self, ltx_config_dict):
+        """Build the TP-sharded transformer from ltx-core config dict.
+
+        The ltx_config_dict should be the full model config (as loaded from
+        the safetensors metadata or provided manually) with 'transformer' key
+        containing the transformer architecture parameters.
+        """
+        replace_sdpa_with_bmm()
+
+        # Import and build the native ltx-core model
+        from ltx_core.model.transformer.model_configurator import LTXModelConfigurator
+
+        ltx_model = LTXModelConfigurator.from_config(ltx_config_dict)
+        ltx_model = ltx_model.to(dtype=self.config.neuron_config.torch_dtype)
+        ltx_model.eval()
+
+        if self.tp_degree > 1:
+            _shard_ltx23_transformer(ltx_model, self.tp_degree)
+
+        # Copy references to the backbone layers
+        self.transformer_blocks = ltx_model.transformer_blocks
+        self.norm_out = ltx_model.norm_out
+        self.proj_out = ltx_model.proj_out
+        self.scale_shift_table = ltx_model.scale_shift_table
+        self.audio_norm_out = ltx_model.audio_norm_out
+        self.audio_proj_out = ltx_model.audio_proj_out
+        self.audio_scale_shift_table = ltx_model.audio_scale_shift_table
+
+        if self.tp_degree > 1 and NEURON_AVAILABLE:
+            self.spmd_rank = SPMDRank(self.tp_degree)
+        else:
+            self.spmd_rank = None
+
+    def _slice_rope(self, cos, sin):
+        """Slice RoPE embeddings to the heads owned by this TP rank.
+
+        Uses SPMDRank (a learnable parameter sharded per-rank) instead of a
+        Python int to avoid baking rank=0 as a constant during XLA tracing.
+        """
+        if self.tp_degree <= 1 or self.spmd_rank is None:
+            return (cos, sin)
+        h_per_rank = cos.shape[1] // self.tp_degree
+        rank = self.spmd_rank.get_rank()  # shape (1,), int32
+        start = (rank[0] * h_per_rank).to(torch.long)
+        indices = start + torch.arange(h_per_rank, device=cos.device, dtype=torch.long)
+        cos_sliced = torch.index_select(cos, 1, indices)
+        sin_sliced = torch.index_select(sin, 1, indices)
+        return (cos_sliced, sin_sliced)
+
+    def forward(
+        self,
+        hidden_states,  # (B, video_seq, inner_dim) -- after patchify_proj
+        audio_hidden_states,  # (B, audio_seq, audio_inner_dim)
+        encoder_hidden_states,  # (B, text_seq, inner_dim) -- already projected
+        audio_encoder_hidden_states,  # (B, text_seq, audio_inner_dim)
+        temb,  # (B, video_seq, 9*inner_dim) -- per-token time embedding
+        temb_audio,  # (B, audio_seq, 9*audio_inner_dim)
+        embedded_timestep,  # (B, video_seq, inner_dim) -- per-token output scaling
+        audio_embedded_timestep,  # (B, audio_seq, audio_inner_dim)
+        video_ca_ss,  # (B, 1, 4*inner_dim) -- cross-modal scale/shift
+        audio_ca_ss,  # (B, 1, 4*audio_inner_dim)
+        video_ca_gate,  # (B, 1, inner_dim) -- cross-modal a2v gate
+        audio_ca_gate,  # (B, 1, audio_inner_dim) -- cross-modal v2a gate
+        video_rot_cos,  # (B, H, video_seq, rope_dim) -- self-attn RoPE
+        video_rot_sin,
+        audio_rot_cos,  # (B, H_audio, audio_seq, rope_dim)
+        audio_rot_sin,
+        ca_video_rot_cos,  # (B, H, video_seq, ca_rope_dim) -- cross-modal RoPE
+        ca_video_rot_sin,
+        ca_audio_rot_cos,  # (B, H_audio, audio_seq, ca_rope_dim)
+        ca_audio_rot_sin,
+        encoder_attention_mask,  # (B, 1, 1, text_seq) -- additive bias mask
+        audio_encoder_attention_mask,  # (B, 1, 1, text_seq)
+        prompt_timestep,  # (B, 1, 2*inner_dim) -- cross-attn AdaLN prompt
+        audio_prompt_timestep,  # (B, 1, 2*audio_inner_dim)
+    ):
+        """Forward pass: construct TransformerArgs from flat tensors, run native blocks.
+
+        Takes 24 flat tensor inputs for XLA tracing compatibility.
+        Constructs TransformerArgs dataclasses internally and calls the native
+        BasicAVTransformerBlock.forward() which uses the ltx-core interface.
+        """
+        from ltx_core.model.transformer.transformer_args import TransformerArgs
+        from ltx_core.guidance.perturbations import BatchedPerturbationConfig
+
+        batch_size = hidden_states.shape[0]
+
+        # Slice RoPE to local TP shard
+        video_pe = self._slice_rope(video_rot_cos, video_rot_sin)
+        audio_pe = self._slice_rope(audio_rot_cos, audio_rot_sin)
+        ca_video_pe = self._slice_rope(ca_video_rot_cos, ca_video_rot_sin)
+        ca_audio_pe = self._slice_rope(ca_audio_rot_cos, ca_audio_rot_sin)
+
+        # Construct TransformerArgs for video and audio
+        video_args = TransformerArgs(
+            x=hidden_states,
+            context=encoder_hidden_states,
+            context_mask=encoder_attention_mask,
+            timesteps=temb,
+            embedded_timestep=embedded_timestep,
+            positional_embeddings=video_pe,  # (cos, sin) tuple
+            cross_positional_embeddings=ca_video_pe,
+            cross_scale_shift_timestep=video_ca_ss,
+            cross_gate_timestep=video_ca_gate,
+            enabled=True,
+            prompt_timestep=prompt_timestep,
+            self_attention_mask=None,
+        )
+        audio_args = TransformerArgs(
+            x=audio_hidden_states,
+            context=audio_encoder_hidden_states,
+            context_mask=audio_encoder_attention_mask,
+            timesteps=temb_audio,
+            embedded_timestep=audio_embedded_timestep,
+            positional_embeddings=audio_pe,
+            cross_positional_embeddings=ca_audio_pe,
+            cross_scale_shift_timestep=audio_ca_ss,
+            cross_gate_timestep=audio_ca_gate,
+            enabled=True,
+            prompt_timestep=audio_prompt_timestep,
+            self_attention_mask=None,
+        )
+
+        # Empty perturbations for deterministic tracing (no skip branches)
+        perturbations = BatchedPerturbationConfig.empty(batch_size)
+
+        # Run 48 transformer blocks using native ltx-core interface
+        for block in self.transformer_blocks:
+            video_args, audio_args = block(
+                video=video_args,
+                audio=audio_args,
+                perturbations=perturbations,
+            )
+
+        # Video output projection (matches LTXModel._process_output)
+        vx = video_args.x
+        scale_shift_values = (
+            self.scale_shift_table[None, None]
+            + video_args.embedded_timestep[:, :, None]
+        )
+        shift, scale = scale_shift_values[:, :, 0], scale_shift_values[:, :, 1]
+        vx = self.norm_out(vx)
+        vx = vx * (1 + scale) + shift
+        output = self.proj_out(vx)
+
+        # Audio output projection
+        ax = audio_args.x
+        audio_ssv = (
+            self.audio_scale_shift_table[None, None]
+            + audio_args.embedded_timestep[:, :, None]
+        )
+        audio_shift, audio_scale = audio_ssv[:, :, 0], audio_ssv[:, :, 1]
+        ax = self.audio_norm_out(ax)
+        ax = ax * (1 + audio_scale) + audio_shift
+        audio_output = self.audio_proj_out(ax)
+
+        return output, audio_output
+
+
+# -- TP Sharding --------------------------------------------------------------
+def _shard_ltx23_transformer(ltx_model, tp_degree):
+    """Apply tensor parallelism sharding to the LTX-2.3 transformer.
+
+    Adapted from LTX-2's _shard_ltx2_transformer for the native ltx-core
+    model structure. Key differences:
+    - QK-norm attribute names: q_norm/k_norm (not norm_q/norm_k)
+    - Gated attention: to_gate_logits may be present (Column-sharded)
+    - FFN structure: FeedForward with .net containing GEGLU gate + Linear down
+
+    Each of the 48 blocks has 7 attention modules and 2 FFNs:
+    - attn1, attn2 (video self-attn, video text cross-attn)
+    - audio_attn1, audio_attn2 (audio self-attn, audio text cross-attn)
+    - audio_to_video_attn, video_to_audio_attn (cross-modal)
+    - ff, audio_ff (feed-forward)
+    """
+    from neuronx_distributed.parallel_layers import parallel_state
+
+    tp_rank = parallel_state.get_tensor_model_parallel_rank()
+    tp_size = parallel_state.get_tensor_model_parallel_size()
+
+    def get_shard(data, dim):
+        s = data.shape[dim] // tp_size
+        if dim == 0:
+            return data[s * tp_rank : s * (tp_rank + 1)].clone()
+        return data[:, s * tp_rank : s * (tp_rank + 1)].clone()
+
+    def shard_attention(attn):
+        """Shard a single Attention module for TP."""
+        for proj_name in ["to_q", "to_k", "to_v"]:
+            proj = getattr(attn, proj_name)
+            col = ColumnParallelLinear(
+                proj.in_features,
+                proj.out_features,
+                bias=proj.bias is not None,
+                gather_output=False,
+                dtype=proj.weight.dtype,
+            )
+            col.weight.data = get_shard(proj.weight.data, 0)
+            if proj.bias is not None:
+                col.bias.data = get_shard(proj.bias.data, 0)
+            setattr(attn, proj_name, col)
+
+        # Output projection: RowParallelLinear
+        out_linear = attn.to_out[0]
+        row = RowParallelLinear(
+            out_linear.in_features,
+            out_linear.out_features,
+            bias=out_linear.bias is not None,
+            input_is_parallel=True,
+            dtype=out_linear.weight.dtype,
+        )
+        row.weight.data = get_shard(out_linear.weight.data, 1)
+        if out_linear.bias is not None:
+            row.bias.data = out_linear.bias.data.clone()  # bias not sharded
+        attn.to_out[0] = row
+
+        # QK-norm -> DistributedRMSNorm
+        # LTX-2.3 native model uses q_norm/k_norm (torch.nn.RMSNorm)
+        # LTX-2 Diffusers used norm_q/norm_k
+        for norm_name in ["q_norm", "k_norm"]:
+            norm = getattr(attn, norm_name, None)
+            if norm is not None and hasattr(norm, "weight") and norm.weight is not None:
+                full_dim = norm.weight.shape[0]
+                local_dim = full_dim // tp_size
+                dist_norm = DistributedRMSNorm(
+                    local_dim,
+                    eps=getattr(norm, "eps", 1e-6),
+                    tp_size=tp_size,
+                    dtype=norm.weight.dtype,
+                )
+                dist_norm.weight.data = get_shard(norm.weight.data, 0)
+                setattr(attn, norm_name, dist_norm)
+
+        # Gated attention: to_gate_logits (Linear: query_dim -> heads)
+        gate_logits = getattr(attn, "to_gate_logits", None)
+        if gate_logits is not None and isinstance(gate_logits, nn.Linear):
+            col = ColumnParallelLinear(
+                gate_logits.in_features,
+                gate_logits.out_features,
+                bias=gate_logits.bias is not None,
+                gather_output=False,
+                dtype=gate_logits.weight.dtype,
+            )
+            col.weight.data = get_shard(gate_logits.weight.data, 0)
+            if gate_logits.bias is not None:
+                col.bias.data = get_shard(gate_logits.bias.data, 0)
+            attn.to_gate_logits = col
+
+        # Update head count for TP
+        attn.heads = attn.heads // tp_size
+        # The native model doesn't have inner_dim/inner_kv_dim attributes
+        # but the dim_head stays the same. heads is divided.
+
+    def shard_ffn(ff):
+        """Shard a FeedForward module for TP.
+
+        LTX-2.3 FeedForward structure:
+          ff.net = [GEGLU(proj=Linear), Identity(), Linear]
+          or ff.net = [Linear, activation, Linear]
+        """
+        net = ff.net
+        gate = net[0]
+        if hasattr(gate, "proj"):
+            # GEGLU: gate.proj is the Linear
+            proj = gate.proj
+            col = ColumnParallelLinear(
+                proj.in_features,
+                proj.out_features,
+                bias=proj.bias is not None,
+                gather_output=False,
+                dtype=proj.weight.dtype,
+            )
+            col.weight.data = get_shard(proj.weight.data, 0)
+            if proj.bias is not None:
+                col.bias.data = get_shard(proj.bias.data, 0)
+            gate.proj = col
+        elif isinstance(gate, nn.Linear):
+            col = ColumnParallelLinear(
+                gate.in_features,
+                gate.out_features,
+                bias=gate.bias is not None,
+                gather_output=False,
+                dtype=gate.weight.dtype,
+            )
+            col.weight.data = get_shard(gate.weight.data, 0)
+            if gate.bias is not None:
+                col.bias.data = get_shard(gate.bias.data, 0)
+            net[0] = col
+
+        # Down projection: last Linear in net
+        down = net[-1]
+        if isinstance(down, nn.Linear):
+            row = RowParallelLinear(
+                down.in_features,
+                down.out_features,
+                bias=down.bias is not None,
+                input_is_parallel=True,
+                dtype=down.weight.dtype,
+            )
+            row.weight.data = get_shard(down.weight.data, 1)
+            if down.bias is not None:
+                row.bias.data = down.bias.data.clone()
+            net[len(net) - 1] = row
+
+    for block in ltx_model.transformer_blocks:
+        # Video attention modules
+        shard_attention(block.attn1)
+        shard_attention(block.attn2)
+        shard_ffn(block.ff)
+        # Audio attention modules
+        shard_attention(block.audio_attn1)
+        shard_attention(block.audio_attn2)
+        shard_ffn(block.audio_ff)
+        # Cross-modal attention
+        shard_attention(block.audio_to_video_attn)
+        shard_attention(block.video_to_audio_attn)
+
+
+# -- NxDI Config --------------------------------------------------------------
+class LTX23BackboneInferenceConfig(InferenceConfig if NEURON_AVAILABLE else object):
+    """InferenceConfig for the LTX-2.3 transformer backbone."""
+
+    def __init__(self, *args, **kwargs):
+        if NEURON_AVAILABLE:
+            super().__init__(*args, **kwargs)
+
+    def get_required_attributes(self):
+        return [
+            "num_layers",
+            "num_attention_heads",
+            "attention_head_dim",
+            "inner_dim",
+            "audio_num_attention_heads",
+            "audio_attention_head_dim",
+            "audio_inner_dim",
+            "audio_cross_attention_dim",
+            "video_seq",
+            "audio_seq",
+            "text_seq",
+            "height",
+            "width",
+            "num_frames",
+        ]
+
+
+# -- NxDI ModelWrapper ---------------------------------------------------------
+class ModelWrapperLTX23Backbone(ModelWrapper if NEURON_AVAILABLE else object):
+    """ModelWrapper for the LTX-2.3 DiT transformer backbone."""
+
+    def __init__(
+        self,
+        config,
+        model_cls,
+        tag="",
+        compiler_args=None,
+        priority_model_idx=None,
+        model_init_kwargs={},
+    ):
+        if NEURON_AVAILABLE:
+            super().__init__(
+                config,
+                model_cls,
+                tag,
+                compiler_args,
+                priority_model_idx,
+                model_init_kwargs,
+            )
+        self.bucket_config = None
+
+    def input_generator(self):
+        """Generate example inputs for Neuron compilation.
+
+        Returns list of (tuple_of_tensors,) matching the 24-input forward() signature.
+        The native ltx-core model uses 9 AdaLN mod params (not 6 as LTX-2 Diffusers):
+        3 self-attn (shift, scale, gate) + 3 FFN (shift, scale, gate) +
+        3 cross-attn AdaLN (shift_q, scale_q, gate_q).
+        """
+        dtype = self.config.neuron_config.torch_dtype
+        inner_dim = self.config.inner_dim
+        audio_inner_dim = self.config.audio_inner_dim
+        audio_ca_dim = self.config.audio_cross_attention_dim
+        video_seq = self.config.video_seq
+        audio_seq = self.config.audio_seq
+        text_seq = self.config.text_seq
+        num_heads = self.config.num_attention_heads
+        audio_num_heads = self.config.audio_num_attention_heads
+
+        # RoPE rotation dim per head (split RoPE: dim // 2 per head)
+        video_rope_dim = inner_dim // num_heads // 2  # 4096/32/2 = 64
+        audio_rope_dim = audio_inner_dim // audio_num_heads // 2  # 2048/32/2 = 32
+        ca_video_rope_dim = audio_ca_dim // num_heads // 2  # 2048/32/2 = 32
+        ca_audio_rope_dim = audio_ca_dim // audio_num_heads // 2  # 2048/32/2 = 32
+
+        model_inputs = (
+            # Projected hidden states
+            torch.randn(1, video_seq, inner_dim, dtype=dtype),
+            torch.randn(1, audio_seq, audio_inner_dim, dtype=dtype),
+            # Encoder hidden states (already projected by text encoder connectors)
+            torch.randn(1, text_seq, inner_dim, dtype=dtype),
+            torch.randn(1, text_seq, audio_inner_dim, dtype=dtype),
+            # Time embeddings (9 mod params from adaln_single, per-token)
+            torch.randn(1, video_seq, 9 * inner_dim, dtype=dtype),
+            torch.randn(1, audio_seq, 9 * audio_inner_dim, dtype=dtype),
+            # Embedded timestep (per-token, for output scaling)
+            torch.randn(1, video_seq, inner_dim, dtype=dtype),
+            torch.randn(1, audio_seq, audio_inner_dim, dtype=dtype),
+            # Cross-modal scale/shift (4 mod params from av_ca_*_scale_shift, per-batch)
+            torch.randn(1, 1, 4 * inner_dim, dtype=dtype),
+            torch.randn(1, 1, 4 * audio_inner_dim, dtype=dtype),
+            # Cross-modal gate (1 mod param from av_ca_*_gate, per-batch)
+            torch.randn(1, 1, 1 * inner_dim, dtype=dtype),
+            torch.randn(1, 1, 1 * audio_inner_dim, dtype=dtype),
+            # Video self-attn RoPE cos/sin (split format)
+            torch.randn(1, num_heads, video_seq, video_rope_dim, dtype=dtype),
+            torch.randn(1, num_heads, video_seq, video_rope_dim, dtype=dtype),
+            # Audio self-attn RoPE cos/sin
+            torch.randn(1, audio_num_heads, audio_seq, audio_rope_dim, dtype=dtype),
+            torch.randn(1, audio_num_heads, audio_seq, audio_rope_dim, dtype=dtype),
+            # Cross-modal RoPE
+            torch.randn(1, num_heads, video_seq, ca_video_rope_dim, dtype=dtype),
+            torch.randn(1, num_heads, video_seq, ca_video_rope_dim, dtype=dtype),
+            torch.randn(1, audio_num_heads, audio_seq, ca_audio_rope_dim, dtype=dtype),
+            torch.randn(1, audio_num_heads, audio_seq, ca_audio_rope_dim, dtype=dtype),
+            # Attention masks (additive bias, shape B x 1 x 1 x text_seq)
+            # The native preprocessor converts binary masks to 4D additive bias:
+            #   (B, text_seq) -> (B, 1, 1, text_seq) with 0 = attend, -max = ignore
+            torch.zeros(1, 1, 1, text_seq, dtype=dtype),
+            torch.zeros(1, 1, 1, text_seq, dtype=dtype),
+            # Prompt timestep for cross-attn AdaLN (2 mod params, per-batch)
+            torch.randn(1, 1, 2 * inner_dim, dtype=dtype),
+            torch.randn(1, 1, 2 * audio_inner_dim, dtype=dtype),
+        )
+
+        return [model_inputs]
+
+    def get_model_instance(self):
+        def _create_model():
+            model = self.model_cls(self.config)
+            model = model.to(dtype=self.config.neuron_config.torch_dtype)
+            model.eval()
+            return model
+
+        return BaseModelInstance(module_cls=_create_model, input_output_aliases={})
+
+    def forward(self, *args, **kwargs):
+        if self.model is None:
+            raise RuntimeError(
+                "Forward called before load. Run load() or load_state_dict() first."
+            )
+        output = self._forward(*args)
+        return output
+
+
+# -- NxDI Application ---------------------------------------------------------
+class NeuronLTX23BackboneApplication(
+    NeuronApplicationBase if NEURON_AVAILABLE else object
+):
+    """NxDI Application wrapping the LTX-2.3 DiT transformer backbone.
+
+    Handles compilation, weight sharding, loading, and inference.
+    Follows the same pattern as NeuronFluxBackboneApplication / NeuronLTX2BackboneApplication.
+    """
+
+    _model_cls = NeuronLTX23TransformerBackbone
+
+    def __init__(self, *args, **kwargs):
+        if NEURON_AVAILABLE:
+            super().__init__(*args, **kwargs)
+        self.model_wrapper = self.get_model_wrapper_cls()
+
+        self.model = self.model_wrapper(
+            config=self.config,
+            model_cls=self._model_cls,
+            tag=self._model_cls.__name__,
+            compiler_args=self.get_compiler_args(),
+            priority_model_idx=0,
+        )
+        self.models.append(self.model)
+        self.dtype = self.config.neuron_config.torch_dtype
+
+    def get_model_wrapper_cls(self):
+        return ModelWrapperLTX23Backbone
+
+    def forward(self, *model_inputs, **kwargs):
+        return self.models[0](*model_inputs, **kwargs)
+
+    def get_compiler_args(self):
+        """Compiler args for the LTX-2.3 transformer.
+
+        Same as LTX-2: --auto-cast matmult (two t's) and --lnc 2 for trn2.
+        """
+        compiler_args = "--model-type=transformer -O1"
+        compiler_args += " --auto-cast matmult --lnc 2"
+        compiler_args += " --tensorizer-options='--enable-ccop-compute-overlap'"
+
+        os.environ["LOCAL_WORLD_SIZE"] = str(self.config.neuron_config.world_size)
+        os.environ["NEURON_RT_VIRTUAL_CORE_SIZE"] = "2"
+        os.environ["NEURON_FUSE_SOFTMAX"] = "1"
+        os.environ["NEURON_CUSTOM_SILU"] = "1"
+        os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+
+        return compiler_args
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict):
+        pass
+
+    def checkpoint_loader_fn(self, mmap: bool = False):
+        """Load the LTX-2.3 transformer weights from safetensors.
+
+        LTX-2.3 weights are stored in a single safetensors file (not a
+        HuggingFace model repo with config.json). Weight keys use the
+        native ltx-core naming convention.
+        """
+        # NeuronApplicationBase.normalize_path() adds trailing '/'; strip it
+        # so os.path.isfile() works for single safetensors files.
+        model_path = self.model_path.rstrip("/")
+        logger.info("Loading LTX-2.3 transformer weights from %s", model_path)
+
+        if os.path.isdir(model_path):
+            from safetensors.torch import load_file
+            import glob as _glob
+
+            safetensors_files = sorted(
+                _glob.glob(os.path.join(model_path, "*.safetensors"))
+            )
+            if safetensors_files:
+                model_sd = {}
+                for sf in safetensors_files:
+                    model_sd.update(load_file(sf))
+                logger.info(
+                    "Loaded %d tensors from %d safetensors files",
+                    len(model_sd),
+                    len(safetensors_files),
+                )
+            else:
+                raise FileNotFoundError(f"No safetensors files in {model_path}")
+        elif os.path.isfile(model_path) and model_path.endswith(".safetensors"):
+            from safetensors.torch import load_file
+
+            model_sd = load_file(model_path)
+            logger.info("Loaded %d tensors from %s", len(model_sd), model_path)
+        else:
+            raise FileNotFoundError(f"Cannot load weights from {model_path}")
+
+        model_sd = self.convert_to_neuron_state_dict(model_sd, self.config)
+        return model_sd
+
+    @staticmethod
+    def convert_to_neuron_state_dict(state_dict, config):
+        """Convert safetensors state dict to Neuron backbone format.
+
+        Key transformations:
+        1. Strips 'model.diffusion_model.' prefix (ComfyUI convention in safetensors)
+        2. Adds the SPMDRank arange tensor for per-rank RoPE slicing
+        3. Filters to only keep keys matching the backbone model structure
+           (transformer_blocks.*, norm_out.*, proj_out.*, scale_shift_table, audio_*)
+        4. Removes preprocessing layers (patchify_proj, adaln, rope, connectors, etc.)
+        """
+        # Strip ComfyUI prefix if present
+        prefix = "model.diffusion_model."
+        stripped_sd = {}
+        for k, v in state_dict.items():
+            if k.startswith(prefix):
+                stripped_sd[k[len(prefix) :]] = v
+            else:
+                stripped_sd[k] = v
+
+        # Add SPMDRank tensor for per-rank sharding
+        stripped_sd["spmd_rank.rank"] = torch.arange(
+            0, config.neuron_config.world_size, dtype=torch.int32
+        )
+
+        # Filter to keys the backbone model expects
+        # These match the native ltx-core LTXModel weight keys
+        backbone_prefixes = (
+            "transformer_blocks.",
+            "norm_out.",
+            "proj_out.",
+            "scale_shift_table",
+            "audio_norm_out.",
+            "audio_proj_out.",
+            "audio_scale_shift_table",
+            "spmd_rank.",
+        )
+        filtered_sd = {}
+        skipped_keys = []
+        for k, v in stripped_sd.items():
+            if k.startswith(backbone_prefixes):
+                filtered_sd[k] = v.clone().detach().contiguous()
+            else:
+                skipped_keys.append(k)
+
+        if skipped_keys:
+            logger.info(
+                "Filtered out %d preprocessing keys (patchify_proj, adaln, connectors, etc.): %s",
+                len(skipped_keys),
+                ", ".join(skipped_keys[:10])
+                + ("..." if len(skipped_keys) > 10 else ""),
+            )
+
+        return filtered_sd
diff --git a/contrib/models/LTX-2.3/src/modeling_vae_23.py b/contrib/models/LTX-2.3/src/modeling_vae_23.py
new file mode 100644
index 00000000..db39a17a
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/modeling_vae_23.py
@@ -0,0 +1,806 @@
+"""
+NxDI LTX-2.3 VAE Decoder — Tensor Parallel Model
+==================================================
+Tensor-parallel VAE decoder for the LTX-2.3 video diffusion model (ltx-core).
+
+Adapted from the LTX-2 (Diffusers) TP VAE decoder. Key differences from LTX-2:
+  - ltx-core uses a flat `up_blocks` list (interleaved UNetMidBlock3D + DepthToSpaceUpsample)
+    instead of Diffusers' hierarchical (mid_block + up_blocks with sub-upsamplers)
+  - LTX-2.3 has timestep conditioning: AdaLN in each ResnetBlock3D, time embedders
+    in each UNetMidBlock3D, and a final AdaLN after norm_out
+  - Normalization is PixelNorm (same math as PerChannelRMSNorm)
+  - CausalConv3d (ltx-core) ≡ LTX2VideoCausalConv3d (Diffusers)
+
+Channel progression: 128 -> 1024 -> 512 -> 256 -> 128 -> 48 -> 3
+
+Compilation boundary (same as LTX-2):
+  - H_latent × W_latent ≤ 64 elements (SRAM limit)
+  - 4×16 = 64 ✓ (optimal for wide outputs like 1024×1536)
+  - 8×8 = 64 ✓
+
+Pre-processing (done on CPU, OUTSIDE the compiled graph):
+  - Noise injection: sample = noise * 0.025 + sample * 0.975
+  - Denormalization: per_channel_statistics.un_normalize(sample)
+  - Timestep embedding computation
+
+The compiled graph takes:
+  - latent_tile: [1, 128, T, H_tile, W_tile] (denormalized)
+  - timestep_embed_tile: [1, C_embed, 1, 1, 1] (precomputed on CPU)
+  - last_ada_values: [1, 2, C_final, 1, 1, 1] (precomputed on CPU)
+"""
+
+import os
+from functools import partial
+
+import torch
+import torch.nn as nn
+
+os.environ.setdefault("NEURON_FUSE_SOFTMAX", "1")
+os.environ.setdefault("NEURON_CUSTOM_SILU", "1")
+
+COMPILER_FLAGS = (
+    "--model-type=unet-inference -O1 --auto-cast none "
+    "--enable-fast-loading-neuron-binaries"
+)
+
+
+def get_sharded_data(data, dim):
+    """Get shard for current TP rank along given dimension."""
+    from neuronx_distributed.parallel_layers import parallel_state
+
+    tp_rank = parallel_state.get_tensor_model_parallel_rank()
+    tp_size = parallel_state.get_tensor_model_parallel_size()
+    shard_size = data.shape[dim] // tp_size
+    if dim == 0:
+        return data[shard_size * tp_rank : shard_size * (tp_rank + 1)].clone()
+    elif dim == 1:
+        return data[:, shard_size * tp_rank : shard_size * (tp_rank + 1)].clone()
+    else:
+        raise ValueError(f"Unsupported dim: {dim}")
+
+
+# ── Temporal padding helper ─────────────────────────────────────────
+
+
+def make_noncausal_pad_fn(conv_module):
+    """Create non-causal temporal padding function for CausalConv3d.
+
+    ltx-core CausalConv3d stores kernel_size as tuple; non-causal mode pads
+    (k-1)//2 frames of first/last frame on each side.
+    """
+    # CausalConv3d stores time_kernel_size attribute
+    kernel_t = getattr(conv_module, "time_kernel_size", 3)
+    if hasattr(conv_module, "kernel_size") and isinstance(
+        conv_module.kernel_size, tuple
+    ):
+        kernel_t = conv_module.kernel_size[0]
+    pad_t = (kernel_t - 1) // 2  # 1 for kernel=3
+
+    def pad_fn(x):
+        if pad_t > 0:
+            pad_left = x[:, :, :1].repeat(1, 1, pad_t, 1, 1)
+            pad_right = x[:, :, -1:].repeat(1, 1, pad_t, 1, 1)
+            x = torch.cat([pad_left, x, pad_right], dim=2)
+        return x
+
+    return pad_fn
+
+
+def make_noncausal_pad_fn_from_kernel(kernel_t):
+    """Create non-causal temporal padding from explicit kernel size."""
+    pad_t = (kernel_t - 1) // 2
+
+    def pad_fn(x):
+        if pad_t > 0:
+            pad_left = x[:, :, :1].repeat(1, 1, pad_t, 1, 1)
+            pad_right = x[:, :, -1:].repeat(1, 1, pad_t, 1, 1)
+            x = torch.cat([pad_left, x, pad_right], dim=2)
+        return x
+
+    return pad_fn
+
+
+# ── Parallel Conv3d layers ───────────────────────────────────────────
+
+
+class ColumnParallelConv3d(nn.Module):
+    """Conv3d with output channels sharded across TP ranks.
+    Input: full channels. Output: sharded channels."""
+
+    def __init__(self, inner_conv, tp_degree):
+        super().__init__()
+        self.sharded_out = inner_conv.out_channels // tp_degree
+        self.conv = nn.Conv3d(
+            inner_conv.in_channels,
+            self.sharded_out,
+            kernel_size=inner_conv.kernel_size,
+            stride=inner_conv.stride,
+            padding=inner_conv.padding,
+            padding_mode=inner_conv.padding_mode,
+            bias=inner_conv.bias is not None,
+        )
+        self.conv.weight.data = get_sharded_data(inner_conv.weight.data, 0)
+        if inner_conv.bias is not None:
+            self.conv.bias.data = get_sharded_data(inner_conv.bias.data, 0)
+
+    def forward(self, x):
+        return self.conv(x)
+
+
+class RowParallelConv3d(nn.Module):
+    """Conv3d with input channels sharded. Output is all-reduced (full channels).
+    Input: sharded channels. Output: full channels."""
+
+    def __init__(self, inner_conv, tp_degree):
+        super().__init__()
+        from neuronx_distributed.parallel_layers.mappings import (
+            reduce_from_tensor_model_parallel_region,
+        )
+
+        self.reduce = reduce_from_tensor_model_parallel_region
+        self.sharded_in = inner_conv.in_channels // tp_degree
+        self.conv = nn.Conv3d(
+            self.sharded_in,
+            inner_conv.out_channels,
+            kernel_size=inner_conv.kernel_size,
+            stride=inner_conv.stride,
+            padding=inner_conv.padding,
+            padding_mode=inner_conv.padding_mode,
+            bias=inner_conv.bias is not None,
+        )
+        self.conv.weight.data = get_sharded_data(inner_conv.weight.data, 1)
+        if inner_conv.bias is not None:
+            self.conv.bias.data = inner_conv.bias.data.clone() / tp_degree
+
+    def forward(self, x):
+        return self.reduce(self.conv(x))
+
+
+class ColumnRowParallelConv3d(nn.Module):
+    """Conv3d with sharded input -> sharded output.
+
+    All-gathers input channels, then applies column-parallel conv
+    (full in -> sharded out). 1 all-gather per forward.
+    """
+
+    def __init__(self, inner_conv, tp_degree):
+        super().__init__()
+        from neuronx_distributed.parallel_layers.mappings import (
+            gather_from_tensor_model_parallel_region_with_dim,
+        )
+
+        self.gather_channels = gather_from_tensor_model_parallel_region_with_dim
+        self.sharded_out = inner_conv.out_channels // tp_degree
+        self.conv = nn.Conv3d(
+            inner_conv.in_channels,  # full input
+            self.sharded_out,  # sharded output
+            kernel_size=inner_conv.kernel_size,
+            stride=inner_conv.stride,
+            padding=inner_conv.padding,
+            padding_mode=inner_conv.padding_mode,
+            bias=inner_conv.bias is not None,
+        )
+        self.conv.weight.data = get_sharded_data(inner_conv.weight.data, 0)
+        if inner_conv.bias is not None:
+            self.conv.bias.data = get_sharded_data(inner_conv.bias.data, 0)
+
+    def forward(self, x):
+        x_full = self.gather_channels(x, gather_dim=1)
+        return self.conv(x_full)
+
+
+# ── Sharded RMSNorm (PixelNorm) ────────────────────────────────────
+
+
+class ShardedPixelNorm(nn.Module):
+    """PixelNorm that works with sharded channel dimension.
+
+    PixelNorm computes x / sqrt(mean(x^2, dim=1) + eps).
+    With sharded channels, we need global mean across all ranks.
+
+    Strategy:
+      1. Compute local sum of x^2 over local channels
+      2. All-reduce (sum) across ranks for global sum
+      3. Divide by total channels for global mean
+      4. x / sqrt(global_mean + eps)
+    """
+
+    def __init__(self, original_norm, tp_degree):
+        super().__init__()
+        from neuronx_distributed.parallel_layers.mappings import (
+            reduce_from_tensor_model_parallel_region,
+        )
+
+        self.reduce = reduce_from_tensor_model_parallel_region
+        self.tp_degree = tp_degree
+        self.eps = getattr(original_norm, "eps", 1e-8)
+
+    def forward(self, x):
+        local_sq_sum = (x**2).sum(dim=1, keepdim=True)
+        global_sq_sum = self.reduce(local_sq_sum)
+        n_channels = x.shape[1] * self.tp_degree
+        rms = torch.sqrt(global_sq_sum / n_channels + self.eps)
+        return x / rms
+
+
+# ── Sharded ResNet Block ────────────────────────────────────────────
+
+
+class ShardedResnetBlock3D(nn.Module):
+    """LTX-2.3 ResnetBlock3D with sharded channels and timestep conditioning.
+
+    Forward order (from ltx-core source):
+      x = norm1(x)
+      if timestep_conditioning: x = x * (1 + scale1) + shift1
+      x = SiLU(x) -> conv1
+      x = norm2(x)
+      if timestep_conditioning: x = x * (1 + scale2) + shift2
+      x = SiLU(x) -> dropout -> conv2
+      residual = norm3(input) -> conv_shortcut
+      return x + residual
+
+    When in_channels == out_channels: conv_shortcut = Identity, norm3 = Identity.
+    """
+
+    def __init__(self, original_block, tp_degree):
+        super().__init__()
+        self.nonlinearity = nn.SiLU()
+        self.timestep_conditioning = original_block.timestep_conditioning
+
+        # Norms — sharded PixelNorm
+        self.norm1 = ShardedPixelNorm(original_block.norm1, tp_degree)
+        self.norm2 = ShardedPixelNorm(original_block.norm2, tp_degree)
+
+        # Extract inner Conv3d from CausalConv3d wrappers
+        conv1_inner = original_block.conv1.conv
+        conv2_inner = original_block.conv2.conv
+
+        # Both convs: sharded in -> sharded out
+        self.conv1 = ColumnRowParallelConv3d(conv1_inner, tp_degree)
+        self.conv2 = ColumnRowParallelConv3d(conv2_inner, tp_degree)
+
+        # Temporal padding (non-causal mode)
+        self._pad1 = make_noncausal_pad_fn_from_kernel(
+            original_block.conv1.time_kernel_size
+        )
+        self._pad2 = make_noncausal_pad_fn_from_kernel(
+            original_block.conv2.time_kernel_size
+        )
+
+        # Timestep conditioning: scale_shift_table is [4, C] — NOT sharded
+        # The shift/scale values are broadcast over spatial dims; they must be
+        # sliced to match local channel count
+        if self.timestep_conditioning:
+            # scale_shift_table: [4, C_full] -> we need [4, C_shard]
+            shard_size = original_block.scale_shift_table.shape[1] // tp_degree
+            from neuronx_distributed.parallel_layers import parallel_state
+
+            tp_rank = parallel_state.get_tensor_model_parallel_rank()
+            start = tp_rank * shard_size
+            end = start + shard_size
+            self.scale_shift_table = nn.Parameter(
+                original_block.scale_shift_table.data[:, start:end].clone()
+            )
+
+        # Shortcut (identity when in_channels == out_channels)
+        self.is_identity_shortcut = isinstance(
+            original_block.conv_shortcut, nn.Identity
+        )
+        if not self.is_identity_shortcut:
+            # norm3 is GroupNorm(1, in_channels) — sharded
+            # conv_shortcut is a 1x1 linear — ColumnRowParallel
+            self.norm3 = nn.GroupNorm(
+                num_groups=1,
+                num_channels=original_block.norm3.num_channels // tp_degree,
+                eps=original_block.norm3.eps,
+                affine=True,
+            )
+            # Shard GroupNorm weight/bias
+            self.norm3.weight.data = get_sharded_data(
+                original_block.norm3.weight.data, 0
+            )
+            self.norm3.bias.data = get_sharded_data(original_block.norm3.bias.data, 0)
+            # conv_shortcut is make_linear_nd -> Conv3d(1,1,1)
+            shortcut_conv = original_block.conv_shortcut
+            if hasattr(shortcut_conv, "conv"):
+                shortcut_conv = shortcut_conv.conv
+            self.conv_shortcut = ColumnRowParallelConv3d(shortcut_conv, tp_degree)
+        else:
+            self.norm3 = nn.Identity()
+            self.conv_shortcut = nn.Identity()
+
+    def forward(self, x, timestep_embed=None):
+        residual = x
+
+        x = self.norm1(x)
+
+        if self.timestep_conditioning and timestep_embed is not None:
+            # timestep_embed: [B, 4*C_shard, 1, 1, 1] from UNetMidBlock3D
+            batch_size = x.shape[0]
+            ada_values = self.scale_shift_table[None, ..., None, None, None].to(
+                device=x.device, dtype=x.dtype
+            ) + timestep_embed.reshape(
+                batch_size,
+                4,
+                -1,
+                timestep_embed.shape[-3],
+                timestep_embed.shape[-2],
+                timestep_embed.shape[-1],
+            )
+            shift1, scale1, shift2, scale2 = ada_values.unbind(dim=1)
+            x = x * (1 + scale1) + shift1
+
+        x = self.nonlinearity(x)
+        x = self._pad1(x)
+        x = self.conv1(x)
+
+        x = self.norm2(x)
+
+        if self.timestep_conditioning and timestep_embed is not None:
+            x = x * (1 + scale2) + shift2
+
+        x = self.nonlinearity(x)
+        x = self._pad2(x)
+        x = self.conv2(x)
+
+        if not self.is_identity_shortcut:
+            residual = self.norm3(residual)
+            residual = self.conv_shortcut(residual)
+
+        return x + residual
+
+
+# ── Sharded UNetMidBlock3D ──────────────────────────────────────────
+
+
+class ShardedUNetMidBlock3D(nn.Module):
+    """Sharded UNetMidBlock3D with timestep conditioning.
+
+    Contains a time_embedder (small MLP, not sharded) and N ResnetBlock3D instances.
+    The time_embedder produces [B, C*4, 1, 1, 1] which is sliced per-rank for
+    the 4 shift/scale values in each ResnetBlock3D.
+    """
+
+    def __init__(self, original_block, tp_degree):
+        super().__init__()
+        self.timestep_conditioning = original_block.timestep_conditioning
+
+        if self.timestep_conditioning:
+            # time_embedder: small MLP producing [B, C*4] — keep on all ranks (broadcast)
+            # But we need to shard the output since ResnetBlock3D expects sharded shift/scale
+            self.time_embedder = original_block.time_embedder
+            self.tp_degree = tp_degree
+
+        self.res_blocks = nn.ModuleList()
+        for resnet in original_block.res_blocks:
+            self.res_blocks.append(ShardedResnetBlock3D(resnet, tp_degree))
+
+    def forward(self, hidden_states, scaled_timestep=None):
+        timestep_embed = None
+        if self.timestep_conditioning and scaled_timestep is not None:
+            batch_size = hidden_states.shape[0]
+            # time_embedder output: [B, C*4] where C is full channel count
+            embed_full = self.time_embedder(
+                timestep=scaled_timestep.flatten(),
+                hidden_dtype=hidden_states.dtype,
+            )
+            # Reshape to [B, C*4, 1, 1, 1]
+            timestep_embed = embed_full.view(batch_size, embed_full.shape[-1], 1, 1, 1)
+            # Shard: [B, 4*C_full, 1, 1, 1] -> [B, 4*C_shard, 1, 1, 1]
+            # The 4*C layout is interleaved: [shift1_0..shift1_C, scale1_0..scale1_C, ...]
+            # Actually the ResnetBlock3D reshapes it as (B, 4, C, 1, 1, 1) then unbinds
+            # So the layout is [4, C] contiguous in the embed dim
+            # We need to shard C within each of the 4 groups
+            full_c = embed_full.shape[-1] // 4
+            shard_c = full_c // self.tp_degree
+            from neuronx_distributed.parallel_layers import parallel_state
+
+            tp_rank = parallel_state.get_tensor_model_parallel_rank()
+            start = tp_rank * shard_c
+            end = start + shard_c
+            # Reshape to [B, 4, C, 1, 1, 1], shard C, reshape back
+            embed_4d = timestep_embed.view(batch_size, 4, full_c, 1, 1, 1)
+            embed_sharded = embed_4d[:, :, start:end, :, :, :].contiguous()
+            timestep_embed = embed_sharded.view(batch_size, 4 * shard_c, 1, 1, 1)
+
+        for resnet in self.res_blocks:
+            hidden_states = resnet(hidden_states, timestep_embed=timestep_embed)
+
+        return hidden_states
+
+
+# ── Sharded DepthToSpaceUpsample ────────────────────────────────────
+
+
+class ShardedDepthToSpaceUpsample(nn.Module):
+    """LTX-2.3 DepthToSpaceUpsample (sub-pixel shuffle) with TP support.
+
+    The sub-pixel shuffle is a local reshape+permute operation. At any TP
+    degree, per-rank channel counts are divisible by stride_prod, so
+    the shuffle works independently per rank.
+
+    ltx-core's DepthToSpaceUpsample uses einops rearrange for the shuffle.
+    We replicate the exact same logic with explicit reshape+permute.
+    """
+
+    def __init__(self, original_upsampler, tp_degree):
+        super().__init__()
+        self.tp_degree = tp_degree
+        # stride is a tuple like (2, 2, 2)
+        self.stride = (
+            original_upsampler.stride
+            if hasattr(original_upsampler, "stride")
+            else (2, 2, 2)
+        )
+        self.residual = getattr(original_upsampler, "residual", False)
+
+        # The conv inside DepthToSpaceUpsample
+        conv_module = original_upsampler.conv
+        # CausalConv3d wraps nn.Conv3d
+        inner_conv = conv_module.conv if hasattr(conv_module, "conv") else conv_module
+        self.conv = ColumnRowParallelConv3d(inner_conv, tp_degree)
+
+        # Temporal padding
+        kernel_t = getattr(conv_module, "time_kernel_size", 3)
+        self._pad = make_noncausal_pad_fn_from_kernel(kernel_t)
+
+        # Compute per-rank out_channels for the shuffle
+        # Original: out_channels = prod(stride) * in_channels // reduction_factor
+        # After TP sharding, each rank has out_channels // tp_degree
+        stride_prod = self.stride[0] * self.stride[1] * self.stride[2]
+        self.stride_prod = stride_prod
+
+        # Store out_channels_reduction_factor if present
+        self.out_channels_reduction_factor = getattr(
+            original_upsampler, "out_channels_reduction_factor", 1
+        )
+
+    def forward(self, x):
+        batch_size, num_channels, num_frames, height, width = x.shape
+        s_t, s_h, s_w = self.stride
+
+        residual = None
+        if self.residual:
+            # Reshape for skip connection: depth-to-space on input
+            residual = x.reshape(
+                batch_size, -1, s_t, s_h, s_w, num_frames, height, width
+            )
+            residual = residual.permute(0, 1, 5, 2, 6, 3, 7, 4)
+            residual = residual.flatten(6, 7).flatten(4, 5).flatten(2, 3)
+            # Repeat channels for reduction factor
+            repeats = self.stride_prod // self.out_channels_reduction_factor
+            if repeats > 1:
+                residual = residual.repeat(1, repeats, 1, 1, 1)
+            # Remove first frame (temporal stride)
+            residual = residual[:, :, s_t - 1 :]
+
+        x = self._pad(x)
+        x = self.conv(x)
+
+        # Sub-pixel shuffle
+        b2, c2, f2, h2, w2 = x.shape
+        x = x.reshape(b2, -1, s_t, s_h, s_w, f2, h2, w2)
+        x = x.permute(0, 1, 5, 2, 6, 3, 7, 4)
+        x = x.flatten(6, 7).flatten(4, 5).flatten(2, 3)
+        # Remove first frame
+        x = x[:, :, s_t - 1 :]
+
+        if residual is not None:
+            x = x + residual
+
+        return x
+
+
+# ── Full Sharded LTX-2.3 Decoder ───────────────────────────────────
+
+
+class ShardedLTX23Decoder(nn.Module):
+    """Tensor Parallel LTX-2.3 VAE Decoder.
+
+    Wraps the ltx-core VideoDecoder's flat up_blocks list.
+
+    The compiled graph expects pre-processed input:
+      - Noise already injected (if timestep_conditioning)
+      - Latent already denormalized via per_channel_statistics
+      - Scaled timestep computed on CPU
+
+    Channel flow:
+      conv_in:    128 full  -> 1024 sharded  (ColumnParallel)
+      up_blocks:  flat list of UNetMidBlock3D + DepthToSpaceUpsample
+        [0] res_x@1024: 5 sharded resnets
+        [1] compress_all: 1024 sharded -> upsampler -> 512 sharded
+        [2] res_x@512: 5 sharded resnets
+        [3] compress_all: 512 sharded -> upsampler -> 256 sharded
+        [4] res_x@256: 5 sharded resnets
+        [5] compress_all: 256 sharded -> upsampler -> 128 sharded
+        [6] res_x@128: 5 sharded resnets
+      norm_out:   ShardedPixelNorm (128 sharded)
+      [AdaLN]:    last_scale_shift_table applied (if timestep_conditioning)
+      conv_act:   SiLU
+      conv_out:   128 sharded -> 48 full (RowParallel)
+      unpatchify: 48 -> 3 (spatial rearrangement, patch_size=4)
+    """
+
+    def __init__(self, video_decoder, tp_degree):
+        super().__init__()
+        self.tp_degree = tp_degree
+        self.timestep_conditioning = video_decoder.timestep_conditioning
+        self.patch_size = video_decoder.patch_size
+
+        # conv_in: 128 full -> 1024 sharded
+        conv_in_module = video_decoder.conv_in
+        inner_conv_in = (
+            conv_in_module.conv if hasattr(conv_in_module, "conv") else conv_in_module
+        )
+        self._conv_in_pad = make_noncausal_pad_fn_from_kernel(
+            getattr(conv_in_module, "time_kernel_size", 3)
+        )
+        self.conv_in = ColumnParallelConv3d(inner_conv_in, tp_degree)
+
+        # Timestep conditioning: compute scaled_timestep on CPU, pass in
+        if self.timestep_conditioning:
+            self.timestep_scale_multiplier = video_decoder.timestep_scale_multiplier
+
+        # up_blocks: walk the flat list
+        self.sharded_up_blocks = nn.ModuleList()
+        self.up_block_types = []  # "mid" or "upsample"
+
+        for block in video_decoder.up_blocks:
+            block_type = type(block).__name__
+            if "UNetMidBlock3D" in block_type:
+                self.sharded_up_blocks.append(ShardedUNetMidBlock3D(block, tp_degree))
+                self.up_block_types.append("mid")
+            elif "DepthToSpace" in block_type:
+                self.sharded_up_blocks.append(
+                    ShardedDepthToSpaceUpsample(block, tp_degree)
+                )
+                self.up_block_types.append("upsample")
+            elif "ResnetBlock3D" in block_type:
+                self.sharded_up_blocks.append(ShardedResnetBlock3D(block, tp_degree))
+                self.up_block_types.append("resnet")
+            else:
+                raise ValueError(f"Unknown up_block type: {block_type}")
+
+        # norm_out: sharded PixelNorm
+        self.norm_out = ShardedPixelNorm(video_decoder.conv_norm_out, tp_degree)
+
+        # Final AdaLN (timestep conditioning)
+        if self.timestep_conditioning:
+            # last_time_embedder: small MLP, keep on all ranks
+            self.last_time_embedder = video_decoder.last_time_embedder
+            # last_scale_shift_table: [2, C_full] — shard C
+            self.last_scale_shift_table_full = video_decoder.last_scale_shift_table
+
+        # conv_act: SiLU
+        self.conv_act = nn.SiLU()
+
+        # conv_out: 128 sharded -> 48 full (RowParallel)
+        conv_out_module = video_decoder.conv_out
+        inner_conv_out = (
+            conv_out_module.conv
+            if hasattr(conv_out_module, "conv")
+            else conv_out_module
+        )
+        self._conv_out_pad = make_noncausal_pad_fn_from_kernel(
+            getattr(conv_out_module, "time_kernel_size", 3)
+        )
+        self.conv_out = RowParallelConv3d(inner_conv_out, tp_degree)
+
+    def forward(self, latent, scaled_timestep=None):
+        """Forward pass on denormalized, noise-injected latent.
+
+        Args:
+            latent: [B, 128, T, H_tile, W_tile] — denormalized latent tile
+            scaled_timestep: [B] — timestep * scale_multiplier (or None)
+
+        Returns:
+            [B, 3, T_out, H_out, W_out] — decoded video tile
+        """
+        # conv_in
+        x = self._conv_in_pad(latent)
+        x = self.conv_in(x)
+
+        # up_blocks
+        for block, btype in zip(self.sharded_up_blocks, self.up_block_types):
+            if btype == "mid":
+                x = block(x, scaled_timestep=scaled_timestep)
+            elif btype == "upsample":
+                x = block(x)
+            elif btype == "resnet":
+                x = block(x)
+
+        # norm_out
+        x = self.norm_out(x)
+
+        # Final AdaLN
+        if self.timestep_conditioning and scaled_timestep is not None:
+            batch_size = x.shape[0]
+            embedded_timestep = self.last_time_embedder(
+                timestep=scaled_timestep.flatten(),
+                hidden_dtype=x.dtype,
+            )
+            embedded_timestep = embedded_timestep.view(
+                batch_size, embedded_timestep.shape[-1], 1, 1, 1
+            )
+            ada_values = self.last_scale_shift_table_full[
+                None, ..., None, None, None
+            ].to(device=x.device, dtype=x.dtype) + embedded_timestep.reshape(
+                batch_size,
+                2,
+                -1,
+                embedded_timestep.shape[-3],
+                embedded_timestep.shape[-2],
+                embedded_timestep.shape[-1],
+            )
+            shift, scale = ada_values.unbind(dim=1)
+            # Shard shift/scale to match local channel count
+            from neuronx_distributed.parallel_layers import parallel_state
+
+            tp_rank = parallel_state.get_tensor_model_parallel_rank()
+            shard_c = x.shape[1]
+            start = tp_rank * shard_c
+            end = start + shard_c
+            shift = shift[:, start:end]
+            scale = scale[:, start:end]
+            x = x * (1 + scale) + shift
+
+        # conv_act + conv_out (gathers channels)
+        x = self.conv_act(x)
+        x = self._conv_out_pad(x)
+        x = self.conv_out(x)  # [B, 48, F, H, W] — gathered (full channels)
+
+        # Unpatchify: 48 channels -> 3 RGB channels
+        p = self.patch_size  # 4
+        batch_size, num_channels, num_frames, height, width = x.shape
+        # From ltx-core: unpatchify(sample, patch_size_hw=4, patch_size_t=1)
+        # Equivalent to rearrange "b (c p1 p2) f h w -> b c f (h p1) (w p2)"
+        # where p1=p2=4, c=3
+        x = x.reshape(batch_size, -1, p, p, num_frames, height, width)
+        x = x.permute(0, 1, 4, 5, 2, 6, 3)  # [B, 3, F, H, p, W, p]
+        x = x.reshape(batch_size, -1, num_frames, height * p, width * p)
+
+        return x
+
+
+class DecoderWrapperTP(nn.Module):
+    """Wrapper for parallel_model_trace."""
+
+    def __init__(self, video_decoder, tp_degree):
+        super().__init__()
+        self.decoder = ShardedLTX23Decoder(video_decoder, tp_degree)
+
+    def forward(self, latent, scaled_timestep=None):
+        return self.decoder(latent, scaled_timestep)
+
+
+def get_decoder_model(tp_degree, model_path, config):
+    """Factory function for parallel_model_trace.
+
+    Loads the ltx-core VideoDecoder from safetensors and wraps it.
+
+    Returns:
+        (model, empty_dict): The DecoderWrapperTP model and empty state dict
+    """
+    from ltx_core.model.video_vae.model_configurator import VideoDecoderConfigurator
+    from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder
+    from ltx_core.loader.sd_ops import SDOps
+
+    vd_ops = (
+        SDOps("v")
+        .with_matching(prefix="vae.")
+        .with_replacement("vae.decoder.", "")
+        .with_replacement("vae.", "")
+    )
+    vd_builder = SingleGPUModelBuilder(
+        model_class_configurator=VideoDecoderConfigurator,
+        model_path=model_path,
+        model_sd_ops=vd_ops,
+    )
+    video_decoder = vd_builder.build(device=torch.device("cpu"), dtype=torch.float32)
+    video_decoder.eval()
+
+    wrapper = DecoderWrapperTP(video_decoder, tp_degree)
+    wrapper.eval()
+    return wrapper, {}
+
+
+def compile_vae_decoder(
+    tp_degree=4,
+    tile_height=128,
+    tile_width=512,
+    num_frames=121,
+    output_dir="/home/ubuntu/ltx23_vae_tp4",
+    compiler_workdir="/home/ubuntu/compiler_workdir_vae23",
+    model_path="/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors",
+    config=None,
+    timestep_conditioning=True,
+):
+    """Compile the TP VAE decoder for Neuron.
+
+    Args:
+        tp_degree: Tensor parallel degree (default 4)
+        tile_height: Tile height in pixels (default 128 for 4-latent)
+        tile_width: Tile width in pixels (default 512 for 16-latent)
+        num_frames: Number of video frames (default 121)
+        output_dir: Directory to save compiled model
+        compiler_workdir: Directory for compiler intermediate files
+        model_path: Path to LTX-2.3 safetensors
+        config: Parsed config dict (loaded from safetensors metadata if None)
+        timestep_conditioning: Whether the decoder uses timestep conditioning
+
+    Returns:
+        compiled: The compiled parallel model
+    """
+    import json
+    import time
+
+    import neuronx_distributed
+
+    if config is None:
+        from safetensors import safe_open
+
+        with safe_open(model_path, framework="pt") as f:
+            metadata = f.metadata()
+        config = json.loads(metadata["config"])
+
+    latent_f = (num_frames - 1) // 8 + 1
+    latent_h = tile_height // 32
+    latent_w = tile_width // 32
+
+    os.environ["LOCAL_WORLD_SIZE"] = str(tp_degree)
+    os.environ["NEURON_CC_FLAGS"] = (
+        os.environ.get("NEURON_CC_FLAGS", "") + f" {COMPILER_FLAGS}"
+    )
+
+    print("=" * 70)
+    print("LTX-2.3 VAE Decoder — Tensor Parallel Compilation")
+    print("=" * 70)
+    print(f"  Tile size:   {tile_height}x{tile_width} pixels")
+    print(f"  Latent tile: [1, 128, {latent_f}, {latent_h}, {latent_w}]")
+    print(f"  TP degree:   {tp_degree}")
+    print(f"  Timestep conditioning: {timestep_conditioning}")
+    print(f"  Output dir:  {output_dir}")
+
+    latent_input = torch.randn(
+        1, 128, latent_f, latent_h, latent_w, dtype=torch.float32
+    )
+
+    # Build trace inputs
+    if timestep_conditioning:
+        # scaled_timestep is a scalar tensor [B=1]
+        scaled_timestep = torch.tensor([0.05 * 1000.0], dtype=torch.float32)  # 50.0
+        trace_inputs = (latent_input, scaled_timestep)
+    else:
+        trace_inputs = (latent_input,)
+
+    print(f"\n  Compiling (this may take 10-30 minutes)...")
+    t0 = time.time()
+
+    get_model_fn = partial(get_decoder_model, tp_degree, model_path, config)
+
+    compiled = neuronx_distributed.trace.parallel_model_trace(
+        get_model_fn,
+        trace_inputs,
+        compiler_workdir=compiler_workdir,
+        compiler_args=COMPILER_FLAGS,
+        tp_degree=tp_degree,
+        inline_weights_to_neff=False,
+    )
+
+    compile_time = time.time() - t0
+    print(f"  Compiled in {compile_time:.1f}s")
+
+    os.makedirs(output_dir, exist_ok=True)
+    neuronx_distributed.trace.parallel_model_save(compiled, output_dir)
+    print(f"  Saved to {output_dir}")
+
+    # Quick validation
+    print("\n  Running validation...")
+    with torch.no_grad():
+        neuron_out = compiled(*trace_inputs)
+    print(f"  Output shape: {list(neuron_out.shape)}")
+    print(f"  Output range: [{neuron_out.min():.3f}, {neuron_out.max():.3f}]")
+    T_out = (num_frames - 1) + 1 if latent_f == 1 else num_frames
+    print(f"  Expected:    [1, 3, ~{T_out}, {tile_height}, {tile_width}]")
+
+    return compiled
diff --git a/contrib/models/LTX-2.3/src/pipeline.py b/contrib/models/LTX-2.3/src/pipeline.py
new file mode 100644
index 00000000..f634c119
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/pipeline.py
@@ -0,0 +1,778 @@
+"""
+NxDI LTX-2.3 Pipeline
+=====================
+Neuron-aware pipeline for the LTX-2.3 22B audio-video diffusion model.
+
+Unlike LTX-2 which wrapped the Diffusers LTX2Pipeline, LTX-2.3 uses native
+ltx-core components directly (no Diffusers pipeline exists for 2.3).
+
+The pipeline handles:
+1. CPU preprocessing via native ltx-core TransformerArgsPreprocessor
+2. Routing the 24 flat tensors through the compiled Neuron backbone
+3. CPU postprocessing (unpatchify, VAE decode)
+4. Euler denoising loop with flow matching scheduler
+
+Text encoding (Gemma 3 12B) and VAE decoding stay on CPU.
+Only the DiT transformer backbone runs on Neuron.
+"""
+
+import logging
+import math
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+
+logger = logging.getLogger(__name__)
+
+
+class NeuronTransformerWrapper(nn.Module):
+    """Wraps the compiled Neuron backbone with CPU preprocessing.
+
+    The native ltx-core LTXModel.forward() takes Modality objects and runs:
+    1. Preprocessing (patchify, adaln, rope, connector) -> TransformerArgs
+    2. 48 transformer blocks
+    3. Output projection (norm, scale/shift, linear)
+
+    This wrapper keeps step 1 on CPU (using the original LTXModel's preprocessors)
+    and routes steps 2-3 through the compiled Neuron backbone.
+
+    The backbone expects 24 flat tensors (see modeling_ltx23.py forward signature).
+
+    Performance optimizations:
+    - AdaLN deduplication: when all tokens share the same sigma (T2V mode),
+      the AdaLN MLP is computed once instead of N times. Saves ~47ms/step
+      (from 54ms to 7ms for the AdaLN component).
+    - Step-invariant caching: RoPE embeddings and context projection are
+      computed once on the first step and reused for subsequent steps.
+      Saves ~57ms/step (RoPE ~1.6ms + context projection ~55ms).
+    """
+
+    def __init__(
+        self,
+        compiled_backbone,
+        cpu_ltx_model,
+        text_seq=256,
+        mask_4d=False,
+        compiled_text_seq=None,
+    ):
+        """
+        Args:
+            compiled_backbone: Compiled Neuron model (TensorParallelNeuronModel
+                or callable that takes 24 positional tensor args)
+            cpu_ltx_model: The full unsharded LTXModel on CPU (for preprocessors)
+            text_seq: Maximum text sequence length for preprocessing
+            mask_4d: If True, keep attention masks as 4D (B,1,1,seq) for
+                Application-compiled NEFFs. Standalone-compiled NEFFs expect
+                2D (B,seq) masks.
+            compiled_text_seq: Text sequence length the NEFF was compiled with.
+                If different from text_seq, context tensors are padded to match.
+                Defaults to text_seq if None.
+        """
+        super().__init__()
+        self.compiled_backbone = compiled_backbone
+        self.text_seq = text_seq
+        self.compiled_text_seq = compiled_text_seq or text_seq
+        self.dtype = torch.bfloat16
+        self.mask_4d = mask_4d
+
+        # Keep CPU preprocessors from the native model
+        self.video_args_preprocessor = cpu_ltx_model.video_args_preprocessor
+        self.audio_args_preprocessor = cpu_ltx_model.audio_args_preprocessor
+
+        # Apply AdaLN optimization by default
+        self._patch_adaln_dedup()
+
+        # Apply step-invariant caching (RoPE + context projection)
+        self._step_cache = {}
+        self._patch_step_invariant_cache()
+
+    def _patch_adaln_dedup(self):
+        """Monkey-patch the preprocessor's _prepare_timestep to deduplicate AdaLN.
+
+        In T2V mode, all video tokens share the same sigma value, so the AdaLN
+        MLP (sinusoidal → Linear 256→4096 → SiLU → Linear 4096→36864) processes
+        N identical values. This patch detects uniform timesteps and computes
+        the MLP once, then expands. Measured speedup: 54ms → 7ms (8.3x) for
+        768 video tokens.
+
+        In I2V mode, there are typically 2 unique sigma values (sigma=0 for
+        the conditioning frame, sigma for the rest). The patch handles this
+        by computing only the unique values.
+
+        The patch targets the inner TransformerArgsPreprocessor._prepare_timestep
+        which is used by both the video and audio preprocessors.
+        """
+
+        def _dedup_prepare_timestep(original_fn, preprocessor):
+            """Create a deduplicated version of _prepare_timestep."""
+
+            def patched(timestep, adaln, batch_size, hidden_dtype):
+                # timestep shape: (B, seq_len) or (B, 1)
+                # In T2V: all values identical. In I2V: typically 2 unique values.
+                flat = timestep.flatten()
+                unique_vals = flat.unique()
+
+                if unique_vals.numel() == flat.numel():
+                    # All different — no dedup possible, use original
+                    return original_fn(
+                        preprocessor, timestep, adaln, batch_size, hidden_dtype
+                    )
+
+                # Compute AdaLN MLP only for unique values
+                scaled_unique = unique_vals * preprocessor.timestep_scale_multiplier
+                ts_unique, et_unique = adaln(scaled_unique, hidden_dtype=hidden_dtype)
+                # ts_unique: (n_unique, adaln_dim), et_unique: (n_unique, embed_dim)
+
+                if unique_vals.numel() == 1:
+                    # All tokens share same sigma — most common T2V case
+                    ts = ts_unique.expand(flat.shape[0], -1)
+                    et = et_unique.expand(flat.shape[0], -1)
+                else:
+                    # Multiple unique values (I2V case) — scatter back
+                    # Build index map: for each flat element, which unique index?
+                    sorted_unique, sort_idx = unique_vals.sort()
+                    bucket = torch.bucketize(flat, sorted_unique)
+                    bucket = bucket.clamp(max=len(sorted_unique) - 1)
+                    # Map through sort index to get original unique ordering
+                    inv_sort = torch.empty_like(sort_idx)
+                    inv_sort[sort_idx] = torch.arange(
+                        len(sort_idx), device=sort_idx.device
+                    )
+                    idx = inv_sort[bucket]
+                    ts = ts_unique[idx]
+                    et = et_unique[idx]
+
+                ts = ts.view(batch_size, -1, ts.shape[-1])
+                et = et.view(batch_size, -1, et.shape[-1])
+                return ts, et
+
+            return patched
+
+        # Patch the video preprocessor's inner simple_preprocessor
+        video_inner = self.video_args_preprocessor.simple_preprocessor
+        orig_video = type(video_inner)._prepare_timestep
+        video_inner._prepare_timestep = _dedup_prepare_timestep(orig_video, video_inner)
+
+        # Patch the audio preprocessor's inner simple_preprocessor
+        audio_inner = self.audio_args_preprocessor.simple_preprocessor
+        orig_audio = type(audio_inner)._prepare_timestep
+        audio_inner._prepare_timestep = _dedup_prepare_timestep(orig_audio, audio_inner)
+
+        logger.info(
+            "AdaLN deduplication patch applied to video and audio preprocessors"
+        )
+
+    def _patch_step_invariant_cache(self):
+        """Cache RoPE embeddings and context projections across denoising steps.
+
+        Several preprocessing components depend only on positions and context
+        (not on the diffusion timestep or latent sample), so they produce
+        identical results on every denoising step:
+
+        1. RoPE (positional embeddings): depends on spatial/temporal positions
+           which are constant across steps. ~1.6ms × 4 calls = ~1.6ms total.
+        2. Context projection (caption_projection Linear): depends on text
+           encoder output which is constant across steps. ~55ms for video.
+        3. Cross-modal RoPE: same positions, constant. ~0.4ms.
+        4. Attention mask preparation: depends on context_mask, constant. ~0.02ms.
+
+        Total savings: ~57ms/step (from ~67ms to ~10ms preprocessing before
+        AdaLN dedup, or from ~48ms to ~14ms after AdaLN dedup is already applied).
+
+        The cache is keyed by a string identifier for each call site. Call
+        clear_step_cache() between generations with different prompts/resolutions.
+        """
+        cache = self._step_cache
+
+        def _make_cached_fn(original_fn, preprocessor, cache_prefix):
+            """Wrap _prepare_positional_embeddings with caching."""
+
+            def cached_pe(
+                positions,
+                inner_dim,
+                max_pos,
+                use_middle_indices_grid,
+                num_attention_heads,
+                x_dtype,
+            ):
+                key = f"{cache_prefix}_pe_{inner_dim}_{positions.shape}"
+                if key in cache:
+                    return cache[key]
+                result = original_fn(
+                    preprocessor,
+                    positions,
+                    inner_dim,
+                    max_pos,
+                    use_middle_indices_grid,
+                    num_attention_heads,
+                    x_dtype,
+                )
+                cache[key] = result
+                return result
+
+            return cached_pe
+
+        def _make_cached_context(original_fn, preprocessor, cache_prefix):
+            """Wrap _prepare_context with caching."""
+
+            def cached_ctx(context, x):
+                key = f"{cache_prefix}_ctx"
+                if key in cache:
+                    return cache[key]
+                result = original_fn(preprocessor, context, x)
+                cache[key] = result
+                return result
+
+            return cached_ctx
+
+        def _make_cached_mask(original_fn, preprocessor, cache_prefix):
+            """Wrap _prepare_attention_mask with caching."""
+
+            def cached_mask(context_mask, dtype):
+                key = f"{cache_prefix}_mask"
+                if key in cache:
+                    return cache[key]
+                result = original_fn(preprocessor, context_mask, dtype)
+                cache[key] = result
+                return result
+
+            return cached_mask
+
+        # Patch video preprocessor
+        video_inner = self.video_args_preprocessor.simple_preprocessor
+        video_cls = type(video_inner)
+        video_inner._prepare_positional_embeddings = _make_cached_fn(
+            video_cls._prepare_positional_embeddings, video_inner, "video"
+        )
+        video_inner._prepare_context = _make_cached_context(
+            video_cls._prepare_context, video_inner, "video"
+        )
+        video_inner._prepare_attention_mask = _make_cached_mask(
+            video_cls._prepare_attention_mask, video_inner, "video"
+        )
+
+        # Patch audio preprocessor
+        audio_inner = self.audio_args_preprocessor.simple_preprocessor
+        audio_cls = type(audio_inner)
+        audio_inner._prepare_positional_embeddings = _make_cached_fn(
+            audio_cls._prepare_positional_embeddings, audio_inner, "audio"
+        )
+        audio_inner._prepare_context = _make_cached_context(
+            audio_cls._prepare_context, audio_inner, "audio"
+        )
+        audio_inner._prepare_attention_mask = _make_cached_mask(
+            audio_cls._prepare_attention_mask, audio_inner, "audio"
+        )
+
+        logger.info(
+            "Step-invariant cache applied (RoPE, context projection, attention mask)"
+        )
+
+    def clear_step_cache(self):
+        """Clear the step-invariant cache.
+
+        Call this between generations with different prompts, resolutions,
+        or frame counts. Not needed between denoising steps (that's the point).
+        """
+        self._step_cache.clear()
+
+    def preprocess(self, video_modality, audio_modality):
+        """Run CPU preprocessing to produce the 24 flat tensors.
+
+        Args:
+            video_modality: ltx_core Modality for video
+            audio_modality: ltx_core Modality for audio
+
+        Returns:
+            Tuple of 24 tensors matching the backbone forward signature.
+        """
+        with torch.no_grad():
+            va = self.video_args_preprocessor.prepare(video_modality, audio_modality)
+            aa = self.audio_args_preprocessor.prepare(audio_modality, video_modality)
+
+        dtype = self.dtype
+
+        # Extract RoPE tuples
+        video_pe_cos, video_pe_sin = va.positional_embeddings
+        audio_pe_cos, audio_pe_sin = aa.positional_embeddings
+        ca_video_pe_cos, ca_video_pe_sin = va.cross_positional_embeddings
+        ca_audio_pe_cos, ca_audio_pe_sin = aa.cross_positional_embeddings
+
+        # Build the 24 flat tensors
+        # Note: context_mask tensors must be distinct objects (different data_ptr)
+        # to avoid flattener layout assertion errors in the Neuron JIT wrapper.
+        v_mask = va.context_mask
+        a_mask = aa.context_mask
+
+        # Attention mask pipeline:
+        # - If Modality.context_mask was int64 (from real text encoder or random
+        #   with int64 mask), the preprocessor's _prepare_attention_mask converts
+        #   it to 4D additive format: (B,1,1,seq) with 0=attend, -max=ignore.
+        #   We squeeze to 2D and it's already correct — no further conversion.
+        # - If Modality.context_mask was bf16 (e.g., from older code paths),
+        #   the preprocessor passes it through unchanged as 2D bf16 binary {0,1}.
+        #   We must convert binary→additive: 1→0 (attend), 0→-max (ignore).
+        already_additive = False
+
+        if v_mask is not None and v_mask.ndim == 4:
+            if self.mask_4d:
+                # Application-compiled NEFFs expect 4D masks (B,1,1,seq).
+                # Keep as-is; already in additive format from preprocessor.
+                already_additive = True
+            else:
+                # Standalone-compiled NEFFs expect 2D masks (B,seq).
+                v_mask = v_mask.squeeze(1).squeeze(1)  # (B,1,1,seq) -> (B,seq)
+                already_additive = True
+        if a_mask is not None and a_mask.ndim == 4:
+            if not self.mask_4d:
+                a_mask = a_mask.squeeze(1).squeeze(1)
+
+        # Convert to bf16
+        if v_mask is not None:
+            v_mask = v_mask.to(dtype)
+        if a_mask is not None:
+            a_mask = a_mask.to(dtype)
+
+        # Only convert if the mask is still binary (bf16 input case)
+        if v_mask is not None and not already_additive:
+            if v_mask.ndim == 2:
+                # Mask is bf16 binary {0, 1}: convert to additive format
+                finfo = torch.finfo(dtype)
+                v_mask = torch.where(
+                    v_mask > 0.5,
+                    torch.zeros_like(v_mask),
+                    torch.full_like(v_mask, finfo.min),
+                )
+                a_mask = torch.where(
+                    a_mask > 0.5,
+                    torch.zeros_like(a_mask),
+                    torch.full_like(a_mask, finfo.min),
+                )
+                if self.mask_4d:
+                    # Expand to 4D for Application path
+                    v_mask = v_mask.unsqueeze(1).unsqueeze(1)  # (B,seq) -> (B,1,1,seq)
+                    a_mask = a_mask.unsqueeze(1).unsqueeze(1)
+
+        inputs = (
+            va.x.to(dtype),
+            aa.x.to(dtype),
+            va.context.to(dtype),
+            aa.context.to(dtype),
+            va.timesteps.to(dtype),
+            aa.timesteps.to(dtype),
+            va.embedded_timestep.to(dtype),
+            aa.embedded_timestep.to(dtype),
+            va.cross_scale_shift_timestep.to(dtype),
+            aa.cross_scale_shift_timestep.to(dtype),
+            va.cross_gate_timestep.to(dtype),
+            aa.cross_gate_timestep.to(dtype),
+            video_pe_cos.to(dtype),
+            video_pe_sin.to(dtype),
+            audio_pe_cos.to(dtype),
+            audio_pe_sin.to(dtype),
+            ca_video_pe_cos.to(dtype),
+            ca_video_pe_sin.to(dtype),
+            ca_audio_pe_cos.to(dtype),
+            ca_audio_pe_sin.to(dtype),
+            v_mask,
+            a_mask.clone(),  # must be distinct tensor object
+            va.prompt_timestep.to(dtype),
+            aa.prompt_timestep.to(dtype),
+        )
+
+        # Pad context/mask tensors if compiled_text_seq > actual text_seq
+        if self.compiled_text_seq != self.text_seq:
+            inputs = self._pad_context_tensors(inputs)
+
+        return inputs, va, aa
+
+    def _pad_context_tensors(self, inputs):
+        """Pad context and mask tensors to match compiled_text_seq.
+
+        The NEFF was compiled with compiled_text_seq context tokens, but the
+        actual text encoder output may have fewer tokens. Pad with zeros
+        (context) and -inf (masks) so the extra positions are ignored.
+
+        Tensor indices in the 24-input tuple:
+            [2]  encoder_hidden_states    (B, text_seq, inner_dim)
+            [3]  audio_encoder_hidden_states (B, text_seq, audio_inner_dim)
+            [20] encoder_attention_mask   (B, text_seq) or (B,1,1,text_seq)
+            [21] audio_encoder_attention_mask (same shape)
+        """
+        actual_seq = inputs[2].shape[1]
+        target_seq = self.compiled_text_seq
+        if actual_seq >= target_seq:
+            return inputs
+
+        pad_len = target_seq - actual_seq
+        dtype = self.dtype
+        inputs = list(inputs)
+
+        # Pad context with zeros
+        for idx in (2, 3):
+            ctx = inputs[idx]
+            pad = torch.zeros(ctx.shape[0], pad_len, ctx.shape[2], dtype=dtype)
+            inputs[idx] = torch.cat([ctx, pad], dim=1)
+
+        # Pad masks with -inf (masked out)
+        finfo = torch.finfo(dtype)
+        for idx in (20, 21):
+            mask = inputs[idx]
+            if mask.ndim == 4:
+                # (B, 1, 1, seq) -> pad last dim
+                pad = torch.full((mask.shape[0], 1, 1, pad_len), finfo.min, dtype=dtype)
+                inputs[idx] = torch.cat([mask, pad], dim=-1)
+            elif mask.ndim == 2:
+                # (B, seq) -> pad last dim
+                pad = torch.full((mask.shape[0], pad_len), finfo.min, dtype=dtype)
+                inputs[idx] = torch.cat([mask, pad], dim=-1)
+
+        return tuple(inputs)
+
+    def forward(self, video_modality, audio_modality):
+        """Preprocess on CPU, run backbone on Neuron, return (video_out, audio_out).
+
+        Args:
+            video_modality: ltx_core Modality for video
+            audio_modality: ltx_core Modality for audio
+
+        Returns:
+            (video_output, audio_output) tensors from the backbone
+        """
+        inputs, va, aa = self.preprocess(video_modality, audio_modality)
+        video_output, audio_output = self.compiled_backbone(*inputs)
+        return video_output, audio_output
+
+
+class NeuronLTX23Pipeline:
+    """Self-contained pipeline for LTX-2.3 on Neuron.
+
+    Orchestrates:
+    1. Text encoding (Gemma 3 12B on CPU)
+    2. Noise scheduling (flow matching with Euler steps)
+    3. Denoising loop (Neuron backbone via NeuronTransformerWrapper)
+    4. VAE decoding (video VAE + audio VAE + vocoder on CPU)
+
+    Usage:
+        pipe = NeuronLTX23Pipeline(
+            ltx_model=cpu_ltx_model,  # full native ltx-core model
+            neuron_backbone=compiled_backbone,
+            text_encoder=gemma_model,
+            embeddings_processor=embeddings_proc,
+            tokenizer=tokenizer,
+            video_vae=video_vae,
+            audio_vae=audio_vae,
+            vocoder=vocoder,
+        )
+        video, audio = pipe(
+            prompt="A dog playing in a meadow",
+            height=384, width=512, num_frames=25,
+            num_inference_steps=8,
+        )
+    """
+
+    def __init__(
+        self,
+        ltx_model,
+        neuron_backbone,
+        text_encoder=None,
+        embeddings_processor=None,
+        tokenizer=None,
+        video_vae=None,
+        audio_vae=None,
+        vocoder=None,
+        text_seq=256,
+        mask_4d=False,
+        compiled_text_seq=None,
+    ):
+        """
+        Args:
+            ltx_model: Native ltx-core LTXModel (for preprocessors, patchify, etc.)
+            neuron_backbone: Compiled Neuron backbone (callable with 24 inputs)
+            text_encoder: Gemma 3 12B model (on CPU)
+            embeddings_processor: EmbeddingsProcessor (feature extractor + connectors)
+            tokenizer: LTXVGemmaTokenizer
+            video_vae: Video VAE decoder
+            audio_vae: Audio VAE decoder
+            vocoder: Audio vocoder
+            text_seq: Maximum text sequence length
+            mask_4d: If True, keep attention masks as 4D for Application-compiled NEFFs
+            compiled_text_seq: Text seq len the NEFF was compiled with (defaults to text_seq)
+        """
+        self.ltx_model = ltx_model
+        self.wrapper = NeuronTransformerWrapper(
+            compiled_backbone=neuron_backbone,
+            cpu_ltx_model=ltx_model,
+            text_seq=text_seq,
+            mask_4d=mask_4d,
+            compiled_text_seq=compiled_text_seq,
+        )
+        self.text_encoder = text_encoder
+        self.embeddings_processor = embeddings_processor
+        self.tokenizer = tokenizer
+        self.video_vae = video_vae
+        self.audio_vae = audio_vae
+        self.vocoder = vocoder
+        self.text_seq = text_seq
+        self.dtype = torch.bfloat16
+
+    def encode_text(self, prompt, device="cpu"):
+        """Encode text prompt using Gemma 3 12B + embeddings processor.
+
+        Returns:
+            (video_context, audio_context, context_mask) tensors
+        """
+        if self.text_encoder is None or self.tokenizer is None:
+            raise RuntimeError("Text encoder and tokenizer must be set")
+
+        # Tokenize
+        tokens = self.tokenizer(
+            prompt,
+            max_length=self.text_seq,
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+        )
+        input_ids = tokens.input_ids.to(device)
+        attention_mask = tokens.attention_mask.to(device)
+
+        # Run Gemma 3 12B
+        with torch.no_grad():
+            outputs = self.text_encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                output_hidden_states=True,
+            )
+            hidden_states = outputs.hidden_states
+
+        # Run embeddings processor (feature extractor + connectors)
+        with torch.no_grad():
+            result = self.embeddings_processor.process_hidden_states(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+            )
+
+        return result.video_encoding, result.audio_encoding, result.attention_mask
+
+    def velocity_to_denoised(self, sample, velocity, sigma):
+        """Convert velocity prediction to denoised sample (flow matching).
+
+        The LTX-2.3 backbone (LTXModel) outputs velocity v, where:
+            denoised = sample - v * sigma
+        """
+        return (sample.to(torch.float32) - velocity.to(torch.float32) * sigma).to(
+            sample.dtype
+        )
+
+    def denoise_step(self, sample, velocity, sigma, sigma_next):
+        """Single Euler step for flow matching diffusion.
+
+        Takes the velocity output from the backbone (NOT denoised).
+        Computes: next_sample = sample + velocity * (sigma_next - sigma)
+        """
+        dt = sigma_next - sigma
+        return (sample.to(torch.float32) + velocity.to(torch.float32) * dt).to(
+            sample.dtype
+        )
+
+    def __call__(
+        self,
+        prompt: str = "",
+        video_context: Optional[torch.Tensor] = None,
+        audio_context: Optional[torch.Tensor] = None,
+        context_mask: Optional[torch.Tensor] = None,
+        height: int = 384,
+        width: int = 512,
+        num_frames: int = 25,
+        num_inference_steps: int = 8,
+        guidance_scale: float = 1.0,
+        generator: Optional[torch.Generator] = None,
+        fps: float = 24.0,
+        audio_num_frames: int = 26,
+        decode_video: bool = True,
+        decode_audio: bool = True,
+    ):
+        """Run the full LTX-2.3 pipeline.
+
+        Either provide pre-computed context tensors or a text prompt.
+        If both are provided, the pre-computed tensors are used.
+
+        Returns:
+            dict with keys 'video_latent', 'audio_latent', and optionally
+            'video' (decoded frames) and 'audio' (decoded waveform).
+        """
+        # Text encoding (if context not pre-computed)
+        if video_context is None:
+            if not prompt:
+                raise ValueError(
+                    "Either prompt or pre-computed context must be provided"
+                )
+            video_context, audio_context, context_mask = self.encode_text(prompt)
+
+        if context_mask is None:
+            context_mask = torch.ones(1, self.text_seq, dtype=self.dtype)
+
+        # Import ltx-core tools for latent creation and scheduling
+        from ltx_core.tools import (
+            VideoLatentTools,
+            VideoLatentPatchifier,
+            VideoLatentShape,
+            AudioLatentTools,
+            AudioPatchifier,
+            AudioLatentShape,
+            SpatioTemporalScaleFactors,
+        )
+        from ltx_core.model.transformer.modality import Modality
+
+        # Compute latent dimensions
+        # LTX-2.3 VAE downsamples spatially by 32x (not 16x)
+        # For 384x512 -> height=12, width=16 latent grid
+        latent_h = height // 32
+        latent_w = width // 32
+        latent_f = (num_frames - 1) // 8 + 1  # temporal downsampling
+
+        video_shape = VideoLatentShape(
+            batch=1, channels=128, frames=latent_f, height=latent_h, width=latent_w
+        )
+        v_patchifier = VideoLatentPatchifier(patch_size=1)
+        v_scale = SpatioTemporalScaleFactors.default()  # time=8, height=32, width=32
+        video_tools = VideoLatentTools(
+            target_shape=video_shape,
+            patchifier=v_patchifier,
+            scale_factors=v_scale,
+            causal_fix=False,
+            fps=fps,
+        )
+
+        audio_shape = AudioLatentShape(
+            batch=1, channels=8, frames=audio_num_frames, mel_bins=16
+        )
+        a_patchifier = AudioPatchifier(patch_size=16)
+        audio_tools = AudioLatentTools(
+            patchifier=a_patchifier, target_shape=audio_shape
+        )
+
+        # Create initial noise
+        video_state = video_tools.create_initial_state(device="cpu", dtype=self.dtype)
+        audio_state = audio_tools.create_initial_state(device="cpu", dtype=self.dtype)
+
+        # Initialize with noise
+        if generator is not None:
+            video_noise = torch.randn(
+                video_state.latent.shape,
+                dtype=self.dtype,
+                generator=generator,
+            )
+            audio_noise = torch.randn(
+                audio_state.latent.shape,
+                dtype=self.dtype,
+                generator=generator,
+            )
+        else:
+            video_noise = torch.randn_like(video_state.latent)
+            audio_noise = torch.randn_like(audio_state.latent)
+
+        # Sigma schedule — use distilled values for the distilled model
+        # The distilled model was trained with these exact sigma values
+        # See: ltx_pipelines/utils/constants.py DISTILLED_SIGMA_VALUES
+        DISTILLED_SIGMA_VALUES = [
+            1.0,
+            0.99375,
+            0.9875,
+            0.98125,
+            0.975,
+            0.909375,
+            0.725,
+            0.421875,
+            0.0,
+        ]
+        sigmas = torch.tensor(DISTILLED_SIGMA_VALUES, dtype=torch.float32)
+        assert len(sigmas) == num_inference_steps + 1, (
+            f"Distilled sigma values have {len(sigmas)} entries "
+            f"but {num_inference_steps} steps require {num_inference_steps + 1}"
+        )
+
+        # Start from pure noise (sigma=1)
+        video_sample = video_noise.clone()
+        audio_sample = audio_noise.clone()
+
+        # Clear step-invariant cache from any previous generation
+        self.wrapper.clear_step_cache()
+
+        logger.info(
+            "Starting denoising: %d steps, sigmas=%s",
+            num_inference_steps,
+            [f"{s:.4f}" for s in sigmas.tolist()],
+        )
+
+        # Denoising loop
+        for step_idx in range(num_inference_steps):
+            sigma = sigmas[step_idx]
+            sigma_next = sigmas[step_idx + 1]
+
+            # Per-token sigma for the backbone
+            video_seq_len = video_state.latent.shape[1]
+            audio_seq_len = audio_state.latent.shape[1]
+            v_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, video_seq_len)
+            a_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, audio_seq_len)
+
+            # Build Modality objects
+            video_mod = Modality(
+                latent=video_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=v_ts,
+                positions=video_state.positions,
+                context=video_context,
+                enabled=True,
+                context_mask=context_mask,
+                attention_mask=None,
+            )
+            audio_mod = Modality(
+                latent=audio_sample,
+                sigma=sigma.unsqueeze(0),
+                timesteps=a_ts,
+                positions=audio_state.positions,
+                context=audio_context,
+                enabled=True,
+                context_mask=context_mask.clone(),  # distinct object
+                attention_mask=None,
+            )
+
+            # Forward through Neuron backbone (returns VELOCITY, not denoised)
+            video_velocity, audio_velocity = self.wrapper(video_mod, audio_mod)
+
+            # Euler step using velocity directly
+            video_sample = self.denoise_step(
+                video_sample, video_velocity, sigma, sigma_next
+            )
+            audio_sample = self.denoise_step(
+                audio_sample, audio_velocity, sigma, sigma_next
+            )
+
+            logger.info(
+                "  Step %d/%d: sigma=%.4f -> %.4f",
+                step_idx + 1,
+                num_inference_steps,
+                sigma.item(),
+                sigma_next.item(),
+            )
+
+        result = {
+            "video_latent": video_sample,
+            "audio_latent": audio_sample,
+        }
+
+        # VAE decode (optional, on CPU)
+        if decode_video and self.video_vae is not None:
+            with torch.no_grad():
+                video_frames = self.video_vae.decode(video_sample)
+            result["video"] = video_frames
+
+        if decode_audio and self.audio_vae is not None and self.vocoder is not None:
+            with torch.no_grad():
+                audio_mel = self.audio_vae.decode(audio_sample)
+                audio_waveform = self.vocoder(audio_mel)
+            result["audio"] = audio_waveform
+
+        return result
diff --git a/contrib/models/LTX-2.3/src/run_phase2.py b/contrib/models/LTX-2.3/src/run_phase2.py
new file mode 100644
index 00000000..f992c9c3
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/run_phase2.py
@@ -0,0 +1,576 @@
+"""
+Phase 2 of the two-stage LTX-2.3 benchmark on trn2.48xlarge.
+
+Loads Stage 1 latent saved by Phase 1, then runs:
+  - Spatial upsample x2 (CPU)
+  - Stage 2 denoising at full resolution (Neuron, TP=16)
+  - VAE decode (Neuron tiled or CPU fallback) + save frames + MP4
+
+Usage:
+    # Phase 1 (TP=4, separate process):
+    python generate_ltx23.py --two-stage --use-app --save-s1-latent /mnt/models/s1_latent.pt ...
+
+    # Phase 2 (TP=16, this script) with Neuron VAE:
+    python run_phase2.py \
+        --model-path /mnt/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+        --s1-latent /mnt/models/s1_latent.pt \
+        --s2-compiled-dir /mnt/models/compiled/benchmark/s2_tp16 \
+        --spatial-upscaler-path /mnt/models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.0.safetensors \
+        --vae-compiled-dir /mnt/models/compiled/vae_tp4_4x16 \
+        --height 1024 --width 1536 --num-frames 121 \
+        --tp-degree 16 \
+        --output-dir /mnt/models/output/benchmark
+
+    # Without --vae-compiled-dir, falls back to CPU decode.
+"""
+
+import argparse
+import gc
+import json
+import logging
+import os
+import sys
+import time
+
+import numpy as np
+import torch
+
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s"
+)
+logger = logging.getLogger("run_phase2")
+
+# Must match Phase 1 compile constants
+APP_BACKBONE_TEXT_SEQ = 256
+APP_BACKBONE_AUDIO_SEQ = 26
+STAGE_2_DISTILLED_SIGMA_VALUES = [0.909375, 0.725, 0.421875, 0.0]
+
+
+def load_config(model_path):
+    from safetensors import safe_open
+
+    with safe_open(model_path, framework="pt") as f:
+        metadata = f.metadata()
+    return json.loads(metadata["config"])
+
+
+def load_cpu_components(model_path, dtype=torch.bfloat16):
+    """Load CPU components needed for decode."""
+    from generate_ltx23 import build_cpu_components
+
+    config = load_config(model_path)
+    return build_cpu_components(config, model_path, dtype)
+
+
+def create_app_compositor(
+    model_path,
+    encoder_path,
+    tp_degree,
+    text_seq,
+    height,
+    width,
+    num_frames,
+    audio_seq=APP_BACKBONE_AUDIO_SEQ,
+):
+    """Create NeuronLTX23Application compositor for S2 backbone."""
+    from neuronx_distributed_inference.models.config import NeuronConfig
+    from modeling_ltx23 import LTX23BackboneInferenceConfig
+    from modeling_gemma3_encoder import Gemma3EncoderInferenceConfig, GEMMA3_12B_CONFIG
+    from application import NeuronLTX23Application
+
+    config = load_config(model_path)
+    tc = config["transformer"]
+
+    num_heads = tc["num_attention_heads"]
+    head_dim = tc["attention_head_dim"]
+    inner_dim = num_heads * head_dim
+    audio_num_heads = tc["audio_num_attention_heads"]
+    audio_head_dim = tc["audio_attention_head_dim"]
+    audio_inner_dim = audio_num_heads * audio_head_dim
+    audio_ca_dim = tc.get("audio_cross_attention_dim", 2048)
+
+    latent_h = height // 32
+    latent_w = width // 32
+    latent_f = (num_frames - 1) // 8 + 1
+    video_seq = latent_f * latent_h * latent_w
+
+    dtype = torch.bfloat16
+
+    backbone_nc = NeuronConfig(
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        batch_size=1,
+        seq_len=video_seq,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    backbone_config = LTX23BackboneInferenceConfig(
+        neuron_config=backbone_nc,
+        num_layers=tc["num_layers"],
+        num_attention_heads=num_heads,
+        attention_head_dim=head_dim,
+        inner_dim=inner_dim,
+        audio_num_attention_heads=audio_num_heads,
+        audio_attention_head_dim=audio_head_dim,
+        audio_inner_dim=audio_inner_dim,
+        audio_cross_attention_dim=audio_ca_dim,
+        video_seq=video_seq,
+        audio_seq=audio_seq,
+        text_seq=APP_BACKBONE_TEXT_SEQ,
+        height=latent_h,
+        width=latent_w,
+        num_frames=latent_f,
+        ltx_config_dict=config,
+    )
+
+    # Encoder config needed by Application even though we don't use it in Phase 2
+    # Use TP=4 for encoder config since we won't actually load it
+    enc_tp = min(tp_degree, 4)  # Gemma3 KV heads = 4, can't go beyond TP=4
+    encoder_nc = NeuronConfig(
+        tp_degree=enc_tp,
+        world_size=enc_tp,
+        batch_size=1,
+        seq_len=512,
+        torch_dtype=dtype,
+        logical_nc_config=2,
+        save_sharded_checkpoint=True,
+    )
+    encoder_config = Gemma3EncoderInferenceConfig(
+        neuron_config=encoder_nc,
+        vocab_size=GEMMA3_12B_CONFIG["vocab_size"],
+        hidden_size=GEMMA3_12B_CONFIG["hidden_size"],
+        num_hidden_layers=GEMMA3_12B_CONFIG["num_hidden_layers"],
+        num_attention_heads=GEMMA3_12B_CONFIG["num_attention_heads"],
+        num_key_value_heads=GEMMA3_12B_CONFIG["num_key_value_heads"],
+        head_dim=GEMMA3_12B_CONFIG["head_dim"],
+        intermediate_size=GEMMA3_12B_CONFIG["intermediate_size"],
+        rms_norm_eps=GEMMA3_12B_CONFIG["rms_norm_eps"],
+        rope_theta=GEMMA3_12B_CONFIG["rope_theta"],
+        max_position_embeddings=GEMMA3_12B_CONFIG["max_position_embeddings"],
+        query_pre_attn_scalar=GEMMA3_12B_CONFIG["query_pre_attn_scalar"],
+        pad_token_id=GEMMA3_12B_CONFIG["pad_token_id"],
+    )
+
+    app = NeuronLTX23Application(
+        backbone_config=backbone_config,
+        encoder_config=encoder_config,
+        model_path=model_path,
+        encoder_path=encoder_path or model_path,
+    )
+    logger.info("S2 compositor created (video_seq=%d, TP=%d)", video_seq, tp_degree)
+    return app
+
+
+def main():
+    parser = argparse.ArgumentParser(description="LTX-2.3 Phase 2: S2 refinement")
+    parser.add_argument("--model-path", required=True, help="LTX-2.3 safetensors")
+    parser.add_argument(
+        "--s1-latent", required=True, help="Path to Phase 1 saved latent"
+    )
+    parser.add_argument(
+        "--s2-compiled-dir", required=True, help="S2 compiled backbone dir"
+    )
+    parser.add_argument(
+        "--spatial-upscaler-path", required=True, help="Spatial upscaler safetensors"
+    )
+    parser.add_argument(
+        "--gemma-path", default=None, help="Gemma3 path (for Application init)"
+    )
+    parser.add_argument(
+        "--vae-compiled-dir",
+        default=None,
+        help="Compiled Neuron VAE directory (if omitted, falls back to CPU decode)",
+    )
+    parser.add_argument("--height", type=int, required=True, help="Full-res height")
+    parser.add_argument("--width", type=int, required=True, help="Full-res width")
+    parser.add_argument(
+        "--num-frames", type=int, required=True, help="Number of frames"
+    )
+    parser.add_argument("--tp-degree", type=int, default=16, help="TP degree for S2")
+    parser.add_argument("--seed", type=int, default=10, help="Random seed for S2 noise")
+    parser.add_argument("--fps", type=float, default=25.0, help="Video frame rate")
+    parser.add_argument(
+        "--audio-num-frames", type=int, default=26, help="Audio latent frames"
+    )
+    parser.add_argument("--text-seq", type=int, default=256, help="Text seq length")
+    parser.add_argument("--output-dir", required=True, help="Output directory")
+    args = parser.parse_args()
+
+    dtype = torch.bfloat16
+    total_t0 = time.time()
+
+    # 1. Load Phase 1 latent
+    logger.info("\n=== Loading Phase 1 latent from %s ===", args.s1_latent)
+    saved = torch.load(args.s1_latent, map_location="cpu", weights_only=True)
+    s1_video_latent = saved["s1_video_latent"]
+    audio_sample = saved["audio_sample"]
+    video_context = saved["video_context"]
+    audio_context = saved["audio_context"]
+    context_mask = saved["context_mask"]
+    s1_total_time = saved.get("s1_total_time", 0.0)
+    logger.info("  S1 video latent: %s", s1_video_latent.shape)
+    logger.info("  S1 denoising time: %.1fs", s1_total_time)
+
+    # 2. Load CPU components (for VAE decode and spatial upscaler)
+    logger.info("\n=== Loading CPU components ===")
+    t0 = time.time()
+    cpu = load_cpu_components(args.model_path, dtype)
+    logger.info("  CPU components loaded in %.1fs", time.time() - t0)
+
+    # 3. Spatial upsample
+    logger.info("\n=== Spatial Upsample x2 ===")
+    from generate_ltx23 import load_spatial_upscaler, spatial_upscale_latent
+
+    spatial_up = load_spatial_upscaler(args.spatial_upscaler_path, dtype=dtype)
+    s2_video_latent = spatial_upscale_latent(
+        s1_video_latent, cpu["video_decoder"], spatial_up
+    )
+    logger.info("  Upscaled: %s -> %s", s1_video_latent.shape, s2_video_latent.shape)
+    del spatial_up, s1_video_latent
+    gc.collect()
+
+    # 4. Setup Stage 2 latent tools
+    from ltx_core.tools import (
+        VideoLatentTools,
+        VideoLatentPatchifier,
+        VideoLatentShape,
+        AudioLatentTools,
+        AudioPatchifier,
+        AudioLatentShape,
+        SpatioTemporalScaleFactors,
+    )
+    from ltx_core.model.transformer.modality import Modality
+    from pipeline import NeuronTransformerWrapper
+
+    s2_latent_h = args.height // 32
+    s2_latent_w = args.width // 32
+    s2_latent_f = s2_video_latent.shape[2]
+    s2_video_shape = VideoLatentShape(
+        batch=1,
+        channels=128,
+        frames=s2_latent_f,
+        height=s2_latent_h,
+        width=s2_latent_w,
+    )
+    v_patchifier = VideoLatentPatchifier(patch_size=1)
+    v_scale = SpatioTemporalScaleFactors.default()
+    s2_video_tools = VideoLatentTools(
+        target_shape=s2_video_shape,
+        patchifier=v_patchifier,
+        scale_factors=v_scale,
+        causal_fix=False,
+        fps=args.fps,
+    )
+    s2_video_state = s2_video_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    audio_shape = AudioLatentShape(
+        batch=1, channels=8, frames=args.audio_num_frames, mel_bins=16
+    )
+    a_patchifier = AudioPatchifier(patch_size=16)
+    audio_tools = AudioLatentTools(patchifier=a_patchifier, target_shape=audio_shape)
+    audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    # Patchify upscaled latent
+    s2_upscaled_tokens = v_patchifier.patchify(s2_video_latent)
+    logger.info("  S2 upscaled tokens: %s", s2_upscaled_tokens.shape)
+
+    # Noise injection
+    s2_sigmas = torch.tensor(STAGE_2_DISTILLED_SIGMA_VALUES, dtype=torch.float32)
+    noise_scale = s2_sigmas[0].item()
+    gen_s2 = torch.Generator().manual_seed(args.seed + 42)
+    s2_noise = torch.randn(s2_upscaled_tokens.shape, dtype=dtype, generator=gen_s2)
+    video_sample = (
+        noise_scale * s2_noise + (1.0 - noise_scale) * s2_upscaled_tokens
+    ).to(dtype)
+    logger.info("  Noise injected at sigma=%.4f", noise_scale)
+    del s2_noise, s2_upscaled_tokens, s2_video_latent
+
+    # 5. Load S2 backbone via Application
+    logger.info("\n=== Loading S2 backbone (TP=%d) ===", args.tp_degree)
+    s2_app = create_app_compositor(
+        model_path=args.model_path,
+        encoder_path=args.gemma_path,
+        tp_degree=args.tp_degree,
+        text_seq=args.text_seq,
+        height=args.height,
+        width=args.width,
+        num_frames=args.num_frames,
+    )
+    t0 = time.time()
+    s2_app.load_backbone(args.s2_compiled_dir)
+    logger.info("  S2 backbone loaded in %.1fs", time.time() - t0)
+
+    wrapper = NeuronTransformerWrapper(
+        compiled_backbone=s2_app,
+        cpu_ltx_model=cpu["ltx_model"],
+        text_seq=args.text_seq,
+        mask_4d=True,
+        compiled_text_seq=APP_BACKBONE_TEXT_SEQ,
+    )
+
+    # Warmup
+    logger.info("  Warming up S2 backbone...")
+    warmup_sigma = torch.tensor([1.0])
+    warmup_v_ts = warmup_sigma.unsqueeze(0).expand(1, s2_video_state.latent.shape[1])
+    warmup_a_ts = warmup_sigma.unsqueeze(0).expand(1, audio_state.latent.shape[1])
+    warmup_video_mod = Modality(
+        latent=torch.randn(1, s2_video_state.latent.shape[1], 128, dtype=dtype),
+        sigma=warmup_sigma,
+        timesteps=warmup_v_ts,
+        positions=s2_video_state.positions,
+        context=video_context,
+        enabled=True,
+        context_mask=context_mask,
+        attention_mask=None,
+    )
+    warmup_audio_mod = Modality(
+        latent=torch.randn_like(audio_sample),
+        sigma=warmup_sigma,
+        timesteps=warmup_a_ts,
+        positions=audio_state.positions,
+        context=audio_context,
+        enabled=True,
+        context_mask=context_mask.clone(),
+        attention_mask=None,
+    )
+    t0 = time.time()
+    with torch.no_grad():
+        _ = wrapper(warmup_video_mod, warmup_audio_mod)
+    logger.info("  S2 warmup done in %.1fs", time.time() - t0)
+    del warmup_video_mod, warmup_audio_mod
+
+    # 6. Stage 2 denoising
+    s2_num_steps = len(s2_sigmas) - 1
+    logger.info(
+        "\n=== Stage 2 Denoising (%d steps at %dx%d) ===",
+        s2_num_steps,
+        args.height,
+        args.width,
+    )
+    s2_total_time = 0.0
+    for step_idx in range(s2_num_steps):
+        sigma = s2_sigmas[step_idx]
+        sigma_next = s2_sigmas[step_idx + 1]
+        video_seq_len = s2_video_state.latent.shape[1]
+        audio_seq_len = audio_state.latent.shape[1]
+        v_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, video_seq_len)
+        a_ts = sigma.unsqueeze(0).unsqueeze(0).expand(1, audio_seq_len)
+        video_mod = Modality(
+            latent=video_sample,
+            sigma=sigma.unsqueeze(0),
+            timesteps=v_ts,
+            positions=s2_video_state.positions,
+            context=video_context,
+            enabled=True,
+            context_mask=context_mask,
+            attention_mask=None,
+        )
+        audio_mod = Modality(
+            latent=audio_sample,
+            sigma=sigma.unsqueeze(0),
+            timesteps=a_ts,
+            positions=audio_state.positions,
+            context=audio_context,
+            enabled=True,
+            context_mask=context_mask.clone(),
+            attention_mask=None,
+        )
+        t0 = time.time()
+        with torch.no_grad():
+            video_velocity, audio_velocity = wrapper(video_mod, audio_mod)
+        step_time = time.time() - t0
+        s2_total_time += step_time
+        dt = sigma_next - sigma
+        video_sample = (video_sample.float() + video_velocity.float() * dt).to(dtype)
+        audio_sample = (audio_sample.float() + audio_velocity.float() * dt).to(dtype)
+        logger.info(
+            "  S2 Step %d/%d: sigma %.4f -> %.4f (%.1fs)",
+            step_idx + 1,
+            s2_num_steps,
+            sigma.item(),
+            sigma_next.item(),
+            step_time,
+        )
+
+    logger.info(
+        "  Stage 2 total: %.1fs (%.1fs/step)",
+        s2_total_time,
+        s2_total_time / s2_num_steps,
+    )
+    logger.info(
+        "\n  Combined: S1=%.1fs + S2=%.1fs = %.1fs",
+        s1_total_time,
+        s2_total_time,
+        s1_total_time + s2_total_time,
+    )
+
+    # Unload S2 backbone
+    s2_app.unload_backbone()
+    del s2_app, wrapper
+    gc.collect()
+
+    # 7. Decode
+    logger.info("\n=== Decoding ===")
+    video_latent_spatial = v_patchifier.unpatchify(video_sample, s2_video_shape)
+    audio_latent_spatial = a_patchifier.unpatchify(audio_sample, audio_shape)
+    video_latent_4d = video_latent_spatial[0]
+    logger.info("  Video latent for VAE: %s", video_latent_4d.shape)
+
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    if args.vae_compiled_dir:
+        # --- Neuron tiled VAE decode ---
+        logger.info("  Using Neuron VAE from %s", args.vae_compiled_dir)
+        from tiled_vae_decode_23 import (
+            preprocess_latent,
+            get_scaled_timestep,
+            load_compiled_vae,
+            tiled_decode,
+        )
+
+        # Preprocess on CPU (noise injection + denormalization)
+        t0 = time.time()
+        preprocessed = preprocess_latent(
+            video_latent_spatial, cpu["video_decoder"], seed=42
+        )
+        scaled_ts = get_scaled_timestep(cpu["video_decoder"], batch_size=1)
+        # The compiled VAE NEF always expects 2 inputs (latent + scaled_timestep),
+        # even when timestep_conditioning=False in config (the value was constant-
+        # folded during tracing). Provide a default if get_scaled_timestep returns None.
+        if scaled_ts is None:
+            scaled_ts = torch.tensor([0.05 * 1000.0], dtype=torch.float32)
+            logger.info(
+                "  Using default scaled_timestep=50.0 (constant-folded in compiled model)"
+            )
+        logger.info(
+            "  Preprocessing: %.1fs (scaled_ts=%s)",
+            time.time() - t0,
+            scaled_ts,
+        )
+
+        # Unload S2 backbone models from Neuron before loading VAE
+        # (they were already unloaded above, but ensure NRT is clear)
+
+        # Load compiled VAE
+        t0 = time.time()
+        compiled_vae = load_compiled_vae(args.vae_compiled_dir)
+        logger.info("  Compiled VAE loaded in %.1fs", time.time() - t0)
+
+        # Tiled decode
+        t0 = time.time()
+        video_output = tiled_decode(
+            preprocessed,
+            compiled_vae,
+            scaled_timestep=scaled_ts,
+            tile_latent_h=4,
+            tile_latent_w=16,
+            overlap_latent_h=1,
+            overlap_latent_w=0,
+            verbose=True,
+        )
+        decode_time = time.time() - t0
+        logger.info("  Neuron VAE decode: %.1fs", decode_time)
+
+        # Convert to uint8 frames: [1, 3, T_out, H, W] -> [T_out, H, W, 3]
+        video_output = video_output.clamp(0, 1)
+        video_frames = (video_output[0].permute(1, 2, 3, 0) * 255).to(torch.uint8)
+        logger.info("  Video frames: %s", video_frames.shape)
+
+        del compiled_vae, preprocessed
+        gc.collect()
+    else:
+        # --- CPU fallback decode ---
+        logger.info("  Using CPU VAE decode (no --vae-compiled-dir provided)")
+        logger.info("  Decoding video...")
+        t0 = time.time()
+        from ltx_core.model.video_vae.video_vae import decode_video
+
+        video_chunks = []
+        with torch.no_grad():
+            for chunk in decode_video(video_latent_4d, cpu["video_decoder"]):
+                video_chunks.append(chunk)
+        video_frames = torch.cat(video_chunks, dim=0)
+        decode_time = time.time() - t0
+        logger.info("  Video decoded: %s in %.1fs", video_frames.shape, decode_time)
+
+    from PIL import Image
+
+    for i in range(video_frames.shape[0]):
+        frame = video_frames[i].numpy()
+        img = Image.fromarray(frame)
+        img.save(os.path.join(args.output_dir, f"frame_{i:04d}.png"))
+    logger.info("  Saved %d frames to %s", video_frames.shape[0], args.output_dir)
+
+    # MP4
+    try:
+        import subprocess
+
+        frame_pattern = os.path.join(args.output_dir, "frame_%04d.png")
+        mp4_path = os.path.join(args.output_dir, "output.mp4")
+        subprocess.run(
+            [
+                "ffmpeg",
+                "-y",
+                "-framerate",
+                str(int(args.fps)),
+                "-i",
+                frame_pattern,
+                "-c:v",
+                "libx264",
+                "-pix_fmt",
+                "yuv420p",
+                mp4_path,
+            ],
+            capture_output=True,
+            check=True,
+        )
+        logger.info("  Saved MP4: %s", mp4_path)
+    except (subprocess.CalledProcessError, FileNotFoundError) as e:
+        logger.warning("  ffmpeg not available: %s", e)
+
+    # Audio decode
+    logger.info("  Decoding audio...")
+    t0 = time.time()
+    from ltx_core.model.audio_vae.audio_vae import decode_audio
+
+    with torch.no_grad():
+        audio_result = decode_audio(
+            audio_latent_spatial.float(), cpu["audio_decoder"].float(), cpu["vocoder"]
+        )
+    logger.info("  Audio decoded in %.1fs", time.time() - t0)
+
+    try:
+        import torchaudio
+
+        wav_path = os.path.join(args.output_dir, "output.wav")
+        torchaudio.save(
+            wav_path, audio_result.waveform.cpu(), audio_result.sampling_rate
+        )
+        logger.info("  Saved WAV: %s", wav_path)
+    except ImportError:
+        wav_path = os.path.join(args.output_dir, "audio_waveform.pt")
+        torch.save(
+            {"waveform": audio_result.waveform.cpu(), "sr": audio_result.sampling_rate},
+            wav_path,
+        )
+
+    total_time = time.time() - total_t0
+    vae_mode = "Neuron" if args.vae_compiled_dir else "CPU"
+    logger.info("\n" + "=" * 60)
+    logger.info("BENCHMARK RESULTS")
+    logger.info("=" * 60)
+    logger.info("  S1 denoising (from Phase 1): %.1fs", s1_total_time)
+    logger.info(
+        "  S2 denoising:                %.1fs (%.1fs/step)",
+        s2_total_time,
+        s2_total_time / s2_num_steps,
+    )
+    logger.info("  VAE decode (%s):          %.1fs", vae_mode, decode_time)
+    logger.info("  Total Phase 2 wall time:     %.1fs", total_time)
+    logger.info("  Output: %s", args.output_dir)
+    logger.info("=" * 60)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/shard_backbone_weights.py b/contrib/models/LTX-2.3/src/shard_backbone_weights.py
new file mode 100644
index 00000000..43c1bd7c
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/shard_backbone_weights.py
@@ -0,0 +1,147 @@
+#!/usr/bin/env python3
+"""
+Pre-shard LTX-2.3 DiT backbone weights for Neuron TP=4.
+
+Reads the full 41GB safetensors file once, extracts backbone keys,
+shards them per TP rank, and saves compact per-rank .pt files (~5GB each).
+
+This avoids loading the full 41GB file during generation, which would cause
+memory pressure and swap thrashing on trn2.3xlarge (124GB RAM).
+
+Output structure:
+  backbone_sharded/
+    rank_0.pt   (~5 GB)
+    rank_1.pt
+    rank_2.pt
+    rank_3.pt
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  python3 shard_backbone_weights.py \
+    --model-path /home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+    --output-dir /home/ubuntu/backbone_sharded
+"""
+
+import argparse
+import gc
+import os
+import sys
+import time
+
+import torch
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+TP_DEGREE = 4
+MODEL_PATH = "/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors"
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Pre-shard LTX-2.3 backbone weights")
+    parser.add_argument(
+        "--model-path",
+        default=MODEL_PATH,
+        help="Path to LTX-2.3 safetensors file",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default="/home/ubuntu/backbone_sharded",
+        help="Output directory for sharded weights",
+    )
+    parser.add_argument(
+        "--tp-degree",
+        type=int,
+        default=TP_DEGREE,
+        help="Tensor parallelism degree",
+    )
+    args = parser.parse_args()
+
+    print("=" * 60)
+    print("Pre-shard LTX-2.3 DiT backbone weights (TP=%d)" % args.tp_degree)
+    print("=" * 60)
+
+    # 1. Load backbone weights from safetensors (memory-mapped for efficiency)
+    print("\n[1/3] Loading backbone weights from %s..." % args.model_path)
+    t0 = time.time()
+
+    from safetensors import safe_open
+    from load_with_weights import shard_weight
+
+    prefix = "model.diffusion_model."
+    backbone_prefixes = (
+        "transformer_blocks.",
+        "norm_out.",
+        "proj_out.",
+        "scale_shift_table",
+        "audio_norm_out.",
+        "audio_proj_out.",
+        "audio_scale_shift_table",
+    )
+
+    backbone_sd = {}
+    with safe_open(args.model_path, framework="pt") as f:
+        all_keys = list(f.keys())
+        for k in all_keys:
+            stripped = k[len(prefix) :] if k.startswith(prefix) else k
+            if stripped.startswith(backbone_prefixes):
+                backbone_sd[stripped] = f.get_tensor(k).to(torch.bfloat16).contiguous()
+
+    # Add SPMDRank tensor
+    backbone_sd["spmd_rank.rank"] = torch.arange(0, args.tp_degree, dtype=torch.int32)
+
+    total_bytes = sum(v.numel() * v.element_size() for v in backbone_sd.values())
+    print("  Backbone keys: %d (%.2f GB)" % (len(backbone_sd), total_bytes / 1e9))
+    print("  Done in %.1fs" % (time.time() - t0))
+
+    # 2. Convert safetensors key format to JIT param format and shard
+    print("\n[2/3] Sharding and saving per-rank checkpoints...")
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    def sf_key_to_jit_key(sf_key):
+        return "weights." + sf_key.replace(".", "->")
+
+    for rank in range(args.tp_degree):
+        t0 = time.time()
+        rank_sd = {}
+        for sf_key, full_weight in backbone_sd.items():
+            jit_key = sf_key_to_jit_key(sf_key)
+            sharded = shard_weight(full_weight, jit_key, rank, args.tp_degree)
+            # CRITICAL: .contiguous().clone() so torch.save doesn't serialize
+            # the full unsharded storage backing sliced/narrowed tensors.
+            rank_sd[jit_key] = sharded.contiguous().clone()
+
+        rank_path = os.path.join(args.output_dir, "rank_%d.pt" % rank)
+        torch.save(rank_sd, rank_path)
+        size_gb = os.path.getsize(rank_path) / 1e9
+        elapsed = time.time() - t0
+        print(
+            "  rank_%d.pt: %d keys, %.2f GB, %.1fs"
+            % (rank, len(rank_sd), size_gb, elapsed)
+        )
+        del rank_sd
+        gc.collect()
+
+    del backbone_sd
+    gc.collect()
+
+    # 3. Verify
+    print("\n[3/3] Verification...")
+    total_size = 0
+    for rank in range(args.tp_degree):
+        rank_path = os.path.join(args.output_dir, "rank_%d.pt" % rank)
+        size = os.path.getsize(rank_path)
+        total_size += size
+        ckpt = torch.load(rank_path, weights_only=True)
+        print("  rank_%d.pt: %d keys, %.2f GB" % (rank, len(ckpt), size / 1e9))
+        if rank == 0:
+            for k in sorted(ckpt.keys())[:3]:
+                print("    %s: %s %s" % (k, tuple(ckpt[k].shape), ckpt[k].dtype))
+        del ckpt
+
+    print("\n  Total sharded size: %.2f GB" % (total_size / 1e9))
+    print("  Output dir: %s" % args.output_dir)
+    print("\nDone!")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/shard_gemma3_weights.py b/contrib/models/LTX-2.3/src/shard_gemma3_weights.py
new file mode 100644
index 00000000..04e88a24
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/shard_gemma3_weights.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+"""
+Pre-shard Gemma 3-12B encoder weights for Neuron TP=4.
+
+Produces per-rank checkpoint files that can be loaded directly with
+replace_weights() -- no cloning or sharding at load time.
+
+Output structure:
+  gemma3_encoder_sharded/
+    rank_0.pt   (~5.9 GB)
+    rank_1.pt
+    rank_2.pt
+    rank_3.pt
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  python3 shard_gemma3_weights.py \
+    --gemma-path /home/ubuntu/models/gemma-3-12b \
+    --output-dir /home/ubuntu/gemma3_encoder_sharded
+"""
+
+import argparse
+import gc
+import glob
+import os
+import sys
+import time
+
+import torch
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+TP_DEGREE = 4
+
+
+def get_model_fn():
+    """Create Gemma3 encoder model with TP layers (for sharding metadata)."""
+    from modeling_gemma3_encoder import Gemma3TextEncoderModel
+
+    model = Gemma3TextEncoderModel(
+        vocab_size=262208,
+        hidden_size=3840,
+        num_hidden_layers=48,
+        num_attention_heads=16,
+        num_key_value_heads=8,
+        head_dim=256,
+        intermediate_size=15360,
+        rms_norm_eps=1e-6,
+        rope_theta=1_000_000.0,
+        max_position_embeddings=131072,
+        query_pre_attn_scalar=256,
+        pad_token_id=0,
+        dtype=torch.bfloat16,
+    )
+    model = model.to(dtype=torch.bfloat16)
+    model.eval()
+    return model, None
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Pre-shard Gemma3 encoder weights")
+    parser.add_argument(
+        "--gemma-path",
+        default="/home/ubuntu/models/gemma-3-12b",
+        help="Path to HuggingFace Gemma 3 model directory",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default="/home/ubuntu/gemma3_encoder_sharded",
+        help="Output directory for sharded weights",
+    )
+    args = parser.parse_args()
+
+    print("=" * 60)
+    print("Pre-shard Gemma 3-12B encoder weights (TP=%d)" % TP_DEGREE)
+    print("=" * 60)
+
+    # 1. Load HF weights from safetensors
+    print("\n[1/3] Loading HF weights from %s..." % args.gemma_path)
+    t0 = time.time()
+
+    from modeling_gemma3_encoder import convert_hf_gemma3_to_encoder_state_dict
+    from safetensors.torch import load_file
+
+    # Find safetensors files in the model directory
+    st_files = sorted(glob.glob(os.path.join(args.gemma_path, "model-*.safetensors")))
+    if not st_files:
+        st_files = sorted(glob.glob(os.path.join(args.gemma_path, "*.safetensors")))
+    if not st_files:
+        print("ERROR: No safetensors files found in %s" % args.gemma_path)
+        sys.exit(1)
+
+    print("  Loading %d safetensors files..." % len(st_files))
+    hf_state_dict = {}
+    for f in st_files:
+        shard = load_file(f)
+        hf_state_dict.update(shard)
+    print("  Total HF keys: %d" % len(hf_state_dict))
+
+    encoder_state_dict = convert_hf_gemma3_to_encoder_state_dict(hf_state_dict)
+    del hf_state_dict
+    gc.collect()
+
+    total_bytes = sum(v.numel() * v.element_size() for v in encoder_state_dict.values())
+    print("  Encoder keys: %d (%.2f GB)" % (len(encoder_state_dict), total_bytes / 1e9))
+    print("  Done in %.1fs" % (time.time() - t0))
+
+    # 2. Shard per rank and save
+    print("\n[2/3] Sharding and saving per-rank checkpoints...")
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    from neuronx_distributed.trace.trace import (
+        _mock_parallel_state,
+        init_on_device,
+        get_sharded_checkpoint,
+    )
+
+    for rank in range(TP_DEGREE):
+        t0 = time.time()
+        ckpt = {k: v.clone() for k, v in encoder_state_dict.items()}
+
+        _mock_parallel_state(TP_DEGREE, rank)
+        with init_on_device(torch.device("meta")):
+            model, _ = get_model_fn()
+        get_sharded_checkpoint(ckpt, model, rank, TP_DEGREE)
+
+        # CRITICAL: Force contiguous clones so torch.save doesn't serialize
+        # the full unsharded storage backing sliced/narrowed tensors.
+        # Without this, each rank file would be ~24 GB instead of ~6 GB.
+        ckpt = {k: v.contiguous().clone() for k, v in ckpt.items()}
+
+        rank_path = os.path.join(args.output_dir, "rank_%d.pt" % rank)
+        torch.save(ckpt, rank_path)
+        size_gb = os.path.getsize(rank_path) / 1e9
+        num_keys = len(ckpt)
+        elapsed = time.time() - t0
+        print(
+            "  rank_%d.pt: %d keys, %.2f GB, %.1fs" % (rank, num_keys, size_gb, elapsed)
+        )
+        del ckpt, model
+        gc.collect()
+
+    del encoder_state_dict
+    gc.collect()
+
+    # 3. Verify
+    print("\n[3/3] Verification...")
+    total_size = 0
+    for rank in range(TP_DEGREE):
+        rank_path = os.path.join(args.output_dir, "rank_%d.pt" % rank)
+        size = os.path.getsize(rank_path)
+        total_size += size
+        ckpt = torch.load(rank_path, weights_only=True)
+        print("  rank_%d.pt: %d keys, %.2f GB" % (rank, len(ckpt), size / 1e9))
+        if rank == 0:
+            for k in sorted(ckpt.keys())[:3]:
+                print("    %s: %s %s" % (k, tuple(ckpt[k].shape), ckpt[k].dtype))
+        del ckpt
+
+    print("\n  Total sharded size: %.2f GB" % (total_size / 1e9))
+    print("  Output dir: %s" % args.output_dir)
+    print("\nDone!")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/LTX-2.3/src/tiled_vae_decode_23.py b/contrib/models/LTX-2.3/src/tiled_vae_decode_23.py
new file mode 100644
index 00000000..2f7ddf74
--- /dev/null
+++ b/contrib/models/LTX-2.3/src/tiled_vae_decode_23.py
@@ -0,0 +1,259 @@
+"""
+NxDI LTX-2.3 Tiled VAE Decode — Spatial Tiling with Overlap + Blending
+=======================================================================
+Tiles the latent space into overlapping patches, decodes each on Neuron,
+then blends in output pixel space.
+
+Key difference from LTX-2: LTX-2.3 has timestep_conditioning, so we must
+also pass the precomputed scaled_timestep to the compiled model.
+
+Pre-processing (done on CPU before calling compiled model):
+  1. Noise injection: sample = noise * 0.025 + sample * 0.975
+  2. Denormalization: per_channel_statistics.un_normalize(sample)
+  Both are applied to the FULL latent before tiling (unlike per-tile).
+
+Optimal tile: 4x16 latent (128x512 pixels) with overlap_h=1, overlap_w=0.
+
+Usage (as library):
+  from tiled_vae_decode_23 import tiled_decode, load_compiled_vae, preprocess_latent
+  latent = preprocess_latent(raw_latent, video_decoder, seed=42)
+  scaled_timestep = torch.tensor([0.05 * video_decoder.timestep_scale_multiplier.item()])
+  output = tiled_decode(latent, compiled_model, scaled_timestep=scaled_timestep,
+                        tile_latent_h=4, tile_latent_w=16, overlap_latent_h=1, overlap_latent_w=0)
+"""
+
+import os
+import time
+
+import torch
+
+os.environ.setdefault("NEURON_FUSE_SOFTMAX", "1")
+os.environ.setdefault("NEURON_CUSTOM_SILU", "1")
+
+COMPILER_FLAGS = (
+    "--model-type=unet-inference -O1 --auto-cast none "
+    "--enable-fast-loading-neuron-binaries"
+)
+
+
+def preprocess_latent(raw_latent, video_decoder, seed=None):
+    """Apply noise injection + denormalization on CPU.
+
+    This must be done BEFORE tiling, on the full latent.
+
+    Args:
+        raw_latent: [1, 128, T, H, W] raw latent from denoising
+        video_decoder: CPU VideoDecoder (has per_channel_statistics, decode_noise_scale)
+        seed: Random seed for noise injection (None = random)
+
+    Returns:
+        [1, 128, T, H, W] preprocessed latent ready for tiled Neuron decode
+    """
+    sample = raw_latent.float()
+
+    if video_decoder.timestep_conditioning:
+        gen = torch.Generator()
+        if seed is not None:
+            gen.manual_seed(seed)
+        noise = (
+            torch.randn(
+                sample.size(), generator=gen, dtype=sample.dtype, device=sample.device
+            )
+            * video_decoder.decode_noise_scale
+        )
+        sample = noise + (1.0 - video_decoder.decode_noise_scale) * sample
+
+    # Denormalize
+    sample = video_decoder.per_channel_statistics.un_normalize(sample)
+
+    return sample
+
+
+def get_scaled_timestep(video_decoder, batch_size=1):
+    """Compute scaled_timestep on CPU.
+
+    Returns:
+        [batch_size] tensor of scaled timestep values, or None if no conditioning
+    """
+    if not video_decoder.timestep_conditioning:
+        return None
+
+    timestep = torch.full(
+        (batch_size,),
+        video_decoder.decode_timestep,
+        dtype=torch.float32,
+    )
+    scaled = timestep * video_decoder.timestep_scale_multiplier.item()
+    return scaled
+
+
+def create_blend_mask_1d(length, blend_size, device="cpu"):
+    """Create a 1D linear blending mask."""
+    mask = torch.ones(length, device=device)
+    if blend_size > 0:
+        ramp = torch.linspace(0, 1, blend_size + 2, device=device)[1:-1]
+        mask[:blend_size] = ramp
+        mask[-blend_size:] = ramp.flip(0)
+    return mask
+
+
+def tiled_decode(
+    latent,
+    compiled_model,
+    scaled_timestep=None,
+    tile_latent_h=4,
+    tile_latent_w=16,
+    overlap_latent_h=1,
+    overlap_latent_w=0,
+    spatial_scale=32,
+    verbose=True,
+):
+    """Decode preprocessed latent tensor using spatial tiling with overlap blending.
+
+    IMPORTANT: latent must be preprocessed (noise injected + denormalized)
+    using preprocess_latent() before calling this function.
+
+    Args:
+        latent: [1, 128, T, H_lat, W_lat] preprocessed latent tensor
+        compiled_model: Neuron-compiled VAE decoder
+        scaled_timestep: [B] scaled timestep (or None if no conditioning)
+        tile_latent_h: tile height in latent space (default 4)
+        tile_latent_w: tile width in latent space (default 16)
+        overlap_latent_h: overlap in latent H pixels (default 1)
+        overlap_latent_w: overlap in latent W pixels (default 0)
+        spatial_scale: latent-to-pixel spatial scale (32 for LTX-2.3)
+        verbose: print progress information
+
+    Returns:
+        [1, 3, T_out, H_out, W_out] decoded video tensor
+    """
+    B, C, T, H_lat, W_lat = latent.shape
+    assert B == 1, "Batch size must be 1"
+
+    H_out = H_lat * spatial_scale
+    W_out = W_lat * spatial_scale
+
+    stride_h = tile_latent_h - overlap_latent_h
+    stride_w = tile_latent_w - overlap_latent_w
+    assert stride_h > 0, f"stride_h={stride_h} must be > 0"
+    assert stride_w > 0, f"stride_w={stride_w} must be > 0"
+
+    overlap_h_pixels = overlap_latent_h * spatial_scale
+    overlap_w_pixels = overlap_latent_w * spatial_scale
+    tile_h_pixels = tile_latent_h * spatial_scale
+    tile_w_pixels = tile_latent_w * spatial_scale
+
+    # Determine tile start positions
+    n_tiles_h = max(1, (H_lat - tile_latent_h + stride_h - 1) // stride_h + 1)
+    n_tiles_w = max(1, (W_lat - tile_latent_w + stride_w - 1) // stride_w + 1)
+
+    tile_starts_h = []
+    for i in range(n_tiles_h):
+        start = min(i * stride_h, H_lat - tile_latent_h)
+        tile_starts_h.append(start)
+    tile_starts_h = sorted(set(tile_starts_h))
+
+    tile_starts_w = []
+    for i in range(n_tiles_w):
+        start = min(i * stride_w, W_lat - tile_latent_w)
+        tile_starts_w.append(start)
+    tile_starts_w = sorted(set(tile_starts_w))
+
+    n_tiles_h = len(tile_starts_h)
+    n_tiles_w = len(tile_starts_w)
+    total_tiles = n_tiles_h * n_tiles_w
+
+    if verbose:
+        print(f"  Tiling: {n_tiles_h}x{n_tiles_w} = {total_tiles} tiles")
+        print(
+            f"  Tile latent: {tile_latent_h}x{tile_latent_w}, "
+            f"overlap: h={overlap_latent_h}, w={overlap_latent_w}"
+        )
+
+    # Output temporal dimension: T_out = (T-1)*8 + 1
+    T_out = (T - 1) * 8 + 1
+
+    output_accum = torch.zeros(1, 3, T_out, H_out, W_out, dtype=torch.float32)
+    weight_accum = torch.zeros(1, 1, 1, H_out, W_out, dtype=torch.float32)
+
+    decode_times = []
+
+    for ti, h_start_lat in enumerate(tile_starts_h):
+        for tj, w_start_lat in enumerate(tile_starts_w):
+            tile_idx = ti * n_tiles_w + tj + 1
+
+            h_end_lat = h_start_lat + tile_latent_h
+            w_end_lat = w_start_lat + tile_latent_w
+            tile_latent = latent[:, :, :, h_start_lat:h_end_lat, w_start_lat:w_end_lat]
+
+            t0 = time.time()
+            with torch.no_grad():
+                if scaled_timestep is not None:
+                    tile_output = compiled_model(tile_latent, scaled_timestep)
+                else:
+                    tile_output = compiled_model(tile_latent)
+            dt = time.time() - t0
+            decode_times.append(dt)
+
+            h_start_px = h_start_lat * spatial_scale
+            w_start_px = w_start_lat * spatial_scale
+            h_end_px = h_start_px + tile_h_pixels
+            w_end_px = w_start_px + tile_w_pixels
+
+            # Create spatial blend mask
+            blend_h = create_blend_mask_1d(
+                tile_h_pixels, overlap_h_pixels if h_start_lat > 0 else 0
+            )
+            blend_w = create_blend_mask_1d(
+                tile_w_pixels, overlap_w_pixels if w_start_lat > 0 else 0
+            )
+
+            if h_end_lat < H_lat and overlap_h_pixels > 0:
+                blend_h[-overlap_h_pixels:] = torch.linspace(
+                    1, 0, overlap_h_pixels + 2
+                )[1:-1]
+            if w_end_lat < W_lat and overlap_w_pixels > 0:
+                blend_w[-overlap_w_pixels:] = torch.linspace(
+                    1, 0, overlap_w_pixels + 2
+                )[1:-1]
+
+            blend_mask = blend_h.unsqueeze(1) * blend_w.unsqueeze(0)
+            blend_mask = blend_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0)
+
+            output_accum[:, :, :, h_start_px:h_end_px, w_start_px:w_end_px] += (
+                tile_output.float() * blend_mask
+            )
+            weight_accum[:, :, :, h_start_px:h_end_px, w_start_px:w_end_px] += (
+                blend_mask
+            )
+
+            if verbose:
+                print(
+                    f"    Tile {tile_idx}/{total_tiles}: "
+                    f"lat[{h_start_lat}:{h_end_lat}, {w_start_lat}:{w_end_lat}] "
+                    f"-> px[{h_start_px}:{h_end_px}, {w_start_px}:{w_end_px}], "
+                    f"{dt * 1000:.0f}ms"
+                )
+
+    output = output_accum / weight_accum.clamp(min=1e-6)
+
+    total_decode = sum(decode_times)
+    if verbose:
+        print(f"\n  Total decode time: {total_decode:.2f}s ({total_tiles} tiles)")
+        print(f"  Avg per tile: {total_decode / total_tiles * 1000:.0f}ms")
+
+    return output
+
+
+def load_compiled_vae(compiled_dir):
+    """Load a compiled TP VAE decoder from disk.
+
+    Args:
+        compiled_dir: Directory containing the compiled model files
+
+    Returns:
+        compiled: The loaded TensorParallelNeuronModel
+    """
+    import neuronx_distributed
+
+    return neuronx_distributed.trace.parallel_model_load(compiled_dir)
diff --git a/contrib/models/LTX-2.3/test/__init__.py b/contrib/models/LTX-2.3/test/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/LTX-2.3/test/integration/__init__.py b/contrib/models/LTX-2.3/test/integration/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/LTX-2.3/test/integration/test_model.py b/contrib/models/LTX-2.3/test/integration/test_model.py
new file mode 100644
index 00000000..83cb95ca
--- /dev/null
+++ b/contrib/models/LTX-2.3/test/integration/test_model.py
@@ -0,0 +1,462 @@
+"""
+Integration test for LTX-2.3 22B DiT on Neuron.
+
+Tests the compiled TP=4 backbone against a CPU reference forward pass.
+Requires:
+  - trn2.3xlarge instance with LNC=2 (4 logical cores)
+  - Neuron SDK 2.28 (Deep Learning AMI Neuron Ubuntu 24.04 20260227)
+  - ltx-core package installed
+  - LTX-2.3 distilled model safetensors
+
+Environment variables:
+  MODEL_PATH: Path to ltx-2.3-22b-distilled.safetensors
+  COMPILED_MODEL_PATH: Path to compiled model directory (will compile if missing)
+
+Usage:
+  source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+  MODEL_PATH=/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors \
+  COMPILED_MODEL_PATH=/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2 \
+    pytest test_model.py -v -s
+"""
+
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import pytest
+import torch
+
+# Add src to path
+SRC_DIR = str(Path(__file__).parent.parent.parent / "src")
+sys.path.insert(0, SRC_DIR)
+
+# Required environment variables
+MODEL_PATH = os.environ.get(
+    "MODEL_PATH",
+    "/home/ubuntu/models/LTX-2.3/ltx-2.3-22b-distilled.safetensors",
+)
+COMPILED_MODEL_PATH = os.environ.get(
+    "COMPILED_MODEL_PATH",
+    "/home/ubuntu/ltx23_neuron/compiler_workdir_tp4_lnc2_v2",
+)
+
+TP_DEGREE = 4
+BATCH = 1
+VIDEO_SEQ = 768  # 4 frames * 12 * 16 patches (384x512, patch_size=1)
+AUDIO_SEQ = 26
+TEXT_SEQ = 256
+NUM_HEADS = 32
+AUDIO_NUM_HEADS = 32
+INNER_DIM = 4096
+AUDIO_INNER_DIM = 2048
+HEAD_DIM = INNER_DIM // NUM_HEADS  # 128
+AUDIO_HEAD_DIM = AUDIO_INNER_DIM // AUDIO_NUM_HEADS  # 64
+
+
+def load_config():
+    from safetensors import safe_open
+
+    with safe_open(MODEL_PATH, framework="pt") as f:
+        metadata = f.metadata()
+    return json.loads(metadata["config"])
+
+
+def create_test_inputs(dtype=torch.bfloat16):
+    """Create deterministic test inputs matching the 24-input backbone signature."""
+    gen = torch.Generator().manual_seed(42)
+
+    inputs = [
+        torch.randn(
+            BATCH, VIDEO_SEQ, INNER_DIM, dtype=dtype, generator=gen
+        ),  # hidden_states
+        torch.randn(
+            BATCH, AUDIO_SEQ, AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_hidden_states
+        torch.randn(
+            BATCH, TEXT_SEQ, INNER_DIM, dtype=dtype, generator=gen
+        ),  # encoder_hidden_states
+        torch.randn(
+            BATCH, TEXT_SEQ, AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_encoder_hidden_states
+        torch.randn(
+            BATCH, VIDEO_SEQ, 9 * INNER_DIM, dtype=dtype, generator=gen
+        ),  # temb
+        torch.randn(
+            BATCH, AUDIO_SEQ, 9 * AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # temb_audio
+        torch.randn(
+            BATCH, VIDEO_SEQ, INNER_DIM, dtype=dtype, generator=gen
+        ),  # embedded_timestep
+        torch.randn(
+            BATCH, AUDIO_SEQ, AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_embedded_timestep
+        torch.randn(BATCH, 1, 4 * INNER_DIM, dtype=dtype, generator=gen),  # video_ca_ss
+        torch.randn(
+            BATCH, 1, 4 * AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_ca_ss
+        torch.randn(BATCH, 1, INNER_DIM, dtype=dtype, generator=gen),  # video_ca_gate
+        torch.randn(
+            BATCH, 1, AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_ca_gate
+        torch.randn(
+            BATCH, NUM_HEADS, VIDEO_SEQ, HEAD_DIM // 2, dtype=dtype, generator=gen
+        ),  # video_rot_cos
+        torch.randn(
+            BATCH, NUM_HEADS, VIDEO_SEQ, HEAD_DIM // 2, dtype=dtype, generator=gen
+        ),  # video_rot_sin
+        torch.randn(
+            BATCH,
+            AUDIO_NUM_HEADS,
+            AUDIO_SEQ,
+            AUDIO_HEAD_DIM // 2,
+            dtype=dtype,
+            generator=gen,
+        ),  # audio_rot_cos
+        torch.randn(
+            BATCH,
+            AUDIO_NUM_HEADS,
+            AUDIO_SEQ,
+            AUDIO_HEAD_DIM // 2,
+            dtype=dtype,
+            generator=gen,
+        ),  # audio_rot_sin
+        torch.randn(
+            BATCH, NUM_HEADS, VIDEO_SEQ, AUDIO_HEAD_DIM // 2, dtype=dtype, generator=gen
+        ),  # ca_video_rot_cos
+        torch.randn(
+            BATCH, NUM_HEADS, VIDEO_SEQ, AUDIO_HEAD_DIM // 2, dtype=dtype, generator=gen
+        ),  # ca_video_rot_sin
+        torch.randn(
+            BATCH,
+            AUDIO_NUM_HEADS,
+            AUDIO_SEQ,
+            AUDIO_HEAD_DIM // 2,
+            dtype=dtype,
+            generator=gen,
+        ),  # ca_audio_rot_cos
+        torch.randn(
+            BATCH,
+            AUDIO_NUM_HEADS,
+            AUDIO_SEQ,
+            AUDIO_HEAD_DIM // 2,
+            dtype=dtype,
+            generator=gen,
+        ),  # ca_audio_rot_sin
+        torch.zeros(
+            BATCH, TEXT_SEQ, dtype=dtype
+        ),  # encoder_attention_mask (all attend)
+        torch.zeros(BATCH, TEXT_SEQ, dtype=dtype),  # audio_encoder_attention_mask
+        torch.randn(
+            BATCH, 1, 2 * INNER_DIM, dtype=dtype, generator=gen
+        ),  # prompt_timestep
+        torch.randn(
+            BATCH, 1, 2 * AUDIO_INNER_DIM, dtype=dtype, generator=gen
+        ),  # audio_prompt_timestep
+    ]
+    return inputs
+
+
+@pytest.fixture(scope="module")
+def compiled_model():
+    """Load compiled Neuron backbone with real weights.
+
+    If COMPILED_MODEL_PATH exists, loads from there.
+    Compilation must be done separately via compile_transformer.py.
+    """
+    tp_0_path = os.path.join(COMPILED_MODEL_PATH, "tp_0.pt")
+    if not os.path.exists(tp_0_path):
+        pytest.skip(
+            f"Compiled model not found at {tp_0_path}. "
+            "Run compile_transformer.py first:\n"
+            "  NEURON_FUSE_SOFTMAX=1 NEURON_CUSTOM_SILU=1 NEURON_RT_STOCHASTIC_ROUNDING_EN=0 "
+            "torchrun --nproc_per_node=4 src/compile_transformer.py"
+        )
+
+    import torch_neuronx
+    from neuronx_distributed.trace.trace import TensorParallelNeuronModel
+    from load_with_weights import shard_weight
+    from safetensors.torch import load_file
+
+    # Load and shard weights
+    full_sd = load_file(MODEL_PATH)
+    prefix = "model.diffusion_model."
+    backbone_prefixes = (
+        "transformer_blocks.",
+        "norm_out.",
+        "proj_out.",
+        "scale_shift_table",
+        "audio_norm_out.",
+        "audio_proj_out.",
+        "audio_scale_shift_table",
+    )
+    backbone_sd = {}
+    for k, v in full_sd.items():
+        stripped = k[len(prefix) :] if k.startswith(prefix) else k
+        if stripped.startswith(backbone_prefixes):
+            backbone_sd[stripped] = v.to(torch.bfloat16).contiguous()
+    backbone_sd["spmd_rank.rank"] = torch.arange(0, TP_DEGREE, dtype=torch.int32)
+    del full_sd
+
+    def sf_key_to_jit_key(sf_key):
+        return "weights." + sf_key.replace(".", "->")
+
+    # Create per-rank state dicts
+    rank_sds = [{} for _ in range(TP_DEGREE)]
+    for sf_key, full_weight in backbone_sd.items():
+        jit_key = sf_key_to_jit_key(sf_key)
+        for rank in range(TP_DEGREE):
+            rank_sds[rank][jit_key] = shard_weight(
+                full_weight, jit_key, rank, TP_DEGREE
+            )
+    del backbone_sd
+
+    # Load compiled models and inject weights
+    models = []
+    for rank in range(TP_DEGREE):
+        with torch_neuronx.contexts.disable_nrt_load():
+            model = torch.jit.load(tp_0_path)
+        model_sd = dict(model.named_parameters())
+        for jit_key, sharded_weight in rank_sds[rank].items():
+            if jit_key in model_sd and model_sd[jit_key].shape == sharded_weight.shape:
+                model_sd[jit_key].data.copy_(sharded_weight)
+        models.append(model)
+    del rank_sds
+
+    return TensorParallelNeuronModel(models)
+
+
+@pytest.fixture(scope="module")
+def cpu_model():
+    """Build CPU reference model for accuracy comparison."""
+    from modeling_ltx23 import replace_sdpa_with_bmm
+
+    replace_sdpa_with_bmm()
+
+    from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder
+    from ltx_core.loader.sd_ops import SDOps
+    from ltx_core.model.transformer.model_configurator import LTXModelConfigurator
+
+    config = load_config()
+    ltx_ops = (
+        SDOps("ltx")
+        .with_matching(prefix="model.diffusion_model.")
+        .with_replacement("model.diffusion_model.", "")
+    )
+    builder = SingleGPUModelBuilder(
+        model_class_configurator=LTXModelConfigurator,
+        model_path=MODEL_PATH,
+        model_sd_ops=ltx_ops,
+    )
+    model = builder.build(device=torch.device("cpu"), dtype=torch.bfloat16)
+    model.eval()
+    return model
+
+
+def test_model_loads(compiled_model):
+    """Smoke test: compiled model loads successfully."""
+    assert compiled_model is not None
+
+
+def test_forward_pass_no_nan(compiled_model):
+    """Forward pass produces valid (non-NaN, non-Inf) output."""
+    inputs = create_test_inputs()
+
+    with torch.no_grad():
+        outputs = compiled_model(*inputs)
+
+    video_out = outputs[0]
+    audio_out = outputs[1]
+
+    assert not torch.isnan(video_out).any(), "Video output contains NaN"
+    assert not torch.isinf(video_out).any(), "Video output contains Inf"
+    assert not torch.isnan(audio_out).any(), "Audio output contains NaN"
+    assert not torch.isinf(audio_out).any(), "Audio output contains Inf"
+
+    # Check output shapes
+    assert video_out.shape == (BATCH, VIDEO_SEQ, 128), f"Video shape: {video_out.shape}"
+    assert audio_out.shape == (BATCH, AUDIO_SEQ, 128), f"Audio shape: {audio_out.shape}"
+
+
+def test_accuracy_vs_cpu(compiled_model, cpu_model):
+    """Compare Neuron forward pass to CPU reference.
+
+    Acceptance threshold: cosine similarity >= 0.999 for single forward pass.
+    This validates that TP=4 sharding, DistributedRMSNorm, and fused SDPA
+    produce outputs matching the unsharded CPU model.
+    """
+    from ltx_core.model.transformer.modality import Modality
+    from ltx_core.guidance.perturbations import (
+        BatchedPerturbationConfig,
+        PerturbationConfig,
+    )
+
+    dtype = torch.bfloat16
+    torch.manual_seed(42)
+
+    # Create matching inputs for both CPU and Neuron paths
+    sigma = torch.tensor([1.0], dtype=dtype)
+
+    # Build latent tools for proper state creation
+    from ltx_core.tools import (
+        VideoLatentTools,
+        VideoLatentPatchifier,
+        VideoLatentShape,
+        AudioLatentTools,
+        AudioPatchifier,
+        AudioLatentShape,
+        SpatioTemporalScaleFactors,
+    )
+
+    video_shape = VideoLatentShape(batch=1, channels=128, frames=4, height=12, width=16)
+    v_patchifier = VideoLatentPatchifier(patch_size=1)
+    v_scale = SpatioTemporalScaleFactors(time=1, width=8, height=8)
+    video_tools = VideoLatentTools(
+        target_shape=video_shape,
+        patchifier=v_patchifier,
+        scale_factors=v_scale,
+        causal_fix=False,
+        fps=24.0,
+    )
+    audio_shape = AudioLatentShape(batch=1, channels=8, frames=26, mel_bins=16)
+    a_patchifier = AudioPatchifier(patch_size=16)
+    audio_tools = AudioLatentTools(patchifier=a_patchifier, target_shape=audio_shape)
+
+    video_state = video_tools.create_initial_state(device="cpu", dtype=dtype)
+    audio_state = audio_tools.create_initial_state(device="cpu", dtype=dtype)
+
+    # Use noise latents in patchified shape (B, seq, C)
+    video_latent = torch.randn_like(video_state.latent)
+    audio_latent = torch.randn_like(audio_state.latent)
+
+    video_seq_len = video_latent.shape[1]
+    audio_seq_len = audio_latent.shape[1]
+    v_ts = sigma.unsqueeze(0).expand(1, video_seq_len)
+    a_ts = sigma.unsqueeze(0).expand(1, audio_seq_len)
+
+    video_context = torch.randn(1, TEXT_SEQ, INNER_DIM, dtype=dtype)
+    audio_context = torch.randn(1, TEXT_SEQ, AUDIO_INNER_DIM, dtype=dtype)
+    context_mask = torch.ones(1, TEXT_SEQ, dtype=dtype)
+    context_mask[:, 50:] = 0
+
+    video_mod = Modality(
+        latent=video_latent,
+        sigma=sigma,
+        timesteps=v_ts,
+        positions=video_state.positions,
+        context=video_context,
+        enabled=True,
+        context_mask=context_mask,
+        attention_mask=None,
+    )
+    audio_mod = Modality(
+        latent=audio_latent,
+        sigma=sigma,
+        timesteps=a_ts,
+        positions=audio_state.positions,
+        context=audio_context,
+        enabled=True,
+        context_mask=context_mask.clone(),
+        attention_mask=None,
+    )
+
+    # CPU forward through full model
+    perturbation = BatchedPerturbationConfig(perturbations=[PerturbationConfig.empty()])
+    with torch.no_grad():
+        cpu_out = cpu_model(video_mod, audio_mod, perturbations=perturbation)
+
+    # LTXModel.forward returns (video_velocity, audio_velocity) tuple
+    cpu_video = cpu_out[0].float()
+    cpu_audio = cpu_out[1].float()
+
+    # Neuron forward via wrapper
+    from pipeline import NeuronTransformerWrapper
+
+    wrapper = NeuronTransformerWrapper(
+        compiled_backbone=compiled_model,
+        cpu_ltx_model=cpu_model,
+        text_seq=TEXT_SEQ,
+    )
+
+    # Re-create modalities (cpu_model forward may have modified them)
+    torch.manual_seed(42)
+    video_latent2 = torch.randn_like(video_state.latent)
+    audio_latent2 = torch.randn_like(audio_state.latent)
+
+    video_mod2 = Modality(
+        latent=video_latent2,
+        sigma=sigma,
+        timesteps=v_ts,
+        positions=video_state.positions,
+        context=video_context,
+        enabled=True,
+        context_mask=torch.ones(1, TEXT_SEQ, dtype=dtype),
+        attention_mask=None,
+    )
+    video_mod2.context_mask[:, 50:] = 0
+
+    audio_mod2 = Modality(
+        latent=audio_latent2,
+        sigma=sigma,
+        timesteps=a_ts,
+        positions=audio_state.positions,
+        context=audio_context,
+        enabled=True,
+        context_mask=torch.ones(1, TEXT_SEQ, dtype=dtype),
+        attention_mask=None,
+    )
+    audio_mod2.context_mask[:, 50:] = 0
+
+    with torch.no_grad():
+        neuron_video, neuron_audio = wrapper(video_mod2, audio_mod2)
+
+    neuron_video = neuron_video.float()
+    neuron_audio = neuron_audio.float()
+
+    # Cosine similarity
+    video_cos = torch.nn.functional.cosine_similarity(
+        cpu_video.flatten(), neuron_video.flatten(), dim=0
+    ).item()
+    audio_cos = torch.nn.functional.cosine_similarity(
+        cpu_audio.flatten(), neuron_audio.flatten(), dim=0
+    ).item()
+
+    print(f"\n=== Accuracy Results ===")
+    print(f"Video cosine similarity: {video_cos:.6f}")
+    print(f"Audio cosine similarity: {audio_cos:.6f}")
+    print(f"Video max abs error: {(cpu_video - neuron_video).abs().max().item():.4f}")
+    print(f"Audio max abs error: {(cpu_audio - neuron_audio).abs().max().item():.4f}")
+
+    assert video_cos >= 0.98, f"Video cos_sim {video_cos:.6f} < 0.98"
+    assert audio_cos >= 0.90, f"Audio cos_sim {audio_cos:.6f} < 0.90"
+
+
+def test_performance_latency(compiled_model):
+    """Measure per-step forward pass latency (warm)."""
+    inputs = create_test_inputs()
+
+    # Warmup
+    with torch.no_grad():
+        for _ in range(3):
+            compiled_model(*inputs)
+
+    # Timed runs
+    latencies = []
+    with torch.no_grad():
+        for _ in range(10):
+            t0 = time.time()
+            compiled_model(*inputs)
+            latencies.append(time.time() - t0)
+
+    avg_ms = sum(latencies) / len(latencies) * 1000
+    min_ms = min(latencies) * 1000
+    max_ms = max(latencies) * 1000
+
+    print(f"\n=== Performance Results ===")
+    print(f"Average latency: {avg_ms:.1f} ms")
+    print(f"Min latency: {min_ms:.1f} ms")
+    print(f"Max latency: {max_ms:.1f} ms")
+
+    # No hard performance threshold -- just report
+    assert avg_ms < 60000, f"Forward pass too slow: {avg_ms:.1f} ms"
diff --git a/contrib/models/LTX-2.3/test/unit/__init__.py b/contrib/models/LTX-2.3/test/unit/__init__.py
new file mode 100644
index 00000000..e69de29b