NVIDIA · jingyu-ml · Apr 18, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
@@ -11,3 +11,15 @@ coverage:
         target: auto
         threshold: 1% # Allow atmost 1% coverage drop from main branch.
     patch: false
+
+# Exclude GPU-only Triton kernel files from ALL codecov calculations (project
+# and patch checks, all flags). Rationale: these files are dominated by
+# @triton.jit kernel bodies that CPU unit tests cannot exercise. GPU tests
+# cover them end-to-end (see tests/gpu/torch/sparsity/attention_sparsity/) but
+# the `gpu`-flag upload may race with the PR status check, so relying on flag
+# combination alone leaves the project check flaky. Dropping these files here
+# makes the check deterministic — local `pytest --cov` and GPU runs still
+# measure them; only the codecov PR status ignores them.
+ignore:
+  - "modelopt/torch/kernels/triton_fa.py"
+  - "modelopt/torch/kernels/hf_triton_attention.py"
@@ -63,7 +63,7 @@ jobs:
     strategy: &torch_strategy
       fail-fast: false
       matrix:
-        example: [llm_distill, llm_qat, llm_sparsity]
+        example: [llm_distill, llm_qat, llm_sparsity, diffusers_sparsity]
         include:
           - example: speculative_decoding
             docker_image: "26.01"

@@ -13,6 +13,7 @@ Cache Diffusion is a technique that reuses cached outputs from previous diffusio
 | Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
 | Getting Started | Learn how to optimize your models using quantization/cache diffusion to reduce precision and improve inference efficiency | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Support Matrix | View the support matrix to see quantization/cahce diffusion compatibility and feature availability across different models | \[[Link](#support-matrix)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
+| Sparse Attention (Skip-Softmax) | Skip-softmax sparse attention for diffusion models | \[[Link](#sparse-attention-skip-softmax)\] | |
 | Cache Diffusion | Caching technique to accelerate inference without compromising quality | \[[Link](#cache-diffusion)\] | |
 | Post Training Quantization (PTQ) | Example scripts on how to run PTQ on diffusion models | \[[Link](#post-training-quantization-ptq)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Quantization Aware Training (QAT) | Example scripts on how to run QAT on diffusion models | \[[Link](#quantization-aware-training-qat)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
@@ -290,6 +291,59 @@ mto.restore(pipe.unet, your_quantized_ckpt)
 
 By following these steps, your PEFT LoRA model should be efficiently quantized using ModelOpt, ready for deployment while maximizing performance.
 
+## Sparse Attention (Skip-Softmax)
+
+Skip-softmax sparse attention skips KV tiles whose attention scores are negligible during the softmax computation, reducing FLOPs without retraining. An exponential model (`scale_factor = a * exp(b * target_sparsity)`) is calibrated once, then the target sparsity can be adjusted at runtime without recalibration.
+
+### Getting Started
+
+```python
+import modelopt.torch.sparsity.attention_sparsity as mtsa
+
+# 1. Define config with calibration
+config = {
+    "sparse_cfg": {
+        "calibration": {
+            "target_sparse_ratio": {"prefill": 0.5},
+        },
+        "*.attn1": {
+            "method": "triton_skip_softmax",
+            "backend": "triton",
+            "is_causal": False,
+            "collect_stats": True,
+            "enable": True,
+        },
+        "*.attn2": {"enable": False},
+        "default": {"enable": False},
+    },
+}
+
+# 2. Provide a calibration forward loop
+def forward_loop(model):
+    pipeline(prompt="a cat", num_frames=81, num_inference_steps=40, ...)
+
+# 3. Sparsify + calibrate
+mtsa.sparsify(transformer, config, forward_loop=forward_loop)
+
+# 4. Generate as usual — sparsity is applied automatically
+output = pipeline(prompt="a dog on the beach", ...)
+```
+
+### Example Scripts
+
+#### Wan 2.2 [Script](./sparsity/wan22_skip_softmax.py)
+
+The 14B model automatically sparsifies both `transformer` and `transformer_2`.
+
+```bash
+
+# 5B/14B model
+python sparsity/wan22_skip_softmax.py \
+    --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers|Wan-AI/Wan2.2-TI2V-5B-Diffusers \
+    --calibrate --target-sparsity 0.5 --calib-size 4 \
+    --prompt "A sunset over mountains" --output out.mp4
+```
+
 ## Cache Diffusion
 
 Cache Diffusion methods, such as [DeepCache](https://arxiv.org/abs/2312.00858), [Block Caching](https://arxiv.org/abs/2312.03209) and [T-Gate](https://arxiv.org/abs/2404.02747), optimize performance by reusing cached outputs from previous steps instead of recalculating them. This **training-free** caching approach is compatible with a variety of models, like **DiT** and **UNet**, enabling considerable acceleration without compromising quality.

@@ -0,0 +1,76 @@
+# Skip-Softmax Sparse Attention for Diffusion Models
+
+> [!WARNING]
+> **Third-Party License Notice — LTX-2**
+>
+> LTX-2 packages (`ltx-core`, `ltx-pipelines`, `ltx-trainer`) are third-party dependencies
+> developed and provided by [Lightricks](https://github.com/Lightricks/LTX-2). They are
+> **NOT** covered by the Apache 2.0 license governing NVIDIA Model Optimizer.
+>
+> You **MUST** comply with the
+> [LTX Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE)
+> when installing and using LTX-2 with NVIDIA Model Optimizer. Any derivative models or
+> fine-tuned weights produced from LTX-2 (including quantized, distilled, or sparsified
+> checkpoints) remain subject to the LTX Community License Agreement, not Apache 2.0.
+
+Skip-softmax sparse attention (BLASST, <https://arxiv.org/pdf/2512.12087>) skips KV
+tiles whose attention scores are negligible during the FlashAttention computation,
+reducing FLOPs without retraining.
+
+Two modes are supported:
+- **Fixed raw threshold** — pass a log2-space threshold directly to the Triton
+  kernel. No calibration needed. Good for quick testing and sweeps.
+- **Calibrated threshold** — an exponential model
+  (`scale_factor = a * exp(b * target_sparsity)`) is calibrated once via the
+  Triton calibration kernel, then the target sparsity can be adjusted at runtime
+  without recalibration. Log-space fitting (`fit_logspace=True`) is recommended
+  for diffusion models where scale_factors span many orders of magnitude.
+
+## Supported Models
+
+| Model | Script | Notes |
+|-------|--------|-------|
+| WAN 2.2 5B | `wan22_skip_softmax.py` | Single transformer, self-attention only |
+| WAN 2.2 14B | `wan22_skip_softmax.py` | Dual transformer (auto-detected) |
+| LTX-2 | (coming soon) | Via `ltx_triton_attention.py` backend |
+
+## Quick Start
+
+```bash
+# Fixed raw threshold (no calibration, fast)
+python wan22_skip_softmax.py \
+    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
+    --raw-threshold -0.7 \
+    --prompt "A cat playing piano" --output out.mp4
+
+# With calibration
+python wan22_skip_softmax.py \
+    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
+    --calibrate --target-sparsity 0.5 \
+    --prompt "A cat playing piano" --output out.mp4
+
+# Dense baseline (no sparsity, for comparison)
+python wan22_skip_softmax.py \
+    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
+    --baseline \
+    --prompt "A cat playing piano" --output baseline.mp4
+
+# Report runtime sparsity (per-layer tile skip ratios)
+python wan22_skip_softmax.py \
+    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
+    --raw-threshold -0.7 --report-avg-sparsity \
+    --prompt "A cat playing piano" --output out.mp4
+```
+
+## Threshold Modes
+
+| Mode | How threshold reaches the kernel | Use case |
+|------|----------------------------------|----------|
+| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly as `skip_threshold_log2` — no conversion | Quick testing, sweeps |
+| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold) * sm_scale` | Production use with automatic seqlen adaptation |
+| **Static lambda** (default `skip_softmax_threshold=0.1`) | `log2(lambda) * sm_scale` | Fallback when neither raw nor calibrated |
+
+## Known Issues
+
+- **14B dual transformer calibration**: Transformers are calibrated sequentially — transformer_2's calibration runs while transformer_1 is already sparsified, introducing asymmetric calibration conditions.
+- **Minimum achievable sparsity**: Even the strictest threshold may yield 30-40% sparsity on diffusion models (many tiles are inherently negligible). Targets below this floor cause extrapolation; an inference-time warning is emitted.