Skip to content
Merged
12 changes: 12 additions & 0 deletions .github/codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,15 @@ coverage:
target: auto
threshold: 1% # Allow atmost 1% coverage drop from main branch.
patch: false

# Exclude GPU-only Triton kernel files from ALL codecov calculations (project
# and patch checks, all flags). Rationale: these files are dominated by
# @triton.jit kernel bodies that CPU unit tests cannot exercise. GPU tests
# cover them end-to-end (see tests/gpu/torch/sparsity/attention_sparsity/) but
# the `gpu`-flag upload may race with the PR status check, so relying on flag
# combination alone leaves the project check flaky. Dropping these files here
# makes the check deterministic — local `pytest --cov` and GPU runs still
# measure them; only the codecov PR status ignores them.
ignore:
- "modelopt/torch/kernels/triton_fa.py"
- "modelopt/torch/kernels/hf_triton_attention.py"
2 changes: 1 addition & 1 deletion .github/workflows/example_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
strategy: &torch_strategy
fail-fast: false
matrix:
example: [llm_distill, llm_qat, llm_sparsity]
example: [llm_distill, llm_qat, llm_sparsity, diffusers_sparsity]
include:
- example: speculative_decoding
docker_image: "26.01"
Expand Down
54 changes: 54 additions & 0 deletions examples/diffusers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Cache Diffusion is a technique that reuses cached outputs from previous diffusio
| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
| Getting Started | Learn how to optimize your models using quantization/cache diffusion to reduce precision and improve inference efficiency | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Support Matrix | View the support matrix to see quantization/cahce diffusion compatibility and feature availability across different models | \[[Link](#support-matrix)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Sparse Attention (Skip-Softmax) | Skip-softmax sparse attention for diffusion models | \[[Link](#sparse-attention-skip-softmax)\] | |
| Cache Diffusion | Caching technique to accelerate inference without compromising quality | \[[Link](#cache-diffusion)\] | |
| Post Training Quantization (PTQ) | Example scripts on how to run PTQ on diffusion models | \[[Link](#post-training-quantization-ptq)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Quantization Aware Training (QAT) | Example scripts on how to run QAT on diffusion models | \[[Link](#quantization-aware-training-qat)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
Expand Down Expand Up @@ -290,6 +291,59 @@ mto.restore(pipe.unet, your_quantized_ckpt)

By following these steps, your PEFT LoRA model should be efficiently quantized using ModelOpt, ready for deployment while maximizing performance.

## Sparse Attention (Skip-Softmax)

Skip-softmax sparse attention skips KV tiles whose attention scores are negligible during the softmax computation, reducing FLOPs without retraining. An exponential model (`scale_factor = a * exp(b * target_sparsity)`) is calibrated once, then the target sparsity can be adjusted at runtime without recalibration.

### Getting Started

```python
import modelopt.torch.sparsity.attention_sparsity as mtsa

# 1. Define config with calibration
config = {
"sparse_cfg": {
"calibration": {
"target_sparse_ratio": {"prefill": 0.5},
},
"*.attn1": {
"method": "triton_skip_softmax",
"backend": "triton",
"is_causal": False,
"collect_stats": True,
"enable": True,
},
"*.attn2": {"enable": False},
"default": {"enable": False},
},
}

# 2. Provide a calibration forward loop
def forward_loop(model):
pipeline(prompt="a cat", num_frames=81, num_inference_steps=40, ...)

# 3. Sparsify + calibrate
mtsa.sparsify(transformer, config, forward_loop=forward_loop)

# 4. Generate as usual — sparsity is applied automatically
output = pipeline(prompt="a dog on the beach", ...)
```

### Example Scripts

#### Wan 2.2 [Script](./sparsity/wan22_skip_softmax.py)

The 14B model automatically sparsifies both `transformer` and `transformer_2`.

```bash

# 5B/14B model
python sparsity/wan22_skip_softmax.py \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers|Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--calibrate --target-sparsity 0.5 --calib-size 4 \
--prompt "A sunset over mountains" --output out.mp4
```

## Cache Diffusion

Cache Diffusion methods, such as [DeepCache](https://arxiv.org/abs/2312.00858), [Block Caching](https://arxiv.org/abs/2312.03209) and [T-Gate](https://arxiv.org/abs/2404.02747), optimize performance by reusing cached outputs from previous steps instead of recalculating them. This **training-free** caching approach is compatible with a variety of models, like **DiT** and **UNet**, enabling considerable acceleration without compromising quality.
Expand Down
76 changes: 76 additions & 0 deletions examples/diffusers/sparsity/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Skip-Softmax Sparse Attention for Diffusion Models

> [!WARNING]
> **Third-Party License Notice — LTX-2**
>
> LTX-2 packages (`ltx-core`, `ltx-pipelines`, `ltx-trainer`) are third-party dependencies
> developed and provided by [Lightricks](https://github.com/Lightricks/LTX-2). They are
> **NOT** covered by the Apache 2.0 license governing NVIDIA Model Optimizer.
>
> You **MUST** comply with the
> [LTX Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE)
> when installing and using LTX-2 with NVIDIA Model Optimizer. Any derivative models or
> fine-tuned weights produced from LTX-2 (including quantized, distilled, or sparsified
> checkpoints) remain subject to the LTX Community License Agreement, not Apache 2.0.

Skip-softmax sparse attention (BLASST, <https://arxiv.org/pdf/2512.12087>) skips KV
tiles whose attention scores are negligible during the FlashAttention computation,
reducing FLOPs without retraining.

Two modes are supported:
- **Fixed raw threshold** — pass a log2-space threshold directly to the Triton
kernel. No calibration needed. Good for quick testing and sweeps.
- **Calibrated threshold** — an exponential model
(`scale_factor = a * exp(b * target_sparsity)`) is calibrated once via the
Triton calibration kernel, then the target sparsity can be adjusted at runtime
without recalibration. Log-space fitting (`fit_logspace=True`) is recommended
for diffusion models where scale_factors span many orders of magnitude.

## Supported Models

| Model | Script | Notes |
|-------|--------|-------|
| WAN 2.2 5B | `wan22_skip_softmax.py` | Single transformer, self-attention only |
| WAN 2.2 14B | `wan22_skip_softmax.py` | Dual transformer (auto-detected) |
| LTX-2 | (coming soon) | Via `ltx_triton_attention.py` backend |

## Quick Start

```bash
# Fixed raw threshold (no calibration, fast)
python wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--raw-threshold -0.7 \
--prompt "A cat playing piano" --output out.mp4

# With calibration
python wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--calibrate --target-sparsity 0.5 \
--prompt "A cat playing piano" --output out.mp4

# Dense baseline (no sparsity, for comparison)
python wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--baseline \
--prompt "A cat playing piano" --output baseline.mp4

# Report runtime sparsity (per-layer tile skip ratios)
python wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--raw-threshold -0.7 --report-avg-sparsity \
--prompt "A cat playing piano" --output out.mp4
```

## Threshold Modes

| Mode | How threshold reaches the kernel | Use case |
|------|----------------------------------|----------|
| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly as `skip_threshold_log2` — no conversion | Quick testing, sweeps |
| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold) * sm_scale` | Production use with automatic seqlen adaptation |
| **Static lambda** (default `skip_softmax_threshold=0.1`) | `log2(lambda) * sm_scale` | Fallback when neither raw nor calibrated |

## Known Issues

- **14B dual transformer calibration**: Transformers are calibrated sequentially — transformer_2's calibration runs while transformer_1 is already sparsified, introducing asymmetric calibration conditions.
- **Minimum achievable sparsity**: Even the strictest threshold may yield 30-40% sparsity on diffusion models (many tiles are inherently negligible). Targets below this floor cause extrapolation; an inference-time warning is emitted.
Loading
Loading