Skip to content

[Draft] Add Ascend NPU optim Support#478

Draft
zhangtao0408 wants to merge 216 commits intovipshop:ascendfrom
zhangtao0408:dev
Draft

[Draft] Add Ascend NPU optim Support#478
zhangtao0408 wants to merge 216 commits intovipshop:ascendfrom
zhangtao0408:dev

Conversation

@zhangtao0408
Copy link
Contributor

Optim: npu_fast_gelu, npu_rms_norm, npu_layer_norm_eval, npu_rotary_mul, npu_weight_nz, npu_adalayernorm

@DefTruth DefTruth self-requested a review November 25, 2025 07:07
@DefTruth
Copy link
Member

Thanks! Please switch the target branch to ascend

@zhangtao0408 zhangtao0408 changed the base branch from dev to ascend November 25, 2025 07:10
@DefTruth DefTruth changed the title [Draft] Add Ascend NPU optim Support for Cache-Dit [Draft] Add Ascend NPU optim Support Nov 25, 2025
DefTruth and others added 26 commits November 26, 2025 18:36
* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux

* feat: support async ulysses qkv proj for Flux
* shard transformer

* tp text encoder

* refactor

* clean up

* save more gpu memory

---------

Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* misc: refactor flux.2 tensor parallel

* misc: refactor flux.2 tensor parallel

* misc: refactor flux.2 tensor parallel

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2

* feat: support hybrid cache + tp for flux.2
* feat: support hybrid cache + tp for flux.2

* feat: enable seq offload for FLUX.2 w/ GPU=1
* feat: support hybrid cache + tp for flux.2

* feat: enable seq offload for FLUX.2 w/ GPU=1

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel
* opt run_wan_tp example

* upd

* support torch profiler in cache-dit

* upd

* upd

* upd

* upd

* upd

* upd

* Update README.md

---------

Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* add zimage

* inspect

* apply tp plan

* apply tp plan

* fix

* fix bug

---------

Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* add lumina tp

* fix name

* update

---------

Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* feat: support cache for z-image

* feat: support cache for z-image

* feat: support cache for z-image
* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image

* feat: support context parallel for z-image
* feat: support context parallel for z-image

* feat: support context parallel for z-image
* chore: Update README.md

* Update README.md

* Update README.md

* Update README.md
* feat: support FnB0 for z-image w/ cp

* feat: support FnB0 for z-image w/ cp

* feat: support FnB0 for z-image w/ cp
* feat: fast rope for z-image

* chore: update notes

* chore: update z-image cp example

* chore: update z-image cp example

* chore: update z-image cp example

* chore: update z-image cp example

* chore: update z-image cp example

* feat: allow cudnn attn w/ attn mask for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp

* feat: support _sdpa_cudnn backend for cp
* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image
* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: support async ulysses cp for z-image

* feat: add all_to_all_single v2

* feat: add all_to_all_single v2
* feat: support async ulysses cp for qwen-image

* feat: support async ulysses cp for qwen-image

* feat: support async ulysses cp for qwen-image
* support all2all_qkv fp8

* support qkv fp8 all2all

* support o fp8 all2all

* add wait_tensor

* add addition ulysses_anything_float8 logic

* Update utils.py

* fix log info error

* Update __init__.py

* Update _templated_ulysses_anything.py

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* add profiler for flux tp and cp example

* Enhance Flux2 & Qwen examples with customizable CLI arguments
* uaa-fp8 support torch.compile

* Update _distributed_primitives.py

* Update _distributed_primitives.py

---------

Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* feat: relaxed assert transformer

* feat: relaxed assert transformer

* feat: relaxed assert transformer
* feat: all2all qkv fp8 for ulysses

* feat: all2all qkv fp8 for ulysses

* feat: all2all qkv fp8 for ulysses

* feat: all2all qkv fp8 for ulysses

* feat: all2all qkv fp8 for ulysses
DefTruth and others added 15 commits January 15, 2026 16:54
* CI: use torch-cpu for basic cpu tests

* CI: use torch-cpu for basic cpu tests

* CI: use torch-cpu for basic cpu tests
* npu docs update

* npu docs update
* npu docs update

* npu docs update

* npu docs update
* npu docs update

* npu docs update

* npu docs update

* NPU support update
* feat: support flux2-klein

* feat: support flux2-klein

* feat: support flux2-klein

* feat: support flux2-klein

* feat: support flux2-klein series

* feat: support flux2-klein series

* feat: support flux2-klein series

* fix dim computation for FLUX.2 klein

* feat: support flux2-klein series

* feat: support flux2-klein series

* feat: support flux2-klein series

---------

Co-authored-by: G.O.D <32255912+gameofdimension@users.noreply.github.com>
* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo

* chore: use new logo
* fix logo link

* enable scm for examples

* enable scm for examples
BBuf and others added 14 commits January 19, 2026 09:44
* random

* Update bench.py

* Update bench_distill.py

---------

Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* chore: allow custom generator device in examples

* chore: allow custom generator device in examples

* chore: allow custom generator device in examples

* chore: allow custom generator device in examples

* fix docs

* fix docs

* fix docs

* fix docs
* docs: add latest news

* docs: add latest news

* docs: add latest news

* docs: add latest news
* docs: fix docs format

* docs: fix docs format
* fix ltx-2 i2v example

* fix ltx-2 i2v example
* Update README.md

* Update README.md

* Update README.md
* allow use default steps for scm

* allow use default steps for scm

* allow use default steps for scm

* allow use default steps for scm

* allow use default steps for scm

* allow use default steps for scm

* allow use default steps for scm
* refine docs

* add pert

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.