[Draft] Add Ascend NPU optim Support#478
Draft
zhangtao0408 wants to merge 216 commits intovipshop:ascendfrom
Draft
[Draft] Add Ascend NPU optim Support#478zhangtao0408 wants to merge 216 commits intovipshop:ascendfrom
zhangtao0408 wants to merge 216 commits intovipshop:ascendfrom
Conversation
Member
|
Thanks! Please switch the target branch to ascend |
* feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux * feat: support async ulysses qkv proj for Flux
* shard transformer * tp text encoder * refactor * clean up * save more gpu memory --------- Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* misc: refactor flux.2 tensor parallel * misc: refactor flux.2 tensor parallel * misc: refactor flux.2 tensor parallel * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2 * feat: support hybrid cache + tp for flux.2
* feat: support hybrid cache + tp for flux.2 * feat: enable seq offload for FLUX.2 w/ GPU=1
* feat: support hybrid cache + tp for flux.2 * feat: enable seq offload for FLUX.2 w/ GPU=1 * feat: support FLUX.2 context parallel * feat: support FLUX.2 context parallel * feat: support FLUX.2 context parallel * feat: support FLUX.2 context parallel * feat: support FLUX.2 context parallel * feat: support FLUX.2 context parallel
* opt run_wan_tp example * upd * support torch profiler in cache-dit * upd * upd * upd * upd * upd * upd * Update README.md --------- Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* add zimage * inspect * apply tp plan * apply tp plan * fix * fix bug --------- Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* add lumina tp * fix name * update --------- Co-authored-by: felix01.yu <felix01.yu@vipshop.com>
* feat: support cache for z-image * feat: support cache for z-image * feat: support cache for z-image
* feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image * feat: support context parallel for z-image
* feat: support context parallel for z-image * feat: support context parallel for z-image
* chore: Update README.md * Update README.md * Update README.md * Update README.md
* feat: support FnB0 for z-image w/ cp * feat: support FnB0 for z-image w/ cp * feat: support FnB0 for z-image w/ cp
* feat: fast rope for z-image * chore: update notes * chore: update z-image cp example * chore: update z-image cp example * chore: update z-image cp example * chore: update z-image cp example * chore: update z-image cp example * feat: allow cudnn attn w/ attn mask for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp * feat: support _sdpa_cudnn backend for cp
* feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image
* feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: support async ulysses cp for z-image * feat: add all_to_all_single v2 * feat: add all_to_all_single v2
* feat: support async ulysses cp for qwen-image * feat: support async ulysses cp for qwen-image * feat: support async ulysses cp for qwen-image
* support all2all_qkv fp8 * support qkv fp8 all2all * support o fp8 all2all * add wait_tensor * add addition ulysses_anything_float8 logic * Update utils.py * fix log info error * Update __init__.py * Update _templated_ulysses_anything.py --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* add profiler for flux tp and cp example * Enhance Flux2 & Qwen examples with customizable CLI arguments
* uaa-fp8 support torch.compile * Update _distributed_primitives.py * Update _distributed_primitives.py --------- Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* feat: relaxed assert transformer * feat: relaxed assert transformer * feat: relaxed assert transformer
* feat: all2all qkv fp8 for ulysses * feat: all2all qkv fp8 for ulysses * feat: all2all qkv fp8 for ulysses * feat: all2all qkv fp8 for ulysses * feat: all2all qkv fp8 for ulysses
* CI: use torch-cpu for basic cpu tests * CI: use torch-cpu for basic cpu tests * CI: use torch-cpu for basic cpu tests
* npu docs update * npu docs update
* npu docs update * npu docs update * npu docs update
* npu docs update * npu docs update * npu docs update * NPU support update
* feat: support flux2-klein * feat: support flux2-klein * feat: support flux2-klein * feat: support flux2-klein * feat: support flux2-klein series * feat: support flux2-klein series * feat: support flux2-klein series * fix dim computation for FLUX.2 klein * feat: support flux2-klein series * feat: support flux2-klein series * feat: support flux2-klein series --------- Co-authored-by: G.O.D <32255912+gameofdimension@users.noreply.github.com>
* chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo * chore: use new logo
* fix logo link * enable scm for examples * enable scm for examples
* random * Update bench.py * Update bench_distill.py --------- Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
* chore: allow custom generator device in examples * chore: allow custom generator device in examples * chore: allow custom generator device in examples * chore: allow custom generator device in examples * fix docs * fix docs * fix docs * fix docs
* docs: add latest news * docs: add latest news * docs: add latest news * docs: add latest news
* docs: fix docs format * docs: fix docs format
* fix ltx-2 i2v example * fix ltx-2 i2v example
* Update README.md * Update README.md * Update README.md
* allow use default steps for scm * allow use default steps for scm * allow use default steps for scm * allow use default steps for scm * allow use default steps for scm * allow use default steps for scm * allow use default steps for scm
* refine docs * add pert * format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optim: npu_fast_gelu, npu_rms_norm, npu_layer_norm_eval, npu_rotary_mul, npu_weight_nz, npu_adalayernorm