Skip to content

[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2#2874

Merged
sunway513 merged 7 commits intorelease/v0.1.12from
pensun/post2-with-2645
Apr 23, 2026
Merged

[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2#2874
sunway513 merged 7 commits intorelease/v0.1.12from
pensun/post2-with-2645

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

This PR brings the validated content of pensun/post2-with-2645 onto release/v0.1.12 so the branch reflects exactly what we shipped as the v0.1.12.post2.dev2645+torch210 wheel that just passed cross-arch DSR1 E2E validation. Ready for tagging as v0.1.12.post2 after review.

Builds on top of #2846 (the C10_HIP_KERNEL_LAUNCH_CHECK macro removal that fixed Greg's build-from-source issue). The macro removal alone was insufficient for vllm/vllm-openai-rocm:v0.19.1's ABI — the dispatch fix from #2645 + the manylinux + torch 2.10 pin combine for the full fix.

What's in this PR

Two commits on top of release/v0.1.12:

  1. 7f3e5249 — Cherry-pick of fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 (Vinay's multi-arch CK GEMM dispatch fix, merged on main as 727253ae)

    • Fixes multi-arch dispatch key collision in CK GEMM modules — the C++ dispatch map is now keyed by (gfx, cu_num, M, N, K) instead of (cu_num, M, N, K), eliminating collisions when the same wheel runs across gfx942 / gfx950.
    • Adds gfx column to all tuned-GEMM CSVs and tuner output keys.
    • Adds chip_info.get_build_targets() / get_gfx_runtime() helpers and PRETUNE_MODULES build flag.
    • Addresses TJ Mok's 0.1.12 fails on DeepSeek R1 on MI300 #2864 DSR1 + MI300 GEMM crash report.
  2. b8eab93b — SynchronizedCache template hand-port (subset of Replace unsafe uses of std::unordered_map with SynchronizedCache #2221)

Validation

The wheel built from this exact branch state — amd_aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28.torch210... — has been validated end-to-end:

  • Test build run #24800645132 — auditwheel PASS (GLIBCXX ≤ 3.4.29, GLIBC ≤ 2.34), manylinux_2_28 compliant.
  • Cross-arch DSR1 E2E: same wheel content serves DeepSeek R1 cleanly on both:
    • MI300X (gfx942) with vllm/vllm-openai-rocm:v0.19.1 + VLLM_ROCM_USE_AITER=1 — coherent inference, no GEMM crash.
    • MI355X (gfx950) with vllm/vllm-openai-rocm:v0.19.1 + VLLM_ROCM_USE_AITER=1 — coherent inference.
  • The multi-arch dispatch fix is what unblocks running the same wheel on both targets without the dispatch key collision that 0.1.12 fails on DeepSeek R1 on MI300 #2864 reported.

Issue references

CI

Applying full label-driven downstream test set:

  • ci:vllm
  • ci:atom
  • ci:sglang
  • ci:triton-355

Do NOT merge yet

Leave open for review. Tag v0.1.12.post2 will happen post-merge.

eppaneamd and others added 2 commits April 22, 2026 16:01
…ss all CK GEMM modules (#2645)

* chip_info: add GFX_CU_NUM_MAP and get_build_targets()

* aiter/configs: migrate tuned GEMM CSVs to add gfx as first column

* csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets

* aiter/ops: add gfx to runtime GEMM dispatch lookup keys

* aiter/utility: add gfx to GemmCommonTuner key and tune result output

* csrc, gradlib: add gfx to all GEMM tuner output keys

* op_tests: fix is_shape_tuned to filter by (gfx, cu_num)

* fix(configs): resolve model_configs merge conflicts and add gfx column

* op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts

* fix(merge): resolve conflict in gemm_op_a4w4.py after main sync

The merge commit 6a18cd6 accidentally preserved conflict markers in
gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as
gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a
gfx column, fall back to (cu_num, M, N, K) for old CSVs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(configs): add missing gfx column to dsv3 model_configs overrides

* op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8

* op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules

* fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules

* op_tests/configs/gemm_codegen_gfx_filter.csv

* chip_info: split arch constants and env-only build targets into torch-free build_targets.py

* op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32

* chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro

* docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs

* lint: apply black formatting and fix ruff violations in modified files

* lint: fix black/ruff violations in csrc gen_instances and gradlib

* fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults

* fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs

* fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list

* docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location

* fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num()

* fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets()

* chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column

* fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection

* fix(core): self-heal CSV dedup without requiring a re-run

* fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings

* fix(chip_info): use logger.warning instead of print for kernel skip warnings

* style(chip_info): fix E402 import order after logger initialization

* fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent

* test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU

* fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import

* docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback

* fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K)

* style(chip_info, test_gemm_codegen): apply black/ruff formatting

* feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU

* feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter

* refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI

* fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import

* refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication

* docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments

* fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test

* setup.py: import pretune directly to avoid premature aiter package init

* pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup

* pretune: tune only missing model shapes in warmup(), not full CSV

* fix(pretune): remove vLLM-specific env var hint from warmup() warning

* revert: remove warmup API from pretune.py

* fix(tuners): clear module-level CSV caches in _clear_op_caches

* fix(build): add _parse_gpu_archs_env()

* fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup

* fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering

* fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache

* tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env

* fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge

* raise error when having duplicate shape entries

* fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv

* resolve duplicated shapes

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
…#2645 cherry-pick

PR #2645 introduced csrc/include/gemm_dispatch_utils.h which references
the SynchronizedCache<Key, T> template. That template was added by a
SEPARATE earlier PR (#2221, 2026-04-15) to csrc/include/aiter_hip_common.h
on main, but #2221 was never on release/v0.1.12.

Cherry-picking #2645 alone fails to compile with:
  csrc/include/gemm_dispatch_utils.h:46:12: error: no template named 'SynchronizedCache'

Hand-port just the template definition (23 lines + 2 includes) — minimum
needed for the dispatch fix to compile. Skips #2221's other changes
(replacing std::unordered_map usages in 12 .cu files), since release/v0.1.12
doesn't have those dispatch sites.
PR #2645 introduced 'kid' as a placeholder/stub on this line; PR #2734
later refactored the tune.py files and properly named the variable.
Our cherry-pick of #2645 onto release/v0.1.12 doesn't include #2734's
refactor, so the dangling 'kid' reference fails ruff F821.

Use the inner loop variable 'i' (kernel index) which is what 'kid'
was meant to refer to in this context (KernelID = i).
PR #2645 introduced 'kid' as a placeholder/stub on this line; PR #2734
later refactored the tune.py files and properly named the variable.
Our cherry-pick of #2645 onto release/v0.1.12 doesn't include #2734's
refactor, so the dangling 'kid' reference fails ruff F821.

Use the inner loop variable 'i' (kernel index) which is what 'kid'
was meant to refer to in this context (KernelID = i).
total_kernel_nums = 0
# kernels_num = len(kernels_list_ck)
info_keys = (cu_num, M, N, K, q_dtype_w)
prev_task_count = len(task)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F841> reported by reviewdog 🐶
Local variable prev_task_count is assigned to but never used

Suggested change
prev_task_count = len(task)
cu_num, M, N, K, q_dtype_w)

@ChuanLi1101
Copy link
Copy Markdown

FYI — ran auditwheel show + import smoke test on the wheel from run 24800645132 on a ROCm dev box:

Wheel: amd_aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
sha256: 2a70ff7b806049a6a0f8cf2e1be48a9c782ea8a2afd45c1eef5d29849ad5ce7b

auditwheel show verdict:

  • Platform: manylinux_2_27_x86_64 (matches the manylinux_2_27 / manylinux_2_28 dual tag in the filename)
  • Max required symbols: GLIBCXX_3.4.21 (libstdc++), GLIBC_2.27 (glibc, from libm; libc itself tops at GLIBC_2.17)

Import test inside vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22.04, ceilings GLIBCXX_3.4.30 / GLIBC_2.35, GPUs mounted):

  • pip install + from aiter import flash_attn_varlen_func both succeed.
  • Installs as amd-aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28.

For perspective, the earlier post2 candidate (run 24753462414, +rocm7.2.2.ubuntu22 filename tag) was manylinux_2_34 / GLIBCXX_3.4.29 / GLIBC_2.32; this build (+rocm7.2.manylinux.2.28 tag) tightens to manylinux_2_27 / GLIBCXX_3.4.21 / GLIBC_2.27. Base-image switch to manylinux took effect.

Not gating review — just closing the binary-ABI loop.

@valarLip
Copy link
Copy Markdown
Collaborator

#2881 this one also needed

@sunway513 sunway513 merged commit 28a7b6a into release/v0.1.12 Apr 23, 2026
23 of 25 checks passed
@sunway513 sunway513 deleted the pensun/post2-with-2645 branch April 23, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants