[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2 by sunway513 · Pull Request #2874 · ROCm/aiter

sunway513 · 2026-04-23T03:45:44Z

Summary

This PR brings the validated content of pensun/post2-with-2645 onto release/v0.1.12 so the branch reflects exactly what we shipped as the v0.1.12.post2.dev2645+torch210 wheel that just passed cross-arch DSR1 E2E validation. Ready for tagging as v0.1.12.post2 after review.

Builds on top of #2846 (the C10_HIP_KERNEL_LAUNCH_CHECK macro removal that fixed Greg's build-from-source issue). The macro removal alone was insufficient for vllm/vllm-openai-rocm:v0.19.1's ABI — the dispatch fix from #2645 + the manylinux + torch 2.10 pin combine for the full fix.

What's in this PR

Two commits on top of release/v0.1.12:

7f3e5249 — Cherry-pick of fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 (Vinay's multi-arch CK GEMM dispatch fix, merged on main as 727253ae)
- Fixes multi-arch dispatch key collision in CK GEMM modules — the C++ dispatch map is now keyed by (gfx, cu_num, M, N, K) instead of (cu_num, M, N, K), eliminating collisions when the same wheel runs across gfx942 / gfx950.
- Adds gfx column to all tuned-GEMM CSVs and tuner output keys.
- Adds chip_info.get_build_targets() / get_gfx_runtime() helpers and PRETUNE_MODULES build flag.
- Addresses TJ Mok's 0.1.12 fails on DeepSeek R1 on MI300 #2864 DSR1 + MI300 GEMM crash report.
b8eab93b — SynchronizedCache template hand-port (subset of Replace unsafe uses of std::unordered_map with SynchronizedCache #2221)
- PR fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645's csrc/include/gemm_dispatch_utils.h references a SynchronizedCache<Key, T> template that lives in csrc/include/aiter_hip_common.h. That template was added by separate PR Replace unsafe uses of std::unordered_map with SynchronizedCache #2221 ("Replace unsafe uses of std::unordered_map", commit 016ead37) on main.
- Replace unsafe uses of std::unordered_map with SynchronizedCache #2221 is not on release/v0.1.12. Cherry-picking fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 alone fails to compile with:
```
csrc/include/gemm_dispatch_utils.h:46:12: error: no template named 'SynchronizedCache'
```
- This commit hand-ports just the 23-line template definition + 2 includes — the minimum needed for fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 to compile. It deliberately skips Replace unsafe uses of std::unordered_map with SynchronizedCache #2221's other changes (replacing std::unordered_map usages in 12 .cu files), since release/v0.1.12 doesn't have those dispatch sites and pulling them in would expand the diff unnecessarily.

Validation

The wheel built from this exact branch state — amd_aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28.torch210... — has been validated end-to-end:

Test build run #24800645132 — auditwheel PASS (GLIBCXX ≤ 3.4.29, GLIBC ≤ 2.34), manylinux_2_28 compliant.
Cross-arch DSR1 E2E: same wheel content serves DeepSeek R1 cleanly on both:
- MI300X (gfx942) with vllm/vllm-openai-rocm:v0.19.1 + VLLM_ROCM_USE_AITER=1 — coherent inference, no GEMM crash.
- MI355X (gfx950) with vllm/vllm-openai-rocm:v0.19.1 + VLLM_ROCM_USE_AITER=1 — coherent inference.
The multi-arch dispatch fix is what unblocks running the same wheel on both targets without the dispatch key collision that 0.1.12 fails on DeepSeek R1 on MI300 #2864 reported.

Issue references

Resolves Building 0.1.12.post1 with PREBUILD_KERNELS fails #2837 (fully — combination of [release/v0.1.12] Remove C10_HIP_KERNEL_LAUNCH_CHECK (Lingpeng's preferred fix) #2846 + this PR + the manylinux build + torch 2.10 pin)
Tracks 0.1.12 fails on DeepSeek R1 on MI300 #2864 (DSR1 + MI300 crash) — wheel built from this branch eliminates the crash
Builds on Trial wheel: v0.1.12.post2 candidate (ROCm 7.2.2 + py3.12) available for smoke test #2843 (release tracking)
Builds on [release/v0.1.12] Remove C10_HIP_KERNEL_LAUNCH_CHECK (Lingpeng's preferred fix) #2846 (already merged — C10_HIP_KERNEL_LAUNCH_CHECK macro removal)

CI

Applying full label-driven downstream test set:

ci:vllm
ci:atom
ci:sglang
ci:triton-355

Do NOT merge yet

Leave open for review. Tag v0.1.12.post2 will happen post-merge.

…ss all CK GEMM modules (#2645) * chip_info: add GFX_CU_NUM_MAP and get_build_targets() * aiter/configs: migrate tuned GEMM CSVs to add gfx as first column * csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets * aiter/ops: add gfx to runtime GEMM dispatch lookup keys * aiter/utility: add gfx to GemmCommonTuner key and tune result output * csrc, gradlib: add gfx to all GEMM tuner output keys * op_tests: fix is_shape_tuned to filter by (gfx, cu_num) * fix(configs): resolve model_configs merge conflicts and add gfx column * op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts * fix(merge): resolve conflict in gemm_op_a4w4.py after main sync The merge commit 6a18cd6 accidentally preserved conflict markers in gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a gfx column, fall back to (cu_num, M, N, K) for old CSVs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(configs): add missing gfx column to dsv3 model_configs overrides * op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8 * op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules * fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules * op_tests/configs/gemm_codegen_gfx_filter.csv * chip_info: split arch constants and env-only build targets into torch-free build_targets.py * op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32 * chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro * docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs * lint: apply black formatting and fix ruff violations in modified files * lint: fix black/ruff violations in csrc gen_instances and gradlib * fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults * fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs * fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list * docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location * fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num() * fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets() * chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column * fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection * fix(core): self-heal CSV dedup without requiring a re-run * fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings * fix(chip_info): use logger.warning instead of print for kernel skip warnings * style(chip_info): fix E402 import order after logger initialization * fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent * test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU * fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import * docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback * fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K) * style(chip_info, test_gemm_codegen): apply black/ruff formatting * feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU * feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter * refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI * fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import * refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication * docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments * fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test * setup.py: import pretune directly to avoid premature aiter package init * pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup * pretune: tune only missing model shapes in warmup(), not full CSV * fix(pretune): remove vLLM-specific env var hint from warmup() warning * revert: remove warmup API from pretune.py * fix(tuners): clear module-level CSV caches in _clear_op_caches * fix(build): add _parse_gpu_archs_env() * fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup * fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering * fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache * tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env * fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge * raise error when having duplicate shape entries * fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv * resolve duplicated shapes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

…#2645 cherry-pick PR #2645 introduced csrc/include/gemm_dispatch_utils.h which references the SynchronizedCache<Key, T> template. That template was added by a SEPARATE earlier PR (#2221, 2026-04-15) to csrc/include/aiter_hip_common.h on main, but #2221 was never on release/v0.1.12. Cherry-picking #2645 alone fails to compile with: csrc/include/gemm_dispatch_utils.h:46:12: error: no template named 'SynchronizedCache' Hand-port just the template definition (23 lines + 2 includes) — minimum needed for the dispatch fix to compile. Skips #2221's other changes (replacing std::unordered_map usages in 12 .cu files), since release/v0.1.12 doesn't have those dispatch sites.

PR #2645 introduced 'kid' as a placeholder/stub on this line; PR #2734 later refactored the tune.py files and properly named the variable. Our cherry-pick of #2645 onto release/v0.1.12 doesn't include #2734's refactor, so the dangling 'kid' reference fails ruff F821. Use the inner loop variable 'i' (kernel index) which is what 'kid' was meant to refer to in this context (KernelID = i).

github-actions · 2026-04-23T03:58:18Z

-            total_kernel_nums = 0
-            # kernels_num = len(kernels_list_ck)
-            info_keys = (cu_num, M, N, K, q_dtype_w)
+            prev_task_count = len(task)


⚠️ [ruff] <F841> _{reported by reviewdog 🐶}
Local variable prev_task_count is assigned to but never used

Suggested change

prev_task_count = len(task)

cu_num, M, N, K, q_dtype_w)

ChuanLi1101 · 2026-04-23T10:51:43Z

FYI — ran auditwheel show + import smoke test on the wheel from run 24800645132 on a ROCm dev box:

Wheel: amd_aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
sha256: 2a70ff7b806049a6a0f8cf2e1be48a9c782ea8a2afd45c1eef5d29849ad5ce7b

auditwheel show verdict:

Platform: manylinux_2_27_x86_64 (matches the manylinux_2_27 / manylinux_2_28 dual tag in the filename)
Max required symbols: GLIBCXX_3.4.21 (libstdc++), GLIBC_2.27 (glibc, from libm; libc itself tops at GLIBC_2.17)

Import test inside vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22.04, ceilings GLIBCXX_3.4.30 / GLIBC_2.35, GPUs mounted):

pip install + from aiter import flash_attn_varlen_func both succeed.
Installs as amd-aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28.

For perspective, the earlier post2 candidate (run 24753462414, +rocm7.2.2.ubuntu22 filename tag) was manylinux_2_34 / GLIBCXX_3.4.29 / GLIBC_2.32; this build (+rocm7.2.manylinux.2.28 tag) tightens to manylinux_2_27 / GLIBCXX_3.4.21 / GLIBC_2.27. Base-image switch to manylinux took effect.

Not gating review — just closing the binary-ABI loop.

valarLip · 2026-04-23T12:14:22Z

#2881 this one also needed

eppaneamd and others added 2 commits April 22, 2026 16:01

sunway513 added ci:atom ci:sglang ci:triton-355 ci:vllm labels Apr 23, 2026

sunway513 mentioned this pull request Apr 23, 2026

0.1.12 fails on DeepSeek R1 on MI300 #2864

Open

sunway513 added 5 commits April 23, 2026 03:53

fix(tune): replace undefined 'kernel_idx' with loop var 'i' (ruff F821)

67c69ee

fix(tune): remove unused run_kwargs + prev_task_count locals (ruff F841)

ea4d110

fix(tune): remove unused prev_task_count local (ruff F841)

3e5bf3d

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

sunway513 merged commit 28a7b6a into release/v0.1.12 Apr 23, 2026
23 of 25 checks passed

sunway513 deleted the pensun/post2-with-2645 branch April 23, 2026 18:03

sunway513 mentioned this pull request Apr 23, 2026

AITER v0.1.12.post2 release summary (2026-04-23) sunway513/aiter#59

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2#2874

[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2#2874
sunway513 merged 7 commits intorelease/v0.1.12from
pensun/post2-with-2645

sunway513 commented Apr 23, 2026

Uh oh!

github-actions Bot Apr 23, 2026

Uh oh!

ChuanLi1101 commented Apr 23, 2026

Uh oh!

valarLip commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sunway513 commented Apr 23, 2026

Summary

What's in this PR

Validation

Issue references

CI

Do NOT merge yet

Uh oh!

github-actions Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 commented Apr 23, 2026

Uh oh!

valarLip commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants