[release/v0.1.12] Cherry-pick #2645 (multi-arch CK GEMM dispatch) + SynchronizedCache backport for v0.1.12.post2#2874
Conversation
…ss all CK GEMM modules (#2645) * chip_info: add GFX_CU_NUM_MAP and get_build_targets() * aiter/configs: migrate tuned GEMM CSVs to add gfx as first column * csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets * aiter/ops: add gfx to runtime GEMM dispatch lookup keys * aiter/utility: add gfx to GemmCommonTuner key and tune result output * csrc, gradlib: add gfx to all GEMM tuner output keys * op_tests: fix is_shape_tuned to filter by (gfx, cu_num) * fix(configs): resolve model_configs merge conflicts and add gfx column * op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts * fix(merge): resolve conflict in gemm_op_a4w4.py after main sync The merge commit 6a18cd6 accidentally preserved conflict markers in gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a gfx column, fall back to (cu_num, M, N, K) for old CSVs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(configs): add missing gfx column to dsv3 model_configs overrides * op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8 * op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules * fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules * op_tests/configs/gemm_codegen_gfx_filter.csv * chip_info: split arch constants and env-only build targets into torch-free build_targets.py * op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32 * chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro * docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs * lint: apply black formatting and fix ruff violations in modified files * lint: fix black/ruff violations in csrc gen_instances and gradlib * fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults * fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs * fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list * docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location * fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num() * fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets() * chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column * fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection * fix(core): self-heal CSV dedup without requiring a re-run * fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings * fix(chip_info): use logger.warning instead of print for kernel skip warnings * style(chip_info): fix E402 import order after logger initialization * fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent * test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU * fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import * docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback * fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K) * style(chip_info, test_gemm_codegen): apply black/ruff formatting * feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU * feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter * refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI * fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import * refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication * docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments * fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test * setup.py: import pretune directly to avoid premature aiter package init * pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup * pretune: tune only missing model shapes in warmup(), not full CSV * fix(pretune): remove vLLM-specific env var hint from warmup() warning * revert: remove warmup API from pretune.py * fix(tuners): clear module-level CSV caches in _clear_op_caches * fix(build): add _parse_gpu_archs_env() * fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup * fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering * fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache * tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env * fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge * raise error when having duplicate shape entries * fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv * resolve duplicated shapes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>
…#2645 cherry-pick PR #2645 introduced csrc/include/gemm_dispatch_utils.h which references the SynchronizedCache<Key, T> template. That template was added by a SEPARATE earlier PR (#2221, 2026-04-15) to csrc/include/aiter_hip_common.h on main, but #2221 was never on release/v0.1.12. Cherry-picking #2645 alone fails to compile with: csrc/include/gemm_dispatch_utils.h:46:12: error: no template named 'SynchronizedCache' Hand-port just the template definition (23 lines + 2 includes) — minimum needed for the dispatch fix to compile. Skips #2221's other changes (replacing std::unordered_map usages in 12 .cu files), since release/v0.1.12 doesn't have those dispatch sites.
PR #2645 introduced 'kid' as a placeholder/stub on this line; PR #2734 later refactored the tune.py files and properly named the variable. Our cherry-pick of #2645 onto release/v0.1.12 doesn't include #2734's refactor, so the dangling 'kid' reference fails ruff F821. Use the inner loop variable 'i' (kernel index) which is what 'kid' was meant to refer to in this context (KernelID = i).
PR #2645 introduced 'kid' as a placeholder/stub on this line; PR #2734 later refactored the tune.py files and properly named the variable. Our cherry-pick of #2645 onto release/v0.1.12 doesn't include #2734's refactor, so the dangling 'kid' reference fails ruff F821. Use the inner loop variable 'i' (kernel index) which is what 'kid' was meant to refer to in this context (KernelID = i).
| total_kernel_nums = 0 | ||
| # kernels_num = len(kernels_list_ck) | ||
| info_keys = (cu_num, M, N, K, q_dtype_w) | ||
| prev_task_count = len(task) |
|
FYI — ran Wheel:
Import test inside
For perspective, the earlier post2 candidate (run 24753462414, Not gating review — just closing the binary-ABI loop. |
|
#2881 this one also needed |
Summary
This PR brings the validated content of
pensun/post2-with-2645ontorelease/v0.1.12so the branch reflects exactly what we shipped as thev0.1.12.post2.dev2645+torch210wheel that just passed cross-arch DSR1 E2E validation. Ready for tagging asv0.1.12.post2after review.Builds on top of #2846 (the
C10_HIP_KERNEL_LAUNCH_CHECKmacro removal that fixed Greg's build-from-source issue). The macro removal alone was insufficient forvllm/vllm-openai-rocm:v0.19.1's ABI — the dispatch fix from #2645 + the manylinux + torch 2.10 pin combine for the full fix.What's in this PR
Two commits on top of
release/v0.1.12:7f3e5249— Cherry-pick of fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 (Vinay's multi-arch CK GEMM dispatch fix, merged onmainas727253ae)(gfx, cu_num, M, N, K)instead of(cu_num, M, N, K), eliminating collisions when the same wheel runs across gfx942 / gfx950.gfxcolumn to all tuned-GEMM CSVs and tuner output keys.chip_info.get_build_targets()/get_gfx_runtime()helpers andPRETUNE_MODULESbuild flag.b8eab93b— SynchronizedCache template hand-port (subset of Replace unsafe uses of std::unordered_map with SynchronizedCache #2221)csrc/include/gemm_dispatch_utils.hreferences aSynchronizedCache<Key, T>template that lives incsrc/include/aiter_hip_common.h. That template was added by separate PR Replace unsafe uses of std::unordered_map with SynchronizedCache #2221 ("Replace unsafe uses of std::unordered_map", commit016ead37) onmain.release/v0.1.12. Cherry-picking fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules #2645 alone fails to compile with:std::unordered_mapusages in 12.cufiles), sincerelease/v0.1.12doesn't have those dispatch sites and pulling them in would expand the diff unnecessarily.Validation
The wheel built from this exact branch state —
amd_aiter-0.1.12.post2.dev2645+rocm7.2.manylinux.2.28.torch210...— has been validated end-to-end:vllm/vllm-openai-rocm:v0.19.1+VLLM_ROCM_USE_AITER=1— coherent inference, no GEMM crash.vllm/vllm-openai-rocm:v0.19.1+VLLM_ROCM_USE_AITER=1— coherent inference.Issue references
C10_HIP_KERNEL_LAUNCH_CHECKmacro removal)CI
Applying full label-driven downstream test set:
ci:vllmci:atomci:sglangci:triton-355Do NOT merge yet
Leave open for review. Tag
v0.1.12.post2will happen post-merge.