Aiter MHC fix and keep DSv4 ATOM conc1#1202
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
e21a607 to
889c695
Compare
|
Claude finished @Oseltamivir's task in 1m 1s —— View job Review: PR #1202 — Aiter MHC device fix for DSv4 ATOM
LGTM — no blocking issues found. The runtime patcher for |
There was a problem hiding this comment.
LGTM — runtime aiter patch is idempotent, verifies pre/post conditions, and only affects the dsv4-fp4-mi355x-atom benchmark.
Extended reasoning...
Overview
This PR modifies a single benchmark script (benchmarks/single_node/dsv4_fp4_mi355x_atom.sh) to vendor the pure-Python portion of ROCm/aiter#2916 as a runtime patch over the installed aiter/ops/mhc.py. It also removes the previous sed workaround that disabled mhc_pre entirely, and adds a corresponding perf-changelog.yaml entry. The patch adds device=residual.device to several torch.empty() calls so MHC intermediate tensors no longer land on the wrong device.
Security risks
None meaningful. The script clones from public GitHub mirrors over HTTPS at pinned SHAs (already the existing pattern for ATOM and triton_kernels), then mutates a file in an ephemeral container. There is no user input, no network-exposed surface, and no auth/crypto/permissions code involved.
Level of scrutiny
Low. This is benchmark infra code, not production. The blast radius is one CI config (dsv4-fp4-mi355x-atom); a regression here would manifest as a benchmark failure rather than affect users. The patch logic is mechanical (literal source.replace(old, new, 1)), guarded by an idempotency check, refuses to patch if the expected pre-patch text isn't found (with named missing patterns), and re-reads the file post-write to verify success. Failure modes all exit non-zero with clear FATAL messages.
Other factors
The author lists local validation including a fake-package idempotency check. The replacement substrings are carefully whitespace-precise and unique enough that accidental matches are unlikely. The removed sed workaround is replaced with a grep -q guard so a future ATOM rev that drops the hook fails loudly rather than silently. The changelog entry follows the established format. No bugs flagged by the hunting system.
889c695 to
952c923
Compare
951d350 to
952c923
Compare
Matches the exact tree from 55fd191 (run 25027405568). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
04a7baf to
0d94067
Compare
Summary
mhc_predevice-allocation fix at benchmark runtime fordsv4-fp4-mi355x-atom.deepseek_v4.pysed workaround that disabledmhc_preand forced the torch fallback.CONC=1only, with a fatal script guard for accidental high-concurrency runs.Why
PR #1165 introduced DeepSeek-V4-Pro support on ATOM but had to disable the aiter
mhc_prepath because aiter allocated internal tensors on the wrong device. ROCm/aiter#2916 fixes that by allocatingmhc_preintermediates onresidual.device. This PR vendors that pure-Python fix into the benchmark startup path without rebuilding aiter or changing the image.Run 24953107645 showed that higher-concurrency DSv4 ATOM runs are not ready yet:
1k1katCONC>=16can fail during initialization with negative KV budget after high warmup peak memory.1k1katCONC=4and8k1katCONC>=4OOM inside the PR Claude Opus 4.6 #650 torchsparse_attnfallback.chat_templatefor/v1/chat/completions.Until ATOM lands the AITER sparse-attention / multi-request path for DeepSeek-V4, this should stay a single-request marker.
Quantization notes
The ATOM PR #650 path appears to allocate routed MoE expert weights as MXFP4, not BF16:
make_v4_quant_config()returnsdtypes.fp4x2for.ffn.experts,FusedMoEselectsMxfp4MoEMethod, and the triton path swizzles packeduint8FP4 weights plus e8m0 scales. The observed OOMs are in sparse-attention temporary tensors / warmup budget, not from globally dequantizing MoE weights.Validation
bash -n benchmarks/single_node/dsv4_fp4_mi355x_atom.shpython utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/amd-master.yaml --config-keys dsv4-fp4-mi355x-atom --no-evalspython utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/amd-master.yaml --config-keys dsv4-fp4-mi355x-atom --evals-onlypython utils/process_changelog.py --base-ref origin/main --head-ref HEAD --changelog-file perf-changelog.yaml --trim-concaiter.ops.mhcpatcher applies and is idempotent.References