Skip to content

Aiter MHC fix and keep DSv4 ATOM conc1#1202

Merged
Oseltamivir merged 3 commits intomainfrom
codex/dsv4-atom-aiter-mhc-fix
Apr 28, 2026
Merged

Aiter MHC fix and keep DSv4 ATOM conc1#1202
Oseltamivir merged 3 commits intomainfrom
codex/dsv4-atom-aiter-mhc-fix

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 27, 2026

Summary

  • Applies the fix mhc device ROCm/aiter#2916 mhc_pre device-allocation fix at benchmark runtime for dsv4-fp4-mi355x-atom.
  • Removes the ATOM deepseek_v4.py sed workaround that disabled mhc_pre and forced the torch fallback.
  • Keeps the DSv4 ATOM config at CONC=1 only, with a fatal script guard for accidental high-concurrency runs.
  • Appends a perf changelog entry so CI runs the affected MI355X ATOM config.

Why

PR #1165 introduced DeepSeek-V4-Pro support on ATOM but had to disable the aiter mhc_pre path because aiter allocated internal tensors on the wrong device. ROCm/aiter#2916 fixes that by allocating mhc_pre intermediates on residual.device. This PR vendors that pure-Python fix into the benchmark startup path without rebuilding aiter or changing the image.

Run 24953107645 showed that higher-concurrency DSv4 ATOM runs are not ready yet:

  • 1k1k at CONC>=16 can fail during initialization with negative KV budget after high warmup peak memory.
  • 1k1k at CONC=4 and 8k1k at CONC>=4 OOM inside the PR Claude Opus 4.6 #650 torch sparse_attn fallback.
  • Eval-only currently fails independently because DSv4 has no HF tokenizer chat_template for /v1/chat/completions.

Until ATOM lands the AITER sparse-attention / multi-request path for DeepSeek-V4, this should stay a single-request marker.

Quantization notes

The ATOM PR #650 path appears to allocate routed MoE expert weights as MXFP4, not BF16: make_v4_quant_config() returns dtypes.fp4x2 for .ffn.experts, FusedMoE selects Mxfp4MoEMethod, and the triton path swizzles packed uint8 FP4 weights plus e8m0 scales. The observed OOMs are in sparse-attention temporary tensors / warmup budget, not from globally dequantizing MoE weights.

Validation

  • bash -n benchmarks/single_node/dsv4_fp4_mi355x_atom.sh
  • python utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/amd-master.yaml --config-keys dsv4-fp4-mi355x-atom --no-evals
  • python utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/amd-master.yaml --config-keys dsv4-fp4-mi355x-atom --evals-only
  • python utils/process_changelog.py --base-ref origin/main --head-ref HEAD --changelog-file perf-changelog.yaml --trim-conc
  • Local fake-package check that the embedded aiter.ops.mhc patcher applies and is idempotent.

References

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir force-pushed the codex/dsv4-atom-aiter-mhc-fix branch from e21a607 to 889c695 Compare April 27, 2026 23:04
@Oseltamivir Oseltamivir changed the title [codex] Use aiter MHC device fix for DSv4 ATOM Aiter MHC device fix for DSv4 ATOM Apr 27, 2026
@Oseltamivir Oseltamivir marked this pull request as ready for review April 27, 2026 23:06
@Oseltamivir Oseltamivir requested a review from a team April 27, 2026 23:06
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 27, 2026

Claude finished @Oseltamivir's task in 1m 1s —— View job


Review: PR #1202 — Aiter MHC device fix for DSv4 ATOM

  • Read repo guidelines
  • Review changed files and gather context
  • Validate perf-changelog entry
  • Check benchmark script code style
  • Post review findings

LGTM — no blocking issues found.

The runtime patcher for aiter/ops/mhc.py is well-structured: idempotent (skips if already patched), validates pre-conditions before modifying anything, verifies post-conditions after writing, and fails fast with clear messages on any mismatch. Replacing the sed workaround that disabled mhc_pre with the upstream device-allocation fix (plus a grep -q assertion that ATOM still exposes the hook) is a clean improvement. Perf-changelog entry is correctly appended at the end.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — runtime aiter patch is idempotent, verifies pre/post conditions, and only affects the dsv4-fp4-mi355x-atom benchmark.

Extended reasoning...

Overview

This PR modifies a single benchmark script (benchmarks/single_node/dsv4_fp4_mi355x_atom.sh) to vendor the pure-Python portion of ROCm/aiter#2916 as a runtime patch over the installed aiter/ops/mhc.py. It also removes the previous sed workaround that disabled mhc_pre entirely, and adds a corresponding perf-changelog.yaml entry. The patch adds device=residual.device to several torch.empty() calls so MHC intermediate tensors no longer land on the wrong device.

Security risks

None meaningful. The script clones from public GitHub mirrors over HTTPS at pinned SHAs (already the existing pattern for ATOM and triton_kernels), then mutates a file in an ephemeral container. There is no user input, no network-exposed surface, and no auth/crypto/permissions code involved.

Level of scrutiny

Low. This is benchmark infra code, not production. The blast radius is one CI config (dsv4-fp4-mi355x-atom); a regression here would manifest as a benchmark failure rather than affect users. The patch logic is mechanical (literal source.replace(old, new, 1)), guarded by an idempotency check, refuses to patch if the expected pre-patch text isn't found (with named missing patterns), and re-reads the file post-write to verify success. Failure modes all exit non-zero with clear FATAL messages.

Other factors

The author lists local validation including a fake-package idempotency check. The replacement substrings are carefully whitespace-precise and unique enough that accidental matches are unlikely. The removed sed workaround is replaced with a grep -q guard so a future ATOM rev that drops the hook fails loudly rather than silently. The changelog entry follows the established format. No bugs flagged by the hunting system.

@Oseltamivir Oseltamivir force-pushed the codex/dsv4-atom-aiter-mhc-fix branch from 889c695 to 952c923 Compare April 27, 2026 23:23
@Oseltamivir Oseltamivir changed the title Aiter MHC device fix for DSv4 ATOM [codex] Use aiter MHC fix and keep DSv4 ATOM conc1 Apr 27, 2026
@Oseltamivir Oseltamivir changed the title [codex] Use aiter MHC fix and keep DSv4 ATOM conc1 Aiter MHC fix and keep DSv4 ATOM conc1 Apr 27, 2026
@Oseltamivir Oseltamivir force-pushed the codex/dsv4-atom-aiter-mhc-fix branch from 951d350 to 952c923 Compare April 28, 2026 06:07
Matches the exact tree from 55fd191 (run 25027405568).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Oseltamivir Oseltamivir force-pushed the codex/dsv4-atom-aiter-mhc-fix branch from 04a7baf to 0d94067 Compare April 28, 2026 07:44
@Oseltamivir Oseltamivir merged commit 38d2da7 into main Apr 28, 2026
17 checks passed
@Oseltamivir Oseltamivir deleted the codex/dsv4-atom-aiter-mhc-fix branch April 28, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant