Skip to content

fix mhc device#2916

Merged
valarLip merged 1 commit intomainfrom
fix_mhc_device
Apr 28, 2026
Merged

fix mhc device#2916
valarLip merged 1 commit intomainfrom
fix_mhc_device

Conversation

@valarLip
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@valarLip valarLip requested review from a team and Copilot April 25, 2026 15:33
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2916 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes mhc_pre intermediate tensor allocations to be created on the same device as the input residual, preventing device-mismatch failures when the global default device is not CUDA.

Changes:

  • Reorders imports in aiter/ops/mhc.py.
  • Allocates out_pad, sqrsum, post_mix, comb_mix, and layer_input with device=residual.device.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/mhc.py
Comment on lines +92 to +94
device = residual.device
out_pad = torch.empty(
selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32
selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32, device=device
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test that exercises mhc_pre when the global default device is CPU but inputs are explicitly on CUDA (e.g., torch.set_default_device('cpu') then create residual/fn/hc_scale/hc_base on cuda). This change fixes internal tensor allocations to follow residual.device, but current tests may still pass even if allocations accidentally fall back to the default device.

Copilot uses AI. Check for mistakes.
@valarLip valarLip merged commit 8c27e66 into main Apr 28, 2026
32 checks passed
@valarLip valarLip deleted the fix_mhc_device branch April 28, 2026 03:01
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants