Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Fixes mhc_pre intermediate tensor allocations to be created on the same device as the input residual, preventing device-mismatch failures when the global default device is not CUDA.
Changes:
- Reorders imports in
aiter/ops/mhc.py. - Allocates
out_pad,sqrsum,post_mix,comb_mix, andlayer_inputwithdevice=residual.device.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| device = residual.device | ||
| out_pad = torch.empty( | ||
| selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32 | ||
| selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32, device=device |
There was a problem hiding this comment.
Consider adding a regression test that exercises mhc_pre when the global default device is CPU but inputs are explicitly on CUDA (e.g., torch.set_default_device('cpu') then create residual/fn/hc_scale/hc_base on cuda). This change fixes internal tensor allocations to follow residual.device, but current tests may still pass even if allocations accidentally fall back to the default device.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist