Enable MI300X ROCm support#484
Open
ehartford wants to merge 3 commits into
Open
Conversation
added 2 commits
June 21, 2026 14:11
OPS-NeoRetro
reviewed
Jul 2, 2026
| * **Metal** is our primary target. Starting from MacBooks with 96GB of RAM (or less, using SSD streaming). | ||
| * **NVIDIA CUDA / DGX Spark**, CUDA with special care for the DGX Spark. | ||
| * **Strix Halo (ROCm)**, systems like the Framework Desktop and other systems based on the same GPU and unified RAM design. | ||
| * **AMD ROCm**, validated on AMD Instinct CDNA3 / MI300X. CDNA4 (`gfx950`) build targets are included but still need runtime validation on CDNA4 hardware. Strix Halo uses the `gfx1151` target. |
There was a problem hiding this comment.
Why did you erase information about AMD Strix Halo?
Author
There was a problem hiding this comment.
It is because Strix Halo (gfx1151) is just one target of AMD ROCm, now there are 3 supported (gfx1151, gfx942, and gfx950)
Just like NVIDIA CUDA line doesn't list all the GPUs but they do say "with special care for DGX Spark" so, I could say "with special care for Strix Halo" or something
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables DeepSeek V4 Pro to run on AMD Instinct MI300X by adding CDNA-oriented ROCm kernels and sharding model layers across local ROCm GPUs.
Summary
Add CDNA3/CDNA4 direct MFMA wrapper kernels for f16 MFMA:
- gfx942 uses mfma_f32_16x16x16_f16
- gfx950 uses mfma_f32_16x16x32_f16
Add a CDNA Q8 batch matmul/MFMA prefill path.
Add ROCm MoE kernel fixes for CDNA correctness, including disabling the broken IQ2/Q2 float-down WMMA overlay.
Add ROCm attention/activation fixes to avoid fp16 overflow and repeated BOS failures.
Add MI300X/CDNA build targets, with CDNA4 gfx950 compile plumbing.
Add local --gpus launcher using the existing distributed runtime to shard layers across local GPUs.
Support repeated -m model shards independent of argument order.
Allocate graph/KV/cache state only for the layer slice owned by each worker.
Add model-cache preflight checks for early actionable OOM errors.
Use BF16 for 16-bit distributed activation transport.
Add MI300X/ROCm smoke scripts and a synthetic Q8 MFMA correctness test.
Validation
Validated on MI300X / CDNA3:
make mi300x
git diff --check
Also validated a local sharded Pro Q4 run across MI300X GPUs, including reversed -m shard order.
Notes
CDNA4 / gfx950 kernel selection and build plumbing are included, but runtime validation has not been performed because I do not have CDNA4 hardware.