Enable MI300X ROCm support by ehartford · Pull Request #484 · antirez/ds4

ehartford · 2026-07-01T18:18:24Z

This PR enables DeepSeek V4 Pro to run on AMD Instinct MI300X by adding CDNA-oriented ROCm kernels and sharding model layers across local ROCm GPUs.

Summary

Add CDNA3/CDNA4 direct MFMA wrapper kernels for f16 MFMA:
- gfx942 uses mfma_f32_16x16x16_f16
- gfx950 uses mfma_f32_16x16x32_f16
Add a CDNA Q8 batch matmul/MFMA prefill path.
Add ROCm MoE kernel fixes for CDNA correctness, including disabling the broken IQ2/Q2 float-down WMMA overlay.
Add ROCm attention/activation fixes to avoid fp16 overflow and repeated BOS failures.
Add MI300X/CDNA build targets, with CDNA4 gfx950 compile plumbing.
Add local --gpus launcher using the existing distributed runtime to shard layers across local GPUs.
Support repeated -m model shards independent of argument order.
Allocate graph/KV/cache state only for the layer slice owned by each worker.
Add model-cache preflight checks for early actionable OOM errors.
Use BF16 for 16-bit distributed activation transport.
Add MI300X/ROCm smoke scripts and a synthetic Q8 MFMA correctness test.

Validation

Validated on MI300X / CDNA3:

make mi300x
git diff --check

Also validated a local sharded Pro Q4 run across MI300X GPUs, including reversed -m shard order.

Notes

CDNA4 / gfx950 kernel selection and build plumbing are included, but runtime validation has not been performed because I do not have CDNA4 hardware.

OPS-NeoRetro · 2026-07-02T10:13:31Z

 * **Metal** is our primary target. Starting from MacBooks with 96GB of RAM (or less, using SSD streaming).
 * **NVIDIA CUDA / DGX Spark**, CUDA with special care for the DGX Spark.
-* **Strix Halo (ROCm)**, systems like the Framework Desktop and other systems based on the same GPU and unified RAM design.
+* **AMD ROCm**, validated on AMD Instinct CDNA3 / MI300X. CDNA4 (`gfx950`) build targets are included but still need runtime validation on CDNA4 hardware. Strix Halo uses the `gfx1151` target.


Why did you erase information about AMD Strix Halo?

Accident. I'll fix it.

It is because Strix Halo (gfx1151) is just one target of AMD ROCm, now there are 3 supported (gfx1151, gfx942, and gfx950)

Just like NVIDIA CUDA line doesn't list all the GPUs but they do say "with special care for DGX Spark" so, I could say "with special care for Strix Halo" or something

do you prefer this?

ed63605

Eric Hartford added 2 commits June 21, 2026 14:11

Add CDNA ROCm MFMA build path

50778f0

enable mi300x

725661b

OPS-NeoRetro reviewed Jul 2, 2026

View reviewed changes

fix README.md mention of Strix Halo

ed63605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable MI300X ROCm support#484

Enable MI300X ROCm support#484
ehartford wants to merge 3 commits into
antirez:mainfrom
QuixiAI:main

ehartford commented Jul 1, 2026

Uh oh!

OPS-NeoRetro Jul 2, 2026

Uh oh!

ehartford Jul 2, 2026

Uh oh!

ehartford Jul 2, 2026

Uh oh!

ehartford Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ehartford commented Jul 1, 2026

Summary

Validation

Notes

Uh oh!

OPS-NeoRetro Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

ehartford Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

ehartford Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

ehartford Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants