Skip to content

Enable MI300X ROCm support#484

Open
ehartford wants to merge 3 commits into
antirez:mainfrom
QuixiAI:main
Open

Enable MI300X ROCm support#484
ehartford wants to merge 3 commits into
antirez:mainfrom
QuixiAI:main

Conversation

@ehartford

Copy link
Copy Markdown

This PR enables DeepSeek V4 Pro to run on AMD Instinct MI300X by adding CDNA-oriented ROCm kernels and sharding model layers across local ROCm GPUs.

Summary

  • Add CDNA3/CDNA4 direct MFMA wrapper kernels for f16 MFMA:
    - gfx942 uses mfma_f32_16x16x16_f16
    - gfx950 uses mfma_f32_16x16x32_f16

  • Add a CDNA Q8 batch matmul/MFMA prefill path.

  • Add ROCm MoE kernel fixes for CDNA correctness, including disabling the broken IQ2/Q2 float-down WMMA overlay.

  • Add ROCm attention/activation fixes to avoid fp16 overflow and repeated BOS failures.

  • Add MI300X/CDNA build targets, with CDNA4 gfx950 compile plumbing.

  • Add local --gpus launcher using the existing distributed runtime to shard layers across local GPUs.

  • Support repeated -m model shards independent of argument order.

  • Allocate graph/KV/cache state only for the layer slice owned by each worker.

  • Add model-cache preflight checks for early actionable OOM errors.

  • Use BF16 for 16-bit distributed activation transport.

  • Add MI300X/ROCm smoke scripts and a synthetic Q8 MFMA correctness test.

Validation

Validated on MI300X / CDNA3:

make mi300x
git diff --check

Also validated a local sharded Pro Q4 run across MI300X GPUs, including reversed -m shard order.

Notes

CDNA4 / gfx950 kernel selection and build plumbing are included, but runtime validation has not been performed because I do not have CDNA4 hardware.

Comment thread README.md Outdated
* **Metal** is our primary target. Starting from MacBooks with 96GB of RAM (or less, using SSD streaming).
* **NVIDIA CUDA / DGX Spark**, CUDA with special care for the DGX Spark.
* **Strix Halo (ROCm)**, systems like the Framework Desktop and other systems based on the same GPU and unified RAM design.
* **AMD ROCm**, validated on AMD Instinct CDNA3 / MI300X. CDNA4 (`gfx950`) build targets are included but still need runtime validation on CDNA4 hardware. Strix Halo uses the `gfx1151` target.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you erase information about AMD Strix Halo?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accident. I'll fix it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because Strix Halo (gfx1151) is just one target of AMD ROCm, now there are 3 supported (gfx1151, gfx942, and gfx950)

Just like NVIDIA CUDA line doesn't list all the GPUs but they do say "with special care for DGX Spark" so, I could say "with special care for Strix Halo" or something

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you prefer this?

ed63605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants