[draft] minimax & kimi Amd/vllm disagg mvp dev by ichbinblau · Pull Request #1141 · SemiAnalysisAI/InferenceX

ichbinblau · 2026-04-24T14:41:43Z

Update recipes for vLLM disagg:

remove dsr1 from vllm serving
merge vllm-disagg-utils with amd-utils
setup a standalone container for vllm router

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Add multi-node vLLM PD disaggregation recipe using Nixl/RIXL KV transfer and vllm-router, mirroring the existing SGLang disagg recipe structure. - New benchmark config: dsr1-fp8-mi355x-vllm-disagg (1P2D, TP8) - New utils: vllm_disagg_utils/ (job.slurm, server.sh, submit.sh, etc.) - Runner: extend launch_mi355x-amds.sh for vllm-disagg framework

Extract hardcoded model configurations from server.sh bash maps and job.slurm VALID_MODELS into a declarative models.yaml, mirroring the SGLang disagg recipe pattern. Adding a new model now requires no script changes. Also: - Consolidate UCX transport vars in job.slurm Docker env; remove duplicated setup_ucx_env() from server.sh - Extract RDMA workarounds (ionic /31 route fix, Nixl UCX patch) into setup_rdma_env() helper - Lower UCX_LOG_LEVEL from info to warn - Add nicctl mount and QoS/DSCP auto-detection to env.sh - Remove stale host libionic bind-mounts (driver now built into image)

Adapt server.sh to vLLM v0.17.1 breaking changes: - Use simplified kv-transfer-config (side channel via env vars instead of kv_ip/kv_port, add kv_load_failure_policy) - Remove deprecated --disable-log-requests (disabled by default in v0.17) - Route NIXL side channel through RDMA IP for correct fabric path - Fix RIXL ucx_error_handling_mode patch for updated _api.py layout

bench.sh: replace `vllm bench serve` (log-only output) with the shared run_benchmark_serving helper from benchmark_lib.sh, matching the SGLang disagg pattern. This produces the .json result files that the multinode CI workflow expects (benchmark-multinode-tmpl.yml → process_result.py). server.sh: make the Nixl ucx_error_handling_mode=none runtime patch conditional on Pensando ionic RDMA devices (IBDEVICES=*ionic*). On the mia1 cluster (ConnectX/mlx5, IBDEVICES=rdma*), UCX handles error mode natively and the patch is skipped. Model-path resolution and IBDEVICES/UCX/QoS auto-detection were verified to already work on mia1 — no changes needed. Tested locally (Job 2802, 1P+2D, ISL/OSL=1024): conc 8 → 507 tok/s conc 32 → 1778 tok/s conc 16 → 1004 tok/s conc 64 → 2480 tok/s All four .json result files produced; 100% external prefix cache hit rate.

Move the vllm-router from a dedicated proxy node onto the first prefill node, mirroring SGLang's co-location pattern. This reduces the node count from xP + yD + 1 to xP + yD (e.g., 3 nodes instead of 4 for 1P+2D). - server.sh: NODE_RANK=0 now runs both vllm serve (prefill, port 2584) and vllm-router (port 30000); barrier waits on all nodes - submit.sh / job.slurm: NUM_NODES = PREFILL_NODES + DECODE_NODES - bench.sh: ROUTER_PORT default updated to 30000 Local 1P+2D benchmark (ISL/OSL=1024, DeepSeek-R1 FP8, MI355X): - Throughput: +1.6% to +8.4% across concurrency 8-64 - Mean TTFT: -22% to -63% (prefill is local to router) - TPOT/ITL: unchanged (within noise) - 25% fewer nodes, no performance regression

Replace the custom Docker image (vllm_disagg_pd:latest) with the public vllm/vllm-openai-rocm:v0.17.1 base image. Missing components (UCX, RIXL, etcd, libionic1, vllm-router) are now installed at container start via setup_deps.sh, which is sourced by server.sh. This eliminates the need to build, host, and maintain a custom image — CI nodes can pull directly from Docker Hub. Changes: - Add setup_deps.sh: idempotent installer for UCX (ROCm fork), RIXL, etcd, libionic1 (Pensando ionic), and vllm-router (NODE_RANK=0 only). Build steps run in subshells to avoid CWD pollution. - server.sh: source setup_deps.sh before any other logic - job.slurm: add --entrypoint "" to override the base image's vllm CLI entrypoint, allowing bash -lc to work correctly - env.sh: update comment (paths now set by setup_deps.sh, not image ENV) - amd-master.yaml: image changed to vllm/vllm-openai-rocm:v0.17.1 Tested locally (Job 2807, 3 nodes, ISL/OSL=1024): Setup overhead: ~2.5 min per node (all components built from source) Benchmark completed successfully across concurrency 8/16/32/64

…ecode Enable MoRI-based Expert Parallelism (--enable-expert-parallel --all2all-backend mori) on decode workers for DeepSeek-R1-0528, while keeping TP=8 to preserve KV cache transfer compatibility with the prefill node via NixlConnector. This matches SGLang's approach of TP=8 + EP within the TP group. KV Transfer: RIXL/NixlConnector (unchanged) MoE All-to-All: NCCL (default) -> MoRI-EP (--all2all-backend mori) Changes: - models.yaml: Add --enable-expert-parallel --all2all-backend mori to decode_flags; increase engine ready timeout to 1200s - setup_deps.sh: Add MoRI install and vLLM v0.17.1 patches for MoRI-EP + FP8 compatibility (AITER assertion, defer_input_quant) - server.sh: Support decode_env from models.yaml for decode-specific environment overrides - dsr1_fp8_mi355x_vllm-disagg.sh: Pass NODELIST to submit.sh for Slurm node constraints

@C64

…roxy Replace NixlConnector with MoRIIOConnector for KV cache transfer and replace the Rust-based vllm-router with a MoRI-IO-aware Python proxy that handles both HTTP routing and ZMQ-based RDMA endpoint discovery. The key architectural change is that the proxy enriches each request's kv_transfer_params with remote RDMA endpoint info (handshake_port, notify_port, host, port) before dispatching, enabling concurrent prefill+decode in WRITE mode — something vllm-router could not do because it only understands HTTP, not the MoRI-IO registration protocol. Changes: - Add moriio_proxy.py: MoRI-IO-aware proxy with ZMQ service discovery, request enrichment, and /health endpoint (adapted from vLLM upstream moriio_toy_proxy_server.py) - server.sh: switch --kv-transfer-config from NixlConnector to MoRIIOConnector with kv_connector_extra_config (proxy_ip, proxy_ping_port, http_port); launch proxy before prefill on NODE_RANK=0; set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1 as workaround for v0.17.1 completion-ID mismatch (upstream fix: vllm-project/vllm#34907) - setup_deps.sh: replace vllm-router/Rust install with lightweight Python deps (quart, aiohttp, msgpack, pyzmq) for the proxy Benchmark (Job 2853 vs 2818 NixlConnector baseline, ISL/OSL=1024): TTFT median: -37% to -55% across C8–C64 (e.g. 384→241ms @C64) TTFT p99: -63% at C64 (6622→2469ms) Throughput: +8% at C64 (2634→2844 tok/s) TPOT: unchanged (~22ms @C64)

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

… still ran the first barrier; 2. kill and kill run only when DRY_RUN=0 Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Enable READ-mode KV transfer (decode-initiated RDMA reads) with a critical scheduler assertion fix, and add safety timeouts to prevent indefinite hangs during RDMA transfers. Changes: - setup_deps.sh: Add patches — save_kv_layer/start_load_kv handshake timeouts (30s), RDMA transfer timeout (120s), deferred write task expiry (60s), write worker error handling, and scheduler assertion fix for READ-mode intermediate request states - moriio_proxy.py: Add stream idle timeout (PROXY_STREAM_IDLE_TIMEOUT) to abort stalled decode streams, and proper response.release() - submit.sh, job.slurm: Plumb PROXY_STREAM_IDLE_TIMEOUT and VLLM_MORIIO_CONNECTOR_READ_MODE env vars into Docker containers Validated: 1k/1k full sweep (C8–C512), 100% success rate at all concurrency levels, peak 8500 output tok/s at C512.

Port the vLLM disaggregated serving pipeline from the 4N cluster (Pensando ionic NICs) to the 9N mia1 cluster (mlx5/rdma NICs). Key changes: - Fix C512 deadlock: apply ucx_error_handling_mode=none universally instead of only for ionic NICs. Under high concurrency, UCX's default UCP_ERR_HANDLING_MODE_PEER prevents RIXL RDMA READ retries from recovering after ibv_post_send queue exhaustion, causing prefill KV cache saturation and pipeline deadlock. - Force-reinstall MoRI from b645fc8 to fix PCI topology assertion failure on nodes with Broadcom PEX890xx PCIe switches. - Auto-detect Docker privilege (sudo vs non-sudo) for cross-cluster portability. - Add SLURM_EXCLUDE_NODES support to skip nodes with broken Docker sockets. - Increase VLLM_ENGINE_READY_TIMEOUT_S to 3600 to accommodate longer setup times (RIXL/MoRI source builds over NFS).

…rdening Server-side: RIXL can lose `finished_sending` notifications under high concurrency with ibv_post_send failures, permanently leaking prefill KV blocks. Over multiple benchmark rounds (sweep), leaked blocks accumulate and saturate the prefill KV cache, deadlocking C512. - Fix finished_sending handler to unconditionally free KV blocks (the conditional status check had no recovery path, causing leaks) - Add idle KV block reaper: detects engine idle >5s with finished requests still holding blocks, then force-frees them - Add 10s cooldown between benchmark rounds for reaper activation Client-side: SSE streaming loop did not break on the [DONE] sentinel, causing the benchmark client to hang when the proxy held connections open after request completion. - Break SSE loop on [DONE] in completions and chat completions - Share a single aiohttp.ClientSession across all requests (connection pooling via TCPConnector instead of per-request session creation) - Add asyncio.wait_for timeout around asyncio.gather with proper task cancellation and partial result collection - Reduce AIOHTTP_TIMEOUT from 6h to 30min Verified: sweep 1K/1K C128→C256→C512 all pass (Job 6222, 9N cluster).

…pletion Background processes (proxy, prefill, decode, etcd) were started via `cmd 2>&1 | tee logfile &`, causing bash $! to capture the PID of tee rather than the actual process. `kill $pid` only killed tee, leaving the real process running. The proxy kept port 30000 open, so decode nodes' `sync.py wait` never detected shutdown and the Slurm job hung forever. Additionally, etcd's stderr was not redirected, holding the Docker container's main pipe open and preventing container exit even after server.sh completed. Changes: - Redirect all background processes to log files instead of piping through tee, so $! captures the correct PID (matches SGLang pattern) - Redirect etcd launcher's stderr to prevent pipe leak - Add pkill fallback cleanup for proxy, vllm, and etcd processes - Increase barrier grace period to handle node setup time variance - Increase container creation barrier timeout from 300s to 600s

Fix per-node Docker privilege detection in vLLM disagg job.slurm

Docker containers run as root, so __pycache__/*.pyc files created during benchmark_serving.py import end up root-owned on the NFS workspace. The CI runner cannot delete them, breaking checkout. Set PYTHONPYCACHEPREFIX=/tmp/pycache in the Docker env so bytecache stays inside the container. Remove the previous server.sh find-and- delete workaround since the root cause is now addressed.

The idle KV block reaper only fired when both running=0 AND waiting=0. Under 8K ISL at C64+, leaked blocks filled the prefill KV cache while new requests queued in WAITING state — the non-empty wait queue prevented the reaper from ever triggering, causing a permanent hang. Remove the waiting-queue check so the reaper fires whenever no requests are actively running, which is precisely when leaked blocks can be safely reclaimed. Verified with 8K/1K sweep (C32–C512) completing without hangs.

…DECODE_EP,DECODE_DP_ATTN from amd-master.yaml config. Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Bump vllm/vllm-openai-rocm to v0.18.0 for the dsr1-fp8-mi355x-vllm-disagg config. Changes required by the new image: - setup_deps.sh: drop aiohttp/pyzmq installs (now pre-installed in v0.18.0); move install_mori_proxy_deps before patches and run on all nodes so msgpack is available when patch scripts import MoRI-IO connector modules - moriio_proxy.py: populate transfer_id in kv_transfer_params dicts (new required field in v0.18.0's moriio_connector.update_state_after_alloc) - MoRI PCI topology bug persists in v0.18.0; rebuild from b645fc8 retained Tested: 1K1K C8,16,32,64,128,256 on mia1 3-node (1P+2D) CONC512 is ongoing but it seems good so far

Enable vLLM disagg serving for amd/Kimi-K2.5-MXFP4 on MI355X with a 1P2D node topology (TP=8, decode EP=8). Changes: - amd-master.yaml: add kimik2.5-fp4-mi355x-vllm-disagg config with three seq-len scenarios (1K1K, 8K1K), READ mode enabled - models.yaml: add Kimi-K2.5-MXFP4 server flags (PIECEWISE cudagraph, --gpu-memory-utilization 0.90, --mm-encoder-tp-mode data) - bench.sh: add --trust-remote-code for models with custom code - setup_deps.sh: install amd-quark for MXFP4 quantization support - Add kimik2.5_fp4_mi355x_vllm-disagg.sh entry script Verified with full 1K/1K sweep (CONC 8–512) on SA4N and mia1 9N cluster; all concurrency levels completed without hang.

…-IO) Cherry-picked from ChuanLi1101/InferenceMAX:chuali/minimax-m25-vllm-disagg (commit 72a0002). Resolved conflict in models.yaml to keep both Kimi-K2.5-MXFP4 and MiniMax-M2.5 entries. Add multi-node vLLM PD disaggregation support for MiniMax-M2.5 (FP8), following the DeepSeek R1 disagg recipe pattern. Includes: - models.yaml: MiniMax-M2.5 config with TP8 prefill / TP8+EP8+MoRI decode - Entry script: minimaxm25_fp8_mi355x_vllm-disagg.sh - amd-master.yaml: e2e test entry for 1P2D on MI355X (1k1k, 8k1k, 1k8k) MiniMax M2.5 (230B, 256 experts, top-8 sigmoid routing, GQA) uses the same disagg infrastructure as DSR1. Unlike DeepSeek MLA models, M2.5 uses standard GQA attention so AITER paged attention is fully supported and no block-size/cudagraph workarounds are needed. Co-authored-by: ChuanLi1101 <Chuan.Li2@amd.com> Co-authored-by: Claude Made-with: Cursor

Cherry-picked from ChuanLi1101/InferenceMAX:chuali/minimax-m25-vllm-disagg (commit bb6bd0e). Adapted for v0.18.0 base: kept vllm/vllm-openai-rocm:v0.18.0 image (runtime patch via setup_deps.sh is sufficient; custom Docker image available in docker/minimax-m25-disagg/ if needed). Two deployment options for getting vLLM minimax_m2.py changes into the container: Option A -- Custom Docker image (docker/minimax-m25-disagg/): Builds from the public vLLM ROCm image and pre-installs UCX, etcd, RIXL, and patched minimax_m2.py with WideEP + MoRI + EPLB support baked in. Option B -- Runtime patch (setup_deps.sh): patch_minimax_m2_wideep_mori() copies patched minimax_m2.py from the mounted InferenceX repo into the container's vLLM installation at startup. Co-authored-by: ChuanLi1101 <Chuan.Li2@amd.com> Co-authored-by: Claude Made-with: Cursor

Align MiniMax M2.5 disagg naming with existing single-node configs (minimaxm2.5_fp8_mi355x.sh, minimaxm2.5_fp8_mi300x.sh, etc.). - amd-master.yaml: minimaxm25 -> minimaxm2.5 in config key + model-prefix - Rename entry script: minimaxm25_fp8_mi355x_vllm-disagg.sh -> minimaxm2.5_fp8_mi355x_vllm-disagg.sh - Dockerfile: update COPY path to match renamed script

…niMax M2.5 disagg Align MiniMax M2.5 disagg serve parameters with the proven single-node config (minimaxm2.5_fp8_mi355x.sh). MiniMax M2.5 uses GQA (not MLA), so block-size 32 is optimal (vs block-size 1 for DeepSeek/Kimi MLA). The extra 5% GPU memory (0.95 vs default 0.9) increases KV cache capacity for high-concurrency sweeps (C256/C512).

MiniMax M2.5 has expert intermediate_size=1536; with TP=8 and no EP the sharded dimension (192) is not divisible by FP8 block_n=128, crashing the prefill node. Set prefill EP=8 (matching decode and single-node) and add --enable-expert-parallel --all2all-backend mori to prefill_flags. Fix GateLinear to use out_dtype=torch.float32 instead of params_dtype=torch.float32 so the GEMM runs in bf16 (ROCm compatible) and only the output is cast to fp32 for routing precision. Remove the 1K/8K benchmark scenario (not needed).

The Dockerfile, build.sh, and duplicate minimax_m2.py patch were never used by the CI pipeline or local tests.

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Signed-off-by: Chun Fang <chun.fang@amd.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

ichbinblau requested review from billishyahao and chunfangamd as code owners April 24, 2026 14:41

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

functionstackx changed the title ~~Amd/vllm disagg mvp dev~~ [draft] minimax & kimi Amd/vllm disagg mvp dev Apr 24, 2026

ichbinblau force-pushed the amd/vllm_disagg_mvp_dev branch from aabc1f7 to a7364fa Compare April 27, 2026 08:18

ichbinblau requested review from 1am9trash, seungrokj and yctseng0211 as code owners April 27, 2026 08:18

chunfangamd and others added 21 commits April 27, 2026 08:21

[AMD] BUG fix: RANDOM_RANGE_RATIO never reaches bench.sh

159b571

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Bug fix: 1. With DRY_RUN=1, node 0 skipped starting proxy/prefill but…

d7e9e50

… still ran the first barrier; 2. kill and kill run only when DRY_RUN=0 Signed-off-by: Theresa Shan <theresa.shan@amd.com>

[AMD] Enable MoRI-IO READ mode by default for vLLM disagg

a002143

[AMD] Fix CI checkout failure caused by root-owned __pycache__ files

eba7f66

Fix per-node Docker privilege detection in vLLM disagg job.slurm

[AMD] Enable reading PREFILL_TP,PREFILL_EP,PREFILL_DP_ATTN,DECODE_TP,…

5fedd82

…DECODE_EP,DECODE_DP_ATTN from amd-master.yaml config. Signed-off-by: Theresa Shan <theresa.shan@amd.com>

chunfangamd and others added 10 commits April 27, 2026 08:21

Remove unused docker/minimax-m25-disagg/ directory

e163312

The Dockerfile, build.sh, and duplicate minimax_m2.py patch were never used by the CI pipeline or local tests.

remove vllm disagg for dpsr1 and dpv3

185df53

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

consolidate amd_utils for sglang and vllm

48cc23a

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

use vLLM router as default router for vllm disagg

5adfe2b

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

fix bugs

0734709

Signed-off-by: Chun Fang <chun.fang@amd.com>

ichbinblau force-pushed the amd/vllm_disagg_mvp_dev branch from bdb53d6 to 0734709 Compare April 27, 2026 08:47

ichbinblau marked this pull request as draft April 28, 2026 02:07

ichbinblau marked this pull request as ready for review April 28, 2026 02:08

claude Bot reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] minimax & kimi Amd/vllm disagg mvp dev#1141

[draft] minimax & kimi Amd/vllm disagg mvp dev#1141
ichbinblau wants to merge 31 commits intoSemiAnalysisAI:amd/vllm_disagg_mvp_devfrom
ichbinblau:amd/vllm_disagg_mvp_dev

ichbinblau commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ichbinblau commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants