Skip to content

ThePie88/vLLM-ROCm-Windows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM on native Windows + AMD ROCm (RDNA3)

Glue and build tooling to run vLLM on native Windows (no WSL2) with AMD ROCm on RDNA3 consumer GPUs. Developed and tested on a Radeon RX 7900 XT (gfx1100).

This is not a fork of vLLM. It is an out-of-tree platform plugin plus a set of compatibility shims, a one-line patch, and a build harness that compiles vLLM's own HIP kernels natively on Windows. Upstream vLLM is cloned and pinned separately (see Setup).

Status (honest)

Experimental, but past "it just runs". What currently works on the test machine:

  • vLLM imports and generates correct tokens on gfx1100, native Windows, single GPU.
  • W4A16 quantized models run across formats: compressed-tensors, GPTQ-Int4, and AWQ-Int4.
  • vLLM's native exllama W4A16 GEMM (_C.gptq_gemm) is compiled natively for Windows (GPTQ models otherwise have no kernel on Windows at all).
  • A custom M=1 W4 dequant-GEMV (Triton) for AWQ-uint4 decode, which has no fast kernel on ROCm otherwise (exllama rejects uint4, Marlin is CUDA-only, leaving only the slow conch tile).
  • torch.compile / inductor works (CompilationMode.STOCK_TORCH_COMPILE), and hipGraph decode capture works (cudagraph_mode=FULL_DECODE_ONLY).
  • fp8 KV cache works (Triton path), ~2x KV-cache capacity / context length.
  • KVarN KV-cache quantization (calibration-free Hadamard + Sinkhorn + asymmetric RTN, K 4-bit / V 2-bit; Huawei's method, Triton kernels ported to gfx1100) runs end-to-end and gives ~4.7x KV capacity at ~fp16 accuracy (demonstrated on Qwen2.5-7B: 999k vs 210k KV tokens, coherent). WIP: its per-forward workspace over-allocates (~5 GiB), so today it only fits models that leave enough headroom (7-9B), and it is ~35% slower — a capacity feature, not a speed one. Not finished; the builder memory refactor is pending.

Performance (measured)

Single-stream decode (batch 1, greedy) on the test machine. Output was verified coherent for each model. All weights are 4-bit; KV cache fp16 unless noted.

Model Quantization decode (tok/s) KVarN KV-quant (WIP) notes
Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 (dense, 7B) GPTQ Int4 115 4.74x KV (999k vs 210k tok), 74.7 tok/s, coherent native exllama GEMM + hipGraph decode, gpu_memory_utilization=0.9 (spill-verified clean)
cyankiwi/ERNIE-4.5-21B-A3B-Thinking-AWQ-4bit (MoE, 21B / A3B active, head 128) compressed-tensors W4A16 gs32 62.7 → 79.2 — (14GB weights leave no room for the ~5 GiB KVarN workspace) stock → +M=1 MoE-decode gather-GEMV + native wvSplitK dense. Fits 20GB with ~3 GiB free — spill-free
sahilchachra/Qwythos-9B-Claude-Mythos-5-1M-AWQ (Qwen3.5 hybrid, 9B) compressed-tensors W4A16 61.7 native exllama + hipGraph, gpu_memory_utilization=0.7 (see note)
casperhansen/deepseek-r1-distill-qwen-14b-awq (dense, 14B) AWQ Int4 50.3 custom M=1 W4 GEMV, autotuned, util 0.9 (spill-verified clean)

Numbers re-measured 2026-07-02, cudagraph (FULL_DECODE_ONLY) decode, and verified spill-free by polling the Windows GPU shared-memory counter during the run (peak shared == the ~0.76 GiB desktop baseline). torch.cuda.mem_get_info() only reports dedicated VRAM free, so it does NOT catch a WDDM spill on its own.

VRAM caveat (per-model gpu_memory_utilization). Standard dense models (Qwen, deepseek) are safe at 0.9. The Qwythos-9B hybrid (linear-attention Mamba state cache + an unquantized vision tower + a 248k vocab) is NOT: vLLM's memory profiling does not account for the Mamba/vision allocations, so at 0.9 it fills dedicated VRAM with KV and those extras overflow ~2.7 GiB into shared DRAM — a fake, spill-slowed number (that was the old 39.9). At util=0.7 it runs entirely in dedicated VRAM (6 GiB free) and is both clean and faster (61.7). The 26B-class MoEs (gemma-4, 17GB weights) overfill 20GB at any workable util, so ERNIE-4.5-21B is used as the clean MoE bench. Aggregate throughput scales with concurrency (Qwen2.5-7B, greedy): ~73 tok/s at batch 4, ~232 at batch 16, ~358 at batch 32.

KVarN (experimental / WIP). --kv-cache-dtype kvarn_k4v2_g128 --block-size 128 runs end-to-end on gfx1100 (K 4-bit / V 2-bit, calibration-free). On Qwen2.5-7B, vLLM sizes 999,296 KV tokens vs 210,784 in fp16 — 4.74x capacity — and generation stays coherent, at 74.7 tok/s (~35% slower than the fp16 115: KVarN is a KV-capacity feature, not a speed one). Two rough edges remain (hence WIP): (1) vLLM sizes the KV pool to fill the budget, so cap it with num_gpu_blocks_override (else it tries to allocate all ~1M tokens at once and spills); (2) KVarN's per-forward workspace over-allocates ~5 GiB (Sinkhorn / rotation / D2H staging, not counted by gpu_memory_utilization), so today it only fits models that leave that headroom (7-9B) — a 14GB-weight model like ERNIE has no room. The pending builder memory refactor would remove both.

Decode is still below the card's ~800 GB/s memory-bandwidth roofline; per-shape GEMV tuning, fp8-KV scale calibration, and porting the rest of the csrc kernels are ongoing.

Not done

  • Single GPU only. RCCL does not exist on Windows, so tensor/pipeline parallel are out of scope; torch.distributed is shimmed for the single-process case only.
  • KV-cache quantization: fp8 works; sub-8-bit (INT8 / 2-bit / KVarN) is not wired up yet, and fp8 currently uses default scales (calibrated k_scale/v_scale needed for near-lossless).
  • Only part of vLLM's kernel suite is built natively so far (see "Native kernels" below).

Tested stack (pinned, fragile)

This depends on a specific, somewhat experimental combination. Other versions may not work.

  • Windows 11, AMD Radeon RX 7900 XT (gfx1100)
  • A ROCm-enabled PyTorch Windows build: torch 2.10.0+rocm7.13 (TheRock-class), torchvision 0.25.0, Python 3.12
  • AMD HIP SDK 7.2 (C:\HIP-SDK), MSVC (Visual Studio Build Tools), Windows SDK 10
  • triton-windows 3.6, conch-triton-kernels, llguidance, xgrammar
  • vLLM v0.19.1 (the newest tag pinned to torch 2.10; v0.20+ requires torch 2.11)

Note: helper scripts contain absolute paths from the author's machine (C:\HIP-SDK, E:\BuildTools, C:\Users\...). Adjust them for your environment.

How it works

  • windows_rocm_plugin/ is a pip-installable package providing:
    • WindowsRocmPlatform (registered via the vllm.platform_plugins entry point) that detects the GPU through torch.cuda instead of the Linux-only amdsmi.
    • A single-process torch.distributed shim (the Windows ROCm torch wheel is built without distributed), plus stubs for amdsmi, uvloop, fcntl, torch._C._distributed_c10d, and a tokenizer-class compatibility alias.
    • A torch.distributed.tensor stub that makes the (natively absent) DTensor module raise ModuleNotFoundError instead of a half-initialized ImportError. inductor's graph logging guards that import with except ModuleNotFoundError; without the stub, torch.compile dies during compilation. This is what unblocks inductor here.
    • cops.py: loads the compiled native kernel library (see below) so torch.ops._C.* resolve to the real HIP kernels, and registers torch-native fallbacks for any op the native build does not provide (so vLLM's unconditional torch.ops._C.* bindings work either way).
  • vLLM is installed with VLLM_TARGET_DEVICE=empty (no kernels compiled by vLLM's own build) plus a one-line patch to vllm/__init__.py that imports the shim early.

Native kernels

experiments/vllm_c_ext/ builds vLLM's own csrc HIP kernels for Windows. vLLM's Linux build relies on a CUDA->HIP header redirect that the Windows torch wheel does not ship, and cpp_extension's hipify orchestrator mishandles Windows paths, so the harness applies torch's hipify substitution engine (RE_PYTORCH_PREPROCESSOR + PYTORCH_MAP) to the sources directly, with a small set of redirect shim headers. Currently built and validated:

  • silu_and_mul, rms_norm, fused_add_rms_norm, rotary_embedding (fused activation / layernorm / RoPE)
  • the W4A16 GPTQ/exllama GEMM (gptq_gemm, gptq_shuffle) from csrc/quantization/gptq/q_gemm.cu, which has a dedicated small-batch path for single-stream decode

To select the native exllama GEMM for a compressed-tensors W4A16 model, set VLLM_DISABLED_KERNELS=ConchLinearKernel (vLLM's ROCm kernel selection then falls through from conch to the exllama kernel).

The plugin also ships a custom M=1 W4 dequant-GEMV (awq_gemv.py, pure Triton) registered ahead of conch for AWQ-uint4 decode. AWQ-uint4 has no fast kernel on ROCm (exllama only accepts uint4b8; Marlin is CUDA-only), so vLLM falls back to conch, whose throughput-shaped tile is ~20x off memory bandwidth for a single decode row. The GEMV is a true reduction (no tl.dot/split-K/atomicAdd) that reuses conch's weight normalization and delegates prefill (M>1) back to conch; it is @triton.autotuned per shape (BLOCK_N/num_warps). On casperhansen/deepseek-r1-distill-qwen-14b-awq it takes decode from 12.2 to 50.9 tok/s.

CK ck_tile FMHA (WMMA) for prefill

experiments/ck_fmha/ builds a native Composable Kernel ck_tile flash-attention (forward, d128, fp16 + bf16, causal + GQA, varlen/group-mode) for gfx1100 -- the RDNA3 WMMA attention path that AITER's Windows gate (ENABLE_CK=False) hides but that CK itself supports. It compiles with hipcc + MSVC after a one-line device-code patch (std::memcpy -> __builtin_memcpy). Isolated, it runs prefill attention at ~37 TFLOP/s vs ~11 for Triton unified_attention (~3.3x).

Wired into vLLM prefill via VLLM_WIN_CK_PREFILL=1 (cops.maybe_patch_ck_prefill, opt-in): pure-prefill batches with no prior KV context (head 128, no sliding-window / softcap / alibi) route their attention to the CK varlen kernel; decode, mixed prefill+decode, and sliding-window steps fall through to Triton. The KV-cache write is a separate step, so decode is untouched. This is a prefill / TTFT lever (compute-bound WMMA), not a single-stream decode one, so the end-to-end win grows with context as the O(S^2) attention fraction rises. On ERNIE-4.5-21B-A3B (bf16, clean paired runs, best-of-3 TTFT):

prompt tokens Triton CK TTFT speedup
2059 415.7 ms 381.9 ms 1.09x
4099 850.2 ms 733.9 ms 1.16x
6156 1379.7 ms 1113.5 ms 1.24x
8196 1990.6 ms 1519.0 ms 1.31x
10253 2681.7 ms 1935.9 ms 1.39x

At short prompts (~1k) the win is only ~1.03x -- attention is a small slice of the prefill step (QKV/O projection + MoE) -- and the curve is still climbing at 10k. bf16 output differs slightly from Triton (kernel numerics), which can flip greedy tokens. Correctness gate: rel ~1e-4 (fp16) / ~3e-3 (bf16) vs scaled_dot_product_attention across causal, GQA, and multi-sequence varlen.

Setup

:: 1. Clone the matching vLLM tag next to this repo's content
git clone --depth 1 --branch v0.19.1 https://github.com/vllm-project/vllm.git vllm

:: 2. Don't let pip replace your ROCm torch, then install vLLM with no kernels
cd vllm
python use_existing_torch.py
set VLLM_TARGET_DEVICE=empty
python -m pip install -e . --no-build-isolation
cd ..

:: 3. Apply the one-line shim import to vLLM, install the plugin and extra deps
python tools\patch_vllm.py vllm
python -m pip install -e windows_rocm_plugin
python -m pip install conch-triton-kernels llguidance xgrammar

:: 4. (optional) Build vLLM's native HIP kernels for Windows
cd experiments\vllm_c_ext
build_run.bat

Running

Run from run/ (not the repo root, so the cloned vllm/ directory does not shadow the installed vllm package).

cd run
python first_token.py        :: smallest end-to-end smoke test (OPT-125m)
python bench.py              :: decode tok/s + VRAM (configure via VLLM_BENCH_* env vars)
python batch_sweep.py        :: aggregate throughput vs concurrency

bench.py knobs (env): VLLM_BENCH_COMPILE=1 enables inductor, VLLM_BENCH_CGMODE=FULL_DECODE_ONLY enables hipGraph decode capture, VLLM_DISABLED_KERNELS=ConchLinearKernel selects the native exllama GEMM.

For a quantized model with a broken tokenizer_class (e.g. some llm-compressor exports):

python ..\tools\fix_tokenizer_config.py <model-substring>
set HF_HUB_OFFLINE=1

Layout

  • windows_rocm_plugin/ - the out-of-tree platform plugin and compatibility shims
  • tools/ - patch and fixup scripts
  • run/ - bench / first-token / profiling / batch-sweep drivers
  • experiments/ - native csrc kernel build harness and standalone HIP/Triton kernel proofs

License

This repository's glue code is Apache-2.0, matching vLLM. vLLM itself is not included here and remains under its own license.

About

A honest port of vLLM-ROCm for windows.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages