Glue and build tooling to run vLLM on native Windows (no WSL2) with AMD ROCm on RDNA3 consumer GPUs. Developed and tested on a Radeon RX 7900 XT (gfx1100).
This is not a fork of vLLM. It is an out-of-tree platform plugin plus a set of compatibility shims, a one-line patch, and a build harness that compiles vLLM's own HIP kernels natively on Windows. Upstream vLLM is cloned and pinned separately (see Setup).
Experimental, but past "it just runs". What currently works on the test machine:
- vLLM imports and generates correct tokens on gfx1100, native Windows, single GPU.
- W4A16 quantized models run across formats: compressed-tensors, GPTQ-Int4, and AWQ-Int4.
- vLLM's native exllama W4A16 GEMM (
_C.gptq_gemm) is compiled natively for Windows (GPTQ models otherwise have no kernel on Windows at all). - A custom M=1 W4 dequant-GEMV (Triton) for AWQ-uint4 decode, which has no fast kernel on
ROCm otherwise (exllama rejects uint4, Marlin is CUDA-only, leaving only the slow
conchtile). - torch.compile / inductor works (CompilationMode.STOCK_TORCH_COMPILE), and hipGraph
decode capture works (
cudagraph_mode=FULL_DECODE_ONLY). - fp8 KV cache works (Triton path), ~2x KV-cache capacity / context length.
- KVarN KV-cache quantization (calibration-free Hadamard + Sinkhorn + asymmetric RTN, K 4-bit / V 2-bit; Huawei's method, Triton kernels ported to gfx1100) runs end-to-end and gives ~4.7x KV capacity at ~fp16 accuracy (demonstrated on Qwen2.5-7B: 999k vs 210k KV tokens, coherent). WIP: its per-forward workspace over-allocates (~5 GiB), so today it only fits models that leave enough headroom (7-9B), and it is ~35% slower — a capacity feature, not a speed one. Not finished; the builder memory refactor is pending.
Single-stream decode (batch 1, greedy) on the test machine. Output was verified coherent for each model. All weights are 4-bit; KV cache fp16 unless noted.
| Model | Quantization | decode (tok/s) | KVarN KV-quant (WIP) | notes |
|---|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 (dense, 7B) |
GPTQ Int4 | 115 | 4.74x KV (999k vs 210k tok), 74.7 tok/s, coherent | native exllama GEMM + hipGraph decode, gpu_memory_utilization=0.9 (spill-verified clean) |
cyankiwi/ERNIE-4.5-21B-A3B-Thinking-AWQ-4bit (MoE, 21B / A3B active, head 128) |
compressed-tensors W4A16 gs32 | 62.7 → 79.2 | — (14GB weights leave no room for the ~5 GiB KVarN workspace) | stock → +M=1 MoE-decode gather-GEMV + native wvSplitK dense. Fits 20GB with ~3 GiB free — spill-free |
sahilchachra/Qwythos-9B-Claude-Mythos-5-1M-AWQ (Qwen3.5 hybrid, 9B) |
compressed-tensors W4A16 | 61.7 | — | native exllama + hipGraph, gpu_memory_utilization=0.7 (see note) |
casperhansen/deepseek-r1-distill-qwen-14b-awq (dense, 14B) |
AWQ Int4 | 50.3 | — | custom M=1 W4 GEMV, autotuned, util 0.9 (spill-verified clean) |
Numbers re-measured 2026-07-02, cudagraph (FULL_DECODE_ONLY) decode, and verified spill-free by
polling the Windows GPU shared-memory counter during the run (peak shared == the ~0.76 GiB desktop
baseline). torch.cuda.mem_get_info() only reports dedicated VRAM free, so it does NOT catch a WDDM
spill on its own.
VRAM caveat (per-model gpu_memory_utilization). Standard dense models (Qwen, deepseek) are safe at
0.9. The Qwythos-9B hybrid (linear-attention Mamba state cache + an unquantized vision tower + a
248k vocab) is NOT: vLLM's memory profiling does not account for the Mamba/vision allocations, so at 0.9 it
fills dedicated VRAM with KV and those extras overflow ~2.7 GiB into shared DRAM — a fake, spill-slowed
number (that was the old 39.9). At util=0.7 it runs entirely in dedicated VRAM (6 GiB free) and is both
clean and faster (61.7). The 26B-class MoEs (gemma-4, 17GB weights) overfill 20GB at any workable util, so
ERNIE-4.5-21B is used as the clean MoE bench. Aggregate throughput scales with concurrency (Qwen2.5-7B,
greedy): ~73 tok/s at batch 4, ~232 at batch 16, ~358 at batch 32.
KVarN (experimental / WIP). --kv-cache-dtype kvarn_k4v2_g128 --block-size 128 runs end-to-end on
gfx1100 (K 4-bit / V 2-bit, calibration-free). On Qwen2.5-7B, vLLM sizes 999,296 KV tokens vs 210,784
in fp16 — 4.74x capacity — and generation stays coherent, at 74.7 tok/s (~35% slower than the fp16
115: KVarN is a KV-capacity feature, not a speed one). Two rough edges remain (hence WIP): (1) vLLM
sizes the KV pool to fill the budget, so cap it with num_gpu_blocks_override (else it tries to allocate
all ~1M tokens at once and spills); (2) KVarN's per-forward workspace over-allocates ~5 GiB (Sinkhorn /
rotation / D2H staging, not counted by gpu_memory_utilization), so today it only fits models that leave
that headroom (7-9B) — a 14GB-weight model like ERNIE has no room. The pending builder memory refactor
would remove both.
Decode is still below the card's ~800 GB/s memory-bandwidth roofline; per-shape GEMV tuning,
fp8-KV scale calibration, and porting the rest of the csrc kernels are ongoing.
- Single GPU only. RCCL does not exist on Windows, so tensor/pipeline parallel are out of
scope;
torch.distributedis shimmed for the single-process case only. - KV-cache quantization: fp8 works; sub-8-bit (INT8 / 2-bit / KVarN) is not wired up yet, and
fp8 currently uses default scales (calibrated
k_scale/v_scaleneeded for near-lossless). - Only part of vLLM's kernel suite is built natively so far (see "Native kernels" below).
This depends on a specific, somewhat experimental combination. Other versions may not work.
- Windows 11, AMD Radeon RX 7900 XT (gfx1100)
- A ROCm-enabled PyTorch Windows build:
torch 2.10.0+rocm7.13(TheRock-class),torchvision 0.25.0, Python 3.12 - AMD HIP SDK 7.2 (
C:\HIP-SDK), MSVC (Visual Studio Build Tools), Windows SDK 10 triton-windows3.6,conch-triton-kernels,llguidance,xgrammar- vLLM v0.19.1 (the newest tag pinned to torch 2.10; v0.20+ requires torch 2.11)
Note: helper scripts contain absolute paths from the author's machine
(C:\HIP-SDK, E:\BuildTools, C:\Users\...). Adjust them for your environment.
windows_rocm_plugin/is a pip-installable package providing:WindowsRocmPlatform(registered via thevllm.platform_pluginsentry point) that detects the GPU throughtorch.cudainstead of the Linux-onlyamdsmi.- A single-process
torch.distributedshim (the Windows ROCm torch wheel is built without distributed), plus stubs foramdsmi,uvloop,fcntl,torch._C._distributed_c10d, and a tokenizer-class compatibility alias. - A
torch.distributed.tensorstub that makes the (natively absent) DTensor module raiseModuleNotFoundErrorinstead of a half-initializedImportError. inductor's graph logging guards that import withexcept ModuleNotFoundError; without the stub,torch.compiledies during compilation. This is what unblocks inductor here. cops.py: loads the compiled native kernel library (see below) sotorch.ops._C.*resolve to the real HIP kernels, and registers torch-native fallbacks for any op the native build does not provide (so vLLM's unconditionaltorch.ops._C.*bindings work either way).
- vLLM is installed with
VLLM_TARGET_DEVICE=empty(no kernels compiled by vLLM's own build) plus a one-line patch tovllm/__init__.pythat imports the shim early.
experiments/vllm_c_ext/ builds vLLM's own csrc HIP kernels for Windows. vLLM's
Linux build relies on a CUDA->HIP header redirect that the Windows torch wheel does not ship,
and cpp_extension's hipify orchestrator mishandles Windows paths, so the harness applies
torch's hipify substitution engine (RE_PYTORCH_PREPROCESSOR + PYTORCH_MAP) to the sources
directly, with a small set of redirect shim headers. Currently built and validated:
silu_and_mul,rms_norm,fused_add_rms_norm,rotary_embedding(fused activation / layernorm / RoPE)- the W4A16 GPTQ/exllama GEMM (
gptq_gemm,gptq_shuffle) fromcsrc/quantization/gptq/q_gemm.cu, which has a dedicated small-batch path for single-stream decode
To select the native exllama GEMM for a compressed-tensors W4A16 model, set
VLLM_DISABLED_KERNELS=ConchLinearKernel (vLLM's ROCm kernel selection then falls through
from conch to the exllama kernel).
The plugin also ships a custom M=1 W4 dequant-GEMV (awq_gemv.py, pure Triton) registered
ahead of conch for AWQ-uint4 decode. AWQ-uint4 has no fast kernel on ROCm (exllama only
accepts uint4b8; Marlin is CUDA-only), so vLLM falls back to conch, whose throughput-shaped
tile is ~20x off memory bandwidth for a single decode row. The GEMV is a true reduction (no
tl.dot/split-K/atomicAdd) that reuses conch's weight normalization and delegates prefill
(M>1) back to conch; it is @triton.autotuned per shape (BLOCK_N/num_warps). On
casperhansen/deepseek-r1-distill-qwen-14b-awq it takes decode from 12.2 to 50.9 tok/s.
experiments/ck_fmha/ builds a native Composable Kernel ck_tile flash-attention (forward, d128,
fp16 + bf16, causal + GQA, varlen/group-mode) for gfx1100 -- the RDNA3 WMMA attention path that AITER's
Windows gate (ENABLE_CK=False) hides but that CK itself supports. It compiles with hipcc + MSVC after a
one-line device-code patch (std::memcpy -> __builtin_memcpy). Isolated, it runs prefill attention at
~37 TFLOP/s vs ~11 for Triton unified_attention (~3.3x).
Wired into vLLM prefill via VLLM_WIN_CK_PREFILL=1 (cops.maybe_patch_ck_prefill, opt-in): pure-prefill
batches with no prior KV context (head 128, no sliding-window / softcap / alibi) route their attention to
the CK varlen kernel; decode, mixed prefill+decode, and sliding-window steps fall through to Triton. The
KV-cache write is a separate step, so decode is untouched. This is a prefill / TTFT lever
(compute-bound WMMA), not a single-stream decode one, so the end-to-end win grows with context as the
O(S^2) attention fraction rises. On ERNIE-4.5-21B-A3B (bf16, clean paired runs, best-of-3 TTFT):
| prompt tokens | Triton | CK | TTFT speedup |
|---|---|---|---|
| 2059 | 415.7 ms | 381.9 ms | 1.09x |
| 4099 | 850.2 ms | 733.9 ms | 1.16x |
| 6156 | 1379.7 ms | 1113.5 ms | 1.24x |
| 8196 | 1990.6 ms | 1519.0 ms | 1.31x |
| 10253 | 2681.7 ms | 1935.9 ms | 1.39x |
At short prompts (~1k) the win is only ~1.03x -- attention is a small slice of the prefill step (QKV/O
projection + MoE) -- and the curve is still climbing at 10k. bf16 output differs slightly from Triton
(kernel numerics), which can flip greedy tokens. Correctness gate: rel ~1e-4 (fp16) / ~3e-3 (bf16) vs
scaled_dot_product_attention across causal, GQA, and multi-sequence varlen.
:: 1. Clone the matching vLLM tag next to this repo's content
git clone --depth 1 --branch v0.19.1 https://github.com/vllm-project/vllm.git vllm
:: 2. Don't let pip replace your ROCm torch, then install vLLM with no kernels
cd vllm
python use_existing_torch.py
set VLLM_TARGET_DEVICE=empty
python -m pip install -e . --no-build-isolation
cd ..
:: 3. Apply the one-line shim import to vLLM, install the plugin and extra deps
python tools\patch_vllm.py vllm
python -m pip install -e windows_rocm_plugin
python -m pip install conch-triton-kernels llguidance xgrammar
:: 4. (optional) Build vLLM's native HIP kernels for Windows
cd experiments\vllm_c_ext
build_run.batRun from run/ (not the repo root, so the cloned vllm/ directory does not shadow the
installed vllm package).
cd run
python first_token.py :: smallest end-to-end smoke test (OPT-125m)
python bench.py :: decode tok/s + VRAM (configure via VLLM_BENCH_* env vars)
python batch_sweep.py :: aggregate throughput vs concurrencybench.py knobs (env): VLLM_BENCH_COMPILE=1 enables inductor, VLLM_BENCH_CGMODE=FULL_DECODE_ONLY
enables hipGraph decode capture, VLLM_DISABLED_KERNELS=ConchLinearKernel selects the native
exllama GEMM.
For a quantized model with a broken tokenizer_class (e.g. some llm-compressor exports):
python ..\tools\fix_tokenizer_config.py <model-substring>
set HF_HUB_OFFLINE=1windows_rocm_plugin/- the out-of-tree platform plugin and compatibility shimstools/- patch and fixup scriptsrun/- bench / first-token / profiling / batch-sweep driversexperiments/- nativecsrckernel build harness and standalone HIP/Triton kernel proofs
This repository's glue code is Apache-2.0, matching vLLM. vLLM itself is not included here and remains under its own license.