fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native)#456
Conversation
… GB NCCL overhead On world_size==1 the vllm driver still initialized an NCCL process group and vLLM device communicators, which eagerly reserve ~2.8 GB of GPU memory (NCCL comm buffers + kernel modules) that a single GPU never uses. On small cards (e.g. a 4 GB RTX 3050 under WSL2) that overhead alone can leave no room for the KV cache, tripping 'Insufficient KV cache budget: 0 bytes' in _ensure_vllm_distributed's downstream KV sizing (issue #450). Use the gloo (CPU) backend for the single-rank group and disable custom all-reduce for single-GPU: no GPU-side collectives run on one GPU (vLLM's tensor_model_parallel_all_reduce short-circuits at world_size==1), so the group only needs to stay valid for CPU metadata broadcasts. Measured on an RTX 4090 (Qwen3-0.6B, bitsandbytes, enforce_eager, FLASHINFER): load footprint 3841 -> 1055 MiB (-2786). Constrained to ~3.7 GB free, stock OOMs during load while the fixed path loads, sizes the KV cache, runs a full 28-layer forward, and emits a token; gloo vs nccl output is identical. The multi-GPU (world_size>1) path is unchanged.
WalkthroughUpdates vLLM distributed setup to disable custom all-reduce for single-rank runs and always use gloo for single-rank process-group initialization. Also changes CUDA memory planning to compute the minimum KV-token threshold from both horizon demand and available budget. ChangesDistributed Initialization Logic
CUDA Memory Planning
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@driver/vllm/src/pie_driver_vllm/loader.py`:
- Line 242: The single-rank store path setup still uses tempfile.mktemp, which
is unsafe because FileStore can open the path after a TOCTOU window; update the
path creation in the loader logic that builds store_path for FileStore to use a
private temporary directory and create the store file inside it instead. Add the
missing tempfile and datetime imports needed by this block, and make sure the
temporary directory is removed on exit so the single-rank store cleanup is
handled safely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c014fd2-d974-45b6-8801-15ebf305d0d6
📒 Files selected for processing (1)
driver/vllm/src/pie_driver_vllm/loader.py
| # WSL2) that overhead alone can leave no room for the KV cache and | ||
| # trips "Insufficient KV cache budget" (see issue #450). gloo keeps the | ||
| # group valid for CPU metadata broadcasts while touching no GPU memory. | ||
| store_path = tempfile.mktemp(prefix="pie_vllm_singlerank_") |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify no insecure temporary filename generation remains in the loader.
rg -nP '\btempfile\.mktemp\s*\(' driver/vllm/src/pie_driver_vllm/loader.pyRepository: pie-project/pie
Length of output: 225
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== loader.py outline =="
ast-grep outline driver/vllm/src/pie_driver_vllm/loader.py --view expanded || true
echo
echo "== imports and surrounding lines =="
sed -n '1,320p' driver/vllm/src/pie_driver_vllm/loader.py | nl -ba | sed -n '1,320p'Repository: pie-project/pie
Length of output: 808
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== imports =="
sed -n '1,80p' driver/vllm/src/pie_driver_vllm/loader.py
echo
echo "== store-path region =="
sed -n '220,265p' driver/vllm/src/pie_driver_vllm/loader.py
echo
echo "== FileStore references =="
rg -n "FileStore|store_path|mktemp|mkdtemp|atexit|shutil|os\.path" driver/vllm/src/pie_driver_vllm/loader.pyRepository: pie-project/pie
Length of output: 5250
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "^(import|from) .*tempfile|^(import|from) .*datetime" driver/vllm/src/pie_driver_vllm/loader.py || trueRepository: pie-project/pie
Length of output: 153
Replace the single-rank temp path creation. tempfile.mktemp leaves a TOCTOU window before FileStore opens the path, and this block also needs the missing tempfile/datetime imports. Use a private temp directory, place the store file inside it, and clean it up on exit.
🧰 Tools
🪛 Ruff (0.15.20)
[error] 242-242: Use of insecure and deprecated function (mktemp)
(S306)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@driver/vllm/src/pie_driver_vllm/loader.py` at line 242, The single-rank store
path setup still uses tempfile.mktemp, which is unsafe because FileStore can
open the path after a TOCTOU window; update the path creation in the loader
logic that builds store_path for FileStore to use a private temporary directory
and create the store file inside it instead. Add the missing tempfile and
datetime imports needed by this block, and make sure the temporary directory is
removed on exit so the single-rank store cleanup is handled safely.
Source: Linters/SAST tools
…#450) The memory planner rejected every candidate whose resident KV was below a flat 32768-token floor (cuda_memory_planner.cpp). That floor assumes a large GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the planner threw 'no viable forward/KV layout fits budget ...' even though a modest, usable KV pool fit. Cap the floor by what the candidate's budget can actually afford for KV ((budget - arena - state_bytes) / per_kv_token_bytes), with a 7/8 margin to absorb page-flooring and a small 2048-token absolute floor. Large cards keep the 32768 floor (affordable exceeds it), so their behavior is unchanged. Verified on an RTX 4090 with gpu_memory_utilization tuned to emulate a 4 GB budget (~1474 MiB), loading Qwen3-0.6B on the cuda-native driver: stock throws 'no viable forward/KV layout fits budget 1474 MiB'; fixed selects a layout (N=2048, R=512, 472 KV pages = 7552 KV tokens).
There was a problem hiding this comment.
🧹 Nitpick comments (1)
driver/cuda/src/cuda_memory_planner.cpp (1)
632-635: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRedundant ternary — condition is always true here.
The guard at Line 589 (
if (arena + state_bytes >= budget) continue;) already guaranteesbudget > arena + state_bytesby the time execution reaches Line 633, making the ternary's false branch (: 0) dead code. Also,budget - arena - state_bytesduplicates theremainingvalue already computed at Line 590.♻️ Simplify by reusing `remaining`
- const std::size_t affordable_kv_tokens = - (budget > arena + state_bytes) - ? (budget - arena - state_bytes) / per_kv_token_bytes - : 0; + const std::size_t affordable_kv_tokens = remaining / per_kv_token_bytes;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@driver/cuda/src/cuda_memory_planner.cpp` around lines 632 - 635, The affordable token calculation in cuda_memory_planner.cpp is using a redundant ternary because the earlier guard in the same loop already ensures the budget exceeds arena + state_bytes. Update the logic around the affordable_kv_tokens computation in the CUDA memory planner to reuse the existing remaining value instead of recomputing budget - arena - state_bytes, and remove the dead false branch so the expression is simplified and consistent with the prior check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@driver/cuda/src/cuda_memory_planner.cpp`:
- Around line 632-635: The affordable token calculation in
cuda_memory_planner.cpp is using a redundant ternary because the earlier guard
in the same loop already ensures the budget exceeds arena + state_bytes. Update
the logic around the affordable_kv_tokens computation in the CUDA memory planner
to reuse the existing remaining value instead of recomputing budget - arena -
state_bytes, and remove the dead false branch so the expression is simplified
and consistent with the prior check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c375a738-cc18-4be5-8696-5a5778b85ee2
📒 Files selected for processing (1)
driver/cuda/src/cuda_memory_planner.cpp
Enables inference on single-GPU / small-VRAM (4 GB) setups on two drivers.
Addresses #450.vllm driver — single-GPU NCCL overhead
On
world_size == 1the vllm driver's_ensure_vllm_distributed(driver/vllm/src/pie_driver_vllm/loader.py) still initialized an NCCL process group + vLLM device communicators, eagerly reserving ~2.8 GB of GPU memory (NCCL comm buffers + kernel modules) that a single GPU never uses. On small cards this alone leaves no room for the KV cache and tripsInsufficient KV cache budget: 0 bytes.Fix: use the gloo (CPU) backend for the single-rank group and disable custom all-reduce for single-GPU. No GPU-side collectives run on one GPU — vLLM's
tensor_model_parallel_all_reduceshort-circuits atworld_size == 1(GroupCoordinator.all_reducereturns the input before touching any communicator) — so the group only needs to stay valid for CPU metadata broadcasts. The multi-GPU (world_size > 1) path is unchanged.Verified: Qwen3-0.6B (bnb, eager, FLASHINFER) load footprint 3841 → 1055 MiB (−2786). Under a ~3.7 GB constraint, stock OOMs; fixed loads → sizes KV → runs a full 28-layer forward → emits a token, with gloo vs nccl output identical (numeric invariance, consistent with the ws==1 all-reduce short-circuit). The
world_size > 1branch is behaviorally identical (custom-all-reduce arg unchanged; gloo block never entered).cuda-native driver — hardcoded KV-residency floor
The memory planner rejected every candidate whose resident KV was below a flat 32768-token floor (
driver/cuda/src/cuda_memory_planner.cpp:616). That floor assumes a large GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the planner threwno viable forward/KV layout fits budget ...even though a modest, usable KV pool fit (independent of the forward arena — even arena=0 yields~14.3k < 32768).Fix: cap the floor by what the candidate's budget can afford for KV —
(budget - arena - state_bytes) / per_kv_token_bytes, with a× 7/8margin to absorb page-flooring and a2048-token absolute floor. On large cardsaffordableexceeds 32768, so the floor stays 32768 and behavior is unchanged.Verified: full
pie run text-completionon the cuda-native driver under a physical ~4 GB VRAM holder (planner budget = 1236 MiB). Stock throwsno viable forward/KV layout fits budget 1236 MiBand cannot run; fixed selects a layout and generates 16 tokens end-to-end (EXIT 0). Also confirmed on the standalone driver binary (stock throws at budget 1474 MiB; fixed selects a layout and allocates a 7552-token KV cache).Summary by CodeRabbit