Skip to content

fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native)#456

Merged
ingim merged 2 commits into
mainfrom
fix/vllm-single-gpu-gloo
Jul 1, 2026
Merged

fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native)#456
ingim merged 2 commits into
mainfrom
fix/vllm-single-gpu-gloo

Conversation

@ingim

@ingim ingim commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Enables inference on single-GPU / small-VRAM (4 GB) setups on two drivers.

Addresses #450.

vllm driver — single-GPU NCCL overhead

On world_size == 1 the vllm driver's _ensure_vllm_distributed (driver/vllm/src/pie_driver_vllm/loader.py) still initialized an NCCL process group + vLLM device communicators, eagerly reserving ~2.8 GB of GPU memory (NCCL comm buffers + kernel modules) that a single GPU never uses. On small cards this alone leaves no room for the KV cache and trips Insufficient KV cache budget: 0 bytes.

Fix: use the gloo (CPU) backend for the single-rank group and disable custom all-reduce for single-GPU. No GPU-side collectives run on one GPU — vLLM's tensor_model_parallel_all_reduce short-circuits at world_size == 1 (GroupCoordinator.all_reduce returns the input before touching any communicator) — so the group only needs to stay valid for CPU metadata broadcasts. The multi-GPU (world_size > 1) path is unchanged.

Verified: Qwen3-0.6B (bnb, eager, FLASHINFER) load footprint 3841 → 1055 MiB (−2786). Under a ~3.7 GB constraint, stock OOMs; fixed loads → sizes KV → runs a full 28-layer forward → emits a token, with gloo vs nccl output identical (numeric invariance, consistent with the ws==1 all-reduce short-circuit). The world_size > 1 branch is behaviorally identical (custom-all-reduce arg unchanged; gloo block never entered).

cuda-native driver — hardcoded KV-residency floor

The memory planner rejected every candidate whose resident KV was below a flat 32768-token floor (driver/cuda/src/cuda_memory_planner.cpp:616). That floor assumes a large GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the planner threw no viable forward/KV layout fits budget ... even though a modest, usable KV pool fit (independent of the forward arena — even arena=0 yields ~14.3k < 32768).

Fix: cap the floor by what the candidate's budget can afford for KV — (budget - arena - state_bytes) / per_kv_token_bytes, with a × 7/8 margin to absorb page-flooring and a 2048-token absolute floor. On large cards affordable exceeds 32768, so the floor stays 32768 and behavior is unchanged.

Verified: full pie run text-completion on the cuda-native driver under a physical ~4 GB VRAM holder (planner budget = 1236 MiB). Stock throws no viable forward/KV layout fits budget 1236 MiB and cannot run; fixed selects a layout and generates 16 tokens end-to-end (EXIT 0). Also confirmed on the standalone driver binary (stock throws at budget 1474 MiB; fixed selects a layout and allocates a 7552-token KV cache).

Summary by CodeRabbit

  • Bug Fixes
    • Improved distributed startup behavior for vLLM, especially in single-device runs.
    • Ensured custom all-reduce is disabled unless running with multiple ranks.
    • Simplified single-rank process-group initialization for more reliable startup.
    • Updated CUDA KV-memory planning to better account for budget limits, reducing overly strict candidate rejection.

… GB NCCL overhead

On world_size==1 the vllm driver still initialized an NCCL process group and
vLLM device communicators, which eagerly reserve ~2.8 GB of GPU memory (NCCL
comm buffers + kernel modules) that a single GPU never uses. On small cards
(e.g. a 4 GB RTX 3050 under WSL2) that overhead alone can leave no room for the
KV cache, tripping 'Insufficient KV cache budget: 0 bytes' in
_ensure_vllm_distributed's downstream KV sizing (issue #450).

Use the gloo (CPU) backend for the single-rank group and disable custom
all-reduce for single-GPU: no GPU-side collectives run on one GPU (vLLM's
tensor_model_parallel_all_reduce short-circuits at world_size==1), so the group
only needs to stay valid for CPU metadata broadcasts.

Measured on an RTX 4090 (Qwen3-0.6B, bitsandbytes, enforce_eager, FLASHINFER):
load footprint 3841 -> 1055 MiB (-2786). Constrained to ~3.7 GB free, stock
OOMs during load while the fixed path loads, sizes the KV cache, runs a full
28-layer forward, and emits a token; gloo vs nccl output is identical. The
multi-GPU (world_size>1) path is unchanged.
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Updates vLLM distributed setup to disable custom all-reduce for single-rank runs and always use gloo for single-rank process-group initialization. Also changes CUDA memory planning to compute the minimum KV-token threshold from both horizon demand and available budget.

Changes

Distributed Initialization Logic

Layer / File(s) Summary
Custom all-reduce gating and single-rank process group backend
driver/vllm/src/pie_driver_vllm/loader.py
Custom all-reduce now requires world_size > 1 in addition to not being disabled; single-rank init_process_group calls always use gloo backend, removing the CUDA-based nccl/gloo conditional and device_id argument.

CUDA Memory Planning

Layer / File(s) Summary
Budget-aware KV token floor
driver/cuda/src/cuda_memory_planner.cpp
The minimum KV-token threshold now combines the prior horizon-based target with an affordable floor from remaining budget, then uses the smaller of the two values.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: enabling single-GPU, small-VRAM inference for the vLLM and CUDA-native drivers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/vllm-single-gpu-gloo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@driver/vllm/src/pie_driver_vllm/loader.py`:
- Line 242: The single-rank store path setup still uses tempfile.mktemp, which
is unsafe because FileStore can open the path after a TOCTOU window; update the
path creation in the loader logic that builds store_path for FileStore to use a
private temporary directory and create the store file inside it instead. Add the
missing tempfile and datetime imports needed by this block, and make sure the
temporary directory is removed on exit so the single-rank store cleanup is
handled safely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c014fd2-d974-45b6-8801-15ebf305d0d6

📥 Commits

Reviewing files that changed from the base of the PR and between 17e5125 and 832c27b.

📒 Files selected for processing (1)
  • driver/vllm/src/pie_driver_vllm/loader.py

# WSL2) that overhead alone can leave no room for the KV cache and
# trips "Insufficient KV cache budget" (see issue #450). gloo keeps the
# group valid for CPU metadata broadcasts while touching no GPU memory.
store_path = tempfile.mktemp(prefix="pie_vllm_singlerank_")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify no insecure temporary filename generation remains in the loader.
rg -nP '\btempfile\.mktemp\s*\(' driver/vllm/src/pie_driver_vllm/loader.py

Repository: pie-project/pie

Length of output: 225


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== loader.py outline =="
ast-grep outline driver/vllm/src/pie_driver_vllm/loader.py --view expanded || true

echo
echo "== imports and surrounding lines =="
sed -n '1,320p' driver/vllm/src/pie_driver_vllm/loader.py | nl -ba | sed -n '1,320p'

Repository: pie-project/pie

Length of output: 808


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== imports =="
sed -n '1,80p' driver/vllm/src/pie_driver_vllm/loader.py

echo
echo "== store-path region =="
sed -n '220,265p' driver/vllm/src/pie_driver_vllm/loader.py

echo
echo "== FileStore references =="
rg -n "FileStore|store_path|mktemp|mkdtemp|atexit|shutil|os\.path" driver/vllm/src/pie_driver_vllm/loader.py

Repository: pie-project/pie

Length of output: 5250


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "^(import|from) .*tempfile|^(import|from) .*datetime" driver/vllm/src/pie_driver_vllm/loader.py || true

Repository: pie-project/pie

Length of output: 153


Replace the single-rank temp path creation. tempfile.mktemp leaves a TOCTOU window before FileStore opens the path, and this block also needs the missing tempfile/datetime imports. Use a private temp directory, place the store file inside it, and clean it up on exit.

🧰 Tools
🪛 Ruff (0.15.20)

[error] 242-242: Use of insecure and deprecated function (mktemp)

(S306)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@driver/vllm/src/pie_driver_vllm/loader.py` at line 242, The single-rank store
path setup still uses tempfile.mktemp, which is unsafe because FileStore can
open the path after a TOCTOU window; update the path creation in the loader
logic that builds store_path for FileStore to use a private temporary directory
and create the store file inside it instead. Add the missing tempfile and
datetime imports needed by this block, and make sure the temporary directory is
removed on exit so the single-rank store cleanup is handled safely.

Source: Linters/SAST tools

…#450)

The memory planner rejected every candidate whose resident KV was below a
flat 32768-token floor (cuda_memory_planner.cpp). That floor assumes a large
GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the
planner threw 'no viable forward/KV layout fits budget ...' even though a
modest, usable KV pool fit.

Cap the floor by what the candidate's budget can actually afford for KV
((budget - arena - state_bytes) / per_kv_token_bytes), with a 7/8 margin to
absorb page-flooring and a small 2048-token absolute floor. Large cards keep
the 32768 floor (affordable exceeds it), so their behavior is unchanged.

Verified on an RTX 4090 with gpu_memory_utilization tuned to emulate a 4 GB
budget (~1474 MiB), loading Qwen3-0.6B on the cuda-native driver: stock throws
'no viable forward/KV layout fits budget 1474 MiB'; fixed selects a layout
(N=2048, R=512, 472 KV pages = 7552 KV tokens).
@ingim ingim changed the title fix(driver/vllm): use gloo for single-GPU process group to avoid ~2.8 GB NCCL overhead fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native) Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
driver/cuda/src/cuda_memory_planner.cpp (1)

632-635: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant ternary — condition is always true here.

The guard at Line 589 (if (arena + state_bytes >= budget) continue;) already guarantees budget > arena + state_bytes by the time execution reaches Line 633, making the ternary's false branch (: 0) dead code. Also, budget - arena - state_bytes duplicates the remaining value already computed at Line 590.

♻️ Simplify by reusing `remaining`
-            const std::size_t affordable_kv_tokens =
-                (budget > arena + state_bytes)
-                    ? (budget - arena - state_bytes) / per_kv_token_bytes
-                    : 0;
+            const std::size_t affordable_kv_tokens = remaining / per_kv_token_bytes;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@driver/cuda/src/cuda_memory_planner.cpp` around lines 632 - 635, The
affordable token calculation in cuda_memory_planner.cpp is using a redundant
ternary because the earlier guard in the same loop already ensures the budget
exceeds arena + state_bytes. Update the logic around the affordable_kv_tokens
computation in the CUDA memory planner to reuse the existing remaining value
instead of recomputing budget - arena - state_bytes, and remove the dead false
branch so the expression is simplified and consistent with the prior check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@driver/cuda/src/cuda_memory_planner.cpp`:
- Around line 632-635: The affordable token calculation in
cuda_memory_planner.cpp is using a redundant ternary because the earlier guard
in the same loop already ensures the budget exceeds arena + state_bytes. Update
the logic around the affordable_kv_tokens computation in the CUDA memory planner
to reuse the existing remaining value instead of recomputing budget - arena -
state_bytes, and remove the dead false branch so the expression is simplified
and consistent with the prior check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c375a738-cc18-4be5-8696-5a5778b85ee2

📥 Commits

Reviewing files that changed from the base of the PR and between 832c27b and e5e4827.

📒 Files selected for processing (1)
  • driver/cuda/src/cuda_memory_planner.cpp

@ingim ingim merged commit 496a124 into main Jul 1, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant