fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native) by ingim · Pull Request #456 · pie-project/pie

ingim · 2026-07-01T06:16:58Z

Enables inference on single-GPU / small-VRAM (4 GB) setups on two drivers.

Addresses #450.

vllm driver — single-GPU NCCL overhead

On world_size == 1 the vllm driver's _ensure_vllm_distributed (driver/vllm/src/pie_driver_vllm/loader.py) still initialized an NCCL process group + vLLM device communicators, eagerly reserving ~2.8 GB of GPU memory (NCCL comm buffers + kernel modules) that a single GPU never uses. On small cards this alone leaves no room for the KV cache and trips Insufficient KV cache budget: 0 bytes.

Fix: use the gloo (CPU) backend for the single-rank group and disable custom all-reduce for single-GPU. No GPU-side collectives run on one GPU — vLLM's tensor_model_parallel_all_reduce short-circuits at world_size == 1 (GroupCoordinator.all_reduce returns the input before touching any communicator) — so the group only needs to stay valid for CPU metadata broadcasts. The multi-GPU (world_size > 1) path is unchanged.

Verified: Qwen3-0.6B (bnb, eager, FLASHINFER) load footprint 3841 → 1055 MiB (−2786). Under a ~3.7 GB constraint, stock OOMs; fixed loads → sizes KV → runs a full 28-layer forward → emits a token, with gloo vs nccl output identical (numeric invariance, consistent with the ws==1 all-reduce short-circuit). The world_size > 1 branch is behaviorally identical (custom-all-reduce arg unchanged; gloo block never entered).

cuda-native driver — hardcoded KV-residency floor

The memory planner rejected every candidate whose resident KV was below a flat 32768-token floor (driver/cuda/src/cuda_memory_planner.cpp:616). That floor assumes a large GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the planner threw no viable forward/KV layout fits budget ... even though a modest, usable KV pool fit (independent of the forward arena — even arena=0 yields ~14.3k < 32768).

Fix: cap the floor by what the candidate's budget can afford for KV — (budget - arena - state_bytes) / per_kv_token_bytes, with a × 7/8 margin to absorb page-flooring and a 2048-token absolute floor. On large cards affordable exceeds 32768, so the floor stays 32768 and behavior is unchanged.

Verified: full pie run text-completion on the cuda-native driver under a physical ~4 GB VRAM holder (planner budget = 1236 MiB). Stock throws no viable forward/KV layout fits budget 1236 MiB and cannot run; fixed selects a layout and generates 16 tokens end-to-end (EXIT 0). Also confirmed on the standalone driver binary (stock throws at budget 1474 MiB; fixed selects a layout and allocates a 7552-token KV cache).

Summary by CodeRabbit

Bug Fixes
- Improved distributed startup behavior for vLLM, especially in single-device runs.
- Ensured custom all-reduce is disabled unless running with multiple ranks.
- Simplified single-rank process-group initialization for more reliable startup.
- Updated CUDA KV-memory planning to better account for budget limits, reducing overly strict candidate rejection.

… GB NCCL overhead On world_size==1 the vllm driver still initialized an NCCL process group and vLLM device communicators, which eagerly reserve ~2.8 GB of GPU memory (NCCL comm buffers + kernel modules) that a single GPU never uses. On small cards (e.g. a 4 GB RTX 3050 under WSL2) that overhead alone can leave no room for the KV cache, tripping 'Insufficient KV cache budget: 0 bytes' in _ensure_vllm_distributed's downstream KV sizing (issue #450). Use the gloo (CPU) backend for the single-rank group and disable custom all-reduce for single-GPU: no GPU-side collectives run on one GPU (vLLM's tensor_model_parallel_all_reduce short-circuits at world_size==1), so the group only needs to stay valid for CPU metadata broadcasts. Measured on an RTX 4090 (Qwen3-0.6B, bitsandbytes, enforce_eager, FLASHINFER): load footprint 3841 -> 1055 MiB (-2786). Constrained to ~3.7 GB free, stock OOMs during load while the fixed path loads, sizes the KV cache, runs a full 28-layer forward, and emits a token; gloo vs nccl output is identical. The multi-GPU (world_size>1) path is unchanged.

coderabbitai · 2026-07-01T06:17:19Z

Walkthrough

Updates vLLM distributed setup to disable custom all-reduce for single-rank runs and always use gloo for single-rank process-group initialization. Also changes CUDA memory planning to compute the minimum KV-token threshold from both horizon demand and available budget.

Changes

Distributed Initialization Logic

Layer / File(s)	Summary
Custom all-reduce gating and single-rank process group backend `driver/vllm/src/pie_driver_vllm/loader.py`	Custom all-reduce now requires `world_size > 1` in addition to not being disabled; single-rank `init_process_group` calls always use `gloo` backend, removing the CUDA-based nccl/gloo conditional and `device_id` argument.

CUDA Memory Planning

Layer / File(s)	Summary
Budget-aware KV token floor `driver/cuda/src/cuda_memory_planner.cpp`	The minimum KV-token threshold now combines the prior horizon-based target with an affordable floor from remaining budget, then uses the smaller of the two values.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: enabling single-GPU, small-VRAM inference for the vLLM and CUDA-native drivers.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/vllm-single-gpu-gloo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@driver/vllm/src/pie_driver_vllm/loader.py`:
- Line 242: The single-rank store path setup still uses tempfile.mktemp, which
is unsafe because FileStore can open the path after a TOCTOU window; update the
path creation in the loader logic that builds store_path for FileStore to use a
private temporary directory and create the store file inside it instead. Add the
missing tempfile and datetime imports needed by this block, and make sure the
temporary directory is removed on exit so the single-rank store cleanup is
handled safely.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c014fd2-d974-45b6-8801-15ebf305d0d6

📥 Commits

Reviewing files that changed from the base of the PR and between 17e5125 and 832c27b.

📒 Files selected for processing (1)

driver/vllm/src/pie_driver_vllm/loader.py

coderabbitai · 2026-07-01T06:20:23Z

+        # WSL2) that overhead alone can leave no room for the KV cache and
+        # trips "Insufficient KV cache budget" (see issue #450). gloo keeps the
+        # group valid for CPU metadata broadcasts while touching no GPU memory.
        store_path = tempfile.mktemp(prefix="pie_vllm_singlerank_")


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify no insecure temporary filename generation remains in the loader. rg -nP '\btempfile\.mktemp\s*\(' driver/vllm/src/pie_driver_vllm/loader.py

Repository: pie-project/pie

Length of output: 225

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== loader.py outline ==" ast-grep outline driver/vllm/src/pie_driver_vllm/loader.py --view expanded || true echo echo "== imports and surrounding lines ==" sed -n '1,320p' driver/vllm/src/pie_driver_vllm/loader.py | nl -ba | sed -n '1,320p'

Repository: pie-project/pie

Length of output: 808

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== imports ==" sed -n '1,80p' driver/vllm/src/pie_driver_vllm/loader.py echo echo "== store-path region ==" sed -n '220,265p' driver/vllm/src/pie_driver_vllm/loader.py echo echo "== FileStore references ==" rg -n "FileStore|store_path|mktemp|mkdtemp|atexit|shutil|os\.path" driver/vllm/src/pie_driver_vllm/loader.py

Repository: pie-project/pie

Length of output: 5250

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "^(import|from) .*tempfile|^(import|from) .*datetime" driver/vllm/src/pie_driver_vllm/loader.py || true

Repository: pie-project/pie

Length of output: 153

Replace the single-rank temp path creation. tempfile.mktemp leaves a TOCTOU window before FileStore opens the path, and this block also needs the missing tempfile/datetime imports. Use a private temp directory, place the store file inside it, and clean it up on exit.

🧰 Tools

🪛 Ruff (0.15.20)

[error] 242-242: Use of insecure and deprecated function (mktemp)

(S306)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@driver/vllm/src/pie_driver_vllm/loader.py` at line 242, The single-rank store path setup still uses tempfile.mktemp, which is unsafe because FileStore can open the path after a TOCTOU window; update the path creation in the loader logic that builds store_path for FileStore to use a private temporary directory and create the store file inside it instead. Add the missing tempfile and datetime imports needed by this block, and make sure the temporary directory is removed on exit so the single-rank store cleanup is handled safely.

Source: Linters/SAST tools

…#450) The memory planner rejected every candidate whose resident KV was below a flat 32768-token floor (cuda_memory_planner.cpp). That floor assumes a large GPU: for Qwen3-0.6B, 32768 KV tokens cost ~3.5 GB, so on a 4 GB card the planner threw 'no viable forward/KV layout fits budget ...' even though a modest, usable KV pool fit. Cap the floor by what the candidate's budget can actually afford for KV ((budget - arena - state_bytes) / per_kv_token_bytes), with a 7/8 margin to absorb page-flooring and a small 2048-token absolute floor. Large cards keep the 32768 floor (affordable exceeds it), so their behavior is unchanged. Verified on an RTX 4090 with gpu_memory_utilization tuned to emulate a 4 GB budget (~1474 MiB), loading Qwen3-0.6B on the cuda-native driver: stock throws 'no viable forward/KV layout fits budget 1474 MiB'; fixed selects a layout (N=2048, R=512, 472 KV pages = 7552 KV tokens).

coderabbitai

🧹 Nitpick comments (1)

driver/cuda/src/cuda_memory_planner.cpp (1)
632-635: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant ternary — condition is always true here.

The guard at Line 589 (if (arena + state_bytes >= budget) continue;) already guarantees budget > arena + state_bytes by the time execution reaches Line 633, making the ternary's false branch (: 0) dead code. Also, budget - arena - state_bytes duplicates the remaining value already computed at Line 590.
♻️ Simplify by reusing `remaining`
-            const std::size_t affordable_kv_tokens =
-                (budget > arena + state_bytes)
-                    ? (budget - arena - state_bytes) / per_kv_token_bytes
-                    : 0;
+            const std::size_t affordable_kv_tokens = remaining / per_kv_token_bytes;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@driver/cuda/src/cuda_memory_planner.cpp` around lines 632 - 635, The
affordable token calculation in cuda_memory_planner.cpp is using a redundant
ternary because the earlier guard in the same loop already ensures the budget
exceeds arena + state_bytes. Update the logic around the affordable_kv_tokens
computation in the CUDA memory planner to reuse the existing remaining value
instead of recomputing budget - arena - state_bytes, and remove the dead false
branch so the expression is simplified and consistent with the prior check.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@driver/cuda/src/cuda_memory_planner.cpp`:
- Around line 632-635: The affordable token calculation in
cuda_memory_planner.cpp is using a redundant ternary because the earlier guard
in the same loop already ensures the budget exceeds arena + state_bytes. Update
the logic around the affordable_kv_tokens computation in the CUDA memory planner
to reuse the existing remaining value instead of recomputing budget - arena -
state_bytes, and remove the dead false branch so the expression is simplified
and consistent with the prior check.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c375a738-cc18-4be5-8696-5a5778b85ee2

📥 Commits

Reviewing files that changed from the base of the PR and between 832c27b and e5e4827.

📒 Files selected for processing (1)

driver/cuda/src/cuda_memory_planner.cpp

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

ingim changed the title ~~fix(driver/vllm): use gloo for single-GPU process group to avoid ~2.8 GB NCCL overhead~~ fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native) Jul 1, 2026

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

ingim merged commit 496a124 into main Jul 1, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native)#456

fix(driver): enable single-GPU / small-VRAM (4 GB) inference (vllm + cuda-native)#456
ingim merged 2 commits into
mainfrom
fix/vllm-single-gpu-gloo

ingim commented Jul 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ingim commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vllm driver — single-GPU NCCL overhead

cuda-native driver — hardcoded KV-residency floor

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ingim commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading