fix(spur-cloud-api): adapt GPU allocation to CDI-based device API by shiv-tyagi · Pull Request #63 · ROCm/spur-cloud

shiv-tyagi · 2026-06-11T17:17:39Z

Summary

Adapts get_gpu_capacity to the new CDI-based device API introduced in spur PR #262. The flat alloc.gpus field was replaced upstream with a generic alloc.devices map keyed by device class — this reads allocated GPUs from alloc.devices["gpu"] and cross-references device IDs against total resources for type resolution.
Bumps tonic 0.12→0.14 and prost 0.13→0.14 to match the updated spur-proto dependency.

Test plan

cargo build — compiles cleanly against latest spur-proto from main
Deployed on K8s test cluster with fake CDI GPU resources (3 nodes × 4 MI300X GPUs)
/api/gpus endpoint returns correct capacity, allocation counts, and per-node breakdown
Frontend loads and displays GPU dashboard

Copilot

Pull request overview

This PR updates spur-cloud-api to consume Spur’s new CDI-based device allocation API for GPU accounting, and aligns gRPC/protobuf dependencies with the updated spur-proto.

Changes:

Update get_gpu_capacity to read allocated GPUs from alloc.devices["gpu"] and resolve GPU types via device_id lookup in total.gpus.
Bump tonic/prost/prost-types to 0.14 and adjust tower dependency configuration.
Pin spur-proto to a specific upstream commit revision.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`crates/spur-cloud-api/src/spur_client.rs`	Switch GPU allocation counting to CDI device map (`alloc.devices["gpu"]`) with `device_id`-based type resolution.
`Cargo.toml`	Update gRPC/prost deps and change `spur-proto` from tag pin to git revision pin (plus `tower` dependency tweak).
`Cargo.lock`	Lockfile updates resulting from dependency bumps and spur-proto revision change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spur PR #262 replaced the flat `alloc.gpus` field with a generic `alloc.devices` map keyed by device class. Update `get_gpu_capacity` to read allocated GPUs from `alloc.devices["gpu"]` and cross-reference device IDs against total resources for type resolution. Bump tonic 0.12→0.14 and prost 0.13→0.14 to match spur-proto.

yansun1996

Code Review

Refactor and test coverage are a clear improvement — extracting compute_gpu_pools(&[NodeInfo]) makes the logic unit-testable without a live gRPC client, and the new cases cover idle/allocated/down/multi-type/unresolved-id paths well. A few items below, one of which may be a correctness bug.

Top issue — verify `count` / `device_id` semantics (possible correctness bug)

The matching is:

total.gpus.iter().find(|g| g.device_id == dev.device_id)
...
pool.allocated += dev.count as u32;

This only makes sense if device_id is a physical GPU id in both total.gpus and AllocatedDevice. But if that's true, count for a single physical device should always be 1 — yet the test gpu_capacity_count_greater_than_one models device_id: 0, count: 2 on a 2-GPU node and asserts allocated == 2, which reports device_id 1 as allocated when it's actually idle.

So one of two things is wrong:

If count can legitimately be >1, then device_id is not a physical id (it's a class/model index), and matching it against total.gpus[].device_id mixes two id spaces — available = total − allocated becomes unreliable.
If device_id is physical, then count is always 1 and the count-based math (and that test) models an impossible state.

Please confirm against the spur #262 proto what AllocatedDevice.device_id and .count actually mean, and make the test reflect a real CDI payload. This is the one item that could silently produce wrong dashboard numbers.

Other notes

Double warning: an unresolved device_id triggers warn! in the global block and again in the per-node block for the same device. Consider resolving once.
Silent drop when total is missing: the global block is now if let (Some(alloc), Some(total)). A node with allocations but no total_resources now drops those allocations with no warning.
dev.count as u32: if the proto field is i32, a negative value wraps to a huge u32. u32::try_from (or .max(0)) would be safer at the boundary.
Dependency tree: tonic 0.14 pulls in axum 0.8 while the app still uses axum 0.7, so two axum trees now compile. Functionally fine (tonic's axum is internal), just a larger build — worth noting.
spur-proto pinned to a raw commit instead of a release tag. The PR body acknowledges this; please re-pin to a tag once spur cuts a release including #262, for reproducibility.

Test gaps (once `count` is settled)

A multi-node + multi-allocation case confirming global vs per-node tallies agree.
The "alloc present, total absent" edge case noted above.

Verdict

Looks good pending clarification of the count/device_id semantics — the only item that could affect correctness. Everything else is minor cleanup; the dep bumps and refactor are sound.

Eliminate double warning on unresolved device_id by resolving allocations through a per-node id_to_type lookup in a single pass. Warn when a node has alloc_resources but no total_resources instead of silently dropping. Use saturating arithmetic for u64→u32 count conversion and pool accumulation. Add tests for alloc-without-total and multi-node tally consistency.

shiv-tyagi · 2026-06-12T05:58:39Z

Thanks for the thorough review.

count/device_id semantics: device_id is a physical GPU index and count is always 1 for GPUs today (spur-core uses AllocatedDevice::injectable(id) which hardcodes count: 1). The count > 1 path only fires for countable generics like bandwidth/licenses. That said, we intentionally keep the count-based math in spur-cloud rather than hardcoding count == 1, so we don't break if spur later adds time-slicing or MIG support. Updated the test comment to make this explicit.

For the other items:

Double warning: Refactored to resolve device_id through a per-node HashMap lookup in a single pass. Both global pool and per-node tallies now use the same resolved data.
Silent drop (alloc without total): Now emits a warning and skips the node.
dev.count as u32: Changed to u32::try_from(...).unwrap_or(u32::MAX). Also switched pool accumulation to saturating_add.
Dual axum trees: Noted, nothing actionable for now.
spur-proto pinned to commit: Will re-pin to a tag once spur cuts a release with #262.

Added two new tests: alloc-without-total edge case, and a multi-node scenario that asserts global totals equal the sum of per-node tallies.

shiv-tyagi force-pushed the fix/cdi-device-api-compat branch 2 times, most recently from 16da85e to 42e6d63 Compare June 11, 2026 17:46

shiv-tyagi marked this pull request as ready for review June 11, 2026 17:46

shiv-tyagi requested review from powderluv, sgopinath1 and yansun1996 as code owners June 11, 2026 17:46

Copilot AI review requested due to automatic review settings June 11, 2026 17:46

Copilot started reviewing on behalf of shiv-tyagi June 11, 2026 17:47 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated

Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated

Comment thread Cargo.toml Outdated

Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated

shiv-tyagi force-pushed the fix/cdi-device-api-compat branch from 42e6d63 to 7501a0b Compare June 11, 2026 18:11

yansun1996 reviewed Jun 12, 2026

View reviewed changes

sgopinath1 approved these changes Jun 12, 2026

View reviewed changes

sgopinath1 merged commit acea05e into ROCm:main Jun 12, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(spur-cloud-api): adapt GPU allocation to CDI-based device API#63

fix(spur-cloud-api): adapt GPU allocation to CDI-based device API#63
sgopinath1 merged 2 commits into
ROCm:mainfrom
shiv-tyagi:fix/cdi-device-api-compat

shiv-tyagi commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yansun1996 left a comment

Uh oh!

shiv-tyagi commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

shiv-tyagi commented Jun 11, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yansun1996 left a comment

Choose a reason for hiding this comment

Code Review

Top issue — verify count / device_id semantics (possible correctness bug)

Other notes

Test gaps (once count is settled)

Verdict

Uh oh!

shiv-tyagi commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Top issue — verify `count` / `device_id` semantics (possible correctness bug)

Test gaps (once `count` is settled)