Skip to content

fix(spur-cloud-api): adapt GPU allocation to CDI-based device API#63

Merged
sgopinath1 merged 2 commits into
ROCm:mainfrom
shiv-tyagi:fix/cdi-device-api-compat
Jun 12, 2026
Merged

fix(spur-cloud-api): adapt GPU allocation to CDI-based device API#63
sgopinath1 merged 2 commits into
ROCm:mainfrom
shiv-tyagi:fix/cdi-device-api-compat

Conversation

@shiv-tyagi

Copy link
Copy Markdown
Member

Summary

  • Adapts get_gpu_capacity to the new CDI-based device API introduced in spur PR #262. The flat alloc.gpus field was replaced upstream with a generic alloc.devices map keyed by device class — this reads allocated GPUs from alloc.devices["gpu"] and cross-references device IDs against total resources for type resolution.
  • Bumps tonic 0.12→0.14 and prost 0.13→0.14 to match the updated spur-proto dependency.

Test plan

  • cargo build — compiles cleanly against latest spur-proto from main
  • Deployed on K8s test cluster with fake CDI GPU resources (3 nodes × 4 MI300X GPUs)
  • /api/gpus endpoint returns correct capacity, allocation counts, and per-node breakdown
  • Frontend loads and displays GPU dashboard

@shiv-tyagi shiv-tyagi force-pushed the fix/cdi-device-api-compat branch 2 times, most recently from 16da85e to 42e6d63 Compare June 11, 2026 17:46
@shiv-tyagi shiv-tyagi marked this pull request as ready for review June 11, 2026 17:46
Copilot AI review requested due to automatic review settings June 11, 2026 17:46

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates spur-cloud-api to consume Spur’s new CDI-based device allocation API for GPU accounting, and aligns gRPC/protobuf dependencies with the updated spur-proto.

Changes:

  • Update get_gpu_capacity to read allocated GPUs from alloc.devices["gpu"] and resolve GPU types via device_id lookup in total.gpus.
  • Bump tonic/prost/prost-types to 0.14 and adjust tower dependency configuration.
  • Pin spur-proto to a specific upstream commit revision.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

File Description
crates/spur-cloud-api/src/spur_client.rs Switch GPU allocation counting to CDI device map (alloc.devices["gpu"]) with device_id-based type resolution.
Cargo.toml Update gRPC/prost deps and change spur-proto from tag pin to git revision pin (plus tower dependency tweak).
Cargo.lock Lockfile updates resulting from dependency bumps and spur-proto revision change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated
Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated
Comment thread Cargo.toml Outdated
Comment thread crates/spur-cloud-api/src/spur_client.rs Outdated
spur PR #262 replaced the flat `alloc.gpus` field with a generic
`alloc.devices` map keyed by device class. Update `get_gpu_capacity`
to read allocated GPUs from `alloc.devices["gpu"]` and cross-reference
device IDs against total resources for type resolution.

Bump tonic 0.12→0.14 and prost 0.13→0.14 to match spur-proto.
@shiv-tyagi shiv-tyagi force-pushed the fix/cdi-device-api-compat branch from 42e6d63 to 7501a0b Compare June 11, 2026 18:11

@yansun1996 yansun1996 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Refactor and test coverage are a clear improvement — extracting compute_gpu_pools(&[NodeInfo]) makes the logic unit-testable without a live gRPC client, and the new cases cover idle/allocated/down/multi-type/unresolved-id paths well. A few items below, one of which may be a correctness bug.

Top issue — verify count / device_id semantics (possible correctness bug)

The matching is:

total.gpus.iter().find(|g| g.device_id == dev.device_id)
...
pool.allocated += dev.count as u32;

This only makes sense if device_id is a physical GPU id in both total.gpus and AllocatedDevice. But if that's true, count for a single physical device should always be 1 — yet the test gpu_capacity_count_greater_than_one models device_id: 0, count: 2 on a 2-GPU node and asserts allocated == 2, which reports device_id 1 as allocated when it's actually idle.

So one of two things is wrong:

  • If count can legitimately be >1, then device_id is not a physical id (it's a class/model index), and matching it against total.gpus[].device_id mixes two id spaces — available = total − allocated becomes unreliable.
  • If device_id is physical, then count is always 1 and the count-based math (and that test) models an impossible state.

Please confirm against the spur #262 proto what AllocatedDevice.device_id and .count actually mean, and make the test reflect a real CDI payload. This is the one item that could silently produce wrong dashboard numbers.

Other notes

  • Double warning: an unresolved device_id triggers warn! in the global block and again in the per-node block for the same device. Consider resolving once.
  • Silent drop when total is missing: the global block is now if let (Some(alloc), Some(total)). A node with allocations but no total_resources now drops those allocations with no warning.
  • dev.count as u32: if the proto field is i32, a negative value wraps to a huge u32. u32::try_from (or .max(0)) would be safer at the boundary.
  • Dependency tree: tonic 0.14 pulls in axum 0.8 while the app still uses axum 0.7, so two axum trees now compile. Functionally fine (tonic's axum is internal), just a larger build — worth noting.
  • spur-proto pinned to a raw commit instead of a release tag. The PR body acknowledges this; please re-pin to a tag once spur cuts a release including #262, for reproducibility.

Test gaps (once count is settled)

  1. A multi-node + multi-allocation case confirming global vs per-node tallies agree.
  2. The "alloc present, total absent" edge case noted above.

Verdict

Looks good pending clarification of the count/device_id semantics — the only item that could affect correctness. Everything else is minor cleanup; the dep bumps and refactor are sound.

Eliminate double warning on unresolved device_id by resolving
allocations through a per-node id_to_type lookup in a single pass.
Warn when a node has alloc_resources but no total_resources instead
of silently dropping. Use saturating arithmetic for u64→u32 count
conversion and pool accumulation. Add tests for alloc-without-total
and multi-node tally consistency.
@shiv-tyagi

Copy link
Copy Markdown
Member Author

Thanks for the thorough review.

count/device_id semantics: device_id is a physical GPU index and count is always 1 for GPUs today (spur-core uses AllocatedDevice::injectable(id) which hardcodes count: 1). The count > 1 path only fires for countable generics like bandwidth/licenses. That said, we intentionally keep the count-based math in spur-cloud rather than hardcoding count == 1, so we don't break if spur later adds time-slicing or MIG support. Updated the test comment to make this explicit.

For the other items:

  • Double warning: Refactored to resolve device_id through a per-node HashMap lookup in a single pass. Both global pool and per-node tallies now use the same resolved data.
  • Silent drop (alloc without total): Now emits a warning and skips the node.
  • dev.count as u32: Changed to u32::try_from(...).unwrap_or(u32::MAX). Also switched pool accumulation to saturating_add.
  • Dual axum trees: Noted, nothing actionable for now.
  • spur-proto pinned to commit: Will re-pin to a tag once spur cuts a release with #262.

Added two new tests: alloc-without-total edge case, and a multi-node scenario that asserts global totals equal the sum of per-node tallies.

@sgopinath1 sgopinath1 merged commit acea05e into ROCm:main Jun 12, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants