fix(spur-cloud-api): adapt GPU allocation to CDI-based device API#63
Conversation
16da85e to
42e6d63
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates spur-cloud-api to consume Spur’s new CDI-based device allocation API for GPU accounting, and aligns gRPC/protobuf dependencies with the updated spur-proto.
Changes:
- Update
get_gpu_capacityto read allocated GPUs fromalloc.devices["gpu"]and resolve GPU types viadevice_idlookup intotal.gpus. - Bump
tonic/prost/prost-typesto0.14and adjusttowerdependency configuration. - Pin
spur-prototo a specific upstream commit revision.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
crates/spur-cloud-api/src/spur_client.rs |
Switch GPU allocation counting to CDI device map (alloc.devices["gpu"]) with device_id-based type resolution. |
Cargo.toml |
Update gRPC/prost deps and change spur-proto from tag pin to git revision pin (plus tower dependency tweak). |
Cargo.lock |
Lockfile updates resulting from dependency bumps and spur-proto revision change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spur PR #262 replaced the flat `alloc.gpus` field with a generic `alloc.devices` map keyed by device class. Update `get_gpu_capacity` to read allocated GPUs from `alloc.devices["gpu"]` and cross-reference device IDs against total resources for type resolution. Bump tonic 0.12→0.14 and prost 0.13→0.14 to match spur-proto.
42e6d63 to
7501a0b
Compare
yansun1996
left a comment
There was a problem hiding this comment.
Code Review
Refactor and test coverage are a clear improvement — extracting compute_gpu_pools(&[NodeInfo]) makes the logic unit-testable without a live gRPC client, and the new cases cover idle/allocated/down/multi-type/unresolved-id paths well. A few items below, one of which may be a correctness bug.
Top issue — verify count / device_id semantics (possible correctness bug)
The matching is:
total.gpus.iter().find(|g| g.device_id == dev.device_id)
...
pool.allocated += dev.count as u32;This only makes sense if device_id is a physical GPU id in both total.gpus and AllocatedDevice. But if that's true, count for a single physical device should always be 1 — yet the test gpu_capacity_count_greater_than_one models device_id: 0, count: 2 on a 2-GPU node and asserts allocated == 2, which reports device_id 1 as allocated when it's actually idle.
So one of two things is wrong:
- If
countcan legitimately be >1, thendevice_idis not a physical id (it's a class/model index), and matching it againsttotal.gpus[].device_idmixes two id spaces —available = total − allocatedbecomes unreliable. - If
device_idis physical, thencountis always 1 and thecount-based math (and that test) models an impossible state.
Please confirm against the spur #262 proto what AllocatedDevice.device_id and .count actually mean, and make the test reflect a real CDI payload. This is the one item that could silently produce wrong dashboard numbers.
Other notes
- Double warning: an unresolved
device_idtriggerswarn!in the global block and again in the per-node block for the same device. Consider resolving once. - Silent drop when total is missing: the global block is now
if let (Some(alloc), Some(total)). A node with allocations but nototal_resourcesnow drops those allocations with no warning. dev.count as u32: if the proto field isi32, a negative value wraps to a hugeu32.u32::try_from(or.max(0)) would be safer at the boundary.- Dependency tree: tonic 0.14 pulls in
axum 0.8while the app still usesaxum 0.7, so two axum trees now compile. Functionally fine (tonic's axum is internal), just a larger build — worth noting. spur-protopinned to a raw commit instead of a release tag. The PR body acknowledges this; please re-pin to a tag once spur cuts a release including #262, for reproducibility.
Test gaps (once count is settled)
- A multi-node + multi-allocation case confirming global vs per-node tallies agree.
- The "alloc present, total absent" edge case noted above.
Verdict
Looks good pending clarification of the count/device_id semantics — the only item that could affect correctness. Everything else is minor cleanup; the dep bumps and refactor are sound.
Eliminate double warning on unresolved device_id by resolving allocations through a per-node id_to_type lookup in a single pass. Warn when a node has alloc_resources but no total_resources instead of silently dropping. Use saturating arithmetic for u64→u32 count conversion and pool accumulation. Add tests for alloc-without-total and multi-node tally consistency.
|
Thanks for the thorough review. count/device_id semantics: For the other items:
Added two new tests: alloc-without-total edge case, and a multi-node scenario that asserts global totals equal the sum of per-node tallies. |
Summary
get_gpu_capacityto the new CDI-based device API introduced in spur PR #262. The flatalloc.gpusfield was replaced upstream with a genericalloc.devicesmap keyed by device class — this reads allocated GPUs fromalloc.devices["gpu"]and cross-references device IDs against total resources for type resolution.tonic0.12→0.14 andprost0.13→0.14 to match the updatedspur-protodependency.Test plan
cargo build— compiles cleanly against latest spur-proto from main/api/gpusendpoint returns correct capacity, allocation counts, and per-node breakdown