Refactor hardware access with dependency injection#137
Refactor hardware access with dependency injection#137ncclementi merged 5 commits intorapidsai:mainfrom
Conversation
28e3957 to
126bca0
Compare
Introduce rapids_cli/providers.py as the process-wide home for the GPU, system, and CUDA-toolkit provider instances. The doctor orchestrator installs real providers once per run via set_providers(); checks and debug functions read them via get_gpu_info() / get_system_info() / get_toolkit_info(). Lazy fallbacks construct real implementations on first access so nothing imports pynvml / psutil / cuda.pathfinder unless a provider is actually touched. Drop the intermediate Check base class and the six check subclasses; every check is now a plain module-level function with a (verbose=False, **kwargs) signature. Same treatment for debug.run_debug and its gather_* helpers. Tests use pytest fixtures in rapids_cli/tests/conftest.py (set_gpu_info / set_system_info / set_toolkit_info) that wrap monkeypatch.setattr on the registry globals, plus an autouse reset fixture for isolation. External plugin authors who already write def my_check(verbose=False, **kwargs) are unaffected. Also fix a latent inconsistency: doctor.py now installs system_info too, so the memory check no longer relies on its own fallback at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
Design note: pattern choiceWorth flagging for reviewers: what this PR introduces in Both patterns have well-known drawbacks:
The tradeoff was made knowingly. For a single-shot CLI with no concurrency and no library embedding, the Singleton concerns are largely theoretical, and the Service Locator's hidden-dependency cost is small because the set of possible dependencies is tiny (three providers) and each check's body is short. The alternative — threading If |
ncclementi
left a comment
There was a problem hiding this comment.
I left a few comments, mostly about design and questions of implementation. I haven't had the time to try this locally, and I'm heading on PTO for a week.
I know @jayavenkatesh19 was trying this, so he might have more concrete feedback.
General comment: once this refactor is done, we need some developer documentation on how to use it, how to add checks, and how the general design works.
| # SPDX-License-Identifier: Apache-2.0 | ||
| """This module contains the debug subcommand for the Rapids CLI.""" | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
why do we need this here? What are we changing that requires this, I noticed that it's in several files, and it's not clear to me why.
There was a problem hiding this comment.
Removed in 56ba029. This was added preemptively but none of these files use X | Y type annotations — it is only needed in providers.py and hardware.py (new files that use GpuInfoProvider | None etc.).
| def gather_cuda_version(): | ||
| """Return CUDA driver version as a string, similar to nvidia-smi output.""" | ||
| version = pynvml.nvmlSystemGetCudaDriverVersion() | ||
| ver = get_gpu_info().cuda_driver_version |
There was a problem hiding this comment.
I'd avoid changing style, I mean changing "version" for "ver" doesn't makes the code more clear, in general I would limit from doing this type of changes
There was a problem hiding this comment.
Reverted to version in 56ba029. The rename was a side-effect of swapping the RHS to use the provider, but the original name is fine (the local-scope shadow of importlib.metadata.version already existed before this PR).
| self._cuda_driver_version = pynvml.nvmlSystemGetCudaDriverVersion() | ||
| self._driver_version = pynvml.nvmlSystemGetDriverVersion() |
There was a problem hiding this comment.
I noticed we are checking driver version using pynvml. We should try to rely on cuda.core.systems, like what @jayavenkatesh19 did in his PR #141, for example here https://github.com/rapidsai/rapids-cli/pull/141/changes#diff-c91bfdf239eab90d8996e1d73e12ed815f9a9c38cacfb4b9ba41b00eee766e58R158-R186
There was a problem hiding this comment.
Agree this would be ideal. cuda.core.system currently provides get_driver_version() (which _gather_toolkit_info already uses), but it does not yet expose device enumeration, compute capability, memory info, or NVLink states. A full migration from pynvml would require either cuda.core.system adding those APIs or a hybrid approach. I think this is best as a follow-up PR once broader coverage is available. Related: issue #145 (cuda-bindings dependency).
There was a problem hiding this comment.
As of a few days ago, cuda.core should have all of the APIs necessary for rapids-cli. I filed at #146 to migrate -- should be testable now, but will need to wait until after the cuda.core 1.0 release to merge.
There was a problem hiding this comment.
Great news! Filed #147 to track migrating NvmlGpuInfo once cuda-core 1.0 ships. The GpuInfoProvider Protocol from this PR should make the swap straightforward — only hardware.py and test_hardware.py need to change, everything downstream reads through the abstraction.
| def cuda_toolkit_check( | ||
| verbose=False, *, toolkit_info: CudaToolkitInfo | None = None, **kwargs | ||
| ): | ||
| def cuda_toolkit_check(verbose=False, **kwargs): |
There was a problem hiding this comment.
what happened here with passing toolkit_info: CudaToolkitInfo ?
There was a problem hiding this comment.
The toolkit_info parameter was replaced by the provider registry — cuda_toolkit_check now reads from get_toolkit_info() (via rapids_cli.providers) instead of receiving it as a kwarg. This is the core change of the DI refactor: checks no longer need provider parameters threaded through their signatures. Tests install fakes into the registry via monkeypatch.setattr fixtures in conftest.py.
Also reverted the local variable name back to toolkit_info in 56ba029 to minimize the diff.
| info = get_toolkit_info() | ||
|
|
||
| # Check library findability | ||
| if toolkit_info.missing_libs: | ||
| any_found_via = next(iter(toolkit_info.found_libs.values()), None) | ||
| raise ValueError( | ||
| _format_missing_error(toolkit_info.missing_libs, any_found_via) | ||
| ) | ||
| if info.missing_libs: | ||
| any_found_via = next(iter(info.found_libs.values()), None) | ||
| raise ValueError(_format_missing_error(info.missing_libs, any_found_via)) | ||
|
|
||
| # Check driver availability | ||
| if toolkit_info.driver_major is None: | ||
| if info.driver_major is None: | ||
| raise ValueError( | ||
| "Unable to query the GPU driver's CUDA version. " | ||
| "RAPIDS requires a working NVIDIA GPU driver." | ||
| ) | ||
|
|
||
| driver_major = toolkit_info.driver_major | ||
| toolkit_major = toolkit_info.toolkit_major | ||
| driver_major = info.driver_major | ||
| toolkit_major = info.toolkit_major | ||
|
|
||
| # Compare toolkit to driver (only error when toolkit > driver, drivers are backward compatible) | ||
| if toolkit_major is not None and toolkit_major > driver_major: | ||
| raise ValueError( | ||
| _format_mismatch_error( | ||
| toolkit_major, | ||
| driver_major, | ||
| toolkit_info.found_libs.get("cudart"), | ||
| toolkit_info.cudart_path, | ||
| info.found_libs.get("cudart"), | ||
| info.cudart_path, | ||
| ) | ||
| ) | ||
|
|
||
| # Only check system paths if CUDA was found via system/CUDA_HOME. | ||
| # When found via conda or pip, RAPIDS uses those libs and ignores system paths. | ||
| cudart_source = toolkit_info.found_libs.get("cudart", "") | ||
| cudart_source = info.found_libs.get("cudart", "") |
There was a problem hiding this comment.
similar comment regarding style I made above. I don't think we should be changing variables names in this case. It makes the review process longer and the truth is nothing change in this part.
| def test_failing_gpu_info_device_count(): | ||
| with pytest.raises(ValueError, match="No GPU available"): | ||
| _ = FailingGpuInfo().device_count | ||
|
|
||
|
|
||
| def test_failing_gpu_info_devices(): | ||
| with pytest.raises(ValueError, match="No GPU available"): | ||
| _ = FailingGpuInfo().devices | ||
|
|
||
|
|
||
| def test_failing_gpu_info_cuda_driver_version(): | ||
| with pytest.raises(ValueError, match="No GPU available"): | ||
| _ = FailingGpuInfo().cuda_driver_version | ||
|
|
||
|
|
||
| def test_failing_gpu_info_driver_version(): | ||
| with pytest.raises(ValueError, match="No GPU available"): | ||
| _ = FailingGpuInfo().driver_version |
There was a problem hiding this comment.
I think ideally these ValueError would be more specific. They seem to generic, I know it's what's created as part of FailingGpuInfo but we could probably improve this to be more representative of what you'd get if trigger those scenarios.
There was a problem hiding this comment.
Addressed in 56ba029. Added a custom HardwareInfoError exception in hardware.py. NvmlGpuInfo now raises HardwareInfoError instead of generic ValueError when NVML initialization fails, and the test fakes (FailingGpuInfo, FailingSystemInfo) raise it too. Check functions catch HardwareInfoError and re-raise with user-facing ValueError messages.
| handle = pynvml.nvmlDeviceGetHandleByIndex(i) | ||
| major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle) | ||
| memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle) |
There was a problem hiding this comment.
We should try to replace some of these for cuda.core.system equivalents if we have them.
There was a problem hiding this comment.
Same reply as above — cuda.core.system does not yet have equivalents for device count, compute capability, memory, or NVLink. Happy to migrate as a follow-up when the API surface grows.
|
|
||
|
|
||
| @dataclass | ||
| class FakeGpuInfo: |
There was a problem hiding this comment.
I'm not sure how I feel about these (from here below) testing-utils likes classes, here in hardware.py I wonder if they should live in the tests folder in a utils file or similar.
There was a problem hiding this comment.
Moved all test fakes (FakeGpuInfo, FakeSystemInfo, FailingGpuInfo, FailingSystemInfo) to rapids_cli/tests/fakes.py in 56ba029. hardware.py now only contains protocols, data classes, and real implementations.
| @property | ||
| def device_count(self) -> int: | ||
| """Raise ValueError.""" | ||
| raise ValueError("No GPU available") |
There was a problem hiding this comment.
The ValueError here and in all the cases below, seem a bit generic I wonder if we could be more specific, and using an error that is more representative of the error we would get in the real case scenario
There was a problem hiding this comment.
Addressed in 56ba029 — see reply on the test_hardware.py comment. HardwareInfoError replaces ValueError throughout.
| toolkit_info: CudaToolkitInfo | None = None, | ||
| ) -> None: | ||
| """Install providers for the current run. Only non-None args are applied.""" | ||
| global _gpu_info, _system_info, _toolkit_info |
There was a problem hiding this comment.
Why do we need to use global variables here, this is not very pythonic, and I'm worried that it's not very convenient for maintainability reasons. Does it have to be this way?
There was a problem hiding this comment.
The registry uses module-level globals the same way Python's own logging, warnings, and locale modules do — it's the standard pattern for process-wide ambient context. The alternatives considered:
- Threading providers through function kwargs — this was the original approach, but it leaked 4-line fallback boilerplate into every check function and required the orchestrator to know which providers each check needs.
- Class-based approach — intermediate iteration; added scaffolding without reducing complexity.
contextvars— the "proper" escape hatch if we ever need async/per-thread isolation, but overkill for a synchronous CLI tool.
The globals are private (_gpu_info, etc.) and only mutated in two places: set_providers() (called once by the orchestrator) and the lazy fallbacks in get_*() (guarded by # pragma: no cover). Tests swap them via monkeypatch.setattr which auto-reverts after each test. See the design patterns discussion for more context.
There was a problem hiding this comment.
I'm not convinced by this, I consulted with @gforsyth to see if he had a better idea on how to approach this, and he suggested dataclasses. Like doing something like this but with better typing
In [1]: from dataclasses import dataclass
In [2]: @dataclass
...: class Providers:
...: gpu_info: None = None
...: system_info: None = None
...: toolkit_info: None = None
...:
In [3]: a = Providers(gpu_info='hithere')
In [4]: a
Out[4]: Providers(gpu_info='hithere', system_info=None, toolkit_info=None)I think this is a good approach, and form my conversation with him this approach shouldn't add an extra calls.
"If the dataclass is instantiated in the providers file, the helper get_* functions in there can access the class members and return them"
There was a problem hiding this comment.
Great suggestion — done in 326c789. Replaced the bare module-level globals with a _Providers dataclass instance:
@dataclass
class _Providers:
gpu_info: GpuInfoProvider | None = field(default=None)
system_info: SystemInfoProvider | None = field(default=None)
toolkit_info: CudaToolkitInfo | None = field(default=None)
_providers = _Providers()The set_providers() / get_*() public API is unchanged — the only difference is _providers.gpu_info instead of global _gpu_info. Tests monkeypatch on the dataclass attributes (providers._providers.gpu_info) which auto-reverts the same way.
There was a problem hiding this comment.
I tested these changes out, and everything seems to be working well functionally. I did come across an issue with using cuda-core that is documented in #145
While I do understand the reasoning behind abstracting away the individual pynvml errors and having a generic error displayed to the user, I do think using ValueError might not be the best choice here. I think we can add a new exception type in hardware.py and call it something like HardwareInfoError and use that to handle errors. This way, the user knows it is something to do with the GPU and not a pythonic error which is generally where ValueError shows up
I also do think @ncclementi's comments about stylistic choices needs some clarification too, and we should document this new structure of using GPU info from providers.py instead of direct pynvml calls for future users
| self._device_count = pynvml.nvmlDeviceGetCount() | ||
| self._cuda_driver_version = pynvml.nvmlSystemGetCudaDriverVersion() | ||
| self._driver_version = pynvml.nvmlSystemGetDriverVersion() | ||
|
|
||
| self._devices = [] | ||
| for i in range(self._device_count): | ||
| handle = pynvml.nvmlDeviceGetHandleByIndex(i) | ||
| major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle) | ||
| memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle) |
There was a problem hiding this comment.
Adding to Naty's comment above, think it would be good to add try/except blocks for these pynvml calls too, because they could hypothetically throw errors that pynvml.init() did not
There was a problem hiding this comment.
After nvmlInit() succeeds, the individual calls (nvmlDeviceGetCount, nvmlSystemGetCudaDriverVersion, etc.) are extremely unlikely to fail — if the driver is loaded, these are thin wrappers around kernel ioctls. Adding try/except for each would be verbose and handle scenarios that essentially require a driver bug. I'd prefer to handle this as a follow-up if we see it cause issues in practice. That said, if you feel strongly about it we can add a broad except pynvml.NVMLError around the post-init block that wraps into HardwareInfoError.
| try: | ||
| state = pynvml.nvmlDeviceGetNvLinkState(handle, link_id) | ||
| nvlink_states.append(bool(state)) | ||
| except pynvml.NVMLError: |
There was a problem hiding this comment.
Like in #143, we can make this line better scoped using the NVMLError_InvalidArgument and NVMLError_NotSupported errors to scope this try/except block to the use case where we have reached the end of all the available devices.
…ff noise - Add HardwareInfoError custom exception replacing generic ValueError for hardware failures (suggested by jayavenkatesh19 and ncclementi) - Move test fakes (FakeGpuInfo, FakeSystemInfo, FailingGpuInfo, FailingSystemInfo) from hardware.py to rapids_cli/tests/fakes.py - Narrow NVMLError catch in nvlink enumeration to NVMLError_InvalidArgument and NVMLError_NotSupported (matching PR rapidsai#143 pattern) - Remove unnecessary `from __future__ import annotations` from check modules and debug.py - Remove `**kwargs` from check functions that didn't have it on main - Revert cosmetic variable renames (ver→version, info→toolkit_info) to minimize diff noise Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
|
@ncclementi @jayavenkatesh19 — pushed 56ba029 addressing the review feedback: Code changes:
Follow-up items (separate PRs):
I've replied to each inline comment individually. |
jacobtomlinson
left a comment
There was a problem hiding this comment.
I would recommend reinstating the **kwargs, or at least adding a **_ to the signature. This allows us to add new kwargs in the future without breaking existing checks. It's less important because these checks live in the main repo, but imagine if we wanted to add a new kwarg and then needed to go through cudf/cuml/etc and update their check signatures. See the docs for more info.
External checks (cudf, cuml, etc.) need to absorb future kwargs without breaking when the orchestrator adds new arguments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
|
|
|
This looks good to me! I will wait for @ncclementi and @jacobtomlinson to review as well |
|
@mmccarty thanks for all the updates, this is looking good. But the only one thing I'm not convinced is the use of globals still, I left a comment in that thread https://github.com/rapidsai/rapids-cli/pull/137/changes#r3156781409 based on a conversation I had with Gil. I'm thinking that based on maintainability and debugging purposes the use of |
Wraps the three module-level provider variables in a _Providers dataclass, eliminating the need for `global` statements while keeping the same set_providers/get_* public API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
|
@ncclementi I like the |
There was a problem hiding this comment.
Thank you @mmccarty for all your work.
This looks good to me, it looks like I can't merge until @jacobtomlinson approves since he requested changes.
jacobtomlinson
left a comment
There was a problem hiding this comment.
I had a quick look through and overall this looks awesome. Deferring to @ncclementi on merge decision.
Summary
Decouples check logic from hardware access so tests can swap in fakes without
mock.patchonpynvml/psutil/cuda.pathfinder.rapids_cli/hardware.py—DeviceInfodataclass,GpuInfoProvider/SystemInfoProviderProtocols, lazy-loading real implementations (NvmlGpuInfo,DefaultSystemInfo), and test fakes (FakeGpuInfo,FakeSystemInfo,FailingGpuInfo,FailingSystemInfo).test_hardware.pycovers the provider layer directly.rapids_cli/providers.py— process-wide registry withset_providers(gpu_info=, system_info=, toolkit_info=)andget_gpu_info()/get_system_info()/get_toolkit_info()lazy accessors. The doctor orchestrator installs providers once per run; checks read from the registry.debug.run_debug— plain module-level functions, no provider kwargs and no classes. Signatures staydef my_check(verbose=False, **kwargs)so third-party plugin authors are unaffected.rapids_cli/tests/conftest.pyaddsset_gpu_info/set_system_info/set_toolkit_infofixtures wrappingmonkeypatch.setattr, plus an autouse reset for isolation. Eliminates ~51 hardwaremock.patchcalls and ~11MagicMockobjects from the check/debug tests.0tonvmlDeviceGetNvLinkStateinstead of the actual link id; doctor orchestrator only installedgpu_info, so the memory check fell back to constructing its ownDefaultSystemInfo()at runtime.Note: this branch incorporates upstream changes merged after #135/#136 (CUDA toolkit check #141, richer nvlink check #143).
Test plan
pytest— 88 tests passpre-commit run --all-filespassesrapids doctor --verboseon a GPU host — all 6 entry points discovered;gpu,gpu_compute_capability,cuda,memory_to_gpu_ratio(with expected warning), andnvlink_statuspassed end-to-end on real hardware, exercising the# pragma: no coverlazy-load branches inproviders.py.cuda_toolkit_checkraisedModuleNotFoundError: No module named 'cuda.core.system'during testing, which turned out to be a stale conda env withcuda-core 0.3.2— the declared constraint (cuda-core >=0.6.0) is already correct; filed and closed cuda_toolkit check fails with ModuleNotFoundError: No module named 'cuda.core.system' #144 with that explanation. Will re-verify on a fresh env.🤖 Generated with Claude Code