Skip to content

Refactor hardware access with dependency injection#137

Merged
ncclementi merged 5 commits intorapidsai:mainfrom
mmccarty:refactor-di
Apr 29, 2026
Merged

Refactor hardware access with dependency injection#137
ncclementi merged 5 commits intorapidsai:mainfrom
mmccarty:refactor-di

Conversation

@mmccarty
Copy link
Copy Markdown
Contributor

@mmccarty mmccarty commented Feb 11, 2026

Summary

Decouples check logic from hardware access so tests can swap in fakes without mock.patch on pynvml / psutil / cuda.pathfinder.

  • rapids_cli/hardware.pyDeviceInfo dataclass, GpuInfoProvider / SystemInfoProvider Protocols, lazy-loading real implementations (NvmlGpuInfo, DefaultSystemInfo), and test fakes (FakeGpuInfo, FakeSystemInfo, FailingGpuInfo, FailingSystemInfo). test_hardware.py covers the provider layer directly.
  • rapids_cli/providers.py — process-wide registry with set_providers(gpu_info=, system_info=, toolkit_info=) and get_gpu_info() / get_system_info() / get_toolkit_info() lazy accessors. The doctor orchestrator installs providers once per run; checks read from the registry.
  • Checks and debug.run_debug — plain module-level functions, no provider kwargs and no classes. Signatures stay def my_check(verbose=False, **kwargs) so third-party plugin authors are unaffected.
  • Testsrapids_cli/tests/conftest.py adds set_gpu_info / set_system_info / set_toolkit_info fixtures wrapping monkeypatch.setattr, plus an autouse reset for isolation. Eliminates ~51 hardware mock.patch calls and ~11 MagicMock objects from the check/debug tests.
  • Bug fixes: nvlink check was always passing 0 to nvmlDeviceGetNvLinkState instead of the actual link id; doctor orchestrator only installed gpu_info, so the memory check fell back to constructing its own DefaultSystemInfo() at runtime.

Note: this branch incorporates upstream changes merged after #135/#136 (CUDA toolkit check #141, richer nvlink check #143).

Test plan

  • pytest — 88 tests pass
  • Coverage at 96.78%, above the 95% threshold
  • pre-commit run --all-files passes
  • Manual rapids doctor --verbose on a GPU host — all 6 entry points discovered; gpu, gpu_compute_capability, cuda, memory_to_gpu_ratio (with expected warning), and nvlink_status passed end-to-end on real hardware, exercising the # pragma: no cover lazy-load branches in providers.py. cuda_toolkit_check raised ModuleNotFoundError: No module named 'cuda.core.system' during testing, which turned out to be a stale conda env with cuda-core 0.3.2 — the declared constraint (cuda-core >=0.6.0) is already correct; filed and closed cuda_toolkit check fails with ModuleNotFoundError: No module named 'cuda.core.system' #144 with that explanation. Will re-verify on a fresh env.

🤖 Generated with Claude Code

mmccarty and others added 2 commits April 17, 2026 15:27
Introduce rapids_cli/providers.py as the process-wide home for the GPU,
system, and CUDA-toolkit provider instances. The doctor orchestrator
installs real providers once per run via set_providers(); checks and
debug functions read them via get_gpu_info() / get_system_info() /
get_toolkit_info(). Lazy fallbacks construct real implementations on
first access so nothing imports pynvml / psutil / cuda.pathfinder unless
a provider is actually touched.

Drop the intermediate Check base class and the six check subclasses;
every check is now a plain module-level function with a
(verbose=False, **kwargs) signature. Same treatment for debug.run_debug
and its gather_* helpers. Tests use pytest fixtures in
rapids_cli/tests/conftest.py (set_gpu_info / set_system_info /
set_toolkit_info) that wrap monkeypatch.setattr on the registry globals,
plus an autouse reset fixture for isolation. External plugin authors
who already write def my_check(verbose=False, **kwargs) are unaffected.

Also fix a latent inconsistency: doctor.py now installs system_info too,
so the memory check no longer relies on its own fallback at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
@mmccarty
Copy link
Copy Markdown
Contributor Author

Design note: pattern choice

Worth flagging for reviewers: what this PR introduces in providers.py is essentially Singleton + Service Locator. The module-level globals are a Singleton by another name (module state is just a Singleton without the class ceremony), and get_gpu_info() / get_system_info() / get_toolkit_info() are textbook Service Locator — consumers pull from a registry rather than having dependencies passed in.

Both patterns have well-known drawbacks:

  • Singleton: global mutable state, test isolation requires discipline (addressed here with the autouse reset fixture in conftest.py), lifecycle is implicit.
  • Service Locator: Mark Seemann has argued this is an anti-pattern because it hides dependencies — a function signature def gpu_check(verbose=False, **kwargs) no longer advertises that it needs a GpuInfoProvider; you only discover that by reading the body. With explicit DI (the intermediate iteration of this PR, with providers as kwargs), the dependency shows up in the signature.

The tradeoff was made knowingly. For a single-shot CLI with no concurrency and no library embedding, the Singleton concerns are largely theoretical, and the Service Locator's hidden-dependency cost is small because the set of possible dependencies is tiny (three providers) and each check's body is short. The alternative — threading gpu_info= / system_info= through every signature — was judged noisier than the Service Locator cost.

If rapids-cli were ever embedded as a library in a long-running service, I'd argue for revisiting this. The "proper" form of this pattern in Python is contextvars, which gives the same "ambient context" ergonomics but with correct scoping (per-task, per-thread). Overkill for a CLI, but worth noting as the escape hatch.

Copy link
Copy Markdown
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, mostly about design and questions of implementation. I haven't had the time to try this locally, and I'm heading on PTO for a week.

I know @jayavenkatesh19 was trying this, so he might have more concrete feedback.

General comment: once this refactor is done, we need some developer documentation on how to use it, how to add checks, and how the general design works.

Comment thread rapids_cli/debug/debug.py Outdated
# SPDX-License-Identifier: Apache-2.0
"""This module contains the debug subcommand for the Rapids CLI."""

from __future__ import annotations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this here? What are we changing that requires this, I noticed that it's in several files, and it's not clear to me why.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 56ba029. This was added preemptively but none of these files use X | Y type annotations — it is only needed in providers.py and hardware.py (new files that use GpuInfoProvider | None etc.).

Comment thread rapids_cli/debug/debug.py Outdated
def gather_cuda_version():
"""Return CUDA driver version as a string, similar to nvidia-smi output."""
version = pynvml.nvmlSystemGetCudaDriverVersion()
ver = get_gpu_info().cuda_driver_version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid changing style, I mean changing "version" for "ver" doesn't makes the code more clear, in general I would limit from doing this type of changes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to version in 56ba029. The rename was a side-effect of swapping the RHS to use the provider, but the original name is fine (the local-scope shadow of importlib.metadata.version already existed before this PR).

Comment thread rapids_cli/hardware.py
Comment on lines +87 to +88
self._cuda_driver_version = pynvml.nvmlSystemGetCudaDriverVersion()
self._driver_version = pynvml.nvmlSystemGetDriverVersion()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we are checking driver version using pynvml. We should try to rely on cuda.core.systems, like what @jayavenkatesh19 did in his PR #141, for example here https://github.com/rapidsai/rapids-cli/pull/141/changes#diff-c91bfdf239eab90d8996e1d73e12ed815f9a9c38cacfb4b9ba41b00eee766e58R158-R186

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this would be ideal. cuda.core.system currently provides get_driver_version() (which _gather_toolkit_info already uses), but it does not yet expose device enumeration, compute capability, memory info, or NVLink states. A full migration from pynvml would require either cuda.core.system adding those APIs or a hybrid approach. I think this is best as a follow-up PR once broader coverage is available. Related: issue #145 (cuda-bindings dependency).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of a few days ago, cuda.core should have all of the APIs necessary for rapids-cli. I filed at #146 to migrate -- should be testable now, but will need to wait until after the cuda.core 1.0 release to merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great news! Filed #147 to track migrating NvmlGpuInfo once cuda-core 1.0 ships. The GpuInfoProvider Protocol from this PR should make the swap straightforward — only hardware.py and test_hardware.py need to change, everything downstream reads through the abstraction.

def cuda_toolkit_check(
verbose=False, *, toolkit_info: CudaToolkitInfo | None = None, **kwargs
):
def cuda_toolkit_check(verbose=False, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened here with passing toolkit_info: CudaToolkitInfo ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toolkit_info parameter was replaced by the provider registry — cuda_toolkit_check now reads from get_toolkit_info() (via rapids_cli.providers) instead of receiving it as a kwarg. This is the core change of the DI refactor: checks no longer need provider parameters threaded through their signatures. Tests install fakes into the registry via monkeypatch.setattr fixtures in conftest.py.

Also reverted the local variable name back to toolkit_info in 56ba029 to minimize the diff.

Comment on lines +193 to +223
info = get_toolkit_info()

# Check library findability
if toolkit_info.missing_libs:
any_found_via = next(iter(toolkit_info.found_libs.values()), None)
raise ValueError(
_format_missing_error(toolkit_info.missing_libs, any_found_via)
)
if info.missing_libs:
any_found_via = next(iter(info.found_libs.values()), None)
raise ValueError(_format_missing_error(info.missing_libs, any_found_via))

# Check driver availability
if toolkit_info.driver_major is None:
if info.driver_major is None:
raise ValueError(
"Unable to query the GPU driver's CUDA version. "
"RAPIDS requires a working NVIDIA GPU driver."
)

driver_major = toolkit_info.driver_major
toolkit_major = toolkit_info.toolkit_major
driver_major = info.driver_major
toolkit_major = info.toolkit_major

# Compare toolkit to driver (only error when toolkit > driver, drivers are backward compatible)
if toolkit_major is not None and toolkit_major > driver_major:
raise ValueError(
_format_mismatch_error(
toolkit_major,
driver_major,
toolkit_info.found_libs.get("cudart"),
toolkit_info.cudart_path,
info.found_libs.get("cudart"),
info.cudart_path,
)
)

# Only check system paths if CUDA was found via system/CUDA_HOME.
# When found via conda or pip, RAPIDS uses those libs and ignores system paths.
cudart_source = toolkit_info.found_libs.get("cudart", "")
cudart_source = info.found_libs.get("cudart", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment regarding style I made above. I don't think we should be changing variables names in this case. It makes the review process longer and the truth is nothing change in this part.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the local variable name back to toolkit_info in 56ba029 so the body of the function is unchanged from PR #141.

Comment on lines +205 to +222
def test_failing_gpu_info_device_count():
with pytest.raises(ValueError, match="No GPU available"):
_ = FailingGpuInfo().device_count


def test_failing_gpu_info_devices():
with pytest.raises(ValueError, match="No GPU available"):
_ = FailingGpuInfo().devices


def test_failing_gpu_info_cuda_driver_version():
with pytest.raises(ValueError, match="No GPU available"):
_ = FailingGpuInfo().cuda_driver_version


def test_failing_gpu_info_driver_version():
with pytest.raises(ValueError, match="No GPU available"):
_ = FailingGpuInfo().driver_version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally these ValueError would be more specific. They seem to generic, I know it's what's created as part of FailingGpuInfo but we could probably improve this to be more representative of what you'd get if trigger those scenarios.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 56ba029. Added a custom HardwareInfoError exception in hardware.py. NvmlGpuInfo now raises HardwareInfoError instead of generic ValueError when NVML initialization fails, and the test fakes (FailingGpuInfo, FailingSystemInfo) raise it too. Check functions catch HardwareInfoError and re-raise with user-facing ValueError messages.

Comment thread rapids_cli/hardware.py
Comment on lines +92 to +94
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle)
memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to replace some of these for cuda.core.system equivalents if we have them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reply as above — cuda.core.system does not yet have equivalents for device count, compute capability, memory, or NVLink. Happy to migrate as a follow-up when the API surface grows.

Comment thread rapids_cli/hardware.py Outdated


@dataclass
class FakeGpuInfo:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about these (from here below) testing-utils likes classes, here in hardware.py I wonder if they should live in the tests folder in a utils file or similar.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all test fakes (FakeGpuInfo, FakeSystemInfo, FailingGpuInfo, FailingSystemInfo) to rapids_cli/tests/fakes.py in 56ba029. hardware.py now only contains protocols, data classes, and real implementations.

Comment thread rapids_cli/hardware.py Outdated
@property
def device_count(self) -> int:
"""Raise ValueError."""
raise ValueError("No GPU available")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValueError here and in all the cases below, seem a bit generic I wonder if we could be more specific, and using an error that is more representative of the error we would get in the real case scenario

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 56ba029 — see reply on the test_hardware.py comment. HardwareInfoError replaces ValueError throughout.

Comment thread rapids_cli/providers.py Outdated
toolkit_info: CudaToolkitInfo | None = None,
) -> None:
"""Install providers for the current run. Only non-None args are applied."""
global _gpu_info, _system_info, _toolkit_info
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use global variables here, this is not very pythonic, and I'm worried that it's not very convenient for maintainability reasons. Does it have to be this way?

Copy link
Copy Markdown
Contributor Author

@mmccarty mmccarty Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry uses module-level globals the same way Python's own logging, warnings, and locale modules do — it's the standard pattern for process-wide ambient context. The alternatives considered:

  1. Threading providers through function kwargs — this was the original approach, but it leaked 4-line fallback boilerplate into every check function and required the orchestrator to know which providers each check needs.
  2. Class-based approach — intermediate iteration; added scaffolding without reducing complexity.
  3. contextvars — the "proper" escape hatch if we ever need async/per-thread isolation, but overkill for a synchronous CLI tool.

The globals are private (_gpu_info, etc.) and only mutated in two places: set_providers() (called once by the orchestrator) and the lazy fallbacks in get_*() (guarded by # pragma: no cover). Tests swap them via monkeypatch.setattr which auto-reverts after each test. See the design patterns discussion for more context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by this, I consulted with @gforsyth to see if he had a better idea on how to approach this, and he suggested dataclasses. Like doing something like this but with better typing

In [1]: from dataclasses import dataclass

In [2]: @dataclass
   ...: class Providers:
   ...:     gpu_info: None = None
   ...:     system_info: None = None
   ...:     toolkit_info: None = None
   ...:

In [3]: a = Providers(gpu_info='hithere')

In [4]: a
Out[4]: Providers(gpu_info='hithere', system_info=None, toolkit_info=None)

I think this is a good approach, and form my conversation with him this approach shouldn't add an extra calls.

"If the dataclass is instantiated in the providers file, the helper get_* functions in there can access the class members and return them"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion — done in 326c789. Replaced the bare module-level globals with a _Providers dataclass instance:

@dataclass
class _Providers:
    gpu_info: GpuInfoProvider | None = field(default=None)
    system_info: SystemInfoProvider | None = field(default=None)
    toolkit_info: CudaToolkitInfo | None = field(default=None)

_providers = _Providers()

The set_providers() / get_*() public API is unchanged — the only difference is _providers.gpu_info instead of global _gpu_info. Tests monkeypatch on the dataclass attributes (providers._providers.gpu_info) which auto-reverts the same way.

Copy link
Copy Markdown
Contributor

@jayavenkatesh19 jayavenkatesh19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested these changes out, and everything seems to be working well functionally. I did come across an issue with using cuda-core that is documented in #145

While I do understand the reasoning behind abstracting away the individual pynvml errors and having a generic error displayed to the user, I do think using ValueError might not be the best choice here. I think we can add a new exception type in hardware.py and call it something like HardwareInfoError and use that to handle errors. This way, the user knows it is something to do with the GPU and not a pythonic error which is generally where ValueError shows up

I also do think @ncclementi's comments about stylistic choices needs some clarification too, and we should document this new structure of using GPU info from providers.py instead of direct pynvml calls for future users

Comment thread rapids_cli/hardware.py
Comment on lines +86 to +94
self._device_count = pynvml.nvmlDeviceGetCount()
self._cuda_driver_version = pynvml.nvmlSystemGetCudaDriverVersion()
self._driver_version = pynvml.nvmlSystemGetDriverVersion()

self._devices = []
for i in range(self._device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle)
memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
Copy link
Copy Markdown
Contributor

@jayavenkatesh19 jayavenkatesh19 Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to Naty's comment above, think it would be good to add try/except blocks for these pynvml calls too, because they could hypothetically throw errors that pynvml.init() did not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After nvmlInit() succeeds, the individual calls (nvmlDeviceGetCount, nvmlSystemGetCudaDriverVersion, etc.) are extremely unlikely to fail — if the driver is loaded, these are thin wrappers around kernel ioctls. Adding try/except for each would be verbose and handle scenarios that essentially require a driver bug. I'd prefer to handle this as a follow-up if we see it cause issues in practice. That said, if you feel strongly about it we can add a broad except pynvml.NVMLError around the post-init block that wraps into HardwareInfoError.

Comment thread rapids_cli/hardware.py Outdated
try:
state = pynvml.nvmlDeviceGetNvLinkState(handle, link_id)
nvlink_states.append(bool(state))
except pynvml.NVMLError:
Copy link
Copy Markdown
Contributor

@jayavenkatesh19 jayavenkatesh19 Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in #143, we can make this line better scoped using the NVMLError_InvalidArgument and NVMLError_NotSupported errors to scope this try/except block to the use case where we have reached the end of all the available devices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 56ba029 — narrowed to except (pynvml.NVMLError_InvalidArgument, pynvml.NVMLError_NotSupported), matching the pattern from PR #143.

…ff noise

- Add HardwareInfoError custom exception replacing generic ValueError for
  hardware failures (suggested by jayavenkatesh19 and ncclementi)
- Move test fakes (FakeGpuInfo, FakeSystemInfo, FailingGpuInfo,
  FailingSystemInfo) from hardware.py to rapids_cli/tests/fakes.py
- Narrow NVMLError catch in nvlink enumeration to NVMLError_InvalidArgument
  and NVMLError_NotSupported (matching PR rapidsai#143 pattern)
- Remove unnecessary `from __future__ import annotations` from check modules
  and debug.py
- Remove `**kwargs` from check functions that didn't have it on main
- Revert cosmetic variable renames (ver→version, info→toolkit_info) to
  minimize diff noise

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
@mmccarty
Copy link
Copy Markdown
Contributor Author

@ncclementi @jayavenkatesh19 — pushed 56ba029 addressing the review feedback:

Code changes:

  • Added HardwareInfoError custom exception (replaces generic ValueError for hardware failures)
  • Moved test fakes to rapids_cli/tests/fakes.py
  • Narrowed NVMLError catch to NVMLError_InvalidArgument | NVMLError_NotSupported
  • Removed unnecessary from __future__ import annotations, **kwargs, and cosmetic variable renames to minimize diff noise

Follow-up items (separate PRs):

  • Migrate from pynvml to cuda.core.system once it has broader API coverage (device enumeration, compute capability, memory, NVLink)
  • Developer documentation on the provider registry pattern and how to add checks

I've replied to each inline comment individually.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend reinstating the **kwargs, or at least adding a **_ to the signature. This allows us to add new kwargs in the future without breaking existing checks. It's less important because these checks live in the main repo, but imagine if we wanted to add a new kwarg and then needed to go through cudf/cuml/etc and update their check signatures. See the docs for more info.

External checks (cudf, cuml, etc.) need to absorb future kwargs without
breaking when the orchestrator adds new arguments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
@mmccarty
Copy link
Copy Markdown
Contributor Author

**kwargs reinstated on all check functions in 25ff1d4. Good call — the plugin contract needs to absorb future kwargs so external checks in cudf/cuml/etc. don't break.

@jayavenkatesh19
Copy link
Copy Markdown
Contributor

This looks good to me! I will wait for @ncclementi and @jacobtomlinson to review as well

@ncclementi
Copy link
Copy Markdown
Contributor

@mmccarty thanks for all the updates, this is looking good. But the only one thing I'm not convinced is the use of globals still, I left a comment in that thread https://github.com/rapidsai/rapids-cli/pull/137/changes#r3156781409 based on a conversation I had with Gil.

I'm thinking that based on maintainability and debugging purposes the use of globals is never a good idea. I'll let @jacobtomlinson chime in here.

Wraps the three module-level provider variables in a _Providers dataclass,
eliminating the need for `global` statements while keeping the same
set_providers/get_* public API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
@mmccarty
Copy link
Copy Markdown
Contributor Author

@ncclementi I like the dataclass better too. Done! Thank you!

Copy link
Copy Markdown
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mmccarty for all your work.

This looks good to me, it looks like I can't merge until @jacobtomlinson approves since he requested changes.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look through and overall this looks awesome. Deferring to @ncclementi on merge decision.

@ncclementi ncclementi merged commit a64f0b0 into rapidsai:main Apr 29, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cuda_toolkit check fails with ModuleNotFoundError: No module named 'cuda.core.system'

5 participants