Add CUDA toolkit check to `rapids doctor` by jayavenkatesh19 · Pull Request #141 · rapidsai/rapids-cli

jayavenkatesh19 · 2026-03-11T23:52:15Z

Adds a new rapids doctor check that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver.

These are the things the check does:

Library discoverability: Use cuda-pathfinder to verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies that libcudart.so, libnvrtc.so and libnvvm.so. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently.
Toolkit vs driver version: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from Add CUDA toolkit major version check #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers.
System installation checks: When CTK is not installed via conda/pip, it checks the /usr/local/cuda symlink and the CUDA_HOME/CUDA_PATH variables for version mismatches.

I based the order and the checks themselves after the load_nvidia_dynamic_lib documentation page for cuda-pathfinder, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOME

One scenario which isn't covered by these tests is described in this comment. This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For pip packages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that for conda packages.

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

jacobtomlinson

Overall this looks great.

I left a comment about trying to use cuda.core.system instead of pynvml. I'm not sure if it supports enough features for us, but if we can we should.

I also notice the tests have a lot of mocking in them. Perhaps the dependency injection approach @mmccarty was exploring in #137 would help clean these up?

Also it looks like CI is failing because coverage has dropped below 95%.

jacobtomlinson · 2026-03-12T11:05:26Z

+def _extract_major_from_cuda_path(path: Path) -> int | None:
+    """Extract CUDA major version from a path like /usr/local/cuda-12.4 or its version.txt."""
+    match = re.search(r"cuda-(\d+)", str(path))
+    if match:
+        return int(match.group(1))
+    version_file = path / "version.txt"
+    if version_file.exists():
+        match = re.search(r"(\d+)\.", version_file.read_text())
+        if match:
+            return int(match.group(1))
+    return None


There may be situations where multiple CTKs are installed. In this case we need to check which one /usr/local/cuda is symlinked to, as that will be the active one.

cudart_source = toolkit_info.found_libs.get("cudart", "") if cudart_source not in ("conda", "site-packages"): if _CUDA_SYMLINK.exists(): _check_path_version( "/usr/local/cuda", _CUDA_SYMLINK.resolve(), driver_major )

In line 222-226 of the cuda_toolkit.py, I am using Path.resolve() to resolve the symlink and get the exact path. So the version being returned will be the one which is symlinked.

jacobtomlinson · 2026-03-12T11:07:53Z

+        pynvml.nvmlInit()
+        driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000


Could we use cuda.core.system.get_driver_version() instead here, if we can it would be more future proof.

I made this change. And the API is indeed super weird.

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

ncclementi · 2026-04-01T17:28:27Z

+# Maps cuda-pathfinder's found_via values to human-readable source labels.
+_SOURCE_LABELS = {
+    "conda": "conda",
+    "site-packages": "pip",


I wonder how does this work for uv, do we need to add something else? I'm thinking a scenario where things are installed in a uv venv

From what I can find, uv installs into site-packages the same way pip does. I updated the label to say pip/uv to reflect this

ncclementi · 2026-04-01T17:36:33Z

+
+    return (
+        f"Some CUDA libraries ({missing_str}) could not be found. "
+        "Install the CUDA Toolkit, or use conda/pip which manage CUDA automatically."


pip does only manage the cuda toolkit install for cudf and cuml, and for some recent versions, but other libraries don not. We should update this to be more clear about instructions.

Updated the message for this.

ncclementi · 2026-04-01T17:38:29Z

+        version = ctypes.c_int()
+        if libcudart.cudaRuntimeGetVersion(ctypes.byref(version)) == 0:
+            return version.value // 1000
+    except OSError:


Will this OSError raise when there are no GPUs? or under which situation.

This OSError only fires if the .so file itself cannot be loaded (missing, broken symlink etc)

On a non GPU machine, cudart_path would not be found by cuda-pathfinder, and we would never reach this code path. This error only raises if the path to the file exists but cannot be opened.

This error only raises if the path to the file exists but cannot be opened.

If this is the case, shouldn't we raise with a human readable message instead of passing?

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

jameslamb

Just approving for packaging-codeowners (approving the cuda-core dependency), I haven't reviewed any of the code.

## Summary Decouples check logic from hardware access so tests can swap in fakes without `mock.patch` on `pynvml` / `psutil` / `cuda.pathfinder`. - **`rapids_cli/hardware.py`** — `DeviceInfo` dataclass, `GpuInfoProvider` / `SystemInfoProvider` Protocols, lazy-loading real implementations (`NvmlGpuInfo`, `DefaultSystemInfo`), and test fakes (`FakeGpuInfo`, `FakeSystemInfo`, `FailingGpuInfo`, `FailingSystemInfo`). `test_hardware.py` covers the provider layer directly. - **`rapids_cli/providers.py`** — process-wide registry with `set_providers(gpu_info=, system_info=, toolkit_info=)` and `get_gpu_info()` / `get_system_info()` / `get_toolkit_info()` lazy accessors. The doctor orchestrator installs providers once per run; checks read from the registry. - **Checks and `debug.run_debug`** — plain module-level functions, no provider kwargs and no classes. Signatures stay `def my_check(verbose=False, **kwargs)` so third-party plugin authors are unaffected. - **Tests** — `rapids_cli/tests/conftest.py` adds `set_gpu_info` / `set_system_info` / `set_toolkit_info` fixtures wrapping `monkeypatch.setattr`, plus an autouse reset for isolation. Eliminates ~51 hardware `mock.patch` calls and ~11 `MagicMock` objects from the check/debug tests. - **Bug fixes:** nvlink check was always passing `0` to `nvmlDeviceGetNvLinkState` instead of the actual link id; doctor orchestrator only installed `gpu_info`, so the memory check fell back to constructing its own `DefaultSystemInfo()` at runtime. Note: this branch incorporates upstream changes merged after #135/#136 (CUDA toolkit check #141, richer nvlink check #143). ## Test plan - [x] `pytest` — 88 tests pass - [x] Coverage at 96.78%, above the 95% threshold - [x] `pre-commit run --all-files` passes - [x] Manual `rapids doctor --verbose` on a GPU host — all 6 entry points discovered; `gpu`, `gpu_compute_capability`, `cuda`, `memory_to_gpu_ratio` (with expected warning), and `nvlink_status` passed end-to-end on real hardware, exercising the `# pragma: no cover` lazy-load branches in `providers.py`. `cuda_toolkit_check` raised `ModuleNotFoundError: No module named 'cuda.core.system'` during testing, which turned out to be a stale conda env with `cuda-core 0.3.2` — the declared constraint (`cuda-core >=0.6.0`) is already correct; filed and closed #144 with that explanation. Will re-verify on a fresh env. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Mike McCarty <mmccarty@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Two related fixes surfaced while smoke-testing #137 on a fresh Brev box: - **`cuda_toolkit_check` was reading the kernel driver, not the CUDA driver.** `get_driver_version(kernel_mode=True)` returns the NVIDIA kernel module version (e.g. `580` from `580.126.09`), not the CUDA Driver API version (e.g. `13` from CUDA 13.0). The verbose message also printed `Driver supports CUDA 580`, which is what tipped this off. Dropping `kernel_mode=True` makes `get_driver_version()` default to the CUDA Driver API mode and the comparison logic actually fires. - **`cuda-bindings` is now declared as a runtime dep, and the conda recipe gets the missing `cuda-core` it should have had since #141.** `cuda-core` calls into `cuda.bindings.driver` via lazy import and without `cuda-bindings` installed, `cuda_toolkit_check` raises `ImportError: cuda.bindings 12.x or 13.x must be installed` on a fresh `pip install rapids-cli`. The pin `>=12.9.6,!=13.0.*,!=13.1.*` excludes the cuda-bindings 13.0/13.1 wheels and is compatible with both CUDA 12 and CUDA 13 driver hosts (verified with cuda-bindings 12.9.6 against a CUDA 13 environment and cuda-bindings 13.2 against a CUDA 12 environment). A regression test (`test_gather_toolkit_info_driver_major_is_cuda_major`) exercises `_gather_toolkit_info()` end-to-end and asserts `driver_major < 100` to ensure that we are getting the CUDA major version and not the driver version Closes #145. --------- Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

jayavenkatesh19 added 5 commits March 11, 2026 11:46

added cuda toolkit check

6df7ef4

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

fixed formatting and error messages

bff8cb3

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

fixed system_path checks

f358112

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

fixed docstring

5a44201

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

made the code more modular

4885ac9

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

jayavenkatesh19 self-assigned this Mar 11, 2026

jayavenkatesh19 requested review from a team as code owners March 11, 2026 23:52

jayavenkatesh19 requested a review from msarahan March 11, 2026 23:52

fixed formatting

ea65345

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

jayavenkatesh19 removed the request for review from msarahan March 11, 2026 23:57

jacobtomlinson reviewed Mar 12, 2026

View reviewed changes

jayavenkatesh19 added 2 commits March 12, 2026 13:39

switched to cuda-core

7526e9a

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

fixed formatting

9de27b6

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

ncclementi reviewed Mar 13, 2026

View reviewed changes

Comment thread rapids_cli/doctor/checks/cuda_toolkit.py Outdated

changed kernel mode to true.

8c926c3

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

ncclementi reviewed Apr 1, 2026

View reviewed changes

add better error messages

49332fe

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>

ncclementi approved these changes Apr 1, 2026

View reviewed changes

Merge branch 'main' into feat/cuda-toolkit-check

b05bffa

jameslamb approved these changes Apr 2, 2026

View reviewed changes

jayavenkatesh19 merged commit 1d64e61 into rapidsai:main Apr 2, 2026
9 checks passed

This was referenced Apr 6, 2026

Find more rapids doctor checks from other RAPIDS projects #124

Open

Stable Install Path Testing for RAPIDS rapidsai/build-planning#226

Open

jayavenkatesh19 mentioned this pull request Apr 20, 2026

Add a rapids doctor check to verify CUDA toolkit libraries are findable and are version-consistent #139

Closed

This was referenced Apr 20, 2026

Refactor hardware access with dependency injection #137

Merged

cuda_toolkit check fails with ModuleNotFoundError: No module named 'cuda.core.system' #144

Closed

jayavenkatesh19 mentioned this pull request Apr 29, 2026

Fix cuda_toolkit_check driver version and add cuda-bindings dep #148

Merged

ncclementi mentioned this pull request Apr 30, 2026

Add CUDA toolkit major version check #140

Closed

		pynvml.nvmlInit()
		driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000

Conversation

jayavenkatesh19 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayavenkatesh19 Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayavenkatesh19 Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jayavenkatesh19 commented Mar 11, 2026 •

edited

Loading

jacobtomlinson left a comment •

edited

Loading

jayavenkatesh19 Mar 13, 2026 •

edited

Loading

jayavenkatesh19 Mar 13, 2026 •

edited

Loading

ncclementi Apr 1, 2026 •

edited

Loading