Skip to content

Add CUDA toolkit check to rapids doctor#141

Merged
jayavenkatesh19 merged 11 commits intorapidsai:mainfrom
jayavenkatesh19:feat/cuda-toolkit-check
Apr 2, 2026
Merged

Add CUDA toolkit check to rapids doctor#141
jayavenkatesh19 merged 11 commits intorapidsai:mainfrom
jayavenkatesh19:feat/cuda-toolkit-check

Conversation

@jayavenkatesh19
Copy link
Copy Markdown
Contributor

@jayavenkatesh19 jayavenkatesh19 commented Mar 11, 2026

Adds a new rapids doctor check that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver.

These are the things the check does:

  • Library discoverability: Use cuda-pathfinder to verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies that libcudart.so, libnvrtc.so and libnvvm.so. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently.

  • Toolkit vs driver version: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from Add CUDA toolkit major version check #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers.

  • System installation checks: When CTK is not installed via conda/pip, it checks the /usr/local/cuda symlink and the CUDA_HOME/CUDA_PATH variables for version mismatches.

I based the order and the checks themselves after the load_nvidia_dynamic_lib documentation page for cuda-pathfinder, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOME

One scenario which isn't covered by these tests is described in this comment. This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For pip packages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that for conda packages.

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
@jayavenkatesh19 jayavenkatesh19 self-assigned this Mar 11, 2026
@jayavenkatesh19 jayavenkatesh19 requested review from a team as code owners March 11, 2026 23:52
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
@jayavenkatesh19 jayavenkatesh19 removed the request for review from msarahan March 11, 2026 23:57
Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great.

I left a comment about trying to use cuda.core.system instead of pynvml. I'm not sure if it supports enough features for us, but if we can we should.

I also notice the tests have a lot of mocking in them. Perhaps the dependency injection approach @mmccarty was exploring in #137 would help clean these up?

Also it looks like CI is failing because coverage has dropped below 95%.

Comment on lines +120 to +130
def _extract_major_from_cuda_path(path: Path) -> int | None:
"""Extract CUDA major version from a path like /usr/local/cuda-12.4 or its version.txt."""
match = re.search(r"cuda-(\d+)", str(path))
if match:
return int(match.group(1))
version_file = path / "version.txt"
if version_file.exists():
match = re.search(r"(\d+)\.", version_file.read_text())
if match:
return int(match.group(1))
return None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be situations where multiple CTKs are installed. In this case we need to check which one /usr/local/cuda is symlinked to, as that will be the active one.

Copy link
Copy Markdown
Contributor Author

@jayavenkatesh19 jayavenkatesh19 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    cudart_source = toolkit_info.found_libs.get("cudart", "")
    if cudart_source not in ("conda", "site-packages"):
        if _CUDA_SYMLINK.exists():
            _check_path_version(
                "/usr/local/cuda", _CUDA_SYMLINK.resolve(), driver_major
            )

In line 222-226 of the cuda_toolkit.py, I am using Path.resolve() to resolve the symlink and get the exact path. So the version being returned will be the one which is symlinked.

Comment on lines +168 to +169
pynvml.nvmlInit()
driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use cuda.core.system.get_driver_version() instead here, if we can it would be more future proof.

Copy link
Copy Markdown
Contributor Author

@jayavenkatesh19 jayavenkatesh19 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change. And the API is indeed super weird.

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Comment thread rapids_cli/doctor/checks/cuda_toolkit.py Outdated
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
# Maps cuda-pathfinder's found_via values to human-readable source labels.
_SOURCE_LABELS = {
"conda": "conda",
"site-packages": "pip",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how does this work for uv, do we need to add something else? I'm thinking a scenario where things are installed in a uv venv

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can find, uv installs into site-packages the same way pip does. I updated the label to say pip/uv to reflect this

Comment thread rapids_cli/doctor/checks/cuda_toolkit.py

return (
f"Some CUDA libraries ({missing_str}) could not be found. "
"Install the CUDA Toolkit, or use conda/pip which manage CUDA automatically."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip does only manage the cuda toolkit install for cudf and cuml, and for some recent versions, but other libraries don not. We should update this to be more clear about instructions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message for this.

version = ctypes.c_int()
if libcudart.cudaRuntimeGetVersion(ctypes.byref(version)) == 0:
return version.value // 1000
except OSError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this OSError raise when there are no GPUs? or under which situation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This OSError only fires if the .so file itself cannot be loaded (missing, broken symlink etc)

On a non GPU machine, cudart_path would not be found by cuda-pathfinder, and we would never reach this code path. This error only raises if the path to the file exists but cannot be opened.

Copy link
Copy Markdown
Contributor

@ncclementi ncclementi Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error only raises if the path to the file exists but cannot be opened.

If this is the case, shouldn't we raise with a human readable message instead of passing?

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just approving for packaging-codeowners (approving the cuda-core dependency), I haven't reviewed any of the code.

@jayavenkatesh19 jayavenkatesh19 merged commit 1d64e61 into rapidsai:main Apr 2, 2026
9 checks passed
ncclementi pushed a commit that referenced this pull request Apr 29, 2026
## Summary

Decouples check logic from hardware access so tests can swap in fakes
without `mock.patch` on `pynvml` / `psutil` / `cuda.pathfinder`.

- **`rapids_cli/hardware.py`** — `DeviceInfo` dataclass,
`GpuInfoProvider` / `SystemInfoProvider` Protocols, lazy-loading real
implementations (`NvmlGpuInfo`, `DefaultSystemInfo`), and test fakes
(`FakeGpuInfo`, `FakeSystemInfo`, `FailingGpuInfo`,
`FailingSystemInfo`). `test_hardware.py` covers the provider layer
directly.
- **`rapids_cli/providers.py`** — process-wide registry with
`set_providers(gpu_info=, system_info=, toolkit_info=)` and
`get_gpu_info()` / `get_system_info()` / `get_toolkit_info()` lazy
accessors. The doctor orchestrator installs providers once per run;
checks read from the registry.
- **Checks and `debug.run_debug`** — plain module-level functions, no
provider kwargs and no classes. Signatures stay `def
my_check(verbose=False, **kwargs)` so third-party plugin authors are
unaffected.
- **Tests** — `rapids_cli/tests/conftest.py` adds `set_gpu_info` /
`set_system_info` / `set_toolkit_info` fixtures wrapping
`monkeypatch.setattr`, plus an autouse reset for isolation. Eliminates
~51 hardware `mock.patch` calls and ~11 `MagicMock` objects from the
check/debug tests.
- **Bug fixes:** nvlink check was always passing `0` to
`nvmlDeviceGetNvLinkState` instead of the actual link id; doctor
orchestrator only installed `gpu_info`, so the memory check fell back to
constructing its own `DefaultSystemInfo()` at runtime.

Note: this branch incorporates upstream changes merged after #135/#136
(CUDA toolkit check #141, richer nvlink check #143).

## Test plan

- [x] `pytest` — 88 tests pass
- [x] Coverage at 96.78%, above the 95% threshold
- [x] `pre-commit run --all-files` passes
- [x] Manual `rapids doctor --verbose` on a GPU host — all 6 entry
points discovered; `gpu`, `gpu_compute_capability`, `cuda`,
`memory_to_gpu_ratio` (with expected warning), and `nvlink_status`
passed end-to-end on real hardware, exercising the `# pragma: no cover`
lazy-load branches in `providers.py`. `cuda_toolkit_check` raised
`ModuleNotFoundError: No module named 'cuda.core.system'` during
testing, which turned out to be a stale conda env with `cuda-core 0.3.2`
— the declared constraint (`cuda-core >=0.6.0`) is already correct;
filed and closed #144 with that explanation. Will re-verify on a fresh
env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Mike McCarty <mmccarty@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ncclementi pushed a commit that referenced this pull request Apr 30, 2026
## Summary

Two related fixes surfaced while smoke-testing #137 on a fresh Brev box:

- **`cuda_toolkit_check` was reading the kernel driver, not the CUDA
driver.** `get_driver_version(kernel_mode=True)` returns the NVIDIA
kernel module version (e.g. `580` from `580.126.09`), not the CUDA
Driver API version (e.g. `13` from CUDA 13.0). The verbose message also
printed `Driver supports CUDA 580`, which is what tipped this off.
Dropping `kernel_mode=True` makes `get_driver_version()` default to the
CUDA Driver API mode and the comparison logic actually fires.
- **`cuda-bindings` is now declared as a runtime dep, and the conda
recipe gets the missing `cuda-core` it should have had since #141.**
`cuda-core` calls into `cuda.bindings.driver` via lazy import and
without `cuda-bindings` installed, `cuda_toolkit_check` raises
`ImportError: cuda.bindings 12.x or 13.x must be installed` on a fresh
`pip install rapids-cli`. The pin `>=12.9.6,!=13.0.*,!=13.1.*` excludes
the cuda-bindings 13.0/13.1 wheels and is compatible with both CUDA 12
and CUDA 13 driver hosts (verified with cuda-bindings 12.9.6 against a
CUDA 13 environment and cuda-bindings 13.2 against a CUDA 12
environment).

A regression test
(`test_gather_toolkit_info_driver_major_is_cuda_major`) exercises
`_gather_toolkit_info()` end-to-end and asserts `driver_major < 100` to
ensure that we are getting the CUDA major version and not the driver
version

Closes #145.

---------

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants