Add CUDA toolkit check to rapids doctor#141
Conversation
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
There was a problem hiding this comment.
Overall this looks great.
I left a comment about trying to use cuda.core.system instead of pynvml. I'm not sure if it supports enough features for us, but if we can we should.
I also notice the tests have a lot of mocking in them. Perhaps the dependency injection approach @mmccarty was exploring in #137 would help clean these up?
Also it looks like CI is failing because coverage has dropped below 95%.
| def _extract_major_from_cuda_path(path: Path) -> int | None: | ||
| """Extract CUDA major version from a path like /usr/local/cuda-12.4 or its version.txt.""" | ||
| match = re.search(r"cuda-(\d+)", str(path)) | ||
| if match: | ||
| return int(match.group(1)) | ||
| version_file = path / "version.txt" | ||
| if version_file.exists(): | ||
| match = re.search(r"(\d+)\.", version_file.read_text()) | ||
| if match: | ||
| return int(match.group(1)) | ||
| return None |
There was a problem hiding this comment.
There may be situations where multiple CTKs are installed. In this case we need to check which one /usr/local/cuda is symlinked to, as that will be the active one.
There was a problem hiding this comment.
cudart_source = toolkit_info.found_libs.get("cudart", "")
if cudart_source not in ("conda", "site-packages"):
if _CUDA_SYMLINK.exists():
_check_path_version(
"/usr/local/cuda", _CUDA_SYMLINK.resolve(), driver_major
)In line 222-226 of the cuda_toolkit.py, I am using Path.resolve() to resolve the symlink and get the exact path. So the version being returned will be the one which is symlinked.
| pynvml.nvmlInit() | ||
| driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000 |
There was a problem hiding this comment.
Could we use cuda.core.system.get_driver_version() instead here, if we can it would be more future proof.
There was a problem hiding this comment.
I made this change. And the API is indeed super weird.
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
| # Maps cuda-pathfinder's found_via values to human-readable source labels. | ||
| _SOURCE_LABELS = { | ||
| "conda": "conda", | ||
| "site-packages": "pip", |
There was a problem hiding this comment.
I wonder how does this work for uv, do we need to add something else? I'm thinking a scenario where things are installed in a uv venv
There was a problem hiding this comment.
From what I can find, uv installs into site-packages the same way pip does. I updated the label to say pip/uv to reflect this
|
|
||
| return ( | ||
| f"Some CUDA libraries ({missing_str}) could not be found. " | ||
| "Install the CUDA Toolkit, or use conda/pip which manage CUDA automatically." |
There was a problem hiding this comment.
pip does only manage the cuda toolkit install for cudf and cuml, and for some recent versions, but other libraries don not. We should update this to be more clear about instructions.
There was a problem hiding this comment.
Updated the message for this.
| version = ctypes.c_int() | ||
| if libcudart.cudaRuntimeGetVersion(ctypes.byref(version)) == 0: | ||
| return version.value // 1000 | ||
| except OSError: |
There was a problem hiding this comment.
Will this OSError raise when there are no GPUs? or under which situation.
There was a problem hiding this comment.
This OSError only fires if the .so file itself cannot be loaded (missing, broken symlink etc)
On a non GPU machine, cudart_path would not be found by cuda-pathfinder, and we would never reach this code path. This error only raises if the path to the file exists but cannot be opened.
There was a problem hiding this comment.
This error only raises if the path to the file exists but cannot be opened.
If this is the case, shouldn't we raise with a human readable message instead of passing?
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
jameslamb
left a comment
There was a problem hiding this comment.
Just approving for packaging-codeowners (approving the cuda-core dependency), I haven't reviewed any of the code.
## Summary Decouples check logic from hardware access so tests can swap in fakes without `mock.patch` on `pynvml` / `psutil` / `cuda.pathfinder`. - **`rapids_cli/hardware.py`** — `DeviceInfo` dataclass, `GpuInfoProvider` / `SystemInfoProvider` Protocols, lazy-loading real implementations (`NvmlGpuInfo`, `DefaultSystemInfo`), and test fakes (`FakeGpuInfo`, `FakeSystemInfo`, `FailingGpuInfo`, `FailingSystemInfo`). `test_hardware.py` covers the provider layer directly. - **`rapids_cli/providers.py`** — process-wide registry with `set_providers(gpu_info=, system_info=, toolkit_info=)` and `get_gpu_info()` / `get_system_info()` / `get_toolkit_info()` lazy accessors. The doctor orchestrator installs providers once per run; checks read from the registry. - **Checks and `debug.run_debug`** — plain module-level functions, no provider kwargs and no classes. Signatures stay `def my_check(verbose=False, **kwargs)` so third-party plugin authors are unaffected. - **Tests** — `rapids_cli/tests/conftest.py` adds `set_gpu_info` / `set_system_info` / `set_toolkit_info` fixtures wrapping `monkeypatch.setattr`, plus an autouse reset for isolation. Eliminates ~51 hardware `mock.patch` calls and ~11 `MagicMock` objects from the check/debug tests. - **Bug fixes:** nvlink check was always passing `0` to `nvmlDeviceGetNvLinkState` instead of the actual link id; doctor orchestrator only installed `gpu_info`, so the memory check fell back to constructing its own `DefaultSystemInfo()` at runtime. Note: this branch incorporates upstream changes merged after #135/#136 (CUDA toolkit check #141, richer nvlink check #143). ## Test plan - [x] `pytest` — 88 tests pass - [x] Coverage at 96.78%, above the 95% threshold - [x] `pre-commit run --all-files` passes - [x] Manual `rapids doctor --verbose` on a GPU host — all 6 entry points discovered; `gpu`, `gpu_compute_capability`, `cuda`, `memory_to_gpu_ratio` (with expected warning), and `nvlink_status` passed end-to-end on real hardware, exercising the `# pragma: no cover` lazy-load branches in `providers.py`. `cuda_toolkit_check` raised `ModuleNotFoundError: No module named 'cuda.core.system'` during testing, which turned out to be a stale conda env with `cuda-core 0.3.2` — the declared constraint (`cuda-core >=0.6.0`) is already correct; filed and closed #144 with that explanation. Will re-verify on a fresh env. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Mike McCarty <mmccarty@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Two related fixes surfaced while smoke-testing #137 on a fresh Brev box: - **`cuda_toolkit_check` was reading the kernel driver, not the CUDA driver.** `get_driver_version(kernel_mode=True)` returns the NVIDIA kernel module version (e.g. `580` from `580.126.09`), not the CUDA Driver API version (e.g. `13` from CUDA 13.0). The verbose message also printed `Driver supports CUDA 580`, which is what tipped this off. Dropping `kernel_mode=True` makes `get_driver_version()` default to the CUDA Driver API mode and the comparison logic actually fires. - **`cuda-bindings` is now declared as a runtime dep, and the conda recipe gets the missing `cuda-core` it should have had since #141.** `cuda-core` calls into `cuda.bindings.driver` via lazy import and without `cuda-bindings` installed, `cuda_toolkit_check` raises `ImportError: cuda.bindings 12.x or 13.x must be installed` on a fresh `pip install rapids-cli`. The pin `>=12.9.6,!=13.0.*,!=13.1.*` excludes the cuda-bindings 13.0/13.1 wheels and is compatible with both CUDA 12 and CUDA 13 driver hosts (verified with cuda-bindings 12.9.6 against a CUDA 13 environment and cuda-bindings 13.2 against a CUDA 12 environment). A regression test (`test_gather_toolkit_info_driver_major_is_cuda_major`) exercises `_gather_toolkit_info()` end-to-end and asserts `driver_major < 100` to ensure that we are getting the CUDA major version and not the driver version Closes #145. --------- Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Adds a new
rapids doctorcheck that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver.These are the things the check does:
Library discoverability: Use
cuda-pathfinderto verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies thatlibcudart.so,libnvrtc.soandlibnvvm.so. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently.Toolkit vs driver version: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from Add CUDA toolkit major version check #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers.
System installation checks: When CTK is not installed via conda/pip, it checks the
/usr/local/cudasymlink and theCUDA_HOME/CUDA_PATHvariables for version mismatches.I based the order and the checks themselves after the
load_nvidia_dynamic_libdocumentation page forcuda-pathfinder, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOMEOne scenario which isn't covered by these tests is described in this comment. This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For
pippackages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that forcondapackages.