Added fallback to load extra cudnn dlls in the site packages #1369
Added fallback to load extra cudnn dlls in the site packages #1369hthadicherla wants to merge 4 commits intomainfrom
Conversation
…er that ort.preload_dlls() missed Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughWindows-specific cuDNN detection in Changes
Sequence Diagram(s)sequenceDiagram
participant Python as PythonProcess (ort_utils)
participant ORT as ONNXRuntime
participant FS as FileSystem (site-packages)
participant WinAPI as WindowsAPI (GetModuleHandleW)
participant CT as ctypes.CDLL
Python->>ORT: call preload_dlls()
ORT-->>Python: preload_result (dict)
alt preload_result contains Windows failure key (cudnn64_9)
Python->>Python: log warning (preload reported failure)
else preload appears successful
Python->>FS: scan site-packages for nvidia-cudnn-cu* -> list cudnn*.dll
loop for each cudnn DLL
Python->>WinAPI: GetModuleHandleW(dll_name)
alt module already loaded
WinAPI-->>Python: handle present -> skip DLL
else not loaded
Python->>CT: ctypes.CDLL(full_dll_path)
alt load success
CT-->>Python: success -> log loaded DLL
else load failure
CT-->>Python: error -> log failure
end
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 154-155: The docstring incorrectly states that _find_cudnn_bin_dir
is the "caller" of this function; update the docstring to correctly describe the
relationship (this function is the caller/helper that invokes
_find_cudnn_bin_dir or vice versa as appropriate). Locate the docstring in
ort_utils.py around the function that references _find_cudnn_bin_dir and change
the wording to state that this function calls (or is a helper/child of)
_find_cudnn_bin_dir, or simply describe their relationship accurately (e.g.,
"this function calls _find_cudnn_bin_dir" or "helper/callee" vs "caller"),
ensuring function names _find_cudnn_bin_dir and the current function are
mentioned for clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fc851a14-a70d-4dff-a6fb-0235d3765f77
📒 Files selected for processing (1)
modelopt/onnx/quantization/ort_utils.py
|
…r in linux since this fix is only for windows Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 147-201: The current logic in _load_extra_cudnn_dlls (and the
caller _check_for_libcudnn) swallows exceptions from ort.preload_dlls() and
treats partial extra-DLL load failures as success; change the flow so that if
ort.preload_dlls() raises we log and immediately return False (or re-raise), and
after attempting to load extra cudnn DLLs, if any DLLs failed to load (the
failed list is non-empty) return False (or raise an appropriate error) instead
of allowing a success return; update callers of
_load_extra_cudnn_dlls/_check_for_libcudnn to respect the boolean failure and
propagate failure to the caller so that partial/failed cuDNN loads are not
reported as available.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9ff98f02-7045-4bea-8a6b-df1776d8d3d3
📒 Files selected for processing (1)
modelopt/onnx/quantization/ort_utils.py
| def _load_extra_cudnn_dlls(): | ||
| """Load any cuDNN DLLs from site-packages that ORT's preload_dlls() missed. | ||
|
|
||
| TEMPORARY WORKAROUND: This function exists because ort.preload_dlls() has a | ||
| hardcoded list of cuDNN sub-libraries which may be incomplete for newer cuDNN | ||
| versions (e.g. cuDNN 9.21 added cudnn_engines_tensor_ir64_9.dll, cuDNN 9.20 | ||
| added cudnn_cnn64_9.dll). Once ort.preload_dlls() is fixed upstream to | ||
| dynamically discover all cuDNN DLLs, this function and its caller | ||
| (_find_cudnn_bin_dir) should be removed. | ||
|
|
||
| This scans the nvidia-cudnn bin directory and loads any cudnn*.dll not already | ||
| loaded in the process. | ||
| """ | ||
| import ctypes | ||
| import ctypes.wintypes | ||
|
|
||
| cudnn_bin_dir = _find_cudnn_bin_dir() | ||
| if not cudnn_bin_dir: | ||
| logger.debug( | ||
| "nvidia-cudnn bin directory not found in site-packages, skipping extra DLL load" | ||
| ) | ||
| return | ||
|
|
||
| dll_files = sorted(glob.glob(os.path.join(cudnn_bin_dir, "cudnn*.dll"))) | ||
| if not dll_files: | ||
| return | ||
|
|
||
| get_module_handle_w = ctypes.windll.kernel32.GetModuleHandleW # type: ignore[attr-defined] | ||
| get_module_handle_w.argtypes = [ctypes.wintypes.LPCWSTR] | ||
| get_module_handle_w.restype = ctypes.wintypes.HMODULE | ||
|
|
||
| loaded = [] | ||
| skipped = [] | ||
| failed = [] | ||
| for dll_path in dll_files: | ||
| dll_name = os.path.basename(dll_path) | ||
| if get_module_handle_w(dll_name): | ||
| skipped.append(dll_name) | ||
| continue | ||
| try: | ||
| ctypes.CDLL(dll_path) | ||
| loaded.append(dll_name) | ||
| except OSError as e: | ||
| failed.append(dll_name) | ||
| logger.warning(f"Failed to load {dll_name} from site-packages: {e}") | ||
|
|
||
| if skipped: | ||
| logger.debug(f"Already loaded (skipped): {skipped}") | ||
| if loaded: | ||
| logger.info( | ||
| f"Loaded {len(loaded)} extra cuDNN DLLs that ort.preload_dlls() missed: {loaded}" | ||
| ) | ||
| if failed: | ||
| logger.warning(f"Failed to load {len(failed)} cuDNN DLLs: {failed}") | ||
|
|
There was a problem hiding this comment.
Preload failure can still return success (True).
If ort.preload_dlls() raises, the exception is only logged and the flow can still hit the success branch and return True. Also, extra-DLL load failures are not propagated to the caller, so _check_for_libcudnn() may report success while cuDNN is still partially unavailable.
Proposed fix
-def _load_extra_cudnn_dlls():
+def _load_extra_cudnn_dlls() -> list[str]:
@@
- return
+ return []
@@
- return
+ return []
@@
if failed:
logger.warning(f"Failed to load {len(failed)} cuDNN DLLs: {failed}")
+ return failed
@@
- try:
+ preload_exception: Exception | None = None
+ try:
with redirect_stdout(captured), redirect_stderr(captured):
ort.preload_dlls()
except Exception as e:
+ preload_exception = e
logger.warning(f"onnxruntime.preload_dlls() raised an exception: {e}")
@@
- if f"Failed to load {core_cudnn_dll}" in preload_output:
+ preload_failed = preload_exception is not None or (
+ f"Failed to load {core_cudnn_dll}" in preload_output
+ )
+ extra_failed: list[str] = []
+ if not preload_failed and platform.system() == "Windows":
+ extra_failed = _load_extra_cudnn_dlls()
+ preload_failed = len(extra_failed) > 0
+
+ if preload_failed:
logger.error(
f"onnxruntime.preload_dlls() was called but {core_cudnn_dll} failed to load. "
"cuDNN DLLs were NOT successfully loaded from site-packages."
)
else:
- if platform.system() == "Windows":
- _load_extra_cudnn_dlls()
logger.info(
"onnxruntime.preload_dlls() succeeded — CUDA/cuDNN DLLs loaded"
" from site-packages. Verify version compatibility at"
" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
)
return TrueAlso applies to: 229-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/ort_utils.py` around lines 147 - 201, The current
logic in _load_extra_cudnn_dlls (and the caller _check_for_libcudnn) swallows
exceptions from ort.preload_dlls() and treats partial extra-DLL load failures as
success; change the flow so that if ort.preload_dlls() raises we log and
immediately return False (or re-raise), and after attempting to load extra cudnn
DLLs, if any DLLs failed to load (the failed list is non-empty) return False (or
raise an appropriate error) instead of allowing a success return; update callers
of _load_extra_cudnn_dlls/_check_for_libcudnn to respect the boolean failure and
propagate failure to the caller so that partial/failed cuDNN loads are not
reported as available.
There was a problem hiding this comment.
@coderabbitai This seems like a non issue, since there is no real way to log failure through ort_preload_dlls. So we just check if the function loads cudnn64 main dll and if it is not loaded then only log failure. The extra dll will try to load all dlls in venv nvidia folder . There should be no error logging there.
There was a problem hiding this comment.
@hthadicherla, that's a fair clarification — the best-effort philosophy for extra DLL loading makes sense, and I agree that failures in _load_extra_cudnn_dlls() should remain warnings rather than propagating as hard errors.
One narrow concern still stands though: if ort.preload_dlls() raises an exception, captured is left empty, so the "Failed to load cudnn64_9" check silently passes (empty string never contains the pattern), _load_extra_cudnn_dlls() is called, and the function returns True — even though the main preload attempt actually failed with an exception. The exception is already being caught and logged as a warning, so this isn't a silent failure, but the function still surfaces as "success" to its caller.
A minimal guard for just that case (without changing the best-effort DLL logic at all):
preload_exception = None
try:
with redirect_stdout(captured), redirect_stderr(captured):
ort.preload_dlls()
except Exception as e:
preload_exception = e
logger.warning(f"onnxruntime.preload_dlls() raised an exception: {e}")
preload_output = captured.getvalue()
if preload_output:
logger.warning(f"preload_dlls() output:\n{preload_output}")
core_cudnn_dll = "cudnn64_9" if platform.system() == "Windows" else "libcudnn_adv"
if preload_exception is not None or f"Failed to load {core_cudnn_dll}" in preload_output:
logger.error(...)
else:
if platform.system() == "Windows":
_load_extra_cudnn_dlls() # best-effort, failures stay as warnings
logger.info("onnxruntime.preload_dlls() succeeded ...")
return TrueThat keeps _load_extra_cudnn_dlls() fully best-effort while correctly treating a raised exception as a load failure. Happy to drop this entirely if you consider the exception path acceptable in practice.
🧠 Learnings used
Learnt from: kevalmorabia97
Repo: NVIDIA/Model-Optimizer PR: 974
File: modelopt/torch/puzzletron/pruning/pruning_utils.py:195-201
Timestamp: 2026-03-06T14:26:21.439Z
Learning: In NVIDIA/Model-Optimizer, for PyTorch >= 2.6, torch.load() calls without an explicit weights_only argument are safe. Do not flag bare torch.load(...) as a security issue in files under the modelopt package (e.g., modelopt/torch/puzzletron/pruning/pruning_utils.py) as long as the PyTorch version constraint is maintained. If supporting PyTorch < 2.6, require an explicit weights_only argument to torch.load() to avoid potential issues.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1369 +/- ##
===========================================
- Coverage 76.48% 66.34% -10.15%
===========================================
Files 471 471
Lines 50487 51002 +515
===========================================
- Hits 38617 33837 -4780
- Misses 11870 17165 +5295
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| """Locate the nvidia cudnn bin directory inside site-packages.""" | ||
| from importlib.metadata import PackageNotFoundError, distribution | ||
|
|
||
| for pkg_name in ("nvidia-cudnn-cu12", "nvidia-cudnn-cu11", "nvidia-cudnn-cu13"): |
There was a problem hiding this comment.
We dont support Cuda11. Lets keep 12 and 13 only
| for pkg_name in ("nvidia-cudnn-cu12", "nvidia-cudnn-cu11", "nvidia-cudnn-cu13"): | |
| for pkg_name in ("nvidia-cudnn-cu12", "nvidia-cudnn-cu13"): |
|
|
||
| def _find_cudnn_bin_dir(): | ||
| """Locate the nvidia cudnn bin directory inside site-packages.""" | ||
| from importlib.metadata import PackageNotFoundError, distribution |
There was a problem hiding this comment.
pls move this import to the top of the file
There was a problem hiding this comment.
Done. Made commit for this
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
ajrasane
left a comment
There was a problem hiding this comment.
LGTM. Left two minor comments. PTAL.
|
|
||
| dll_files = sorted(glob.glob(os.path.join(cudnn_bin_dir, "cudnn*.dll"))) | ||
| if not dll_files: | ||
| return |
There was a problem hiding this comment.
nit: could you add a logger.debug statement that no cudnn dll files were found?
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
Hey @kevalmorabia97 , as you can see this test https://github.com/NVIDIA/Model-Optimizer/actions/runs/25175182312/job/73807528625?pr=1369 is failing [gpu-tests (gpu, 60, pytorch:26.03-py3] is cancelling out at 1hr mark regardless of pass or fail. Can you increase the timeout for this test in CI/CD ? |
What does this PR do?
Type of change: Bug fix
CUDNN 9.21 added a new dll dependency called cudnn_engines_tensor_ir64_9.dll that ort.preload_dlls() is not updated on for windows to load this dll hence fails trying to load cudnn when just nvidia-cudnn-cu12>9.20 package is used.
So added code to add any extra dlls from the site-packages folder that the preload function misses.
Summary by CodeRabbit