nvidia-driver-reload

Reload NVIDIA drivers without rebooting on headless Linux GPU servers.

Why?

Driver updates, version mismatches, and stuck GPU states typically require a full server reboot. On production GPU servers, that means:

Killing running workloads
Minutes of downtime
Lost compute time / rental revenue

This tool gracefully stops GPU workloads, unloads/reloads kernel modules, and restarts everything — no reboot required.

Quick Start

# Check current status
sudo python3 nvidia_driver_reload.py --status

# Update driver via package manager first
sudo apt-get update && sudo apt-get install nvidia-driver-550

# Then reload without rebooting
sudo python3 nvidia_driver_reload.py --reload --yes

# Or just do a GPU reset (lighter, faster)
sudo python3 nvidia_driver_reload.py --reset

Workflow

Update driver via your package manager (apt, yum, dnf)
Run this tool to reload the driver without rebooting
Done — new driver is active, workloads restarted

# Example: upgrade from 535 to 550
sudo apt-get install nvidia-driver-550
sudo python3 nvidia_driver_reload.py --reload --yes

Features

Hot-reload drivers — Unload and reload kernel modules in the correct order
Intelligent process detection — Uses NVML API and fuser to find actual GPU users, not static process lists
Container-aware — Gracefully stops Docker/Podman GPU containers, restarts daemons, restarts containers
Enterprise GPU support — Automatically handles DCGM, nvidia-peermem, and Fabric Manager
Smart error detection — Detects when reboot is actually required (XID 79, hardware failures)
Handles nvidia_drm.modeset=1 — Unbinds VT consoles and framebuffer automatically
Rollback support — Saves state before operations, can rollback on failure
Safe by design — Never kills system processes, uses authoritative detection only

⚠️ H100/H200 Warning

CRITICAL: NVIDIA drivers < 535 have a silent data corruption bug when reloading on H100/H200 GPUs. Reloading nvidia.ko causes incorrect computation results with no errors reported.

Recommendation: Upgrade to driver 535+ before using this tool on H100/H200 systems.

Installation

# Just the script (no dependencies required)
curl -O https://raw.githubusercontent.com/YOUR_USERNAME/nvidia-driver-reload/main/nvidia_driver_reload.py
chmod +x nvidia_driver_reload.py

# Optional: better GPU detection and container control
pip install nvidia-ml-py docker psutil

Usage

# Show GPU status and reload feasibility
sudo python3 nvidia_driver_reload.py --status

# Comprehensive health check
sudo python3 nvidia_driver_reload.py --verify

# Dry run (see what would happen)
sudo python3 nvidia_driver_reload.py --reload --dry-run

# Full driver reload (stops workloads, unloads modules, reloads, restarts)
sudo python3 nvidia_driver_reload.py --reload

# Skip confirmation prompt (for automation)
sudo python3 nvidia_driver_reload.py --reload --yes

# GPU reset only (faster, doesn't unload modules)
sudo python3 nvidia_driver_reload.py --reset

# Stop all GPU workloads (manual control)
sudo python3 nvidia_driver_reload.py --stop

# Restart previously stopped workloads
sudo python3 nvidia_driver_reload.py --start

# Rollback from failed reload
sudo python3 nvidia_driver_reload.py --rollback

Options

Flag	Description
`--status`, `-s`	Show current NVIDIA status
`--verify`	Comprehensive nvidia-smi health check
`--reload`, `-r`	Full driver reload
`--reset`	GPU reset only (nvidia-smi --gpu-reset)
`--stop`	Stop all GPU workloads
`--start`	Restart stopped workloads
`--rollback`	Rollback from failed state
`--yes`, `-y`	Skip confirmation prompts
`--dry-run`	Show what would happen
`--verbose`, `-v`	Verbose output
`--json`	JSON output for status/verify

Requirements

Linux (tested on Ubuntu 22.04/24.04, Debian 12)
Root privileges
Python 3.8+
Headless server — No X11/Wayland display server using the GPU

Optional Dependencies

pip install nvidia-ml-py   # Better GPU detection via NVML
pip install docker         # Docker container management
pip install psutil         # Process detection

How It Works

Detect GPU state — Check for fatal errors, running processes, module usage
Stop GPU containers — Gracefully stop Docker/Podman containers using GPU
Stop services — nvidia-dcgm, nvidia-persistenced, nvidia-fabricmanager
Kill GPU processes — Only processes detected by NVML/fuser (not by name)
Restart systemd-logind — Releases DRM handles (the #1 hidden culprit)
Handle modeset — Unbind VT consoles and framebuffer if nvidia_drm.modeset=1
Unload modules — nvidia_drm → nvidia_modeset → nvidia_uvm → nvidia_peermem → nvidia
Load modules — nvidia → nvidia_peermem → nvidia_uvm → nvidia_modeset → nvidia_drm
Rebind console — Restore framebuffer and VT consoles
Verify driver — Comprehensive nvidia-smi health check
Restart Docker — Required to refresh nvidia-container-toolkit paths
Restart containers — Bring back GPU workloads

Enterprise GPU Support (A100, H100, H200)

The script automatically handles enterprise GPU components when present:

Component	Purpose	Handling
NVIDIA DCGM	Data Center GPU Manager — monitoring/metrics	Stopped before unload (holds GPU handles via NVML), restarted after
nvidia-peermem	GPUDirect RDMA for HPC/InfiniBand clusters	Unloaded in correct order (after nvidia_uvm, before nvidia)
Fabric Manager	NVSwitch/NVLink for DGX/HGX systems	Stopped before unload, restarted after (version must match driver)

All components are auto-detected — if not present, they're silently skipped. No configuration needed.

Intelligent Process Detection

This tool does NOT use static process name lists (which kill innocent python, docker, containerd processes). Instead, it uses authoritative detection:

Method	What it catches
NVML API	All processes actively using GPU compute/graphics resources
fuser /dev/nvidia*	All processes holding open handles to NVIDIA device files

If a process isn't detected by these tools, it's not using the GPU — leave it alone. This approach works for any GPU workload without manual updates:

CUDA applications
PyTorch/TensorFlow training jobs
Docker containers with --gpus
Podman GPU containers
QEMU/KVM GPU passthrough
FFmpeg hardware encoding
Any other GPU workload

Safety Guarantees

Never kills PIDs < 100 (kernel threads, init)
Never kills critical system processes (systemd, dbus, kworker, etc.)
Only terminates processes confirmed to be holding GPU resources

When Reboot Is Required

The script automatically detects scenarios where reload won't work:

Condition	Why
XID 79	GPU fell off PCIe bus — hardware issue
GSP firmware failure	Firmware needs full reset
NULL pointer in nvidia module	Kernel corruption
Display server running	X11/Wayland holds GPU — stop it first

XID Error Handling

XID	Severity	Action
79	Fatal	Reboot required
48, 74, 95, 119	Recoverable	GPU reset works
31, 43, 45, 68, 69, 94	App fault	Just restart application
61, 62, 63, 64, 92	Info	Monitor, usually fine

Key insight: Many XIDs previously thought "fatal" are actually recoverable. The script uses timestamp filtering to avoid false positives from old dmesg entries.

Limitations

Headless only — Display servers prevent module unload
CUDA state lost — No checkpoint/restore, running CUDA jobs are killed
NVLink systems — Fabric Manager version must match driver
Screen blanks — Expected during modeset=1 unbind (console comes back)
H100/H200 driver < 535 — Silent data corruption bug on reload (upgrade first!)

Files

Path	Purpose
`/var/run/nvidia-reload.lock`	Prevents concurrent runs
`/var/lib/nvidia-reload/state.json`	Saved state for rollback
`/var/log/nvidia-reload.log`	Operation log

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
nvidia_driver_reload.py		nvidia_driver_reload.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nvidia-driver-reload

Why?

Quick Start

Workflow

Features

⚠️ H100/H200 Warning

Installation

Usage

Options

Requirements

Optional Dependencies

How It Works

Enterprise GPU Support (A100, H100, H200)

Intelligent Process Detection

Safety Guarantees

When Reboot Is Required

XID Error Handling

Limitations

Files

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nvidia-driver-reload

Why?

Quick Start

Workflow

Features

⚠️ H100/H200 Warning

Installation

Usage

Options

Requirements

Optional Dependencies

How It Works

Enterprise GPU Support (A100, H100, H200)

Intelligent Process Detection

Safety Guarantees

When Reboot Is Required

XID Error Handling

Limitations

Files

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages