Reload NVIDIA drivers without rebooting on headless Linux GPU servers.
Driver updates, version mismatches, and stuck GPU states typically require a full server reboot. On production GPU servers, that means:
- Killing running workloads
- Minutes of downtime
- Lost compute time / rental revenue
This tool gracefully stops GPU workloads, unloads/reloads kernel modules, and restarts everything — no reboot required.
# Check current status
sudo python3 nvidia_driver_reload.py --status
# Update driver via package manager first
sudo apt-get update && sudo apt-get install nvidia-driver-550
# Then reload without rebooting
sudo python3 nvidia_driver_reload.py --reload --yes
# Or just do a GPU reset (lighter, faster)
sudo python3 nvidia_driver_reload.py --reset- Update driver via your package manager (
apt,yum,dnf) - Run this tool to reload the driver without rebooting
- Done — new driver is active, workloads restarted
# Example: upgrade from 535 to 550
sudo apt-get install nvidia-driver-550
sudo python3 nvidia_driver_reload.py --reload --yes- Hot-reload drivers — Unload and reload kernel modules in the correct order
- Intelligent process detection — Uses NVML API and
fuserto find actual GPU users, not static process lists - Container-aware — Gracefully stops Docker/Podman GPU containers, restarts daemons, restarts containers
- Enterprise GPU support — Automatically handles DCGM, nvidia-peermem, and Fabric Manager
- Smart error detection — Detects when reboot is actually required (XID 79, hardware failures)
- Handles nvidia_drm.modeset=1 — Unbinds VT consoles and framebuffer automatically
- Rollback support — Saves state before operations, can rollback on failure
- Safe by design — Never kills system processes, uses authoritative detection only
CRITICAL: NVIDIA drivers < 535 have a silent data corruption bug when reloading on H100/H200 GPUs. Reloading nvidia.ko causes incorrect computation results with no errors reported.
Recommendation: Upgrade to driver 535+ before using this tool on H100/H200 systems.
# Just the script (no dependencies required)
curl -O https://raw.githubusercontent.com/YOUR_USERNAME/nvidia-driver-reload/main/nvidia_driver_reload.py
chmod +x nvidia_driver_reload.py
# Optional: better GPU detection and container control
pip install nvidia-ml-py docker psutil# Show GPU status and reload feasibility
sudo python3 nvidia_driver_reload.py --status
# Comprehensive health check
sudo python3 nvidia_driver_reload.py --verify
# Dry run (see what would happen)
sudo python3 nvidia_driver_reload.py --reload --dry-run
# Full driver reload (stops workloads, unloads modules, reloads, restarts)
sudo python3 nvidia_driver_reload.py --reload
# Skip confirmation prompt (for automation)
sudo python3 nvidia_driver_reload.py --reload --yes
# GPU reset only (faster, doesn't unload modules)
sudo python3 nvidia_driver_reload.py --reset
# Stop all GPU workloads (manual control)
sudo python3 nvidia_driver_reload.py --stop
# Restart previously stopped workloads
sudo python3 nvidia_driver_reload.py --start
# Rollback from failed reload
sudo python3 nvidia_driver_reload.py --rollback| Flag | Description |
|---|---|
--status, -s |
Show current NVIDIA status |
--verify |
Comprehensive nvidia-smi health check |
--reload, -r |
Full driver reload |
--reset |
GPU reset only (nvidia-smi --gpu-reset) |
--stop |
Stop all GPU workloads |
--start |
Restart stopped workloads |
--rollback |
Rollback from failed state |
--yes, -y |
Skip confirmation prompts |
--dry-run |
Show what would happen |
--verbose, -v |
Verbose output |
--json |
JSON output for status/verify |
- Linux (tested on Ubuntu 22.04/24.04, Debian 12)
- Root privileges
- Python 3.8+
- Headless server — No X11/Wayland display server using the GPU
pip install nvidia-ml-py # Better GPU detection via NVML
pip install docker # Docker container management
pip install psutil # Process detection- Detect GPU state — Check for fatal errors, running processes, module usage
- Stop GPU containers — Gracefully stop Docker/Podman containers using GPU
- Stop services — nvidia-dcgm, nvidia-persistenced, nvidia-fabricmanager
- Kill GPU processes — Only processes detected by NVML/fuser (not by name)
- Restart systemd-logind — Releases DRM handles (the #1 hidden culprit)
- Handle modeset — Unbind VT consoles and framebuffer if nvidia_drm.modeset=1
- Unload modules — nvidia_drm → nvidia_modeset → nvidia_uvm → nvidia_peermem → nvidia
- Load modules — nvidia → nvidia_peermem → nvidia_uvm → nvidia_modeset → nvidia_drm
- Rebind console — Restore framebuffer and VT consoles
- Verify driver — Comprehensive nvidia-smi health check
- Restart Docker — Required to refresh nvidia-container-toolkit paths
- Restart containers — Bring back GPU workloads
The script automatically handles enterprise GPU components when present:
| Component | Purpose | Handling |
|---|---|---|
| NVIDIA DCGM | Data Center GPU Manager — monitoring/metrics | Stopped before unload (holds GPU handles via NVML), restarted after |
| nvidia-peermem | GPUDirect RDMA for HPC/InfiniBand clusters | Unloaded in correct order (after nvidia_uvm, before nvidia) |
| Fabric Manager | NVSwitch/NVLink for DGX/HGX systems | Stopped before unload, restarted after (version must match driver) |
All components are auto-detected — if not present, they're silently skipped. No configuration needed.
This tool does NOT use static process name lists (which kill innocent python, docker, containerd processes). Instead, it uses authoritative detection:
| Method | What it catches |
|---|---|
| NVML API | All processes actively using GPU compute/graphics resources |
| fuser /dev/nvidia* | All processes holding open handles to NVIDIA device files |
If a process isn't detected by these tools, it's not using the GPU — leave it alone. This approach works for any GPU workload without manual updates:
- CUDA applications
- PyTorch/TensorFlow training jobs
- Docker containers with
--gpus - Podman GPU containers
- QEMU/KVM GPU passthrough
- FFmpeg hardware encoding
- Any other GPU workload
- Never kills PIDs < 100 (kernel threads, init)
- Never kills critical system processes (systemd, dbus, kworker, etc.)
- Only terminates processes confirmed to be holding GPU resources
The script automatically detects scenarios where reload won't work:
| Condition | Why |
|---|---|
| XID 79 | GPU fell off PCIe bus — hardware issue |
| GSP firmware failure | Firmware needs full reset |
| NULL pointer in nvidia module | Kernel corruption |
| Display server running | X11/Wayland holds GPU — stop it first |
| XID | Severity | Action |
|---|---|---|
| 79 | Fatal | Reboot required |
| 48, 74, 95, 119 | Recoverable | GPU reset works |
| 31, 43, 45, 68, 69, 94 | App fault | Just restart application |
| 61, 62, 63, 64, 92 | Info | Monitor, usually fine |
Key insight: Many XIDs previously thought "fatal" are actually recoverable. The script uses timestamp filtering to avoid false positives from old dmesg entries.
- Headless only — Display servers prevent module unload
- CUDA state lost — No checkpoint/restore, running CUDA jobs are killed
- NVLink systems — Fabric Manager version must match driver
- Screen blanks — Expected during modeset=1 unbind (console comes back)
- H100/H200 driver < 535 — Silent data corruption bug on reload (upgrade first!)
| Path | Purpose |
|---|---|
/var/run/nvidia-reload.lock |
Prevents concurrent runs |
/var/lib/nvidia-reload/state.json |
Saved state for rollback |
/var/log/nvidia-reload.log |
Operation log |
- NVIDIA Forums: Reset driver without rebooting
- CUDA Driver Reload Guide
- nvidia-container-toolkit #169
- Arch Wiki: NVIDIA Tips
- Arch Forums: nvidia_drm unload
- NVIDIA XID Errors
- NVIDIA DCGM Documentation
- GPUDirect RDMA
- Fabric Manager User Guide
- optimus-manager
- GPU Passthrough Scripts
MIT