Skip to content

Refactor public API and add NVML installation fallback#2

Open
riccardoc95 wants to merge 37 commits into
sales-lab:develfrom
riccardoc95:devel
Open

Refactor public API and add NVML installation fallback#2
riccardoc95 wants to merge 37 commits into
sales-lab:develfrom
riccardoc95:devel

Conversation

@riccardoc95

@riccardoc95 riccardoc95 commented May 21, 2026

Copy link
Copy Markdown

Changelog

Public API

The low-level nvml_* wrappers are no longer exported.
The public API has been split into files for easier maintenance. It is now focused on:

  • cm_start()
  • cm_timestamp()
  • cm_stop()
  • cm_parser()
  • cm_vizdf()
  • cm_plot_usage()
  • CudaMonSession()

R CMD check notes

The cm_plot_usage() global variable notes were addressed by explicitly binding the ggplot variable:
tm <- value <- type <- label_y <- step <- NULL

NVML deprecation warning

The deprecated nvmlDeviceGetTemperature() call was replaced with the newer nvmlDeviceGetTemperatureV() NVML temperature API path.

Torch/reticulate dependency

The previous Python/torch-based GPU test path was removed from inst.

Process discovery

The monitoring code no longer parses ps command output directly. Process discovery was moved to the ps R package: a script with function process_pids is added.

Bioconductor GPU CI

Added .BBSoptions with GPU reliance configuration so Bioconductor can opt the package into GPU-capable builders.

Installation fallback

The configure logic now includes a fallback for systems where NVML is not available. This allows the package to install in non-NVIDIA or non-CUDA environments while still enabling NVML-backed monitoring when the library is available.

Documentation

Documentation was expanded and updated, including:

  • clearer descriptions of GPU metrics
  • updated examples
  • labelled vignette chunks
  • improved vignette formatting
  • optional Rcollectl integration documented as an enhancement

Current MIG-related limitations

gpu_utilization_pct and memory_utilization_pct are not always available when running on MIG-enabled devices.

At the moment, these missing metrics are handled conservatively: when a metric is not reported, it is excluded from the plot.

The memory total field was also renamed from memory_total_bytes to memory_total_device_bytes to make its meaning explicit. This value refers to the total memory of the physical GPU device, not necessarily the memory limit of a specific MIG instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant