Skip to content

GPUOP-907: fix gpuagent SIGSEGV on MI350P from clock freq OOB read#75

Open
bhatturu wants to merge 1 commit into
ROCm:mainfrom
bhatturu:fix/gpuop-907-clock-oob
Open

GPUOP-907: fix gpuagent SIGSEGV on MI350P from clock freq OOB read#75
bhatturu wants to merge 1 commit into
ROCm:mainfrom
bhatturu:fix/gpuop-907-clock-oob

Conversation

@bhatturu

Copy link
Copy Markdown
Contributor

smi_fill_clock_status_ indexed freq.frequency[freq.current] without bounds-checking. On gfx950 (MI350P) amdsmi returns SUCCESS with a garbage freq.current / num_supported for the DF/DCEF/PCIe clock types, so the read goes past the fixed-size frequency[] array and SIGSEGVs the grpcpp_sync_ser thread (ip 0x116384f) on every gpuctl show gpu. MI300/gfx942 returns a valid in-bounds index so the bug never fires there.

  • add current_frequency_mhz() helper that clamps freq.current to a valid index before indexing frequency[]
  • clamp num_supported to AMDSMI_MAX_NUM_FREQUENCIES in find_low_high_frequency() before constructing the vector

Validated on MI350P: baseline SIGSEGVs at ip 0x116384f; fixed binary no longer faults in the clock path (gdb confirms smi_fill_clock_status_ is clear).

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@bhatturu bhatturu closed this Jun 23, 2026
@bhatturu bhatturu reopened this Jun 23, 2026
static inline uint64_t
current_frequency_mhz (amdsmi_frequencies_t *freq)
{
uint32_t idx = (freq->current < freq->num_supported &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing braces around each condition

@bhatturu bhatturu force-pushed the fix/gpuop-907-clock-oob branch from fcaff1d to 218b21e Compare June 23, 2026 17:30
smi_fill_clock_status_ indexed freq.frequency[freq.current] without
bounds-checking. On gfx950 (MI350P) amdsmi returns SUCCESS with a garbage
freq.current / num_supported for the DF/DCEF/PCIe clock types, so the read
goes past the fixed-size frequency[] array and SIGSEGVs the grpcpp_sync_ser
thread (ip 0x116384f) on every `gpuctl show gpu`. MI300/gfx942 returns a
valid in-bounds index so the bug never fires there.

- add current_frequency_hz() helper that clamps freq.current to a valid
  index before indexing frequency[], returning the raw value in Hz
- skip DF/DCEF/PCIe clock types whose current frequency is reported as NA
  (AMDSMI_INVALID_UINT32) instead of populating a bogus clock entry
- clamp num_supported to AMDSMI_MAX_NUM_FREQUENCIES in
  find_low_high_frequency() before constructing the vector

Validated on MI350P: baseline SIGSEGVs at ip 0x116384f; fixed binary no
longer faults in the clock path (gdb confirms smi_fill_clock_status_ is
clear).
@bhatturu bhatturu force-pushed the fix/gpuop-907-clock-oob branch from 218b21e to d771829 Compare June 23, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants