GPUOP-907: fix gpuagent SIGSEGV on MI350P from clock freq OOB read#75
Open
bhatturu wants to merge 1 commit into
Open
GPUOP-907: fix gpuagent SIGSEGV on MI350P from clock freq OOB read#75bhatturu wants to merge 1 commit into
bhatturu wants to merge 1 commit into
Conversation
sarat-k
reviewed
Jun 23, 2026
| static inline uint64_t | ||
| current_frequency_mhz (amdsmi_frequencies_t *freq) | ||
| { | ||
| uint32_t idx = (freq->current < freq->num_supported && |
Collaborator
There was a problem hiding this comment.
missing braces around each condition
fcaff1d to
218b21e
Compare
smi_fill_clock_status_ indexed freq.frequency[freq.current] without bounds-checking. On gfx950 (MI350P) amdsmi returns SUCCESS with a garbage freq.current / num_supported for the DF/DCEF/PCIe clock types, so the read goes past the fixed-size frequency[] array and SIGSEGVs the grpcpp_sync_ser thread (ip 0x116384f) on every `gpuctl show gpu`. MI300/gfx942 returns a valid in-bounds index so the bug never fires there. - add current_frequency_hz() helper that clamps freq.current to a valid index before indexing frequency[], returning the raw value in Hz - skip DF/DCEF/PCIe clock types whose current frequency is reported as NA (AMDSMI_INVALID_UINT32) instead of populating a bogus clock entry - clamp num_supported to AMDSMI_MAX_NUM_FREQUENCIES in find_low_high_frequency() before constructing the vector Validated on MI350P: baseline SIGSEGVs at ip 0x116384f; fixed binary no longer faults in the clock path (gdb confirms smi_fill_clock_status_ is clear).
218b21e to
d771829
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
smi_fill_clock_status_ indexed freq.frequency[freq.current] without bounds-checking. On gfx950 (MI350P) amdsmi returns SUCCESS with a garbage freq.current / num_supported for the DF/DCEF/PCIe clock types, so the read goes past the fixed-size frequency[] array and SIGSEGVs the grpcpp_sync_ser thread (ip 0x116384f) on every
gpuctl show gpu. MI300/gfx942 returns a valid in-bounds index so the bug never fires there.Validated on MI350P: baseline SIGSEGVs at ip 0x116384f; fixed binary no longer faults in the clock path (gdb confirms smi_fill_clock_status_ is clear).
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist