Update rocm_agent_enumerator to better handle numerous parallel usages#47
Update rocm_agent_enumerator to better handle numerous parallel usages#47jlgreathouse wants to merge 4 commits intoROCm:masterfrom
Conversation
The PCI ID backup method in rocm_agent_enumerator, where the tool uses lspci to find all AMD GPU devices in the system and manaully match them to gfx version, is extremely outdated. The PCI ID list did not include anything after Vega 10, and the actual call to lspci no longer returned anything due to some missing conversions. The patch adds all GPUs that might be needed by ROCr up through Navy Flounder. The PCI ID to gfx matching pulls from the amdgpu driver and libhsakmt.
When building packages, add in pciutils as a dependency because rocm_agent_enumerator uses this as a mechanism for looking up what GPUs exist on the system.
rocminfo is a very heavyweight mechanism for learning a lot of information about the GPUs that are attached to the system. It opens up the limited /dev/kfd resource to gather lots of information about each device, while rocm_agent_enumerator really only wants the gfx number of AMD devices attached to the system. To avoid this heavyweight lookup in most cases, this patch switches the order of tests. Rather than starting with rocminfo and then falling back to a poorly-maintained PCI ID list, this patch changes the agent enumerator to start by checking in the PCI ID list (fast case) and then falling back to rocminfo (slow case) if the PCI ID list is out of date.
|
On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties. |
New versions of amdkfd include the gfx architecture version number for all GPUs surfaced in the HSA topology. This patch adds this as the preferred way for rocm_agent_enumerator to check for supported gfx architecture numbers. Kernels that are missing this feature will not have the value in the topology. rocm_agent_enumerator will fall back to checking against the PCI IDs in this case. If PCI IDs fail, we fall back to the heavyweight rocminfo method.
Glad you caught this -- I didn't know this had made it into KFD. None of my test systems had it when I wrote the other patches. I just pushed a further patch to this PR that uses the KFD topology as the primary desired method for finding gfx arch. Fallback to lspci, and then further fallback to rocminfo. |
|
Unable to import to rocm-systems due to merge conflict |
rocm_agent_enumeratorcurrently callsrocminfoto find what gfx architectures are available on the current system. This is used by, for instance, compilers that want to query what to natively build for if they are not provided with a gfxarch target.However,
rocminfois a very heavyweight method of getting the gfxarch. It queries a large amount of HSA topology information, and opens up/dev/kfdfor various querying purposes. This can make builds slow, as each large, slow query to simply get the gfxarch takes a long time.In addition, it's possible to do a large number of parallel builds (e.g.
make -j, even when targeting the number of processors on large server systems)./dev/kfdhas a limited number of concurrent users, meaning that it can quickly exhaust its resources. This can lead to incorrect compilations, because no gfxarch would be returned fromrocminfo.rocm_agent_enumeratoris supposed to have a fallback path whenrocminfofinds no GPUs. It useslspcito find AMD GPU device numbers, then looks them up to a hard-coded table. However, this table is woefully out of date, and the call tolspciis broken anyway. Sorocm_agent_enumeratorwould simply fail to return a gfxarch isrocminfofailed to return that gfxarch.This patchset:
lspciso that it actually works.lspcito the dependency list so that we don't end up shipping Docker containers that don't include proper tools.lspcifirst, and only fall back to the heavyweightrocminfois our PCI ID list falls out of date.