Add Iluvatar vendor support for verl hardware plugin#3
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Iluvatar GPU platform, including comprehensive documentation, installation guides, quick start scripts, and integration tests. It implements the PlatformIluvatar platform class and registers FSDP and Megatron engines optimized for Iluvatar. The review feedback highlights a critical issue in is_platform_available where the lack of a default SMI check could cause NVIDIA systems to be incorrectly detected as Iluvatar. A code suggestion is provided to cache and perform the ixsmi check by default to prevent false positives.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def is_platform_available(self, use_smi_check: bool = False) -> bool: | ||
| """Determine if the current machine has Iluvatar hardware. | ||
|
|
||
| Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True | ||
| on both Iluvatar and NVIDIA machines. The only reliable way to distinguish | ||
| them is the ixsmi command (Iluvatar's equivalent of nvidia-smi). | ||
|
|
||
| Detection logic: | ||
| 1. If torch.cuda is not available at all → False | ||
| 2. If use_smi_check=True → check if ixsmi exists and exits 0 | ||
| 3. If use_smi_check=False → True (assume CUDA = Iluvatar) | ||
|
|
||
| The use_smi_check=True path is used during first-time auto-detection. | ||
| In subsequent calls (runtime checks), use_smi_check=False is typical. | ||
| """ | ||
| if not torch.cuda.is_available(): | ||
| return False | ||
| if use_smi_check: | ||
| return self.check_smi_command("ixsmi") | ||
| return True |
There was a problem hiding this comment.
The current implementation of is_platform_available defaults use_smi_check to False. Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True on standard NVIDIA GPU systems as well. When verl auto-detects the platform, it calls is_platform_available() without arguments on all registered platforms. This will cause NVIDIA systems to be incorrectly detected as iluvatar if this platform is checked before NVIDIA (or if NVIDIA is the fallback).
To prevent false positives on NVIDIA systems, the SMI check (or another reliable hardware check) should be performed by default. To avoid the overhead of spawning a subprocess repeatedly, we can cache the result of the check.
| def is_platform_available(self, use_smi_check: bool = False) -> bool: | |
| """Determine if the current machine has Iluvatar hardware. | |
| Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True | |
| on both Iluvatar and NVIDIA machines. The only reliable way to distinguish | |
| them is the ixsmi command (Iluvatar's equivalent of nvidia-smi). | |
| Detection logic: | |
| 1. If torch.cuda is not available at all → False | |
| 2. If use_smi_check=True → check if ixsmi exists and exits 0 | |
| 3. If use_smi_check=False → True (assume CUDA = Iluvatar) | |
| The use_smi_check=True path is used during first-time auto-detection. | |
| In subsequent calls (runtime checks), use_smi_check=False is typical. | |
| """ | |
| if not torch.cuda.is_available(): | |
| return False | |
| if use_smi_check: | |
| return self.check_smi_command("ixsmi") | |
| return True | |
| _is_available_cache: Optional[bool] = None | |
| def is_platform_available(self) -> bool: | |
| """Determine if the current machine has Iluvatar hardware. | |
| Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True | |
| on both Iluvatar and NVIDIA machines. To avoid false positives on NVIDIA | |
| systems, we check for the presence of the `ixsmi` command and cache the result. | |
| """ | |
| if PlatformIluvatar._is_available_cache is not None: | |
| return PlatformIluvatar._is_available_cache | |
| if not torch.cuda.is_available(): | |
| PlatformIluvatar._is_available_cache = False | |
| return False | |
| # Distinguish from NVIDIA by checking for the ixsmi command | |
| has_smi = self.check_smi_command("ixsmi") | |
| PlatformIluvatar._is_available_cache = has_smi | |
| return has_smi |
| # ------------------------------------------------------------------ | ||
|
|
||
| def communication_backend_name(self) -> str: | ||
| return "nccl" |
There was a problem hiding this comment.
def communication_backend_name(self) -> str:
return "flagcx" if os.getenv("USE_FLAGCX", "0").lower() in ["1", "true"] else "nccl"
to support FlagCX
There was a problem hiding this comment.
FlagOS-related integration updates will be handled in a separate PR
| def is_available(self) -> bool: | ||
| return torch.cuda.is_available() | ||
|
|
||
| def is_platform_available(self, use_smi_check: bool = False) -> bool: |
There was a problem hiding this comment.
Refer to
def is_platform_available(self, use_smi_check=False) -> bool:
if not hasattr(torch, "cuda"):
return False
if use_smi_check:
# In CPU-only Ray actors, torch.cuda.is_available() may return False
# even though the cluster has GPUs. Fall back to nvidia-smi check,
# and if that's also unavailable (e.g. not on PATH), treat
# torch.cuda being built as sufficient evidence.
cmd = "nvidia-smi"
cmd_path = shutil.which(cmd)
if cmd_path is None:
# Fallback to common absolute paths if not found in PATH
common_paths = [
f"/usr/bin/{cmd}",
f"/usr/local/bin/{cmd}",
f"/usr/local/cuda/bin/{cmd}",
]
for path in common_paths:
if os.path.isfile(path) and os.access(path, os.X_OK):
cmd_path = path
break
if cmd_path is None:
return False
if self.check_smi_command(cmd_path):
return True
return torch.cuda.is_available()
to ensure that we can execute the ix-smi correclty.
We should use ix-smi before torch.cuda.is_available() to detect the platform because the torch.cuda.is_available() will be False on the ray work with num_gpus=0
There was a problem hiding this comment.
Done — is_platform_available() now checks ixsmi first (with use_smi_check=True) before relying on torch.cuda.is_available()
| @@ -0,0 +1,42 @@ | |||
| # Iluvatar GPU User Guide | |||
|
|
|||
| Last updated: 06/16/2026. | |||
There was a problem hiding this comment.
Add the docs using FlagOS software stack in docs/user_guide_flagos/Iluvatar/
There was a problem hiding this comment.
FlagOS-related documentation updates will be handled in a separate PR
| ## 1. Pull the Base Image | ||
|
|
||
| ```bash | ||
| docker pull harbor.baai.ac.cn/flagos21-base/iluvatarcorex-4.4.0-ubuntu24-py312-base:20260601v1 |
There was a problem hiding this comment.
In addition to the mirror, we hope to provide download options for key software such as ray.
There was a problem hiding this comment.
We documented community Ray setup in quick_start.md (RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 + ray_init.num_gpus)
| cd /root | ||
|
|
||
| # Install verl (ver > v0.8.0, #6086) | ||
| pip install -v -e "git+https://github.com/verl-project/verl.git@main#egg=verl" --no-build-isolation |
There was a problem hiding this comment.
It is best to provide a corresponding commit-id to prevent subsequent users from being unable to reproduce it.
There was a problem hiding this comment.
Addressed. install_guidance.md now pins verl to commit ed89419c23653730e95c43954c00e6c24277e1c8 instead of @main.
|
Scope of this PR
Out of scope (planned follow-up) FlagOS-related documentation and integration updates will be handled in a separate PR to keep this change focused on the Iluvatar vendor contribution. |
|
(doc) update training script and add run logs in quick_start |
|
Please check the format. |
Done — pre-commit checks pass locally after fixing the line-length issue in the test file. |
|
|
||
| Or use the verl official image, see [verl installation docs](https://verl.readthedocs.io/en/latest/start/install.html). | ||
|
|
||
| Start a container (NVIDIA example): |
There was a problem hiding this comment.
fixed the container example label in install_guidance.md.
| ## Introduction | ||
|
|
||
| This document describes how to use verl for reinforcement learning training on Iluvatar GPUs. | ||
|
|
There was a problem hiding this comment.
Update the top README.md file related to suppported lists.
There was a problem hiding this comment.
Docs updated — Iluvatar added to README
…mmunity Ray setup
Summary
Add Iluvatar vendor support for verl hardware plugin
Motivation
Iluvatar GPUs are CUDA-compatible but were not registered in verl’s hardware plugin system, so training could not resolve vendor-specific platform and engine bindings. This PR adds Iluvatar platform, FSDP/Megatron engines, and user documentation as a reference integration for Iluvatar-based deployments.
Changes
Testing
pytest tests/ -vpassesChecklist
pre-commitchecks