Add Iluvatar vendor support for verl hardware plugin by DannyP0 · Pull Request #3 · verl-project/verl-hardware-plugin

DannyP0 · 2026-06-16T07:47:42Z

Summary

Add Iluvatar vendor support for verl hardware plugin

Motivation

Iluvatar GPUs are CUDA-compatible but were not registered in verl’s hardware plugin system, so training could not resolve vendor-specific platform and engine bindings. This PR adds Iluvatar platform, FSDP/Megatron engines, and user documentation as a reference integration for Iluvatar-based deployments.

Changes

Add PlatformIluvatar (platform_cuda_iluvatar.py) for Iluvatar GPUs
Register FSDP and Megatron engines for (cuda, iluvatar)
Add docs/user_guide_iluvatar with install and usage guidance
Wire platform/engine registration in init.py modules
Extend test_plugin_registration.py for Iluvatar coverage

Testing

pytest tests/ -v passes
Manually verified on target hardware (if applicable)

Checklist

Code follows the project's style and passes pre-commit checks
Documentation updated (if applicable)
No secrets or credentials included

gemini-code-assist

Code Review

This pull request introduces support for the Iluvatar GPU platform, including comprehensive documentation, installation guides, quick start scripts, and integration tests. It implements the PlatformIluvatar platform class and registers FSDP and Megatron engines optimized for Iluvatar. The review feedback highlights a critical issue in is_platform_available where the lack of a default SMI check could cause NVIDIA systems to be incorrectly detected as Iluvatar. A code suggestion is provided to cache and perform the ixsmi check by default to prevent false positives.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-16T07:48:51Z

+    def is_platform_available(self, use_smi_check: bool = False) -> bool:
+        """Determine if the current machine has Iluvatar hardware.
+
+        Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True
+        on both Iluvatar and NVIDIA machines. The only reliable way to distinguish
+        them is the ixsmi command (Iluvatar's equivalent of nvidia-smi).
+
+        Detection logic:
+        1. If torch.cuda is not available at all → False
+        2. If use_smi_check=True → check if ixsmi exists and exits 0
+        3. If use_smi_check=False → True (assume CUDA = Iluvatar)
+
+        The use_smi_check=True path is used during first-time auto-detection.
+        In subsequent calls (runtime checks), use_smi_check=False is typical.
+        """
+        if not torch.cuda.is_available():
+            return False
+        if use_smi_check:
+            return self.check_smi_command("ixsmi")
+        return True


The current implementation of is_platform_available defaults use_smi_check to False. Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True on standard NVIDIA GPU systems as well. When verl auto-detects the platform, it calls is_platform_available() without arguments on all registered platforms. This will cause NVIDIA systems to be incorrectly detected as iluvatar if this platform is checked before NVIDIA (or if NVIDIA is the fallback).

To prevent false positives on NVIDIA systems, the SMI check (or another reliable hardware check) should be performed by default. To avoid the overhead of spawning a subprocess repeatedly, we can cache the result of the check.

Suggested change

def is_platform_available(self, use_smi_check: bool = False) -> bool:

"""Determine if the current machine has Iluvatar hardware.

Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True

on both Iluvatar and NVIDIA machines. The only reliable way to distinguish

them is the ixsmi command (Iluvatar's equivalent of nvidia-smi).

Detection logic:

1. If torch.cuda is not available at all → False

2. If use_smi_check=True → check if ixsmi exists and exits 0

3. If use_smi_check=False → True (assume CUDA = Iluvatar)

The use_smi_check=True path is used during first-time auto-detection.

In subsequent calls (runtime checks), use_smi_check=False is typical.

"""

if not torch.cuda.is_available():

return False

if use_smi_check:

return self.check_smi_command("ixsmi")

return True

_is_available_cache: Optional[bool] = None

def is_platform_available(self) -> bool:

"""Determine if the current machine has Iluvatar hardware.

Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True

on both Iluvatar and NVIDIA machines. To avoid false positives on NVIDIA

systems, we check for the presence of the `ixsmi` command and cache the result.

"""

if PlatformIluvatar._is_available_cache is not None:

return PlatformIluvatar._is_available_cache

if not torch.cuda.is_available():

PlatformIluvatar._is_available_cache = False

return False

# Distinguish from NVIDIA by checking for the ixsmi command

has_smi = self.check_smi_command("ixsmi")

PlatformIluvatar._is_available_cache = has_smi

return has_smi

heavyrain-lzy · 2026-06-16T08:07:25Z

+    # ------------------------------------------------------------------
+
+    def communication_backend_name(self) -> str:
+        return "nccl"


def communication_backend_name(self) -> str: return "flagcx" if os.getenv("USE_FLAGCX", "0").lower() in ["1", "true"] else "nccl"

to support FlagCX

FlagOS-related integration updates will be handled in a separate PR

heavyrain-lzy · 2026-06-16T08:10:48Z

+    def is_available(self) -> bool:
+        return torch.cuda.is_available()
+
+    def is_platform_available(self, use_smi_check: bool = False) -> bool:


Refer to

def is_platform_available(self, use_smi_check=False) -> bool: if not hasattr(torch, "cuda"): return False if use_smi_check: # In CPU-only Ray actors, torch.cuda.is_available() may return False # even though the cluster has GPUs. Fall back to nvidia-smi check, # and if that's also unavailable (e.g. not on PATH), treat # torch.cuda being built as sufficient evidence. cmd = "nvidia-smi" cmd_path = shutil.which(cmd) if cmd_path is None: # Fallback to common absolute paths if not found in PATH common_paths = [ f"/usr/bin/{cmd}", f"/usr/local/bin/{cmd}", f"/usr/local/cuda/bin/{cmd}", ] for path in common_paths: if os.path.isfile(path) and os.access(path, os.X_OK): cmd_path = path break if cmd_path is None: return False if self.check_smi_command(cmd_path): return True return torch.cuda.is_available()

to ensure that we can execute the ix-smi correclty.
We should use ix-smi before torch.cuda.is_available() to detect the platform because the torch.cuda.is_available() will be False on the ray work with num_gpus=0

Done — is_platform_available() now checks ixsmi first (with use_smi_check=True) before relying on torch.cuda.is_available()

heavyrain-lzy · 2026-06-16T08:25:13Z

@@ -0,0 +1,42 @@
+# Iluvatar GPU User Guide
+
+Last updated: 06/16/2026.


Add the docs using FlagOS software stack in docs/user_guide_flagos/Iluvatar/

FlagOS-related documentation updates will be handled in a separate PR

heavyrain-lzy · 2026-06-16T08:27:36Z

+## 1. Pull the Base Image
+
+```bash
+docker pull harbor.baai.ac.cn/flagos21-base/iluvatarcorex-4.4.0-ubuntu24-py312-base:20260601v1 


In addition to the mirror, we hope to provide download options for key software such as ray.

We documented community Ray setup in quick_start.md (RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 + ray_init.num_gpus)

heavyrain-lzy · 2026-06-16T08:36:38Z

+cd /root
+
+# Install verl (ver > v0.8.0, #6086)
+pip install -v -e "git+https://github.com/verl-project/verl.git@main#egg=verl" --no-build-isolation


It is best to provide a corresponding commit-id to prevent subsequent users from being unable to reproduce it.

Addressed. install_guidance.md now pins verl to commit ed89419c23653730e95c43954c00e6c24277e1c8 instead of @main.

DannyP0 · 2026-06-24T08:46:48Z

Scope of this PR

Iluvatar vendor support only: platform plugin, FSDP/Megatron engines, docs, and registration tests.
Review fixes: ixsmi-based platform detection and community Ray setup in quick_start.

Out of scope (planned follow-up)

FlagOS-related documentation and integration updates will be handled in a separate PR to keep this change focused on the Iluvatar vendor contribution.
Please take another look. Thanks!

DannyP0 · 2026-06-25T08:07:06Z

(doc) update training script and add run logs in quick_start
trainning log: https://swanlab.cn/@dannyp/verl_grpo_gsm8k_math/runs/qy00qayu/chart

heavyrain-lzy · 2026-06-25T11:56:36Z

Please check the format.

pre-commit install
and check the files

DannyP0 · 2026-06-29T06:27:41Z

Please check the format.
pre-commit install
and check the files

Done — pre-commit checks pass locally after fixing the line-length issue in the test file.

heavyrain-lzy · 2026-06-30T02:35:48Z

+
+Or use the verl official image, see [verl installation docs](https://verl.readthedocs.io/en/latest/start/install.html).
+
+Start a container (NVIDIA example):


Why NVIDIA?

fixed the container example label in install_guidance.md.

heavyrain-lzy · 2026-06-30T02:39:14Z

+## Introduction
+
+This document describes how to use verl for reinforcement learning training on Iluvatar GPUs.
+


Update the top README.md file related to suppported lists.

Docs updated — Iluvatar added to README

…mmunity Ray setup

…title

DannyP0 requested review from heavyrain-lzy and physics31415926 as code owners June 16, 2026 07:47

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

heavyrain-lzy reviewed Jun 16, 2026

View reviewed changes

This was referenced Jun 17, 2026

[MLU] feat: add mlu support #1

Merged

feat(enflame): add GCU platform, engines, and runtime shims for verl 0.9 #6

Open

DannyP0 force-pushed the main branch from e985ff0 to 62addf0 Compare June 24, 2026 08:41

heavyrain-lzy reviewed Jun 30, 2026

View reviewed changes

DannyP0 added 5 commits June 30, 2026 17:29

Add Iluvatar vendor support for verl hardware plugin

3be2bb7

[iluvatar] Address review: improve Iluvatar detection and document co…

85243d3

…mmunity Ray setup

[iluvatar] update training script and add run logs in quick_start

7017aa0

[iluvatar] fix line length in megatron iluvatar registration test

37df55e

[iluvatar] document Iluvatar in README and fix install guide example …

832ba62

…title

DannyP0 force-pushed the main branch from bd37a48 to 832ba62 Compare June 30, 2026 09:41

		@@ -0,0 +1,42 @@
		# Iluvatar GPU User Guide

		Last updated: 06/16/2026.


		Or use the verl official image, see [verl installation docs](https://verl.readthedocs.io/en/latest/start/install.html).

		Start a container (NVIDIA example):

		## Introduction

		This document describes how to use verl for reinforcement learning training on Iluvatar GPUs.

Uh oh!

Conversation

DannyP0 commented Jun 16, 2026

Summary

Motivation

Changes

Testing

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyP0 Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyP0 commented Jun 24, 2026

Uh oh!

DannyP0 commented Jun 25, 2026

Uh oh!

heavyrain-lzy commented Jun 25, 2026

Uh oh!

DannyP0 commented Jun 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DannyP0 Jun 29, 2026 •

edited

Loading