Skip to content

Add Iluvatar vendor support for verl hardware plugin#3

Open
DannyP0 wants to merge 5 commits into
verl-project:mainfrom
iLuvPeNg:main
Open

Add Iluvatar vendor support for verl hardware plugin#3
DannyP0 wants to merge 5 commits into
verl-project:mainfrom
iLuvPeNg:main

Conversation

@DannyP0

@DannyP0 DannyP0 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Add Iluvatar vendor support for verl hardware plugin

Motivation

Iluvatar GPUs are CUDA-compatible but were not registered in verl’s hardware plugin system, so training could not resolve vendor-specific platform and engine bindings. This PR adds Iluvatar platform, FSDP/Megatron engines, and user documentation as a reference integration for Iluvatar-based deployments.

Changes

  • Add PlatformIluvatar (platform_cuda_iluvatar.py) for Iluvatar GPUs
  • Register FSDP and Megatron engines for (cuda, iluvatar)
  • Add docs/user_guide_iluvatar with install and usage guidance
  • Wire platform/engine registration in init.py modules
  • Extend test_plugin_registration.py for Iluvatar coverage

Testing

  • pytest tests/ -v passes
  • Manually verified on target hardware (if applicable)

Checklist

  • Code follows the project's style and passes pre-commit checks
  • Documentation updated (if applicable)
  • No secrets or credentials included

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Iluvatar GPU platform, including comprehensive documentation, installation guides, quick start scripts, and integration tests. It implements the PlatformIluvatar platform class and registers FSDP and Megatron engines optimized for Iluvatar. The review feedback highlights a critical issue in is_platform_available where the lack of a default SMI check could cause NVIDIA systems to be incorrectly detected as Iluvatar. A code suggestion is provided to cache and perform the ixsmi check by default to prevent false positives.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +72 to +91
def is_platform_available(self, use_smi_check: bool = False) -> bool:
"""Determine if the current machine has Iluvatar hardware.

Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True
on both Iluvatar and NVIDIA machines. The only reliable way to distinguish
them is the ixsmi command (Iluvatar's equivalent of nvidia-smi).

Detection logic:
1. If torch.cuda is not available at all → False
2. If use_smi_check=True → check if ixsmi exists and exits 0
3. If use_smi_check=False → True (assume CUDA = Iluvatar)

The use_smi_check=True path is used during first-time auto-detection.
In subsequent calls (runtime checks), use_smi_check=False is typical.
"""
if not torch.cuda.is_available():
return False
if use_smi_check:
return self.check_smi_command("ixsmi")
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of is_platform_available defaults use_smi_check to False. Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True on standard NVIDIA GPU systems as well. When verl auto-detects the platform, it calls is_platform_available() without arguments on all registered platforms. This will cause NVIDIA systems to be incorrectly detected as iluvatar if this platform is checked before NVIDIA (or if NVIDIA is the fallback).

To prevent false positives on NVIDIA systems, the SMI check (or another reliable hardware check) should be performed by default. To avoid the overhead of spawning a subprocess repeatedly, we can cache the result of the check.

Suggested change
def is_platform_available(self, use_smi_check: bool = False) -> bool:
"""Determine if the current machine has Iluvatar hardware.
Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True
on both Iluvatar and NVIDIA machines. The only reliable way to distinguish
them is the ixsmi command (Iluvatar's equivalent of nvidia-smi).
Detection logic:
1. If torch.cuda is not available at allFalse
2. If use_smi_check=Truecheck if ixsmi exists and exits 0
3. If use_smi_check=FalseTrue (assume CUDA = Iluvatar)
The use_smi_check=True path is used during first-time auto-detection.
In subsequent calls (runtime checks), use_smi_check=False is typical.
"""
if not torch.cuda.is_available():
return False
if use_smi_check:
return self.check_smi_command("ixsmi")
return True
_is_available_cache: Optional[bool] = None
def is_platform_available(self) -> bool:
"""Determine if the current machine has Iluvatar hardware.
Since Iluvatar is CUDA-compatible, torch.cuda.is_available() returns True
on both Iluvatar and NVIDIA machines. To avoid false positives on NVIDIA
systems, we check for the presence of the `ixsmi` command and cache the result.
"""
if PlatformIluvatar._is_available_cache is not None:
return PlatformIluvatar._is_available_cache
if not torch.cuda.is_available():
PlatformIluvatar._is_available_cache = False
return False
# Distinguish from NVIDIA by checking for the ixsmi command
has_smi = self.check_smi_command("ixsmi")
PlatformIluvatar._is_available_cache = has_smi
return has_smi

# ------------------------------------------------------------------

def communication_backend_name(self) -> str:
return "nccl"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def communication_backend_name(self) -> str:
        return "flagcx" if os.getenv("USE_FLAGCX", "0").lower() in ["1", "true"] else "nccl"

to support FlagCX

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlagOS-related integration updates will be handled in a separate PR

def is_available(self) -> bool:
return torch.cuda.is_available()

def is_platform_available(self, use_smi_check: bool = False) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to

    def is_platform_available(self, use_smi_check=False) -> bool:
        if not hasattr(torch, "cuda"):
            return False
        if use_smi_check:
            # In CPU-only Ray actors, torch.cuda.is_available() may return False
            # even though the cluster has GPUs. Fall back to nvidia-smi check,
            # and if that's also unavailable (e.g. not on PATH), treat
            # torch.cuda being built as sufficient evidence.
            cmd = "nvidia-smi"
            cmd_path = shutil.which(cmd)
            if cmd_path is None:
                # Fallback to common absolute paths if not found in PATH
                common_paths = [
                    f"/usr/bin/{cmd}",
                    f"/usr/local/bin/{cmd}",
                    f"/usr/local/cuda/bin/{cmd}",
                ]
                for path in common_paths:
                    if os.path.isfile(path) and os.access(path, os.X_OK):
                        cmd_path = path
                        break
                if cmd_path is None:
                    return False
            if self.check_smi_command(cmd_path):
                return True
        return torch.cuda.is_available()

to ensure that we can execute the ix-smi correclty.
We should use ix-smi before torch.cuda.is_available() to detect the platform because the torch.cuda.is_available() will be False on the ray work with num_gpus=0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DannyP0 DannyP0 Jun 29, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — is_platform_available() now checks ixsmi first (with use_smi_check=True) before relying on torch.cuda.is_available()

@@ -0,0 +1,42 @@
# Iluvatar GPU User Guide

Last updated: 06/16/2026.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the docs using FlagOS software stack in docs/user_guide_flagos/Iluvatar/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlagOS-related documentation updates will be handled in a separate PR

## 1. Pull the Base Image

```bash
docker pull harbor.baai.ac.cn/flagos21-base/iluvatarcorex-4.4.0-ubuntu24-py312-base:20260601v1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the mirror, we hope to provide download options for key software such as ray.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We documented community Ray setup in quick_start.md (RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 + ray_init.num_gpus)

cd /root

# Install verl (ver > v0.8.0, #6086)
pip install -v -e "git+https://github.com/verl-project/verl.git@main#egg=verl" --no-build-isolation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best to provide a corresponding commit-id to prevent subsequent users from being unable to reproduce it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. install_guidance.md now pins verl to commit ed89419c23653730e95c43954c00e6c24277e1c8 instead of @main.

@DannyP0

DannyP0 commented Jun 24, 2026

Copy link
Copy Markdown
Author

Scope of this PR

  • Iluvatar vendor support only: platform plugin, FSDP/Megatron engines, docs, and registration tests.
  • Review fixes: ixsmi-based platform detection and community Ray setup in quick_start.

Out of scope (planned follow-up)

FlagOS-related documentation and integration updates will be handled in a separate PR to keep this change focused on the Iluvatar vendor contribution.
Please take another look. Thanks!

@DannyP0

DannyP0 commented Jun 25, 2026

Copy link
Copy Markdown
Author

(doc) update training script and add run logs in quick_start
trainning log: https://swanlab.cn/@dannyp/verl_grpo_gsm8k_math/runs/qy00qayu/chart

@heavyrain-lzy

Copy link
Copy Markdown
Collaborator

Please check the format.

pre-commit install
and check the files

@DannyP0

DannyP0 commented Jun 29, 2026

Copy link
Copy Markdown
Author

Please check the format.

pre-commit install
and check the files

Done — pre-commit checks pass locally after fixing the line-length issue in the test file.


Or use the verl official image, see [verl installation docs](https://verl.readthedocs.io/en/latest/start/install.html).

Start a container (NVIDIA example):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why NVIDIA?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the container example label in install_guidance.md.

## Introduction

This document describes how to use verl for reinforcement learning training on Iluvatar GPUs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the top README.md file related to suppported lists.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs updated — Iluvatar added to README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants