[trainer, fully_async] Adapt vLLM 0.19+ for Ascend NPU by fh188 · Pull Request #6881 · verl-project/verl

fh188 · 2026-06-29T07:00:54Z

[trainer, fully_async] Adapt vLLM 0.19+ for Ascend NPU

What does this PR do?

This PR mainly addresses compatibility issues with vLLM 0.19.0 and later in NPU scenarios.

The main changes include:

For vLLM >= 0.19.0, continue disabling flash attention in RotaryEmbedding to avoid compatibility issues on NPU.
Wrap FusedMoE.weight_loader to keep the MoE weight loading logic compatible with newer vLLM versions.
Add a new create_tcp_store helper to create TCPStore in a unified way, with support for passing an existing listen_socket.
When listen_socket is provided, use detach() to transfer the socket file descriptor to TCPStore, and correctly close the fd on exceptions to avoid resource leaks.
Update the StatelessProcessGroup creation logic based on the vLLM version:
For vLLM >= 0.19.0: create the store through the new create_tcp_store helper.
For vLLM < 0.19.0: keep the original logic and continue passing the socket.
Add the VLLM_ASCEND_AUTO_DETECT_QUANTIZATION=0 environment variable to disable vLLM Ascend automatic quantization detection, avoiding potential misdetection or compatibility issues.

Overall, this PR improves the startup, communication, and weight-loading compatibility of verl with vLLM 0.19.0+ in Ascend NPU scenarios.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

fix fully async bug

CLAassistant · 2026-06-29T07:01:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces compatibility updates for vLLM version 0.19.0 and above, including conditional handling of StatelessProcessGroup and TCPStore creation in verl/utils/distributed.py, and applying the NPU rotary embedding patch for newer vLLM versions in verl/utils/vllm/npu_vllm_patch.py. Additionally, it sets the ASCEND_RT_VISIBLE_DEVICES environment variable when launching servers in vllm_async_server.py. Feedback on the changes points out a bug in create_tcp_store where calling socket.close(listen_fd) will raise an AttributeError, suggesting the use of os.close(listen_fd) instead to prevent file descriptor leaks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T07:02:21Z

+    except Exception:
+        socket.close(listen_fd)
+        raise


Using 'socket.close(listen_fd)' will raise an 'AttributeError' because the 'socket' module does not have a 'close' function. Since 'listen_fd' is a raw integer file descriptor obtained via 'listen_socket.detach()', you should use 'os.close(listen_fd)' to properly close it and avoid leaking the file descriptor.

Suggested change

except Exception:

socket.close(listen_fd)

raise

except Exception:

os.close(listen_fd)

raise

wuxibin89 · 2026-06-29T07:38:11Z

Submit PR to main first then cherry-pick to release branch.

fix fully async bug

3c6f4d6

fix fully async bug

fh188 requested review from ArronHZG, PeterSH6, chenhaiq and wuxibin89 as code owners June 29, 2026 07:00

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

fh188 changed the title ~~fix fully async bug~~ [trainer, fully_async] Adapt vLLM 0.19+ for Ascend NPU Jun 29, 2026

wucong25 added the Ascend label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[trainer, fully_async] Adapt vLLM 0.19+ for Ascend NPU#6881

[trainer, fully_async] Adapt vLLM 0.19+ for Ascend NPU#6881
fh188 wants to merge 1 commit into
verl-project:release/v0.8.0from
fh188:release/v0.8.0

fh188 commented Jun 29, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

wuxibin89 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

fh188 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fh188 commented Jun 29, 2026 •

edited

Loading