Skip to content

chore: bump vllm to 0.15.1#7867

Open
hjh0119 wants to merge 5 commits intomodelscope:mainfrom
hjh0119:vllm014
Open

chore: bump vllm to 0.15.1#7867
hjh0119 wants to merge 5 commits intomodelscope:mainfrom
hjh0119:vllm014

Conversation

@hjh0119
Copy link
Collaborator

@hjh0119 hjh0119 commented Jan 22, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the swift project to ensure compatibility with vLLM version 0.14. The primary focus is on enhancing the WeightSyncWorkerExtension to manage distributed weight synchronization more effectively. This involves a significant refactoring of the class, introducing explicit support for various hardware accelerators like torch.xpu and vllm_ascend, and improving overall vLLM version compatibility for critical utilities. These changes aim to bolster the stability and flexibility of distributed inference pipelines within the system.

Highlights

  • WeightSyncWorkerExtension Refactor: The WeightSyncWorkerExtension class has been completely refactored to remove its dependency on trl.scripts.vllm_serve.WeightSyncWorkerExtension and introduce a more robust, self-contained implementation for weight synchronization.
  • Expanded Device and vLLM Version Compatibility: Explicit support for torch.xpu (Intel XPU) and vllm_ascend (Huawei Ascend NPU) devices has been added within the communicator initialization. The get_open_port utility now includes a version check to ensure compatibility with vLLM versions 0.11.0 and above.
  • Improved Communicator Management: The new WeightSyncWorkerExtension includes dedicated init_communicator and close_communicator methods, providing clearer control over the lifecycle of the stateless process group used for weight updates and ensuring proper resource cleanup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add compatibility with vLLM 0.14 by updating swift/pipelines/infer/rollout.py. The main change is a local re-implementation of WeightSyncWorkerExtension to handle API changes in vLLM's distributed communication, including support for CUDA, XPU, and Ascend devices.

My review has identified several critical issues related to the new XPU implementation. The current code will fail on XPU systems due to incorrect device handling for tensor allocation and improper use of broadcast and barrier calls. I have provided specific code suggestions to fix these by adding the necessary conditional logic for XPU devices. Additionally, I've included a medium-severity suggestion to refactor a redundant check for improved code clarity.

Comment on lines +172 to +176
weight = torch.empty(shape, dtype=dtype, device=self.communicator.device)

# Use NCCL to broadcast the updated weights from the client (src) to all workers.
self._comm.broadcast(weight, src=self.client_rank)
self._comm.group.barrier()
self.communicator.broadcast(weight, src=self.client_rank)
self.communicator.group.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for updating named parameters will fail on XPU devices. The self.communicator for XPU is a torch.distributed.ProcessGroup, which does not have a .device attribute, causing an AttributeError. Additionally, the broadcast and barrier calls are incorrect for this process group type.

You should add conditional logic to handle XPU devices correctly, similar to how it's done in recent versions of trl.

Suggested change
weight = torch.empty(shape, dtype=dtype, device=self.communicator.device)
# Use NCCL to broadcast the updated weights from the client (src) to all workers.
self._comm.broadcast(weight, src=self.client_rank)
self._comm.group.barrier()
self.communicator.broadcast(weight, src=self.client_rank)
self.communicator.group.barrier()
device = self.device if is_torch_xpu_available() else self.communicator.device
weight = torch.empty(shape, dtype=dtype, device=device)
# Use NCCL to broadcast the updated weights from the client (src) to all workers.
if is_torch_xpu_available():
torch.distributed.broadcast(tensor=weight, src=self.client_rank, group=self.communicator)
self.communicator.barrier().wait()
else:
self.communicator.broadcast(weight, src=self.client_rank)
self.communicator.group.barrier()

Comment on lines +193 to +195
flatten_tensor = torch.empty(flatten_tensor_length, dtype=dtype, device=self.communicator.device)
self.communicator.broadcast(flatten_tensor, src=self.client_rank)
self.communicator.group.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to update_named_param, this method has incorrect logic for XPU devices. self.communicator.device will raise an AttributeError, and the broadcast/barrier calls are incorrect for ProcessGroup. This will cause a runtime failure on XPU systems.

        device = self.device if is_torch_xpu_available() else self.communicator.device
        flatten_tensor = torch.empty(flatten_tensor_length, dtype=dtype, device=device)
        if is_torch_xpu_available():
            torch.distributed.broadcast(tensor=flatten_tensor, src=self.client_rank, group=self.communicator)
            self.communicator.barrier().wait()
        else:
            self.communicator.broadcast(flatten_tensor, src=self.client_rank)
            self.communicator.group.barrier()

Comment on lines +225 to +226
tensor = torch.empty(shape, dtype=dtype, device=self.communicator.device)
self.communicator.broadcast(tensor, src=self.client_rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This part of the code has incorrect logic for XPU devices. The device for tensor allocation is wrong, and the broadcast call needs to be conditional to support the ProcessGroup used for XPU.

            tensor = torch.empty(shape, dtype=dtype, device=self.device if is_torch_xpu_available() else self.communicator.device)
            if is_torch_xpu_available():
                torch.distributed.broadcast(tensor=tensor, src=self.client_rank, group=self.communicator)
            else:
                self.communicator.broadcast(tensor, src=self.client_rank)

named_params[name] = tensor

self._comm.group.barrier()
self.communicator.group.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The barrier call is incorrect for XPU devices. For a ProcessGroup, you should call self.communicator.barrier().wait() instead of self.communicator.group.barrier().

        if is_torch_xpu_available():
            self.communicator.barrier().wait()
        else:
            self.communicator.group.barrier()

Comment on lines +252 to +255
flatten_tensor = torch.empty(flatten_tensor_length, dtype=dtype, device=self.communicator.device)

self._comm.broadcast(flatten_tensor, src=self.client_rank)
self._comm.group.barrier()
self.communicator.broadcast(flatten_tensor, src=self.client_rank)
self.communicator.group.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This method has the same critical issue with XPU device handling as other update methods. self.communicator.device will fail, and the broadcast and barrier calls are incorrect for the ProcessGroup used in the XPU case.

        device = self.device if is_torch_xpu_available() else self.communicator.device
        flatten_tensor = torch.empty(flatten_tensor_length, dtype=dtype, device=device)

        if is_torch_xpu_available():
            torch.distributed.broadcast(tensor=flatten_tensor, src=self.client_rank, group=self.communicator)
            self.communicator.barrier().wait()
        else:
            self.communicator.broadcast(flatten_tensor, src=self.client_rank)
            self.communicator.group.barrier()

Comment on lines 115 to 117
if torch.cuda.is_available() or (is_torch_xpu_available()
and hasattr(torch.xpu.get_device_properties(self.device), 'uuid')):
accelerator_module = torch.xpu if is_torch_xpu_available() else torch.cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check is_torch_xpu_available() is performed twice in this conditional block. You can simplify this logic by using a variable to store the result of the check, which improves readability and avoids redundancy.

Suggested change
if torch.cuda.is_available() or (is_torch_xpu_available()
and hasattr(torch.xpu.get_device_properties(self.device), 'uuid')):
accelerator_module = torch.xpu if is_torch_xpu_available() else torch.cuda
use_xpu_with_uuid = is_torch_xpu_available() and hasattr(torch.xpu.get_device_properties(self.device), 'uuid')
if torch.cuda.is_available() or use_xpu_with_uuid:
accelerator_module = torch.xpu if use_xpu_with_uuid else torch.cuda

@hjh0119
Copy link
Collaborator Author

hjh0119 commented Jan 27, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the codebase to be compatible with vLLM 0.14, primarily by re-implementing the WeightSyncWorkerExtension class and adjusting related communication protocols. This involves removing the _check_trl_version method and the client_device_uuid field from communication requests, and updating the weight synchronization logic to use the new communicator attribute. The changes are well-aligned with the stated goal of vLLM 0.14 compatibility. However, there is a critical issue in the re-implemented WeightSyncWorkerExtension where self.device is accessed before being defined, which needs to be addressed.

import torch_npu
torch_npu.npu.set_device(device)
else:
device = self.device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The device attribute is accessed via self.device but is not defined within the WeightSyncWorkerExtension class. This will lead to an AttributeError. It should likely be accessed from self.model_runner.device, as model_runner is an attribute typically available in worker extensions and holds the device information.

Suggested change
device = self.device
device = self.model_runner.device

@hjh0119 hjh0119 changed the title compatible with vllm 0.14 chore: bump vllm to 0.15.1 Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant