[rollout, trainer, cfg] feat: per-request abort hooks and AbortableLLMServerClient#6865
[rollout, trainer, cfg] feat: per-request abort hooks and AbortableLLMServerClient#6865cr-gao wants to merge 4 commits into
Conversation
…MServerClient Add selective per-request abort to the rollout client path: extension hooks on LLMServerClient plus a concrete AbortableLLMServerClient, selectable via the new actor_rollout_ref.rollout.agent.llm_client_class config. Also fix a typo (chunkes -> chunks) in agent_loop.py. Co-authored-by: Claude Signed-off-by: cr-gao <gaochenrui@sjtu.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request introduces AbortableLLMServerClient to support per-request aborts and in-flight request tracking, adds corresponding CPU unit tests, and allows configuring a custom LLMServerClient class. It also fixes a minor typo in the agent loop. The review feedback highlights a critical issue where client-side task cancellation (such as timeouts) would silently leak GPU resources on the vLLM server because the server-side request is not aborted; the reviewer suggests introducing and implementing an _on_cancel hook to safely abort the request upon cancellation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| finally: | ||
| self._on_complete(request_id) | ||
| self._release_server(server_id) | ||
|
|
||
| def _on_dispatch(self, request_id, inner_request_id, server): | ||
| """Hook fired after the inner vLLM request_id is assigned, before awaiting generation. | ||
| Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort.""" | ||
|
|
||
| def _on_complete(self, request_id): | ||
| """Hook fired when generation finishes or raises. Default no-op.""" |
There was a problem hiding this comment.
If the client-side generate task is cancelled (e.g., due to a timeout via asyncio.timeout or asyncio.wait_for), the finally block runs immediately and removes the request from _inflight via _on_complete. However, the server-side request is never aborted, leading to silent GPU resource leaks on the vLLM server.
To fix this, we should catch asyncio.CancelledError in generate, trigger a new _on_cancel hook to abort the request on the server, and then re-raise the cancellation.
| finally: | |
| self._on_complete(request_id) | |
| self._release_server(server_id) | |
| def _on_dispatch(self, request_id, inner_request_id, server): | |
| """Hook fired after the inner vLLM request_id is assigned, before awaiting generation. | |
| Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort.""" | |
| def _on_complete(self, request_id): | |
| """Hook fired when generation finishes or raises. Default no-op.""" | |
| except asyncio.CancelledError: | |
| self._on_cancel(request_id) | |
| raise | |
| finally: | |
| self._on_complete(request_id) | |
| self._release_server(server_id) | |
| def _on_dispatch(self, request_id, inner_request_id, server): | |
| """Hook fired after the inner vLLM request_id is assigned, before awaiting generation. | |
| Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort.""" | |
| def _on_complete(self, request_id): | |
| """Hook fired when generation finishes or raises. Default no-op.""" | |
| def _on_cancel(self, request_id): | |
| """Hook fired when generation is cancelled. Default no-op.""" |
There was a problem hiding this comment.
Good catch. asyncio.wait_for/timeout cancellation indeed leaked the server-side request, and that's exactly one of the motivating use cases. Added an _on_cancel(request_id) hook fired on CancelledError (before the finally so the in-flight entry still exists), implemented as a fire-and-forget abort in AbortableLLMServerClient (we can't safely await during cancellation), with abort() and _on_cancel sharing a _send_abort helper. Also added a regression test for the cancellation path. Done in aeadf2f.
| def _on_complete(self, request_id): | ||
| # Sole owner of removal: fires from generate()'s finally on success, abort, or error. | ||
| self._inflight.pop(request_id, None) |
There was a problem hiding this comment.
Implement the _on_cancel hook in AbortableLLMServerClient to automatically abort the request on the server when the client-side task is cancelled. This prevents GPU resource leaks on the vLLM server during timeouts or cancellations.
def _on_complete(self, request_id):
# Sole owner of removal: fires from generate()'s finally on success, abort, or error.
self._inflight.pop(request_id, None)
def _on_cancel(self, request_id):
entry = self._inflight.get(request_id)
if entry is not None:
inner_request_id, server = entry
server.abort_request.remote(inner_request_id)There was a problem hiding this comment.
Implemented _on_cancel in AbortableLLMServerClient as fire-and-forget (_send_abort, no await during cancellation); removal stays owned by _on_complete. Done in aeadf2f.
Add an _on_cancel hook fired on asyncio.CancelledError so AbortableLLMServerClient fire-and-forget aborts the server-side request when generate() is cancelled (e.g. asyncio.wait_for/timeout), preventing silent GPU resource leaks. Add a regression test for the cancellation path. Addresses gemini-code-assist review feedback on verl-project#6865. Co-authored-by: Claude Signed-off-by: cr-gao <gaochenrui@sjtu.edu.cn>
| reward_loop_worker_handles = self.reward_loop_manager.reward_loop_workers if enable_agent_reward_loop else None | ||
|
|
||
| # Support custom LLMServerClient via config (mirrors agent_loop_manager_class above) | ||
| llm_client_class_fqn = self.config.actor_rollout_ref.rollout.get("agent", {}).get("llm_client_class") |
There was a problem hiding this comment.
ray_trainer.py is deprecated, please see verl/trainer/ppo/v1/trainer_base.py
There was a problem hiding this comment.
Thanks! moved the config-driven custom LLMServerClient selection into v1/trainer_base.py's get_llm_client() and reverted the change in the deprecated ray_trainer.py. Done in 5cee1d8.
| ) | ||
| output: TokenOutput = await server.generate.remote( | ||
| request_id=uuid4().hex, # use new request_id for each turn | ||
| request_id=inner_request_id, |
There was a problem hiding this comment.
Do not use same request_id in multi-turn, it may cause some buggy behavior in sglang/vllm server.
There was a problem hiding this comment.
The server already gets a fresh id per turn. generate() passes inner_request_id = uuid4().hex to server.generate.remote(...), which preserves the pre-existing "use new request_id for each turn" behavior. I only hoisted the uuid4() out of the .remote() call so subclasses can record it. So the outer sticky request_id is never sent to the sglang/vllm server.
The outer request_id is only used as the key of AbortableLLMServerClient._inflight for per-request abort. Since it's reused across turns, this is safe for sequential turns but would collide if a single session ever has concurrent in-flight generate() calls. Is that the case you're flagging? If so I'll re-key _inflight by inner_request_id or allow multiple in-flight entries per session, happy to make that change!
…r_base get_llm_client (addresses verl-project#6865 review)
What does this PR do?
Adds selective (per-request) abort support to the rollout client path. Today
LLMServerClientexposes no way to interrupt a single in-flight request — only thebroadcast
abort_all_requestson the rollout replica exists. This PR introduceslightweight extension hooks plus a concrete
AbortableLLMServerClientthat tracksin-flight requests so a custom
AgentLoopWorker/AgentLoopManagercan abort one specificrequest (e.g. a timed-out rollout, or a custom trajectory-collection policy that decides to
drop a particular sequence).
AI assistance (Claude) was used to write this change; the author has reviewed every line.
Checklist Before Starting
Resolves #6866. Feature discussion / motivation: #6866.
[{modules}] {type}: {description}Test
Added
tests/workers/rollout/test_abortable_llm_client_on_cpu.py(CPU-only, mocks the Rayload balancer + server handles — no GPU/vLLM needed):
test_record_abort_and_cleanup— dispatch records the request;abort()targets theinner vLLM request id (not the outer sticky-session id); completion clears the entry.
test_inflight_cleared_on_normal_completion— normal finish also clears the table.test_abort_unknown_request_is_noop— aborting an unknown/finished id neither raises norhits the server.
test_cancellation_aborts_server_and_clears_inflight— cancellinggenerate()(e.g.asyncio.wait_for/timeout) fires_on_cancel, which aborts the server-side request so itdoes not leak, and
_on_completestill clears the in-flight entry.API and Usage Example
New config field
actor_rollout_ref.rollout.agent.llm_client_class(FQN, dynamicallyimported — mirrors the existing
agent_loop_manager_class), wired inray_trainer.py:Design & Code Changes
llm_server.py:request_idinto a namedinner_request_idso subclassescan record it.
_on_dispatch(request_id, inner_request_id, server)(before awaitinggeneration),
_on_complete(request_id)(infinally), and_on_cancel(request_id)(in
except asyncio.CancelledError, before thefinally, so the in-flight entry stillexists).
AbortableLLMServerClient(LLMServerClient): tracksrequest_id -> (inner_request_id, server)in_inflight;_on_completeis the sole cleanup point;abort(request_id)and
_on_cancelshare a_send_aborthelper that sends a preciseabort_requestto theright server (
abortawaits it;_on_cancelis fire-and-forget, since awaiting duringcancellation is unsafe). Inherits the base (non-resuming)
generate, i.e. hard-abortsemantics.
workers/config/rollout.py: addllm_client_classfield toAgentLoopConfig.trainer/ppo/ray_trainer.py: loadllm_client_class(if set) and pass it asclient_clsto
get_client().agent_loop.py: fix typochunkes->chunks.Checklist Before Submitting
check-naming-conventionsfalsepositive only triggers on vendored files under
.venv, which is gitignored and absent in CI.)ci-requestchannel once ready for CI.🤖 Generated with Claude Code