feat: expose POST /release_memory_occupation and /resume_memory_occupation#272
feat: expose POST /release_memory_occupation and /resume_memory_occupation#272qywu wants to merge 4 commits into
Conversation
…ation Wires the existing torch_memory_saver-backed engine methods all the way to the HTTP layer, so external orchestrators (e.g. multi-instance memory- pressure controllers) can trigger offload/restore over a standard REST call. Changes: - io_struct.py: add `tags` field to ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput (the engine was already passing `tags=` but the dataclass had no such field, causing a TypeError at runtime). - engine_base.py: propagate `tags` parameter to the abstract interface. - request_handler.py: dispatch ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput in the scheduler process, calling TorchMemorySaverAdapter.pause() / .resume() on every TP rank. - http_server.py (new): lightweight FastAPI server that wraps AsyncLLM and exposes /release_memory_occupation and /resume_memory_occupation alongside /health, /health_generate, and /flush_cache. Intended for direct node access in tests and PD-disaggregated deployments. - mini_lb.py: proxy both new endpoints to all prefill and decode servers in parallel, matching the existing /flush_cache fan-out pattern. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5a9bc3d4e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for coro in asyncio.as_completed(tasks): | ||
| await coro |
There was a problem hiding this comment.
Propagate backend failures from memory fan-out
When a prefill/decode node returns a non-2xx response for either new memory endpoint (for example because the node is running an older server without /release_memory_occupation, or because the backend rejects the request), this loop only awaits the request and never inspects response.status, so the load balancer still returns 200 to the orchestrator even though some nodes did not release/resume memory. Please check each ClientResponse and propagate a failure status/detail before reporting success.
Useful? React with 👍 / 👎.
…ause (lightseekorg#273) Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…ause torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages remain ours, so co-tenants can't use the headroom that pause() just freed. Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() to surrender those bytes as well. On a typical workload this recovers an additional few hundred MiB on top of saver.pause() alone (the exact number depends on allocator fragmentation at the time of release). Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…r.region() The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…vation
contents are gone — torch_memory_saver.pause() preserves virtual addresses
only, not data. /resume_memory_occupation gets zeroed pages back, so any
caller that wants the model to work after a release/resume cycle has to
re-read the checkpoint from disk. For RLHF train↔serve handoff and similar
"pause inference, return GPU, resume inference" flows that's tens of GiB
of disk I/O on every cycle.
Adds an opt-in CPU staging step:
POST /release_memory_occupation
{"stage_to_cpu": true}
- Before saver.pause(), copy every param and buffer of the (target and
draft) model into a pre-allocated pinned host buffer.
- After saver.resume(), copy them back.
On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16
vs ~60-180 s to re-read the same weights from network storage.
Changes:
- New: tokenspeed/runtime/engine/memory_occupation_manager.py with the
MemoryOccupationManager class — pin/unpin host buffers, drive
saver.pause()/resume(), and reuse buffers across cycles.
- io_struct.py: add stage_to_cpu: bool = False to
ReleaseMemoryOccupationReqInput.
- request_handler.py: replace direct memory_saver calls with the
manager so the staging path runs in the scheduler process where the
model_runner is live.
- event_loop.py: construct the manager with the real target/draft
model_runner and pass it into RequestHandler.
- http_server.py: parse stage_to_cpu from the JSON body.
Trade-offs:
- Host RAM hold = ~sizeof(model) for the duration of the release.
- Staging adds a few seconds to the release path; without it /release
completes in ~1 s.
- Does not stage KV cache or request pools (those are scratch — the
engine flushes them before any reasonable use of this endpoint).
Stacked on top of lightseekorg#272.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…r.region() The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…vation
contents are gone — torch_memory_saver.pause() preserves virtual addresses
only, not data. /resume_memory_occupation gets zeroed pages back, so any
caller that wants the model to work after a release/resume cycle has to
re-read the checkpoint from disk. For RLHF train↔serve handoff and similar
"pause inference, return GPU, resume inference" flows that's tens of GiB
of disk I/O on every cycle.
Adds an opt-in CPU staging step:
POST /release_memory_occupation
{"stage_to_cpu": true}
- Before saver.pause(), copy every param and buffer of the (target and
draft) model into a pre-allocated pinned host buffer.
- After saver.resume(), copy them back.
On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16
vs ~60-180 s to re-read the same weights from network storage.
Changes:
- New: tokenspeed/runtime/engine/memory_occupation_manager.py with the
MemoryOccupationManager class — pin/unpin host buffers, drive
saver.pause()/resume(), and reuse buffers across cycles.
- io_struct.py: add stage_to_cpu: bool = False to
ReleaseMemoryOccupationReqInput.
- request_handler.py: replace direct memory_saver calls with the
manager so the staging path runs in the scheduler process where the
model_runner is live.
- event_loop.py: construct the manager with the real target/draft
model_runner and pass it into RequestHandler.
- http_server.py: parse stage_to_cpu from the JSON body.
Trade-offs:
- Host RAM hold = ~sizeof(model) for the duration of the release.
- Staging adds a few seconds to the release path; without it /release
completes in ~1 s.
- Does not stage KV cache or request pools (those are scratch — the
engine flushes them before any reasonable use of this endpoint).
Stacked on top of lightseekorg#272.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
20b647f to
f292dbb
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f292dbb331
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.memory_occupation_manager.release( | ||
| stage_to_cpu=recv_req.stage_to_cpu, | ||
| tags=recv_req.tags, | ||
| ) |
There was a problem hiding this comment.
Serialize memory release with active generation
When this control request arrives while the server still has queued or in-flight generations, the scheduler processes it at the start of an event-loop iteration and immediately calls pause(), which releases/zeros the same weights, KV cache, and CUDA graph buffers used by active requests. generate_request is protected by model_update_lock.reader_lock, but the release/resume control path does not take the corresponding writer lock, so an orchestrator can hit /release_memory_occupation before traffic has drained and corrupt or fail those requests. Please quiesce/lock this path the same way weight updates do before pausing memory.
Useful? React with 👍 / 👎.
f292dbb to
33682b8
Compare
The previous implementation awaited each ``session.post`` without
inspecting ``response.status``, so the load balancer returned ``200`` even
when individual prefill / decode nodes errored out or responded with a
non-2xx status (e.g. an older server without ``/release_memory_occupation``
or a backend that rejected the request). The orchestrator then assumed
release/resume succeeded everywhere when it had silently skipped on some
nodes.
Switch to ``asyncio.gather(..., return_exceptions=True)`` so we keep the
parallel fan-out, then check every result:
- exception (timeout, connection refused, ...) → record server + repr.
- status >= 400 → record server + status + truncated body (capped at
500 bytes so a misbehaving node can't dump megabytes into our
response).
If any failures land, raise ``HTTPException(502, detail={...})`` listing
the failing servers and the failed-out-of-total count. Successful runs
still return 200 unchanged.
Addresses chatgpt-codex-connector review comment on lightseekorg#272.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4b198f7f4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.memory_saver = memory_saver or TorchMemorySaverAdapter.create( | ||
| server_args.enable_memory_saver | ||
| ) |
There was a problem hiding this comment.
Reject memory control when saver is disabled
When a node is started without --enable-memory-saver, this constructs the no-op adapter, but the new HTTP/RPC paths still return success after pause()/resume(). In deployments that accidentally omit the flag, /release_memory_occupation appears to succeed even though no weights, KV cache, or CUDA graph memory was offloaded, so an orchestrator can schedule colocated work based on memory that was never freed. Please surface an error instead of silently acknowledging the no-op path.
Useful? React with 👍 / 👎.
Two robustness holes in ``_broadcast_memory_control`` that the previous
fix didn't cover:
1. ``aiohttp.ClientSession()`` had no timeout, so a wedged prefill or
decode node would hang the LB's release/resume call indefinitely.
Add ``aiohttp.ClientTimeout(total=120s)`` via a module-level
``_FANOUT_TIMEOUT_SECONDS`` constant. 120 s is generous-but-bounded:
``release_memory_occupation`` with ``stage_to_cpu=true`` on a
72B-class model takes ~6-10 s of host-side memcpy, so the budget
covers realistic worst cases without making a hung backend invisible.
The constant is module-level so tests can monkey-patch a shorter
value to exercise the timeout path.
2. ``await result.text()`` raises ``UnicodeDecodeError`` if a backend
returns binary / non-UTF-8 bytes in its error body, and that error
escapes the ``try/finally`` and fails the entire helper instead of
being reported as a per-node failure. Switch to
``(await result.read())[:500].decode("utf-8", errors="replace")``
so undecodable bytes turn into U+FFFD markers but the helper keeps
going. Read-then-slice (rather than text-then-slice) also bounds the
read at the cap, so a 1 GiB error body doesn't get fully buffered.
Both paths are covered by ``/tmp/test_mini_lb_fanout.py`` scenarios 11
(hung backend, 1.5 s test timeout) and 12 (500 with raw ``\xff`` bytes).
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be50ec2e84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # control request here so API callers still get a typed reply. | ||
| self.send_func.send_pyobj(FlushCacheReqOutput(success=True)) | ||
| elif isinstance(recv_req, ReleaseMemoryOccupationReqInput): | ||
| self.memory_saver.pause() |
There was a problem hiding this comment.
Honor memory-control tags before pausing
When callers use the newly advertised {"tags": [...]} body (for example to release only weights but leave KV/cache regions alone), the scheduler ignores recv_req.tags and calls the unqualified pause()/resume(), which applies to every torch-memory-saver region. In tagged-memory deployments this can release or reoccupy unrelated regions and defeat the caller's requested memory scope; please pass the tags through to the adapter (and ensure allocations are tagged) or reject tagged requests instead of silently treating them as release-all.
Useful? React with 👍 / 👎.
|
Superseded by #305 which adds a full control-plane HTTP sidecar including these endpoints. |
Summary
Wires the existing
torch_memory_saver-backed engine methods all the way to the HTTP layer, so external orchestrators (e.g. multi-instance memory-pressure controllers, RLHF train↔serve handoffs) can trigger GPU memory release/reclaim over a standard REST call.io_struct.py— addtags: list[str] | None = NonetoReleaseMemoryOccupationReqInput/ResumeMemoryOccupationReqInput; the engine was already passingtags=but the empty dataclass caused aTypeErrorat runtime.engine_base.py— propagatetagsparameter to the abstract interface to keep it consistent with the concreteEngine.request_handler.py— dispatchReleaseMemoryOccupationReqInput/ResumeMemoryOccupationReqInputin the scheduler process; callsTorchMemorySaverAdapter.pause()/.resume()on all TP ranks (via the existing broadcast mechanism) and sends back the typed reply on rank-0.http_server.py(new) — lightweight FastAPI server that wrapsAsyncLLMand exposes/release_memory_occupationand/resume_memory_occupationalongside/health,/health_generate, and/flush_cache. Intended for direct node access in tests and PD-disaggregated deployments; production deployments continue to use the SMG gRPC servicer.mini_lb.py— proxy both new endpoints to all prefill and decode servers in parallel, matching the existing/flush_cachefan-out pattern. The fan-out helper is robust against partial failure: per-node errors (HTTP 4xx/5xx, network failures, timeouts) are aggregated into a502with structuredfailures: [...]instead of being silently swallowed into a 200.Endpoints
/release_memory_occupation--enable-memory-saver./resume_memory_occupationBoth endpoints accept an optional JSON body
{"tags": ["weights", "kv_cache"]}(forwarded to the engine for future tag-scoped release; currently stored on the request buttorch_memory_saver0.0.9 operates on all registered regions).Semantics — important
torch_memory_saver.pause()releases the physical GPU pages backing tensors allocated insidesaver.region()while reserving the virtual address range.resume()re-allocates physical pages at those same virtual addresses. It does not copy tensor data to CPU on release, andresume()returns memory with uninitialized contents — round-trip data preservation is the consumer's responsibility.In practice this means:
releaseandresume.resume(typical patterns: reload from disk viaupdate_weights_from_disk, or stage to CPU on the application side before callingrelease).See #275 for opt-in CPU staging that transparently restores weights on
resume.mini_lb.pyfan-out — robust by constructionThe fan-out helper (
_broadcast_memory_control) addresses four real failure modes that the original draft swallowed:/release_memory_occupationreturning 404, or a backend that rejects the body. Aggregated into a502listing the per-node{server, status, detail}.asyncio.gather(..., return_exceptions=True)and reported as{server, error}without astatuskey._FANOUT_TIMEOUT_SECONDS = 120.0on theaiohttp.ClientSession. 120 s covers realistic worst cases (e.g.stage_to_cpu=trueon a 72B model is ~6-10 s of host-side memcpy) while making sure the orchestrator can never block on a hung worker.\xff\xff...in a 500 body would previously raiseUnicodeDecodeErrorout ofawait result.text()and fail the entire helper. Replaced with(await result.read())[:500].decode("utf-8", errors="replace")so undecodable bytes turn into U+FFFD markers and the read is bounded at 500 bytes regardless of body size.Successful runs still return
200; only partial / total failure raises502.Verification on H100
Measured on H100 (80 GB HBM3, GPU 0) with Qwen2-1.5B-Instruct,
gpu_memory_utilization=0.5,--attention-backend=triton,--enforce-eager./release_memory_occupation/resume_memory_occupationThe 97.7 % reclaim ratio is the full-stack measurement (this PR + #273 empty_cache/ipc_collect + #274 workspaces). The standalone-PR figure from synthetic testing was 90.7 %; the extra ~7 pp comes from the stacked improvements.
A round-trip generation parity check (output before release vs. output after resume) passes when combined with #275's
stage_to_cpu=trueflow. Without staging, weights are zeroed after resume (as documented above) and the engine produces garbage until weights are reloaded by other means.Env caveats during testing (pre-existing, unrelated to this PR — fixes opened separately):
scheduler_metadatakwarg leak inMHAAttnBackendthat breaks--attention-backend=triton/fa4/flashinfer.import triton_kernels.matmulfailure caused by upstream module renames.FLASHINFER_DISABLE_VERSION_CHECK=1andLD_PRELOAD=$(pwd)/.venv/.../torch_memory_saver_hook_mode_preload_cu13.abi3.soto launch the test process.mini_lb.pyfan-out — verificationEnd-to-end test (
/tmp/test_mini_lb_fanout.py) spins up realaiohttp.webbackends on dynamic ports and drives the LB via FastAPITestClient. All 12 scenarios pass:502with both failing servers itemised200/resume_memory_occupationwith one bad backend502with the bad serverdetailcapped at exactly 500 B502witherror:(nostatus:)502with2/2 nodesin message{stage_to_cpu, tags}body forwarded verbatim200decode_serversand empty everything200(no-op fan-out)await asyncio.sleep(60)), helper timeout 1.5 s\xff\xfe\xfd\xfcbytes502withlen(detail)=128, U+FFFD replacement marker present, no UnicodeDecodeErrorTest plan
from tokenspeed.runtime.entrypoints.engine import Engine(verified after env fixes).Engine(enable_memory_saver=True)startup completes; baseline → after-load GPU memory matchesgpu_memory_utilization.engine.release_memory_occupation()returns; GPU memory drops by ~97 % of engine footprint.engine.resume_memory_occupation()returns; GPU memory restored.Engine.release_memory_occupation(tags=["weights"])no longer raisesTypeError.mini_lb.pyfan-out propagates per-node failures (HTTP 4xx/5xx, connection refused, all-fail, body forwarding, 2xx variants, empty configs, hung backend timeout, binary error body) — see the 12-scenario table above.Notes / follow-ups
smg_grpc_protoandsmg_grpc_serviceris left as follow-up work so the endpoints become reachable through the productionts servepath.tagsfiltering (releasing only a subset of registered regions) is not yet implemented intorch_memory_saver; the field is plumbed through for forward-compatibility._FANOUT_TIMEOUT_SECONDSis currently a module-level constant (120 s). If real deployments need per-call tuning we can promote it to a server arg or query parameter.