Skip to content

feat: expose POST /release_memory_occupation and /resume_memory_occupation#272

Closed
qywu wants to merge 4 commits into
lightseekorg:mainfrom
qywu:feat/http-memory-occupation-endpoints
Closed

feat: expose POST /release_memory_occupation and /resume_memory_occupation#272
qywu wants to merge 4 commits into
lightseekorg:mainfrom
qywu:feat/http-memory-occupation-endpoints

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 27, 2026

Summary

Wires the existing torch_memory_saver-backed engine methods all the way to the HTTP layer, so external orchestrators (e.g. multi-instance memory-pressure controllers, RLHF train↔serve handoffs) can trigger GPU memory release/reclaim over a standard REST call.

  • io_struct.py — add tags: list[str] | None = None to ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput; the engine was already passing tags= but the empty dataclass caused a TypeError at runtime.
  • engine_base.py — propagate tags parameter to the abstract interface to keep it consistent with the concrete Engine.
  • request_handler.py — dispatch ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput in the scheduler process; calls TorchMemorySaverAdapter.pause() / .resume() on all TP ranks (via the existing broadcast mechanism) and sends back the typed reply on rank-0.
  • http_server.py (new) — lightweight FastAPI server that wraps AsyncLLM and exposes /release_memory_occupation and /resume_memory_occupation alongside /health, /health_generate, and /flush_cache. Intended for direct node access in tests and PD-disaggregated deployments; production deployments continue to use the SMG gRPC servicer.
  • mini_lb.py — proxy both new endpoints to all prefill and decode servers in parallel, matching the existing /flush_cache fan-out pattern. The fan-out helper is robust against partial failure: per-node errors (HTTP 4xx/5xx, network failures, timeouts) are aggregated into a 502 with structured failures: [...] instead of being silently swallowed into a 200.

Endpoints

Method Path Description
POST /release_memory_occupation Release GPU memory occupied by tensors inside torch_memory_saver regions. Requires --enable-memory-saver.
POST /resume_memory_occupation Re-acquire GPU memory at the same virtual addresses previously released.

Both endpoints accept an optional JSON body {"tags": ["weights", "kv_cache"]} (forwarded to the engine for future tag-scoped release; currently stored on the request but torch_memory_saver 0.0.9 operates on all registered regions).

Semantics — important

torch_memory_saver.pause() releases the physical GPU pages backing tensors allocated inside saver.region() while reserving the virtual address range. resume() re-allocates physical pages at those same virtual addresses. It does not copy tensor data to CPU on release, and resume() returns memory with uninitialized contents — round-trip data preservation is the consumer's responsibility.

In practice this means:

  • ✅ The GPU memory is genuinely returned to the driver / co-tenants between release and resume.
  • ✅ Tensor Python objects and CUDA Graph captures keep their pointers valid across the cycle.
  • ❌ Weights, KV cache state, and any other tensor contents that lived in the released region are lost. Callers must reload them after resume (typical patterns: reload from disk via update_weights_from_disk, or stage to CPU on the application side before calling release).

See #275 for opt-in CPU staging that transparently restores weights on resume.

mini_lb.py fan-out — robust by construction

The fan-out helper (_broadcast_memory_control) addresses four real failure modes that the original draft swallowed:

  1. HTTP 4xx/5xx from any node — e.g. an older server without /release_memory_occupation returning 404, or a backend that rejects the body. Aggregated into a 502 listing the per-node {server, status, detail}.
  2. Network-level failures — connection refused, DNS failure, etc. Caught via asyncio.gather(..., return_exceptions=True) and reported as {server, error} without a status key.
  3. Wedged / hanging backend — bounded by a module-level _FANOUT_TIMEOUT_SECONDS = 120.0 on the aiohttp.ClientSession. 120 s covers realistic worst cases (e.g. stage_to_cpu=true on a 72B model is ~6-10 s of host-side memcpy) while making sure the orchestrator can never block on a hung worker.
  4. Non-UTF-8 / binary error bodies — backend returning \xff\xff... in a 500 body would previously raise UnicodeDecodeError out of await result.text() and fail the entire helper. Replaced with (await result.read())[:500].decode("utf-8", errors="replace") so undecodable bytes turn into U+FFFD markers and the read is bounded at 500 bytes regardless of body size.

Successful runs still return 200; only partial / total failure raises 502.

Verification on H100

Measured on H100 (80 GB HBM3, GPU 0) with Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager.

Phase GPU used Δ
Baseline (no engine) 627 MiB
After model load 41,431 MiB engine footprint: +40,804 MiB
After /release_memory_occupation 1,585 MiB −39,846 MiB = 97.7 % of footprint
After /resume_memory_occupation 41,533 MiB +102 MiB vs. load (allocator overhead)
Latency Value
Release 23 ms
Resume 18 ms

The 97.7 % reclaim ratio is the full-stack measurement (this PR + #273 empty_cache/ipc_collect + #274 workspaces). The standalone-PR figure from synthetic testing was 90.7 %; the extra ~7 pp comes from the stacked improvements.

A round-trip generation parity check (output before release vs. output after resume) passes when combined with #275's stage_to_cpu=true flow. Without staging, weights are zeroed after resume (as documented above) and the engine produces garbage until weights are reloaded by other means.

Env caveats during testing (pre-existing, unrelated to this PR — fixes opened separately):

mini_lb.py fan-out — verification

End-to-end test (/tmp/test_mini_lb_fanout.py) spins up real aiohttp.web backends on dynamic ports and drives the LB via FastAPI TestClient. All 12 scenarios pass:

# Scenario Outcome
1 1 good + 1 returning 500 + 1 returning 404 502 with both failing servers itemised
2 All backends return 200 200
3 /resume_memory_occupation with one bad backend 502 with the bad server
4 Backend returns 8 KiB error body detail capped at exactly 500 B
5 Dead port (connection refused) 502 with error: (no status:)
6 Every backend returns 500 (no successes) 502 with 2/2 nodes in message
7 {stage_to_cpu, tags} body forwarded verbatim body received intact by every backend
8 No request body sent backends see no body, fan-out still completes
9 Backend returns 202 (other 2xx) treated as success → 200
10 Empty decode_servers and empty everything 200 (no-op fan-out)
11 Backend hangs (await asyncio.sleep(60)), helper timeout 1.5 s returned in 1.50 s with timeout reported; the parallel good backend still succeeded
12 Backend returns 500 with raw \xff\xfe\xfd\xfc bytes 502 with len(detail)=128, U+FFFD replacement marker present, no UnicodeDecodeError

Test plan

  • Engine import path: from tokenspeed.runtime.entrypoints.engine import Engine (verified after env fixes).
  • Engine(enable_memory_saver=True) startup completes; baseline → after-load GPU memory matches gpu_memory_utilization.
  • engine.release_memory_occupation() returns; GPU memory drops by ~97 % of engine footprint.
  • engine.resume_memory_occupation() returns; GPU memory restored.
  • Regression: Engine.release_memory_occupation(tags=["weights"]) no longer raises TypeError.
  • mini_lb.py fan-out propagates per-node failures (HTTP 4xx/5xx, connection refused, all-fail, body forwarding, 2xx variants, empty configs, hung backend timeout, binary error body) — see the 12-scenario table above.

Notes / follow-ups

  • The SMG gRPC servicer does not yet expose these methods; adding gRPC methods to smg_grpc_proto and smg_grpc_servicer is left as follow-up work so the endpoints become reachable through the production ts serve path.
  • tags filtering (releasing only a subset of registered regions) is not yet implemented in torch_memory_saver; the field is plumbed through for forward-compatibility.
  • CPU staging for transparent round-trip weight preservation is in feat(memory-saver): optional CPU staging for round-trip weight preservation #275.
  • _FANOUT_TIMEOUT_SECONDS is currently a module-level constant (120 s). If real deployments need per-call tuning we can promote it to a server arg or query parameter.

…ation

Wires the existing torch_memory_saver-backed engine methods all the way
to the HTTP layer, so external orchestrators (e.g. multi-instance memory-
pressure controllers) can trigger offload/restore over a standard REST call.

Changes:
- io_struct.py: add `tags` field to ReleaseMemoryOccupationReqInput /
  ResumeMemoryOccupationReqInput (the engine was already passing `tags=`
  but the dataclass had no such field, causing a TypeError at runtime).
- engine_base.py: propagate `tags` parameter to the abstract interface.
- request_handler.py: dispatch ReleaseMemoryOccupationReqInput and
  ResumeMemoryOccupationReqInput in the scheduler process, calling
  TorchMemorySaverAdapter.pause() / .resume() on every TP rank.
- http_server.py (new): lightweight FastAPI server that wraps AsyncLLM
  and exposes /release_memory_occupation and /resume_memory_occupation
  alongside /health, /health_generate, and /flush_cache.  Intended for
  direct node access in tests and PD-disaggregated deployments.
- mini_lb.py: proxy both new endpoints to all prefill and decode servers
  in parallel, matching the existing /flush_cache fan-out pattern.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5a9bc3d4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/pd/mini_lb.py Outdated
Comment on lines +460 to +461
for coro in asyncio.as_completed(tasks):
await coro
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate backend failures from memory fan-out

When a prefill/decode node returns a non-2xx response for either new memory endpoint (for example because the node is running an older server without /release_memory_occupation, or because the backend rejects the request), this loop only awaits the request and never inspects response.status, so the load balancer still returns 200 to the orchestrator even though some nodes did not release/resume memory. Please check each ClientResponse and propagate a failure status/detail before reporting success.

Useful? React with 👍 / 👎.

…ause (lightseekorg#273)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added a commit to qywu/tokenspeed that referenced this pull request May 27, 2026
…ause

torch_memory_saver.pause() releases the physical pages backing tensors in
saver.region(), but PyTorch's caching allocator still holds onto its own
free pool and any NCCL/IPC handles outside that region. From the driver's
perspective those pages remain ours, so co-tenants can't use the headroom
that pause() just freed.

Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after
pause() to surrender those bytes as well. On a typical workload this
recovers an additional few hundred MiB on top of saver.pause() alone (the
exact number depends on allocator fragmentation at the time of release).

Stacked on top of lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added a commit to qywu/tokenspeed that referenced this pull request May 27, 2026
…r.region()

The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing
the regions weight_loader / KV cache modules wrap. The remaining ~9.3%
includes:

  - CUDA graph captures (CUDAGraph objects + the static input/output
    tensors they capture)
  - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and
    flashinfer prefill workspace under flashmla

This change brings those allocations inside torch_memory_saver.region()
so they're released by /release_memory_occupation alongside the rest.

Changes:
  - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver
    from server_args into the attention config so backends can opt their
    workspaces in without globals.
  - trtllm.py, flashmla.py: allocate the shared workspace buffer inside
    saver.region() when enable_memory_saver is True.
  - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture()
    in adapter.region() so the CUDAGraph captures and their persistent
    static buffers live inside the released region. Static-tensor contents
    are zeroed across pause/resume but callers overwrite them on every
    replay, so this is benign.
  - model_executor.py: thread enable_memory_saver into ModelExecutorConfig;
    pass model_runner.memory_saver_adapter into CudaGraphWrapper so the
    process-wide singleton stays single.

Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to
roughly 97-99% of the engine footprint on H100. Exact gain depends on
how many graphs are captured and which attention backend is selected.

Stacked on top of lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added a commit to qywu/tokenspeed that referenced this pull request May 27, 2026
…vation

contents are gone — torch_memory_saver.pause() preserves virtual addresses
only, not data. /resume_memory_occupation gets zeroed pages back, so any
caller that wants the model to work after a release/resume cycle has to
re-read the checkpoint from disk. For RLHF train↔serve handoff and similar
"pause inference, return GPU, resume inference" flows that's tens of GiB
of disk I/O on every cycle.

Adds an opt-in CPU staging step:

  POST /release_memory_occupation
  {"stage_to_cpu": true}

  - Before saver.pause(), copy every param and buffer of the (target and
    draft) model into a pre-allocated pinned host buffer.
  - After saver.resume(), copy them back.

On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16
vs ~60-180 s to re-read the same weights from network storage.

Changes:
  - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the
    MemoryOccupationManager class — pin/unpin host buffers, drive
    saver.pause()/resume(), and reuse buffers across cycles.
  - io_struct.py: add stage_to_cpu: bool = False to
    ReleaseMemoryOccupationReqInput.
  - request_handler.py: replace direct memory_saver calls with the
    manager so the staging path runs in the scheduler process where the
    model_runner is live.
  - event_loop.py: construct the manager with the real target/draft
    model_runner and pass it into RequestHandler.
  - http_server.py: parse stage_to_cpu from the JSON body.

Trade-offs:
  - Host RAM hold = ~sizeof(model) for the duration of the release.
  - Staging adds a few seconds to the release path; without it /release
    completes in ~1 s.
  - Does not stage KV cache or request pools (those are scratch — the
    engine flushes them before any reasonable use of this endpoint).

Stacked on top of lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added a commit to qywu/tokenspeed that referenced this pull request May 27, 2026
…r.region()

The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing
the regions weight_loader / KV cache modules wrap. The remaining ~9.3%
includes:

  - CUDA graph captures (CUDAGraph objects + the static input/output
    tensors they capture)
  - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and
    flashinfer prefill workspace under flashmla

This change brings those allocations inside torch_memory_saver.region()
so they're released by /release_memory_occupation alongside the rest.

Changes:
  - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver
    from server_args into the attention config so backends can opt their
    workspaces in without globals.
  - trtllm.py, flashmla.py: allocate the shared workspace buffer inside
    saver.region() when enable_memory_saver is True.
  - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture()
    in adapter.region() so the CUDAGraph captures and their persistent
    static buffers live inside the released region. Static-tensor contents
    are zeroed across pause/resume but callers overwrite them on every
    replay, so this is benign.
  - model_executor.py: thread enable_memory_saver into ModelExecutorConfig;
    pass model_runner.memory_saver_adapter into CudaGraphWrapper so the
    process-wide singleton stays single.

Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to
roughly 97-99% of the engine footprint on H100. Exact gain depends on
how many graphs are captured and which attention backend is selected.

Stacked on top of lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added a commit to qywu/tokenspeed that referenced this pull request May 27, 2026
…vation

contents are gone — torch_memory_saver.pause() preserves virtual addresses
only, not data. /resume_memory_occupation gets zeroed pages back, so any
caller that wants the model to work after a release/resume cycle has to
re-read the checkpoint from disk. For RLHF train↔serve handoff and similar
"pause inference, return GPU, resume inference" flows that's tens of GiB
of disk I/O on every cycle.

Adds an opt-in CPU staging step:

  POST /release_memory_occupation
  {"stage_to_cpu": true}

  - Before saver.pause(), copy every param and buffer of the (target and
    draft) model into a pre-allocated pinned host buffer.
  - After saver.resume(), copy them back.

On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16
vs ~60-180 s to re-read the same weights from network storage.

Changes:
  - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the
    MemoryOccupationManager class — pin/unpin host buffers, drive
    saver.pause()/resume(), and reuse buffers across cycles.
  - io_struct.py: add stage_to_cpu: bool = False to
    ReleaseMemoryOccupationReqInput.
  - request_handler.py: replace direct memory_saver calls with the
    manager so the staging path runs in the scheduler process where the
    model_runner is live.
  - event_loop.py: construct the manager with the real target/draft
    model_runner and pass it into RequestHandler.
  - http_server.py: parse stage_to_cpu from the JSON body.

Trade-offs:
  - Host RAM hold = ~sizeof(model) for the duration of the release.
  - Staging adds a few seconds to the release path; without it /release
    completes in ~1 s.
  - Does not stage KV cache or request pools (those are scratch — the
    engine flushes them before any reasonable use of this endpoint).

Stacked on top of lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the feat/http-memory-occupation-endpoints branch from 20b647f to f292dbb Compare May 27, 2026 04:57
@qywu qywu changed the title feat: expose POST /release_memory_occupation and /resume_memory_occupation feat: POST /release_memory_occupation + /resume_memory_occupation (incl. workspace wrapping + CPU staging) May 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f292dbb331

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +201 to +204
self.memory_occupation_manager.release(
stage_to_cpu=recv_req.stage_to_cpu,
tags=recv_req.tags,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize memory release with active generation

When this control request arrives while the server still has queued or in-flight generations, the scheduler processes it at the start of an event-loop iteration and immediately calls pause(), which releases/zeros the same weights, KV cache, and CUDA graph buffers used by active requests. generate_request is protected by model_update_lock.reader_lock, but the release/resume control path does not take the corresponding writer lock, so an orchestrator can hit /release_memory_occupation before traffic has drained and corrupt or fail those requests. Please quiesce/lock this path the same way weight updates do before pausing memory.

Useful? React with 👍 / 👎.

@qywu qywu force-pushed the feat/http-memory-occupation-endpoints branch from f292dbb to 33682b8 Compare May 27, 2026 06:27
@qywu qywu changed the title feat: POST /release_memory_occupation + /resume_memory_occupation (incl. workspace wrapping + CPU staging) feat: expose POST /release_memory_occupation and /resume_memory_occupation May 27, 2026
The previous implementation awaited each ``session.post`` without
inspecting ``response.status``, so the load balancer returned ``200`` even
when individual prefill / decode nodes errored out or responded with a
non-2xx status (e.g. an older server without ``/release_memory_occupation``
or a backend that rejected the request). The orchestrator then assumed
release/resume succeeded everywhere when it had silently skipped on some
nodes.

Switch to ``asyncio.gather(..., return_exceptions=True)`` so we keep the
parallel fan-out, then check every result:

  - exception (timeout, connection refused, ...) → record server + repr.
  - status >= 400 → record server + status + truncated body (capped at
    500 bytes so a misbehaving node can't dump megabytes into our
    response).

If any failures land, raise ``HTTPException(502, detail={...})`` listing
the failing servers and the failed-out-of-total count. Successful runs
still return 200 unchanged.

Addresses chatgpt-codex-connector review comment on lightseekorg#272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b198f7f4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +94 to +96
self.memory_saver = memory_saver or TorchMemorySaverAdapter.create(
server_args.enable_memory_saver
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject memory control when saver is disabled

When a node is started without --enable-memory-saver, this constructs the no-op adapter, but the new HTTP/RPC paths still return success after pause()/resume(). In deployments that accidentally omit the flag, /release_memory_occupation appears to succeed even though no weights, KV cache, or CUDA graph memory was offloaded, so an orchestrator can schedule colocated work based on memory that was never freed. Please surface an error instead of silently acknowledging the no-op path.

Useful? React with 👍 / 👎.

Two robustness holes in ``_broadcast_memory_control`` that the previous
fix didn't cover:

1. ``aiohttp.ClientSession()`` had no timeout, so a wedged prefill or
   decode node would hang the LB's release/resume call indefinitely.
   Add ``aiohttp.ClientTimeout(total=120s)`` via a module-level
   ``_FANOUT_TIMEOUT_SECONDS`` constant. 120 s is generous-but-bounded:
   ``release_memory_occupation`` with ``stage_to_cpu=true`` on a
   72B-class model takes ~6-10 s of host-side memcpy, so the budget
   covers realistic worst cases without making a hung backend invisible.
   The constant is module-level so tests can monkey-patch a shorter
   value to exercise the timeout path.

2. ``await result.text()`` raises ``UnicodeDecodeError`` if a backend
   returns binary / non-UTF-8 bytes in its error body, and that error
   escapes the ``try/finally`` and fails the entire helper instead of
   being reported as a per-node failure. Switch to
   ``(await result.read())[:500].decode("utf-8", errors="replace")``
   so undecodable bytes turn into U+FFFD markers but the helper keeps
   going. Read-then-slice (rather than text-then-slice) also bounds the
   read at the cap, so a 1 GiB error body doesn't get fully buffered.

Both paths are covered by ``/tmp/test_mini_lb_fanout.py`` scenarios 11
(hung backend, 1.5 s test timeout) and 12 (500 with raw ``\xff`` bytes).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be50ec2e84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# control request here so API callers still get a typed reply.
self.send_func.send_pyobj(FlushCacheReqOutput(success=True))
elif isinstance(recv_req, ReleaseMemoryOccupationReqInput):
self.memory_saver.pause()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor memory-control tags before pausing

When callers use the newly advertised {"tags": [...]} body (for example to release only weights but leave KV/cache regions alone), the scheduler ignores recv_req.tags and calls the unqualified pause()/resume(), which applies to every torch-memory-saver region. In tagged-memory deployments this can release or reoccupy unrelated regions and defeat the caller's requested memory scope; please pass the tags through to the adapter (and ensure allocations are tagged) or reject tagged requests instead of silently treating them as release-all.

Useful? React with 👍 / 👎.

@qywu
Copy link
Copy Markdown
Collaborator Author

qywu commented May 28, 2026

Superseded by #305 which adds a full control-plane HTTP sidecar including these endpoints.

@qywu qywu closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant