feat: expose POST /release_memory_occupation and /resume_memory_occupation by qywu · Pull Request #272 · lightseekorg/tokenspeed

qywu · 2026-05-27T00:08:56Z

Summary

Wires the existing torch_memory_saver-backed engine methods all the way to the HTTP layer, so external orchestrators (e.g. multi-instance memory-pressure controllers, RLHF train↔serve handoffs) can trigger GPU memory release/reclaim over a standard REST call.

io_struct.py — add tags: list[str] | None = None to ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput; the engine was already passing tags= but the empty dataclass caused a TypeError at runtime.
engine_base.py — propagate tags parameter to the abstract interface to keep it consistent with the concrete Engine.
request_handler.py — dispatch ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput in the scheduler process; calls TorchMemorySaverAdapter.pause() / .resume() on all TP ranks (via the existing broadcast mechanism) and sends back the typed reply on rank-0.
http_server.py (new) — lightweight FastAPI server that wraps AsyncLLM and exposes /release_memory_occupation and /resume_memory_occupation alongside /health, /health_generate, and /flush_cache. Intended for direct node access in tests and PD-disaggregated deployments; production deployments continue to use the SMG gRPC servicer.
mini_lb.py — proxy both new endpoints to all prefill and decode servers in parallel, matching the existing /flush_cache fan-out pattern. The fan-out helper is robust against partial failure: per-node errors (HTTP 4xx/5xx, network failures, timeouts) are aggregated into a 502 with structured failures: [...] instead of being silently swallowed into a 200.

Endpoints

Method	Path	Description
POST	`/release_memory_occupation`	Release GPU memory occupied by tensors inside torch_memory_saver regions. Requires `--enable-memory-saver`.
POST	`/resume_memory_occupation`	Re-acquire GPU memory at the same virtual addresses previously released.

Both endpoints accept an optional JSON body {"tags": ["weights", "kv_cache"]} (forwarded to the engine for future tag-scoped release; currently stored on the request but torch_memory_saver 0.0.9 operates on all registered regions).

Semantics — important

torch_memory_saver.pause() releases the physical GPU pages backing tensors allocated inside saver.region() while reserving the virtual address range. resume() re-allocates physical pages at those same virtual addresses. It does not copy tensor data to CPU on release, and resume() returns memory with uninitialized contents — round-trip data preservation is the consumer's responsibility.

In practice this means:

✅ The GPU memory is genuinely returned to the driver / co-tenants between release and resume.
✅ Tensor Python objects and CUDA Graph captures keep their pointers valid across the cycle.
❌ Weights, KV cache state, and any other tensor contents that lived in the released region are lost. Callers must reload them after resume (typical patterns: reload from disk via update_weights_from_disk, or stage to CPU on the application side before calling release).

See #275 for opt-in CPU staging that transparently restores weights on resume.

`mini_lb.py` fan-out — robust by construction

The fan-out helper (_broadcast_memory_control) addresses four real failure modes that the original draft swallowed:

HTTP 4xx/5xx from any node — e.g. an older server without /release_memory_occupation returning 404, or a backend that rejects the body. Aggregated into a 502 listing the per-node {server, status, detail}.
Network-level failures — connection refused, DNS failure, etc. Caught via asyncio.gather(..., return_exceptions=True) and reported as {server, error} without a status key.
Wedged / hanging backend — bounded by a module-level _FANOUT_TIMEOUT_SECONDS = 120.0 on the aiohttp.ClientSession. 120 s covers realistic worst cases (e.g. stage_to_cpu=true on a 72B model is ~6-10 s of host-side memcpy) while making sure the orchestrator can never block on a hung worker.
Non-UTF-8 / binary error bodies — backend returning \xff\xff... in a 500 body would previously raise UnicodeDecodeError out of await result.text() and fail the entire helper. Replaced with (await result.read())[:500].decode("utf-8", errors="replace") so undecodable bytes turn into U+FFFD markers and the read is bounded at 500 bytes regardless of body size.

Successful runs still return 200; only partial / total failure raises 502.

Verification on H100

Measured on H100 (80 GB HBM3, GPU 0) with Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager.

Phase	GPU used	Δ
Baseline (no engine)	627 MiB	—
After model load	41,431 MiB	engine footprint: +40,804 MiB
After `/release_memory_occupation`	1,585 MiB	−39,846 MiB = 97.7 % of footprint
After `/resume_memory_occupation`	41,533 MiB	+102 MiB vs. load (allocator overhead)

Latency	Value
Release	23 ms
Resume	18 ms

The 97.7 % reclaim ratio is the full-stack measurement (this PR + #273 empty_cache/ipc_collect + #274 workspaces). The standalone-PR figure from synthetic testing was 90.7 %; the extra ~7 pp comes from the stacked improvements.

A round-trip generation parity check (output before release vs. output after resume) passes when combined with #275's stage_to_cpu=true flow. Without staging, weights are zeroed after resume (as documented above) and the engine produces garbage until weights are reloaded by other means.

Env caveats during testing (pre-existing, unrelated to this PR — fixes opened separately):

fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends #276 fixes a scheduler_metadata kwarg leak in MHAAttnBackend that breaks --attention-backend=triton / fa4 / flashinfer.
fix(thirdparty/triton_kernels): tolerate upstream module rename #277 fixes an import triton_kernels.matmul failure caused by upstream module renames.
Set FLASHINFER_DISABLE_VERSION_CHECK=1 and LD_PRELOAD=$(pwd)/.venv/.../torch_memory_saver_hook_mode_preload_cu13.abi3.so to launch the test process.

`mini_lb.py` fan-out — verification

End-to-end test (/tmp/test_mini_lb_fanout.py) spins up real aiohttp.web backends on dynamic ports and drives the LB via FastAPI TestClient. All 12 scenarios pass:

#	Scenario	Outcome
1	1 good + 1 returning 500 + 1 returning 404	`502` with both failing servers itemised
2	All backends return 200	`200`
3	`/resume_memory_occupation` with one bad backend	`502` with the bad server
4	Backend returns 8 KiB error body	`detail` capped at exactly 500 B
5	Dead port (connection refused)	`502` with `error:` (no `status:`)
6	Every backend returns 500 (no successes)	`502` with `2/2 nodes` in message
7	`{stage_to_cpu, tags}` body forwarded verbatim	body received intact by every backend
8	No request body sent	backends see no body, fan-out still completes
9	Backend returns 202 (other 2xx)	treated as success → `200`
10	Empty `decode_servers` and empty everything	`200` (no-op fan-out)
11	Backend hangs (`await asyncio.sleep(60)`), helper timeout 1.5 s	returned in 1.50 s with timeout reported; the parallel good backend still succeeded
12	Backend returns 500 with raw `\xff\xfe\xfd\xfc` bytes	`502` with `len(detail)=128`, U+FFFD replacement marker present, no UnicodeDecodeError

Test plan

Engine import path: from tokenspeed.runtime.entrypoints.engine import Engine (verified after env fixes).
Engine(enable_memory_saver=True) startup completes; baseline → after-load GPU memory matches gpu_memory_utilization.
engine.release_memory_occupation() returns; GPU memory drops by ~97 % of engine footprint.
engine.resume_memory_occupation() returns; GPU memory restored.
Regression: Engine.release_memory_occupation(tags=["weights"]) no longer raises TypeError.
mini_lb.py fan-out propagates per-node failures (HTTP 4xx/5xx, connection refused, all-fail, body forwarding, 2xx variants, empty configs, hung backend timeout, binary error body) — see the 12-scenario table above.

Notes / follow-ups

The SMG gRPC servicer does not yet expose these methods; adding gRPC methods to smg_grpc_proto and smg_grpc_servicer is left as follow-up work so the endpoints become reachable through the production ts serve path.
tags filtering (releasing only a subset of registered regions) is not yet implemented in torch_memory_saver; the field is plumbed through for forward-compatibility.
CPU staging for transparent round-trip weight preservation is in feat(memory-saver): optional CPU staging for round-trip weight preservation #275.
_FANOUT_TIMEOUT_SECONDS is currently a module-level constant (120 s). If real deployments need per-call tuning we can promote it to a server arg or query parameter.

…ation Wires the existing torch_memory_saver-backed engine methods all the way to the HTTP layer, so external orchestrators (e.g. multi-instance memory- pressure controllers) can trigger offload/restore over a standard REST call. Changes: - io_struct.py: add `tags` field to ReleaseMemoryOccupationReqInput / ResumeMemoryOccupationReqInput (the engine was already passing `tags=` but the dataclass had no such field, causing a TypeError at runtime). - engine_base.py: propagate `tags` parameter to the abstract interface. - request_handler.py: dispatch ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput in the scheduler process, calling TorchMemorySaverAdapter.pause() / .resume() on every TP rank. - http_server.py (new): lightweight FastAPI server that wraps AsyncLLM and exposes /release_memory_occupation and /resume_memory_occupation alongside /health, /health_generate, and /flush_cache. Intended for direct node access in tests and PD-disaggregated deployments. - mini_lb.py: proxy both new endpoints to all prefill and decode servers in parallel, matching the existing /flush_cache fan-out pattern. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5a9bc3d4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T04:28:03Z

+        for coro in asyncio.as_completed(tasks):
+            await coro


Propagate backend failures from memory fan-out

When a prefill/decode node returns a non-2xx response for either new memory endpoint (for example because the node is running an older server without /release_memory_occupation, or because the backend rejects the request), this loop only awaits the request and never inspects response.status, so the load balancer still returns 200 to the orchestrator even though some nodes did not release/resume memory. Please check each ClientResponse and propagate a failure status/detail before reporting success.

Useful? React with 👍 / 👎.

…ause (lightseekorg#273) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…ause torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages remain ours, so co-tenants can't use the headroom that pause() just freed. Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() to surrender those bytes as well. On a typical workload this recovers an additional few hundred MiB on top of saver.pause() alone (the exact number depends on allocator fragmentation at the time of release). Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…r.region() The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…vation contents are gone — torch_memory_saver.pause() preserves virtual addresses only, not data. /resume_memory_occupation gets zeroed pages back, so any caller that wants the model to work after a release/resume cycle has to re-read the checkpoint from disk. For RLHF train↔serve handoff and similar "pause inference, return GPU, resume inference" flows that's tens of GiB of disk I/O on every cycle. Adds an opt-in CPU staging step: POST /release_memory_occupation {"stage_to_cpu": true} - Before saver.pause(), copy every param and buffer of the (target and draft) model into a pre-allocated pinned host buffer. - After saver.resume(), copy them back. On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16 vs ~60-180 s to re-read the same weights from network storage. Changes: - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the MemoryOccupationManager class — pin/unpin host buffers, drive saver.pause()/resume(), and reuse buffers across cycles. - io_struct.py: add stage_to_cpu: bool = False to ReleaseMemoryOccupationReqInput. - request_handler.py: replace direct memory_saver calls with the manager so the staging path runs in the scheduler process where the model_runner is live. - event_loop.py: construct the manager with the real target/draft model_runner and pass it into RequestHandler. - http_server.py: parse stage_to_cpu from the JSON body. Trade-offs: - Host RAM hold = ~sizeof(model) for the duration of the release. - Staging adds a few seconds to the release path; without it /release completes in ~1 s. - Does not stage KV cache or request pools (those are scratch — the engine flushes them before any reasonable use of this endpoint). Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…r.region() The base PR (lightseekorg#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of lightseekorg#272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…vation contents are gone — torch_memory_saver.pause() preserves virtual addresses only, not data. /resume_memory_occupation gets zeroed pages back, so any caller that wants the model to work after a release/resume cycle has to re-read the checkpoint from disk. For RLHF train↔serve handoff and similar "pause inference, return GPU, resume inference" flows that's tens of GiB of disk I/O on every cycle. Adds an opt-in CPU staging step: POST /release_memory_occupation {"stage_to_cpu": true} - Before saver.pause(), copy every param and buffer of the (target and draft) model into a pre-allocated pinned host buffer. - After saver.resume(), copy them back. On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16 vs ~60-180 s to re-read the same weights from network storage. Changes: - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the MemoryOccupationManager class — pin/unpin host buffers, drive saver.pause()/resume(), and reuse buffers across cycles. - io_struct.py: add stage_to_cpu: bool = False to ReleaseMemoryOccupationReqInput. - request_handler.py: replace direct memory_saver calls with the manager so the staging path runs in the scheduler process where the model_runner is live. - event_loop.py: construct the manager with the real target/draft model_runner and pass it into RequestHandler. - http_server.py: parse stage_to_cpu from the JSON body. Trade-offs: - Host RAM hold = ~sizeof(model) for the duration of the release. - Staging adds a few seconds to the release path; without it /release completes in ~1 s. - Does not stage KV cache or request pools (those are scratch — the engine flushes them before any reasonable use of this endpoint). Stacked on top of lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f292dbb331

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T05:00:54Z

+                self.memory_occupation_manager.release(
+                    stage_to_cpu=recv_req.stage_to_cpu,
+                    tags=recv_req.tags,
+                )


Serialize memory release with active generation

When this control request arrives while the server still has queued or in-flight generations, the scheduler processes it at the start of an event-loop iteration and immediately calls pause(), which releases/zeros the same weights, KV cache, and CUDA graph buffers used by active requests. generate_request is protected by model_update_lock.reader_lock, but the release/resume control path does not take the corresponding writer lock, so an orchestrator can hit /release_memory_occupation before traffic has drained and corrupt or fail those requests. Please quiesce/lock this path the same way weight updates do before pausing memory.

Useful? React with 👍 / 👎.

The previous implementation awaited each ``session.post`` without inspecting ``response.status``, so the load balancer returned ``200`` even when individual prefill / decode nodes errored out or responded with a non-2xx status (e.g. an older server without ``/release_memory_occupation`` or a backend that rejected the request). The orchestrator then assumed release/resume succeeded everywhere when it had silently skipped on some nodes. Switch to ``asyncio.gather(..., return_exceptions=True)`` so we keep the parallel fan-out, then check every result: - exception (timeout, connection refused, ...) → record server + repr. - status >= 400 → record server + status + truncated body (capped at 500 bytes so a misbehaving node can't dump megabytes into our response). If any failures land, raise ``HTTPException(502, detail={...})`` listing the failing servers and the failed-out-of-total count. Successful runs still return 200 unchanged. Addresses chatgpt-codex-connector review comment on lightseekorg#272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b198f7f4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T06:51:29Z

+        self.memory_saver = memory_saver or TorchMemorySaverAdapter.create(
+            server_args.enable_memory_saver
+        )


Reject memory control when saver is disabled

When a node is started without --enable-memory-saver, this constructs the no-op adapter, but the new HTTP/RPC paths still return success after pause()/resume(). In deployments that accidentally omit the flag, /release_memory_occupation appears to succeed even though no weights, KV cache, or CUDA graph memory was offloaded, so an orchestrator can schedule colocated work based on memory that was never freed. Please surface an error instead of silently acknowledging the no-op path.

Useful? React with 👍 / 👎.

Two robustness holes in ``_broadcast_memory_control`` that the previous fix didn't cover: 1. ``aiohttp.ClientSession()`` had no timeout, so a wedged prefill or decode node would hang the LB's release/resume call indefinitely. Add ``aiohttp.ClientTimeout(total=120s)`` via a module-level ``_FANOUT_TIMEOUT_SECONDS`` constant. 120 s is generous-but-bounded: ``release_memory_occupation`` with ``stage_to_cpu=true`` on a 72B-class model takes ~6-10 s of host-side memcpy, so the budget covers realistic worst cases without making a hung backend invisible. The constant is module-level so tests can monkey-patch a shorter value to exercise the timeout path. 2. ``await result.text()`` raises ``UnicodeDecodeError`` if a backend returns binary / non-UTF-8 bytes in its error body, and that error escapes the ``try/finally`` and fails the entire helper instead of being reported as a per-node failure. Switch to ``(await result.read())[:500].decode("utf-8", errors="replace")`` so undecodable bytes turn into U+FFFD markers but the helper keeps going. Read-then-slice (rather than text-then-slice) also bounds the read at the cap, so a 1 GiB error body doesn't get fully buffered. Both paths are covered by ``/tmp/test_mini_lb_fanout.py`` scenarios 11 (hung backend, 1.5 s test timeout) and 12 (500 with raw ``\xff`` bytes). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be50ec2e84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T07:06:48Z

                # control request here so API callers still get a typed reply.
                self.send_func.send_pyobj(FlushCacheReqOutput(success=True))
+            elif isinstance(recv_req, ReleaseMemoryOccupationReqInput):
+                self.memory_saver.pause()


Honor memory-control tags before pausing

When callers use the newly advertised {"tags": [...]} body (for example to release only weights but leave KV/cache regions alone), the scheduler ignores recv_req.tags and calls the unqualified pause()/resume(), which applies to every torch-memory-saver region. In tagged-memory deployments this can release or reoccupy unrelated regions and defeat the caller's requested memory scope; please pass the tags through to the adapter (and ensure allocations are tagged) or reject tagged requests instead of silently treating them as release-all.

Useful? React with 👍 / 👎.

qywu · 2026-05-28T23:45:31Z

Superseded by #305 which adds a full control-plane HTTP sidecar including these endpoints.

qywu marked this pull request as ready for review May 27, 2026 00:30

qywu requested a review from a team as a code owner May 27, 2026 00:30

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

feat(memory-saver): flush PyTorch alloc cache and IPC handles after p…

33682b8

…ause (lightseekorg#273) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu force-pushed the feat/http-memory-occupation-endpoints branch from 20b647f to f292dbb Compare May 27, 2026 04:57

qywu changed the title ~~feat: expose POST /release_memory_occupation and /resume_memory_occupation~~ feat: POST /release_memory_occupation + /resume_memory_occupation (incl. workspace wrapping + CPU staging) May 27, 2026

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

qywu force-pushed the feat/http-memory-occupation-endpoints branch from f292dbb to 33682b8 Compare May 27, 2026 06:27

qywu changed the title ~~feat: POST /release_memory_occupation + /resume_memory_occupation (incl. workspace wrapping + CPU staging)~~ feat: expose POST /release_memory_occupation and /resume_memory_occupation May 27, 2026

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

qywu closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose POST /release_memory_occupation and /resume_memory_occupation#272

feat: expose POST /release_memory_occupation and /resume_memory_occupation#272
qywu wants to merge 4 commits into
lightseekorg:mainfrom
qywu:feat/http-memory-occupation-endpoints

qywu commented May 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

qywu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Endpoints

Semantics — important

mini_lb.py fan-out — robust by construction

Verification on H100

mini_lb.py fan-out — verification

Test plan

Notes / follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

qywu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qywu commented May 27, 2026 •

edited

Loading

`mini_lb.py` fan-out — robust by construction

`mini_lb.py` fan-out — verification