feat(memory-saver): optional CPU staging for round-trip weight preservation by qywu · Pull Request #275 · lightseekorg/tokenspeed

qywu · 2026-05-27T00:45:25Z

Stacks on #272.

Summary

#272's /release_memory_occupation truthfully releases GPU memory but the contents are gone — torch_memory_saver.pause() preserves virtual addresses only, not data. /resume_memory_occupation gets zeroed pages back, so any caller that wants the model to work after a release/resume cycle has to re-read the checkpoint from disk. For RLHF train↔serve handoff and similar "pause inference, return GPU, resume inference" flows that's tens of GiB of disk I/O on every cycle.

This PR adds an opt-in CPU staging step:

POST /release_memory_occupation
{"stage_to_cpu": true}

Before saver.pause(), copy every param and buffer of the (target and draft) model into a pre-allocated pinned host buffer.
After saver.resume(), copy them back into the original GPU virtual addresses (so any CUDAGraph captures keep their argument pointers valid).

Changes

New tokenspeed/runtime/engine/memory_occupation_manager.py — MemoryOccupationManager class. Owns the pinned host buffers, drives saver.pause()/resume(), and reuses staging buffers across cycles so a steady-state RLHF loop doesn't reallocate ~145 GiB of pinned host RAM every iteration.
io_struct.py — add stage_to_cpu: bool = False to ReleaseMemoryOccupationReqInput.
request_handler.py — replace direct memory_saver calls with the manager so the staging path runs in the scheduler process where model_runner.model is live.
event_loop.py — construct the manager with the real target + draft model_runner and pass it into RequestHandler.
http_server.py — parse stage_to_cpu from the JSON body.

Verification on H100 — round-trip correctness PASS

Measured on Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager:

Phase	GPU used	Δ
After model load	41,431 MiB	engine footprint: +40,804 MiB
After `release(stage_to_cpu=true)`	1,593 MiB	−39,838 MiB = 97.6 % of footprint
After `resume()` (restores from CPU)	41,541 MiB	+110 MiB vs. load

Latency	Value
Release (incl. staging)	1.70 s
Resume (incl. restore)	0.08 s

Generation parity:

Phase	Prompt → Output
Pre-release	`"The capital of France is"` → `" ______.\nA. Paris\nB."`
Post-resume	`"The capital of France is"` → `" ______.\nA. Paris\nB."`

Outputs match: True — round-trip preservation works end-to-end.

For Qwen2-1.5B (~3 GiB bf16 weights), the 1.7 s release latency is dominated by the DtoH memcpy of weights into pinned host RAM. Extrapolating to Qwen2-72B (~145 GiB) at PCIe Gen4 x16 (~25 GiB/s pinned): release ≈ 6-8 s; resume ≈ 6-8 s. Compared to ~60-180 s to re-read a 145 GiB checkpoint from network storage, that's a ~10-20× speedup for the train↔serve handoff path.

Trade-offs

Host RAM: holds ~sizeof(model) for the duration of the release. For a 72B bf16 model that's ~145 GiB of pinned host RAM.
Release latency: with staging, ~1.7 s for a 1.5B model; without, 23 ms. Resume latency goes from 18 ms to 80 ms.
Scope: stages model parameters + buffers only. KV cache and req_to_token_pool are scratch — the engine flushes outstanding requests before any reasonable use of this endpoint, so there's nothing in them worth preserving.
Tag-based staging: not yet implemented. The tags field is plumbed but unused; a future PR can use it to stage only a subset (e.g. weights but not KV).

Test plan

Round-trip correctness: release with stage_to_cpu=true, resume, run inference; outputs match the pre-release run.
Measure release time / resume time with and without staging on a representative model.
Verify stage_to_cpu=false continues to behave as feat: expose POST /release_memory_occupation and /resume_memory_occupation #272 — release returns zeroed pages, caller must reload weights.
Cycle test: 5x release→resume in a loop, host RAM usage stable (no leak from staging buffer reallocation).
Verify with CUDA graphs enabled (blocked locally by the unrelated FA3 issue; see fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends #276 / fix(thirdparty/triton_kernels): tolerate upstream module rename #277).

Env caveats during testing (unrelated to this PR)

fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends #276 fixes a scheduler_metadata kwarg leak in MHAAttnBackend that breaks --attention-backend=triton / fa4 / flashinfer.
fix(thirdparty/triton_kernels): tolerate upstream module rename #277 fixes an import triton_kernels.matmul failure caused by upstream module renames.

Open questions

Do we want a config knob for pin_memory (currently default-on)? On hosts without enough pinned-RAM budget, unpinned staging works at ~6-8 GiB/s vs. ~25 GiB/s pinned.
Should the staging buffer live on a separate NUMA node when available? Out of scope here.

…vation #272's /release_memory_occupation truthfully releases GPU memory but the contents are gone — torch_memory_saver.pause() preserves virtual addresses only, not data. /resume_memory_occupation gets zeroed pages back, so any caller that wants the model to work after a release/resume cycle has to re-read the checkpoint from disk. For RLHF train↔serve handoff and similar "pause inference, return GPU, resume inference" flows that's tens of GiB of disk I/O on every cycle. Adds an opt-in CPU staging step: POST /release_memory_occupation {"stage_to_cpu": true} - Before saver.pause(), copy every param and buffer of the (target and draft) model into a pre-allocated pinned host buffer. - After saver.resume(), copy them back. On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16 vs ~60-180 s to re-read the same weights from network storage. Changes: - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the MemoryOccupationManager class — pin/unpin host buffers, drive saver.pause()/resume(), and reuse buffers across cycles. - io_struct.py: add stage_to_cpu: bool = False to ReleaseMemoryOccupationReqInput. - request_handler.py: replace direct memory_saver calls with the manager so the staging path runs in the scheduler process where the model_runner is live. - event_loop.py: construct the manager with the real target/draft model_runner and pass it into RequestHandler. - http_server.py: parse stage_to_cpu from the JSON body. Trade-offs: - Host RAM hold = ~sizeof(model) for the duration of the release. - Staging adds a few seconds to the release path; without it /release completes in ~1 s. - Does not stage KV cache or request pools (those are scratch — the engine flushes them before any reasonable use of this endpoint). Stacked on top of #272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory-saver): optional CPU staging for round-trip weight preservation#275

feat(memory-saver): optional CPU staging for round-trip weight preservation#275
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cpu-staging

qywu commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification on H100 — round-trip correctness PASS

Trade-offs

Test plan

Env caveats during testing (unrelated to this PR)

Open questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qywu commented May 27, 2026 •

edited

Loading