Skip to content

feat(memory-saver): optional CPU staging for round-trip weight preservation#275

Draft
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cpu-staging
Draft

feat(memory-saver): optional CPU staging for round-trip weight preservation#275
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cpu-staging

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 27, 2026

Stacks on #272.

Summary

#272's /release_memory_occupation truthfully releases GPU memory but the contents are gone — torch_memory_saver.pause() preserves virtual addresses only, not data. /resume_memory_occupation gets zeroed pages back, so any caller that wants the model to work after a release/resume cycle has to re-read the checkpoint from disk. For RLHF train↔serve handoff and similar "pause inference, return GPU, resume inference" flows that's tens of GiB of disk I/O on every cycle.

This PR adds an opt-in CPU staging step:

POST /release_memory_occupation
{"stage_to_cpu": true}
  • Before saver.pause(), copy every param and buffer of the (target and draft) model into a pre-allocated pinned host buffer.
  • After saver.resume(), copy them back into the original GPU virtual addresses (so any CUDAGraph captures keep their argument pointers valid).

Changes

  • New tokenspeed/runtime/engine/memory_occupation_manager.pyMemoryOccupationManager class. Owns the pinned host buffers, drives saver.pause()/resume(), and reuses staging buffers across cycles so a steady-state RLHF loop doesn't reallocate ~145 GiB of pinned host RAM every iteration.
  • io_struct.py — add stage_to_cpu: bool = False to ReleaseMemoryOccupationReqInput.
  • request_handler.py — replace direct memory_saver calls with the manager so the staging path runs in the scheduler process where model_runner.model is live.
  • event_loop.py — construct the manager with the real target + draft model_runner and pass it into RequestHandler.
  • http_server.py — parse stage_to_cpu from the JSON body.

Verification on H100 — round-trip correctness PASS

Measured on Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager:

Phase GPU used Δ
After model load 41,431 MiB engine footprint: +40,804 MiB
After release(stage_to_cpu=true) 1,593 MiB −39,838 MiB = 97.6 % of footprint
After resume() (restores from CPU) 41,541 MiB +110 MiB vs. load
Latency Value
Release (incl. staging) 1.70 s
Resume (incl. restore) 0.08 s

Generation parity:

Phase Prompt → Output
Pre-release "The capital of France is"" ______.\nA. Paris\nB."
Post-resume "The capital of France is"" ______.\nA. Paris\nB."

Outputs match: True — round-trip preservation works end-to-end.

For Qwen2-1.5B (~3 GiB bf16 weights), the 1.7 s release latency is dominated by the DtoH memcpy of weights into pinned host RAM. Extrapolating to Qwen2-72B (~145 GiB) at PCIe Gen4 x16 (~25 GiB/s pinned): release ≈ 6-8 s; resume ≈ 6-8 s. Compared to ~60-180 s to re-read a 145 GiB checkpoint from network storage, that's a ~10-20× speedup for the train↔serve handoff path.

Trade-offs

  • Host RAM: holds ~sizeof(model) for the duration of the release. For a 72B bf16 model that's ~145 GiB of pinned host RAM.
  • Release latency: with staging, ~1.7 s for a 1.5B model; without, 23 ms. Resume latency goes from 18 ms to 80 ms.
  • Scope: stages model parameters + buffers only. KV cache and req_to_token_pool are scratch — the engine flushes outstanding requests before any reasonable use of this endpoint, so there's nothing in them worth preserving.
  • Tag-based staging: not yet implemented. The tags field is plumbed but unused; a future PR can use it to stage only a subset (e.g. weights but not KV).

Test plan

Env caveats during testing (unrelated to this PR)

Open questions

  • Do we want a config knob for pin_memory (currently default-on)? On hosts without enough pinned-RAM budget, unpinned staging works at ~6-8 GiB/s vs. ~25 GiB/s pinned.
  • Should the staging buffer live on a separate NUMA node when available? Out of scope here.

…vation

#272's /release_memory_occupation truthfully releases GPU memory but the
contents are gone — torch_memory_saver.pause() preserves virtual addresses
only, not data. /resume_memory_occupation gets zeroed pages back, so any
caller that wants the model to work after a release/resume cycle has to
re-read the checkpoint from disk. For RLHF train↔serve handoff and similar
"pause inference, return GPU, resume inference" flows that's tens of GiB
of disk I/O on every cycle.

Adds an opt-in CPU staging step:

  POST /release_memory_occupation
  {"stage_to_cpu": true}

  - Before saver.pause(), copy every param and buffer of the (target and
    draft) model into a pre-allocated pinned host buffer.
  - After saver.resume(), copy them back.

On Qwen2-72B bf16 (~145 GiB) staging round-trip = ~12 s over PCIe Gen4 x16
vs ~60-180 s to re-read the same weights from network storage.

Changes:
  - New: tokenspeed/runtime/engine/memory_occupation_manager.py with the
    MemoryOccupationManager class — pin/unpin host buffers, drive
    saver.pause()/resume(), and reuse buffers across cycles.
  - io_struct.py: add stage_to_cpu: bool = False to
    ReleaseMemoryOccupationReqInput.
  - request_handler.py: replace direct memory_saver calls with the
    manager so the staging path runs in the scheduler process where the
    model_runner is live.
  - event_loop.py: construct the manager with the real target/draft
    model_runner and pass it into RequestHandler.
  - http_server.py: parse stage_to_cpu from the JSON body.

Trade-offs:
  - Host RAM hold = ~sizeof(model) for the duration of the release.
  - Staging adds a few seconds to the release path; without it /release
    completes in ~1 s.
  - Does not stage KV cache or request pools (those are scratch — the
    engine flushes them before any reasonable use of this endpoint).

Stacked on top of #272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant