Skip to content

feat(connector): add save-only pegaflow mode#300

Merged
xiaguan merged 2 commits into
masterfrom
feat/pegaflow-save-only-mode
May 29, 2026
Merged

feat(connector): add save-only pegaflow mode#300
xiaguan merged 2 commits into
masterfrom
feat/pegaflow-save-only-mode

Conversation

@xiaguan
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan commented May 28, 2026

Summary

  • Add pegaflow.mode extra config with read_write default and save_only mode.
  • Make save-only mode skip Pega query/load while still advancing save metadata from absolute computed-token watermarks.
  • Add unit coverage for save-only scheduling/resume behavior and an E2E that combines full-hit DecodeBenchConnector + Pega save-only, then verifies a read-write Pega instance can load the saved KV.

E2E evidence

  • Save-only phase metrics: save_bytes=9175040 insertions=5 hits=0 load_bytes=0.
  • Read-write phase metrics: hits=4 load_bytes=7340032.
  • Save-only server log contains no cache_lookup, query_prefetch, or load; read-write phase logs cache_lookup: hit_blocks=4.

Tests

  • uv run --extra dev ruff check python/pegaflow/connector/__init__.py python/pegaflow/connector/common.py python/pegaflow/connector/scheduler.py python/tests/test_combine_hashes.py python/tests/test_vllm_save_only_e2e.py python/tests/vllm_helpers.py
  • uv run --extra test pytest tests/test_vllm_save_only_e2e.py --collect-only -q -m e2e
  • uv run --extra test pytest tests/test_combine_hashes.py -q
  • PYTHONPATH=/data/pegadev/pegaflow-save-only-mode/python /data/pegadev/pegaflow/.venv/bin/python -m pytest tests/test_vllm_save_only_e2e.py -m e2e -q -s --model Qwen/Qwen3-0.6B --e2e-port 18100
  • pre-commit hooks, including cargo test --release

Copy link
Copy Markdown
Contributor

@feifei-111 feifei-111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan merged commit cf6f9cd into master May 29, 2026
12 checks passed
@xiaguan xiaguan deleted the feat/pegaflow-save-only-mode branch May 29, 2026 03:17
xiaguan added a commit that referenced this pull request May 29, 2026
## chore(release): bump version to 0.22.4

Bumps the Rust workspace, the `pegaflow-llm` Python package, the
commitizen version, and the `Cargo.lock` workspace package versions from
`0.22.3` to `0.22.4`.

---

## Release notes — 0.22.3 → 0.22.4

18 PRs landed on `master` since `v0.22.3` (2026-05-15). Grouped below
for release notes.

### Highlights

- **Disaggregated prefill/decode over RDMA push** (#297) — a brand-new
vLLM v1 KV connector (`PdConnector`) plus a v2 RDMA transfer engine
(`pegaflow-transfer/src/v2`). KV is pushed prefill→decode layer-by-layer
via one-sided RDMA WRITE as each attention layer completes, overlapping
transfer with the forward pass instead of pulling after prefill finishes
(vLLM NIXL model). On H20 / Qwen3-8B the added TTFT is **2–4× lower than
NIXL** across 512–16k input lengths.
- **Query leases replace query pinning** (#284, #288) — the
query/load/release control path moved from pin refcounts to lease-backed
ownership. Query results collapse to `Loading`/`Ready` only; `Ready`
carries `num_hit_blocks` plus an opaque lease that transfers
scheduler→worker and is released on cleanup/failure, with a TTL sweeper
reclaiming abandoned leases.
- **Save-only connector mode** (#300) — new `pegaflow.mode` config;
`save_only` skips Pega query/load while still advancing save metadata,
so an instance can populate the cache without serving reads.

### Features

- feat(pd): RDMA push connector for disaggregated prefill/decode (#297)
- feat(connector): add save-only pegaflow mode (#300)
- feat(storage): sharded SSD cache — cache spread across multiple files,
uring engine dispatches across shards, prefetch ready-blocks now ordered
by requested keys (#299)
- feat(rdma): per-peer N QPs with WQE-level round-robin — new
`--qps-per-peer` (default 2), round-robin at WQE level so one in-flight
task saturates all QPs; handshake validates both sides agree on N (#291)
- feat(metaserver): node lifecycle fencing — heartbeat-based node
tracking with per-node UUID fencing, stale nodes hidden via
`--node-stale-secs` (#285, closes #222)
- feat(connector): replace query pinning with leases (#284)

### Fixes

- fix(connector): preserve non-MLA KV layout registration so cross-layer
layouts (e.g. GLM-4.7-FP8) register by block stride; limit
logical/physical block splitting to MLA (#295, closes #294)
- fix(numa): allocate pinned pools on GPU-local NUMA nodes instead of
the full CPU NUMA set, avoiding wasted capacity on CPU-only nodes (#293)
- fix(connector): handle split physical KV blocks — group split physical
rows into one logical block when FlashMLA uses smaller physical blocks
(#292)
- fix(connector): allow a query lease to be consumed once per registered
worker, fixing multi-worker `query lease is unknown or expired` (#288)
- fix(server): fail on invalid RDMA NICs — accept comma/space-separated
`--nics`, reject empty names, propagate RDMA init failures instead of
silently disabling P2P (#283, fixes #276)
- fix(connector): remove the unused scheduler pending-save request limit
and save-drop accounting (#282)
- fix(connector): demote `cache_lookup_reuse` log from INFO to DEBUG to
stop log spam under cache pressure (#280)

### Performance

- perf: CPU-path Criterion benchmarks + long-block save optimizations —
e.g. `query_prefetch_lease/32768` ~12.3 ms → ~6.1 ms,
`save_flush_unique/8192` ~21.3 ms → ~13.1 ms via reduced prefix-key
cloning, ordered multi-layer save grouping, and RawBlock inline-segment
allocation (#290)

### Internal / refactor / tests

- refactor(metrics): centralize histogram buckets behind a
`build_buckets` helper (#298)
- refactor(core): make prefetch tasks terminal (Ready results carry RAM
prefix blocks); default storage admission to no TinyLFU unless
explicitly enabled (#287)
- test(server): mock vLLM gRPC E2E harness covering
save/query/load/release/session contracts (#289)
- chore: tune transfer duration histogram buckets toward long-tail
visibility (#281)

### Notable behavior & config changes (upgrade notes)

- **Query API**: query results are now `Loading`/`Ready` only; `Ready`
exposes `num_hit_blocks` + an opaque lease. Pin/unpin refcount semantics
are gone (#284, #288).
- **Release RPC**: unknown/expired leases now return
`FailedPrecondition` instead of being silently accepted (#289).
- **`--nics`**: now rejects empty entries (e.g. `mlx5_0,,mlx5_1`) and
fails startup on RDMA init errors rather than silently falling back to
no-P2P (#283).
- **New CLI flags**: `--qps-per-peer` (default 2) (#291),
`--node-stale-secs` for metaserver (#285).
- **New connector config**: `pegaflow.mode` with `read_write` (default)
/ `save_only` (#300).
- **Storage admission**: TinyLFU is now off unless explicitly enabled
(#287).

### Full PR list

```
#298 refactor(metrics): use build_buckets helper for histogram buckets
#300 feat(connector): add save-only pegaflow mode
#299 feat(storage): add sharded SSD cache support
#297 feat(pd): RDMA push connector for disaggregated prefill/decode
#290 perf: add cpu path benchmarks and optimize long-block saves
#295 fix(connector): preserve non-MLA kv layout registration
#293 fix(numa): allocate pinned pools on GPU-local NUMA nodes
#292 fix(connector): handle split physical kv blocks
#291 feat(rdma): per-peer N QPs with WQE-level round-robin
#285 feat(metaserver): add node lifecycle fencing
#287 refactor(core): make prefetch task terminal
#289 test(server): add mock vLLM RPC E2E coverage
#288 fix(connector): allow query leases across workers
#284 feat(connector): replace query pinning with leases
#283 fix(server): fail on invalid rdma nics
#282 fix(connector): remove scheduler save limit
#281 chore: tune transfer duration buckets
#280 fix(connector): demote cache_lookup_reuse log to debug
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants