feat(connector): add save-only pegaflow mode#300
Merged
Conversation
xiaguan
added a commit
that referenced
this pull request
May 29, 2026
## chore(release): bump version to 0.22.4 Bumps the Rust workspace, the `pegaflow-llm` Python package, the commitizen version, and the `Cargo.lock` workspace package versions from `0.22.3` to `0.22.4`. --- ## Release notes — 0.22.3 → 0.22.4 18 PRs landed on `master` since `v0.22.3` (2026-05-15). Grouped below for release notes. ### Highlights - **Disaggregated prefill/decode over RDMA push** (#297) — a brand-new vLLM v1 KV connector (`PdConnector`) plus a v2 RDMA transfer engine (`pegaflow-transfer/src/v2`). KV is pushed prefill→decode layer-by-layer via one-sided RDMA WRITE as each attention layer completes, overlapping transfer with the forward pass instead of pulling after prefill finishes (vLLM NIXL model). On H20 / Qwen3-8B the added TTFT is **2–4× lower than NIXL** across 512–16k input lengths. - **Query leases replace query pinning** (#284, #288) — the query/load/release control path moved from pin refcounts to lease-backed ownership. Query results collapse to `Loading`/`Ready` only; `Ready` carries `num_hit_blocks` plus an opaque lease that transfers scheduler→worker and is released on cleanup/failure, with a TTL sweeper reclaiming abandoned leases. - **Save-only connector mode** (#300) — new `pegaflow.mode` config; `save_only` skips Pega query/load while still advancing save metadata, so an instance can populate the cache without serving reads. ### Features - feat(pd): RDMA push connector for disaggregated prefill/decode (#297) - feat(connector): add save-only pegaflow mode (#300) - feat(storage): sharded SSD cache — cache spread across multiple files, uring engine dispatches across shards, prefetch ready-blocks now ordered by requested keys (#299) - feat(rdma): per-peer N QPs with WQE-level round-robin — new `--qps-per-peer` (default 2), round-robin at WQE level so one in-flight task saturates all QPs; handshake validates both sides agree on N (#291) - feat(metaserver): node lifecycle fencing — heartbeat-based node tracking with per-node UUID fencing, stale nodes hidden via `--node-stale-secs` (#285, closes #222) - feat(connector): replace query pinning with leases (#284) ### Fixes - fix(connector): preserve non-MLA KV layout registration so cross-layer layouts (e.g. GLM-4.7-FP8) register by block stride; limit logical/physical block splitting to MLA (#295, closes #294) - fix(numa): allocate pinned pools on GPU-local NUMA nodes instead of the full CPU NUMA set, avoiding wasted capacity on CPU-only nodes (#293) - fix(connector): handle split physical KV blocks — group split physical rows into one logical block when FlashMLA uses smaller physical blocks (#292) - fix(connector): allow a query lease to be consumed once per registered worker, fixing multi-worker `query lease is unknown or expired` (#288) - fix(server): fail on invalid RDMA NICs — accept comma/space-separated `--nics`, reject empty names, propagate RDMA init failures instead of silently disabling P2P (#283, fixes #276) - fix(connector): remove the unused scheduler pending-save request limit and save-drop accounting (#282) - fix(connector): demote `cache_lookup_reuse` log from INFO to DEBUG to stop log spam under cache pressure (#280) ### Performance - perf: CPU-path Criterion benchmarks + long-block save optimizations — e.g. `query_prefetch_lease/32768` ~12.3 ms → ~6.1 ms, `save_flush_unique/8192` ~21.3 ms → ~13.1 ms via reduced prefix-key cloning, ordered multi-layer save grouping, and RawBlock inline-segment allocation (#290) ### Internal / refactor / tests - refactor(metrics): centralize histogram buckets behind a `build_buckets` helper (#298) - refactor(core): make prefetch tasks terminal (Ready results carry RAM prefix blocks); default storage admission to no TinyLFU unless explicitly enabled (#287) - test(server): mock vLLM gRPC E2E harness covering save/query/load/release/session contracts (#289) - chore: tune transfer duration histogram buckets toward long-tail visibility (#281) ### Notable behavior & config changes (upgrade notes) - **Query API**: query results are now `Loading`/`Ready` only; `Ready` exposes `num_hit_blocks` + an opaque lease. Pin/unpin refcount semantics are gone (#284, #288). - **Release RPC**: unknown/expired leases now return `FailedPrecondition` instead of being silently accepted (#289). - **`--nics`**: now rejects empty entries (e.g. `mlx5_0,,mlx5_1`) and fails startup on RDMA init errors rather than silently falling back to no-P2P (#283). - **New CLI flags**: `--qps-per-peer` (default 2) (#291), `--node-stale-secs` for metaserver (#285). - **New connector config**: `pegaflow.mode` with `read_write` (default) / `save_only` (#300). - **Storage admission**: TinyLFU is now off unless explicitly enabled (#287). ### Full PR list ``` #298 refactor(metrics): use build_buckets helper for histogram buckets #300 feat(connector): add save-only pegaflow mode #299 feat(storage): add sharded SSD cache support #297 feat(pd): RDMA push connector for disaggregated prefill/decode #290 perf: add cpu path benchmarks and optimize long-block saves #295 fix(connector): preserve non-MLA kv layout registration #293 fix(numa): allocate pinned pools on GPU-local NUMA nodes #292 fix(connector): handle split physical kv blocks #291 feat(rdma): per-peer N QPs with WQE-level round-robin #285 feat(metaserver): add node lifecycle fencing #287 refactor(core): make prefetch task terminal #289 test(server): add mock vLLM RPC E2E coverage #288 fix(connector): allow query leases across workers #284 feat(connector): replace query pinning with leases #283 fix(server): fail on invalid rdma nics #282 fix(connector): remove scheduler save limit #281 chore: tune transfer duration buckets #280 fix(connector): demote cache_lookup_reuse log to debug ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pegaflow.modeextra config withread_writedefault andsave_onlymode.E2E evidence
save_bytes=9175040 insertions=5 hits=0 load_bytes=0.hits=4 load_bytes=7340032.cache_lookup,query_prefetch, orload; read-write phase logscache_lookup: hit_blocks=4.Tests
uv run --extra dev ruff check python/pegaflow/connector/__init__.py python/pegaflow/connector/common.py python/pegaflow/connector/scheduler.py python/tests/test_combine_hashes.py python/tests/test_vllm_save_only_e2e.py python/tests/vllm_helpers.pyuv run --extra test pytest tests/test_vllm_save_only_e2e.py --collect-only -q -m e2euv run --extra test pytest tests/test_combine_hashes.py -qPYTHONPATH=/data/pegadev/pegaflow-save-only-mode/python /data/pegadev/pegaflow/.venv/bin/python -m pytest tests/test_vllm_save_only_e2e.py -m e2e -q -s --model Qwen/Qwen3-0.6B --e2e-port 18100cargo test --release