Skip to content

feat(pd): MLA cache layout support + PD RDMA perf/stability#308

Open
xiaguan wants to merge 39 commits into
masterfrom
docs/pd-mla-design
Open

feat(pd): MLA cache layout support + PD RDMA perf/stability#308
xiaguan wants to merge 39 commits into
masterfrom
docs/pd-mla-design

Conversation

@xiaguan
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan commented Jun 1, 2026

Summary

Adds PD (Prefill/Decode disaggregation) connector support for the MLA cache layout, plus the surrounding PD RDMA performance/stability work and a save-only PegaFlow connector mode.

master...HEAD: 39 files, +3384 / −1216. Branch is 0 commits behind master, so it merges cleanly.

Note — dev logs relocated. The PD-MLA experiment/benchmark/debug logs are not part of this PR; they live in the standalone docs repo (docs/projects/) as project knowledge. This PR therefore also removes the pre-existing docs/pd-bench-results.md from pegaflow (it was moved out, not deleted).

Main changes

PD connector — MLA cache layout (python/pegaflow/pd_connector/)

  • layout.py / metadata.py / kv_params.py: support MLA (single latent KV) cache layout in addition to the existing FlashAttention HND layout
  • prefill_worker.py / decode_worker.py / rdma.py: PD RDMA push path reworked — parallelized decode RDMA waits, reduced dispatch/handshake/wait-registration overhead, stabilized producer release, improved push throughput
  • scheduler.py / proxy.py / worker.py: plumbing for the above

Connector (python/pegaflow/connector/)

Server / transfer (Rust)

PyO3 bindings (python/src/lib.rs, pegaflow.pyi) updated for the new connector surface.

CI / dev infra

  • fix(ci): install libibverbs-dev in release wheel build (fix(ci): install libibverbs-dev in release wheel build #303)
  • ci: run the pre-commit cargo test --release hook with --features cuda-13, matching the CUDA 13 runtime on the dev/test hosts (the default cuda-12080 bindings reference cudaEventElapsedTime_v2, absent in CUDA 13 libcudart)

Tests

  • New test_pd_connector_layout.py, test_combine_hashes.py expansion, test_pd_connector_flow.py (renamed from test_pd_connector.py), test_vllm_save_only_e2e.py, shared pd_connector_test_utils.py

Test status

Rust workspace tests pass under --features cuda-13 (via the pre-commit hook). The vLLM correctness E2E and PD integration gates should be run on the GPU machine before merge per CLAUDE.md.

xiaguan added 30 commits May 28, 2026 16:03
xiaguan and others added 9 commits May 29, 2026 17:13
This dev/test host's default CUDA runtime is 13.3, but the default
build feature is cudarc/cuda-12080, whose runtime bindings reference
cudaEventElapsedTime_v2 — a symbol absent from the CUDA 13 libcudart.
The release test hook therefore dlopen-failed on every CUDA test.

Build and run the pre-commit release tests with --features cuda-13 so
the bindings match the installed runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
These experiment/benchmark/debug logs belong in the standalone docs
repo (project knowledge), not in the pegaflow PR surface. Relocated to
/data/pegadev/docs/projects/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant