feat(pd): MLA cache layout support + PD RDMA perf/stability by xiaguan · Pull Request #308 · novitalabs/pegaflow

xiaguan · 2026-06-01T02:41:19Z

Summary

Adds PD (Prefill/Decode disaggregation) connector support for the MLA cache layout, plus the surrounding PD RDMA performance/stability work and a save-only PegaFlow connector mode.

master...HEAD: 39 files, +3384 / −1216. Branch is 0 commits behind master, so it merges cleanly.

Note — dev logs relocated. The PD-MLA experiment/benchmark/debug logs are not part of this PR; they live in the standalone docs repo (docs/projects/) as project knowledge. This PR therefore also removes the pre-existing docs/pd-bench-results.md from pegaflow (it was moved out, not deleted).

Main changes

PD connector — MLA cache layout (python/pegaflow/pd_connector/)

layout.py / metadata.py / kv_params.py: support MLA (single latent KV) cache layout in addition to the existing FlashAttention HND layout
prefill_worker.py / decode_worker.py / rdma.py: PD RDMA push path reworked — parallelized decode RDMA waits, reduced dispatch/handshake/wait-registration overhead, stabilized producer release, improved push throughput
scheduler.py / proxy.py / worker.py: plumbing for the above

Connector (python/pegaflow/connector/)

feat(connector): save-only PegaFlow mode (feat(connector): add save-only pegaflow mode #300)
metrics/scheduler/common adjustments

Server / transfer (Rust)

fix(server): isolate CUDA tensor registry on a dedicated thread (fix(server): isolate CUDA tensor registry on a dedicated thread #301)
pegaflow-server registry/http_server/service changes + http_cleanup_hang_repro.rs regression test
refactor(metrics): shared build_buckets helper for histogram buckets (refactor(metrics): use build_buckets helper for histogram buckets #298)
pegaflow-transfer verbs_domain tweaks

PyO3 bindings (python/src/lib.rs, pegaflow.pyi) updated for the new connector surface.

CI / dev infra

fix(ci): install libibverbs-dev in release wheel build (fix(ci): install libibverbs-dev in release wheel build #303)
ci: run the pre-commit cargo test --release hook with --features cuda-13, matching the CUDA 13 runtime on the dev/test hosts (the default cuda-12080 bindings reference cudaEventElapsedTime_v2, absent in CUDA 13 libcudart)

Tests

New test_pd_connector_layout.py, test_combine_hashes.py expansion, test_pd_connector_flow.py (renamed from test_pd_connector.py), test_vllm_save_only_e2e.py, shared pd_connector_test_utils.py

Test status

Rust workspace tests pass under --features cuda-13 (via the pre-commit hook). The vLLM correctness E2E and PD integration gates should be run on the GPU machine before merge per CLAUDE.md.

This dev/test host's default CUDA runtime is 13.3, but the default build feature is cudarc/cuda-12080, whose runtime bindings reference cudaEventElapsedTime_v2 — a symbol absent from the CUDA 13 libcudart. The release test hook therefore dlopen-failed on every CUDA test. Build and run the pre-commit release tests with --features cuda-13 so the bindings match the installed runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

These experiment/benchmark/debug logs belong in the standalone docs repo (project knowledge), not in the pegaflow PR surface. Relocated to /data/pegadev/docs/projects/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xiaguan added 30 commits May 28, 2026 16:03

feat(pd): support MLA cache layout

4749e59

Merge remote-tracking branch 'origin/master' into docs/pd-mla-design

1ae6142

chore(pd): add H20 Kimi launch script

aa3acb8

fix: stabilize pd rdma producer release

7a45fd7

fix: improve pd rdma push throughput

874e0c6

docs: record h20 pd mla benchmark results

0851fd0

docs: add h20 kimi pd mla experiment note

532d094

docs: define h20 kimi aligned ttft sweep

a46afc1

chore: align h20 kimi sweep tooling

9e196db

fix: add pd rdma latency distribution logs

382a50d

chore: expand h20 kimi sweep summary

3a526a3

docs: record h20 kimi baseline sweep

7a84a0e

fix: avoid hanging h20 nic monitor

346d7e2

fix: reduce pd handshake overhead

d10e358

test: strengthen pd rdma integration benchmark

33186f5

perf: reduce pd connector dispatch overhead

6530727

docs: record h20 tcp microbench

4d75058

chore: trace pd scheduler ingress latency

33d943c

perf: reduce pd decode wait registration overhead

3368acb

docs: record rejected early prefill experiment

07e0db2

docs: complete h20 waitmini ttft sweep

d83029e

docs: record vllm recompute boundary

780b85b

docs: record active nic window analysis

8e03f69

docs: explain layer cadence bandwidth ceiling

f4b8635

perf: log pd ready-window nic utilization

2574016

perf: log pd decode completion tail

8a359f6

perf: parallelize pd decode rdma waits

2f15691

chore: summarize pd connector latency logs

c706900

perf: log pd event ready bandwidth

407e879

fix: refuse h20 kimi start on busy gpus

74cf993

xiaguan and others added 9 commits May 29, 2026 17:13

chore: add h20 idle gpu scanner

8317c87

chore: include rdma nics in h20 idle scan

bbde5f7

chore: support h20 pd probe readiness

75b3217

docs: record h20 event-ready probe

be4db0c

chore: move h20 helpers out of repo

1cb6495

Merge remote-tracking branch 'origin/master' into docs/pd-mla-design

c366b07

chore: clean pd mla pr surface

d4e3161

docs: move pd mla dev logs to standalone docs repo

1abd685

These experiment/benchmark/debug logs belong in the standalone docs repo (project knowledge), not in the pegaflow PR surface. Relocated to /data/pegadev/docs/projects/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pd): MLA cache layout support + PD RDMA perf/stability#308

feat(pd): MLA cache layout support + PD RDMA perf/stability#308
xiaguan wants to merge 39 commits into
masterfrom
docs/pd-mla-design

xiaguan commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaguan commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Test status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaguan commented Jun 1, 2026 •

edited

Loading