Skip to content

fix: support non-RDMA local builds#307

Merged
xiaguan merged 1 commit into
novitalabs:masterfrom
ClSlaid:chore/optional-rdma-build
Jun 1, 2026
Merged

fix: support non-RDMA local builds#307
xiaguan merged 1 commit into
novitalabs:masterfrom
ClSlaid:chore/optional-rdma-build

Conversation

@ClSlaid
Copy link
Copy Markdown
Contributor

@ClSlaid ClSlaid commented May 31, 2026

Summary

  • make RDMA/pegaflow-transfer optional for core/server/python crates while preserving default RDMA builds
  • keep RDMA fetch wiring behind a small storage prefetch adapter for no-RDMA builds
  • split Python RDMA PyO3 bindings into pd_rdma.rs and pass Python loader env to bundled binaries

Tests

  • cargo fmt --check
  • cargo check -p pegaflow-py --no-default-features --features cuda-13 --lib --bins
  • cargo test -p pegaflow-core --no-default-features --features cuda-13 --lib

Not tested

  • RDMA-enabled builds and RDMA runtime paths were not tested on this machine; this environment only validated the no-RDMA CUDA build path.

Copilot AI review requested due to automatic review settings May 31, 2026 16:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes RDMA support optional/feature-gated across the Rust crates and Python extension, and improves how packaged Python-launched binaries find the correct runtime environment.

Changes:

  • Split Python RDMA bindings into a dedicated module and gate them behind the rdma Cargo feature.
  • Add binary_env() and pass it to server/metaserver subprocess launches to ensure correct Python/lib loading.
  • Feature-gate RDMA-related codepaths in pegaflow-core and update crate feature defaults (cuda-12, rdma).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
python/src/pd_rdma.rs New feature-gated Python RDMA engine bindings extracted from lib.rs.
python/src/lib.rs Removes inline RDMA bindings and conditionally registers RDMA classes.
python/pegaflow/_server.py Runs server binary with an augmented environment via binary_env().
python/pegaflow/_metaserver.py Runs metaserver binary with an augmented environment via binary_env().
python/pegaflow/_bin_utils.py Adds binary_env() to set LD_LIBRARY_PATH, PYTHONPATH, and PYTHONHOME.
python/Cargo.toml Introduces rdma feature and makes pegaflow-transfer optional; updates defaults.
pegaflow-transfer/build.rs Changes build script behavior to always generate/link libibverbs bindings.
pegaflow-transfer/Cargo.toml Normalizes CUDA features (cuda-12 default).
pegaflow-server/Cargo.toml Updates defaults to include cuda-12 and rdma.
pegaflow-core/src/storage/prefetch.rs Wraps RDMA fetch behind a RdmaFetch adapter for cfg(feature="rdma").
pegaflow-core/src/storage/mod.rs Feature-gates RDMA transport creation/export and adds a non-RDMA config error.
pegaflow-core/src/metrics.rs Feature-gates RDMA gauges and related types.
pegaflow-core/src/lib.rs Feature-gates RDMA handshake/transport checks and updates tests accordingly.
pegaflow-core/src/backing/mod.rs Feature-gates RDMA modules and re-exports.
pegaflow-core/Cargo.toml Makes pegaflow-transfer optional and adds an explicit rdma feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pegaflow-transfer/build.rs
Comment thread pegaflow-transfer/build.rs
Comment thread python/src/pd_rdma.rs
Comment thread python/src/pd_rdma.rs Outdated
Comment thread python/src/pd_rdma.rs
Comment thread python/pegaflow/_bin_utils.py
Comment thread pegaflow-core/src/lib.rs Outdated
@ClSlaid ClSlaid force-pushed the chore/optional-rdma-build branch 2 times, most recently from 05e32a9 to 20888d2 Compare May 31, 2026 16:27
@ClSlaid
Copy link
Copy Markdown
Contributor Author

ClSlaid commented May 31, 2026

Updated after Copilot review:

  • Fixed PdRdmaEngine::wait_done to check finished_recving with a single lock acquisition instead of remove/reinsert.
  • Added backoff to RDMA polling waits so they do not pure-spin for the full timeout.
  • Removed PYTHONHOME from the launched binary environment and flattened env path construction.
  • Split RDMA/no-RDMA PegaEngine methods into cfg-specific implementations and added no-RDMA tests for ignored RDMA NIC config and handshake error behavior.

I intentionally left pegaflow-transfer/build.rs unconditional: pegaflow-transfer is the RDMA transfer crate, and no-RDMA users avoid it through the optional dependency gates in pegaflow-core/pegaflow-py; making the transfer crate itself build without ibverbs would imply a separate non-RDMA transfer crate mode that this PR is not trying to support.

@ClSlaid ClSlaid force-pushed the chore/optional-rdma-build branch from 20888d2 to 6c0c0cb Compare May 31, 2026 16:30
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan enabled auto-merge (squash) June 1, 2026 02:34
@xiaguan xiaguan merged commit 62ba25c into novitalabs:master Jun 1, 2026
12 checks passed
@ClSlaid ClSlaid deleted the chore/optional-rdma-build branch June 1, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants