iroh-fabric is a Rust test harness for negotiating high-performance local or near-local data planes with iroh.
It uses iroh's custom transport support as the control plane, exchanges opaque transport metadata over an encrypted iroh stream, then runs a small payload test over the selected fabric backend.
The default run tries:
rdma: common libfabric RDMA-capable providers such asverbs;ofi_rxm,efa,cxi,psm3,psm2,opx, andmlx. These are optional and skip cleanly when the local libfabric install does not expose them.sockets: libfabricsocketsprovider usingFI_EP_DGRAM.tcp: libfabrictcpprovider usingFI_EP_RDM.nvlink: CUDA/NVML runtime probe plus CUDA peer-copy smoke test. This is optional and skips cleanly when CUDA, multiple peer-accessible NVIDIA GPUs, or NVLink are unavailable.
cargo runOnly run TCP and sockets:
cargo run -- --profiles sockets,tcpOnly run the NVLink probe:
cargo run -- --profiles nvlinkPin a libfabric provider/domain:
cargo run -- --provider tcp --endpoint-type rdm --domain en1
cargo run -- --provider sockets --endpoint-type dgram --domain en1For source-bound libfabric tests:
cargo run -- --provider tcp --endpoint-type rdm --src-node 0.0.0.0flowchart LR
A["Peer A<br/>iroh endpoint"] <-->|"iroh custom transport<br/>encrypted control stream"| B["Peer B<br/>iroh endpoint"]
A --> C["Open backend endpoint<br/>libfabric or CUDA"]
B --> D["Open backend endpoint<br/>libfabric or CUDA"]
C --> E["Opaque local address<br/>or GPU capability"]
D --> F["Opaque local address<br/>or GPU capability"]
E -->|"serialize JSON over iroh"| B
F -->|"serialize JSON over iroh"| A
C <-->|"payload test over selected data plane"| D
The control plane is deliberately small:
sequenceDiagram
participant A as Peer A
participant I as Iroh stream
participant B as Peer B
participant FA as Fabric/CUDA A
participant FB as Fabric/CUDA B
A->>FA: bind/probe local backend
B->>FB: bind/probe local backend
A->>I: connect over iroh custom transport
A->>B: send local fabric metadata
B->>A: send local fabric metadata
A->>FB: send payload over backend
FB->>FA: reply over backend
The data plane depends on the profile:
flowchart TB
P["Profile selection"] --> R["rdma<br/>libfabric FI_EP_RDM"]
P --> S["sockets<br/>libfabric FI_EP_DGRAM"]
P --> T["tcp<br/>libfabric FI_EP_RDM"]
P --> N["nvlink<br/>CUDA peer copy"]
R --> RF["verbs / EFA / CXI / PSM / OPX / MLX"]
S --> SF["sockets provider"]
T --> TF["tcp provider"]
N --> NF["CUDA runtime + NVML<br/>peer-accessible GPU pair"]
NVLink is not a normal libfabric provider. It is a GPU memory interconnect, so the program treats it as a separate backend:
- Runtime-loads CUDA and NVML with
dlopen. - Finds a pair of CUDA devices with peer access.
- Uses NVML to report whether an NVLink is active for the chosen source device when NVML is available.
- Exchanges the CUDA/NVLink capability over iroh.
- Runs a CUDA
cudaMemcpyPeerpayload verification.
This is a local peer-copy smoke test. A production multi-process version would exchange CUDA IPC handles, buffer descriptors, stream/event metadata, and safety tokens over iroh before allowing GPU memory access.
~/code/mesh-llm already has the right conceptual split for this:
- OpenAI request routing can stay on its existing HTTP/QUIC tunnel because streaming text is latency tolerant.
- Stage traffic is the hot path. Mesh LLM's README already calls out that activation frames move over a dedicated stage transport when a dense model is split.
mesh-llm gpus --jsonalready exposes stable GPU IDs and inventory. That is the natural place to add fabric capability metadata: RDMA provider/domain, Thunderbolt RDMA reachability, CUDA peer pairs, NVLink active state, and measured bandwidth.- Skippy has KV-cache export/import FFI surfaces. Fast local fabrics could eventually move KV pages, activation frames, or expert shards without going through ordinary host TCP.
A possible integration shape:
flowchart LR
G["mesh-llm gpus --json"] --> C["Capability record<br/>GPU ID, VRAM, backend, fabric"]
C --> P["Planner"]
P -->|"ordinary network"| Q["QUIC stage transport"]
P -->|"same host, NVLink"| N["CUDA IPC / NVLink transport"]
P -->|"near host, Thunderbolt RDMA"| R["libfabric RDMA transport"]
Q --> S["Stage frames"]
N --> S
R --> S
For Thunderbolt RDMA, the useful path would be:
- Detect a libfabric-capable RDMA provider/domain for the Thunderbolt link.
- Publish that fabric metadata alongside normal iroh addresses.
- Let the planner prefer Thunderbolt RDMA for peers under the stage-traffic latency cap.
- Fall back to the existing iroh/QUIC stage transport if provider negotiation, address insertion, or the first probe fails.
For NVLink, the useful path is more intra-node:
- Detect CUDA peer-accessible GPU pairs and active NVLink state during GPU inventory.
- When Mesh LLM splits layers or KV-heavy work across GPUs in the same host, use iroh only to authenticate/control the relationship.
- Exchange CUDA IPC memory handles and buffer lifetimes over the control stream.
- Move activation/KV buffers with CUDA peer copies or NCCL/NVSHMEM rather than copying through host memory.
This project is intentionally a testbed, not a drop-in Mesh LLM transport yet. The next hardening steps would be stable capability schemas, benchmark-based route scoring, CUDA IPC handle exchange, and backpressure/error semantics that match Mesh LLM's stage runtime.