iroh-fabric

iroh-fabric is a Rust test harness for negotiating high-performance local or near-local data planes with iroh.

It uses iroh's custom transport support as the control plane, exchanges opaque transport metadata over an encrypted iroh stream, then runs a small payload test over the selected fabric backend.

The default run tries:

rdma: common libfabric RDMA-capable providers such as verbs;ofi_rxm, efa, cxi, psm3, psm2, opx, and mlx. These are optional and skip cleanly when the local libfabric install does not expose them.
sockets: libfabric sockets provider using FI_EP_DGRAM.
tcp: libfabric tcp provider using FI_EP_RDM.
nvlink: CUDA/NVML runtime probe plus CUDA peer-copy smoke test. This is optional and skips cleanly when CUDA, multiple peer-accessible NVIDIA GPUs, or NVLink are unavailable.

Run

cargo run

Only run TCP and sockets:

cargo run -- --profiles sockets,tcp

Only run the NVLink probe:

cargo run -- --profiles nvlink

Pin a libfabric provider/domain:

cargo run -- --provider tcp --endpoint-type rdm --domain en1
cargo run -- --provider sockets --endpoint-type dgram --domain en1

For source-bound libfabric tests:

cargo run -- --provider tcp --endpoint-type rdm --src-node 0.0.0.0

How It Works

flowchart LR
    A["Peer A<br/>iroh endpoint"] <-->|"iroh custom transport<br/>encrypted control stream"| B["Peer B<br/>iroh endpoint"]
    A --> C["Open backend endpoint<br/>libfabric or CUDA"]
    B --> D["Open backend endpoint<br/>libfabric or CUDA"]
    C --> E["Opaque local address<br/>or GPU capability"]
    D --> F["Opaque local address<br/>or GPU capability"]
    E -->|"serialize JSON over iroh"| B
    F -->|"serialize JSON over iroh"| A
    C <-->|"payload test over selected data plane"| D

The control plane is deliberately small:

sequenceDiagram
    participant A as Peer A
    participant I as Iroh stream
    participant B as Peer B
    participant FA as Fabric/CUDA A
    participant FB as Fabric/CUDA B

    A->>FA: bind/probe local backend
    B->>FB: bind/probe local backend
    A->>I: connect over iroh custom transport
    A->>B: send local fabric metadata
    B->>A: send local fabric metadata
    A->>FB: send payload over backend
    FB->>FA: reply over backend

The data plane depends on the profile:

flowchart TB
    P["Profile selection"] --> R["rdma<br/>libfabric FI_EP_RDM"]
    P --> S["sockets<br/>libfabric FI_EP_DGRAM"]
    P --> T["tcp<br/>libfabric FI_EP_RDM"]
    P --> N["nvlink<br/>CUDA peer copy"]

    R --> RF["verbs / EFA / CXI / PSM / OPX / MLX"]
    S --> SF["sockets provider"]
    T --> TF["tcp provider"]
    N --> NF["CUDA runtime + NVML<br/>peer-accessible GPU pair"]

NVLink Notes

NVLink is not a normal libfabric provider. It is a GPU memory interconnect, so the program treats it as a separate backend:

Runtime-loads CUDA and NVML with dlopen.
Finds a pair of CUDA devices with peer access.
Uses NVML to report whether an NVLink is active for the chosen source device when NVML is available.
Exchanges the CUDA/NVLink capability over iroh.
Runs a CUDA cudaMemcpyPeer payload verification.

This is a local peer-copy smoke test. A production multi-process version would exchange CUDA IPC handles, buffer descriptors, stream/event metadata, and safety tokens over iroh before allowing GPU memory access.

Mesh LLM Integration Ideas

~/code/mesh-llm already has the right conceptual split for this:

OpenAI request routing can stay on its existing HTTP/QUIC tunnel because streaming text is latency tolerant.
Stage traffic is the hot path. Mesh LLM's README already calls out that activation frames move over a dedicated stage transport when a dense model is split.
mesh-llm gpus --json already exposes stable GPU IDs and inventory. That is the natural place to add fabric capability metadata: RDMA provider/domain, Thunderbolt RDMA reachability, CUDA peer pairs, NVLink active state, and measured bandwidth.
Skippy has KV-cache export/import FFI surfaces. Fast local fabrics could eventually move KV pages, activation frames, or expert shards without going through ordinary host TCP.

A possible integration shape:

flowchart LR
    G["mesh-llm gpus --json"] --> C["Capability record<br/>GPU ID, VRAM, backend, fabric"]
    C --> P["Planner"]
    P -->|"ordinary network"| Q["QUIC stage transport"]
    P -->|"same host, NVLink"| N["CUDA IPC / NVLink transport"]
    P -->|"near host, Thunderbolt RDMA"| R["libfabric RDMA transport"]
    Q --> S["Stage frames"]
    N --> S
    R --> S

For Thunderbolt RDMA, the useful path would be:

Detect a libfabric-capable RDMA provider/domain for the Thunderbolt link.
Publish that fabric metadata alongside normal iroh addresses.
Let the planner prefer Thunderbolt RDMA for peers under the stage-traffic latency cap.
Fall back to the existing iroh/QUIC stage transport if provider negotiation, address insertion, or the first probe fails.

For NVLink, the useful path is more intra-node:

Detect CUDA peer-accessible GPU pairs and active NVLink state during GPU inventory.
When Mesh LLM splits layers or KV-heavy work across GPUs in the same host, use iroh only to authenticate/control the relationship.
Exchange CUDA IPC memory handles and buffer lifetimes over the control stream.
Move activation/KV buffers with CUDA peer copies or NCCL/NVSHMEM rather than copying through host memory.

This project is intentionally a testbed, not a drop-in Mesh LLM transport yet. The next hardening steps would be stable capability schemas, benchmark-based route scoring, CUDA IPC handle exchange, and backpressure/error semantics that match Mesh LLM's stage runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iroh-fabric

Run

How It Works

NVLink Notes

Mesh LLM Integration Ideas

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iroh-fabric

Run

How It Works

NVLink Notes

Mesh LLM Integration Ideas

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages