Skip to content

Mesh-LLM/iroh-fabric

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iroh-fabric

iroh-fabric is a Rust test harness for negotiating high-performance local or near-local data planes with iroh.

It uses iroh's custom transport support as the control plane, exchanges opaque transport metadata over an encrypted iroh stream, then runs a small payload test over the selected fabric backend.

The default run tries:

  • rdma: common libfabric RDMA-capable providers such as verbs;ofi_rxm, efa, cxi, psm3, psm2, opx, and mlx. These are optional and skip cleanly when the local libfabric install does not expose them.
  • sockets: libfabric sockets provider using FI_EP_DGRAM.
  • tcp: libfabric tcp provider using FI_EP_RDM.
  • nvlink: CUDA/NVML runtime probe plus CUDA peer-copy smoke test. This is optional and skips cleanly when CUDA, multiple peer-accessible NVIDIA GPUs, or NVLink are unavailable.

Run

cargo run

Only run TCP and sockets:

cargo run -- --profiles sockets,tcp

Only run the NVLink probe:

cargo run -- --profiles nvlink

Pin a libfabric provider/domain:

cargo run -- --provider tcp --endpoint-type rdm --domain en1
cargo run -- --provider sockets --endpoint-type dgram --domain en1

For source-bound libfabric tests:

cargo run -- --provider tcp --endpoint-type rdm --src-node 0.0.0.0

How It Works

flowchart LR
    A["Peer A<br/>iroh endpoint"] <-->|"iroh custom transport<br/>encrypted control stream"| B["Peer B<br/>iroh endpoint"]
    A --> C["Open backend endpoint<br/>libfabric or CUDA"]
    B --> D["Open backend endpoint<br/>libfabric or CUDA"]
    C --> E["Opaque local address<br/>or GPU capability"]
    D --> F["Opaque local address<br/>or GPU capability"]
    E -->|"serialize JSON over iroh"| B
    F -->|"serialize JSON over iroh"| A
    C <-->|"payload test over selected data plane"| D
Loading

The control plane is deliberately small:

sequenceDiagram
    participant A as Peer A
    participant I as Iroh stream
    participant B as Peer B
    participant FA as Fabric/CUDA A
    participant FB as Fabric/CUDA B

    A->>FA: bind/probe local backend
    B->>FB: bind/probe local backend
    A->>I: connect over iroh custom transport
    A->>B: send local fabric metadata
    B->>A: send local fabric metadata
    A->>FB: send payload over backend
    FB->>FA: reply over backend
Loading

The data plane depends on the profile:

flowchart TB
    P["Profile selection"] --> R["rdma<br/>libfabric FI_EP_RDM"]
    P --> S["sockets<br/>libfabric FI_EP_DGRAM"]
    P --> T["tcp<br/>libfabric FI_EP_RDM"]
    P --> N["nvlink<br/>CUDA peer copy"]

    R --> RF["verbs / EFA / CXI / PSM / OPX / MLX"]
    S --> SF["sockets provider"]
    T --> TF["tcp provider"]
    N --> NF["CUDA runtime + NVML<br/>peer-accessible GPU pair"]
Loading

NVLink Notes

NVLink is not a normal libfabric provider. It is a GPU memory interconnect, so the program treats it as a separate backend:

  1. Runtime-loads CUDA and NVML with dlopen.
  2. Finds a pair of CUDA devices with peer access.
  3. Uses NVML to report whether an NVLink is active for the chosen source device when NVML is available.
  4. Exchanges the CUDA/NVLink capability over iroh.
  5. Runs a CUDA cudaMemcpyPeer payload verification.

This is a local peer-copy smoke test. A production multi-process version would exchange CUDA IPC handles, buffer descriptors, stream/event metadata, and safety tokens over iroh before allowing GPU memory access.

Mesh LLM Integration Ideas

~/code/mesh-llm already has the right conceptual split for this:

  • OpenAI request routing can stay on its existing HTTP/QUIC tunnel because streaming text is latency tolerant.
  • Stage traffic is the hot path. Mesh LLM's README already calls out that activation frames move over a dedicated stage transport when a dense model is split.
  • mesh-llm gpus --json already exposes stable GPU IDs and inventory. That is the natural place to add fabric capability metadata: RDMA provider/domain, Thunderbolt RDMA reachability, CUDA peer pairs, NVLink active state, and measured bandwidth.
  • Skippy has KV-cache export/import FFI surfaces. Fast local fabrics could eventually move KV pages, activation frames, or expert shards without going through ordinary host TCP.

A possible integration shape:

flowchart LR
    G["mesh-llm gpus --json"] --> C["Capability record<br/>GPU ID, VRAM, backend, fabric"]
    C --> P["Planner"]
    P -->|"ordinary network"| Q["QUIC stage transport"]
    P -->|"same host, NVLink"| N["CUDA IPC / NVLink transport"]
    P -->|"near host, Thunderbolt RDMA"| R["libfabric RDMA transport"]
    Q --> S["Stage frames"]
    N --> S
    R --> S
Loading

For Thunderbolt RDMA, the useful path would be:

  1. Detect a libfabric-capable RDMA provider/domain for the Thunderbolt link.
  2. Publish that fabric metadata alongside normal iroh addresses.
  3. Let the planner prefer Thunderbolt RDMA for peers under the stage-traffic latency cap.
  4. Fall back to the existing iroh/QUIC stage transport if provider negotiation, address insertion, or the first probe fails.

For NVLink, the useful path is more intra-node:

  1. Detect CUDA peer-accessible GPU pairs and active NVLink state during GPU inventory.
  2. When Mesh LLM splits layers or KV-heavy work across GPUs in the same host, use iroh only to authenticate/control the relationship.
  3. Exchange CUDA IPC memory handles and buffer lifetimes over the control stream.
  4. Move activation/KV buffers with CUDA peer copies or NCCL/NVSHMEM rather than copying through host memory.

This project is intentionally a testbed, not a drop-in Mesh LLM transport yet. The next hardening steps would be stable capability schemas, benchmark-based route scoring, CUDA IPC handle exchange, and backpressure/error semantics that match Mesh LLM's stage runtime.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors