Skip to content

Mesh-LLM v1: relay-gated direct-iroh inference between users (WAN)#822

Merged
tlongwell-block merged 17 commits into
mainfrom
eva/mesh-v1-direct-iroh
Jun 3, 2026
Merged

Mesh-LLM v1: relay-gated direct-iroh inference between users (WAN)#822
tlongwell-block merged 17 commits into
mainfrom
eva/mesh-v1-direct-iroh

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented Jun 2, 2026

Mesh-LLM v1: relay-gated direct-iroh inference between users

Real LLM inference between two users on the same Sprout relay, gated purely by relay access. One user shares compute (serves a model); another runs an agent against it. Discovery, connection, and hole-punch signaling all flow through relay events — no public iroh relays, no out-of-band coordination.

How it works

  1. Discovery — a serving node publishes 24620 status reports; the relay validates membership and projects each into a per-reporter relay-signed 30621. Non-members can't be discovered.
  2. Connect — a consuming node publishes a 24621 connect-request naming a target. The relay validates both parties are members.
  3. Call-me-now — on an accepted request the relay emits a paired, relay-signed 24622 to both endpoints; each dials the other. Fails closed on non-membership, self-target, or malformed requests.
  4. Inference — the consumer's local OpenAI-compatible endpoint proxies to the served model over the direct iroh link.

Lanes

  • Relay (crates/sprout-relay): mesh kinds, status-report ingress → 30621 projection, membership-gated signaling → paired 24622. Dead iroh_relay admission module removed.
  • Desktop (desktop/): in-process mesh node wiring, the 24620/24621/24622 UI flow, create-agent fail-closed on missing reporter_pubkey/inviteToken, 5s since-cushion on call-me-now subscribe. Mesh ports env-overridable (SPROUT_MESH_API_PORT/SPROUT_MESH_CONSOLE_PORT, default 9337/3131, fail-closed) so two nodes can run on one host for dev/test.

Verification

  • Relay full suite: 271 passed / 0 failed, including the two mesh_signaling invariants (accepted pair → two relay-signed 24622; self-target → zero emit).
  • Desktop: cargo check/test (mesh_llm), pnpm typecheck, fmt, file-size gate all green. The fail-closed + relay-client wiring was kept under size limits via a clean split (relayMeshSignaling.ts + startRelayMeshClientForTarget.ts) — no override bump.
  • Relay lane independently fresh-eyes reviewed by Dawn + Perci → 9/10, all must-fixes addressed.

⚠️ PENDING: live two-machine inference proof

The relay-backed two-machine desktop flow is new and has not been run end-to-end. It requires two desktop GUI nodes + a served model, which can't be driven headless. Run before merge using the runbook below (single host OK thanks to the env-port seam):

Setup (single host OK with the env-port seam):

  • Start a relay (dev) and add both identities as members: sprout-admin add-member --pubkey <A_HEX> --role member (and B).
  • Launch two desktops against the relay. A on default ports; B with SPROUT_MESH_API_PORT=9338 SPROUT_MESH_CONSOLE_PORT=3132 bin/pnpm --dir desktop tauri dev.

Drive:

  1. A: Settings → Compute → Share compute → pick a small model (e.g. jc-builds/SmolLM2-135M-Instruct-Q4_K_M-GGUF:Q4_K_M) → toggle Share → wait for Active.
  2. B: Create Agent → toggle "Run on relay mesh" → A's target appears as model — device (≤~20s after A is Active) → select it → create. This starts B's client, publishes 24621, the relay emits paired 24622, both nodes dial.
  3. Inference proof on B: curl -s http://127.0.0.1:9338/v1/chat/completions -H 'content-type: application/json' -d '{"model":"<id>","messages":[{"role":"user","content":"Reply with exactly one word: PONG"}],"max_tokens":128,"temperature":0}' → non-empty completion while the model is served only on A.

Passing condition: non-empty assistant content from B's local mesh endpoint, served by A, with discovery+connect gated purely by relay membership.

Passing condition: a curl chat completion against the consumer's local :9337 (or override port) returns non-empty content while the model is served only on the other node.

Connectivity: WAN via STUN + relay-signaled hole-punch

Disabling iroh relays (RelayPolicy::ExplicitlyDisabled) turns off the iroh relay transport (no *.iroh.link traffic, no public-relay leak) but keeps raw STUN on for public-address discovery. The STUN-discovered public address is injected into the invite token / EndpointAddr, exchanged via the relay-signaled 24621/24622 flow, and used to hole-punch directly over UDP — so this works over WAN, not just LAN. Our Sprout relay performs the address-exchange coordination that iroh's relay would otherwise do; STUN is only a "what's my public IP" lookup, not a data path.

Residual limit (honest): with iroh relays off there is no relay transport fallback, so two peers both behind symmetric NATs may fail to hole-punch (the case iroh relays normally cover). Works for the common cases (≥1 side cone-NAT / port-forwarded / server). If symmetric-both-ends becomes a real constraint, allow a self-hosted relay as a fallback — follow-up.

⚠️ Dependency caveat (must resolve before merge)

desktop/src-tauri/Cargo.toml pins mesh-llm-sdk to a personal fork (tlongwell-block/mesh-llm@bc2f1106) as temporary e2e wiring. The two SDK knobs it carries (disable_iroh_relays, runtime EmbeddedNodeHandle::join_token) sit on the unmerged upstream SDK stack — companion upstream PR Mesh-LLM/mesh-llm#782 (base micn/sprout-embedded-serve-sdk = PR #736; mergeable). Before this merges, repoint the dep to a Block-owned source or an upstream tag once #782/#736 land upstream.

This SHA incorporates a line-by-line review (Perci) that caught two real blockers, now fixed: (1) disable_iroh_relays skips the 5s endpoint.online() wait via RelayPolicy::ExplicitlyDisabled — no startup tax with relays off; (2) the passive/client runtime loop handles RuntimeControlRequest::Join, so join_token() works in the mode Sprout's relay-mesh client actually uses. The desktop lib compiles and the mesh tests pass against this SHA.

Relationship to main's private-relay work (why this removes hydrate_private_relay_config)

main (#798) began a private iroh relay path (hydrate_private_relay_config fetches an iroh_relay_url from the relay's NIP-11 + a NIP-98 bearer). That's a good future default — a hosted relay traverses symmetric NAT, which this PR's direct-STUN path can't. But it is not yet wired end-to-end: the relay's NIP-11 doesn't advertise iroh_relay_url and the restricted iroh-relay runtime lane is unfinished (iroh_relay.rs is scaffold). So the desktop hydration would fail closed before mesh start.

This PR therefore takes the direct-STUN + relay-signaled call-me-now path as the working WAN-today solution, and removes the incomplete private-relay hydration to keep the connectivity model coherent. The clean follow-up (when the relay side exists): make transport explicit config — private_iroh_relay default when NIP-11 advertises one, direct_stun_signaling (this PR) as the opt-in/fallback. Decided with Max + Perci; see thread.

@tlongwell-block tlongwell-block requested a review from a team as a code owner June 2, 2026 19:41
@tlongwell-block tlongwell-block force-pushed the eva/mesh-v1-direct-iroh branch 2 times, most recently from f18f750 to 0f47ff1 Compare June 2, 2026 19:55
npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 16 commits June 2, 2026 17:39
v1 mesh is Sprout-coordinated direct iroh (hole-punch only) — no server-side
iroh relay/proxy and no NIP-98 bearer admission. iroh_relay.rs (verify_bearer /
decide_admission / admission_from_membership) was built for the abandoned
separate-process / in-process-relay designs and has zero production callers.
Relay membership gating stays in api::check_relay_membership (NIP-42 driven).

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
KIND_MESH_CONNECT_REQUEST (24621): desktop→relay, 'coordinate a hole-punch to
peer X'. KIND_MESH_CALL_ME_NOW (24622): relay→desktop (relay-signed), the live
paired dial trigger carrying the peer's EndpointAddr. Both ephemeral
(20000-29999, never stored), clustered in the mesh 24xxx family. Registered in
ALL_KINDS; no-duplicate test green.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…aired call-me-now

handlers/mesh_signaling.rs: on KIND_MESH_CONNECT_REQUEST (24621) from an
authenticated relay member, validate the #p target is also a member
(check_relay_membership; ViaOwner/Denied/error all fail closed), then mint two
relay-signed KIND_MESH_CALL_ME_NOW (24622) — one to each peer carrying the
OTHER's EndpointAddr — and fan them over the existing channel-less ephemeral
path (Redis nil-UUID global + local WS). Relay is endpoint-stateless: dial hints
come from the requester (which read them from 30621); the relay only validates
membership and pairs. Both ends gated by relay access, nothing else.

Wired as a branch in handle_ephemeral_event. 5 unit tests (parse/validate/shape).
cargo check + clippy clean.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…30621

KIND_MESH_STATUS_REPORT (24620): a member reports its mesh /api/status; the relay
sanitizes + republishes a relay-signed kind:30621 discovery note. Note is keyed
per reporter via d-tag 'sprout-relay-mesh:<pubkey>' so members' notes are
isolated (NIP-33 keys on (kind,pubkey,d); relay is always author) — no clobber,
no read-modify-write race. Report is ephemeral input; 30621 is the durable record.
Wires the previously-orphaned publish_mesh_status_from_payload. Membership
enforced upstream (reporter authed on member-gated WS). Full relay suite 261 green.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Proves distinct reporters get distinct d-tags (no cross-member clobber) and a
reporter's repeat report reuses its own d-tag (self-replace). This is the
subtlest correctness property of the status ingress \u2014 the multi-machine /
two-user no-clobber guarantee \u2014 so it gets an explicit guard.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…t_id, rate limit, tests

Review consensus must-fixes for 9/10:
- ViaOwner symmetry (Option A): requester AND reporter now membership-checked
  with None auth_tag (ViaOwner unreachable), matching the target check. v1 mesh
  is direct-relay-members-only on all three desktop-facing kinds — no delegated
  identities. Closes the requester/target asymmetry (Perci) and the 24620
  reporter asymmetry (Dawn: delegated reporter would advertise an unconnectable
  serve_target).
- Pure trust gate: extracted membership_admits_mesh(MembershipDecision) so the
  boundary is unit-testable with no AppState. require_mesh_member() fails closed
  on every non-Member/OpenRelay decision + check errors.
- Contract drift fixed: self/peer_endpoint_id flow through 24621 -> 24622
  (optional, correlation/instrumentation only, never auth).
- Proxy-scope: mesh kinds (24620/24621) rejected under ProxySubmit — they're
  direct desktop-user actions; the call-me-now must route to the member's own
  session, not a proxy.
- Rate limit: 20/sec per-requester sliding window on 24621 (bounds the 1->2
  amplification), mirroring the observer-frame limiter.
- Tests: 5 -> 12 (membership gate all 4 decisions incl ViaOwner-denied,
  endpoint_id round-trip, p-tag extraction). Perci owns the stateful
  end-to-end pair-emit test separately.
- Docs: first-#p-wins + 24622 social-graph observability acknowledged in comments.

Full relay suite 269 green, clippy clean.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
DOCS (Eva): extract_target_pubkey notes first-#p-wins / no multi-target in v1;
publish_channelless_ephemeral notes call-me-now is #p-observable by any member
(intentional, matches presence/typing). Closes Dawn's source/summary deltas.

TESTS (Perci): AppState scaffolding (test_state, register_call_me_now_sub) +
accepted_connect_request_emits_two_relay_signed_call_me_now_events and
self_target_connect_request_emits_no_call_me_now_events — the stateful pair-emit
invariant (real fanout, relay signature, #p routing, content swap, self-target
zero-emit). These were swept into this commit by an over-broad `git add -A` on
the shared clone; authorship corrected via the trailer below. (db84186f is a
small follow-up format/verification tweak.)

Co-authored-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…ct + call-me-now)

Desktop lane for relay-gated mesh inference: in-process mesh node wiring,
24620/24621/24622 event flow. Create-agent fails closed on missing
reporter_pubkey/inviteToken; 5s since-cushion on call-me-now subscribe.

Mesh relay event helpers extracted to shared/api/relayMeshSignaling.ts and the
create-agent mesh start/connect sequence to mesh-compute/startRelayMeshClientForTarget.ts
to keep relayClientSession.ts and CreateAgentDialog.tsx under their size limits.

Mesh ports are env-overridable (SPROUT_MESH_API_PORT=9337, SPROUT_MESH_CONSOLE_PORT=3131
defaults; fail-closed on invalid/zero) so two nodes can run on one host for dev/test;
mesh_agent_preset builds OPENAI_COMPAT_BASE_URL from the selected port. Defaults are
behaviorally identical for normal users.

Pins mesh-llm-sdk to tlongwell-block/mesh-llm@94c46d7d (disable_iroh_relays +
join_token) — TEMPORARY e2e wiring; repoint to a Block-owned/upstream source
before the Sprout PR merges.

Co-authored-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Moves the temporary fork pin from 94c46d7d to 999e394e, which carries the two
blockers Perci caught in PR #782 review:
- disable_iroh_relays now skips the 5s endpoint.online() wait (RelayPolicy::
  ExplicitlyDisabled) — no startup tax with relays off.
- passive/client runtime loop handles RuntimeControlRequest::Join, so
  EmbeddedNodeHandle::join_token() works in the mode Sprout's relay-mesh
  client actually uses (passive until call-me-now).
Re-derived onto the current #736 SDK branch tip. Cargo.lock regenerated.
Still temporary fork wiring — repoint to a Block-owned/upstream source before merge.

Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Perci <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Moves the temporary fork pin 999e394e → bc2f1106. The mesh knob now keeps raw
STUN enabled under RelayPolicy::ExplicitlyDisabled (uses_raw_stun matches
DefaultPublic | ExplicitlyDisabled), so "no iroh relays" no longer means "no
public-address discovery." STUN-discovered public addr flows into the invite
token / EndpointAddr, rides the relay-signaled 24621/24622 exchange, and lets
peers hole-punch over WAN — without any iroh-relay dependency or *.iroh.link
traffic. (Disabled / LAN-mDNS mode stays STUN-off, intentional.)

Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
The relay crate's dev-dependency on mesh-llm-sdk was still pinned to the original
upstream Mesh-LLM/mesh-llm@bd16da4 (pre-fork, pre-fixes). Point it at the same
fork SHA the desktop lane uses (tlongwell-block/mesh-llm@bc2f1106) so the whole
workspace resolves to one consistent, WAN-fixed mesh source. Root Cargo.lock
regenerated; zero stale bd16da4 refs remain.

Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Removing the dead iroh_relay module left dangling references; correct them to
the direct-STUN/relay-signaled connectivity model the PR actually ships:
- justfile mesh-e2e: run `mesh_signaling` tests (was the deleted
  iroh_relay::tests::admission).
- docs/mesh-llm-local-build.md: rewrite the connectivity section (public iroh
  relays off via disable_iroh_relays, raw STUN on for WAN, no relay transport
  fallback, symmetric-NAT residual + private-relay follow-up); repoint the
  Layer 2 acceptance row at handlers/mesh_signaling.rs; fix the Layer 1 join
  bootstrap note.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
…ay module

The e2e_mesh_llm module docs still cited the removed `iroh_relay` module for the
membership/admission invariants. Repoint them to `mesh_signaling` (where that
policy now lives). Comment-only; no test behavior change.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
…+ select by label

The relay-mesh model dropdown now keys options on target.endpointAddr (disambiguates
the same model across serve nodes), so the spec's selectOption(model_id) no longer
matched — select by the user-visible label instead.

Deeper: publishMeshConnectRequest(24621) stalled because the e2e bridge only ACKed
admin kinds and routed everything else through channel-tag enforcement; a 24621
carries a `p` tag, not an `h` tag, so it got OK(false,"Missing channel tag") and the
create-agent flow never reached create_managed_agent. Teach the mock to ACK mesh
control kinds (24620/24621) before channel enforcement — no paired 24622 modeled
(that's a separate call-me-now assertion). mesh-compute.spec.ts now 3/3 green.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Rebased onto main, which feature-gated mesh-llm-sdk (optional dep + "mesh-llm"
feature, #823) and changed the ensure_client_node_for_model re-export (#824).
Resolved by carrying both: fork pin tlongwell-block/mesh-llm@bc2f1106 stays, now
also `optional = true`; restore.rs mesh preflight keeps the #[cfg(feature="mesh-llm")]
wrapper with our 3-arg ensure_client_node_for_model(..., None) signature.

Add no-op stubs for mesh_dial_endpoint_addr + mesh_status_report_payload to the
cfg(not(feature="mesh-llm")) stub module — the generate_handler! list references
them in all builds, so the default (feature-off) build needs the stubs (they were
present for the other mesh commands but not these two new ones). Lockfiles
regenerated. Verified: default build + with-feature build + desktop clippy
(-D warnings) + mesh-compute e2e (3/3) + relay suite (271/0) all green.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block force-pushed the eva/mesh-v1-direct-iroh branch from fa47143 to 19260c0 Compare June 2, 2026 21:47
The two no-op stubs added for the #823 feature-gate (mesh_dial_endpoint_addr,
mesh_status_report_payload — required so the default no-feature build's
generate_handler! list resolves) pushed lib.rs from 828 to 846 lines, over its
835 override. Bump to 850 with the justification extended; the additions are
the minimal feature-gate completeness the default build needs, not bloat.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block merged commit 33cfc85 into main Jun 3, 2026
16 checks passed
@tlongwell-block tlongwell-block deleted the eva/mesh-v1-direct-iroh branch June 3, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant