Mesh-LLM v1: relay-gated direct-iroh inference between users (WAN)#822
Merged
Conversation
f18f750 to
0f47ff1
Compare
v1 mesh is Sprout-coordinated direct iroh (hole-punch only) — no server-side iroh relay/proxy and no NIP-98 bearer admission. iroh_relay.rs (verify_bearer / decide_admission / admission_from_membership) was built for the abandoned separate-process / in-process-relay designs and has zero production callers. Relay membership gating stays in api::check_relay_membership (NIP-42 driven). Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
KIND_MESH_CONNECT_REQUEST (24621): desktop→relay, 'coordinate a hole-punch to peer X'. KIND_MESH_CALL_ME_NOW (24622): relay→desktop (relay-signed), the live paired dial trigger carrying the peer's EndpointAddr. Both ephemeral (20000-29999, never stored), clustered in the mesh 24xxx family. Registered in ALL_KINDS; no-duplicate test green. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…aired call-me-now handlers/mesh_signaling.rs: on KIND_MESH_CONNECT_REQUEST (24621) from an authenticated relay member, validate the #p target is also a member (check_relay_membership; ViaOwner/Denied/error all fail closed), then mint two relay-signed KIND_MESH_CALL_ME_NOW (24622) — one to each peer carrying the OTHER's EndpointAddr — and fan them over the existing channel-less ephemeral path (Redis nil-UUID global + local WS). Relay is endpoint-stateless: dial hints come from the requester (which read them from 30621); the relay only validates membership and pairs. Both ends gated by relay access, nothing else. Wired as a branch in handle_ephemeral_event. 5 unit tests (parse/validate/shape). cargo check + clippy clean. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…30621 KIND_MESH_STATUS_REPORT (24620): a member reports its mesh /api/status; the relay sanitizes + republishes a relay-signed kind:30621 discovery note. Note is keyed per reporter via d-tag 'sprout-relay-mesh:<pubkey>' so members' notes are isolated (NIP-33 keys on (kind,pubkey,d); relay is always author) — no clobber, no read-modify-write race. Report is ephemeral input; 30621 is the durable record. Wires the previously-orphaned publish_mesh_status_from_payload. Membership enforced upstream (reporter authed on member-gated WS). Full relay suite 261 green. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Proves distinct reporters get distinct d-tags (no cross-member clobber) and a reporter's repeat report reuses its own d-tag (self-replace). This is the subtlest correctness property of the status ingress \u2014 the multi-machine / two-user no-clobber guarantee \u2014 so it gets an explicit guard. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…t_id, rate limit, tests Review consensus must-fixes for 9/10: - ViaOwner symmetry (Option A): requester AND reporter now membership-checked with None auth_tag (ViaOwner unreachable), matching the target check. v1 mesh is direct-relay-members-only on all three desktop-facing kinds — no delegated identities. Closes the requester/target asymmetry (Perci) and the 24620 reporter asymmetry (Dawn: delegated reporter would advertise an unconnectable serve_target). - Pure trust gate: extracted membership_admits_mesh(MembershipDecision) so the boundary is unit-testable with no AppState. require_mesh_member() fails closed on every non-Member/OpenRelay decision + check errors. - Contract drift fixed: self/peer_endpoint_id flow through 24621 -> 24622 (optional, correlation/instrumentation only, never auth). - Proxy-scope: mesh kinds (24620/24621) rejected under ProxySubmit — they're direct desktop-user actions; the call-me-now must route to the member's own session, not a proxy. - Rate limit: 20/sec per-requester sliding window on 24621 (bounds the 1->2 amplification), mirroring the observer-frame limiter. - Tests: 5 -> 12 (membership gate all 4 decisions incl ViaOwner-denied, endpoint_id round-trip, p-tag extraction). Perci owns the stateful end-to-end pair-emit test separately. - Docs: first-#p-wins + 24622 social-graph observability acknowledged in comments. Full relay suite 269 green, clippy clean. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
DOCS (Eva): extract_target_pubkey notes first-#p-wins / no multi-target in v1; publish_channelless_ephemeral notes call-me-now is #p-observable by any member (intentional, matches presence/typing). Closes Dawn's source/summary deltas. TESTS (Perci): AppState scaffolding (test_state, register_call_me_now_sub) + accepted_connect_request_emits_two_relay_signed_call_me_now_events and self_target_connect_request_emits_no_call_me_now_events — the stateful pair-emit invariant (real fanout, relay signature, #p routing, content swap, self-target zero-emit). These were swept into this commit by an over-broad `git add -A` on the shared clone; authorship corrected via the trailer below. (db84186f is a small follow-up format/verification tweak.) Co-authored-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co> Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…ct + call-me-now) Desktop lane for relay-gated mesh inference: in-process mesh node wiring, 24620/24621/24622 event flow. Create-agent fails closed on missing reporter_pubkey/inviteToken; 5s since-cushion on call-me-now subscribe. Mesh relay event helpers extracted to shared/api/relayMeshSignaling.ts and the create-agent mesh start/connect sequence to mesh-compute/startRelayMeshClientForTarget.ts to keep relayClientSession.ts and CreateAgentDialog.tsx under their size limits. Mesh ports are env-overridable (SPROUT_MESH_API_PORT=9337, SPROUT_MESH_CONSOLE_PORT=3131 defaults; fail-closed on invalid/zero) so two nodes can run on one host for dev/test; mesh_agent_preset builds OPENAI_COMPAT_BASE_URL from the selected port. Defaults are behaviorally identical for normal users. Pins mesh-llm-sdk to tlongwell-block/mesh-llm@94c46d7d (disable_iroh_relays + join_token) — TEMPORARY e2e wiring; repoint to a Block-owned/upstream source before the Sprout PR merges. Co-authored-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co> Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Moves the temporary fork pin from 94c46d7d to 999e394e, which carries the two blockers Perci caught in PR #782 review: - disable_iroh_relays now skips the 5s endpoint.online() wait (RelayPolicy:: ExplicitlyDisabled) — no startup tax with relays off. - passive/client runtime loop handles RuntimeControlRequest::Join, so EmbeddedNodeHandle::join_token() works in the mode Sprout's relay-mesh client actually uses (passive until call-me-now). Re-derived onto the current #736 SDK branch tip. Cargo.lock regenerated. Still temporary fork wiring — repoint to a Block-owned/upstream source before merge. Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Perci <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co> Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Moves the temporary fork pin 999e394e → bc2f1106. The mesh knob now keeps raw STUN enabled under RelayPolicy::ExplicitlyDisabled (uses_raw_stun matches DefaultPublic | ExplicitlyDisabled), so "no iroh relays" no longer means "no public-address discovery." STUN-discovered public addr flows into the invite token / EndpointAddr, rides the relay-signaled 24621/24622 exchange, and lets peers hole-punch over WAN — without any iroh-relay dependency or *.iroh.link traffic. (Disabled / LAN-mDNS mode stays STUN-off, intentional.) Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co> Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
The relay crate's dev-dependency on mesh-llm-sdk was still pinned to the original upstream Mesh-LLM/mesh-llm@bd16da4 (pre-fork, pre-fixes). Point it at the same fork SHA the desktop lane uses (tlongwell-block/mesh-llm@bc2f1106) so the whole workspace resolves to one consistent, WAN-fixed mesh source. Root Cargo.lock regenerated; zero stale bd16da4 refs remain. Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co> Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Removing the dead iroh_relay module left dangling references; correct them to the direct-STUN/relay-signaled connectivity model the PR actually ships: - justfile mesh-e2e: run `mesh_signaling` tests (was the deleted iroh_relay::tests::admission). - docs/mesh-llm-local-build.md: rewrite the connectivity section (public iroh relays off via disable_iroh_relays, raw STUN on for WAN, no relay transport fallback, symmetric-NAT residual + private-relay follow-up); repoint the Layer 2 acceptance row at handlers/mesh_signaling.rs; fix the Layer 1 join bootstrap note. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
…ay module The e2e_mesh_llm module docs still cited the removed `iroh_relay` module for the membership/admission invariants. Repoint them to `mesh_signaling` (where that policy now lives). Comment-only; no test behavior change. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
…+ select by label The relay-mesh model dropdown now keys options on target.endpointAddr (disambiguates the same model across serve nodes), so the spec's selectOption(model_id) no longer matched — select by the user-visible label instead. Deeper: publishMeshConnectRequest(24621) stalled because the e2e bridge only ACKed admin kinds and routed everything else through channel-tag enforcement; a 24621 carries a `p` tag, not an `h` tag, so it got OK(false,"Missing channel tag") and the create-agent flow never reached create_managed_agent. Teach the mock to ACK mesh control kinds (24620/24621) before channel enforcement — no paired 24622 modeled (that's a separate call-me-now assertion). mesh-compute.spec.ts now 3/3 green. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Rebased onto main, which feature-gated mesh-llm-sdk (optional dep + "mesh-llm" feature, #823) and changed the ensure_client_node_for_model re-export (#824). Resolved by carrying both: fork pin tlongwell-block/mesh-llm@bc2f1106 stays, now also `optional = true`; restore.rs mesh preflight keeps the #[cfg(feature="mesh-llm")] wrapper with our 3-arg ensure_client_node_for_model(..., None) signature. Add no-op stubs for mesh_dial_endpoint_addr + mesh_status_report_payload to the cfg(not(feature="mesh-llm")) stub module — the generate_handler! list references them in all builds, so the default (feature-off) build needs the stubs (they were present for the other mesh commands but not these two new ones). Lockfiles regenerated. Verified: default build + with-feature build + desktop clippy (-D warnings) + mesh-compute e2e (3/3) + relay suite (271/0) all green. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Max <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
fa47143 to
19260c0
Compare
The two no-op stubs added for the #823 feature-gate (mesh_dial_endpoint_addr, mesh_status_report_payload — required so the default no-feature build's generate_handler! list resolves) pushed lib.rs from 828 to 846 lines, over its 835 override. Bump to 850 with the justification extended; the additions are the minimal feature-gate completeness the default build needs, not bloat. Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mesh-LLM v1: relay-gated direct-iroh inference between users
Real LLM inference between two users on the same Sprout relay, gated purely by relay access. One user shares compute (serves a model); another runs an agent against it. Discovery, connection, and hole-punch signaling all flow through relay events — no public iroh relays, no out-of-band coordination.
How it works
24620status reports; the relay validates membership and projects each into a per-reporter relay-signed30621. Non-members can't be discovered.24621connect-request naming a target. The relay validates both parties are members.24622to both endpoints; each dials the other. Fails closed on non-membership, self-target, or malformed requests.Lanes
crates/sprout-relay): mesh kinds, status-report ingress → 30621 projection, membership-gated signaling → paired 24622. Deadiroh_relayadmission module removed.desktop/): in-process mesh node wiring, the 24620/24621/24622 UI flow, create-agent fail-closed on missing reporter_pubkey/inviteToken, 5s since-cushion on call-me-now subscribe. Mesh ports env-overridable (SPROUT_MESH_API_PORT/SPROUT_MESH_CONSOLE_PORT, default 9337/3131, fail-closed) so two nodes can run on one host for dev/test.Verification
The relay-backed two-machine desktop flow is new and has not been run end-to-end. It requires two desktop GUI nodes + a served model, which can't be driven headless. Run before merge using the runbook below (single host OK thanks to the env-port seam):
Setup (single host OK with the env-port seam):
sprout-admin add-member --pubkey <A_HEX> --role member(and B).SPROUT_MESH_API_PORT=9338 SPROUT_MESH_CONSOLE_PORT=3132 bin/pnpm --dir desktop tauri dev.Drive:
jc-builds/SmolLM2-135M-Instruct-Q4_K_M-GGUF:Q4_K_M) → toggle Share → wait forActive.model — device(≤~20s after A is Active) → select it → create. This starts B's client, publishes24621, the relay emits paired24622, both nodes dial.curl -s http://127.0.0.1:9338/v1/chat/completions -H 'content-type: application/json' -d '{"model":"<id>","messages":[{"role":"user","content":"Reply with exactly one word: PONG"}],"max_tokens":128,"temperature":0}'→ non-empty completion while the model is served only on A.Passing condition: non-empty assistant content from B's local mesh endpoint, served by A, with discovery+connect gated purely by relay membership.
Passing condition: a
curlchat completion against the consumer's local:9337(or override port) returns non-empty content while the model is served only on the other node.Connectivity: WAN via STUN + relay-signaled hole-punch
Disabling iroh relays (
RelayPolicy::ExplicitlyDisabled) turns off the iroh relay transport (no*.iroh.linktraffic, no public-relay leak) but keeps raw STUN on for public-address discovery. The STUN-discovered public address is injected into the invite token /EndpointAddr, exchanged via the relay-signaled24621/24622flow, and used to hole-punch directly over UDP — so this works over WAN, not just LAN. Our Sprout relay performs the address-exchange coordination that iroh's relay would otherwise do; STUN is only a "what's my public IP" lookup, not a data path.Residual limit (honest): with iroh relays off there is no relay transport fallback, so two peers both behind symmetric NATs may fail to hole-punch (the case iroh relays normally cover). Works for the common cases (≥1 side cone-NAT / port-forwarded / server). If symmetric-both-ends becomes a real constraint, allow a self-hosted relay as a fallback — follow-up.
desktop/src-tauri/Cargo.tomlpinsmesh-llm-sdkto a personal fork (tlongwell-block/mesh-llm@bc2f1106) as temporary e2e wiring. The two SDK knobs it carries (disable_iroh_relays, runtimeEmbeddedNodeHandle::join_token) sit on the unmerged upstream SDK stack — companion upstream PR Mesh-LLM/mesh-llm#782 (basemicn/sprout-embedded-serve-sdk= PR #736;mergeable). Before this merges, repoint the dep to a Block-owned source or an upstream tag once #782/#736 land upstream.This SHA incorporates a line-by-line review (Perci) that caught two real blockers, now fixed: (1)
disable_iroh_relaysskips the 5sendpoint.online()wait viaRelayPolicy::ExplicitlyDisabled— no startup tax with relays off; (2) the passive/client runtime loop handlesRuntimeControlRequest::Join, sojoin_token()works in the mode Sprout's relay-mesh client actually uses. The desktop lib compiles and the mesh tests pass against this SHA.Relationship to main's private-relay work (why this removes
hydrate_private_relay_config)main (#798) began a private iroh relay path (
hydrate_private_relay_configfetches aniroh_relay_urlfrom the relay's NIP-11 + a NIP-98 bearer). That's a good future default — a hosted relay traverses symmetric NAT, which this PR's direct-STUN path can't. But it is not yet wired end-to-end: the relay's NIP-11 doesn't advertiseiroh_relay_urland the restricted iroh-relay runtime lane is unfinished (iroh_relay.rsis scaffold). So the desktop hydration would fail closed before mesh start.This PR therefore takes the direct-STUN + relay-signaled call-me-now path as the working WAN-today solution, and removes the incomplete private-relay hydration to keep the connectivity model coherent. The clean follow-up (when the relay side exists): make transport explicit config —
private_iroh_relaydefault when NIP-11 advertises one,direct_stun_signaling(this PR) as the opt-in/fallback. Decided with Max + Perci; see thread.