Skip to content

fix(merkle): prefer local routing table over iterative DHT lookup#51

Closed
grumbach wants to merge 1 commit intomainfrom
fix/merkle-dht-fallback
Closed

fix(merkle): prefer local routing table over iterative DHT lookup#51
grumbach wants to merge 1 commit intomainfrom
fix/merkle-dht-fallback

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

Problem

Merkle uploads on mainnet have been getting stuck indefinitely on the line:

Collecting candidate pools from 16 midpoints (concurrent)

prepare_merkle_batch_external collects 16 candidate pools in parallel, and each pool's get_merkle_candidate_pool called find_closest_peers(address, 32) — which under the hood triggers a fresh Kademlia iterative lookup (find_closest_nodes_network in saorsa-core).

Root cause

Three things compound:

  1. Kademlia iterative lookup does up to 20 rounds of α=3 concurrent FindNode RPCs. Each round blocks on the first response plus a 5 s grace window for stragglers.
  2. On mainnet many close candidates are behind strict NAT or offline, so most rounds actually wait the full grace on dead peers.
  3. Merkle fires 16 of these lookups in parallel. They starve each other for the single transport socket, so even reachable peers can't answer fast enough.

Observed via a new merkle-probe diagnostic binary that reproduces prepare_merkle_batch_external against mainnet without needing a wallet (stays in a separate repo). Before the fix: after 3 min of running and 258 DHT-iteration events logged, zero of the 16 pool lookups had returned to the client. The 10-minute deadline was blown before the prepare phase could finish.

Fix

Merkle doesn't actually need the strict Kademlia-closest guarantee. We over-query (CANDIDATES_PER_POOL * 2 = 32 peers per pool), validate signatures, and discard failures. Any 16 peers with valid merkle candidate quotes will do.

Two changes:

  • Network::find_closest_peers_local — a new method that reads the routing table directly, no RPC. Bounded, microsecond-fast.
  • get_merkle_candidate_pool — try the local table first; fall back to the iterative network lookup only when the local table has fewer than CANDIDATES_PER_POOL * 2 entries (fresh bootstrap, tiny devnet).

On a client that just finished bootstrap, the routing table typically holds 50+ peers — plenty to satisfy the merkle pool collection without any network round trip.

Measurement

Same merkle-probe harness, same mainnet bootstrap, same 178 synthetic chunk addresses (matches a 730 MB file).

Before After
prepare_merkle_batch_external >180 s, never completed, killed 1362 ms
DHT phase per pool 16 lookups iterating, none returned 0 ms (local read) × 16
Pool collection total 1355 ms

~130× faster in the happy path, and actually completes versus hanging.

Instrumentation added alongside

All at info! level so they show under ant -v:

merkle phase=dht addr=<hex> peers=<n> source=local|network elapsed_ms=<ms>
merkle phase=pool pool=<i>/<total> completed=<k>/<total> elapsed_ms=<ms>
merkle phase=pools_total pools=<total> elapsed_ms=<ms>

Per-peer phase=candidate lines at debug! level for deeper drill-down.

Test plan

  • cargo fmt --all --check: clean
  • cargo clippy --all-targets --all-features -- -D warnings: clean
  • cargo build --release: clean
  • Mainnet probe (before fix): 178-chunk merkle prepare hangs indefinitely (>3 min, zero pools completed)
  • Mainnet probe (after fix): 178-chunk merkle prepare completes in 1.36 s, all 16 pools validated
  • Local devnet (10 nodes) still correctly reports InsufficientPeers — the fallback chain (local → network → connected) still terminates cleanly when the network really doesn't have enough peers

No behavioural change in the happy path of small networks: if the routing table is too thin, we fall back to the same find_closest_peers call used previously. Breakage scenarios therefore bounded to the "fresh client on a network where the routing table hasn't learned the right XOR neighbourhood yet" case, which is already handled by the find_closest_peers fallback.

Merkle uploads collect 16 candidate pools in parallel. Each pool
called `find_closest_peers(address, 32)` which triggered a fresh
Kademlia iterative network lookup. Running 16 of those concurrently
on mainnet was the cause of the "stuck forever" merkle upload:

- Each iterative lookup does up to 20 rounds of α=3 FindNode RPCs.
- Each round blocks on the first response plus a 5 s grace window
  for stragglers; many rounds waste the full grace waiting on dead
  or NAT'd peers.
- 16 lookups in parallel starve each other for the single transport
  socket used by saorsa-transport, so even well-behaved peers can't
  respond in time.
- On mainnet the 10-minute upload deadline was blown before even
  ONE of the 16 DHT lookups returned to the client.

Observed via `merkle-probe` (a new diagnostic binary that reproduces
`prepare_merkle_batch_external` without a wallet): before the fix,
208 DHT iteration events fired in 3 minutes and zero completed; after
the fix, all 16 pool lookups returned in 0 ms and the entire prepare
phase finished in 1.4 s.

Fix: add `Network::find_closest_peers_local` that reads the routing
table directly (no RPC). `get_merkle_candidate_pool` tries that
first; it only falls back to the iterative network lookup when the
local table holds fewer than `CANDIDATES_PER_POOL * 2` entries
(fresh bootstrap, tiny network). Freshness matters less here than
in the K-closest DHT contract — we over-query, validate signatures,
and discard failures anyway.

Instrumentation added alongside (all `info!`):

  merkle phase=dht addr=<hex> peers=<n> source=local|network elapsed_ms=<ms>
  merkle phase=pool pool=<i>/<total> completed=<k>/<total> elapsed_ms=<ms>
  merkle phase=pools_total pools=<total> elapsed_ms=<ms>

Visible under `ant -v` / `ant -vv`. Makes future regressions in the
pool-collection path trivially diagnosable.
@grumbach
Copy link
Copy Markdown
Contributor Author

Closing for discussion — branch retained.

@grumbach grumbach closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant