fix(merkle): prefer local routing table over iterative DHT lookup#51
Closed
fix(merkle): prefer local routing table over iterative DHT lookup#51
Conversation
Merkle uploads collect 16 candidate pools in parallel. Each pool called `find_closest_peers(address, 32)` which triggered a fresh Kademlia iterative network lookup. Running 16 of those concurrently on mainnet was the cause of the "stuck forever" merkle upload: - Each iterative lookup does up to 20 rounds of α=3 FindNode RPCs. - Each round blocks on the first response plus a 5 s grace window for stragglers; many rounds waste the full grace waiting on dead or NAT'd peers. - 16 lookups in parallel starve each other for the single transport socket used by saorsa-transport, so even well-behaved peers can't respond in time. - On mainnet the 10-minute upload deadline was blown before even ONE of the 16 DHT lookups returned to the client. Observed via `merkle-probe` (a new diagnostic binary that reproduces `prepare_merkle_batch_external` without a wallet): before the fix, 208 DHT iteration events fired in 3 minutes and zero completed; after the fix, all 16 pool lookups returned in 0 ms and the entire prepare phase finished in 1.4 s. Fix: add `Network::find_closest_peers_local` that reads the routing table directly (no RPC). `get_merkle_candidate_pool` tries that first; it only falls back to the iterative network lookup when the local table holds fewer than `CANDIDATES_PER_POOL * 2` entries (fresh bootstrap, tiny network). Freshness matters less here than in the K-closest DHT contract — we over-query, validate signatures, and discard failures anyway. Instrumentation added alongside (all `info!`): merkle phase=dht addr=<hex> peers=<n> source=local|network elapsed_ms=<ms> merkle phase=pool pool=<i>/<total> completed=<k>/<total> elapsed_ms=<ms> merkle phase=pools_total pools=<total> elapsed_ms=<ms> Visible under `ant -v` / `ant -vv`. Makes future regressions in the pool-collection path trivially diagnosable.
Contributor
Author
|
Closing for discussion — branch retained. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Merkle uploads on mainnet have been getting stuck indefinitely on the line:
prepare_merkle_batch_externalcollects 16 candidate pools in parallel, and each pool'sget_merkle_candidate_poolcalledfind_closest_peers(address, 32)— which under the hood triggers a fresh Kademlia iterative lookup (find_closest_nodes_networkin saorsa-core).Root cause
Three things compound:
FindNodeRPCs. Each round blocks on the first response plus a 5 s grace window for stragglers.Observed via a new
merkle-probediagnostic binary that reproducesprepare_merkle_batch_externalagainst mainnet without needing a wallet (stays in a separate repo). Before the fix: after 3 min of running and 258 DHT-iteration events logged, zero of the 16 pool lookups had returned to the client. The 10-minute deadline was blown before the prepare phase could finish.Fix
Merkle doesn't actually need the strict Kademlia-closest guarantee. We over-query (
CANDIDATES_PER_POOL * 2 = 32peers per pool), validate signatures, and discard failures. Any 16 peers with valid merkle candidate quotes will do.Two changes:
Network::find_closest_peers_local— a new method that reads the routing table directly, no RPC. Bounded, microsecond-fast.get_merkle_candidate_pool— try the local table first; fall back to the iterative network lookup only when the local table has fewer thanCANDIDATES_PER_POOL * 2entries (fresh bootstrap, tiny devnet).On a client that just finished bootstrap, the routing table typically holds 50+ peers — plenty to satisfy the merkle pool collection without any network round trip.
Measurement
Same
merkle-probeharness, same mainnet bootstrap, same 178 synthetic chunk addresses (matches a 730 MB file).prepare_merkle_batch_external~130× faster in the happy path, and actually completes versus hanging.
Instrumentation added alongside
All at
info!level so they show underant -v:Per-peer
phase=candidatelines atdebug!level for deeper drill-down.Test plan
cargo fmt --all --check: cleancargo clippy --all-targets --all-features -- -D warnings: cleancargo build --release: cleanInsufficientPeers— the fallback chain (local → network → connected) still terminates cleanly when the network really doesn't have enough peersNo behavioural change in the happy path of small networks: if the routing table is too thin, we fall back to the same
find_closest_peerscall used previously. Breakage scenarios therefore bounded to the "fresh client on a network where the routing table hasn't learned the right XOR neighbourhood yet" case, which is already handled by thefind_closest_peersfallback.