fix(merkle): prefer local routing table over iterative DHT lookup by grumbach · Pull Request #51 · WithAutonomi/ant-client

grumbach · 2026-04-21T07:17:36Z

Problem

Merkle uploads on mainnet have been getting stuck indefinitely on the line:

Collecting candidate pools from 16 midpoints (concurrent)

prepare_merkle_batch_external collects 16 candidate pools in parallel, and each pool's get_merkle_candidate_pool called find_closest_peers(address, 32) — which under the hood triggers a fresh Kademlia iterative lookup (find_closest_nodes_network in saorsa-core).

Root cause

Three things compound:

Kademlia iterative lookup does up to 20 rounds of α=3 concurrent FindNode RPCs. Each round blocks on the first response plus a 5 s grace window for stragglers.
On mainnet many close candidates are behind strict NAT or offline, so most rounds actually wait the full grace on dead peers.
Merkle fires 16 of these lookups in parallel. They starve each other for the single transport socket, so even reachable peers can't answer fast enough.

Observed via a new merkle-probe diagnostic binary that reproduces prepare_merkle_batch_external against mainnet without needing a wallet (stays in a separate repo). Before the fix: after 3 min of running and 258 DHT-iteration events logged, zero of the 16 pool lookups had returned to the client. The 10-minute deadline was blown before the prepare phase could finish.

Fix

Merkle doesn't actually need the strict Kademlia-closest guarantee. We over-query (CANDIDATES_PER_POOL * 2 = 32 peers per pool), validate signatures, and discard failures. Any 16 peers with valid merkle candidate quotes will do.

Two changes:

Network::find_closest_peers_local — a new method that reads the routing table directly, no RPC. Bounded, microsecond-fast.
get_merkle_candidate_pool — try the local table first; fall back to the iterative network lookup only when the local table has fewer than CANDIDATES_PER_POOL * 2 entries (fresh bootstrap, tiny devnet).

On a client that just finished bootstrap, the routing table typically holds 50+ peers — plenty to satisfy the merkle pool collection without any network round trip.

Measurement

Same merkle-probe harness, same mainnet bootstrap, same 178 synthetic chunk addresses (matches a 730 MB file).

	Before	After
`prepare_merkle_batch_external`	>180 s, never completed, killed	1362 ms
DHT phase per pool	16 lookups iterating, none returned	0 ms (local read) × 16
Pool collection total	—	1355 ms

~130× faster in the happy path, and actually completes versus hanging.

Instrumentation added alongside

All at info! level so they show under ant -v:

merkle phase=dht addr=<hex> peers=<n> source=local|network elapsed_ms=<ms>
merkle phase=pool pool=<i>/<total> completed=<k>/<total> elapsed_ms=<ms>
merkle phase=pools_total pools=<total> elapsed_ms=<ms>

Per-peer phase=candidate lines at debug! level for deeper drill-down.

Test plan

cargo fmt --all --check: clean
cargo clippy --all-targets --all-features -- -D warnings: clean
cargo build --release: clean
Mainnet probe (before fix): 178-chunk merkle prepare hangs indefinitely (>3 min, zero pools completed)
Mainnet probe (after fix): 178-chunk merkle prepare completes in 1.36 s, all 16 pools validated
Local devnet (10 nodes) still correctly reports InsufficientPeers — the fallback chain (local → network → connected) still terminates cleanly when the network really doesn't have enough peers

No behavioural change in the happy path of small networks: if the routing table is too thin, we fall back to the same find_closest_peers call used previously. Breakage scenarios therefore bounded to the "fresh client on a network where the routing table hasn't learned the right XOR neighbourhood yet" case, which is already handled by the find_closest_peers fallback.

Merkle uploads collect 16 candidate pools in parallel. Each pool called `find_closest_peers(address, 32)` which triggered a fresh Kademlia iterative network lookup. Running 16 of those concurrently on mainnet was the cause of the "stuck forever" merkle upload: - Each iterative lookup does up to 20 rounds of α=3 FindNode RPCs. - Each round blocks on the first response plus a 5 s grace window for stragglers; many rounds waste the full grace waiting on dead or NAT'd peers. - 16 lookups in parallel starve each other for the single transport socket used by saorsa-transport, so even well-behaved peers can't respond in time. - On mainnet the 10-minute upload deadline was blown before even ONE of the 16 DHT lookups returned to the client. Observed via `merkle-probe` (a new diagnostic binary that reproduces `prepare_merkle_batch_external` without a wallet): before the fix, 208 DHT iteration events fired in 3 minutes and zero completed; after the fix, all 16 pool lookups returned in 0 ms and the entire prepare phase finished in 1.4 s. Fix: add `Network::find_closest_peers_local` that reads the routing table directly (no RPC). `get_merkle_candidate_pool` tries that first; it only falls back to the iterative network lookup when the local table holds fewer than `CANDIDATES_PER_POOL * 2` entries (fresh bootstrap, tiny network). Freshness matters less here than in the K-closest DHT contract — we over-query, validate signatures, and discard failures anyway. Instrumentation added alongside (all `info!`): merkle phase=dht addr=<hex> peers=<n> source=local|network elapsed_ms=<ms> merkle phase=pool pool=<i>/<total> completed=<k>/<total> elapsed_ms=<ms> merkle phase=pools_total pools=<total> elapsed_ms=<ms> Visible under `ant -v` / `ant -vv`. Makes future regressions in the pool-collection path trivially diagnosable.

grumbach · 2026-04-21T07:31:05Z

Closing for discussion — branch retained.

grumbach closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(merkle): prefer local routing table over iterative DHT lookup#51

fix(merkle): prefer local routing table over iterative DHT lookup#51
grumbach wants to merge 1 commit intomainfrom
fix/merkle-dht-fallback

grumbach commented Apr 21, 2026

Uh oh!

grumbach commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grumbach commented Apr 21, 2026

Problem

Root cause

Fix

Measurement

Instrumentation added alongside

Test plan

Uh oh!

grumbach commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant