Exclude failed peers from block fetch retry selection#186
Conversation
Track which peers returned empty BlocksByRoot responses per block root in a HashSet instead of the write-only last_peer field. On retry, filter them out of the candidate pool so we pick a different peer. Falls back to the full peer set (with cleared tracking) if all peers have been exhausted. Fixes retry storms observed in devnet where the same lagging peer was randomly selected on every attempt, wasting all 10 retries.
🤖 Kimi Code ReviewReview SummaryThe PR implements peer exclusion logic for block fetching to avoid repeatedly querying peers that returned empty responses. Overall, the implementation is sound, but there are a few issues to address. Issues Found
Suggested Fixes
let (pool, excluded) = {
let pending = server.pending_requests.get(&root);
let failed = pending.map(|p| &p.failed_peers).unwrap_or_default();
let pool: Vec<_> = server
.connected_peers
.iter()
.copied()
.filter(|p| !failed.contains(p))
.collect();
let excluded = server.connected_peers.len() - pool.len();
(pool, excluded)
};
info!(%peer, %root, excluded_peers = excluded, total_peers = server.connected_peers.len(),
"Sending BlocksByRoot request for missing block");The core logic is correct and the peer exclusion mechanism will improve network efficiency. The race condition should be addressed before merge. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code Review
PR-specific change ( I couldn’t run Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewHere is the review: Review: Exclude failed peers from block fetch retry selectionOverall: The fix is correct and well-scoped. The root cause analysis (write-only Correctness
let excluded = server.connected_peers.len() - pool.len();This is computed after the fallback branch. When the fallback fires,
server
.pending_requests
.entry(root)
.or_insert(PendingRequest {
attempts: 1,
failed_peers: HashSet::new(),
});On a retry, the entry already exists, so Stale peers in If a peer is added to
Both
server.request_id_map.insert(request_id, requested_root);This is a pre-existing pattern not touched by this PR. After ObservabilityAdding API / Style
SummaryThe fix directly eliminates the retry storm by giving each Automated review by Claude (Anthropic) · custom prompt |
Greptile SummaryThis PR fixes a block-fetch retry storm observed in the March 6 devnet run by replacing the write-only
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/net/p2p/src/lib.rs | Replaces last_peer: Option<PeerId> with failed_peers: HashSet<PeerId> in PendingRequest. The HashSet import was already present in lib.rs, so no additional imports are needed. Change is minimal and correct. |
| crates/net/p2p/src/req_resp/handlers.rs | Core logic change: fetch_block_from_peer now filters out known-failed peers before selection, with a fallback that clears failed_peers and reuses the full set when all peers are exhausted. handle_fetch_failure correctly inserts the failing peer before the retry/give-up decision. One minor concern: reconnecting peers that previously failed are not removed from failed_peers, relying entirely on the all-failed fallback path to re-include them. |
Sequence Diagram
sequenceDiagram
participant BC as Blockchain
participant EL as event_loop
participant FBP as fetch_block_from_peer
participant HFF as handle_fetch_failure
participant Peer
BC->>EL: FetchBlock(root)
EL->>FBP: fetch_block_from_peer(root)
Note over FBP: Build pool = connected_peers - failed_peers<br/>(empty on first call → no exclusions)
FBP->>Peer: BlocksByRoot request (request_id_A)
FBP-->>EL: pending_requests[root] = {attempts:1, failed_peers:{}}
Peer-->>EL: Empty BlocksByRoot response
EL->>HFF: handle_fetch_failure(root, peer)
Note over HFF: failed_peers.insert(peer)<br/>attempts++ → schedule retry
EL->>FBP: retry: fetch_block_from_peer(root)
Note over FBP: Pool = connected_peers - {peer}<br/>if pool empty → clear failed_peers, use full set
FBP->>Peer: BlocksByRoot request (request_id_B)
Peer-->>EL: BlocksByRoot response (block found)
EL-->>EL: pending_requests.remove(root)<br/>blockchain.notify_new_block(block)
Comments Outside Diff (1)
-
crates/net/p2p/src/req_resp/handlers.rs, line 150-155 (link)Stale
request_id_mapentry interacts withfailed_peersWhen an empty
BlocksByRootresponse is received, the code removes therequest_idfrom the map and then immediately re-inserts it:let Some(requested_root) = server.request_id_map.remove(&request_id) else { ... }; if blocks.is_empty() { server.request_id_map.insert(request_id, requested_root); // ← re-insert stale entry handle_fetch_failure(...).await; return; }
After
handle_fetch_failureschedules the retry andfetch_block_from_peersends a new request,request_id_mapcontains two entries for the same root: the stalerequest_idand the new one. If libp2p later emits anOutboundFailureevent for the stalerequest_id(network-layer timeout on an already-answered request),handle_req_resp_messagewill callhandle_fetch_failurea second time for the samepeer, spuriously adding that peer tofailed_peersbefore its real response arrives for the fresh request. In a small devnet this can prematurely advance toward the all-peers fallback.The re-insert appears unnecessary —
handle_fetch_failuredoes not readrequest_id_map. Removing it would eliminate the stale entry entirely:if blocks.is_empty() { // Do NOT re-insert request_id_map; handle_fetch_failure doesn't need it. warn!(%peer, "Received empty BlocksByRoot response"); handle_fetch_failure(server, requested_root, peer).await; return; }
Prompt To Fix With AI
This is a comment left during a code review. Path: crates/net/p2p/src/req_resp/handlers.rs Line: 150-155 Comment: **Stale `request_id_map` entry interacts with `failed_peers`** When an empty `BlocksByRoot` response is received, the code removes the `request_id` from the map and then immediately re-inserts it: ```rust let Some(requested_root) = server.request_id_map.remove(&request_id) else { ... }; if blocks.is_empty() { server.request_id_map.insert(request_id, requested_root); // ← re-insert stale entry handle_fetch_failure(...).await; return; } ``` After `handle_fetch_failure` schedules the retry and `fetch_block_from_peer` sends a new request, `request_id_map` contains **two** entries for the same root: the stale `request_id` and the new one. If libp2p later emits an `OutboundFailure` event for the stale `request_id` (network-layer timeout on an already-answered request), `handle_req_resp_message` will call `handle_fetch_failure` a second time for the same `peer`, spuriously adding that peer to `failed_peers` before its real response arrives for the fresh request. In a small devnet this can prematurely advance toward the all-peers fallback. The re-insert appears unnecessary — `handle_fetch_failure` does not read `request_id_map`. Removing it would eliminate the stale entry entirely: ```rust if blocks.is_empty() { // Do NOT re-insert request_id_map; handle_fetch_failure doesn't need it. warn!(%peer, "Received empty BlocksByRoot response"); handle_fetch_failure(server, requested_root, peer).await; return; } ``` How can I resolve this? If you propose a fix, please make it concise.
Last reviewed commit: ad127e4
| let pool = if pool.is_empty() { | ||
| warn!(%root, "All peers failed for this block, retrying with full peer set"); | ||
| if let Some(pending) = server.pending_requests.get_mut(&root) { | ||
| pending.failed_peers.clear(); | ||
| } | ||
| server.connected_peers.iter().copied().collect() | ||
| } else { | ||
| pool | ||
| }; |
There was a problem hiding this comment.
Reconnected peers remain excluded until all-peers fallback
When a peer that was added to failed_peers disconnects and then reconnects (same PeerId), it is re-added to connected_peers by the ConnectionEstablished handler in lib.rs, but it is not removed from failed_peers in any PendingRequest. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears failed_peers.
In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing failed_peers for a peer on reconnection would be the more precise fix.
Consider removing the peer from all pending failed_peers sets in ConnectionEstablished / after redialing, or at least noting this limitation with a comment here so future contributors are aware.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/net/p2p/src/req_resp/handlers.rs
Line: 217-225
Comment:
**Reconnected peers remain excluded until all-peers fallback**
When a peer that was added to `failed_peers` disconnects and then reconnects (same `PeerId`), it is re-added to `connected_peers` by the `ConnectionEstablished` handler in `lib.rs`, but it is **not** removed from `failed_peers` in any `PendingRequest`. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears `failed_peers`.
In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing `failed_peers` for a peer on reconnection would be the more precise fix.
Consider removing the peer from all pending `failed_peers` sets in `ConnectionEstablished` / after redialing, or at least noting this limitation with a comment here so future contributors are aware.
How can I resolve this? If you propose a fix, please make it concise.
Motivation
In the March 6 devnet run (logs:
logs_03_06/), both ethlambda nodes hit a block-fetch retry storm caused byfetch_block_from_peerrandomly selecting the same lagging peers on every retry attempt.Node 0 (
ethlambda_0): 131 WARN retries + 1 ERROR across 56 unique block roots in ~35 seconds (11:47:45→11:48:20)Node 1 (
ethlambda_1): 107 WARN retries + 1 ERROR across 50 unique block rootsThe retry distribution per peer on node 0 shows the problem clearly — peers that had pruned the requested blocks were selected repeatedly:
...vi2sxT75Bpq1c7yV...PQhkD6Zg5Co2ee8S...ErASHzouSQumaetyU...7TYVs6qvDKnrovd9The root cause: the old
PendingRequeststruct had alast_peer: Option<PeerId>field that was write-only — set on each attempt (line 240) but never read for peer selection decisions. Every retry picked a random peer from the full connected set, with no memory of which peers had already returned empty responses for that specific block root.Description
Replace
last_peer: Option<PeerId>withfailed_peers: HashSet<PeerId>to track all peers that returned emptyBlocksByRootresponses for a given block root. On retry, filter them out of the candidate pool.Changes
crates/net/p2p/src/lib.rsPendingRequest.last_peer: Option<PeerId>→failed_peers: HashSet<PeerId>crates/net/p2p/src/req_resp/handlers.rsfetch_block_from_peer: Filterconnected_peersagainstfailed_peersbefore random selection. Fall back to full peer set (withfailed_peerscleared) if all peers have been exhausted — peers may have caught up or new ones connected.handle_fetch_failure: Insert the failing peer intofailed_peersbefore scheduling retry.excludedcount to the BlocksByRoot request log for observability.Edge cases
failed_peerscleared, so subsequent retries start a fresh round of eliminationPendingRequestremoved on success,failed_peersdropped with itPendingRequesthas its ownfailed_peers— failure for root A doesn't exclude from root BHow to test
For production validation, deploy to a multi-client devnet and observe that block fetch retries cycle through different peers (visible via the
excluded=Nfield in theSending BlocksByRoot requestlog line).