Skip to content

Exclude failed peers from block fetch retry selection#186

Merged
MegaRedHand merged 3 commits intomainfrom
smarter-block-fetch-peer-selection
Mar 6, 2026
Merged

Exclude failed peers from block fetch retry selection#186
MegaRedHand merged 3 commits intomainfrom
smarter-block-fetch-peer-selection

Conversation

@pablodeymo
Copy link
Collaborator

Motivation

In the March 6 devnet run (logs: logs_03_06/), both ethlambda nodes hit a block-fetch retry storm caused by fetch_block_from_peer randomly selecting the same lagging peers on every retry attempt.

Node 0 (ethlambda_0): 131 WARN retries + 1 ERROR across 56 unique block roots in ~35 seconds (11:47:4511:48:20)
Node 1 (ethlambda_1): 107 WARN retries + 1 ERROR across 50 unique block roots

The retry distribution per peer on node 0 shows the problem clearly — peers that had pruned the requested blocks were selected repeatedly:

Peer (truncated) Retries
...vi2sxT75Bpq1c7yV 46
...PQhkD6Zg5Co2ee8S 45
...ErASHzouSQumaetyU 32
...7TYVs6qvDKnrovd9 8

The root cause: the old PendingRequest struct had a last_peer: Option<PeerId> field that was write-only — set on each attempt (line 240) but never read for peer selection decisions. Every retry picked a random peer from the full connected set, with no memory of which peers had already returned empty responses for that specific block root.

Description

Replace last_peer: Option<PeerId> with failed_peers: HashSet<PeerId> to track all peers that returned empty BlocksByRoot responses for a given block root. On retry, filter them out of the candidate pool.

Changes

crates/net/p2p/src/lib.rs

  • PendingRequest.last_peer: Option<PeerId>failed_peers: HashSet<PeerId>

crates/net/p2p/src/req_resp/handlers.rs

  • fetch_block_from_peer: Filter connected_peers against failed_peers before random selection. Fall back to full peer set (with failed_peers cleared) if all peers have been exhausted — peers may have caught up or new ones connected.
  • handle_fetch_failure: Insert the failing peer into failed_peers before scheduling retry.
  • Add excluded count to the BlocksByRoot request log for observability.

Edge cases

  1. All peers failed: Falls back to full peer set with failed_peers cleared, so subsequent retries start a fresh round of elimination
  2. Single peer: Works naturally — fails, fallback kicks in, retries same peer (only option)
  3. Request completed: PendingRequest removed on success, failed_peers dropped with it
  4. Different roots: Each PendingRequest has its own failed_peers — failure for root A doesn't exclude from root B

How to test

make fmt    # Formatting passes
make lint   # Clippy with -D warnings passes
cargo test --workspace --release   # All tests pass

For production validation, deploy to a multi-client devnet and observe that block fetch retries cycle through different peers (visible via the excluded=N field in the Sending BlocksByRoot request log line).

  Track which peers returned empty BlocksByRoot responses per block root
  in a HashSet instead of the write-only last_peer field. On retry,
  filter them out of the candidate pool so we pick a different peer.
  Falls back to the full peer set (with cleared tracking) if all peers
  have been exhausted.

  Fixes retry storms observed in devnet where the same lagging peer was
  randomly selected on every attempt, wasting all 10 retries.
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🤖 Kimi Code Review

Review Summary

The PR implements peer exclusion logic for block fetching to avoid repeatedly querying peers that returned empty responses. Overall, the implementation is sound, but there are a few issues to address.

Issues Found

  1. Race condition in fetch_block_from_peer (lines 204-213)
    The failed_peers HashSet is accessed without holding the pending_requests lock across the entire operation. This creates a race condition where:

    • Thread A reads failed_peers to build the pool
    • Thread B updates failed_peers via handle_fetch_failure
    • Thread A proceeds with stale data
  2. Inefficient fallback strategy (lines 217-228)
    When all peers have failed, the code clears failed_peers and retries with the full set. This could create infinite retry loops if the same peers keep failing. Consider adding exponential backoff or a cooldown period.

  3. Missing bounds check on failed_peers (no line number)
    The failed_peers HashSet can grow unbounded over time. While it's cleared on fallback, consider adding a maximum size or periodic cleanup for long-running nodes.

  4. Logging inconsistency (line 244)
    The excluded count calculation (server.connected_peers.len() - pool.len()) is correct but could be misleading when pool.is_empty() triggers fallback. Consider logging the actual number of excluded peers vs. total peers separately.

Suggested Fixes

  1. Fix race condition:
let (pool, excluded) = {
    let pending = server.pending_requests.get(&root);
    let failed = pending.map(|p| &p.failed_peers).unwrap_or_default();
    
    let pool: Vec<_> = server
        .connected_peers
        .iter()
        .copied()
        .filter(|p| !failed.contains(p))
        .collect();
    
    let excluded = server.connected_peers.len() - pool.len();
    (pool, excluded)
};
  1. Add retry backoff:
    Consider adding an last_attempt: Instant field to PendingRequest to implement exponential backoff instead of immediate retry.

  2. Improve logging:

info!(%peer, %root, excluded_peers = excluded, total_peers = server.connected_peers.len(), 
      "Sending BlocksByRoot request for missing block");

The core logic is correct and the peer exclusion mechanism will improve network efficiency. The race condition should be addressed before merge.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🤖 Codex Code Review

  1. High: malicious/misbehaving peer can permanently stall block fetches
  • In BlocksByRoot response handling, if response is non-empty but contains no block matching requested_root, the code just continues and exits without retrying or clearing pending state.
  • That leaves pending_requests[root] stuck, and dedup logic prevents future fetch attempts for that root.
  • References: handlers.rs:157, handlers.rs:161, lib.rs:414
  • Suggested fix: track whether a matching block was found; if none, call handle_fetch_failure(server, requested_root, peer).await.
  1. Medium: request_id_map stale-entry leak on empty responses
  • On empty response, request_id is removed, then reinserted before scheduling retry. That old request ID will never be used again, so map entries accumulate over time.
  • This can become a memory growth vector under repeated empty responses.
  • Reference: handlers.rs:151
  • Suggested fix: remove the reinsertion line; only map new outgoing request IDs.

PR-specific change (failed_peers exclusion/reset) itself looks reasonable for retry diversity and does not introduce obvious consensus/fork-choice/SSZ/XMSS risks.

I couldn’t run cargo check in this environment due rustup temp-file permission restrictions, so this review is static-analysis only.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🤖 Claude Code Review

Here is the review:


Review: Exclude failed peers from block fetch retry selection

Overall: The fix is correct and well-scoped. The root cause analysis (write-only last_peer field) is accurate, the solution is minimal, and the edge cases are handled properly. A few observations below.


Correctness

excluded count on fallback (handlers.rs:243)

let excluded = server.connected_peers.len() - pool.len();

This is computed after the fallback branch. When the fallback fires, pool is rebuilt to the full connected_peers set, so excluded evaluates to 0. The warn log above covers the situation, so this isn't misleading in practice — just worth being aware of during log analysis. Not a bug.

or_insert on retry (handlers.rs:256–262)

server
    .pending_requests
    .entry(root)
    .or_insert(PendingRequest {
        attempts: 1,
        failed_peers: HashSet::new(),
    });

On a retry, the entry already exists, so or_insert is a no-op and the PendingRequest literal is never used (but HashSet::new() doesn't allocate, so it's zero-cost). Correct behavior, though or_insert_with(|| ...) would make the intent slightly more explicit. Minor.

Stale peers in failed_peers after reconnection

If a peer is added to failed_peers and then disconnects and reconnects during the backoff window, it remains excluded until either all peers are exhausted (triggering the fallback/clear) or the request completes. This is conservative but not incorrect — the fallback guarantees eventual retry against reconnected peers.

handle_fetch_failure called from two paths

Both OutboundFailure (transport-level failure) and handle_blocks_by_root_response (empty response) correctly pass peer to handle_fetch_failure, which inserts it into failed_peers. Symmetrical and correct.

request_id_map re-insertion on empty response (handlers.rs:151, pre-existing)

server.request_id_map.insert(request_id, requested_root);

This is a pre-existing pattern not touched by this PR. After handle_fetch_failure schedules a retry (which will generate a new request_id), the old request_id entry stays in the map indefinitely. If libp2p later emits an OutboundFailure for that old request_id, it would call handle_fetch_failure a second time for the same root with the same peer. Low probability in practice, but worth a follow-up cleanup.


Observability

Adding excluded to the log line is a good call — it lets you see at a glance how many peers are being filtered without needing to correlate against separate events. The fallback warn! is also clear about what's happening.


API / Style

  • is_none_or (handlers.rs:210) is stable since Rust 1.82; the project targets 1.92.0. Fine.
  • HashSet is correctly imported locally in handlers.rs rather than re-exported from lib.rs. Clean.
  • The borrow-scoping trick with the inner block (lines 204–212) to drop the failed reference before the mutable fallback borrow is idiomatic and correct.

Summary

The fix directly eliminates the retry storm by giving each PendingRequest memory of which peers have already failed for that specific block root. The fallback-with-clear strategy correctly handles the all-peers-exhausted case. No correctness or safety issues introduced. The pre-existing request_id_map stale-entry issue is the only thing worth tracking as a follow-up.


Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR fixes a block-fetch retry storm observed in the March 6 devnet run by replacing the write-only last_peer: Option<PeerId> field in PendingRequest with failed_peers: HashSet<PeerId>, and using that set to filter candidate peers before random selection in fetch_block_from_peer. When all connected peers have been exhausted, the implementation falls back to the full peer set (clearing failed_peers) so retries can continue, and a new excluded log field provides observability into how many peers are being skipped.

  • PendingRequest struct (lib.rs): last_peer: Option<PeerId>failed_peers: HashSet<PeerId>. Simple, isolated change.
  • fetch_block_from_peer (handlers.rs): Builds a filtered candidate pool; falls back to full peer set when pool is empty; logs excluded count per request.
  • handle_fetch_failure (handlers.rs): Inserts the failing peer into failed_peers before the retry/give-up decision — correctly scoped per block root.
  • Stale request_id_map re-insert (handle_blocks_by_root_response, line 151): Pre-existing code re-inserts a just-removed request_id before calling handle_fetch_failure. This creates a stale map entry and can trigger a spurious second call to handle_fetch_failure (via a later OutboundFailure event), which could prematurely add a peer to failed_peers and advance the retry toward the fallback cycle sooner than expected.
  • Reconnected peers not cleared from failed_peers: A peer that was added to failed_peers, then disconnected and reconnected, will continue to be excluded until the all-peers-failed fallback fires and clears the set. This is functionally handled by the fallback, but on small devnets a freshly reconnected and synced peer could be unexpectedly skipped for several retry rounds.

Confidence Score: 4/5

  • Safe to merge with awareness of the pre-existing stale request_id_map re-insert and the reconnected-peer exclusion edge case.
  • The core fix is correct and well-reasoned. The failed_peers filtering logic, fallback mechanism, and excluded metric are all sound. Two concerns keep this from a 5: (1) the pre-existing stale request_id_map entry in handle_blocks_by_root_response interacts with the new failed_peers logic by potentially adding peers spuriously; (2) peers that reconnect remain excluded from failed_peers sets until the all-peers fallback fires, which in a small devnet could delay recovery.
  • crates/net/p2p/src/req_resp/handlers.rs — specifically handle_blocks_by_root_response (stale re-insert at line 151) and the reconnected-peer scenario in fetch_block_from_peer.

Important Files Changed

Filename Overview
crates/net/p2p/src/lib.rs Replaces last_peer: Option<PeerId> with failed_peers: HashSet<PeerId> in PendingRequest. The HashSet import was already present in lib.rs, so no additional imports are needed. Change is minimal and correct.
crates/net/p2p/src/req_resp/handlers.rs Core logic change: fetch_block_from_peer now filters out known-failed peers before selection, with a fallback that clears failed_peers and reuses the full set when all peers are exhausted. handle_fetch_failure correctly inserts the failing peer before the retry/give-up decision. One minor concern: reconnecting peers that previously failed are not removed from failed_peers, relying entirely on the all-failed fallback path to re-include them.

Sequence Diagram

sequenceDiagram
    participant BC as Blockchain
    participant EL as event_loop
    participant FBP as fetch_block_from_peer
    participant HFF as handle_fetch_failure
    participant Peer

    BC->>EL: FetchBlock(root)
    EL->>FBP: fetch_block_from_peer(root)
    Note over FBP: Build pool = connected_peers - failed_peers<br/>(empty on first call → no exclusions)
    FBP->>Peer: BlocksByRoot request (request_id_A)
    FBP-->>EL: pending_requests[root] = {attempts:1, failed_peers:{}}

    Peer-->>EL: Empty BlocksByRoot response
    EL->>HFF: handle_fetch_failure(root, peer)
    Note over HFF: failed_peers.insert(peer)<br/>attempts++ → schedule retry

    EL->>FBP: retry: fetch_block_from_peer(root)
    Note over FBP: Pool = connected_peers - {peer}<br/>if pool empty → clear failed_peers, use full set
    FBP->>Peer: BlocksByRoot request (request_id_B)

    Peer-->>EL: BlocksByRoot response (block found)
    EL-->>EL: pending_requests.remove(root)<br/>blockchain.notify_new_block(block)
Loading

Comments Outside Diff (1)

  1. crates/net/p2p/src/req_resp/handlers.rs, line 150-155 (link)

    Stale request_id_map entry interacts with failed_peers

    When an empty BlocksByRoot response is received, the code removes the request_id from the map and then immediately re-inserts it:

    let Some(requested_root) = server.request_id_map.remove(&request_id) else { ... };
    
    if blocks.is_empty() {
        server.request_id_map.insert(request_id, requested_root); // ← re-insert stale entry
        handle_fetch_failure(...).await;
        return;
    }

    After handle_fetch_failure schedules the retry and fetch_block_from_peer sends a new request, request_id_map contains two entries for the same root: the stale request_id and the new one. If libp2p later emits an OutboundFailure event for the stale request_id (network-layer timeout on an already-answered request), handle_req_resp_message will call handle_fetch_failure a second time for the same peer, spuriously adding that peer to failed_peers before its real response arrives for the fresh request. In a small devnet this can prematurely advance toward the all-peers fallback.

    The re-insert appears unnecessary — handle_fetch_failure does not read request_id_map. Removing it would eliminate the stale entry entirely:

    if blocks.is_empty() {
        // Do NOT re-insert request_id_map; handle_fetch_failure doesn't need it.
        warn!(%peer, "Received empty BlocksByRoot response");
        handle_fetch_failure(server, requested_root, peer).await;
        return;
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/net/p2p/src/req_resp/handlers.rs
    Line: 150-155
    
    Comment:
    **Stale `request_id_map` entry interacts with `failed_peers`**
    
    When an empty `BlocksByRoot` response is received, the code removes the `request_id` from the map and then immediately re-inserts it:
    
    ```rust
    let Some(requested_root) = server.request_id_map.remove(&request_id) else { ... };
    
    if blocks.is_empty() {
        server.request_id_map.insert(request_id, requested_root); // ← re-insert stale entry
        handle_fetch_failure(...).await;
        return;
    }
    ```
    
    After `handle_fetch_failure` schedules the retry and `fetch_block_from_peer` sends a new request, `request_id_map` contains **two** entries for the same root: the stale `request_id` and the new one. If libp2p later emits an `OutboundFailure` event for the stale `request_id` (network-layer timeout on an already-answered request), `handle_req_resp_message` will call `handle_fetch_failure` a second time for the same `peer`, spuriously adding that peer to `failed_peers` before its real response arrives for the fresh request. In a small devnet this can prematurely advance toward the all-peers fallback.
    
    The re-insert appears unnecessary — `handle_fetch_failure` does not read `request_id_map`. Removing it would eliminate the stale entry entirely:
    
    ```rust
    if blocks.is_empty() {
        // Do NOT re-insert request_id_map; handle_fetch_failure doesn't need it.
        warn!(%peer, "Received empty BlocksByRoot response");
        handle_fetch_failure(server, requested_root, peer).await;
        return;
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: ad127e4

Comment on lines +217 to +225
let pool = if pool.is_empty() {
warn!(%root, "All peers failed for this block, retrying with full peer set");
if let Some(pending) = server.pending_requests.get_mut(&root) {
pending.failed_peers.clear();
}
server.connected_peers.iter().copied().collect()
} else {
pool
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnected peers remain excluded until all-peers fallback

When a peer that was added to failed_peers disconnects and then reconnects (same PeerId), it is re-added to connected_peers by the ConnectionEstablished handler in lib.rs, but it is not removed from failed_peers in any PendingRequest. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears failed_peers.

In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing failed_peers for a peer on reconnection would be the more precise fix.

Consider removing the peer from all pending failed_peers sets in ConnectionEstablished / after redialing, or at least noting this limitation with a comment here so future contributors are aware.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/net/p2p/src/req_resp/handlers.rs
Line: 217-225

Comment:
**Reconnected peers remain excluded until all-peers fallback**

When a peer that was added to `failed_peers` disconnects and then reconnects (same `PeerId`), it is re-added to `connected_peers` by the `ConnectionEstablished` handler in `lib.rs`, but it is **not** removed from `failed_peers` in any `PendingRequest`. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears `failed_peers`.

In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing `failed_peers` for a peer on reconnection would be the more precise fix.

Consider removing the peer from all pending `failed_peers` sets in `ConnectionEstablished` / after redialing, or at least noting this limitation with a comment here so future contributors are aware.

How can I resolve this? If you propose a fix, please make it concise.

@MegaRedHand MegaRedHand merged commit 764c1de into main Mar 6, 2026
2 checks passed
@MegaRedHand MegaRedHand deleted the smarter-block-fetch-peer-selection branch March 6, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants