Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion crates/net/p2p/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ enum RetryMessage {

pub(crate) struct PendingRequest {
pub(crate) attempts: u32,
pub(crate) last_peer: Option<PeerId>,
pub(crate) failed_peers: HashSet<PeerId>,
}

#[allow(clippy::too_many_arguments)]
Expand Down
45 changes: 36 additions & 9 deletions crates/net/p2p/src/req_resp/handlers.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
use std::collections::HashSet;

use ethlambda_storage::Store;
use libp2p::{PeerId, request_response};
use rand::seq::SliceRandom;
Expand Down Expand Up @@ -198,9 +200,34 @@ pub async fn fetch_block_from_peer(server: &mut P2PServer, root: H256) -> bool {
return false;
}

// Select random peer
let peers: Vec<_> = server.connected_peers.iter().copied().collect();
let peer = match peers.choose(&mut rand::thread_rng()) {
// Exclude peers that already returned empty responses for this root
let failed = server.pending_requests.get(&root).map(|p| &p.failed_peers);
let pool: Vec<_> = if failed.is_none_or(|f| f.is_empty()) {
server.connected_peers.iter().copied().collect()
} else {
let failed = failed.unwrap();
server
.connected_peers
.iter()
.copied()
.filter(|p| !failed.contains(p))
.collect()
};

// Fall back to full set if all peers have failed (new peers may have connected,
// or previously-failing peers may have caught up). Clear failed_peers so subsequent
// retries start a fresh round of elimination.
let pool = if pool.is_empty() {
warn!(%root, "All peers failed for this block, retrying with full peer set");
if let Some(pending) = server.pending_requests.get_mut(&root) {
pending.failed_peers.clear();
}
server.connected_peers.iter().copied().collect()
} else {
pool
};
Comment on lines +220 to +228
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnected peers remain excluded until all-peers fallback

When a peer that was added to failed_peers disconnects and then reconnects (same PeerId), it is re-added to connected_peers by the ConnectionEstablished handler in lib.rs, but it is not removed from failed_peers in any PendingRequest. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears failed_peers.

In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing failed_peers for a peer on reconnection would be the more precise fix.

Consider removing the peer from all pending failed_peers sets in ConnectionEstablished / after redialing, or at least noting this limitation with a comment here so future contributors are aware.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/net/p2p/src/req_resp/handlers.rs
Line: 217-225

Comment:
**Reconnected peers remain excluded until all-peers fallback**

When a peer that was added to `failed_peers` disconnects and then reconnects (same `PeerId`), it is re-added to `connected_peers` by the `ConnectionEstablished` handler in `lib.rs`, but it is **not** removed from `failed_peers` in any `PendingRequest`. As a result, the reconnected peer continues to be excluded from pool selection until the all-peers-failed fallback fires and clears `failed_peers`.

In a small devnet (2–4 peers) this means a fresh, synced reconnecting peer is invisible to the retry logic until all other peers have also failed, delaying recovery. The fallback does eventually handle this, but clearing `failed_peers` for a peer on reconnection would be the more precise fix.

Consider removing the peer from all pending `failed_peers` sets in `ConnectionEstablished` / after redialing, or at least noting this limitation with a comment here so future contributors are aware.

How can I resolve this? If you propose a fix, please make it concise.


let peer = match pool.choose(&mut rand::thread_rng()) {
Some(&p) => p,
None => {
warn!(%root, "Failed to select random peer");
Expand All @@ -216,7 +243,8 @@ pub async fn fetch_block_from_peer(server: &mut P2PServer, root: H256) -> bool {
}
let request = BlocksByRootRequest { roots };

info!(%peer, %root, "Sending BlocksByRoot request for missing block");
let excluded = server.connected_peers.len() - pool.len();
info!(%peer, %root, excluded, "Sending BlocksByRoot request for missing block");
let request_id = server
.swarm
.behaviour_mut()
Expand All @@ -228,17 +256,14 @@ pub async fn fetch_block_from_peer(server: &mut P2PServer, root: H256) -> bool {
);

// Track the request if not already tracked (new request)
let pending = server
server
.pending_requests
.entry(root)
.or_insert(PendingRequest {
attempts: 1,
last_peer: None,
failed_peers: HashSet::new(),
});

// Update last_peer
pending.last_peer = Some(peer);

// Map request_id to root for failure handling
server.request_id_map.insert(request_id, root);

Expand All @@ -250,6 +275,8 @@ async fn handle_fetch_failure(server: &mut P2PServer, root: H256, peer: PeerId)
return;
};

pending.failed_peers.insert(peer);

if pending.attempts >= MAX_FETCH_RETRIES {
error!(%root, %peer, attempts=%pending.attempts,
"Block fetch failed after max retries, giving up");
Expand Down