fix(payment): block merkle pay-yourself via DHT closeness check#77
Conversation
Merkle verification never tied the 16 candidate pub_keys to the pool midpoint's actual close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point every reward_address at a self-owned wallet, fund the on-chain merkle payment, and receive their own rewards back. Add a closeness defence: for each MerklePaymentProof, the verifier queries the DHT for the closest peers to the winner_pool's midpoint address and requires a majority of the 16 candidate pub_keys to hash to PeerIds that appear in that returned set. Hooked in between the cheap signature check and the on-chain RPC so a forged pool is rejected without paying for an on-chain round-trip. Every storing node that receives the proof runs the same check against the single winner_pool. PaymentVerifier grows an optional Arc<P2PNode> attached at startup via attach_p2p_node; node.rs, devnet.rs, and the e2e testnet harness all wire it after P2PNode construction. Unit tests that don't exercise merkle verification keep a None handle and log a warning. New e2e test test_attack_merkle_pay_yourself_fabricated_pool fails on main (verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.
There was a problem hiding this comment.
Pull request overview
This PR hardens merkle batch payment verification against a “pay-yourself” attack by ensuring the candidate public keys correspond to peers actually close to the pool midpoint in the live DHT.
Changes:
- Add a DHT-based candidate closeness check to
verify_merkle_payment(before on-chain RPC). - Wire an
Arc<P2PNode>intoPaymentVerifierat startup so the verifier can query the routing table. - Add an e2e test that exercises rejection of a fabricated candidate pool (with on-chain lookup bypassed via a test-only cache insert).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/payment/verifier.rs |
Adds attached P2PNode handle, a candidate closeness check, and a test-only pool cache insert helper; runs closeness check during merkle verification. |
src/node.rs |
Attaches the constructed P2PNode to the protocol’s PaymentVerifier during production startup. |
src/devnet.rs |
Attaches the devnet node’s P2PNode to the PaymentVerifier during devnet startup. |
tests/e2e/testnet.rs |
Attaches each test node’s P2PNode to its PaymentVerifier so e2e merkle verification can query the live DHT. |
tests/e2e/merkle_payment.rs |
Adds an e2e attack test for fabricated candidate pools and updates module-level test coverage notes/lints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return Ok(()); | ||
| }; |
There was a problem hiding this comment.
verify_merkle_candidate_closeness currently fails open when no P2PNode is attached (returns Ok(())). That leaves merkle verification vulnerable if startup wiring is missed or regresses. Consider failing closed in non-test builds (return an Error::Payment) and only allowing the skip path under cfg(test)/feature = "test-utils" (or gating it behind an explicit config flag).
| let matched = candidate_peer_ids | ||
| .iter() | ||
| .filter(|pid| network_set.contains(pid)) | ||
| .count(); |
There was a problem hiding this comment.
The closeness match count can be inflated by duplicate candidate pub_keys because matched counts membership for every entry in candidate_peer_ids. If duplicates are allowed, repeating a single matching PeerId multiple times could satisfy the threshold. Consider deduplicating candidates before counting (e.g., HashSet of candidate PeerIds) and/or explicitly rejecting pools with duplicate PeerIds.
| // Set-membership check against the returned closest-peers list. The | ||
| // lookup may return fewer than `lookup_count` on a sparse network, | ||
| // which only tightens the bar — any candidate not in the returned | ||
| // list counts as unmatched. |
There was a problem hiding this comment.
When find_closest_nodes_network returns fewer peers than CANDIDATE_CLOSENESS_REQUIRED, the pool will be rejected regardless of the candidate set. In that scenario, the current rejection message (“candidate pub_keys do not match…”) is misleading; it’s really “insufficient authoritative network peers/visibility to verify closeness”. Consider adding an explicit if network_peers.len() < CANDIDATE_CLOSENESS_REQUIRED { ... } branch with a clearer error (and possibly a retry hint).
| // Set-membership check against the returned closest-peers list. The | |
| // lookup may return fewer than `lookup_count` on a sparse network, | |
| // which only tightens the bar — any candidate not in the returned | |
| // list counts as unmatched. | |
| if network_peers.len() < Self::CANDIDATE_CLOSENESS_REQUIRED { | |
| debug!( | |
| "Merkle closeness rejected: authoritative network lookup returned only {} peers \ | |
| for pool midpoint {} (required to verify closeness: {})", | |
| network_peers.len(), | |
| hex::encode(pool_address.0), | |
| Self::CANDIDATE_CLOSENESS_REQUIRED, | |
| ); | |
| return Err(Error::Payment( | |
| "Merkle candidate pool rejected: insufficient authoritative network peers \ | |
| were returned to verify candidate closeness to the pool midpoint address. \ | |
| Retry once the node has better network visibility." | |
| .into(), | |
| )); | |
| } | |
| // Set-membership check against the returned closest-peers list once | |
| // enough authoritative peers are available to verify closeness. |
| self.verify_merkle_candidate_closeness(&merkle_proof.winner_pool) | ||
| .await?; | ||
|
|
There was a problem hiding this comment.
verify_merkle_payment performs an authoritative DHT lookup for every chunk proof, even when the same winner_pool/pool_hash is already cached. For large batches (e.g., 256 chunks), this adds 256 iterative lookups. Consider caching the closeness result per pool_hash (or reusing the existing pool cache entry) so the network lookup happens once per pool, while still running before the on-chain RPC on cache misses.
| .expect("pool has candidates") | ||
| .merkle_payment_timestamp; | ||
| // 4-leaf tree → depth 2 (log2(4)). Prices are all 1024 in | ||
| // build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048. |
There was a problem hiding this comment.
The per-node payment comment says the expected value is 2048 (1024 * 2^2 / 2), but the test sets per_node to 4096. Since verifier logic checks paid_amount >= expected_per_node, either set per_node to the computed minimum (2048) for clarity or update the comment to explain the intentional overpayment.
| // build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048. | |
| // build_candidate_nodes, so the minimum per-node payment required by the | |
| // verifier formula is 1024 * 2^2 / 2 = 2048. We intentionally overpay | |
| // with 4096 here so the proof passes the payment-formula check and this | |
| // test remains focused on the pay-yourself closeness rejection. |
Security review flagged three issues with the initial closeness check:
1. **Fail-open on missing P2PNode** was unsafe in production: any startup
path that forgot to call attach_p2p_node would silently disable the
defence and only log a warning. Fail CLOSED in release builds — the
warn->error and Ok->Err flip surfaces the wiring bug as a rejected
PUT rather than a silent accept. Test-utils builds keep fail-open
so unit tests don't need a real DHT.
2. **9/16 majority was too permissive**: an attacker controlling 7
neighbourhood peers could fabricate the remaining 7 and still pass.
Raise to 13/16 — tolerates normal routing-table skew (1-3 peers)
without letting an attacker plant 7 fabricated candidates.
3. **Per-chunk Kademlia lookups were a DoS amplifier**: a 256-chunk
batch sharing one winner_pool would trigger 256 iterative lookups,
and N concurrent forged-pool PUTs for the same hash would trigger N
parallel lookups. Add:
- closeness_pass_cache: pool_hash -> () so within-batch repeats
and cross-connection replays are free after the first check.
- inflight_closeness: pool_hash -> Notify so concurrent PUTs for
the same pool coalesce to a single lookup behind a drop-guarded
waker.
Also bump the per-lookup timeout from 10s to 60s. Iterative Kademlia
lookups with dial cascades of 20-30s per unresponsive peer were timing
out under normal churn at 10s — false-rejecting honest proofs. 60s
keeps DoS bounded while being wide enough for real-world convergence.
Document the known Sybil-grinding limitation: midpoint_proof.address()
is BLAKE3 of attacker-controllable inputs, so an attacker who also
runs Sybil DHT nodes can grind it to land in their own neighbourhood.
Closing that gap needs a Sybil-resistance layer or an
attacker-uncontrolled midpoint binding (on-chain VRF / block hash).
… single-flight Second concurrency review flagged three real problems with the single-flight closeness check: 1. **Lost-wakeup race**. `notify.notified().await` creates the Notified future AFTER the inflight lock is released. If the leader calls `notify_waiters()` between the waiter's lock drop and the future's first poll, the notification is lost and the waiter parks forever — there was no outer timeout around the wait. Fixed by switching to `Arc<Notify>::notified_owned()`, which snapshots the `notify_waiters` counter at call time; the counter check on the first poll resolves the future immediately even if the leader already fired. 2. **LRU eviction orphans waiters**. When `DEFAULT_POOL_CACHE_CAPACITY` (1000) unique pool_hashes are in flight, LruCache evicts the oldest `Notify` entry. The leader's drop guard did `slot.pop(&pool_hash)` which, for an evicted entry, popped nothing and called `notify_waiters` on — nothing. Worse, a subsequent leader for the same pool_hash inserts a fresh `Notify`, which the original leader then erroneously pops. Both failure modes orphan waiters. Fixed by having the leader own an `Arc<ClosenessSlot>` independent of the cache and calling `notify_waiters` on THAT; the pop only runs if `Arc::ptr_eq` confirms the cache still holds our slot. 3. **Failure-path waiters each re-run the lookup**. On leader failure the pass cache wasn't written, waiters woke, saw cache miss, and all became new leaders running their own Kademlia lookups. That defeats the documented DoS bound of "N forged PUTs cost at most one lookup". Fixed by sharing the leader's `Result<(), String>` through an `Arc<ClosenessSlot>` with a `OnceLock`; waiters read the stored result directly and skip re-verification on both success AND failure. Other changes: - Loud startup error log if a binary is accidentally built with the `test-utils` feature, since that flips the closeness defence to fail-open when P2PNode isn't attached. - Two new unit tests exercise the cache short-circuit and the concurrent-waiter single-flight path.
Third review pass found two remaining issues: 1. **Unbounded retry loop on leader cancellation cascade**. The loop retries forever if every leader gets cancelled before publishing. Add `MAX_LEADER_RETRIES = 4` so waiters return a visible error rather than spinning through indefinite cancellation cycles. 2. **Single-flight test didn't actually exercise the waiter path**. The existing concurrent-readers test relies on timing and might run both calls serially through the leader path. Add `closeness_waiter_reads_leaders_published_failure`, which seeds the inflight slot, spawns a waiter task, yields until the waiter is parked on `notified_owned().await`, then externally publishes a failure + notifies — proving the waiter reads the leader's published result directly without running its own inner check.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let chosen = if let Some(existing) = inflight.get(&pool_hash) { | ||
| (Some(Arc::clone(existing)), None) | ||
| } else { | ||
| let slot = Arc::new(ClosenessSlot::new()); | ||
| inflight.put(pool_hash, Arc::clone(&slot)); | ||
| (None, Some(slot)) | ||
| }; | ||
| drop(inflight); | ||
| chosen | ||
| }; | ||
|
|
||
| if let Some(slot) = waiter_slot { | ||
| // Build the owned-notified future BEFORE awaiting, so it |
There was a problem hiding this comment.
The single-flight waiter path can miss the leader’s notify_waiters and hang indefinitely: if the leader finishes and calls notify_waiters() after the waiter extracts slot from the cache but before slot.notified_owned() is constructed, the waiter will await a future notification that never comes (even though slot.result is already set).
Fix by using the standard Notify pattern: create let notified = slot.notified_owned(); and then check slot.result.get() before awaiting (and/or check once before creating notified, then create notified, re-check, then await). That guarantees no missed wakeups when the leader completes between the waiter’s cache read and the await.
| /// Minimum number of candidate `pub_keys` (out of 16) whose derived `PeerId` | ||
| /// must match the DHT's actual closest peers to the pool midpoint address. | ||
| /// | ||
| /// Set below 16/16 to absorb normal routing-table skew between the | ||
| /// payer's view and this node's view — on a well-connected network the | ||
| /// divergence between two nodes' closest-set views is typically 1-2 | ||
| /// peers, occasionally 3 during churn. 13/16 tolerates 3 divergent | ||
| /// peers while still limiting how many candidates an attacker can | ||
| /// fabricate before the check bites. A lower threshold (e.g. 9/16) | ||
| /// would let an attacker who controls 7 real neighbourhood peers plant | ||
| /// 7 fabricated candidates and still pass. | ||
| /// | ||
| /// This is the pure "fabricated key" defence; it does not stop an | ||
| /// attacker who can grind the pool midpoint address to land near 13 | ||
| /// pre-chosen keys AND run those keys as Sybil DHT participants. That | ||
| /// requires an orthogonal Sybil-resistance layer and is out of scope | ||
| /// for this check. | ||
| const CANDIDATE_CLOSENESS_REQUIRED: usize = 13; | ||
|
|
||
| /// Timeout for the authoritative network lookup used by the closeness | ||
| /// check. | ||
| /// | ||
| /// Iterative Kademlia lookups can cascade through up to 20 iterations, | ||
| /// and a single unresponsive peer's dial can take 20-30s before timing | ||
| /// out. 60s leaves room for the lookup to converge even under churn | ||
| /// while still capping `DoS` amplification at roughly one bounded lookup | ||
| /// per forged `pool_hash`. | ||
| const CLOSENESS_LOOKUP_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60); |
There was a problem hiding this comment.
PR description says the closeness check uses a 10s timeout and requires 9/16 candidates to be present in the DHT-closest set, but the implementation uses CLOSENESS_LOOKUP_TIMEOUT = 60s and CANDIDATE_CLOSENESS_REQUIRED = 13.
Please either update the PR description to match the actual behavior, or adjust these constants to match the documented design so reviewers/operators don’t get surprised by stricter thresholds/longer blocking time in production.
|
@greptile-apps @claude please review |
Summary
verify_merkle_payment.P2PNodetoPaymentVerifierat startup so the check can query the live routing table.main(verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.The attack
verify_merkle_paymentchecked candidate ML-DSA signatures, timestamps, the cryptographic merkle proof, and on-chain paid-node bindings — but nothing tied the 16 candidatepub_keysto the pool midpoint's close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point all 16reward_addressfields at a self-owned wallet, fund the on-chain merkle payment themselves, and receive their own rewards back.The fix
In
verify_merkle_payment, between the cheap signature check and the on-chain RPC:winner_pool.midpoint_proof.address().dht_manager().find_closest_nodes_network(&pool_address.0, 16)with a 60s timeout.PeerId = BLAKE3(pub_key)for each of the 16 candidates.PeerIds to appear in the DHT-returned set.A
MerklePaymentProofcarries exactly onewinner_pool, so every storing node that receives the proof runs the same check against that single pool — a forged pool is rejected everywhere it lands.The check runs before the on-chain lookup so a forged proof never pays for an RPC round-trip.
Concurrency & DoS hardening
After two rounds of adversarial review:
closeness_pass_cache): within a 256-chunk batch sharing one pool, only the first chunk pays the Kademlia lookup; subsequent chunks hit the cache.inflight_closeness):Nconcurrent PUTs for the samepool_hashcoalesce onto one lookup. The leader publishes itsResult<(), String>through anArc<ClosenessSlot>that holds aNotify+OnceLock; waiters read the published result directly instead of re-running_inner. Bounds DoS amplification to one lookup per uniquepool_hashfor BOTH success and failure cases.Arc<Notify>::notified_owned()so thenotify_waiterscounter is snapshotted while the inflight lock is still held; a race where the leader fires notify between the lock drop and the poll is handled by the counter check on first poll.InflightGuardowns its ownArc<ClosenessSlot>independent of the cache.notify_waitersfires on the owned slot regardless of eviction; the cache pop is gated on anArc::ptr_eqidentity check so an eviction + reinsertion doesn't make the original leader pop someone else's entry.MAX_LEADER_RETRIES = 4times if the leader gets cancelled before publishing, then return a visible error rather than spinning forever.Fail-closed in production
PaymentVerifier::attach_p2p_nodeis called at node startup innode.rs,devnet.rs, and the e2e testnet. If the call is ever missed:Errwith a visible error log — any regression surfaces as a rejected PUT, not a silent accept.test-utilsbuilds: falls back to fail-open so unit tests that don't exercise merkle verification don't need a real DHT.PaymentVerifier::newemits a loud startup-time error log when built withtest-utils, so a production binary accidentally built with the feature is visible in logs.Known limitation — Sybil-grinding
MidpointProof::address()is BLAKE3 of attacker-controllable inputs (leaf bytes, tree root, timestamp). An attacker who also runs 13+ Sybil DHT nodes can grind the midpoint until those Sybils are the true network-closest peers, at which point closeness passes for the attacker. Closing that gap requires either a Sybil-resistant identity layer or binding the midpoint to an attacker-uncontrolled value (on-chain VRF or block hash at payment time). This PR raises the attack cost from "free" to "run 13+ Sybil nodes AND grind" — a meaningful but not complete improvement. Documented in the code.Test plan
test_attack_merkle_pay_yourself_fabricated_poolfails onmainwith "Pay-yourself attack SUCCEEDED"cargo test --lib payment::verifiercargo test --features test-utils --test e2e merkle -- --test-threads=1cargo clippy --all-targets --all-features -- -D warningscleancargo fmt --allcleanRUSTDOCFLAGS="--deny=warnings" cargo doc --no-depscleanFollow-up (not in this PR)
MidpointProofderivation to an on-chain VRF or block hash to close the Sybil-grinding gap.