fix(payment): block merkle pay-yourself via DHT closeness check by grumbach · Pull Request #77 · WithAutonomi/ant-node

grumbach · 2026-04-22T07:59:45Z

Summary

Block the merkle "pay-yourself" attack by adding a DHT-closeness check to verify_merkle_payment.
Attach the P2PNode to PaymentVerifier at startup so the check can query the live routing table.
Add an e2e test that fails on main (verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.

The attack

verify_merkle_payment checked candidate ML-DSA signatures, timestamps, the cryptographic merkle proof, and on-chain paid-node bindings — but nothing tied the 16 candidate pub_keys to the pool midpoint's close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point all 16 reward_address fields at a self-owned wallet, fund the on-chain merkle payment themselves, and receive their own rewards back.

The fix

In verify_merkle_payment, between the cheap signature check and the on-chain RPC:

Pull the pool midpoint address from winner_pool.midpoint_proof.address().
Call dht_manager().find_closest_nodes_network(&pool_address.0, 16) with a 60s timeout.
Derive PeerId = BLAKE3(pub_key) for each of the 16 candidates.
Require at least 13/16 candidate PeerIds to appear in the DHT-returned set.

A MerklePaymentProof carries exactly one winner_pool, so every storing node that receives the proof runs the same check against that single pool — a forged pool is rejected everywhere it lands.

The check runs before the on-chain lookup so a forged proof never pays for an RPC round-trip.

Concurrency & DoS hardening

After two rounds of adversarial review:

Pass cache (closeness_pass_cache): within a 256-chunk batch sharing one pool, only the first chunk pays the Kademlia lookup; subsequent chunks hit the cache.
Single-flight (inflight_closeness): N concurrent PUTs for the same pool_hash coalesce onto one lookup. The leader publishes its Result<(), String> through an Arc<ClosenessSlot> that holds a Notify + OnceLock; waiters read the published result directly instead of re-running _inner. Bounds DoS amplification to one lookup per unique pool_hash for BOTH success and failure cases.
Lost-wakeup safety: waiters use Arc<Notify>::notified_owned() so the notify_waiters counter is snapshotted while the inflight lock is still held; a race where the leader fires notify between the lock drop and the poll is handled by the counter check on first poll.
LRU-eviction safety: the leader's InflightGuard owns its own Arc<ClosenessSlot> independent of the cache. notify_waiters fires on the owned slot regardless of eviction; the cache pop is gated on an Arc::ptr_eq identity check so an eviction + reinsertion doesn't make the original leader pop someone else's entry.
Cancellation-cascade bound: waiters retry up to MAX_LEADER_RETRIES = 4 times if the leader gets cancelled before publishing, then return a visible error rather than spinning forever.

Fail-closed in production

PaymentVerifier::attach_p2p_node is called at node startup in node.rs, devnet.rs, and the e2e testnet. If the call is ever missed:

Release builds: closeness check returns Err with a visible error log — any regression surfaces as a rejected PUT, not a silent accept.
Test / test-utils builds: falls back to fail-open so unit tests that don't exercise merkle verification don't need a real DHT. PaymentVerifier::new emits a loud startup-time error log when built with test-utils, so a production binary accidentally built with the feature is visible in logs.

Known limitation — Sybil-grinding

MidpointProof::address() is BLAKE3 of attacker-controllable inputs (leaf bytes, tree root, timestamp). An attacker who also runs 13+ Sybil DHT nodes can grind the midpoint until those Sybils are the true network-closest peers, at which point closeness passes for the attacker. Closing that gap requires either a Sybil-resistant identity layer or binding the midpoint to an attacker-uncontrolled value (on-chain VRF or block hash at payment time). This PR raises the attack cost from "free" to "run 13+ Sybil nodes AND grind" — a meaningful but not complete improvement. Documented in the code.

Test plan

test_attack_merkle_pay_yourself_fabricated_pool fails on main with "Pay-yourself attack SUCCEEDED"
Same test passes with the fix (closeness rejection)
3 new unit tests: pass-cache short-circuit, concurrent-readers agreement, waiter-reads-leader's-published-failure
All 43 verifier unit tests pass: cargo test --lib payment::verifier
All 8 merkle e2e tests pass: cargo test --features test-utils --test e2e merkle -- --test-threads=1
cargo clippy --all-targets --all-features -- -D warnings clean
cargo fmt --all clean
RUSTDOCFLAGS="--deny=warnings" cargo doc --no-deps clean
Full CI matrix (see checks)

Follow-up (not in this PR)

Bind MidpointProof derivation to an on-chain VRF or block hash to close the Sybil-grinding gap.
Add a global rate limit on iterative Kademlia lookups to further bound cross-pool DoS.

Merkle verification never tied the 16 candidate pub_keys to the pool midpoint's actual close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point every reward_address at a self-owned wallet, fund the on-chain merkle payment, and receive their own rewards back. Add a closeness defence: for each MerklePaymentProof, the verifier queries the DHT for the closest peers to the winner_pool's midpoint address and requires a majority of the 16 candidate pub_keys to hash to PeerIds that appear in that returned set. Hooked in between the cheap signature check and the on-chain RPC so a forged pool is rejected without paying for an on-chain round-trip. Every storing node that receives the proof runs the same check against the single winner_pool. PaymentVerifier grows an optional Arc<P2PNode> attached at startup via attach_p2p_node; node.rs, devnet.rs, and the e2e testnet harness all wire it after P2PNode construction. Unit tests that don't exercise merkle verification keep a None handle and log a warning. New e2e test test_attack_merkle_pay_yourself_fabricated_pool fails on main (verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.

Copilot

Pull request overview

This PR hardens merkle batch payment verification against a “pay-yourself” attack by ensuring the candidate public keys correspond to peers actually close to the pool midpoint in the live DHT.

Changes:

Add a DHT-based candidate closeness check to verify_merkle_payment (before on-chain RPC).
Wire an Arc<P2PNode> into PaymentVerifier at startup so the verifier can query the routing table.
Add an e2e test that exercises rejection of a fabricated candidate pool (with on-chain lookup bypassed via a test-only cache insert).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/payment/verifier.rs`	Adds attached `P2PNode` handle, a candidate closeness check, and a test-only pool cache insert helper; runs closeness check during merkle verification.
`src/node.rs`	Attaches the constructed `P2PNode` to the protocol’s `PaymentVerifier` during production startup.
`src/devnet.rs`	Attaches the devnet node’s `P2PNode` to the `PaymentVerifier` during devnet startup.
`tests/e2e/testnet.rs`	Attaches each test node’s `P2PNode` to its `PaymentVerifier` so e2e merkle verification can query the live DHT.
`tests/e2e/merkle_payment.rs`	Adds an e2e attack test for fabricated candidate pools and updates module-level test coverage notes/lints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T08:05:46Z

+            return Ok(());
+        };


verify_merkle_candidate_closeness currently fails open when no P2PNode is attached (returns Ok(())). That leaves merkle verification vulnerable if startup wiring is missed or regresses. Consider failing closed in non-test builds (return an Error::Payment) and only allowing the skip path under cfg(test)/feature = "test-utils" (or gating it behind an explicit config flag).

Copilot · 2026-04-22T08:05:46Z

+        let matched = candidate_peer_ids
+            .iter()
+            .filter(|pid| network_set.contains(pid))
+            .count();


The closeness match count can be inflated by duplicate candidate pub_keys because matched counts membership for every entry in candidate_peer_ids. If duplicates are allowed, repeating a single matching PeerId multiple times could satisfy the threshold. Consider deduplicating candidates before counting (e.g., HashSet of candidate PeerIds) and/or explicitly rejecting pools with duplicate PeerIds.

Copilot · 2026-04-22T08:05:46Z

+        // Set-membership check against the returned closest-peers list. The
+        // lookup may return fewer than `lookup_count` on a sparse network,
+        // which only tightens the bar — any candidate not in the returned
+        // list counts as unmatched.


When find_closest_nodes_network returns fewer peers than CANDIDATE_CLOSENESS_REQUIRED, the pool will be rejected regardless of the candidate set. In that scenario, the current rejection message (“candidate pub_keys do not match…”) is misleading; it’s really “insufficient authoritative network peers/visibility to verify closeness”. Consider adding an explicit if network_peers.len() < CANDIDATE_CLOSENESS_REQUIRED { ... } branch with a clearer error (and possibly a retry hint).

Suggested change

// Set-membership check against the returned closest-peers list. The

// lookup may return fewer than `lookup_count` on a sparse network,

// which only tightens the bar — any candidate not in the returned

// list counts as unmatched.

if network_peers.len() < Self::CANDIDATE_CLOSENESS_REQUIRED {

debug!(

"Merkle closeness rejected: authoritative network lookup returned only {} peers \

for pool midpoint {} (required to verify closeness: {})",

network_peers.len(),

hex::encode(pool_address.0),

Self::CANDIDATE_CLOSENESS_REQUIRED,

);

return Err(Error::Payment(

"Merkle candidate pool rejected: insufficient authoritative network peers \

were returned to verify candidate closeness to the pool midpoint address. \

Retry once the node has better network visibility."

.into(),

));

}

// Set-membership check against the returned closest-peers list once

// enough authoritative peers are available to verify closeness.

Copilot · 2026-04-22T08:05:47Z

+        self.verify_merkle_candidate_closeness(&merkle_proof.winner_pool)
+            .await?;
+


verify_merkle_payment performs an authoritative DHT lookup for every chunk proof, even when the same winner_pool/pool_hash is already cached. For large batches (e.g., 256 chunks), this adds 256 iterative lookups. Consider caching the closeness result per pool_hash (or reusing the existing pool cache entry) so the network lookup happens once per pool, while still running before the on-chain RPC on cache misses.

Copilot · 2026-04-22T08:05:47Z

+        .expect("pool has candidates")
+        .merkle_payment_timestamp;
+    // 4-leaf tree → depth 2 (log2(4)). Prices are all 1024 in
+    // build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.


The per-node payment comment says the expected value is 2048 (1024 * 2^2 / 2), but the test sets per_node to 4096. Since verifier logic checks paid_amount >= expected_per_node, either set per_node to the computed minimum (2048) for clarity or update the comment to explain the intentional overpayment.

Suggested change

// build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.

// build_candidate_nodes, so the minimum per-node payment required by the

// verifier formula is 1024 * 2^2 / 2 = 2048. We intentionally overpay

// with 4096 here so the proof passes the payment-formula check and this

// test remains focused on the pay-yourself closeness rejection.

Security review flagged three issues with the initial closeness check: 1. **Fail-open on missing P2PNode** was unsafe in production: any startup path that forgot to call attach_p2p_node would silently disable the defence and only log a warning. Fail CLOSED in release builds — the warn->error and Ok->Err flip surfaces the wiring bug as a rejected PUT rather than a silent accept. Test-utils builds keep fail-open so unit tests don't need a real DHT. 2. **9/16 majority was too permissive**: an attacker controlling 7 neighbourhood peers could fabricate the remaining 7 and still pass. Raise to 13/16 — tolerates normal routing-table skew (1-3 peers) without letting an attacker plant 7 fabricated candidates. 3. **Per-chunk Kademlia lookups were a DoS amplifier**: a 256-chunk batch sharing one winner_pool would trigger 256 iterative lookups, and N concurrent forged-pool PUTs for the same hash would trigger N parallel lookups. Add: - closeness_pass_cache: pool_hash -> () so within-batch repeats and cross-connection replays are free after the first check. - inflight_closeness: pool_hash -> Notify so concurrent PUTs for the same pool coalesce to a single lookup behind a drop-guarded waker. Also bump the per-lookup timeout from 10s to 60s. Iterative Kademlia lookups with dial cascades of 20-30s per unresponsive peer were timing out under normal churn at 10s — false-rejecting honest proofs. 60s keeps DoS bounded while being wide enough for real-world convergence. Document the known Sybil-grinding limitation: midpoint_proof.address() is BLAKE3 of attacker-controllable inputs, so an attacker who also runs Sybil DHT nodes can grind it to land in their own neighbourhood. Closing that gap needs a Sybil-resistance layer or an attacker-uncontrolled midpoint binding (on-chain VRF / block hash).

… single-flight Second concurrency review flagged three real problems with the single-flight closeness check: 1. **Lost-wakeup race**. `notify.notified().await` creates the Notified future AFTER the inflight lock is released. If the leader calls `notify_waiters()` between the waiter's lock drop and the future's first poll, the notification is lost and the waiter parks forever — there was no outer timeout around the wait. Fixed by switching to `Arc<Notify>::notified_owned()`, which snapshots the `notify_waiters` counter at call time; the counter check on the first poll resolves the future immediately even if the leader already fired. 2. **LRU eviction orphans waiters**. When `DEFAULT_POOL_CACHE_CAPACITY` (1000) unique pool_hashes are in flight, LruCache evicts the oldest `Notify` entry. The leader's drop guard did `slot.pop(&pool_hash)` which, for an evicted entry, popped nothing and called `notify_waiters` on — nothing. Worse, a subsequent leader for the same pool_hash inserts a fresh `Notify`, which the original leader then erroneously pops. Both failure modes orphan waiters. Fixed by having the leader own an `Arc<ClosenessSlot>` independent of the cache and calling `notify_waiters` on THAT; the pop only runs if `Arc::ptr_eq` confirms the cache still holds our slot. 3. **Failure-path waiters each re-run the lookup**. On leader failure the pass cache wasn't written, waiters woke, saw cache miss, and all became new leaders running their own Kademlia lookups. That defeats the documented DoS bound of "N forged PUTs cost at most one lookup". Fixed by sharing the leader's `Result<(), String>` through an `Arc<ClosenessSlot>` with a `OnceLock`; waiters read the stored result directly and skip re-verification on both success AND failure. Other changes: - Loud startup error log if a binary is accidentally built with the `test-utils` feature, since that flips the closeness defence to fail-open when P2PNode isn't attached. - Two new unit tests exercise the cache short-circuit and the concurrent-waiter single-flight path.

Third review pass found two remaining issues: 1. **Unbounded retry loop on leader cancellation cascade**. The loop retries forever if every leader gets cancelled before publishing. Add `MAX_LEADER_RETRIES = 4` so waiters return a visible error rather than spinning through indefinite cancellation cycles. 2. **Single-flight test didn't actually exercise the waiter path**. The existing concurrent-readers test relies on timing and might run both calls serially through the leader path. Add `closeness_waiter_reads_leaders_published_failure`, which seeds the inflight slot, spawns a waiter task, yields until the waiter is parked on `notified_owned().await`, then externally publishes a failure + notifies — proving the waiter reads the leader's published result directly without running its own inner check.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T08:32:28Z

+                let chosen = if let Some(existing) = inflight.get(&pool_hash) {
+                    (Some(Arc::clone(existing)), None)
+                } else {
+                    let slot = Arc::new(ClosenessSlot::new());
+                    inflight.put(pool_hash, Arc::clone(&slot));
+                    (None, Some(slot))
+                };
+                drop(inflight);
+                chosen
+            };
+
+            if let Some(slot) = waiter_slot {
+                // Build the owned-notified future BEFORE awaiting, so it


The single-flight waiter path can miss the leader’s notify_waiters and hang indefinitely: if the leader finishes and calls notify_waiters() after the waiter extracts slot from the cache but before slot.notified_owned() is constructed, the waiter will await a future notification that never comes (even though slot.result is already set).

Fix by using the standard Notify pattern: create let notified = slot.notified_owned(); and then check slot.result.get() before awaiting (and/or check once before creating notified, then create notified, re-check, then await). That guarantees no missed wakeups when the leader completes between the waiter’s cache read and the await.

Copilot · 2026-04-22T08:32:28Z

+    /// Minimum number of candidate `pub_keys` (out of 16) whose derived `PeerId`
+    /// must match the DHT's actual closest peers to the pool midpoint address.
+    ///
+    /// Set below 16/16 to absorb normal routing-table skew between the
+    /// payer's view and this node's view — on a well-connected network the
+    /// divergence between two nodes' closest-set views is typically 1-2
+    /// peers, occasionally 3 during churn. 13/16 tolerates 3 divergent
+    /// peers while still limiting how many candidates an attacker can
+    /// fabricate before the check bites. A lower threshold (e.g. 9/16)
+    /// would let an attacker who controls 7 real neighbourhood peers plant
+    /// 7 fabricated candidates and still pass.
+    ///
+    /// This is the pure "fabricated key" defence; it does not stop an
+    /// attacker who can grind the pool midpoint address to land near 13
+    /// pre-chosen keys AND run those keys as Sybil DHT participants. That
+    /// requires an orthogonal Sybil-resistance layer and is out of scope
+    /// for this check.
+    const CANDIDATE_CLOSENESS_REQUIRED: usize = 13;
+
+    /// Timeout for the authoritative network lookup used by the closeness
+    /// check.
+    ///
+    /// Iterative Kademlia lookups can cascade through up to 20 iterations,
+    /// and a single unresponsive peer's dial can take 20-30s before timing
+    /// out. 60s leaves room for the lookup to converge even under churn
+    /// while still capping `DoS` amplification at roughly one bounded lookup
+    /// per forged `pool_hash`.
+    const CLOSENESS_LOOKUP_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);


PR description says the closeness check uses a 10s timeout and requires 9/16 candidates to be present in the DHT-closest set, but the implementation uses CLOSENESS_LOOKUP_TIMEOUT = 60s and CANDIDATE_CLOSENESS_REQUIRED = 13.

Please either update the PR description to match the actual behavior, or adjust these constants to match the documented design so reviewers/operators don’t get surprised by stricter thresholds/longer blocking time in production.

grumbach · 2026-04-22T08:35:31Z

@greptile-apps @claude please review

Copilot AI review requested due to automatic review settings April 22, 2026 07:59

Copilot started reviewing on behalf of grumbach April 22, 2026 08:00 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

grumbach added 2 commits April 22, 2026 17:12

Copilot AI review requested due to automatic review settings April 22, 2026 08:27

Copilot started reviewing on behalf of grumbach April 22, 2026 08:27 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

jacderida changed the base branch from main to rc-2026.4.2 April 22, 2026 22:05

jacderida approved these changes Apr 22, 2026

View reviewed changes

jacderida merged commit d4213d5 into WithAutonomi:rc-2026.4.2 Apr 22, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(payment): block merkle pay-yourself via DHT closeness check#77

fix(payment): block merkle pay-yourself via DHT closeness check#77
jacderida merged 4 commits intoWithAutonomi:rc-2026.4.2from
grumbach:fix/merkle-pay-yourself-closeness

grumbach commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

grumbach commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        // Set-membership check against the returned closest-peers list. The
-        // lookup may return fewer than `lookup_count` on a sparse network,
-        // which only tightens the bar — any candidate not in the returned
-        // list counts as unmatched.
+        if network_peers.len() < Self::CANDIDATE_CLOSENESS_REQUIRED {
+            debug!(
+                "Merkle closeness rejected: authoritative network lookup returned only {} peers \
+                 for pool midpoint {} (required to verify closeness: {})",
+                network_peers.len(),
+                hex::encode(pool_address.0),
+                Self::CANDIDATE_CLOSENESS_REQUIRED,
+            );
+            return Err(Error::Payment(
+                "Merkle candidate pool rejected: insufficient authoritative network peers \
+                 were returned to verify candidate closeness to the pool midpoint address. \
+                 Retry once the node has better network visibility."
+                    .into(),
+            ));
+        }
+        // Set-membership check against the returned closest-peers list once
+        // enough authoritative peers are available to verify closeness.

		self.verify_merkle_candidate_closeness(&merkle_proof.winner_pool)
		.await?;

-    // build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.
+    // build_candidate_nodes, so the minimum per-node payment required by the
+    // verifier formula is 1024 * 2^2 / 2 = 2048. We intentionally overpay
+    // with 4096 here so the proof passes the payment-formula check and this
+    // test remains focused on the pay-yourself closeness rejection.

Conversation

grumbach commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The attack

The fix

Concurrency & DoS hardening

Fail-closed in production

Known limitation — Sybil-grinding

Test plan

Follow-up (not in this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

grumbach commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grumbach commented Apr 22, 2026 •

edited

Loading

grumbach commented Apr 22, 2026 •

edited

Loading