Skip to content

fix(payment): block merkle pay-yourself via DHT closeness check#77

Merged
jacderida merged 4 commits intoWithAutonomi:rc-2026.4.2from
grumbach:fix/merkle-pay-yourself-closeness
Apr 22, 2026
Merged

fix(payment): block merkle pay-yourself via DHT closeness check#77
jacderida merged 4 commits intoWithAutonomi:rc-2026.4.2from
grumbach:fix/merkle-pay-yourself-closeness

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

@grumbach grumbach commented Apr 22, 2026

Summary

  • Block the merkle "pay-yourself" attack by adding a DHT-closeness check to verify_merkle_payment.
  • Attach the P2PNode to PaymentVerifier at startup so the check can query the live routing table.
  • Add an e2e test that fails on main (verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.

The attack

verify_merkle_payment checked candidate ML-DSA signatures, timestamps, the cryptographic merkle proof, and on-chain paid-node bindings — but nothing tied the 16 candidate pub_keys to the pool midpoint's close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point all 16 reward_address fields at a self-owned wallet, fund the on-chain merkle payment themselves, and receive their own rewards back.

The fix

In verify_merkle_payment, between the cheap signature check and the on-chain RPC:

  1. Pull the pool midpoint address from winner_pool.midpoint_proof.address().
  2. Call dht_manager().find_closest_nodes_network(&pool_address.0, 16) with a 60s timeout.
  3. Derive PeerId = BLAKE3(pub_key) for each of the 16 candidates.
  4. Require at least 13/16 candidate PeerIds to appear in the DHT-returned set.

A MerklePaymentProof carries exactly one winner_pool, so every storing node that receives the proof runs the same check against that single pool — a forged pool is rejected everywhere it lands.

The check runs before the on-chain lookup so a forged proof never pays for an RPC round-trip.

Concurrency & DoS hardening

After two rounds of adversarial review:

  • Pass cache (closeness_pass_cache): within a 256-chunk batch sharing one pool, only the first chunk pays the Kademlia lookup; subsequent chunks hit the cache.
  • Single-flight (inflight_closeness): N concurrent PUTs for the same pool_hash coalesce onto one lookup. The leader publishes its Result<(), String> through an Arc<ClosenessSlot> that holds a Notify + OnceLock; waiters read the published result directly instead of re-running _inner. Bounds DoS amplification to one lookup per unique pool_hash for BOTH success and failure cases.
  • Lost-wakeup safety: waiters use Arc<Notify>::notified_owned() so the notify_waiters counter is snapshotted while the inflight lock is still held; a race where the leader fires notify between the lock drop and the poll is handled by the counter check on first poll.
  • LRU-eviction safety: the leader's InflightGuard owns its own Arc<ClosenessSlot> independent of the cache. notify_waiters fires on the owned slot regardless of eviction; the cache pop is gated on an Arc::ptr_eq identity check so an eviction + reinsertion doesn't make the original leader pop someone else's entry.
  • Cancellation-cascade bound: waiters retry up to MAX_LEADER_RETRIES = 4 times if the leader gets cancelled before publishing, then return a visible error rather than spinning forever.

Fail-closed in production

PaymentVerifier::attach_p2p_node is called at node startup in node.rs, devnet.rs, and the e2e testnet. If the call is ever missed:

  • Release builds: closeness check returns Err with a visible error log — any regression surfaces as a rejected PUT, not a silent accept.
  • Test / test-utils builds: falls back to fail-open so unit tests that don't exercise merkle verification don't need a real DHT. PaymentVerifier::new emits a loud startup-time error log when built with test-utils, so a production binary accidentally built with the feature is visible in logs.

Known limitation — Sybil-grinding

MidpointProof::address() is BLAKE3 of attacker-controllable inputs (leaf bytes, tree root, timestamp). An attacker who also runs 13+ Sybil DHT nodes can grind the midpoint until those Sybils are the true network-closest peers, at which point closeness passes for the attacker. Closing that gap requires either a Sybil-resistant identity layer or binding the midpoint to an attacker-uncontrolled value (on-chain VRF or block hash at payment time). This PR raises the attack cost from "free" to "run 13+ Sybil nodes AND grind" — a meaningful but not complete improvement. Documented in the code.

Test plan

  • test_attack_merkle_pay_yourself_fabricated_pool fails on main with "Pay-yourself attack SUCCEEDED"
  • Same test passes with the fix (closeness rejection)
  • 3 new unit tests: pass-cache short-circuit, concurrent-readers agreement, waiter-reads-leader's-published-failure
  • All 43 verifier unit tests pass: cargo test --lib payment::verifier
  • All 8 merkle e2e tests pass: cargo test --features test-utils --test e2e merkle -- --test-threads=1
  • cargo clippy --all-targets --all-features -- -D warnings clean
  • cargo fmt --all clean
  • RUSTDOCFLAGS="--deny=warnings" cargo doc --no-deps clean
  • Full CI matrix (see checks)

Follow-up (not in this PR)

  • Bind MidpointProof derivation to an on-chain VRF or block hash to close the Sybil-grinding gap.
  • Add a global rate limit on iterative Kademlia lookups to further bound cross-pool DoS.

Merkle verification never tied the 16 candidate pub_keys to the pool
midpoint's actual close group on the network. An attacker could generate
16 ML-DSA keypairs locally, point every reward_address at a self-owned
wallet, fund the on-chain merkle payment, and receive their own rewards
back.

Add a closeness defence: for each MerklePaymentProof, the verifier
queries the DHT for the closest peers to the winner_pool's midpoint
address and requires a majority of the 16 candidate pub_keys to hash to
PeerIds that appear in that returned set. Hooked in between the cheap
signature check and the on-chain RPC so a forged pool is rejected
without paying for an on-chain round-trip. Every storing node that
receives the proof runs the same check against the single winner_pool.

PaymentVerifier grows an optional Arc<P2PNode> attached at startup via
attach_p2p_node; node.rs, devnet.rs, and the e2e testnet harness all
wire it after P2PNode construction. Unit tests that don't exercise
merkle verification keep a None handle and log a warning.

New e2e test test_attack_merkle_pay_yourself_fabricated_pool fails on
main (verifier accepts 16 attacker-generated keys all rewarding the
payer) and passes with the fix.
Copilot AI review requested due to automatic review settings April 22, 2026 07:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens merkle batch payment verification against a “pay-yourself” attack by ensuring the candidate public keys correspond to peers actually close to the pool midpoint in the live DHT.

Changes:

  • Add a DHT-based candidate closeness check to verify_merkle_payment (before on-chain RPC).
  • Wire an Arc<P2PNode> into PaymentVerifier at startup so the verifier can query the routing table.
  • Add an e2e test that exercises rejection of a fabricated candidate pool (with on-chain lookup bypassed via a test-only cache insert).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/payment/verifier.rs Adds attached P2PNode handle, a candidate closeness check, and a test-only pool cache insert helper; runs closeness check during merkle verification.
src/node.rs Attaches the constructed P2PNode to the protocol’s PaymentVerifier during production startup.
src/devnet.rs Attaches the devnet node’s P2PNode to the PaymentVerifier during devnet startup.
tests/e2e/testnet.rs Attaches each test node’s P2PNode to its PaymentVerifier so e2e merkle verification can query the live DHT.
tests/e2e/merkle_payment.rs Adds an e2e attack test for fabricated candidate pools and updates module-level test coverage notes/lints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs Outdated
Comment on lines +545 to +546
return Ok(());
};
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_merkle_candidate_closeness currently fails open when no P2PNode is attached (returns Ok(())). That leaves merkle verification vulnerable if startup wiring is missed or regresses. Consider failing closed in non-test builds (return an Error::Payment) and only allowing the skip path under cfg(test)/feature = "test-utils" (or gating it behind an explicit config flag).

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs
Comment on lines +601 to +604
let matched = candidate_peer_ids
.iter()
.filter(|pid| network_set.contains(pid))
.count();
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closeness match count can be inflated by duplicate candidate pub_keys because matched counts membership for every entry in candidate_peer_ids. If duplicates are allowed, repeating a single matching PeerId multiple times could satisfy the threshold. Consider deduplicating candidates before counting (e.g., HashSet of candidate PeerIds) and/or explicitly rejecting pools with duplicate PeerIds.

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs
Comment on lines +595 to +598
// Set-membership check against the returned closest-peers list. The
// lookup may return fewer than `lookup_count` on a sparse network,
// which only tightens the bar — any candidate not in the returned
// list counts as unmatched.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When find_closest_nodes_network returns fewer peers than CANDIDATE_CLOSENESS_REQUIRED, the pool will be rejected regardless of the candidate set. In that scenario, the current rejection message (“candidate pub_keys do not match…”) is misleading; it’s really “insufficient authoritative network peers/visibility to verify closeness”. Consider adding an explicit if network_peers.len() < CANDIDATE_CLOSENESS_REQUIRED { ... } branch with a clearer error (and possibly a retry hint).

Suggested change
// Set-membership check against the returned closest-peers list. The
// lookup may return fewer than `lookup_count` on a sparse network,
// which only tightens the bar — any candidate not in the returned
// list counts as unmatched.
if network_peers.len() < Self::CANDIDATE_CLOSENESS_REQUIRED {
debug!(
"Merkle closeness rejected: authoritative network lookup returned only {} peers \
for pool midpoint {} (required to verify closeness: {})",
network_peers.len(),
hex::encode(pool_address.0),
Self::CANDIDATE_CLOSENESS_REQUIRED,
);
return Err(Error::Payment(
"Merkle candidate pool rejected: insufficient authoritative network peers \
were returned to verify candidate closeness to the pool midpoint address. \
Retry once the node has better network visibility."
.into(),
));
}
// Set-membership check against the returned closest-peers list once
// enough authoritative peers are available to verify closeness.

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs Outdated
Comment on lines +678 to +680
self.verify_merkle_candidate_closeness(&merkle_proof.winner_pool)
.await?;

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_merkle_payment performs an authoritative DHT lookup for every chunk proof, even when the same winner_pool/pool_hash is already cached. For large batches (e.g., 256 chunks), this adds 256 iterative lookups. Consider caching the closeness result per pool_hash (or reusing the existing pool cache entry) so the network lookup happens once per pool, while still running before the on-chain RPC on cache misses.

Copilot uses AI. Check for mistakes.
.expect("pool has candidates")
.merkle_payment_timestamp;
// 4-leaf tree → depth 2 (log2(4)). Prices are all 1024 in
// build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-node payment comment says the expected value is 2048 (1024 * 2^2 / 2), but the test sets per_node to 4096. Since verifier logic checks paid_amount >= expected_per_node, either set per_node to the computed minimum (2048) for clarity or update the comment to explain the intentional overpayment.

Suggested change
// build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.
// build_candidate_nodes, so the minimum per-node payment required by the
// verifier formula is 1024 * 2^2 / 2 = 2048. We intentionally overpay
// with 4096 here so the proof passes the payment-formula check and this
// test remains focused on the pay-yourself closeness rejection.

Copilot uses AI. Check for mistakes.
Security review flagged three issues with the initial closeness check:

1. **Fail-open on missing P2PNode** was unsafe in production: any startup
   path that forgot to call attach_p2p_node would silently disable the
   defence and only log a warning. Fail CLOSED in release builds — the
   warn->error and Ok->Err flip surfaces the wiring bug as a rejected
   PUT rather than a silent accept. Test-utils builds keep fail-open
   so unit tests don't need a real DHT.

2. **9/16 majority was too permissive**: an attacker controlling 7
   neighbourhood peers could fabricate the remaining 7 and still pass.
   Raise to 13/16 — tolerates normal routing-table skew (1-3 peers)
   without letting an attacker plant 7 fabricated candidates.

3. **Per-chunk Kademlia lookups were a DoS amplifier**: a 256-chunk
   batch sharing one winner_pool would trigger 256 iterative lookups,
   and N concurrent forged-pool PUTs for the same hash would trigger N
   parallel lookups. Add:
     - closeness_pass_cache: pool_hash -> () so within-batch repeats
       and cross-connection replays are free after the first check.
     - inflight_closeness: pool_hash -> Notify so concurrent PUTs for
       the same pool coalesce to a single lookup behind a drop-guarded
       waker.

Also bump the per-lookup timeout from 10s to 60s. Iterative Kademlia
lookups with dial cascades of 20-30s per unresponsive peer were timing
out under normal churn at 10s — false-rejecting honest proofs. 60s
keeps DoS bounded while being wide enough for real-world convergence.

Document the known Sybil-grinding limitation: midpoint_proof.address()
is BLAKE3 of attacker-controllable inputs, so an attacker who also
runs Sybil DHT nodes can grind it to land in their own neighbourhood.
Closing that gap needs a Sybil-resistance layer or an
attacker-uncontrolled midpoint binding (on-chain VRF / block hash).
… single-flight

Second concurrency review flagged three real problems with the
single-flight closeness check:

1. **Lost-wakeup race**. `notify.notified().await` creates the Notified
   future AFTER the inflight lock is released. If the leader calls
   `notify_waiters()` between the waiter's lock drop and the future's
   first poll, the notification is lost and the waiter parks forever —
   there was no outer timeout around the wait. Fixed by switching to
   `Arc<Notify>::notified_owned()`, which snapshots the
   `notify_waiters` counter at call time; the counter check on the
   first poll resolves the future immediately even if the leader
   already fired.

2. **LRU eviction orphans waiters**. When `DEFAULT_POOL_CACHE_CAPACITY`
   (1000) unique pool_hashes are in flight, LruCache evicts the
   oldest `Notify` entry. The leader's drop guard did
   `slot.pop(&pool_hash)` which, for an evicted entry, popped nothing
   and called `notify_waiters` on — nothing. Worse, a subsequent
   leader for the same pool_hash inserts a fresh `Notify`, which the
   original leader then erroneously pops. Both failure modes orphan
   waiters. Fixed by having the leader own an `Arc<ClosenessSlot>`
   independent of the cache and calling `notify_waiters` on THAT;
   the pop only runs if `Arc::ptr_eq` confirms the cache still holds
   our slot.

3. **Failure-path waiters each re-run the lookup**. On leader failure
   the pass cache wasn't written, waiters woke, saw cache miss, and
   all became new leaders running their own Kademlia lookups. That
   defeats the documented DoS bound of "N forged PUTs cost at most
   one lookup". Fixed by sharing the leader's `Result<(), String>`
   through an `Arc<ClosenessSlot>` with a `OnceLock`; waiters read
   the stored result directly and skip re-verification on both
   success AND failure.

Other changes:

- Loud startup error log if a binary is accidentally built with the
  `test-utils` feature, since that flips the closeness defence to
  fail-open when P2PNode isn't attached.
- Two new unit tests exercise the cache short-circuit and the
  concurrent-waiter single-flight path.
Copilot AI review requested due to automatic review settings April 22, 2026 08:27
Third review pass found two remaining issues:

1. **Unbounded retry loop on leader cancellation cascade**. The loop
   retries forever if every leader gets cancelled before publishing.
   Add `MAX_LEADER_RETRIES = 4` so waiters return a visible error
   rather than spinning through indefinite cancellation cycles.

2. **Single-flight test didn't actually exercise the waiter path**.
   The existing concurrent-readers test relies on timing and might
   run both calls serially through the leader path. Add
   `closeness_waiter_reads_leaders_published_failure`, which seeds
   the inflight slot, spawns a waiter task, yields until the waiter
   is parked on `notified_owned().await`, then externally publishes
   a failure + notifies — proving the waiter reads the leader's
   published result directly without running its own inner check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs
Comment on lines +701 to +713
let chosen = if let Some(existing) = inflight.get(&pool_hash) {
(Some(Arc::clone(existing)), None)
} else {
let slot = Arc::new(ClosenessSlot::new());
inflight.put(pool_hash, Arc::clone(&slot));
(None, Some(slot))
};
drop(inflight);
chosen
};

if let Some(slot) = waiter_slot {
// Build the owned-notified future BEFORE awaiting, so it
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-flight waiter path can miss the leader’s notify_waiters and hang indefinitely: if the leader finishes and calls notify_waiters() after the waiter extracts slot from the cache but before slot.notified_owned() is constructed, the waiter will await a future notification that never comes (even though slot.result is already set).

Fix by using the standard Notify pattern: create let notified = slot.notified_owned(); and then check slot.result.get() before awaiting (and/or check once before creating notified, then create notified, re-check, then await). That guarantees no missed wakeups when the leader completes between the waiter’s cache read and the await.

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs
Comment on lines +592 to +619
/// Minimum number of candidate `pub_keys` (out of 16) whose derived `PeerId`
/// must match the DHT's actual closest peers to the pool midpoint address.
///
/// Set below 16/16 to absorb normal routing-table skew between the
/// payer's view and this node's view — on a well-connected network the
/// divergence between two nodes' closest-set views is typically 1-2
/// peers, occasionally 3 during churn. 13/16 tolerates 3 divergent
/// peers while still limiting how many candidates an attacker can
/// fabricate before the check bites. A lower threshold (e.g. 9/16)
/// would let an attacker who controls 7 real neighbourhood peers plant
/// 7 fabricated candidates and still pass.
///
/// This is the pure "fabricated key" defence; it does not stop an
/// attacker who can grind the pool midpoint address to land near 13
/// pre-chosen keys AND run those keys as Sybil DHT participants. That
/// requires an orthogonal Sybil-resistance layer and is out of scope
/// for this check.
const CANDIDATE_CLOSENESS_REQUIRED: usize = 13;

/// Timeout for the authoritative network lookup used by the closeness
/// check.
///
/// Iterative Kademlia lookups can cascade through up to 20 iterations,
/// and a single unresponsive peer's dial can take 20-30s before timing
/// out. 60s leaves room for the lookup to converge even under churn
/// while still capping `DoS` amplification at roughly one bounded lookup
/// per forged `pool_hash`.
const CLOSENESS_LOOKUP_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the closeness check uses a 10s timeout and requires 9/16 candidates to be present in the DHT-closest set, but the implementation uses CLOSENESS_LOOKUP_TIMEOUT = 60s and CANDIDATE_CLOSENESS_REQUIRED = 13.

Please either update the PR description to match the actual behavior, or adjust these constants to match the documented design so reviewers/operators don’t get surprised by stricter thresholds/longer blocking time in production.

Copilot uses AI. Check for mistakes.
@grumbach
Copy link
Copy Markdown
Collaborator Author

grumbach commented Apr 22, 2026

@greptile-apps @claude please review

@jacderida jacderida changed the base branch from main to rc-2026.4.2 April 22, 2026 22:05
@jacderida jacderida merged commit d4213d5 into WithAutonomi:rc-2026.4.2 Apr 22, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants