Skip to content

fix(consensus): bound ancestry walk, fix partial tx IDs, enforce lock ordering, fix poll mutex#19

Open
wpank wants to merge 1 commit into
Nunchi-trade:mainfrom
wpank:fix/consensus-ancestry-txids-lock-ordering-poll
Open

fix(consensus): bound ancestry walk, fix partial tx IDs, enforce lock ordering, fix poll mutex#19
wpank wants to merge 1 commit into
Nunchi-trade:mainfrom
wpank:fix/consensus-ancestry-txids-lock-ordering-poll

Conversation

@wpank
Copy link
Copy Markdown
Collaborator

@wpank wpank commented May 30, 2026

Summary

Fixes four consensus-layer issues that can cause unbounded memory growth, duplicate transactions, potential deadlocks, and unnecessary mutex contention:

  • #113 -- Bound ancestry walk in verify: The verify method collected all unverified blocks from the ancestry stream into an unbounded Vec. After a prolonged disconnect, this could consume arbitrary memory. Now bounded to MAX_ANCESTRY_WALK (1024 blocks); verification fails if the limit is exceeded, deferring to catch-up / state-sync for recovery.

  • #149 -- collect_ancestor_tx_ids returns error on snapshot gap: LedgerView::collect_ancestor_tx_ids previously returned a partial BTreeSet when a snapshot was missing from the chain, silently ignoring the gap. The caller would then build a block with an incomplete excluded set, risking duplicate transactions. Now returns Option<BTreeSet<TxId>> and propagates None to callers so they can abort the proposal.

  • #151 -- Enforce uniform RwLock ordering in InMemorySnapshotStore: Multiple methods acquired snapshots, persisted, and persisted_order RwLock guards in different orders, creating a potential deadlock. All multi-lock methods now follow the canonical order: persisted -> persisted_order -> snapshots.

  • #153 -- Release ledger mutex during wait_for_snapshot polling: wait_for_snapshot re-acquired the async Mutex<LedgerState> on every poll iteration, causing thundering-herd contention when multiple waiters were notified simultaneously. Now clones the InMemorySnapshotStore handle (which is backed by Arc<RwLock<..>>) once under the mutex and uses it directly for subsequent checks, eliminating the per-iteration mutex acquisition.

Test plan

  • Existing unit tests in kora-consensus and kora-ledger pass (cargo test -p kora-consensus -p kora-ledger)
  • cargo check --workspace passes
  • cargo clippy --workspace --no-deps -- -D warnings passes
  • Review ancestry walk bound (1024) is sufficient for normal operation while preventing OOM
  • Review that build_proposal_txs callers handle the new Option return type correctly

🤖 Generated with Claude Code

… ordering, fix poll mutex

- Bound the verify ancestry walk to MAX_ANCESTRY_WALK (1024) to prevent
  unbounded Vec growth when a long chain of unverified blocks is
  encountered after a prolonged disconnect (#113).

- Change LedgerView::collect_ancestor_tx_ids to return Option and
  propagate None (snapshot gap) to callers instead of silently returning
  a partial excluded set that risks duplicate transactions (#149).

- Enforce uniform RwLock acquisition order (persisted -> persisted_order
  -> snapshots) across all InMemorySnapshotStore methods to prevent
  potential deadlocks from non-uniform ordering (#151).

- Release the async Mutex in wait_for_snapshot by cloning the
  InMemorySnapshotStore handle once under the lock and using it
  directly for subsequent poll iterations, avoiding thundering-herd
  contention when many waiters are notified simultaneously (#153).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant