fix(consensus): bound ancestry walk, fix partial tx IDs, enforce lock ordering, fix poll mutex#19
Open
wpank wants to merge 1 commit into
Conversation
… ordering, fix poll mutex - Bound the verify ancestry walk to MAX_ANCESTRY_WALK (1024) to prevent unbounded Vec growth when a long chain of unverified blocks is encountered after a prolonged disconnect (#113). - Change LedgerView::collect_ancestor_tx_ids to return Option and propagate None (snapshot gap) to callers instead of silently returning a partial excluded set that risks duplicate transactions (#149). - Enforce uniform RwLock acquisition order (persisted -> persisted_order -> snapshots) across all InMemorySnapshotStore methods to prevent potential deadlocks from non-uniform ordering (#151). - Release the async Mutex in wait_for_snapshot by cloning the InMemorySnapshotStore handle once under the lock and using it directly for subsequent poll iterations, avoiding thundering-herd contention when many waiters are notified simultaneously (#153). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes four consensus-layer issues that can cause unbounded memory growth, duplicate transactions, potential deadlocks, and unnecessary mutex contention:
#113 -- Bound ancestry walk in
verify: Theverifymethod collected all unverified blocks from the ancestry stream into an unboundedVec. After a prolonged disconnect, this could consume arbitrary memory. Now bounded toMAX_ANCESTRY_WALK(1024 blocks); verification fails if the limit is exceeded, deferring to catch-up / state-sync for recovery.#149 --
collect_ancestor_tx_idsreturns error on snapshot gap:LedgerView::collect_ancestor_tx_idspreviously returned a partialBTreeSetwhen a snapshot was missing from the chain, silently ignoring the gap. The caller would then build a block with an incomplete excluded set, risking duplicate transactions. Now returnsOption<BTreeSet<TxId>>and propagatesNoneto callers so they can abort the proposal.#151 -- Enforce uniform RwLock ordering in
InMemorySnapshotStore: Multiple methods acquiredsnapshots,persisted, andpersisted_orderRwLock guards in different orders, creating a potential deadlock. All multi-lock methods now follow the canonical order:persisted->persisted_order->snapshots.#153 -- Release ledger mutex during
wait_for_snapshotpolling:wait_for_snapshotre-acquired the asyncMutex<LedgerState>on every poll iteration, causing thundering-herd contention when multiple waiters were notified simultaneously. Now clones theInMemorySnapshotStorehandle (which is backed byArc<RwLock<..>>) once under the mutex and uses it directly for subsequent checks, eliminating the per-iteration mutex acquisition.Test plan
kora-consensusandkora-ledgerpass (cargo test -p kora-consensus -p kora-ledger)cargo check --workspacepassescargo clippy --workspace --no-deps -- -D warningspassesbuild_proposal_txscallers handle the newOptionreturn type correctly🤖 Generated with Claude Code