consensus: rejoin a lagging archive by syncing proposals from peers#570
Merged
CassOnMars merged 1 commit intoJun 18, 2026
Conversation
A global-consensus participant that falls behind — e.g. an archive isolated by a network partition — could not rejoin consensus. Orphan resolution and finalized-rank advancement happen only on the consensus message path (on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain frame-store write never triggers them, and freshness gates drop a lagging node's traffic until its finalized rank advances. With no path to backfill the missed proposals, the node would orphan every later proposal forever — its frame store could catch up (via the read-only poller) while its consensus engine stayed permanently stuck. This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal -> AddProposal): when the engine orphans a proposal for a missing parent, pull the missing proposals from a peer and submit them into the consensus loop so the node finalizes them and resumes voting — rather than merely mirroring frames. Changes: - Serve full proposals: implement GlobalService.GetGlobalProposal, assembling state + parent QC + prior TC + proposer vote from the clock store (FrameLookup::get_global_proposal). Previously stubbed to return nothing for non-genesis frames. - Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so it can be served back; the store trait had the accessor but no producer wrote it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and proto_proposal_to_signed (only QC had one). - Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site, surfaced through GlobalConsumer and a required SyncTriggerHook on ConsensusActivationParams — required (not optional) so the recovery path cannot be silently left unwired. - Catch-up task (node): a Notify-driven task pulls proposals via ArchiveClient::get_global_proposal ascending from the engine's finalized frame (tracked separately from the poller-shared head) and submits them (submit_quorum_certificate / submit_timeout_certificate / submit_proposal). Verification: the devnet rank-1 partition scenario (archive-4 isolated) now passes with the victim genuinely rejoining consensus — it orphans during the partition, then on heal the catch-up supplies the one missing parent frame and the engine finalizes frames 1..4 (state finalized, persisted candidate frames), orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check clean. Before this change the victim stayed at finalized=0, orphaning every rank. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A global-consensus participant that falls behind — e.g. an archive isolated by a network partition — could not rejoin consensus. Orphan resolution and finalized-rank advancement happen only on the consensus message path (on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain frame-store write never triggers them, and freshness gates drop a lagging node's traffic until its finalized rank advances. With no path to backfill the missed proposals, the node would orphan every later proposal forever — its frame store could catch up (via the read-only poller) while its consensus engine stayed permanently stuck.
This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal -> AddProposal): when the engine orphans a proposal for a missing parent, pull the missing proposals from a peer and submit them into the consensus loop so the node finalizes them and resumes voting — rather than merely mirroring frames.
Changes:
Verification: the devnet rank-1 partition scenario (archive-4 isolated) now passes with the victim genuinely rejoining consensus — it orphans during the partition, then on heal the catch-up supplies the one missing parent frame and the engine finalizes frames 1..4 (state finalized, persisted candidate frames), orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check clean. Before this change the victim stayed at finalized=0, orphaning every rank.