Skip to content

consensus: rejoin a lagging archive by syncing proposals from peers#570

Merged
CassOnMars merged 1 commit into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:fix-archive-consensus-catch-up
Jun 18, 2026
Merged

consensus: rejoin a lagging archive by syncing proposals from peers#570
CassOnMars merged 1 commit into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:fix-archive-consensus-catch-up

Conversation

@dazthecorgi

Copy link
Copy Markdown
Contributor

A global-consensus participant that falls behind — e.g. an archive isolated by a network partition — could not rejoin consensus. Orphan resolution and finalized-rank advancement happen only on the consensus message path (on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain frame-store write never triggers them, and freshness gates drop a lagging node's traffic until its finalized rank advances. With no path to backfill the missed proposals, the node would orphan every later proposal forever — its frame store could catch up (via the read-only poller) while its consensus engine stayed permanently stuck.

This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal -> AddProposal): when the engine orphans a proposal for a missing parent, pull the missing proposals from a peer and submit them into the consensus loop so the node finalizes them and resumes voting — rather than merely mirroring frames.

Changes:

  • Serve full proposals: implement GlobalService.GetGlobalProposal, assembling state + parent QC + prior TC + proposer vote from the clock store (FrameLookup::get_global_proposal). Previously stubbed to return nothing for non-genesis frames.
  • Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so it can be served back; the store trait had the accessor but no producer wrote it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and proto_proposal_to_signed (only QC had one).
  • Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site, surfaced through GlobalConsumer and a required SyncTriggerHook on ConsensusActivationParams — required (not optional) so the recovery path cannot be silently left unwired.
  • Catch-up task (node): a Notify-driven task pulls proposals via ArchiveClient::get_global_proposal ascending from the engine's finalized frame (tracked separately from the poller-shared head) and submits them (submit_quorum_certificate / submit_timeout_certificate / submit_proposal).

Verification: the devnet rank-1 partition scenario (archive-4 isolated) now passes with the victim genuinely rejoining consensus — it orphans during the partition, then on heal the catch-up supplies the one missing parent frame and the engine finalizes frames 1..4 (state finalized, persisted candidate frames), orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check clean. Before this change the victim stayed at finalized=0, orphaning every rank.

A global-consensus participant that falls behind — e.g. an archive isolated by
a network partition — could not rejoin consensus. Orphan resolution and
finalized-rank advancement happen only on the consensus message path
(on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain
frame-store write never triggers them, and freshness gates drop a lagging
node's traffic until its finalized rank advances. With no path to backfill the
missed proposals, the node would orphan every later proposal forever — its
frame store could catch up (via the read-only poller) while its consensus
engine stayed permanently stuck.

This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal ->
AddProposal): when the engine orphans a proposal for a missing parent, pull the
missing proposals from a peer and submit them into the consensus loop so the
node finalizes them and resumes voting — rather than merely mirroring frames.

Changes:
- Serve full proposals: implement GlobalService.GetGlobalProposal, assembling
  state + parent QC + prior TC + proposer vote from the clock store
  (FrameLookup::get_global_proposal). Previously stubbed to return nothing for
  non-genesis frames.
- Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so
  it can be served back; the store trait had the accessor but no producer wrote
  it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and
  proto_proposal_to_signed (only QC had one).
- Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site,
  surfaced through GlobalConsumer and a required SyncTriggerHook on
  ConsensusActivationParams — required (not optional) so the recovery path
  cannot be silently left unwired.
- Catch-up task (node): a Notify-driven task pulls proposals via
  ArchiveClient::get_global_proposal ascending from the engine's finalized
  frame (tracked separately from the poller-shared head) and submits them
  (submit_quorum_certificate / submit_timeout_certificate / submit_proposal).

Verification: the devnet rank-1 partition scenario (archive-4 isolated) now
passes with the victim genuinely rejoining consensus — it orphans during the
partition, then on heal the catch-up supplies the one missing parent frame and
the engine finalizes frames 1..4 (state finalized, persisted candidate frames),
orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check
clean. Before this change the victim stayed at finalized=0, orphaning every
rank.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@CassOnMars CassOnMars merged commit f7753c3 into QuilibriumNetwork:v2.1.0.24 Jun 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants