Skip to content

Reorganization handling #561

@randomlogin

Description

@randomlogin

On regtest I had problems with syncing when reorganization occurs and started to investigate the problem.
What happened is that kyoto bans its only peer when reorg happens and I initially though that peer sends filters from reorged blocks => kyoto thinks these are garbage filters and bans the peer.

Well, this could happen, but even after trying to fix this problem, (by introducing ф cache of reorganized blocks to compare received filters against them) I couldn't sync properly.

Spend quite a lot of time debugging and I've realized there are several issues involved.

  1. The first one is the above mentioned, but it does not happen on regtest with a single peer. Still it is relevant for live networks (when we get filters and headers from different peers, which can occur, right?).

  2. Another problem during sync on a reorganized chain:

  • we don't receive stale filters (as filters are sent one by one and if the peer made reorg, he would send the new filters)
  • during reorganization we discard filter headers corresponding to reorganized blocks and when filters arrive we cannot process them (UnknownFilterHash).

If we do receive filter headers before the filters, we're fine and sync successfully.

Setup to reproduce:

  1. Mine 4200 blocks (so it takes several filter requests to sync)
  2. Make kyoto start syncing.
  3. Invalidate a block.
  4. Mine two new blocks.

Please note that it is a race condition and it depends on your computer and you might need to invoke test several times to reproduce or to initially mine more/less blocks.

Logs with additional debugging prints:

In sync filter got message CFilter { filter_type: 0, block_hash: 2720ed35e8255268f42e5dad8705aa52e034fc8fd2bc2212c405700595ed6774, filter: [1, 106, 190, 0] }
banning the peer ManagedPeer { record: Record { addr: Ipv4(127.0.0.1), port: 44277, source: SourceId([182, 151, 35, 72]), services: ServiceFlags(3145), failed_attempts: 0, last_connection: None, last_attempt: None }, broadcast_min: FeeRate(250), ptx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x73696401a080, tail_position: 20 }, semaphore: Semaphore { semaphore: Semaphore { permits: 32 }, bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } }, handle: JoinHandle { id: Id(3) } }
In sync filter got message CFilter { filter_type: 0, block_hash: 6e62367779e67fc6bbe604fb8720ad27c31ca25c49020f9d10eb0e93ef9b07bd, filter: [1, 129, 63, 168] }
[Peer 1]: headers
sync_chain: already have last header , returning empty
Headers: (4201/4202) CFHeaders: (4199/4201) CFilters: (4199/4201)
next_stateful_message: GetHeaders (chain_height=4201)
Error handling a P2P message: Compact filter syncing encountered an error: we could not find the filter hash corresponding to that stop hash.
Percent complete: 99.95239
>>> DISCONNECT: DisconnectCommand from main_thread_request
Bootstrapping peers with DNS
Adding 0 sourced from DNS
Looking for connections to peers. Connected: 0, Required: 1

This in turn can be mitigated by discarding filters if they correspond to orphaned blocks (via the same cache), but it breaks the sync mechanism. The reason for that is that effectively on reorgs we disconnect peers (with or without banning them), because tip height is incremented on "inv" messages and eventually when we reach the tip we receive no new headers.

[Peer 2]: headers
sync_chain: already have last header , returning empty
Headers: (4201/4202) CFHeaders: (4199/4201) CFilters: (4199/4201)
next_stateful_message: GetHeaders (chain_height=4201)
Percent complete: 99.95239
[Peer 2]: headers
sync_chain: already have last header , returning empty
Headers: (4201/4202) CFHeaders: (4199/4201) CFilters: (4199/4201)
next_stateful_message: GetHeaders (chain_height=4201)
Percent complete: 99.95239
[Peer 2]: headers
sync_chain: already have last header , returning empty
Headers: (4201/4202) CFHeaders: (4199/4201) CFilters: (4199/4201)
next_stateful_message: GetHeaders (chain_height=4201)
Percent complete: 99.95239

Even if we sync cf headers and filters, we still indefinitely try to get that phantom block 4202 which does not exist in reality.

If just disconnecting a peer might be okay on a live network, banning peers would definitely exhaust the peer list and shouldn't be done.

The solution I see is to

  1. handle filters for reorganized blocks gracefully
  2. more carefully update our own tip on inv messages without incrementing it at the moment we receive "inv".

I am happy to provide any additional information if needed and to prepare a PR with reorganized blocks cache.

Just in case the tests with dirty debugging can be found here (live_reorg_with_filters_in_flight test): https://github.com/randomlogin/kyoto/tree/reorganization-debug

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions