Skip to content

feat(autoheal): compute heal plan for overlapping partitions#211

Merged
ieQu1 merged 6 commits into
emqx:mainfrom
keynslug:fix/autoheal/overlapping-partitions
May 15, 2026
Merged

feat(autoheal): compute heal plan for overlapping partitions#211
ieQu1 merged 6 commits into
emqx:mainfrom
keynslug:fix/autoheal/overlapping-partitions

Conversation

@keynslug
Copy link
Copy Markdown
Contributor

@keynslug keynslug commented May 12, 2026

This PR reworks the heal plan computation algorithm to produce correct outcomes in environments where overlapping partitions are permitted, while keeping it functionally equivalent to the existing one otherwise. When overlapping partitions are involved, the goal now is to compute largest fully connected sub-cluster over overlapping partitions.

Addresses EMQX-14176.

keynslug added 2 commits May 12, 2026 20:57
This commit reworks the heal plan computation algorithm to work in
environments where overlapping partitions are permitted, while
keeping it functionally equivalent otherwise.

The algorithm is essentially: smallest set of nodes directly
connected to any partitioned nodes must be healed, i.e. must reboot
and rejoin the cluster.
@keynslug keynslug force-pushed the fix/autoheal/overlapping-partitions branch 3 times, most recently from efa38b3 to d7de9d5 Compare May 13, 2026 09:08
@keynslug keynslug marked this pull request as ready for review May 13, 2026 10:56
keynslug added 2 commits May 13, 2026 13:00
This commit changes heal plan algorithm to instead consider fully
connected sub-clusters in (potentially overlapping) cluster cliques
computed from cluster partitions. This, for example, improves plans
for situations where only a single link between 2 nodes is broken:
only those 2 nodes will be asked to rejoin.
@keynslug keynslug force-pushed the fix/autoheal/overlapping-partitions branch from d7de9d5 to 3ef4d38 Compare May 13, 2026 11:01
Comment thread src/mria_lib.erl Outdated
%% Enumerate cliques in the graph.
%% Graph is undirected, edge is considered to exist if 2 vertices have each other in
%% adjacency lists.
-spec find_cliques(#{V => [V]}) -> [[V]].
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code in mria_autoheal uses ordset functions on the cliques, without transformation from lists. I guess it should be

Suggested change
-spec find_cliques(#{V => [V]}) -> [[V]].
-spec find_cliques(#{V => [V]}) -> [ordsets:ordset(V)].

then?

Comment thread src/mria_autoheal.erl Outdated
-spec coordinator([node()]) -> node().
coordinator(Candidates) ->
case lists:member(node(), Candidates) of
true -> node();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to create conflicts, since multiple nodes are likely to appoint themselves as coordinators.

Comment thread src/mria_lib.erl Outdated
Comment on lines +291 to +295

-include_lib("eunit/include/eunit.hrl").
-undef(LET).

-include_lib("proper/include/proper_common.hrl").
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IIRC it's possible to include eunit after proper, then there's no need to un-define LET macro.

keynslug added 2 commits May 15, 2026 07:12
This commit changes the core of split view computation algorithm
from "overlapping cliques" analysis to the "reachability matrix"
approach.

The primary observation is that largest set of nodes that agree
on their reachability and therefore consistency should contain
equal vectors in the cluster reachability matrix. This is likely
to produce the same results as "overlapping cliques" approach (at
least as far as tests show) but much cheaper.
@keynslug keynslug force-pushed the fix/autoheal/overlapping-partitions branch from 1672dd2 to efd299d Compare May 15, 2026 05:14
@ieQu1 ieQu1 merged commit ecb6bd6 into emqx:main May 15, 2026
1 check passed
@keynslug keynslug deleted the fix/autoheal/overlapping-partitions branch May 19, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants