CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Cluster Mesh at Scale#89
CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Cluster Mesh at Scale#89yuecong wants to merge 1 commit intocilium:mainfrom
Conversation
…ter Mesh at Scale Signed-off-by: Cong Yue <cong@databricks.com>
Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89). Co-authored-by: Isaac
Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89).
Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89).
Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89).
There was a problem hiding this comment.
Hi 👋, Thanks for the proposal I personally find very exciting proposal aiming to improve Cluster Mesh scalability! 🎉
I have a few very high level comments though:
- We don't accept new features on existing stable branch. New features should target main branch and we are currently in the 1.20 development cycle. It would also be nice to check if
bootstrap_secondscan still exceeds 5 min in 1.19 in your setup as 1.17 is quite old now (and check what drives up that bootstrap time exactly). - This CFP describes 3 features that are each non trivial so I would suggest separating into 3 CFP. And I would suggest to focus on 1 maybe at most 2 to not have too many ongoing effort started at the same time and help focus efforts by both you and the reviewers on a smaller number of things.
Also at first glance this:
During rolling updates of remote clustermesh-apiservers, LB failovers cause all connected kvstoremesh instances to reconnect simultaneously, creating thundering-herd re-sync bursts.
Seems more like a bug or at least something that we should try to fix too. And it might even actually already be solved in 1.19 too!
|
Thanks for the review! A few clarifications: Regarding the version: in the POC, I used 1.17 because in our environment, all our images require Ubuntu FIPS enabled, and I found that from 1.18 onwards, Ubuntu FIPS is not working for some eBPF checks. I can submit a separate PR to fix that first so that the changes can target the stable 1.20-dev branch as well. Also, the proposal here is not limited to 1.17 — the scalability challenges exist whenever you have 100+ clusters across regions with many pods/services in each cluster. Especially when we rely on cluster mesh connectivity for production workloads, the connectivity across clusters becomes as important as connectivity within the same cluster. |
agree with this. let me split it into 3 CFPs. btw, in this case. do I need separate github issues for each of them? |
Sure an issue or a PR is welcomed! FYI even for fix PR you generally need to target the main branch and then backport it to stable branches (see https://docs.cilium.io/en/stable/contributing/release/backports/).
Yep! Note that at first glance I am a bit skeptic on the on demand recovery of ClusterMesh data though. If there are any reliability issue with the clustermesh-apiserver I would rather fix those rather than introducing something like that ~. |
|
Also about more filtering on CiliumEndpoints I think it would be interesting to estimate / benchmark what impact are we expecting from having a lower number of CiliumEndpoints. As there are "many" other resources involved it's not obvious (at least to me) the exact impact on the agents and clustermesh-apiserver. Also there are various questions around the interaction with global Services/MCS-API for instance.Those are somewhat similar for both sync filtering feature that you are proposing so it sounds probably ok to have those in the same CFP or still do 2 CFPs but not have both at the same time! |
To clarify — the on-demand recovery isn't about working around clustermesh-apiserver bugs. The sync pipeline works correctly, but the current design relies on purely active sync (etcd watches → kvstoremesh → local etcd → agent), which is inherently eventually consistent and cannot guarantee 100% completeness at every point in time. Even within the same region, we occasionally see transient windows where sync is incomplete during frequent pod churn (rolling deployments, autoscaling). Cross-region makes this worse: higher latency, more frequent connection resets, and longer watch reconnection times all widen the gap where /32 ipcache entries may be stale or missing. These aren't bugs — they're inherent properties of an active-push, eventually-consistent system over unreliable networks. You can optimize the pipeline, but you can never fully eliminate the window between "endpoint created" and "all remote nodes have the /32 entry." The on-demand recovery is a datapath-level safety net that turns silent policy drops (missing /32 → WORLD identity → dropped) into a brief degraded-identity window followed by correct behavior. It's complementary to active sync, not a workaround. I'll create a dedicated CFP for this so we can discuss the design and trade-offs in depth separately. |
Thanks for the suggestion! I agree that benchmarking the impact would strengthen the proposal. I'll add concrete measurements showing the reduction in memory, CPU, bootstrap time, and etcd watch load on both the agents and clustermesh-apiserver when filtering is enabled. This will help quantify how much endpoint filtering alone moves the needle given that other resources (CiliumNodes, identities, etc.) are still synced. Good call on the global Services / MCS-API interaction — I should clarify this explicitly. The label filter only applies to CiliumEndpoint sync for pod flat L3 connectivity. Service-backed endpoints follow their own sync path (service.cilium.io/global), so a pod opted out of endpoint sync can still be reachable as a service backend. I'll make this distinction clear in the CFP. I'll combine the source-side label filtering and consumer-side regional filtering into a single CFP since they share the same pipeline and impact analysis. Makes more sense than splitting them. |
|
Ok thanks! If you are going with 2 CFPs (which seems fine to me!), please only choose one now so that we can focus the efforts/have a easier time iterating on that first one and once that feature is complete we could continue/discuss about the second one. |
Summary
This CFP proposes two complementary enhancements to Cilium Cluster Mesh to support scaling to 100+
clusters:
Workload-controlled sync with scope — Workloads opt in to cross-cluster endpoint sync via
pod labels/annotations (
clustermesh.cilium.io/syncandclustermesh.cilium.io/sync-scope).Source-side filtering in the clustermesh-apiserver and consumer-side regional filtering in
kvstoremesh reduce synced endpoint volume by 95–99%.
Ad-hoc sync on datapath miss — A new
SIGNAL_IPCACHE_MISSBPF signal triggers on-demandidentity resolution when a packet hits an ipcache miss, delivering "slowness over failure"
semantics instead of policy drops during sync gaps.
A proof-of-concept is available at cilium/cilium#44916.
Tracking issue: cilium/cilium#44921
/cc @cilium/sig-clustermesh @cilium/sig-datapath