CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Cluster Mesh at Scale by yuecong · Pull Request #89 · cilium/design-cfps

yuecong · 2026-03-21T19:21:13Z

Summary

This CFP proposes two complementary enhancements to Cilium Cluster Mesh to support scaling to 100+
clusters:

Workload-controlled sync with scope — Workloads opt in to cross-cluster endpoint sync via
pod labels/annotations (clustermesh.cilium.io/sync and clustermesh.cilium.io/sync-scope).
Source-side filtering in the clustermesh-apiserver and consumer-side regional filtering in
kvstoremesh reduce synced endpoint volume by 95–99%.
Ad-hoc sync on datapath miss — A new SIGNAL_IPCACHE_MISS BPF signal triggers on-demand
identity resolution when a packet hits an ipcache miss, delivering "slowness over failure"
semantics instead of policy drops during sync gaps.

A proof-of-concept is available at cilium/cilium#44916.

/cc @cilium/sig-clustermesh @cilium/sig-datapath

…ter Mesh at Scale Signed-off-by: Cong Yue <cong@databricks.com>

Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89). Co-authored-by: Isaac

Implements workload-controlled sync and ad-hoc sync for Cilium Cluster Mesh, as proposed in CFP: Cluster Mesh Scaling to 100+ Clusters (cilium/design-cfps#89).

MrFreezeex

Hi 👋, Thanks for the proposal I personally find very exciting proposal aiming to improve Cluster Mesh scalability! 🎉

I have a few very high level comments though:

We don't accept new features on existing stable branch. New features should target main branch and we are currently in the 1.20 development cycle. It would also be nice to check if bootstrap_seconds can still exceeds 5 min in 1.19 in your setup as 1.17 is quite old now (and check what drives up that bootstrap time exactly).
This CFP describes 3 features that are each non trivial so I would suggest separating into 3 CFP. And I would suggest to focus on 1 maybe at most 2 to not have too many ongoing effort started at the same time and help focus efforts by both you and the reviewers on a smaller number of things.

Also at first glance this:

During rolling updates of remote clustermesh-apiservers, LB failovers cause all connected kvstoremesh instances to reconnect simultaneously, creating thundering-herd re-sync bursts.

Seems more like a bug or at least something that we should try to fix too. And it might even actually already be solved in 1.19 too!

yuecong · 2026-03-22T00:25:22Z

Thanks for the review! A few clarifications:

Regarding the version: in the POC, I used 1.17 because in our environment, all our images require Ubuntu FIPS enabled, and I found that from 1.18 onwards, Ubuntu FIPS is not working for some eBPF checks. I can submit a separate PR to fix that first so that the changes can target the stable 1.20-dev branch as well.

Also, the proposal here is not limited to 1.17 — the scalability challenges exist whenever you have 100+ clusters across regions with many pods/services in each cluster. Especially when we rely on cluster mesh connectivity for production workloads, the connectivity across clusters becomes as important as connectivity within the same cluster.

yuecong · 2026-03-22T00:28:41Z

This CFP describes 3 features that are each non trivial so I would suggest separating into 3 CFP. And I would suggest to focus on 1 maybe at most 2 to not have too many ongoing effort started at the same time and help focus efforts by both you and the reviewers on a smaller number of things.

agree with this. let me split it into 3 CFPs. btw, in this case. do I need separate github issues for each of them?

MrFreezeex · 2026-03-22T07:58:07Z

Ubuntu FIPS is not working for some eBPF checks. I can submit a separate PR to fix that first so that the changes can target the stable 1.20-dev branch as well.

Sure an issue or a PR is welcomed! FYI even for fix PR you generally need to target the main branch and then backport it to stable branches (see https://docs.cilium.io/en/stable/contributing/release/backports/).

agree with this. let me split it into 3 CFPs. btw, in this case. do I need separate github issues for each of them?

Yep!

Note that at first glance I am a bit skeptic on the on demand recovery of ClusterMesh data though. If there are any reliability issue with the clustermesh-apiserver I would rather fix those rather than introducing something like that ~.

MrFreezeex · 2026-03-22T09:49:37Z

Also about more filtering on CiliumEndpoints I think it would be interesting to estimate / benchmark what impact are we expecting from having a lower number of CiliumEndpoints. As there are "many" other resources involved it's not obvious (at least to me) the exact impact on the agents and clustermesh-apiserver. Also there are various questions around the interaction with global Services/MCS-API for instance.Those are somewhat similar for both sync filtering feature that you are proposing so it sounds probably ok to have those in the same CFP or still do 2 CFPs but not have both at the same time!

yuecong · 2026-03-22T22:53:15Z

Note that at first glance I am a bit skeptic on the on demand recovery of ClusterMesh data though. If there are any reliability issue with the clustermesh-apiserver I would rather fix those rather than introducing something like that ~.

To clarify — the on-demand recovery isn't about working around clustermesh-apiserver bugs. The sync pipeline works correctly, but the current design relies on purely active sync (etcd watches → kvstoremesh → local etcd → agent), which is inherently eventually consistent and cannot guarantee 100% completeness at every point in time.

Even within the same region, we occasionally see transient windows where sync is incomplete during frequent pod churn (rolling deployments, autoscaling). Cross-region makes this worse: higher latency, more frequent connection resets, and longer watch reconnection times all widen the gap where /32 ipcache entries may be stale or missing.

These aren't bugs — they're inherent properties of an active-push, eventually-consistent system over unreliable networks. You can optimize the pipeline, but you can never fully eliminate the window between "endpoint created" and "all remote nodes have the /32 entry."

The on-demand recovery is a datapath-level safety net that turns silent policy drops (missing /32 → WORLD identity → dropped) into a brief degraded-identity window followed by correct behavior. It's complementary to active sync, not a workaround.

I'll create a dedicated CFP for this so we can discuss the design and trade-offs in depth separately.

yuecong · 2026-03-22T22:56:46Z

Also about more filtering on CiliumEndpoints I think it would be interesting to estimate / benchmark what impact are we expecting from having a lower number of CiliumEndpoints. As there are "many" other resources involved it's not obvious (at least to me) the exact impact on the agents and clustermesh-apiserver. Also there are various questions around the interaction with global Services/MCS-API for instance.Those are somewhat similar for both sync filtering feature that you are proposing so it sounds probably ok to have those in the same CFP or still do 2 CFPs but not have both at the same time!

Thanks for the suggestion! I agree that benchmarking the impact would strengthen the proposal. I'll add concrete measurements showing the reduction in memory, CPU, bootstrap time, and etcd watch load on both the agents and clustermesh-apiserver when filtering is enabled. This will help quantify how much endpoint filtering alone moves the needle given that other resources (CiliumNodes, identities, etc.) are still synced.

Good call on the global Services / MCS-API interaction — I should clarify this explicitly. The label filter only applies to CiliumEndpoint sync for pod flat L3 connectivity. Service-backed endpoints follow their own sync path (service.cilium.io/global), so a pod opted out of endpoint sync can still be reachable as a service backend. I'll make this distinction clear in the CFP.

I'll combine the source-side label filtering and consumer-side regional filtering into a single CFP since they share the same pipeline and impact analysis. Makes more sense than splitting them.

MrFreezeex · 2026-03-23T12:25:45Z

Ok thanks! If you are going with 2 CFPs (which seems fine to me!), please only choose one now so that we can focus the efforts/have a easier time iterating on that first one and once that feature is complete we could continue/discuss about the second one.

CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Clus…

df6a3c8

…ter Mesh at Scale Signed-off-by: Cong Yue <cong@databricks.com>

yuecong mentioned this pull request Mar 21, 2026

CFP: cluster Mesh Scaling cilium/cilium#44921

Open

MrFreezeex requested changes Mar 21, 2026

View reviewed changes

yuecong requested a review from MrFreezeex March 22, 2026 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Cluster Mesh at Scale#89

CFP-44921: Workload-Controlled Sync and On-Demand Resolution for Cluster Mesh at Scale#89
yuecong wants to merge 1 commit intocilium:mainfrom
yuecong:cfp-44921-cluster-mesh-scaling

yuecong commented Mar 21, 2026

Uh oh!

MrFreezeex left a comment •

edited

Loading

Uh oh!

yuecong commented Mar 22, 2026 •

edited

Loading

Uh oh!

yuecong commented Mar 22, 2026 •

edited

Loading

Uh oh!

MrFreezeex commented Mar 22, 2026 •

edited

Loading

Uh oh!

MrFreezeex commented Mar 22, 2026 •

edited

Loading

Uh oh!

yuecong commented Mar 22, 2026

Uh oh!

yuecong commented Mar 22, 2026

Uh oh!

MrFreezeex commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuecong commented Mar 21, 2026

Summary

Uh oh!

MrFreezeex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuecong commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuecong commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrFreezeex commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrFreezeex commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuecong commented Mar 22, 2026

Uh oh!

yuecong commented Mar 22, 2026

Uh oh!

MrFreezeex commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrFreezeex left a comment •

edited

Loading

yuecong commented Mar 22, 2026 •

edited

Loading

yuecong commented Mar 22, 2026 •

edited

Loading

MrFreezeex commented Mar 22, 2026 •

edited

Loading

MrFreezeex commented Mar 22, 2026 •

edited

Loading

MrFreezeex commented Mar 23, 2026 •

edited

Loading