From 9857a3e48ddfe4da74f62fea94eb3ec5f3441f4e Mon Sep 17 00:00:00 2001 From: Tamal Saha Date: Wed, 1 Jul 2026 08:38:57 +0600 Subject: [PATCH 1/2] Add ClickHouse DC-DR guide Signed-off-by: Tamal Saha --- docs/guides/clickhouse/README.md | 4 + docs/guides/clickhouse/dr/_index.md | 10 + docs/guides/clickhouse/dr/guide/index.md | 351 ++++++++++++++++++++ docs/guides/clickhouse/dr/overview/index.md | 350 +++++++++++++++++++ docs/guides/clickhouse/dr/runbook/index.md | 331 ++++++++++++++++++ 5 files changed, 1046 insertions(+) create mode 100644 docs/guides/clickhouse/dr/_index.md create mode 100644 docs/guides/clickhouse/dr/guide/index.md create mode 100644 docs/guides/clickhouse/dr/overview/index.md create mode 100644 docs/guides/clickhouse/dr/runbook/index.md diff --git a/docs/guides/clickhouse/README.md b/docs/guides/clickhouse/README.md index 453a1d9788..f14b72ffd1 100644 --- a/docs/guides/clickhouse/README.md +++ b/docs/guides/clickhouse/README.md @@ -44,3 +44,7 @@ KubeDB supports the following ClickHouse Versions. - [Quickstart ClickHouse](/docs/guides/clickhouse/quickstart/guide/quickstart.md) with KubeDB Operator. - Want to hack on KubeDB? Check our [contribution guidelines](/docs/CONTRIBUTING.md). + +## Cross-DC Disaster Recovery (DC-DR) + +Do you want to run your ClickHouse database across multiple data centers and recover from a full data center failure with a single, automatically re-routing write endpoint? KubeDB runs one logical `ReplicatedMergeTree` cluster across the data centers, spreads ClickHouse Keeper 3-site so no single data center holds a Keeper majority (the split-brain guarantee), and lets the `dr-controlplane` Lease route the write endpoint to a data center that still holds Keeper quorum. Follow [here](/docs/guides/clickhouse/dr/overview/index.md). diff --git a/docs/guides/clickhouse/dr/_index.md b/docs/guides/clickhouse/dr/_index.md new file mode 100644 index 0000000000..801ae26c55 --- /dev/null +++ b/docs/guides/clickhouse/dr/_index.md @@ -0,0 +1,10 @@ +--- +title: Disaster Recovery +menu: + docs_{{ .version }}: + identifier: ch-dr-clickhouse + name: DR + parent: ch-clickhouse-guides + weight: 55 +menu_name: docs_{{ .version }} +--- diff --git a/docs/guides/clickhouse/dr/guide/index.md b/docs/guides/clickhouse/dr/guide/index.md new file mode 100644 index 0000000000..5daaf30172 --- /dev/null +++ b/docs/guides/clickhouse/dr/guide/index.md @@ -0,0 +1,351 @@ +--- +title: DC-DR User Guide +menu: + docs_{{ .version }}: + identifier: ch-dr-guide-clickhouse + name: User Guide + parent: ch-dr-clickhouse + weight: 20 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Running ClickHouse in DC-DR Mode: User Guide + +This guide covers every aspect of operating a distributed ClickHouse in cross data +center disaster recovery (DC-DR) mode: the components, the naming contract, deployment, +connecting through the single write endpoint, reading locally, monitoring, lag and RPO, +Keeper placement, switchover and failback, scaling, and day-2 operations. + +Read the [DC-DR Overview](/docs/guides/clickhouse/dr/overview/index.md) first for the +architecture, and the [DC-DR Runbook](/docs/guides/clickhouse/dr/runbook/index.md) for +scenario-by-scenario procedures. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Components and where they run + +| Component | Runs in | Responsibility | +| --- | --- | --- | +| **`dr-controlplane`** + 3-site etcd quorum | across the data centers (an OCM control plane) | Publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease holder is the DC the single write endpoint resolves to. The Lease is routing, policy, and observability, **not** the failover mechanism. | +| **`dr-controlplane` agent** | each spoke (DC) | Contends for the primary-DC Lease for its DC and projects the Lease decision into the local spoke as the `primary-dc` marker. | +| **KubeDB ClickHouse operator (hub)** | the OCM hub | Expands the `ClickHouse` CR into per-DC `ReplicatedMergeTree` replica groups, spreads the 3-site Keeper ensemble, routes the single write endpoint by following the Lease, drives planned switchover, and writes `status.disasterRecovery`. | +| **ClickHouse Keeper ensemble** | a voter in each Member DC plus a data-less voter in the Arbiter DC | The Raft service that registers parts and orders replication. **Its Raft majority is the failover authority and the split-brain guarantee.** No ZooKeeper. | +| **KubeSlice** | each spoke | Provides the cross-DC pod network so the one logical cluster spans clusters and `ReplicatedMergeTree` replicates over port 9009, coordinated by Keeper on 9181/9234. | + +## The DC-name contract + +One string identifies a data center everywhere. **Keep these identical:** + +- the OCM spoke cluster name +- the agent `--dc-name` +- the primary-DC Lease `holderIdentity` +- the marker `data.activeDC` +- the pod label `open-cluster-management.io/cluster-name` +- the `PlacementPolicy` `distributionRule.clusterName` + +## Deploying + +### PlacementPolicy + +Map the global replica indices to data centers and tag each DC with its role: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: ch-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global # one cluster-wide failover scope (or Group + a group name) + mode: TwoDC # TwoDC: 2 Member DCs + an Arbiter DC; ThreeDC: 3 Member DCs + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1] + - clusterName: dc-b + role: Member + replicaIndices: [2, 3] + - clusterName: dc-c + role: Arbiter + replicaIndices: [] +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries an + empty list. Its single data-less Keeper voter is scheduled onto the arbiter spoke and + co-located with the third etcd member (it is not ordinal-pinned; the operator reads the + Arbiter DC's `clusterName` and schedules the voter onto that spoke via OCM + ManifestWork). +- `mode: TwoDC` expects two Member DCs plus the Arbiter DC (the even layout); `ThreeDC` + expects an odd number of Member DCs and no separate Arbiter DC (each data DC then holds + its own Keeper voter, keeping Keeper quorum among the data DCs). +- Roles are `Member` and `Arbiter` only. + +### ClickHouse + +```yaml +apiVersion: kubedb.com/v1alpha2 +kind: ClickHouse +metadata: + name: ch-dcdr + namespace: demo +spec: + version: "25.7.1" + distributed: true + clusterTopology: + cluster: + - name: appscode-cluster + shards: 2 + replicas: 2 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: ch-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +### What the operator creates + +- **One logical cluster** whose shards each have a `ReplicatedMergeTree` replica in each + Member DC, all sharing one Keeper ensemble and the same `default_replica_path` macros. + Replication is the native ClickHouse link over port 9009; there is no second + replication link. +- A **3-site Keeper ensemble**: a voter in each Member DC and, in the even (`TwoDC`) + layout, one data-less Keeper voter scheduled onto the Arbiter DC. This is what lets the + engine's own quorum survive a DC loss and fence a minority. +- A single **write endpoint** (Service plus `AppBinding`) that the orchestrator points at + the active DC by following the Lease. Reads can go to any replica. + +All data-bearing pods carry the offshoot selectors plus the +`open-cluster-management.io/cluster-name` label, so the single write endpoint and the +single `AppBinding` keep working as the active DC moves. + +> The macros and `default_replica_path` are consistent across DCs so a shard's replicas +> in different DCs share the same Keeper path and replicate to each other. Do not change +> them per DC. + +## Connecting + +A DC-DR ClickHouse exposes a single write endpoint, the same shape as any KubeDB +ClickHouse: + +- the **write endpoint** `` resolves to the active DC's replicas (native port 9000); + the Lease-driven fence keeps it off a non-active DC, fail-closed; +- one **`AppBinding`** `` for applications and KubeDB integrations. + +Because ClickHouse is multi-master, every replica can technically accept writes, but the +single endpoint gives a stable single-writer posture: applications keep using `` and, +after a failover, reconnect and land on the new active DC. + +### Writes and the Keeper-quorum contract + +There is no `w:majority` knob to set: the split-brain guarantee is built into the engine. +Every insert registers a part in Keeper, which needs Keeper quorum. A partitioned +minority DC that has lost Keeper quorum simply cannot register parts, so its inserts +fail. You do not have to opt into this; it is how `ReplicatedMergeTree` plus a spread +3-site Keeper behaves. + +```sql +-- Against the write endpoint :9000: +INSERT INTO orders (item, qty) VALUES ('widget', 1); +``` + +- On a spread 3-site Keeper (topology A), an insert commits only when the writing DC + holds Keeper quorum, so a cut-off minority DC cannot commit and there is no split + brain. +- The bounded loss on an unplanned active-DC loss is only committed-but-unfetched parts + (registered in Keeper, data still on the lost DC's disk), which are recoverable when + that DC returns. + +### Read locally + +Any replica serves consistent-enough reads. Point read traffic at an in-DC replica for +low latency; reads are eventually consistent, bounded by that replica's +`absoluteDelaySeconds`. + +## Monitoring and observability + +### status.disasterRecovery + +The single CR carries the whole cross-DC view: + +```bash +$ kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +| Field | Meaning | +| --- | --- | +| `activeDC` | The DC the single write endpoint currently resolves to (a routing choice, not a promoted primary). | +| `phase` | `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. | +| `lastTransitionTime` | When `activeDC` last changed. | +| `dataCenters[].clusterName` | The data center, by its OCM managed cluster name. | +| `dataCenters[].role` | `Member` or `Arbiter`. | +| `dataCenters[].keeperVoter` | Whether the DC holds a Keeper voter. | +| `dataCenters[].keeperQuorum` | Whether the DC currently sees Keeper quorum (the safety signal). | +| `dataCenters[].writable` | True only for the active (write-routed) DC. | +| `dataCenters[].shards[]` | Per shard: `shard`, `totalReplicas`, `activeReplicas`. | +| `dataCenters[].absoluteDelaySeconds` | The DC's cross-DC replication delay behind the active DC, in seconds (from `system.replicas.absolute_delay`). | +| `dataCenters[].queueSize` | The DC's pending replication queue length (from `system.replicas.queue_size`). | +| `dataCenters[].healthy` | Whether the DC has ready replicas. | + +### Useful checks + +```bash +# Which DC the Lease intends as the write-routed active DC: +$ kubectl --kubeconfig -n dc-failover get lease primary-dc \ + -o jsonpath='{.spec.holderIdentity}' + +# Per-DC replicas and DCs: +$ kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr \ + -L open-cluster-management.io/cluster-name + +# Cluster hosts and per-replica health (from any replica): +$ kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT cluster, host_name, replica_num FROM system.clusters" + +# Replication delay and queue per replica (the lag signal): +$ kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT database, table, absolute_delay, queue_size, log_pointer, log_max_index, total_replicas, active_replicas FROM system.replicas" +``` + +## Replication, lag, and RPO + +- Cross-DC replication is **native ClickHouse `ReplicatedMergeTree`** over port 9009, + asynchronous, coordinated by the shared Keeper ensemble. There is exactly one logical + cluster, so there is no extra replication link to manage. +- The lag signals come from `system.replicas`: `absolute_delay` (seconds a replica is + behind), `queue_size`, and `log_pointer` versus `log_max_index` (how many log entries + the replica still has to fetch). These are surfaced into `status.disasterRecovery` as + `absoluteDelaySeconds` and `queueSize`. +- A **planned switchover loses near-zero committed writes**, because the orchestrator + waits until the target DC's replicas show near-zero `absolute_delay` and an empty queue + before it flips the endpoint. An **unplanned failover** may lose only + committed-but-unfetched parts (bounded by the standby lag when the active DC died); a + clean partition that put the lost DC in the Keeper minority loses zero committed data, + because that DC could not commit without quorum. + +## Keeper placement and the arbiter + +- **Keeper is spread 3-site so no single data DC holds a Keeper majority.** With two data + DCs, each holds a voter and the Arbiter DC holds one data-less voter; the majority is 2 + of 3, so either data DC plus the Arbiter DC keeps quorum, but neither data DC alone + does. This removes split brain at its root: a partitioned data DC cannot register parts + by itself. +- The requirement is an **odd Keeper voter total**, not an odd DC count: + - **Even layout** (two data DCs plus the Arbiter DC): one Keeper voter per data DC and + one data-less voter in the Arbiter DC (total 3). Do not add extra voters that would + let one data DC hold a majority. + - **Odd layout** (three or more Member DCs, no Arbiter DC): one Keeper voter per data + DC, for an odd total, keeping Keeper quorum among the data DCs. This is the layout the + DC-count rule prefers. +- The **Arbiter DC** holds the `dr-controlplane` etcd vote **and** one data-less Keeper + voter, co-located so the two quorums agree on which DCs are alive. This is the same + arbiter trick as MongoDB. + +> For write-heavy ingest, weigh the two alternatives from the overview: two clusters with +> a per-region Keeper cross-replicated (topology B) avoids the cross-DC Keeper round trip, +> and single-DC Keeper (topology C) is the lowest-latency manual-failover floor. + +## Planned switchover (near-zero RPO) + +Move the active (write-routed) DC on purpose by annotating the ClickHouse: + +```bash +$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The hub then: + +1. checks the target is a known, healthy DC within the lag budget; +2. sets `phase: FailingOver` and quiesces writes on the current active DC (routes clients + away from the endpoint); +3. waits until the target DC's replicas report `absolute_delay` near zero and an empty + replication queue (`queue_size: 0`), the catch-up gate that makes this near-zero RPO; +4. moves the Lease and the single write endpoint to `dc-b`. + +Watch `status.disasterRecovery` for `phase` returning to `Steady` with the new +`activeDC`. There is no promotion step, because every replica is already writable. + +## Failback + +Failback is native and clean. A returned DC's replicas rejoin the Keeper ensemble and +catch up through `ReplicatedMergeTree` (they fetch the missing parts). There is **no +rewind**: a partitioned DC that lacked Keeper quorum committed nothing to diverge, so +there is nothing to roll back. + +Once the returned DC is caught up (near-zero `absoluteDelaySeconds`, empty queue), steer +the active DC back with a planned switchover: + +```bash +$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-a +``` + +## Scaling and day-2 operations + +The standard `ClickHouseOpsRequest` operations (`VerticalScaling`, `HorizontalScaling`, +`VolumeExpansion`, `UpdateVersion`, `Reconfigure`, `ReconfigureTLS`, `Restart`, +`RotateAuth`, `StorageMigration`) apply to a DC-DR cluster. They act on the distributed +per-DC replica groups across the DCs and are issued exactly as for a single-cluster +ClickHouse. There is no failover ops type: failover is the engine's Keeper quorum, and +the planned switchover is the `dr.kubedb.com/switchover-to` annotation, not an ops +request. + +`HorizontalScaling` gains a per-DC form so you can scale each DC's per-shard replica +count independently: + +```yaml +apiVersion: ops.kubedb.com/v1alpha1 +kind: ClickHouseOpsRequest +metadata: + name: ch-dcdr-hscale + namespace: demo +spec: + type: HorizontalScaling + databaseRef: + name: ch-dcdr + horizontalScaling: + dataCenters: + - clusterName: dc-a + replicas: 3 + - clusterName: dc-b + replicas: 2 +``` + +> **Note:** the distributed ClickHouse substrate and the DC-DR layer are net-new for +> ClickHouse. Treat the field names and flows in this guide as the intended user +> experience; confirm availability in your release before relying on them in production. + +## Deletion and cleanup + +```bash +$ kubectl delete clickhouse -n demo ch-dcdr +``` + +Per `deletionPolicy`, the operator removes the per-DC replica groups, the arbiter-DC +Keeper voter, and the cluster-scoped per-DC `PlacementPolicies` it generated (these carry +no owner reference, so the operator deletes them explicitly). The user-provided base +`PlacementPolicy` is left for you to delete. + +## Limitations + +- **Adding or removing a whole data center** is a topology change (a replica-group and + Keeper-ensemble change), performed by editing the `PlacementPolicy` topology, not by a + scaling request. +- Cross-DC `ReplicatedMergeTree` replication is asynchronous; an unplanned failover has a + non-zero RPO bounded by the standby lag (only committed-but-unfetched parts). A clean + partition loses zero committed data because the minority DC cannot commit without Keeper + quorum; use a planned switchover for a near-zero-RPO move. +- On a spread 3-site Keeper (topology A), every insert pays a cross-DC Keeper round trip. + For write-heavy ingest, consider the two-cluster per-region-Keeper topology (B) or the + single-DC Keeper topology (C) described in the overview. diff --git a/docs/guides/clickhouse/dr/overview/index.md b/docs/guides/clickhouse/dr/overview/index.md new file mode 100644 index 0000000000..19fe392e9b --- /dev/null +++ b/docs/guides/clickhouse/dr/overview/index.md @@ -0,0 +1,350 @@ +--- +title: DC-DR Overview +menu: + docs_{{ .version }}: + identifier: ch-dr-overview-clickhouse + name: Overview + parent: ch-dr-clickhouse + weight: 10 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Cross Data Center Disaster Recovery (DC-DR) for ClickHouse + +KubeDB can run a single distributed `ClickHouse` across multiple data centers (DCs) so +the database survives the loss of an entire data center. Every replica is writable +(ClickHouse `ReplicatedMergeTree` is multi-master), so DR is not about promoting a new +primary. It is about two things: the ClickHouse Keeper Raft quorum, which decides which +DCs can commit at all, and a single Lease-routed write endpoint, which records and +steers where clients send writes. When a data center is lost, the surviving DCs that +still hold Keeper quorum keep accepting writes, and the write endpoint follows to a DC +that holds quorum. + +This page is the conceptual overview and a quick start. See also: + +- [DC-DR User Guide](/docs/guides/clickhouse/dr/guide/index.md) for every aspect of + running in DC-DR mode (components, status, connecting, monitoring, switchover, + failback, day-2 ops). +- [DC-DR Runbook](/docs/guides/clickhouse/dr/runbook/index.md) for what to do in each + operational scenario. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Why ClickHouse DC-DR is different + +Most KubeDB engines (Postgres, MariaDB, MSSQL) keep their consensus quorum **inside** +a single DC, because a raft or cluster manager flaps or stalls when its quorum spans +data centers. Those engines run one independent group per DC and build a separate +cross-DC replication link, and DR means promoting a standby. + +**ClickHouse is the exception, the same way MongoDB is.** `ReplicatedMergeTree` is +multi-master and geo-aware by design: replicas of a table in different DCs replicate +asynchronously through a shared ClickHouse Keeper ensemble. So for ClickHouse: + +- **One logical cluster spans the DCs.** The same shards, with a `ReplicatedMergeTree` + replica of each shard in each DC, all share one Keeper ensemble and the same + `default_replica_path` macros. Replication is the native ClickHouse replication link + over port 9009. There is **no second replication link to build** and no remote + replica to manage. +- **Failover is the engine's own quorum, not a promotion.** ClickHouse Keeper is a Raft + service. The DCs that hold the Keeper majority keep registering parts and serving + writes. A partitioned minority DC loses Keeper quorum, cannot register parts, and so + its inserts fail. That quorum, not any promotion step, is the failover and the + split-brain guarantee. KubeDB never promotes a replica, because every replica is + already writable. +- **Failback is native and clean.** A returned DC's replicas rejoin the Keeper ensemble + and catch up through `ReplicatedMergeTree` (they fetch the missing parts). There is + **no rewind**: a partitioned minority DC that lacked Keeper quorum committed nothing + to diverge, so there is nothing to roll back. This is cleaner than the Postgres + `pg_rewind` path and even than MongoDB rollback. + +## How it works + +DC-DR for ClickHouse rests on four rules. + +- **ClickHouse Keeper is spread 3-site and is the failover authority.** With two data + DCs the layout is a Keeper voter in `dc-a`, a Keeper voter in `dc-b`, and one + data-less **Keeper voter** in a third arbiter DC. That arbiter-DC voter is co-located + with the `dr-controlplane` etcd member, so the Keeper Raft quorum and the Lease + quorum share the same 3-site topology (exactly MongoDB's arbiter trick). A single DC + loss then leaves a surviving Keeper majority, so the survivors keep committing with no + manual step. +- **Keeper quorum is the writable contract and the split-brain guarantee.** Because the + safety comes from Keeper's Raft majority, a partitioned minority DC cannot register + parts and so cannot commit any insert. A cut-off DC goes non-writable on its own, at + the engine level, with no operator action. This is the same shape as MongoDB majority + plus `w:majority`, and it is the hard guarantee. The Lease-driven endpoint fence is + only a routing layer on top of it. +- **The Lease routes the single write endpoint; it does not promote anything.** A small + control plane (`dr-controlplane`), backed by a three-site etcd quorum, publishes one + `coordination.k8s.io` **Lease** per failover scope. The Lease records which DC the + single write endpoint resolves to and steers clients there, giving a stable + single-writer posture and one consistent cross-engine status. Because ClickHouse is + multi-master, this is a write-routing choice, not an engine-enforced primary. On an + unplanned active-DC loss the orchestrator moves the Lease and the endpoint to a + surviving DC that still holds Keeper quorum. The Lease is routing, policy, and + observability, **not** the failover mechanism (Keeper quorum is). +- **Reads can stay local.** Any replica serves consistent-enough reads, so read traffic + can stay in-DC for low latency while writes route to the active DC through the single + endpoint. + +> **Why not confine Keeper to the active DC?** You can (see the topologies below), and +> it gives the lowest write latency. But then losing the active DC also loses its Keeper +> quorum, and the surviving replicas have no quorum to commit against until you bring up +> a new Keeper and re-point them, an explicit manual recovery with RTO impact. Spreading +> Keeper 3-site removes that manual step at the cost of a cross-DC Keeper round trip on +> every insert. + +### Data center roles + +Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: + +| Role | Holds ClickHouse data | Holds a Keeper voter | Purpose | +| --- | --- | --- | --- | +| **Member** | yes | yes | A data-bearing DC holding a `ReplicatedMergeTree` replica of every shard and a Keeper voter; a candidate for the active (write-routed) DC. | +| **Arbiter** | no | yes (data-less) | The arbiter DC. Holds the `dr-controlplane` etcd vote **and** one data-less ClickHouse Keeper voter. Supplies the tie-break vote for both quorums. No ClickHouse data. | + +> Unlike MariaDB and MSSQL, whose arbiter DC holds no engine member, the ClickHouse +> arbiter DC holds **both** the `dr-controlplane` etcd member and a data-less Keeper +> voter, co-located so the coordination quorum and the Keeper quorum agree on which DCs +> are alive. This is the same pattern as MongoDB's voting arbiter. + +## The Keeper placement decision (the one real tradeoff) + +ClickHouse Keeper Raft is latency-sensitive: every insert registers a part, which is a +Keeper write that needs a quorum round trip. ClickHouse is a write-heavy ingest engine, +so where you place Keeper is the one real tradeoff. There are three placements, and the +right one depends on your write-latency tolerance. **3-site spread Keeper is the +documented automatic-DR path here, but it is not automatically the best choice for every +workload.** + +### A. Spread Keeper 3-site, single cluster (automatic DR, higher write latency) + +One logical ClickHouse cluster with Keeper voters in `dc-a`, `dc-b`, and a data-less +voter in the arbiter DC (co-located with the `dr-controlplane` etcd member). A DC loss +leaves a surviving Keeper majority, so writes continue in the survivor with **no manual +step**. The cost is a cross-DC Keeper round trip on every insert. Batched inserts +amortize it well; high-frequency small inserts may find it prohibitive. This is the path +documented in detail here, because it is the closest analog to the MongoDB design and +gives hands-off failover. + +### B. Two clusters, per-region Keeper, cross-replicated (often the better fit) + +Each DC runs its own cluster with its own in-DC Keeper (low local write latency); the +two cross-replicate the same `ReplicatedMergeTree` tables and a `Distributed` table +fronts both. Each region writes locally, and on a region loss the other already holds a +full replica. This matches ClickHouse's write-local nature and avoids the cross-DC +Keeper tax, at the cost of a more complex topology and looser cross-region ordering. +This is Altinity's "multi-region writes" pattern and is **frequently the better +production choice for write-heavy ingest**. Failover is still a routing change and the +Lease still picks the write-routed DC. + +### C. Single-DC Keeper (lowest latency, manual failover) + +Keeper lives only in the active DC; standby replicas use it cross-DC. Inserts are lowest +latency, but losing the active DC (and its Keeper) leaves the standby with no quorum, so +a new Keeper must be brought up and the replicas re-pointed (an explicit recovery with +RTO impact). This is the simplest Altinity default and is acceptable when low write +latency outweighs automatic DR. + +### At a glance + +| Topology | Keeper | Write latency | Failover on active-DC loss | +| --- | --- | --- | --- | +| A. 3-site spread (documented here) | voters in dc-a, dc-b, arbiter DC | higher (cross-DC round trip per insert) | automatic, survivor keeps quorum | +| B. Two clusters, per-region Keeper | one Keeper per region | low (local) | routing change, other region already replicated | +| C. Single-DC Keeper | only in the active DC | lowest | manual: rebuild Keeper, re-point replicas | + +The rest of these docs describe topology **A**, the 3-site single-cluster spread. Its +safety claim (a minority DC cannot register parts) is specific to it. Pick per workload: +for write-heavy ClickHouse, topology B is frequently better, and topology C is the +low-latency manual-failover floor. + +## The single-CR, single-endpoint model + +The user creates **one** distributed `ClickHouse` object (with `spec.distributed: true` +and a `PlacementPolicy` carrying `distributionRules` and a `failoverPolicy`) and gets +**one** `AppBinding` and **one** write endpoint. The operator expands the CR into per-DC +`ReplicatedMergeTree` replicas of every shard, a 3-site Keeper ensemble (arbiter-DC +voter included), and the Lease-routed write endpoint. + +The single CR's `status.disasterRecovery` carries the whole cross-DC view: the active +(write-routed) DC, each DC's per-shard replica health, whether the DC holds Keeper +quorum, the cross-DC replication delay, and the DR phase. + +## Prerequisites + +- A distributed ClickHouse substrate: Open Cluster Management (OCM) hub and spoke + clusters, KubeSlice connecting the spokes so replicas reach each other, and a storage + class on each data-bearing spoke. The ClickHouse and Keeper ports (native 9000, + replication 9009, Keeper client 9181, Keeper Raft 9234) must be reachable across the + DCs. +- The `dr-controlplane` service and its three-site etcd quorum installed across the data + centers, with a `dr-controlplane` agent running in each spoke (DC). The third etcd + member sits in the arbiter DC alongside the data-less Keeper voter. +- The KubeDB ClickHouse operator started with the DC-DR flags (coordination kubeconfig + and the operator's local DC name). +- One consistent **DC name** per data center, used everywhere: the OCM spoke cluster + name, the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the + pod label `open-cluster-management.io/cluster-name`, and the `PlacementPolicy` + `distributionRule.clusterName`. Keep them identical. + +## Deploy a DC-DR ClickHouse + +### 1. PlacementPolicy + +Assign global replica indices to data centers and tag each DC with its role. Here two +Member DCs (`dc-a`, `dc-b`) each hold a replica of every shard plus a Keeper voter, and +`dc-c` is the arbiter DC holding only a data-less Keeper voter and the `dr-controlplane` +etcd member: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: ch-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global + mode: TwoDC + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1] # dc-a: a replica of each shard + a Keeper voter + - clusterName: dc-b + role: Member + replicaIndices: [2, 3] # dc-b: a replica of each shard + a Keeper voter + - clusterName: dc-c + role: Arbiter + replicaIndices: [] # arbiter DC: dr-controlplane etcd + a data-less Keeper voter +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries + an empty list (its single data-less Keeper voter is not ordinal-pinned, it is + scheduled onto the arbiter spoke by the operator). +- `failoverPolicy.trigger.scope: Global` makes this one cluster-wide failover scope. + +### 2. ClickHouse + +Reference the `PlacementPolicy` and opt the ClickHouse into DC-DR expansion: + +```yaml +apiVersion: kubedb.com/v1alpha2 +kind: ClickHouse +metadata: + name: ch-dcdr + namespace: demo +spec: + version: "25.7.1" + distributed: true + clusterTopology: + cluster: + - name: appscode-cluster + shards: 2 + replicas: 2 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: ch-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +The operator expands this into per-DC `ReplicatedMergeTree` replicas of every shard, +pinned to `dc-a` and `dc-b`, plus a data-less Keeper voter in `dc-c`, and routes the +single write endpoint to the active DC by following the Lease. + +## Observe the DC-DR state + +The single `ClickHouse` object's `status.disasterRecovery` carries the whole cross-DC +view: + +```bash +$ kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +```json +{ + "activeDC": "dc-a", + "phase": "Steady", + "lastTransitionTime": "2026-06-30T10:00:00Z", + "dataCenters": [ + { + "clusterName": "dc-a", "role": "Member", "keeperVoter": true, "keeperQuorum": true, + "writable": true, "healthy": true, "absoluteDelaySeconds": 0, "queueSize": 0, + "shards": [ + { "shard": 0, "totalReplicas": 2, "activeReplicas": 2 }, + { "shard": 1, "totalReplicas": 2, "activeReplicas": 2 } + ] + }, + { + "clusterName": "dc-b", "role": "Member", "keeperVoter": true, "keeperQuorum": true, + "writable": false, "healthy": true, "absoluteDelaySeconds": 2, "queueSize": 5, + "shards": [ + { "shard": 0, "totalReplicas": 2, "activeReplicas": 2 }, + { "shard": 1, "totalReplicas": 2, "activeReplicas": 2 } + ] + }, + { + "clusterName": "dc-c", "role": "Arbiter", "keeperVoter": true, "keeperQuorum": true, + "writable": false, "healthy": true + } + ] +} +``` + +- `activeDC` is the DC the write endpoint currently resolves to (a routing choice, not + a promoted primary). +- `phase` is `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. +- Each `dataCenters` entry reports the DC role, whether it holds a Keeper voter and + whether it currently has Keeper quorum, whether it is the write-routed DC, its + per-shard replica health, its cross-DC `absoluteDelaySeconds` and `queueSize`, and its + health. + +## Unplanned failover + +When the active DC is lost, the surviving DCs that still hold Keeper quorum (a standby +data DC plus the arbiter DC in the even layout) **keep accepting writes on their own**, +because Keeper quorum survives and every replica is already writable. There is no +promotion. The orchestrator observes the Lease move to a surviving DC and points the +single write endpoint there. `status.disasterRecovery.phase` moves to `FailingOver` and +back to `Steady`. Bounded loss is only committed-but-unfetched parts on the lost DC's +disk (a clean partition that put the lost DC in the Keeper minority loses zero committed +data, because it could not commit without quorum). + +## Planned switchover (near-zero RPO) + +To move the active (write-routed) DC on purpose without losing committed writes, +annotate the ClickHouse with the target DC: + +```bash +$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The orchestrator quiesces writes on the current active DC (routes clients away), waits +until the target DC's replicas show `absoluteDelaySeconds` near zero and an empty +replication queue (`queueSize: 0`), then moves the Lease and the write endpoint to +`dc-b`. Because it waits for the target to catch up before flipping, near-zero committed +writes are lost. + +## Cleanup + +```bash +$ kubectl delete clickhouse -n demo ch-dcdr +$ kubectl delete placementpolicy ch-dcdr +``` + +Deleting the `ClickHouse` removes the per-DC replica groups, the arbiter-DC Keeper +voter, and the generated per-DC `PlacementPolicies`. The user-provided base +`PlacementPolicy` is left for you to delete. diff --git a/docs/guides/clickhouse/dr/runbook/index.md b/docs/guides/clickhouse/dr/runbook/index.md new file mode 100644 index 0000000000..8a9ab70ed8 --- /dev/null +++ b/docs/guides/clickhouse/dr/runbook/index.md @@ -0,0 +1,331 @@ +--- +title: DC-DR Runbook +menu: + docs_{{ .version }}: + identifier: ch-dr-runbook-clickhouse + name: Runbook + parent: ch-dr-clickhouse + weight: 30 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# ClickHouse DC-DR Runbook + +Scenario-by-scenario procedures for operating a ClickHouse cluster in cross data center +disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and +ClickHouse do **automatically**, how to **verify**, and the **action** to take. + +Read the [User Guide](/docs/guides/clickhouse/dr/guide/index.md) for the concepts and +commands referenced here. Throughout, `` is the coordination control plane +kubeconfig, and `ch-dcdr`/`demo` are the example database and namespace. The example pod +`ch-dcdr-appscode-cluster-shard-0-0` is the first replica of shard 0. + +## Quick reference + +```bash +# Active DC, phase, and per-DC view: +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq + +# Lease holder (the DC the coordination plane routes writes to): +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' + +# Per-DC replicas and DCs: +kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name + +# Replication delay, queue, and log pointers (from any replica): +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT database, table, absolute_delay, queue_size, log_pointer, log_max_index, total_replicas, active_replicas FROM system.replicas" +``` + +Golden rules: + +- **ClickHouse Keeper quorum decides who can commit.** Never try to force a write into a + DC that has lost Keeper quorum; it cannot register parts and its inserts fail by design. + That is the split-brain guarantee, not a bug. +- **There is no promotion.** Every replica is writable. DR is a routing change (the write + endpoint follows the Lease to a DC that holds Keeper quorum), not an election. +- **Exactly one DC is `writable: true`** in `status.disasterRecovery` at any instant (the + write-routed active DC), even though the engine would let more than one accept writes. +- **The Lease is routing, not safety.** If the Lease is stale, ClickHouse keeps running on + its own Keeper quorum; what you lose is switchover and endpoint steering. + +--- + +## 1. Active DC lost (zone/cluster failure) + +**Symptoms:** the active DC's replicas are gone/unreachable; writes to the endpoint fail +briefly until it re-points. + +**Automatic:** the surviving DCs that still hold Keeper quorum (a standby data DC plus the +Arbiter DC in the even layout, or the surviving data majority in the odd layout) **keep +accepting writes on their own**, because Keeper quorum survives and every replica is +already writable. There is no promotion. The orchestrator observes the Lease move to a +surviving DC and points the single write endpoint there. `phase` moves `FailingOver` to +`Steady`. + +**Verify:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # the survivor +kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name +``` + +**Action:** none required for availability. Note the RPO: only committed-but-unfetched +parts on the lost DC's disk are at risk (a clean partition that put the lost DC in the +Keeper minority loses zero committed data). When the failed DC returns, see scenario 8 +(re-add a DC). + +--- + +## 2. Network partition between data centers + +**Symptoms:** DCs are up but cannot reach each other. + +**Automatic:** the side that keeps Keeper quorum keeps registering parts and stays +writable. The minority side **loses Keeper quorum and cannot register parts, so its +inserts fail on their own**, at the engine level. There is no split brain and the fence +needs no action: a minority DC simply cannot commit. The write endpoint stays on (or moves +to) the majority side. + +**Verify there is exactly one writable DC and check who holds Keeper quorum:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}=writable:{.writable},quorum:{.keeperQuorum} {end}' +``` + +**Action:** heal the network. The minority side rejoins the Keeper ensemble and catches up +through `ReplicatedMergeTree` automatically. There is no rewind, because the minority +committed nothing. + +--- + +## 3. Planned switchover (maintenance on the active DC) + +**Action:** + +```bash +kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +**Automatic:** the hub gates on the target's health and lag, quiesces writes on the current +active DC (routes clients away), waits until the target's replicas show `absolute_delay` +near zero and an empty queue (`queue_size: 0`), then moves the Lease and the write endpoint +to `dc-b`. Because it waits for the target to catch up before flipping, near-zero committed +writes are lost. There is no promotion step. + +**Verify:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # dc-b +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # Steady +``` + +**If it does not complete:** see scenario 6 (switchover stuck). + +--- + +## 4. Planned failback to the original DC + +After the original DC is healthy and caught up (failback is native: its replicas rejoin the +Keeper ensemble and fetch missing parts, with no rewind), steer the active DC back: + +```bash +kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-a +``` + +Same near-zero-RPO flow as scenario 3. There is no `pg_rewind` step and no rollback; a DC +that lacked Keeper quorum committed nothing to diverge. + +--- + +## 5. Arbiter DC lost (even layout) + +**Symptoms:** the Arbiter DC is gone; its etcd member and the data-less Keeper voter are +unreachable. + +**Impact:** none on writes. The two data DCs together still hold 2 of the 3 Keeper voters, +a majority, so Keeper quorum holds and writes continue. You lose the tie-break voter, so a +subsequent **second** failure (a data DC) can no longer keep Keeper quorum automatically. + +**Verify the cluster is still writable and holds quorum:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} quorum:{.keeperQuorum} writable:{.writable}{"\n"}{end}' +``` + +**Action:** restore the Arbiter DC (the etcd member and the data-less Keeper voter) to +regain single-fault tolerance. + +--- + +## 6. Planned switchover stuck (target not catching up) + +**Symptoms:** after annotating `switchover-to`, `phase` stays `FailingOver` and the endpoint +does not move. + +**Diagnose:** + +```bash +# Target lag and health from status: +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} delay={.absoluteDelaySeconds} queue={.queueSize} healthy={.healthy}{"\n"}{end}' +# Replication state from a target replica: +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT table, absolute_delay, queue_size, log_pointer, log_max_index FROM system.replicas" +``` + +**Causes & action:** + +- **Target lag not converging** the switchover waits for the target's `absolute_delay` near + zero and an empty queue before flipping. Relieve the cross-DC bottleneck (network, insert + load, Keeper round-trip latency) so the target drains its replication queue. +- **Target unhealthy** ensure the target DC has ready replicas of every shard. +- **Abort** remove the annotation to cancel: + `kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to-`. + +--- + +## 7. A standby DC is lost + +**Symptoms:** a non-active DC's replicas are gone; that DC shows `healthy: false`. + +**Impact:** none on writes as long as the remaining DCs keep Keeper quorum (the active DC +plus the Arbiter DC in the even layout). You lose that DC's redundancy and its local read +capacity until it returns. + +**Verify the active DC still holds quorum and is writable:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName} quorum:{.keeperQuorum}{end}' +``` + +**Action:** recover the DC's replicas; they reschedule and catch up through +`ReplicatedMergeTree` over Keeper automatically. + +--- + +## 8. Re-add / recover a previously lost data center + +After a DC returns from a failure: + +**Automatic:** its replicas rejoin the Keeper ensemble and catch up over native +`ReplicatedMergeTree` (they fetch the missing parts). There is no rewind and no rollback; a +DC that lacked Keeper quorum committed nothing to diverge. + +**Verify:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} healthy={.healthy} delay={.absoluteDelaySeconds} queue={.queueSize}{"\n"}{end}' +``` + +**Action:** to make it the active DC again, perform a planned failback (scenario 4) once its +`absoluteDelaySeconds` is near zero and its queue is empty. + +--- + +## 9. Keeper quorum lost across the ensemble (double failure) + +**Symptoms:** more than one Keeper voter is unreachable at once, so the ensemble has no Raft +majority; inserts fail cluster-wide. + +**Impact:** with the ensemble below majority, **no DC can register parts**, so writes stop +everywhere. This is the engine protecting against split brain, not a KubeDB fault. + +**Verify:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} keeperVoter:{.keeperVoter} keeperQuorum:{.keeperQuorum}{"\n"}{end}' +``` + +**Action:** restore enough Keeper voters to regain a majority (bring back a failed Member +DC's voter or the Arbiter DC's voter). Writes resume automatically once the ensemble has +quorum again. Do not try to force a single surviving voter to act alone. + +--- + +## 10. A DC is unexpectedly read-only / rejecting writes + +**Symptoms:** a DC you expect to serve the endpoint is rejecting inserts. + +**Diagnose:** + +```bash +# Does this DC hold Keeper quorum, and is it the write-routed DC? +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} writable:{.writable} quorum:{.keeperQuorum}{"\n"}{end}' +# What DC does the Lease route to? +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' +# Replication and Keeper reachability from a replica: +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT table, is_readonly, is_session_expired, absolute_delay FROM system.replicas" +``` + +**Causes & action:** + +- **Not the write-routed DC** the endpoint intentionally routes writes elsewhere; this DC + is a standby (correct). Point writes at the endpoint ``, not at this DC directly. +- **Lost Keeper quorum** `keeperQuorum:false` or `is_readonly:1` means this DC cannot reach + a Keeper majority, so it is read-only by design. See scenario 2 or 9. +- **Lease routes here but writes still fail** check Keeper reachability and the fence + marker; the endpoint fails closed if the marker is stale. + +Never try to bypass the fence or write directly to a DC that lacks Keeper quorum; it cannot +commit and you risk confusing clients. + +--- + +## 11. Coordination plane (dr-controlplane / etcd) unavailable + +**Symptoms:** the Lease cannot be read/renewed across the spokes. + +**Automatic:** ClickHouse keeps running on its own Keeper quorum, so the cluster stays +writable in whichever DC the endpoint last resolved to; the Lease is routing, not the +failover mechanism, so its loss does not by itself stop writes. What you lose is **endpoint +steering and planned switchover**: the orchestrator cannot move the active DC until the +Lease quorum returns. + +**Verify:** + +```bash +kubectl --kubeconfig -n dc-failover get lease primary-dc # error / stale +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT 1" # DB still serving +``` + +**Action:** restore the `dr-controlplane` etcd quorum (it shares the Arbiter DC with the +data-less Keeper voter). Once the Lease is renewable, endpoint steering and switchover +resume. + +--- + +## 12. Suspected split-brain (two DCs taking committed writes) + +This should be impossible with a spread 3-site Keeper: no single data DC holds a Keeper +majority, and a minority DC cannot register parts, so it cannot commit. If +`status.disasterRecovery` ever shows two `writable: true` DCs: + +**Diagnose immediately:** + +```bash +kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} writable:{.writable} quorum:{.keeperQuorum}{"\n"}{end}' +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \ + --query "SELECT table, is_readonly, total_replicas, active_replicas FROM system.replicas" +``` + +**Action:** confirm the Keeper ensemble still spreads voters 3-site (no single data DC was +given a Keeper majority by a bad topology change). Because a minority DC cannot register +parts, it cannot diverge committed data even if the routing briefly shows two writable DCs. +Restore connectivity, let the endpoint settle on one active DC, and correct the Keeper +placement if it drifted. + +--- + +## Escalation checklist + +When unsure, collect: + +```bash +kubectl get clickhouse -n demo ch-dcdr -o yaml +kubectl --kubeconfig -n dc-failover get lease -o yaml +kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name -o wide +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT * FROM system.replicas FORMAT Vertical" +kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT * FROM system.clusters FORMAT Vertical" +``` From 1bcd619ee74692b39f0663b8521f0d76ff0affa4 Mon Sep 17 00:00:00 2001 From: Tamal Saha Date: Wed, 1 Jul 2026 10:31:43 +0600 Subject: [PATCH 2/2] Document WAN-efficient cross-DC part fetch for ClickHouse DC-DR Add the fifth How-it-works rule: ReplicatedMergeTree fetches are not DC-aware, so with two or more in-DC replicas of a shard the operator designates one in-DC replica per shard as the cross-DC fetch source and the others fetch intra-DC, holding cross-DC part traffic to one copy per shard per DC. Signed-off-by: Tamal Saha --- docs/guides/clickhouse/dr/overview/index.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/docs/guides/clickhouse/dr/overview/index.md b/docs/guides/clickhouse/dr/overview/index.md index 19fe392e9b..6a2c93d95d 100644 --- a/docs/guides/clickhouse/dr/overview/index.md +++ b/docs/guides/clickhouse/dr/overview/index.md @@ -61,7 +61,7 @@ asynchronously through a shared ClickHouse Keeper ensemble. So for ClickHouse: ## How it works -DC-DR for ClickHouse rests on four rules. +DC-DR for ClickHouse rests on five rules. - **ClickHouse Keeper is spread 3-site and is the failover authority.** With two data DCs the layout is a Keeper voter in `dc-a`, a Keeper voter in `dc-b`, and one @@ -88,6 +88,16 @@ DC-DR for ClickHouse rests on four rules. - **Reads can stay local.** Any replica serves consistent-enough reads, so read traffic can stay in-DC for low latency while writes route to the active DC through the single endpoint. +- **One cross-DC part copy per shard, then fetch intra-DC.** `ReplicatedMergeTree` + fetches are not DC-aware: a replica pulls a new part from whichever replica advertises + it in Keeper, which can be the cross-DC one. With a single replica of a shard per DC + (the minimal DR shape) that is already one part copy per DC. But when a DC runs two or + more replicas of the same shard for intra-DC HA, each can independently pull the same + part across the WAN. ClickHouse has no native same-DC fetch affinity, so the operator + designates one in-DC replica per shard as the cross-DC fetch source and points the + others at it, so they fetch that part intra-DC. This holds cross-DC part traffic to one + copy per shard per DC, the `ReplicatedMergeTree` analog of the Postgres standby-DC + intra-DC cascade. > **Why not confine Keeper to the active DC?** You can (see the topologies below), and > it gives the lowest write latency. But then losing the active DC also loses its Keeper