From 9857a3e48ddfe4da74f62fea94eb3ec5f3441f4e Mon Sep 17 00:00:00 2001
From: Tamal Saha <tamal@appscode.com>
Date: Wed, 1 Jul 2026 08:38:57 +0600
Subject: [PATCH 1/2] Add ClickHouse DC-DR guide

Signed-off-by: Tamal Saha <tamal@appscode.com>
---
 docs/guides/clickhouse/README.md            |   4 +
 docs/guides/clickhouse/dr/_index.md         |  10 +
 docs/guides/clickhouse/dr/guide/index.md    | 351 ++++++++++++++++++++
 docs/guides/clickhouse/dr/overview/index.md | 350 +++++++++++++++++++
 docs/guides/clickhouse/dr/runbook/index.md  | 331 ++++++++++++++++++
 5 files changed, 1046 insertions(+)
 create mode 100644 docs/guides/clickhouse/dr/_index.md
 create mode 100644 docs/guides/clickhouse/dr/guide/index.md
 create mode 100644 docs/guides/clickhouse/dr/overview/index.md
 create mode 100644 docs/guides/clickhouse/dr/runbook/index.md

diff --git a/docs/guides/clickhouse/README.md b/docs/guides/clickhouse/README.md
index 453a1d9788..f14b72ffd1 100644
--- a/docs/guides/clickhouse/README.md
+++ b/docs/guides/clickhouse/README.md
@@ -44,3 +44,7 @@ KubeDB supports the following ClickHouse Versions.
 
 - [Quickstart ClickHouse](/docs/guides/clickhouse/quickstart/guide/quickstart.md) with KubeDB Operator.
 - Want to hack on KubeDB? Check our [contribution guidelines](/docs/CONTRIBUTING.md).
+
+## Cross-DC Disaster Recovery (DC-DR)
+
+Do you want to run your ClickHouse database across multiple data centers and recover from a full data center failure with a single, automatically re-routing write endpoint? KubeDB runs one logical `ReplicatedMergeTree` cluster across the data centers, spreads ClickHouse Keeper 3-site so no single data center holds a Keeper majority (the split-brain guarantee), and lets the `dr-controlplane` Lease route the write endpoint to a data center that still holds Keeper quorum. Follow [here](/docs/guides/clickhouse/dr/overview/index.md).
diff --git a/docs/guides/clickhouse/dr/_index.md b/docs/guides/clickhouse/dr/_index.md
new file mode 100644
index 0000000000..801ae26c55
--- /dev/null
+++ b/docs/guides/clickhouse/dr/_index.md
@@ -0,0 +1,10 @@
+---
+title: Disaster Recovery
+menu:
+  docs_{{ .version }}:
+    identifier: ch-dr-clickhouse
+    name: DR
+    parent: ch-clickhouse-guides
+    weight: 55
+menu_name: docs_{{ .version }}
+---
diff --git a/docs/guides/clickhouse/dr/guide/index.md b/docs/guides/clickhouse/dr/guide/index.md
new file mode 100644
index 0000000000..5daaf30172
--- /dev/null
+++ b/docs/guides/clickhouse/dr/guide/index.md
@@ -0,0 +1,351 @@
+---
+title: DC-DR User Guide
+menu:
+  docs_{{ .version }}:
+    identifier: ch-dr-guide-clickhouse
+    name: User Guide
+    parent: ch-dr-clickhouse
+    weight: 20
+menu_name: docs_{{ .version }}
+section_menu_id: guides
+---
+
+# Running ClickHouse in DC-DR Mode: User Guide
+
+This guide covers every aspect of operating a distributed ClickHouse in cross data
+center disaster recovery (DC-DR) mode: the components, the naming contract, deployment,
+connecting through the single write endpoint, reading locally, monitoring, lag and RPO,
+Keeper placement, switchover and failback, scaling, and day-2 operations.
+
+Read the [DC-DR Overview](/docs/guides/clickhouse/dr/overview/index.md) first for the
+architecture, and the [DC-DR Runbook](/docs/guides/clickhouse/dr/runbook/index.md) for
+scenario-by-scenario procedures.
+
+> **New to KubeDB?** Please start [here](/docs/README.md).
+
+## Components and where they run
+
+| Component | Runs in | Responsibility |
+| --- | --- | --- |
+| **`dr-controlplane`** + 3-site etcd quorum | across the data centers (an OCM control plane) | Publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease holder is the DC the single write endpoint resolves to. The Lease is routing, policy, and observability, **not** the failover mechanism. |
+| **`dr-controlplane` agent** | each spoke (DC) | Contends for the primary-DC Lease for its DC and projects the Lease decision into the local spoke as the `primary-dc` marker. |
+| **KubeDB ClickHouse operator (hub)** | the OCM hub | Expands the `ClickHouse` CR into per-DC `ReplicatedMergeTree` replica groups, spreads the 3-site Keeper ensemble, routes the single write endpoint by following the Lease, drives planned switchover, and writes `status.disasterRecovery`. |
+| **ClickHouse Keeper ensemble** | a voter in each Member DC plus a data-less voter in the Arbiter DC | The Raft service that registers parts and orders replication. **Its Raft majority is the failover authority and the split-brain guarantee.** No ZooKeeper. |
+| **KubeSlice** | each spoke | Provides the cross-DC pod network so the one logical cluster spans clusters and `ReplicatedMergeTree` replicates over port 9009, coordinated by Keeper on 9181/9234. |
+
+## The DC-name contract
+
+One string identifies a data center everywhere. **Keep these identical:**
+
+- the OCM spoke cluster name
+- the agent `--dc-name`
+- the primary-DC Lease `holderIdentity`
+- the marker `data.activeDC`
+- the pod label `open-cluster-management.io/cluster-name`
+- the `PlacementPolicy` `distributionRule.clusterName`
+
+## Deploying
+
+### PlacementPolicy
+
+Map the global replica indices to data centers and tag each DC with its role:
+
+```yaml
+apiVersion: apps.k8s.appscode.com/v1
+kind: PlacementPolicy
+metadata:
+  name: ch-dcdr
+spec:
+  clusterSpreadConstraint:
+    slice:
+      projectNamespace: kubeslice-demo
+      sliceName: demo-slice
+    failoverPolicy:
+      trigger:
+        scope: Global       # one cluster-wide failover scope (or Group + a group name)
+      mode: TwoDC           # TwoDC: 2 Member DCs + an Arbiter DC; ThreeDC: 3 Member DCs
+    distributionRules:
+    - clusterName: dc-a
+      role: Member
+      replicaIndices: [0, 1]
+    - clusterName: dc-b
+      role: Member
+      replicaIndices: [2, 3]
+    - clusterName: dc-c
+      role: Arbiter
+      replicaIndices: []
+```
+
+- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries an
+  empty list. Its single data-less Keeper voter is scheduled onto the arbiter spoke and
+  co-located with the third etcd member (it is not ordinal-pinned; the operator reads the
+  Arbiter DC's `clusterName` and schedules the voter onto that spoke via OCM
+  ManifestWork).
+- `mode: TwoDC` expects two Member DCs plus the Arbiter DC (the even layout); `ThreeDC`
+  expects an odd number of Member DCs and no separate Arbiter DC (each data DC then holds
+  its own Keeper voter, keeping Keeper quorum among the data DCs).
+- Roles are `Member` and `Arbiter` only.
+
+### ClickHouse
+
+```yaml
+apiVersion: kubedb.com/v1alpha2
+kind: ClickHouse
+metadata:
+  name: ch-dcdr
+  namespace: demo
+spec:
+  version: "25.7.1"
+  distributed: true
+  clusterTopology:
+    cluster:
+    - name: appscode-cluster
+      shards: 2
+      replicas: 2
+  storageType: Durable
+  podTemplate:
+    spec:
+      podPlacementPolicy:
+        name: ch-dcdr
+  storage:
+    accessModes: [ReadWriteOnce]
+    resources:
+      requests:
+        storage: 1Gi
+  deletionPolicy: WipeOut
+```
+
+### What the operator creates
+
+- **One logical cluster** whose shards each have a `ReplicatedMergeTree` replica in each
+  Member DC, all sharing one Keeper ensemble and the same `default_replica_path` macros.
+  Replication is the native ClickHouse link over port 9009; there is no second
+  replication link.
+- A **3-site Keeper ensemble**: a voter in each Member DC and, in the even (`TwoDC`)
+  layout, one data-less Keeper voter scheduled onto the Arbiter DC. This is what lets the
+  engine's own quorum survive a DC loss and fence a minority.
+- A single **write endpoint** (Service plus `AppBinding`) that the orchestrator points at
+  the active DC by following the Lease. Reads can go to any replica.
+
+All data-bearing pods carry the offshoot selectors plus the
+`open-cluster-management.io/cluster-name` label, so the single write endpoint and the
+single `AppBinding` keep working as the active DC moves.
+
+> The macros and `default_replica_path` are consistent across DCs so a shard's replicas
+> in different DCs share the same Keeper path and replicate to each other. Do not change
+> them per DC.
+
+## Connecting
+
+A DC-DR ClickHouse exposes a single write endpoint, the same shape as any KubeDB
+ClickHouse:
+
+- the **write endpoint** `<db>` resolves to the active DC's replicas (native port 9000);
+  the Lease-driven fence keeps it off a non-active DC, fail-closed;
+- one **`AppBinding`** `<db>` for applications and KubeDB integrations.
+
+Because ClickHouse is multi-master, every replica can technically accept writes, but the
+single endpoint gives a stable single-writer posture: applications keep using `<db>` and,
+after a failover, reconnect and land on the new active DC.
+
+### Writes and the Keeper-quorum contract
+
+There is no `w:majority` knob to set: the split-brain guarantee is built into the engine.
+Every insert registers a part in Keeper, which needs Keeper quorum. A partitioned
+minority DC that has lost Keeper quorum simply cannot register parts, so its inserts
+fail. You do not have to opt into this; it is how `ReplicatedMergeTree` plus a spread
+3-site Keeper behaves.
+
+```sql
+-- Against the write endpoint <db>:9000:
+INSERT INTO orders (item, qty) VALUES ('widget', 1);
+```
+
+- On a spread 3-site Keeper (topology A), an insert commits only when the writing DC
+  holds Keeper quorum, so a cut-off minority DC cannot commit and there is no split
+  brain.
+- The bounded loss on an unplanned active-DC loss is only committed-but-unfetched parts
+  (registered in Keeper, data still on the lost DC's disk), which are recoverable when
+  that DC returns.
+
+### Read locally
+
+Any replica serves consistent-enough reads. Point read traffic at an in-DC replica for
+low latency; reads are eventually consistent, bounded by that replica's
+`absoluteDelaySeconds`.
+
+## Monitoring and observability
+
+### status.disasterRecovery
+
+The single CR carries the whole cross-DC view:
+
+```bash
+$ kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq
+```
+
+| Field | Meaning |
+| --- | --- |
+| `activeDC` | The DC the single write endpoint currently resolves to (a routing choice, not a promoted primary). |
+| `phase` | `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. |
+| `lastTransitionTime` | When `activeDC` last changed. |
+| `dataCenters[].clusterName` | The data center, by its OCM managed cluster name. |
+| `dataCenters[].role` | `Member` or `Arbiter`. |
+| `dataCenters[].keeperVoter` | Whether the DC holds a Keeper voter. |
+| `dataCenters[].keeperQuorum` | Whether the DC currently sees Keeper quorum (the safety signal). |
+| `dataCenters[].writable` | True only for the active (write-routed) DC. |
+| `dataCenters[].shards[]` | Per shard: `shard`, `totalReplicas`, `activeReplicas`. |
+| `dataCenters[].absoluteDelaySeconds` | The DC's cross-DC replication delay behind the active DC, in seconds (from `system.replicas.absolute_delay`). |
+| `dataCenters[].queueSize` | The DC's pending replication queue length (from `system.replicas.queue_size`). |
+| `dataCenters[].healthy` | Whether the DC has ready replicas. |
+
+### Useful checks
+
+```bash
+# Which DC the Lease intends as the write-routed active DC:
+$ kubectl --kubeconfig <coord> -n dc-failover get lease primary-dc \
+    -o jsonpath='{.spec.holderIdentity}'
+
+# Per-DC replicas and DCs:
+$ kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr \
+    -L open-cluster-management.io/cluster-name
+
+# Cluster hosts and per-replica health (from any replica):
+$ kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+    --query "SELECT cluster, host_name, replica_num FROM system.clusters"
+
+# Replication delay and queue per replica (the lag signal):
+$ kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+    --query "SELECT database, table, absolute_delay, queue_size, log_pointer, log_max_index, total_replicas, active_replicas FROM system.replicas"
+```
+
+## Replication, lag, and RPO
+
+- Cross-DC replication is **native ClickHouse `ReplicatedMergeTree`** over port 9009,
+  asynchronous, coordinated by the shared Keeper ensemble. There is exactly one logical
+  cluster, so there is no extra replication link to manage.
+- The lag signals come from `system.replicas`: `absolute_delay` (seconds a replica is
+  behind), `queue_size`, and `log_pointer` versus `log_max_index` (how many log entries
+  the replica still has to fetch). These are surfaced into `status.disasterRecovery` as
+  `absoluteDelaySeconds` and `queueSize`.
+- A **planned switchover loses near-zero committed writes**, because the orchestrator
+  waits until the target DC's replicas show near-zero `absolute_delay` and an empty queue
+  before it flips the endpoint. An **unplanned failover** may lose only
+  committed-but-unfetched parts (bounded by the standby lag when the active DC died); a
+  clean partition that put the lost DC in the Keeper minority loses zero committed data,
+  because that DC could not commit without quorum.
+
+## Keeper placement and the arbiter
+
+- **Keeper is spread 3-site so no single data DC holds a Keeper majority.** With two data
+  DCs, each holds a voter and the Arbiter DC holds one data-less voter; the majority is 2
+  of 3, so either data DC plus the Arbiter DC keeps quorum, but neither data DC alone
+  does. This removes split brain at its root: a partitioned data DC cannot register parts
+  by itself.
+- The requirement is an **odd Keeper voter total**, not an odd DC count:
+  - **Even layout** (two data DCs plus the Arbiter DC): one Keeper voter per data DC and
+    one data-less voter in the Arbiter DC (total 3). Do not add extra voters that would
+    let one data DC hold a majority.
+  - **Odd layout** (three or more Member DCs, no Arbiter DC): one Keeper voter per data
+    DC, for an odd total, keeping Keeper quorum among the data DCs. This is the layout the
+    DC-count rule prefers.
+- The **Arbiter DC** holds the `dr-controlplane` etcd vote **and** one data-less Keeper
+  voter, co-located so the two quorums agree on which DCs are alive. This is the same
+  arbiter trick as MongoDB.
+
+> For write-heavy ingest, weigh the two alternatives from the overview: two clusters with
+> a per-region Keeper cross-replicated (topology B) avoids the cross-DC Keeper round trip,
+> and single-DC Keeper (topology C) is the lowest-latency manual-failover floor.
+
+## Planned switchover (near-zero RPO)
+
+Move the active (write-routed) DC on purpose by annotating the ClickHouse:
+
+```bash
+$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b
+```
+
+The hub then:
+
+1. checks the target is a known, healthy DC within the lag budget;
+2. sets `phase: FailingOver` and quiesces writes on the current active DC (routes clients
+   away from the endpoint);
+3. waits until the target DC's replicas report `absolute_delay` near zero and an empty
+   replication queue (`queue_size: 0`), the catch-up gate that makes this near-zero RPO;
+4. moves the Lease and the single write endpoint to `dc-b`.
+
+Watch `status.disasterRecovery` for `phase` returning to `Steady` with the new
+`activeDC`. There is no promotion step, because every replica is already writable.
+
+## Failback
+
+Failback is native and clean. A returned DC's replicas rejoin the Keeper ensemble and
+catch up through `ReplicatedMergeTree` (they fetch the missing parts). There is **no
+rewind**: a partitioned DC that lacked Keeper quorum committed nothing to diverge, so
+there is nothing to roll back.
+
+Once the returned DC is caught up (near-zero `absoluteDelaySeconds`, empty queue), steer
+the active DC back with a planned switchover:
+
+```bash
+$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-a
+```
+
+## Scaling and day-2 operations
+
+The standard `ClickHouseOpsRequest` operations (`VerticalScaling`, `HorizontalScaling`,
+`VolumeExpansion`, `UpdateVersion`, `Reconfigure`, `ReconfigureTLS`, `Restart`,
+`RotateAuth`, `StorageMigration`) apply to a DC-DR cluster. They act on the distributed
+per-DC replica groups across the DCs and are issued exactly as for a single-cluster
+ClickHouse. There is no failover ops type: failover is the engine's Keeper quorum, and
+the planned switchover is the `dr.kubedb.com/switchover-to` annotation, not an ops
+request.
+
+`HorizontalScaling` gains a per-DC form so you can scale each DC's per-shard replica
+count independently:
+
+```yaml
+apiVersion: ops.kubedb.com/v1alpha1
+kind: ClickHouseOpsRequest
+metadata:
+  name: ch-dcdr-hscale
+  namespace: demo
+spec:
+  type: HorizontalScaling
+  databaseRef:
+    name: ch-dcdr
+  horizontalScaling:
+    dataCenters:
+    - clusterName: dc-a
+      replicas: 3
+    - clusterName: dc-b
+      replicas: 2
+```
+
+> **Note:** the distributed ClickHouse substrate and the DC-DR layer are net-new for
+> ClickHouse. Treat the field names and flows in this guide as the intended user
+> experience; confirm availability in your release before relying on them in production.
+
+## Deletion and cleanup
+
+```bash
+$ kubectl delete clickhouse -n demo ch-dcdr
+```
+
+Per `deletionPolicy`, the operator removes the per-DC replica groups, the arbiter-DC
+Keeper voter, and the cluster-scoped per-DC `PlacementPolicies` it generated (these carry
+no owner reference, so the operator deletes them explicitly). The user-provided base
+`PlacementPolicy` is left for you to delete.
+
+## Limitations
+
+- **Adding or removing a whole data center** is a topology change (a replica-group and
+  Keeper-ensemble change), performed by editing the `PlacementPolicy` topology, not by a
+  scaling request.
+- Cross-DC `ReplicatedMergeTree` replication is asynchronous; an unplanned failover has a
+  non-zero RPO bounded by the standby lag (only committed-but-unfetched parts). A clean
+  partition loses zero committed data because the minority DC cannot commit without Keeper
+  quorum; use a planned switchover for a near-zero-RPO move.
+- On a spread 3-site Keeper (topology A), every insert pays a cross-DC Keeper round trip.
+  For write-heavy ingest, consider the two-cluster per-region-Keeper topology (B) or the
+  single-DC Keeper topology (C) described in the overview.
diff --git a/docs/guides/clickhouse/dr/overview/index.md b/docs/guides/clickhouse/dr/overview/index.md
new file mode 100644
index 0000000000..19fe392e9b
--- /dev/null
+++ b/docs/guides/clickhouse/dr/overview/index.md
@@ -0,0 +1,350 @@
+---
+title: DC-DR Overview
+menu:
+  docs_{{ .version }}:
+    identifier: ch-dr-overview-clickhouse
+    name: Overview
+    parent: ch-dr-clickhouse
+    weight: 10
+menu_name: docs_{{ .version }}
+section_menu_id: guides
+---
+
+# Cross Data Center Disaster Recovery (DC-DR) for ClickHouse
+
+KubeDB can run a single distributed `ClickHouse` across multiple data centers (DCs) so
+the database survives the loss of an entire data center. Every replica is writable
+(ClickHouse `ReplicatedMergeTree` is multi-master), so DR is not about promoting a new
+primary. It is about two things: the ClickHouse Keeper Raft quorum, which decides which
+DCs can commit at all, and a single Lease-routed write endpoint, which records and
+steers where clients send writes. When a data center is lost, the surviving DCs that
+still hold Keeper quorum keep accepting writes, and the write endpoint follows to a DC
+that holds quorum.
+
+This page is the conceptual overview and a quick start. See also:
+
+- [DC-DR User Guide](/docs/guides/clickhouse/dr/guide/index.md) for every aspect of
+  running in DC-DR mode (components, status, connecting, monitoring, switchover,
+  failback, day-2 ops).
+- [DC-DR Runbook](/docs/guides/clickhouse/dr/runbook/index.md) for what to do in each
+  operational scenario.
+
+> **New to KubeDB?** Please start [here](/docs/README.md).
+
+## Why ClickHouse DC-DR is different
+
+Most KubeDB engines (Postgres, MariaDB, MSSQL) keep their consensus quorum **inside**
+a single DC, because a raft or cluster manager flaps or stalls when its quorum spans
+data centers. Those engines run one independent group per DC and build a separate
+cross-DC replication link, and DR means promoting a standby.
+
+**ClickHouse is the exception, the same way MongoDB is.** `ReplicatedMergeTree` is
+multi-master and geo-aware by design: replicas of a table in different DCs replicate
+asynchronously through a shared ClickHouse Keeper ensemble. So for ClickHouse:
+
+- **One logical cluster spans the DCs.** The same shards, with a `ReplicatedMergeTree`
+  replica of each shard in each DC, all share one Keeper ensemble and the same
+  `default_replica_path` macros. Replication is the native ClickHouse replication link
+  over port 9009. There is **no second replication link to build** and no remote
+  replica to manage.
+- **Failover is the engine's own quorum, not a promotion.** ClickHouse Keeper is a Raft
+  service. The DCs that hold the Keeper majority keep registering parts and serving
+  writes. A partitioned minority DC loses Keeper quorum, cannot register parts, and so
+  its inserts fail. That quorum, not any promotion step, is the failover and the
+  split-brain guarantee. KubeDB never promotes a replica, because every replica is
+  already writable.
+- **Failback is native and clean.** A returned DC's replicas rejoin the Keeper ensemble
+  and catch up through `ReplicatedMergeTree` (they fetch the missing parts). There is
+  **no rewind**: a partitioned minority DC that lacked Keeper quorum committed nothing
+  to diverge, so there is nothing to roll back. This is cleaner than the Postgres
+  `pg_rewind` path and even than MongoDB rollback.
+
+## How it works
+
+DC-DR for ClickHouse rests on four rules.
+
+- **ClickHouse Keeper is spread 3-site and is the failover authority.** With two data
+  DCs the layout is a Keeper voter in `dc-a`, a Keeper voter in `dc-b`, and one
+  data-less **Keeper voter** in a third arbiter DC. That arbiter-DC voter is co-located
+  with the `dr-controlplane` etcd member, so the Keeper Raft quorum and the Lease
+  quorum share the same 3-site topology (exactly MongoDB's arbiter trick). A single DC
+  loss then leaves a surviving Keeper majority, so the survivors keep committing with no
+  manual step.
+- **Keeper quorum is the writable contract and the split-brain guarantee.** Because the
+  safety comes from Keeper's Raft majority, a partitioned minority DC cannot register
+  parts and so cannot commit any insert. A cut-off DC goes non-writable on its own, at
+  the engine level, with no operator action. This is the same shape as MongoDB majority
+  plus `w:majority`, and it is the hard guarantee. The Lease-driven endpoint fence is
+  only a routing layer on top of it.
+- **The Lease routes the single write endpoint; it does not promote anything.** A small
+  control plane (`dr-controlplane`), backed by a three-site etcd quorum, publishes one
+  `coordination.k8s.io` **Lease** per failover scope. The Lease records which DC the
+  single write endpoint resolves to and steers clients there, giving a stable
+  single-writer posture and one consistent cross-engine status. Because ClickHouse is
+  multi-master, this is a write-routing choice, not an engine-enforced primary. On an
+  unplanned active-DC loss the orchestrator moves the Lease and the endpoint to a
+  surviving DC that still holds Keeper quorum. The Lease is routing, policy, and
+  observability, **not** the failover mechanism (Keeper quorum is).
+- **Reads can stay local.** Any replica serves consistent-enough reads, so read traffic
+  can stay in-DC for low latency while writes route to the active DC through the single
+  endpoint.
+
+> **Why not confine Keeper to the active DC?** You can (see the topologies below), and
+> it gives the lowest write latency. But then losing the active DC also loses its Keeper
+> quorum, and the surviving replicas have no quorum to commit against until you bring up
+> a new Keeper and re-point them, an explicit manual recovery with RTO impact. Spreading
+> Keeper 3-site removes that manual step at the cost of a cross-DC Keeper round trip on
+> every insert.
+
+### Data center roles
+
+Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`:
+
+| Role | Holds ClickHouse data | Holds a Keeper voter | Purpose |
+| --- | --- | --- | --- |
+| **Member** | yes | yes | A data-bearing DC holding a `ReplicatedMergeTree` replica of every shard and a Keeper voter; a candidate for the active (write-routed) DC. |
+| **Arbiter** | no | yes (data-less) | The arbiter DC. Holds the `dr-controlplane` etcd vote **and** one data-less ClickHouse Keeper voter. Supplies the tie-break vote for both quorums. No ClickHouse data. |
+
+> Unlike MariaDB and MSSQL, whose arbiter DC holds no engine member, the ClickHouse
+> arbiter DC holds **both** the `dr-controlplane` etcd member and a data-less Keeper
+> voter, co-located so the coordination quorum and the Keeper quorum agree on which DCs
+> are alive. This is the same pattern as MongoDB's voting arbiter.
+
+## The Keeper placement decision (the one real tradeoff)
+
+ClickHouse Keeper Raft is latency-sensitive: every insert registers a part, which is a
+Keeper write that needs a quorum round trip. ClickHouse is a write-heavy ingest engine,
+so where you place Keeper is the one real tradeoff. There are three placements, and the
+right one depends on your write-latency tolerance. **3-site spread Keeper is the
+documented automatic-DR path here, but it is not automatically the best choice for every
+workload.**
+
+### A. Spread Keeper 3-site, single cluster (automatic DR, higher write latency)
+
+One logical ClickHouse cluster with Keeper voters in `dc-a`, `dc-b`, and a data-less
+voter in the arbiter DC (co-located with the `dr-controlplane` etcd member). A DC loss
+leaves a surviving Keeper majority, so writes continue in the survivor with **no manual
+step**. The cost is a cross-DC Keeper round trip on every insert. Batched inserts
+amortize it well; high-frequency small inserts may find it prohibitive. This is the path
+documented in detail here, because it is the closest analog to the MongoDB design and
+gives hands-off failover.
+
+### B. Two clusters, per-region Keeper, cross-replicated (often the better fit)
+
+Each DC runs its own cluster with its own in-DC Keeper (low local write latency); the
+two cross-replicate the same `ReplicatedMergeTree` tables and a `Distributed` table
+fronts both. Each region writes locally, and on a region loss the other already holds a
+full replica. This matches ClickHouse's write-local nature and avoids the cross-DC
+Keeper tax, at the cost of a more complex topology and looser cross-region ordering.
+This is Altinity's "multi-region writes" pattern and is **frequently the better
+production choice for write-heavy ingest**. Failover is still a routing change and the
+Lease still picks the write-routed DC.
+
+### C. Single-DC Keeper (lowest latency, manual failover)
+
+Keeper lives only in the active DC; standby replicas use it cross-DC. Inserts are lowest
+latency, but losing the active DC (and its Keeper) leaves the standby with no quorum, so
+a new Keeper must be brought up and the replicas re-pointed (an explicit recovery with
+RTO impact). This is the simplest Altinity default and is acceptable when low write
+latency outweighs automatic DR.
+
+### At a glance
+
+| Topology | Keeper | Write latency | Failover on active-DC loss |
+| --- | --- | --- | --- |
+| A. 3-site spread (documented here) | voters in dc-a, dc-b, arbiter DC | higher (cross-DC round trip per insert) | automatic, survivor keeps quorum |
+| B. Two clusters, per-region Keeper | one Keeper per region | low (local) | routing change, other region already replicated |
+| C. Single-DC Keeper | only in the active DC | lowest | manual: rebuild Keeper, re-point replicas |
+
+The rest of these docs describe topology **A**, the 3-site single-cluster spread. Its
+safety claim (a minority DC cannot register parts) is specific to it. Pick per workload:
+for write-heavy ClickHouse, topology B is frequently better, and topology C is the
+low-latency manual-failover floor.
+
+## The single-CR, single-endpoint model
+
+The user creates **one** distributed `ClickHouse` object (with `spec.distributed: true`
+and a `PlacementPolicy` carrying `distributionRules` and a `failoverPolicy`) and gets
+**one** `AppBinding` and **one** write endpoint. The operator expands the CR into per-DC
+`ReplicatedMergeTree` replicas of every shard, a 3-site Keeper ensemble (arbiter-DC
+voter included), and the Lease-routed write endpoint.
+
+The single CR's `status.disasterRecovery` carries the whole cross-DC view: the active
+(write-routed) DC, each DC's per-shard replica health, whether the DC holds Keeper
+quorum, the cross-DC replication delay, and the DR phase.
+
+## Prerequisites
+
+- A distributed ClickHouse substrate: Open Cluster Management (OCM) hub and spoke
+  clusters, KubeSlice connecting the spokes so replicas reach each other, and a storage
+  class on each data-bearing spoke. The ClickHouse and Keeper ports (native 9000,
+  replication 9009, Keeper client 9181, Keeper Raft 9234) must be reachable across the
+  DCs.
+- The `dr-controlplane` service and its three-site etcd quorum installed across the data
+  centers, with a `dr-controlplane` agent running in each spoke (DC). The third etcd
+  member sits in the arbiter DC alongside the data-less Keeper voter.
+- The KubeDB ClickHouse operator started with the DC-DR flags (coordination kubeconfig
+  and the operator's local DC name).
+- One consistent **DC name** per data center, used everywhere: the OCM spoke cluster
+  name, the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the
+  pod label `open-cluster-management.io/cluster-name`, and the `PlacementPolicy`
+  `distributionRule.clusterName`. Keep them identical.
+
+## Deploy a DC-DR ClickHouse
+
+### 1. PlacementPolicy
+
+Assign global replica indices to data centers and tag each DC with its role. Here two
+Member DCs (`dc-a`, `dc-b`) each hold a replica of every shard plus a Keeper voter, and
+`dc-c` is the arbiter DC holding only a data-less Keeper voter and the `dr-controlplane`
+etcd member:
+
+```yaml
+apiVersion: apps.k8s.appscode.com/v1
+kind: PlacementPolicy
+metadata:
+  name: ch-dcdr
+spec:
+  clusterSpreadConstraint:
+    slice:
+      projectNamespace: kubeslice-demo
+      sliceName: demo-slice
+    failoverPolicy:
+      trigger:
+        scope: Global
+      mode: TwoDC
+    distributionRules:
+    - clusterName: dc-a
+      role: Member
+      replicaIndices: [0, 1]      # dc-a: a replica of each shard + a Keeper voter
+    - clusterName: dc-b
+      role: Member
+      replicaIndices: [2, 3]      # dc-b: a replica of each shard + a Keeper voter
+    - clusterName: dc-c
+      role: Arbiter
+      replicaIndices: []          # arbiter DC: dr-controlplane etcd + a data-less Keeper voter
+```
+
+- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries
+  an empty list (its single data-less Keeper voter is not ordinal-pinned, it is
+  scheduled onto the arbiter spoke by the operator).
+- `failoverPolicy.trigger.scope: Global` makes this one cluster-wide failover scope.
+
+### 2. ClickHouse
+
+Reference the `PlacementPolicy` and opt the ClickHouse into DC-DR expansion:
+
+```yaml
+apiVersion: kubedb.com/v1alpha2
+kind: ClickHouse
+metadata:
+  name: ch-dcdr
+  namespace: demo
+spec:
+  version: "25.7.1"
+  distributed: true
+  clusterTopology:
+    cluster:
+    - name: appscode-cluster
+      shards: 2
+      replicas: 2
+  storageType: Durable
+  podTemplate:
+    spec:
+      podPlacementPolicy:
+        name: ch-dcdr
+  storage:
+    accessModes: [ReadWriteOnce]
+    resources:
+      requests:
+        storage: 1Gi
+  deletionPolicy: WipeOut
+```
+
+The operator expands this into per-DC `ReplicatedMergeTree` replicas of every shard,
+pinned to `dc-a` and `dc-b`, plus a data-less Keeper voter in `dc-c`, and routes the
+single write endpoint to the active DC by following the Lease.
+
+## Observe the DC-DR state
+
+The single `ClickHouse` object's `status.disasterRecovery` carries the whole cross-DC
+view:
+
+```bash
+$ kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq
+```
+
+```json
+{
+  "activeDC": "dc-a",
+  "phase": "Steady",
+  "lastTransitionTime": "2026-06-30T10:00:00Z",
+  "dataCenters": [
+    {
+      "clusterName": "dc-a", "role": "Member", "keeperVoter": true, "keeperQuorum": true,
+      "writable": true, "healthy": true, "absoluteDelaySeconds": 0, "queueSize": 0,
+      "shards": [
+        { "shard": 0, "totalReplicas": 2, "activeReplicas": 2 },
+        { "shard": 1, "totalReplicas": 2, "activeReplicas": 2 }
+      ]
+    },
+    {
+      "clusterName": "dc-b", "role": "Member", "keeperVoter": true, "keeperQuorum": true,
+      "writable": false, "healthy": true, "absoluteDelaySeconds": 2, "queueSize": 5,
+      "shards": [
+        { "shard": 0, "totalReplicas": 2, "activeReplicas": 2 },
+        { "shard": 1, "totalReplicas": 2, "activeReplicas": 2 }
+      ]
+    },
+    {
+      "clusterName": "dc-c", "role": "Arbiter", "keeperVoter": true, "keeperQuorum": true,
+      "writable": false, "healthy": true
+    }
+  ]
+}
+```
+
+- `activeDC` is the DC the write endpoint currently resolves to (a routing choice, not
+  a promoted primary).
+- `phase` is `Steady`, `FailingOver`, `FailingBack`, or `Degraded`.
+- Each `dataCenters` entry reports the DC role, whether it holds a Keeper voter and
+  whether it currently has Keeper quorum, whether it is the write-routed DC, its
+  per-shard replica health, its cross-DC `absoluteDelaySeconds` and `queueSize`, and its
+  health.
+
+## Unplanned failover
+
+When the active DC is lost, the surviving DCs that still hold Keeper quorum (a standby
+data DC plus the arbiter DC in the even layout) **keep accepting writes on their own**,
+because Keeper quorum survives and every replica is already writable. There is no
+promotion. The orchestrator observes the Lease move to a surviving DC and points the
+single write endpoint there. `status.disasterRecovery.phase` moves to `FailingOver` and
+back to `Steady`. Bounded loss is only committed-but-unfetched parts on the lost DC's
+disk (a clean partition that put the lost DC in the Keeper minority loses zero committed
+data, because it could not commit without quorum).
+
+## Planned switchover (near-zero RPO)
+
+To move the active (write-routed) DC on purpose without losing committed writes,
+annotate the ClickHouse with the target DC:
+
+```bash
+$ kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b
+```
+
+The orchestrator quiesces writes on the current active DC (routes clients away), waits
+until the target DC's replicas show `absoluteDelaySeconds` near zero and an empty
+replication queue (`queueSize: 0`), then moves the Lease and the write endpoint to
+`dc-b`. Because it waits for the target to catch up before flipping, near-zero committed
+writes are lost.
+
+## Cleanup
+
+```bash
+$ kubectl delete clickhouse -n demo ch-dcdr
+$ kubectl delete placementpolicy ch-dcdr
+```
+
+Deleting the `ClickHouse` removes the per-DC replica groups, the arbiter-DC Keeper
+voter, and the generated per-DC `PlacementPolicies`. The user-provided base
+`PlacementPolicy` is left for you to delete.
diff --git a/docs/guides/clickhouse/dr/runbook/index.md b/docs/guides/clickhouse/dr/runbook/index.md
new file mode 100644
index 0000000000..8a9ab70ed8
--- /dev/null
+++ b/docs/guides/clickhouse/dr/runbook/index.md
@@ -0,0 +1,331 @@
+---
+title: DC-DR Runbook
+menu:
+  docs_{{ .version }}:
+    identifier: ch-dr-runbook-clickhouse
+    name: Runbook
+    parent: ch-dr-clickhouse
+    weight: 30
+menu_name: docs_{{ .version }}
+section_menu_id: guides
+---
+
+# ClickHouse DC-DR Runbook
+
+Scenario-by-scenario procedures for operating a ClickHouse cluster in cross data center
+disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and
+ClickHouse do **automatically**, how to **verify**, and the **action** to take.
+
+Read the [User Guide](/docs/guides/clickhouse/dr/guide/index.md) for the concepts and
+commands referenced here. Throughout, `<coord>` is the coordination control plane
+kubeconfig, and `ch-dcdr`/`demo` are the example database and namespace. The example pod
+`ch-dcdr-appscode-cluster-shard-0-0` is the first replica of shard 0.
+
+## Quick reference
+
+```bash
+# Active DC, phase, and per-DC view:
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery}' | jq
+
+# Lease holder (the DC the coordination plane routes writes to):
+kubectl --kubeconfig <coord> -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}'
+
+# Per-DC replicas and DCs:
+kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name
+
+# Replication delay, queue, and log pointers (from any replica):
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+  --query "SELECT database, table, absolute_delay, queue_size, log_pointer, log_max_index, total_replicas, active_replicas FROM system.replicas"
+```
+
+Golden rules:
+
+- **ClickHouse Keeper quorum decides who can commit.** Never try to force a write into a
+  DC that has lost Keeper quorum; it cannot register parts and its inserts fail by design.
+  That is the split-brain guarantee, not a bug.
+- **There is no promotion.** Every replica is writable. DR is a routing change (the write
+  endpoint follows the Lease to a DC that holds Keeper quorum), not an election.
+- **Exactly one DC is `writable: true`** in `status.disasterRecovery` at any instant (the
+  write-routed active DC), even though the engine would let more than one accept writes.
+- **The Lease is routing, not safety.** If the Lease is stale, ClickHouse keeps running on
+  its own Keeper quorum; what you lose is switchover and endpoint steering.
+
+---
+
+## 1. Active DC lost (zone/cluster failure)
+
+**Symptoms:** the active DC's replicas are gone/unreachable; writes to the endpoint fail
+briefly until it re-points.
+
+**Automatic:** the surviving DCs that still hold Keeper quorum (a standby data DC plus the
+Arbiter DC in the even layout, or the surviving data majority in the odd layout) **keep
+accepting writes on their own**, because Keeper quorum survives and every replica is
+already writable. There is no promotion. The orchestrator observes the Lease move to a
+surviving DC and points the single write endpoint there. `phase` moves `FailingOver` to
+`Steady`.
+
+**Verify:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}'   # the survivor
+kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name
+```
+
+**Action:** none required for availability. Note the RPO: only committed-but-unfetched
+parts on the lost DC's disk are at risk (a clean partition that put the lost DC in the
+Keeper minority loses zero committed data). When the failed DC returns, see scenario 8
+(re-add a DC).
+
+---
+
+## 2. Network partition between data centers
+
+**Symptoms:** DCs are up but cannot reach each other.
+
+**Automatic:** the side that keeps Keeper quorum keeps registering parts and stays
+writable. The minority side **loses Keeper quorum and cannot register parts, so its
+inserts fail on their own**, at the engine level. There is no split brain and the fence
+needs no action: a minority DC simply cannot commit. The write endpoint stays on (or moves
+to) the majority side.
+
+**Verify there is exactly one writable DC and check who holds Keeper quorum:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}=writable:{.writable},quorum:{.keeperQuorum} {end}'
+```
+
+**Action:** heal the network. The minority side rejoins the Keeper ensemble and catches up
+through `ReplicatedMergeTree` automatically. There is no rewind, because the minority
+committed nothing.
+
+---
+
+## 3. Planned switchover (maintenance on the active DC)
+
+**Action:**
+
+```bash
+kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-b
+```
+
+**Automatic:** the hub gates on the target's health and lag, quiesces writes on the current
+active DC (routes clients away), waits until the target's replicas show `absolute_delay`
+near zero and an empty queue (`queue_size: 0`), then moves the Lease and the write endpoint
+to `dc-b`. Because it waits for the target to catch up before flipping, near-zero committed
+writes are lost. There is no promotion step.
+
+**Verify:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}'  # dc-b
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{.status.disasterRecovery.phase}'     # Steady
+```
+
+**If it does not complete:** see scenario 6 (switchover stuck).
+
+---
+
+## 4. Planned failback to the original DC
+
+After the original DC is healthy and caught up (failback is native: its replicas rejoin the
+Keeper ensemble and fetch missing parts, with no rewind), steer the active DC back:
+
+```bash
+kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to=dc-a
+```
+
+Same near-zero-RPO flow as scenario 3. There is no `pg_rewind` step and no rollback; a DC
+that lacked Keeper quorum committed nothing to diverge.
+
+---
+
+## 5. Arbiter DC lost (even layout)
+
+**Symptoms:** the Arbiter DC is gone; its etcd member and the data-less Keeper voter are
+unreachable.
+
+**Impact:** none on writes. The two data DCs together still hold 2 of the 3 Keeper voters,
+a majority, so Keeper quorum holds and writes continue. You lose the tie-break voter, so a
+subsequent **second** failure (a data DC) can no longer keep Keeper quorum automatically.
+
+**Verify the cluster is still writable and holds quorum:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} quorum:{.keeperQuorum} writable:{.writable}{"\n"}{end}'
+```
+
+**Action:** restore the Arbiter DC (the etcd member and the data-less Keeper voter) to
+regain single-fault tolerance.
+
+---
+
+## 6. Planned switchover stuck (target not catching up)
+
+**Symptoms:** after annotating `switchover-to`, `phase` stays `FailingOver` and the endpoint
+does not move.
+
+**Diagnose:**
+
+```bash
+# Target lag and health from status:
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} delay={.absoluteDelaySeconds} queue={.queueSize} healthy={.healthy}{"\n"}{end}'
+# Replication state from a target replica:
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+  --query "SELECT table, absolute_delay, queue_size, log_pointer, log_max_index FROM system.replicas"
+```
+
+**Causes & action:**
+
+- **Target lag not converging** the switchover waits for the target's `absolute_delay` near
+  zero and an empty queue before flipping. Relieve the cross-DC bottleneck (network, insert
+  load, Keeper round-trip latency) so the target drains its replication queue.
+- **Target unhealthy** ensure the target DC has ready replicas of every shard.
+- **Abort** remove the annotation to cancel:
+  `kubectl annotate clickhouse -n demo ch-dcdr dr.kubedb.com/switchover-to-`.
+
+---
+
+## 7. A standby DC is lost
+
+**Symptoms:** a non-active DC's replicas are gone; that DC shows `healthy: false`.
+
+**Impact:** none on writes as long as the remaining DCs keep Keeper quorum (the active DC
+plus the Arbiter DC in the even layout). You lose that DC's redundancy and its local read
+capacity until it returns.
+
+**Verify the active DC still holds quorum and is writable:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName} quorum:{.keeperQuorum}{end}'
+```
+
+**Action:** recover the DC's replicas; they reschedule and catch up through
+`ReplicatedMergeTree` over Keeper automatically.
+
+---
+
+## 8. Re-add / recover a previously lost data center
+
+After a DC returns from a failure:
+
+**Automatic:** its replicas rejoin the Keeper ensemble and catch up over native
+`ReplicatedMergeTree` (they fetch the missing parts). There is no rewind and no rollback; a
+DC that lacked Keeper quorum committed nothing to diverge.
+
+**Verify:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} healthy={.healthy} delay={.absoluteDelaySeconds} queue={.queueSize}{"\n"}{end}'
+```
+
+**Action:** to make it the active DC again, perform a planned failback (scenario 4) once its
+`absoluteDelaySeconds` is near zero and its queue is empty.
+
+---
+
+## 9. Keeper quorum lost across the ensemble (double failure)
+
+**Symptoms:** more than one Keeper voter is unreachable at once, so the ensemble has no Raft
+majority; inserts fail cluster-wide.
+
+**Impact:** with the ensemble below majority, **no DC can register parts**, so writes stop
+everywhere. This is the engine protecting against split brain, not a KubeDB fault.
+
+**Verify:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} keeperVoter:{.keeperVoter} keeperQuorum:{.keeperQuorum}{"\n"}{end}'
+```
+
+**Action:** restore enough Keeper voters to regain a majority (bring back a failed Member
+DC's voter or the Arbiter DC's voter). Writes resume automatically once the ensemble has
+quorum again. Do not try to force a single surviving voter to act alone.
+
+---
+
+## 10. A DC is unexpectedly read-only / rejecting writes
+
+**Symptoms:** a DC you expect to serve the endpoint is rejecting inserts.
+
+**Diagnose:**
+
+```bash
+# Does this DC hold Keeper quorum, and is it the write-routed DC?
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} writable:{.writable} quorum:{.keeperQuorum}{"\n"}{end}'
+# What DC does the Lease route to?
+kubectl --kubeconfig <coord> -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}'
+# Replication and Keeper reachability from a replica:
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+  --query "SELECT table, is_readonly, is_session_expired, absolute_delay FROM system.replicas"
+```
+
+**Causes & action:**
+
+- **Not the write-routed DC** the endpoint intentionally routes writes elsewhere; this DC
+  is a standby (correct). Point writes at the endpoint `<db>`, not at this DC directly.
+- **Lost Keeper quorum** `keeperQuorum:false` or `is_readonly:1` means this DC cannot reach
+  a Keeper majority, so it is read-only by design. See scenario 2 or 9.
+- **Lease routes here but writes still fail** check Keeper reachability and the fence
+  marker; the endpoint fails closed if the marker is stale.
+
+Never try to bypass the fence or write directly to a DC that lacks Keeper quorum; it cannot
+commit and you risk confusing clients.
+
+---
+
+## 11. Coordination plane (dr-controlplane / etcd) unavailable
+
+**Symptoms:** the Lease cannot be read/renewed across the spokes.
+
+**Automatic:** ClickHouse keeps running on its own Keeper quorum, so the cluster stays
+writable in whichever DC the endpoint last resolved to; the Lease is routing, not the
+failover mechanism, so its loss does not by itself stop writes. What you lose is **endpoint
+steering and planned switchover**: the orchestrator cannot move the active DC until the
+Lease quorum returns.
+
+**Verify:**
+
+```bash
+kubectl --kubeconfig <coord> -n dc-failover get lease primary-dc   # error / stale
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT 1"  # DB still serving
+```
+
+**Action:** restore the `dr-controlplane` etcd quorum (it shares the Arbiter DC with the
+data-less Keeper voter). Once the Lease is renewable, endpoint steering and switchover
+resume.
+
+---
+
+## 12. Suspected split-brain (two DCs taking committed writes)
+
+This should be impossible with a spread 3-site Keeper: no single data DC holds a Keeper
+majority, and a minority DC cannot register parts, so it cannot commit. If
+`status.disasterRecovery` ever shows two `writable: true` DCs:
+
+**Diagnose immediately:**
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} writable:{.writable} quorum:{.keeperQuorum}{"\n"}{end}'
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client \
+  --query "SELECT table, is_readonly, total_replicas, active_replicas FROM system.replicas"
+```
+
+**Action:** confirm the Keeper ensemble still spreads voters 3-site (no single data DC was
+given a Keeper majority by a bad topology change). Because a minority DC cannot register
+parts, it cannot diverge committed data even if the routing briefly shows two writable DCs.
+Restore connectivity, let the endpoint settle on one active DC, and correct the Keeper
+placement if it drifted.
+
+---
+
+## Escalation checklist
+
+When unsure, collect:
+
+```bash
+kubectl get clickhouse -n demo ch-dcdr -o yaml
+kubectl --kubeconfig <coord> -n dc-failover get lease -o yaml
+kubectl get pods -n demo -l app.kubernetes.io/instance=ch-dcdr -L open-cluster-management.io/cluster-name -o wide
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT * FROM system.replicas FORMAT Vertical"
+kubectl exec -n demo ch-dcdr-appscode-cluster-shard-0-0 -- clickhouse-client --query "SELECT * FROM system.clusters FORMAT Vertical"
+```

From 1bcd619ee74692b39f0663b8521f0d76ff0affa4 Mon Sep 17 00:00:00 2001
From: Tamal Saha <tamal@appscode.com>
Date: Wed, 1 Jul 2026 10:31:43 +0600
Subject: [PATCH 2/2] Document WAN-efficient cross-DC part fetch for ClickHouse
 DC-DR

Add the fifth How-it-works rule: ReplicatedMergeTree fetches are not DC-aware, so
with two or more in-DC replicas of a shard the operator designates one in-DC
replica per shard as the cross-DC fetch source and the others fetch intra-DC,
holding cross-DC part traffic to one copy per shard per DC.

Signed-off-by: Tamal Saha <tamal@appscode.com>
---
 docs/guides/clickhouse/dr/overview/index.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/docs/guides/clickhouse/dr/overview/index.md b/docs/guides/clickhouse/dr/overview/index.md
index 19fe392e9b..6a2c93d95d 100644
--- a/docs/guides/clickhouse/dr/overview/index.md
+++ b/docs/guides/clickhouse/dr/overview/index.md
@@ -61,7 +61,7 @@ asynchronously through a shared ClickHouse Keeper ensemble. So for ClickHouse:
 
 ## How it works
 
-DC-DR for ClickHouse rests on four rules.
+DC-DR for ClickHouse rests on five rules.
 
 - **ClickHouse Keeper is spread 3-site and is the failover authority.** With two data
   DCs the layout is a Keeper voter in `dc-a`, a Keeper voter in `dc-b`, and one
@@ -88,6 +88,16 @@ DC-DR for ClickHouse rests on four rules.
 - **Reads can stay local.** Any replica serves consistent-enough reads, so read traffic
   can stay in-DC for low latency while writes route to the active DC through the single
   endpoint.
+- **One cross-DC part copy per shard, then fetch intra-DC.** `ReplicatedMergeTree`
+  fetches are not DC-aware: a replica pulls a new part from whichever replica advertises
+  it in Keeper, which can be the cross-DC one. With a single replica of a shard per DC
+  (the minimal DR shape) that is already one part copy per DC. But when a DC runs two or
+  more replicas of the same shard for intra-DC HA, each can independently pull the same
+  part across the WAN. ClickHouse has no native same-DC fetch affinity, so the operator
+  designates one in-DC replica per shard as the cross-DC fetch source and points the
+  others at it, so they fetch that part intra-DC. This holds cross-DC part traffic to one
+  copy per shard per DC, the `ReplicatedMergeTree` analog of the Postgres standby-DC
+  intra-DC cascade.
 
 > **Why not confine Keeper to the active DC?** You can (see the topologies below), and
 > it gives the lowest write latency. But then losing the active DC also loses its Keeper