diff --git a/docs/guides/elasticsearch/dr/_index.md b/docs/guides/elasticsearch/dr/_index.md new file mode 100644 index 000000000..b2b41ec99 --- /dev/null +++ b/docs/guides/elasticsearch/dr/_index.md @@ -0,0 +1,10 @@ +--- +title: Disaster Recovery +menu: + docs_{{ .version }}: + identifier: es-dr-elasticsearch + name: DR + parent: es-elasticsearch-guides + weight: 36 +menu_name: docs_{{ .version }} +--- diff --git a/docs/guides/elasticsearch/dr/guide/index.md b/docs/guides/elasticsearch/dr/guide/index.md new file mode 100644 index 000000000..e4b3007e6 --- /dev/null +++ b/docs/guides/elasticsearch/dr/guide/index.md @@ -0,0 +1,340 @@ +--- +title: DC-DR User Guide +menu: + docs_{{ .version }}: + identifier: es-dr-guide-elasticsearch + name: User Guide + parent: es-dr-elasticsearch + weight: 20 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Running Elasticsearch in DC-DR Mode: User Guide + +This guide covers every aspect of operating a distributed Elasticsearch in cross data +center disaster recovery (DC-DR) mode: the components, the naming contract, deployment, +what the operator creates, indexing into the active endpoint, monitoring CCR follow lag, +the follower-read-only fence, switchover, failback, scaling, and day-2 operations. + +Read the [DC-DR Overview](/docs/guides/elasticsearch/dr/overview/index.md) first for the +architecture, and the [DC-DR Runbook](/docs/guides/elasticsearch/dr/runbook/index.md) for +scenario-by-scenario procedures. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Components and where they run + +| Component | Runs in | Responsibility | +| --- | --- | --- | +| **`dr-controlplane`** + 3-site etcd quorum | across the data centers (an OCM control plane) | Publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease holder is the active write DC. This is the single cross-DC failover authority. | +| **`dr-controlplane` agent** | each spoke (DC) | Contends for the primary-DC Lease for its DC and projects the Lease decision into the local spoke as the primary-dc marker the fence reads. | +| **KubeDB Elasticsearch operator (hub)** | the OCM hub | Expands the `Elasticsearch` CR into per-DC Elasticsearch clusters, registers each as a remote of the other, and sets auto-follow patterns on the standby. On a Lease change it promotes the followers, flips the search/index endpoint, and reverses the CCR direction, then writes `status.disasterRecovery`. | +| **Per-DC Elasticsearch clusters** | each Member DC | Each is a self-contained Elasticsearch with its own intra-DC master quorum, data nodes, and per-shard primaries and replicas. The master quorum never crosses the DC boundary. | +| **CCR remote-cluster registration + auto-follow patterns** | the standby Member DC | The standby registers the active as a remote cluster and follows its leader indices via follower indices. CCR runs only active to standby, and the operator owns the direction. | +| **The follower-read-only fence** | each Member DC | Follower indices are inherently read-only; a non-active cluster refuses client writes to would-be leader indices, fail closed, so only the Lease holder takes writes. | +| **KubeSlice (or external listeners)** | each spoke | Provides the flat cross-DC pod network so CCR can reach the remote cluster's transport port (9300) and the search/index endpoint (9200) resolves across DCs. | + +## The DC-name contract + +One string identifies a data center everywhere. **Keep these identical:** + +- the OCM spoke cluster name +- the agent `--dc-name` +- the primary-DC Lease `holderIdentity` +- the marker `data.activeDC` +- the pod label `open-cluster-management.io/cluster-name` +- the `PlacementPolicy` `distributionRule.clusterName` + +## Deploying + +### PlacementPolicy + +Map the global pod ordinals to data centers and tag each DC with its role: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: es-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + mode: TwoDC # two Member DCs plus an Arbiter DC + trigger: + scope: Global # one cluster-wide failover scope + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1, 2] + - clusterName: dc-b + role: Member + replicaIndices: [3, 4, 5] + - clusterName: dc-c + role: Arbiter + replicaIndices: [] +``` + +- A data-bearing **Member** rule carries `replicaIndices` that map to a self-contained + Elasticsearch cluster with its own master quorum. The **Arbiter** DC carries an empty + `replicaIndices` and holds only the `dr-controlplane` etcd member, never Elasticsearch. +- `mode: TwoDC` expects exactly two Member DCs plus the Arbiter DC. Three or more data DCs + is a separate design and out of scope. +- Roles are `Member` and `Arbiter` only. + +### Elasticsearch + +```yaml +apiVersion: kubedb.com/v1 +kind: Elasticsearch +metadata: + name: es-dcdr + namespace: demo +spec: + version: xpack-8.19.9 + distributed: true + enableSSL: true + replicas: 6 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: es-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +### What the operator creates + +- **One self-contained Elasticsearch cluster per Member DC** (`es-dcdr` materialized into a + cluster in `dc-a` and a cluster in `dc-b`), each with its own master-eligible voting + quorum and its own per-shard primaries and replicas. The two clusters never share a master + quorum. `spec.replicas: 6` is the total node count across both Member DCs; the + `replicaIndices` split it into a three-node cluster per DC (ordinals 0, 1, 2 in `dc-a` and + 3, 4, 5 in `dc-b`). +- **A remote-cluster registration on each DC**, so the standby can reach the active over the + transport port (9300) via `cluster.remote..seeds` pointed at the active's transport + endpoint. The reverse-direction registration stays provisioned so a failover can enable + auto-follow the other way. +- **Auto-follow patterns on the standby DC's cluster** that create follower indices for the + active cluster's leader indices, pulling operations asynchronously. CCR runs only + active-to-standby; the operator owns the direction so the two directions never overlap. +- **The Lease-gated search/index endpoint** that resolves to the active cluster's nodes. + +CCR itself is configured through the Elasticsearch CCR REST API, which the operator drives. +Conceptually, on the standby cluster (`dc-b` while `dc-a` is active) the operator registers +the remote and sets an auto-follow pattern: + +```bash +# Register the active cluster as a remote (transport seeds, port 9300): +PUT /_cluster/settings +{ + "persistent": { + "cluster.remote.dc-a.seeds": [ "es-dcdr-dc-a-master.demo.svc:9300" ] + } +} + +# Auto-follow every new leader index on the active into a follower index here: +PUT /_ccr/auto_follow/dc-a-autofollow +{ + "remote_cluster": "dc-a", + "leader_index_patterns": [ "*" ], + "follow_index_pattern": "{{leader_index}}" +} +``` + +The follower's index name matches the leader's, so a failover is transparent to clients. To +bound catch-up like a replication slot, the operator sets the leader indices' +`index.soft_deletes.retention_lease.period` so a follower that falls behind can still resume; +a follower past retention forces a full re-follow. + +## Connecting and indexing + +A DC-DR Elasticsearch exposes one user-facing **search/index endpoint** (client port 9200) +that resolves to the active cluster's nodes. Clients always connect to that single endpoint +and reach the write cluster without reconfiguration. Because CCR keeps index names identical +on both clusters, after the endpoint flips clients keep using the same indices. + +```bash +# Index into the active cluster through the single search/index endpoint: +$ curl -k -u "admin:$PASSWORD" -X POST \ + "https://es-dcdr.demo.svc:9200/orders/_doc" \ + -H 'Content-Type: application/json' -d '{"id": 1, "item": "book"}' +``` + +Only the active cluster accepts client writes. If clients somehow reach a standby cluster, +its indices are follower (read-only) indices and reject the write (see the fence below), +which is the split-brain guard. + +### Reads on the standby + +The standby cluster's follower indices are queryable read-only, so you can serve local +reads from the standby DC. After a failover or switchover, the promoted indices on the new +active accept writes again under identical names, so clients reconnecting through the flipped +endpoint keep working. + +## Monitoring and observability + +### status.disasterRecovery + +The single CR carries the whole cross-DC view: + +```bash +$ kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +| Field | Meaning | +| --- | --- | +| `activeDC` | The DC that holds the Lease and takes client writes. | +| `phase` | `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. | +| `lastTransitionTime` | When `activeDC` last changed. | +| `dataCenters[].clusterName` | The data center, by its OCM managed cluster name. | +| `dataCenters[].role` | `Member` or `Arbiter`. | +| `dataCenters[].writable` | True only for the active cluster. | +| `dataCenters[].nodesReady` | Ready node count in that DC's cluster. | +| `dataCenters[].followLagOps` | The standby's CCR follow lag in operations (`leader_global_checkpoint` minus `follower_global_checkpoint`, summed across follower indices). | +| `dataCenters[].healthy` | DC health: a Member DC is healthy when its nodes are ready; the Arbiter DC is healthy when its `dr-controlplane` etcd member is reachable (so an Arbiter reports `healthy: true` with `nodesReady: 0`). | + +### CCR follow lag + +Cross-DC lag comes from CCR's own follow-stats: `leader_global_checkpoint` minus +`follower_global_checkpoint` per follower index, a count of operations the follower has not +yet applied. The hub surfaces this into `followLagOps`; there is no lag field in the base +`ElasticsearchStatus`. + +```bash +# Raw CCR follow-stats on the standby cluster: +$ curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-b.demo.svc:9200/_ccr/stats" | jq +``` + +### Useful checks + +```bash +# Which DC the Lease intends as active (from the coordination plane): +$ kubectl --kubeconfig -n dc-failover get lease primary-dc \ + -o jsonpath='{.spec.holderIdentity}' + +# Per-DC nodes and roles: +$ kubectl get pods -n demo -l app.kubernetes.io/instance=es-dcdr \ + -L kubedb.com/role,open-cluster-management.io/cluster-name + +# Follow lag per DC from status: +$ kubectl get elasticsearch -n demo es-dcdr \ + -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} lag={.followLagOps} healthy={.healthy}{"\n"}{end}' +``` + +## The follower-read-only fence + +A non-active cluster must refuse client writes, fail closed: a cluster that cannot confirm it +holds the Lease keeps its indices as read-only followers. The fence is the CCR follower model +itself. + +- **Follower indices are read-only.** While a cluster is standby, its indices are CCR + followers, and Elasticsearch rejects writes to a follower index. There are no writable + leader indices on the standby until the operator promotes them, and it only promotes when + the Lease moves. +- **Fail closed on the Lease.** A cluster that cannot confirm it holds the primary-DC Lease + never promotes its followers, so it stays read-only. A partitioned old-active cluster that + loses its Lease renewal stops being writable on its own, before the hub reacts. + +Two rules keep the fence from breaking replication: + +- **Never promote followers without the Lease.** Promotion (`pause_follow`, `unfollow`, + convert to writable) is the only thing that lifts the fence, and only the hub does it, only + on a Lease change. A cluster never self-promotes. +- **Never run CCR both directions for the same index.** CCR is one-directional per index. The + operator disables the old direction's auto-follow and promotes the old followers before (or + atomically with) enabling the new direction, so an index is never a leader and a follower at + once. + +## Planned switchover (drained, zero document loss) + +Move the active DC on purpose by annotating the Elasticsearch: + +```bash +$ kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The hub then: + +1. checks the target is a known, healthy DC within the CCR follow-lag budget; +2. sets `phase: FailingOver` and quiesces indexing on the active cluster; +3. waits for CCR to drain to zero follow lag, so the target has every operation; +4. pauses and promotes the target's follower indices (`pause_follow`, `unfollow`, convert to + writable), flips the search/index endpoint to the target, and starts CCR in the reverse + direction (auto-follow on the old active once it returns), never both directions at once; +5. moves the Lease to the target. + +Because CCR fully drained before the flip, no acknowledged document is lost. This is a +hub-driven annotation, not an `ElasticsearchOpsRequest` type: the engine-aware quiesce and CCR +drain run in the hub, not in the engine-agnostic `dr-controlplane`. + +## Failback + +Failback is not a rewind. When a failed DC returns, it becomes the CCR follower of the new +active via auto-follow, and catches up. The operations it accepted but never followed before +the failover are a forked tail Elasticsearch cannot rewind. For correctness: + +- **reconcile the forked tail out of band or re-seed the affected indices** from the new + active (delete the returned cluster's copy of an affected index and let auto-follow re-seed + it from scratch), or +- **accept and document the forked tail** as bounded loss. + +Once the returned DC is caught up (low follow lag), a drained planned switchover returns the +active DC: + +```bash +$ kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to=dc-a +``` + +## Scaling and day-2 operations + +The standard `ElasticsearchOpsRequest` operations (`UpdateVersion`, `HorizontalScaling`, +`VerticalScaling`, `VolumeExpansion`, `Restart`, `Reconfigure`, `ReconfigureTLS`, +`RotateAuth`) apply to a DC-DR cluster. They act on the per-DC Elasticsearch clusters. +Horizontal scaling operates per DC (each Member DC's cluster scales its own nodes and handles +master-quorum membership intra-DC), so a scaling request targets the data centers rather than +a single flat node set. + +There is no failover ops type: unplanned failover is driven by the Lease, and the planned +switchover is the `dr.kubedb.com/switchover-to` annotation, not an ops request. + +> **Note:** the distributed Elasticsearch substrate and the DC-DR layer are net-new for +> Elasticsearch. Treat the field names and flows in this guide as the intended user +> experience; confirm availability in your release before relying on them in production. + +## Deletion and cleanup + +```bash +$ kubectl delete elasticsearch -n demo es-dcdr +``` + +Per `deletionPolicy`, the operator removes the per-DC Elasticsearch clusters, the +remote-cluster registrations and auto-follow patterns, and the cluster-scoped per-DC +`PlacementPolicies` it generated (these carry no owner reference, so the operator deletes them +explicitly). The user-provided base `PlacementPolicy` is left for you to delete. + +## Limitations + +- **CCR licensing.** CCR is an Elastic Platinum/Enterprise feature, so the licensed + Elasticsearch image must support it. OpenSearch uses its own cross-cluster replication + plugin (OSS) for the equivalent leader/follower model. Confirm the feature is available in + your image before relying on DC-DR. +- **No zero-RPO on an unplanned failover.** CCR is asynchronous, so an unplanned active-DC + loss loses the un-followed tail (bounded by CCR follow lag). Use a drained planned + switchover for a zero-document-loss move. +- **No rewind on failback.** A returned old-active cluster's un-followed forked tail cannot be + rewound. Reconcile it out of band, re-seed the affected indices, or accept the forked tail + as bounded loss. +- **Two data DCs only.** Active/passive CCR is inherently two-cluster. Three or more data DCs + (fan-out following, three-way failover) is a separate, larger design. +- **Cross-DC reachability is required.** Elasticsearch nodes need routable cross-DC access to + the transport port (9300, for the remote-cluster seeds) and the client endpoint (9200), so + flat pod networking (KubeSlice) or external listeners are required. diff --git a/docs/guides/elasticsearch/dr/overview/index.md b/docs/guides/elasticsearch/dr/overview/index.md new file mode 100644 index 000000000..0d07a8ca7 --- /dev/null +++ b/docs/guides/elasticsearch/dr/overview/index.md @@ -0,0 +1,306 @@ +--- +title: DC-DR Overview +menu: + docs_{{ .version }}: + identifier: es-dr-overview-elasticsearch + name: Overview + parent: es-dr-elasticsearch + weight: 10 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Cross Data Center Disaster Recovery (DC-DR) for Elasticsearch + +KubeDB can run a single distributed `Elasticsearch` across two data centers (DCs) so an +Elasticsearch workload survives the loss of an entire data center. Exactly one DC is the +active write cluster at any instant. The other DC runs a self-contained standby cluster +whose indices are asynchronous Cross-Cluster Replication (CCR) followers of the active +cluster's leader indices. When the active DC is lost, the follower indices on the standby +are promoted to writable, the single search/index endpoint is flipped to the standby, and +clients continue against identical index names. + +This page is the conceptual overview and a quick start. See also: + +- [DC-DR User Guide](/docs/guides/elasticsearch/dr/guide/index.md) for every aspect of + running in DC-DR mode (components, the naming contract, connecting, monitoring, the + follower-read-only fence, switchover, failback, day-2 ops). +- [DC-DR Runbook](/docs/guides/elasticsearch/dr/runbook/index.md) for what to do in each + operational scenario. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Why Elasticsearch DC-DR is its own camp + +Most KubeDB engines have a single writable primary and a leader-to-leader replication +stream. Postgres promotes a survivor, MongoDB elects a new primary, and the endpoint +follows the writable node. **Elasticsearch has none of that.** + +Elasticsearch is not a single-writer database. It is a cluster of per-shard primaries with +a master-eligible voting quorum: each index is split into shards, every shard has its own +primary and replicas, and a master-eligible quorum handles cluster state and shard +allocation intra-cluster. There is no cluster-wide write primary, and the master quorum +already handles leadership inside one cluster. So the single-primary DR pattern (one +writable leader, a leader-to-leader stream, promote the survivor) does not map. For +Elasticsearch: + +- **Cross-DC replication is asynchronous Cross-Cluster Replication (CCR), not a leader + stream.** The standby cluster registers the active cluster as a remote cluster and + creates **follower indices** (via **auto-follow patterns** for new indices) that pull + operations from the active cluster's **leader indices** asynchronously. CCR is + configured through the Elasticsearch CCR REST API. There is no new replication engine to + build. +- **The active DC is a write-endpoint routing decision, not an engine state.** Clients + index into whichever cluster the endpoint is pointed at. DR is active/passive: one + cluster takes writes, CCR follows them onto the standby, and on failover the endpoint is + redirected. The `dr-controlplane` Lease decides which cluster is the write target; the + follower-read-only fence stops writes to a non-active cluster. +- **There is no rewind and no zero-RPO.** CCR is asynchronous, so an unplanned failover + loses the un-followed tail (bounded by CCR follow lag), and a returned old-active cluster + may hold documents that were never followed. Elasticsearch cannot rewind an index. + Failback re-follows the returned cluster from the new active and reconciles or accepts + the un-followed tail as bounded loss. + +So Elasticsearch DC-DR is two independent Elasticsearch clusters (one per Member DC), each +with its own intra-DC master quorum, joined by asynchronous CCR, with the Lease choosing +the write cluster and the follower-read-only fence preventing split writes. + +## How it works + +DC-DR for Elasticsearch rests on five rules. + +- **The master quorum stays intra-DC.** Each Member DC runs its own self-contained + Elasticsearch cluster: its own master-eligible voting quorum, its own data nodes, its own + per-shard primaries and replicas. The master quorum never crosses the DC boundary, so + inter-DC latency or a partition can never flap master election or stall shard allocation. + There is no cross-DC Elasticsearch voter. +- **The active cluster is chosen only by the `dr-controlplane` primary-DC Lease.** A small + control plane, backed by a three-site etcd quorum, publishes one Lease per failover scope. + The Lease holder is the active write DC. Exactly one cluster is the write target at any + instant. +- **Cross-DC replication is CCR, active to standby.** The standby cluster registers the + active cluster as a remote cluster (`cluster.remote..seeds` at the active's + transport endpoint, port 9300) and runs auto-follow patterns that create follower indices + for the active's leader indices. Each follower index pulls its operations once across the + WAN from the active; within the standby cluster the follower's own replica shards fan out + intra-DC (the WAN one-copy rule). CCR is one-directional, active to standby, and the + operator owns the follow direction so the two directions never overlap. A failover pauses + and promotes the followers on the new active and starts CCR in the reverse direction, and + never runs CCR both directions for the same index at once. +- **Writability is gated by the Lease and fenced locally, fail closed.** A follower index + is inherently read-only, which is the fence. A non-active cluster refuses client writes to + would-be leader indices unless it holds the Lease; a cluster that cannot confirm it holds + the Lease stays read-only-follower. This local fence, plus the etcd majority, is the + split-brain guarantee. Without it a partitioned old-active cluster that still sees clients + would keep accepting writes that never follow, diverging the two clusters. +- **One search/index endpoint follows the active cluster.** The single user-facing endpoint + (client port 9200) resolves to the active cluster's nodes (selected by the Lease), so + clients always reach the write cluster. Because CCR keeps index names identical on both + clusters, after the endpoint flips clients keep working against the same indices. + +> **Why never both directions at once?** CCR is one-directional per index: a leader index +> is followed by a follower index. If both directions were enabled for the same index, the +> two clusters would each try to follow the other and the data would ping-pong. A failover +> therefore pauses and promotes the old followers before (or atomically with) starting CCR +> in the new direction. + +### Data center roles + +Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: + +| Role | Holds Elasticsearch | Purpose | +| --- | --- | --- | +| **Member** | yes | A self-contained Elasticsearch cluster with its own master quorum. One Member is the active write cluster; the other is the CCR follower while standby. | +| **Arbiter** | no | The arbiter DC. Holds only the `dr-controlplane` etcd member and never Elasticsearch, because Elasticsearch has no cross-DC voter. Supplies the tie-break etcd vote. | + +## The single-CR, single-endpoint model + +The user creates **one** distributed `Elasticsearch` object (with `spec.distributed` and a +`PlacementPolicy` carrying `distributionRules` and a `failoverPolicy`) and gets **one** +search/index endpoint. The operator expands the CR across the Member DCs: + +- one self-contained **Elasticsearch cluster per Member DC**, each with its own intra-DC + master quorum; +- a **remote-cluster registration plus auto-follow patterns** on the standby cluster, so + the standby's follower indices track the active's leader indices; +- the **Lease-gated search/index endpoint** that resolves to the active cluster. + +The single CR's `status.disasterRecovery` carries the whole cross-DC view: the active DC, +each cluster's node health, the CCR follow lag, and the DR phase. + +> **Scope.** This spec targets the even two-data-DC layout (two Member DCs plus an Arbiter +> DC). Active/passive CCR is inherently two-cluster, so spanning three or more data DCs +> (fan-out following and a three-way failover) is a separate, larger design and is out of +> scope here. + +## Prerequisites + +- A distributed Elasticsearch substrate: an Open Cluster Management (OCM) hub and spoke + clusters, and **flat cross-DC pod networking (KubeSlice) or external listeners**. + Elasticsearch nodes need cross-DC reachability for the **transport port (9300)** (for the + remote-cluster seeds CCR uses) and the **client endpoint (9200)**, so routable + connectivity between the clusters is part of the DC-DR setup. +- The `dr-controlplane` service and its three-site etcd quorum installed across the data + centers, with a `dr-controlplane` agent in each spoke (DC). The third etcd member sits in + the Arbiter DC. +- The KubeDB Elasticsearch operator started with the DC-DR flags (coordination kubeconfig + and the operator's local DC name). +- A **CCR-capable image**. CCR is an Elastic **Platinum/Enterprise** feature, so confirm the + licensed Elasticsearch image supports it. For OpenSearch, use OpenSearch's own + cross-cluster replication plugin (OSS), which provides the equivalent leader/follower model. +- One consistent **DC name** per data center, used everywhere: the OCM spoke cluster name, + the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the pod label + `open-cluster-management.io/cluster-name`, and the `PlacementPolicy` + `distributionRule.clusterName`. Keep them identical. + +## Deploy a DC-DR Elasticsearch + +### 1. PlacementPolicy + +Assign global pod ordinals to data centers and tag each DC with its role. Here two Member +DCs (`dc-a`, `dc-b`) each hold a three-node Elasticsearch cluster, and `dc-c` is the +Arbiter DC: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: es-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + mode: TwoDC + trigger: + scope: Global + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1, 2] + - clusterName: dc-b + role: Member + replicaIndices: [3, 4, 5] + - clusterName: dc-c + role: Arbiter + replicaIndices: [] +``` + +- A data-bearing **Member** rule carries `replicaIndices` mapping its ordinals to a + self-contained Elasticsearch cluster with its own master quorum. The **Arbiter** DC + carries an empty `replicaIndices` and holds no Elasticsearch, only the `dr-controlplane` + etcd member. +- `failoverPolicy.mode: TwoDC` expects two Member DCs plus the Arbiter DC. +- `failoverPolicy.trigger.scope: Global` makes this one cluster-wide failover scope. + +### 2. Elasticsearch + +Reference the `PlacementPolicy` and opt the Elasticsearch into DC-DR expansion: + +```yaml +apiVersion: kubedb.com/v1 +kind: Elasticsearch +metadata: + name: es-dcdr + namespace: demo +spec: + version: xpack-8.19.9 + distributed: true + enableSSL: true + replicas: 6 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: es-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +`spec.replicas: 6` is the total node count across both Member DCs. The `PlacementPolicy` +`replicaIndices` split it into a three-node cluster in `dc-a` (ordinals 0, 1, 2) and a +three-node cluster in `dc-b` (ordinals 3, 4, 5). The operator expands this into one +self-contained Elasticsearch cluster in `dc-a` and one in `dc-b`, registers each as a +remote of the other, and enables auto-follow patterns on the standby DC's cluster so its +follower indices track the active cluster's leader indices. + +## Observe the DC-DR state + +The single `Elasticsearch` object's `status.disasterRecovery` carries the whole cross-DC +view: + +```bash +$ kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +```json +{ + "activeDC": "dc-a", + "phase": "Steady", + "lastTransitionTime": "2026-06-30T10:00:00Z", + "dataCenters": [ + { "clusterName": "dc-a", "role": "Member", "writable": true, "nodesReady": 3, "followLagOps": 0, "healthy": true }, + { "clusterName": "dc-b", "role": "Member", "writable": false, "nodesReady": 3, "followLagOps": 128, "healthy": true }, + { "clusterName": "dc-c", "role": "Arbiter", "writable": false, "nodesReady": 0, "followLagOps": 0, "healthy": true } + ] +} +``` + +- `activeDC` is the DC that holds the Lease and takes client writes. +- `phase` is `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. +- Each `dataCenters` entry reports the DC role, whether it is the writable cluster, how many + nodes are ready, its CCR follow lag in operations (the standby's replication backlog + behind the active), and its health. + +## Unplanned failover + +When the active DC is lost, the standby is already a near-current CCR follower. The +orchestrator observes the Lease move to the standby, pauses and promotes the standby's +follower indices (`pause_follow`, `unfollow`, convert to regular writable indices), flips +the search/index endpoint to the standby, and starts CCR in the reverse direction (auto-follow +on the old active's cluster once it returns). `status.disasterRecovery.phase` moves to +`FailingOver` and back to `Steady`. + +The RPO is the un-followed CCR tail: operations the active cluster accepted but had not yet +followed onto the standby when it died are lost. There is no rewind. + +## Planned switchover (drained, zero document loss) + +To move the active DC on purpose without losing documents, annotate the Elasticsearch with +the target DC: + +```bash +$ kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The orchestrator quiesces indexing on the active cluster, waits for CCR to drain to zero +follow lag (so the target has every operation), then pauses and promotes the target's +follower indices, flips the search/index endpoint, and starts CCR in the reverse direction. +Because CCR has fully drained before the flip, no acknowledged document is lost. The Lease +then follows to `dc-b`. + +## Failback + +Failback is not a rewind. A returned old-active cluster becomes the CCR follower of the new +active (auto-follow). Operations it accepted but never followed before the failover are a +forked tail Elasticsearch cannot rewind. For correctness, re-follow the returned cluster +from the new active and reconcile the forked tail out of band, or re-seed the affected +indices, or accept and document the forked tail as bounded loss. Once the returned DC is +caught up, a drained planned switchover returns the active DC. + +## Cleanup + +```bash +$ kubectl delete elasticsearch -n demo es-dcdr +$ kubectl delete placementpolicy es-dcdr +``` + +Deleting the `Elasticsearch` removes the per-DC Elasticsearch clusters, the remote-cluster +registrations and auto-follow patterns, and the generated cluster-scoped per-DC +`PlacementPolicies` (which carry no owner reference, so the operator deletes them +explicitly). The user-provided base `PlacementPolicy` is left for you to delete. diff --git a/docs/guides/elasticsearch/dr/runbook/index.md b/docs/guides/elasticsearch/dr/runbook/index.md new file mode 100644 index 000000000..0558656e9 --- /dev/null +++ b/docs/guides/elasticsearch/dr/runbook/index.md @@ -0,0 +1,350 @@ +--- +title: DC-DR Runbook +menu: + docs_{{ .version }}: + identifier: es-dr-runbook-elasticsearch + name: Runbook + parent: es-dr-elasticsearch + weight: 30 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Elasticsearch DC-DR Runbook + +Scenario-by-scenario procedures for operating an Elasticsearch workload in cross data center +disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB does +**automatically**, how to **verify**, and the **action** to take. + +Read the [User Guide](/docs/guides/elasticsearch/dr/guide/index.md) for the concepts and +commands referenced here. Throughout, `` is the coordination control plane kubeconfig, +and `es-dcdr`/`demo` are the example database and namespace. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Quick reference + +```bash +# Active DC, phase, and per-DC view: +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery}' | jq + +# Lease holder (the DC the coordination plane makes the active write cluster): +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' + +# Per-DC nodes, roles, and DCs: +kubectl get pods -n demo -l app.kubernetes.io/instance=es-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name + +# CCR follow lag per DC from status: +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} lag={.followLagOps}{"\n"}{end}' +``` + +Golden rules: + +- **The Lease decides the active write cluster.** Exactly one DC is `writable: true` in + `status.disasterRecovery` at any instant. +- **The follower-read-only fence fails closed.** A cluster that cannot confirm it holds the + Lease keeps its indices as read-only followers and never self-promotes, so a partitioned + old-active cluster stops taking writes on its own. +- **CCR is asynchronous.** An unplanned failover loses the un-followed tail (bounded by CCR + follow lag). Only a drained planned switchover is zero document loss. +- **Never run CCR both directions for the same index at once**, and never promote followers + without the Lease. + +--- + +## 1. Intra-DC node loss (single node in a DC fails) + +**Symptoms:** one Elasticsearch node pod in a DC is down; some shards briefly re-allocate or +re-elect a primary within that cluster. + +**Automatic:** the loss is handled entirely inside that DC's Elasticsearch cluster. The +master quorum re-allocates shards and promotes surviving replica shards to primary, and the +pod reschedules. There is **no cross-DC effect**: the active DC stays active, the standby +stays a follower, and the Lease does not move. + +**Verify:** + +```bash +kubectl get pods -n demo -l app.kubernetes.io/instance=es-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # unchanged +``` + +**Action:** none required. Ensure the failed node rejoins its cluster. Provision user indices +with a replica count that tolerates one node loss inside a DC (so shards stay green). + +--- + +## 2. Full active-DC loss (zone/cluster failure) + +**Symptoms:** the active DC's nodes are gone/unreachable; writes to the search/index endpoint +fail briefly. + +**Automatic:** the standby is already a near-current CCR follower. The `dr-controlplane` Lease +moves to the standby, and the orchestrator pauses and promotes the standby's follower indices +(`pause_follow`, `unfollow`, convert to writable), flips the search/index endpoint to the +standby's nodes, and starts CCR in the reverse direction (auto-follow on the old active for +when it returns). `phase` moves `FailingOver` to `Steady` and the survivor becomes +`writable: true`. + +**Verify:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # the survivor +kubectl get pods -n demo -l app.kubernetes.io/instance=es-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +``` + +**Action:** none required for availability. Note the RPO: operations the old active accepted +but had not yet followed are lost (bounded by the CCR follow lag at the moment of loss). There +is no rewind. When the failed DC returns, see scenario 6 (failback). + +--- + +## 3. Clean DC-vs-DR partition (both DCs up, cannot reach each other) + +**Symptoms:** the two data DCs are up but the network between them is cut. CCR follow lag on +the standby climbs; the old active may still see local clients. + +**Automatic:** the follower-read-only fence is the split-brain guard. The old active loses its +Lease renewal across the partition; because promotion requires the Lease, the standby cluster +**stays read-only-follower** and never self-promotes, and the old active keeps its role only +while it can confirm the Lease. The etcd majority (two sites plus the Arbiter DC) keeps or +grants the Lease to one side only, so exactly one cluster is ever writable and the two +clusters cannot diverge. + +**Verify there is exactly one writable DC:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +``` + +**Action:** heal the network. Once the partition clears, the non-active cluster resumes as the +CCR follower and catches up. If the follower fell past the leader's soft-delete retention it +forces a full re-follow (scenario 8). If both sides somehow took writes (should not happen with +a fail-closed fence), treat the non-active side's forked tail as scenario 6 (re-seed the +affected indices). + +--- + +## 4. Planned switchover (maintenance on the active DC) + +**Action:** + +```bash +kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +**Automatic:** the hub gates on the target's health and CCR follow-lag budget, then quiesces +indexing on the active cluster, waits for CCR to drain to zero follow lag (so the target holds +every operation), pauses and promotes the target's follower indices, flips the search/index +endpoint to the target, starts CCR in the reverse direction, and moves the Lease. Because CCR +fully drained before the flip, **zero acknowledged documents are lost**. + +**Verify:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # dc-b +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # Steady +``` + +**If it does not complete:** see scenario 8 (CCR follow lag high / switchover stuck). + +--- + +## 5. Indices not writable after a flip + +**Symptoms:** after a failover or switchover, writes to the new active cluster are rejected +because target indices are still follower (read-only) indices. + +**Cause:** the follower indices were not promoted (`pause_follow`, `unfollow`, convert to +writable), or promotion partially completed. + +**Diagnose:** + +```bash +# Are the target's indices still followers? +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-b.demo.svc:9200/_ccr/stats" | jq +# Which DC does status report as writable? +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +``` + +**Action:** confirm the Lease actually moved to the target (`activeDC` matches the Lease +`holderIdentity`). If the Lease moved but promotion did not finish, the hub retries on the next +reconcile; if it is stuck, ensure the target's follower indices are paused, unfollowed, and +converted to regular writable indices. Never promote a cluster that does not hold the Lease. + +--- + +## 6. Failback (return a recovered DC to active) + +**Symptoms:** the previously lost DC is back but holds a forked tail (operations it accepted +before the failover that were never followed). + +**Automatic:** the returned cluster becomes the CCR follower of the new active via auto-follow +and catches up. But the forked tail (operations that were never followed) sits on top of the +returned cluster's copy and Elasticsearch cannot rewind it. + +**Action:** + +1. **Reconcile or re-seed** the affected indices from the new active: delete the returned + cluster's copy of an affected index so auto-follow re-seeds it from scratch, removing the + forked tail. Or **accept and document** the forked tail as bounded loss. +2. Once the returned DC is caught up (low follow lag), perform a **drained planned switchover** + back to it: + + ```bash + kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to=dc-a + ``` + +There is no rewind; Elasticsearch cannot roll back an index, so the re-seed is what restores +consistency. + +--- + +## 7. A standby DC is lost + +**Symptoms:** the non-active DC's nodes are gone; that DC shows `healthy: false` and CCR follow +lag is unavailable. + +**Impact:** none on writes. The active DC keeps taking client writes; you lose the DR copy +until the standby returns. + +**Verify the active DC is still writable:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName}{end}' +``` + +**Action:** recover the standby cluster's nodes. When it returns, auto-follow resumes following +from the active and the standby catches up. If it fell past the leader's soft-delete retention, +the affected follower indices force a full re-follow. You are running without DR protection +until then. + +--- + +## 8. CCR follow lag high (follower falling behind) + +**Symptoms:** `followLagOps` on the standby climbs; a planned switchover stays in `FailingOver` +because CCR has not drained. + +**Diagnose:** + +```bash +# Per-DC follow lag and health: +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} lag={.followLagOps} healthy={.healthy}{"\n"}{end}' +# Raw CCR follow-stats on the standby cluster: +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-b.demo.svc:9200/_ccr/stats" | jq +``` + +**Causes & action:** + +- **Cross-DC network bottleneck or indexing burst:** relieve the bottleneck (network, + active-cluster write load) so CCR drains. A planned switchover intentionally waits for the + lag to reach zero after quiescing indexing. +- **A follower past the leader's soft-delete retention:** the follower can no longer resume + from the retained history and forces a full re-follow. Raise + `index.soft_deletes.retention_lease.period` on the leader indices to bound catch-up, or + re-seed the affected follower indices. +- **Abort a stuck switchover:** remove the annotation to cancel: + `kubectl annotate elasticsearch -n demo es-dcdr dr.kubedb.com/switchover-to-`. Indexing on + the active DC resumes and it stays active. + +--- + +## 9. Arbiter DC lost + +**Symptoms:** the Arbiter DC is gone; its `dr-controlplane` etcd member is unreachable. + +**Impact:** none on writes. The two data DCs plus the lost arbiter leave the etcd quorum with +two of three members, still a majority, so the Lease can still be renewed and the active DC +keeps taking writes. You lose the tie-break, so a subsequent **second** failure can no longer +keep an etcd majority. + +**Verify the cluster is still writable:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName}{end}' +``` + +**Action:** restore the Arbiter DC's etcd member to regain single-fault tolerance. The Arbiter +DC never runs Elasticsearch, so no node recovery is involved. + +--- + +## 10. Coordination plane (dr-controlplane / etcd) unavailable + +**Symptoms:** the Lease cannot be read or renewed across the spokes. + +**Automatic:** the current active cluster keeps taking writes as long as it can still confirm +the Lease it last held; but a cluster that cannot confirm the Lease **fails closed** and keeps +its indices read-only, so a total coordination-plane outage can make the active cluster go +read-only. What you always lose is **failover and planned switchover**: the hub cannot move the +active DC until the etcd quorum returns. + +**Verify:** + +```bash +kubectl --kubeconfig -n dc-failover get lease primary-dc # error / stale +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # may be Degraded +``` + +**Action:** restore the `dr-controlplane` etcd quorum (its third member lives in the Arbiter +DC). Once the Lease is renewable, the active cluster keeps its writable indices and +failover/switchover resume. + +--- + +## 11. Which DC is active? + +**Question:** confirm which cluster is taking writes right now. + +```bash +# The DR status view (authoritative for what the hub applied): +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' +# The Lease holder (what the coordination plane intends): +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' +# Where the search/index endpoint resolves and which DC is writable: +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +``` + +In steady state the `activeDC`, the Lease `holderIdentity`, and the single `writable: true` DC +all name the same data center. A brief mismatch during `FailingOver` is expected; it should +converge back to `Steady`. + +--- + +## 12. Suspected split writes (two clusters taking writes) + +This should be impossible: the etcd majority grants the Lease to one DC only, and the +follower-read-only fence keeps any non-Lease-holder's indices read-only. If +`status.disasterRecovery` ever shows two `writable: true` DCs, or writes succeed against both +endpoints: + +**Diagnose immediately:** + +```bash +kubectl get elasticsearch -n demo es-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +kubectl --kubeconfig -n dc-failover get lease primary-dc -o yaml +# Neither cluster should have the other's indices as leaders while its own are being followed: +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-a.demo.svc:9200/_ccr/stats" | jq +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-b.demo.svc:9200/_ccr/stats" | jq +``` + +**Action:** confirm the non-Lease-holder's indices are follower (read-only) indices and that +only one CCR direction is enabled. Never run CCR both directions for the same index at once: an +index that is both a leader and a follower would ping-pong operations between clusters. Demote +the wrong side back to followers (re-follow from the Lease holder), restore the fence, then +treat any forked tail as scenario 6 (re-seed). + +--- + +## Escalation checklist + +When unsure, collect: + +```bash +kubectl get elasticsearch -n demo es-dcdr -o yaml +kubectl --kubeconfig -n dc-failover get lease -o yaml +kubectl get pods -n demo -l app.kubernetes.io/instance=es-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name -o wide +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-a.demo.svc:9200/_ccr/stats" | jq +curl -k -u "admin:$PASSWORD" "https://es-dcdr-dc-b.demo.svc:9200/_ccr/stats" | jq +```