diff --git a/docs/guides/documentdb/_index.md b/docs/guides/documentdb/_index.md new file mode 100644 index 000000000..1be3aeb0d --- /dev/null +++ b/docs/guides/documentdb/_index.md @@ -0,0 +1,10 @@ +--- +title: DocumentDB +menu: + docs_{{ .version }}: + identifier: dm-documentdb-guides + name: DocumentDB + parent: guides + weight: 10 +menu_name: docs_{{ .version }} +--- diff --git a/docs/guides/documentdb/dr/_index.md b/docs/guides/documentdb/dr/_index.md new file mode 100644 index 000000000..db521550a --- /dev/null +++ b/docs/guides/documentdb/dr/_index.md @@ -0,0 +1,10 @@ +--- +title: Disaster Recovery +menu: + docs_{{ .version }}: + identifier: guides-documentdb-dr + name: DR + parent: dm-documentdb-guides + weight: 36 +menu_name: docs_{{ .version }} +--- diff --git a/docs/guides/documentdb/dr/guide/index.md b/docs/guides/documentdb/dr/guide/index.md new file mode 100644 index 000000000..39555d83b --- /dev/null +++ b/docs/guides/documentdb/dr/guide/index.md @@ -0,0 +1,379 @@ +--- +title: DC-DR User Guide +menu: + docs_{{ .version }}: + identifier: guides-documentdb-dr-guide + name: User Guide + parent: guides-documentdb-dr + weight: 20 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Running DocumentDB in DC-DR Mode: User Guide + +This guide covers every aspect of operating a distributed DocumentDB in cross data +center disaster recovery (DC-DR) mode: the components, the naming contract, +deployment, connecting, monitoring, replication and lag, timing and tuning, quorum +and roles, switchover and failback, scaling, day-2 operations, backup, and deletion. + +KubeDB `DocumentDB` is Microsoft DocumentDB (the `pg_documentdb` extension) on +PostgreSQL under the hood, so DC-DR reuses the PostgreSQL WAL streaming, the per-DC +`documentdb-coordinator` raft, and `pg_rewind` failback. + +Read the [DC-DR Overview](/docs/guides/documentdb/dr/overview/index.md) +first for the architecture, and the +[DC-DR Runbook](/docs/guides/documentdb/dr/runbook/index.md) for +scenario-by-scenario procedures. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Components and where they run + +| Component | Runs in | Responsibility | +| --- | --- | --- | +| **`dr-controlplane`** + 3-site etcd quorum | across the data centers (an OCM control plane) | Publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease holder is the active (writable) DC. This is the single cross-DC failover authority. | +| **`dr-controlplane` agent** | each spoke (DC) | Contends for the primary-DC Lease on behalf of its DC and projects the Lease decision into the local spoke as a marker `ConfigMap`. | +| **KubeDB DocumentDB operator (hub)** | the OCM hub | Expands the `DocumentDB` CR into per-DC groups, watches the Lease, drives failover/switchover, and writes `status.disasterRecovery`. | +| **`documentdb-coordinator`** | every DocumentDB pod | Runs the per-DC raft, reads the local marker, and fences its leader read-only when its DC is not active. | +| **KubeSlice** | each spoke | Provides the cross-DC pod network so a standby DC's leader can stream from the active DC's leader. | + +The marker `ConfigMap` is the contract between the agent (producer) and the +coordinator (consumer): + +``` +ConfigMap primary-dc (namespace: dc-failover, on each spoke) + data.activeDC = the DC the quorum currently trusts as primary + data.renewTime = RFC3339, the observed primary-DC Lease renewTime + data.quiesce = the DC asked to hold read-only for a planned switchover (else empty) +``` + +The coordinator trusts the marker for 30s (the fence TTL); absent, stale, +unparseable, or naming another DC all mean *not active* and the leader stays +read-only. This is the fail-closed fence. + +## The DC-name contract + +One string identifies a data center everywhere. **Keep these identical:** + +- the OCM spoke cluster name +- the agent `--dc-name` +- the primary-DC Lease `holderIdentity` +- the marker `data.activeDC` +- the pod label `open-cluster-management.io/cluster-name` +- the `PlacementPolicy` `distributionRule.clusterName` + +## Operator configuration + +Start the DocumentDB operator with: + +``` +--dc-dr-enabled +--dc-dr-coord-kubeconfig= +--dc-dr-local-dc= +``` + +The per-DC pod coordinators automatically receive `DC_DR_ENABLED`, `DC_NAME`, +`DC_DR_NAMESPACE` (default `dc-failover`), and `DC_DR_MARKER` (default `primary-dc`) +through their PetSet template, so the fence works without extra wiring. + +## Deploying + +### PlacementPolicy + +Map the global pod ordinals to data centers and tag each DC with its role: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: docdb-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global # one cluster-wide failover scope (or Group + a group name) + mode: TwoDC # TwoDC: 2 Members + a tie-breaker; ThreeDC: 3 Members + distributionRules: + - clusterName: dc-east + role: Member + replicaIndices: [0, 1, 2] + - clusterName: dc-west + role: Member + replicaIndices: [3, 4, 5] + - clusterName: dc-arbiter + role: Arbiter +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** witness DC + (vote only, no DocumentDB) carries none. (The petset `Witness` role, a data-bearing + witness, is for engines like MongoDB and is not used by DocumentDB.) +- `mode: TwoDC` expects exactly two Member DCs plus the Arbiter witness DC; + `ThreeDC` expects at least three Member DCs. + +### DocumentDB + +```yaml +apiVersion: kubedb.com/v1alpha2 +kind: DocumentDB +metadata: + name: docdb-dcdr + namespace: demo + annotations: + dr.kubedb.com/enabled: "true" # opt into per-DC DC-DR expansion + # dr.kubedb.com/failover-group: payments # optional: a Group failover scope + # dr.kubedb.com/switchover-max-lag-bytes: "16777216" # optional lag budget override +spec: + version: "pg17-0.109.0" + replicas: 6 + distributed: true + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: docdb-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +The cross-DC standby behavior follows the CR's `spec.standbyMode` (`Hot` or `Warm`) +and `spec.streamingMode` (`Synchronous` or `Asynchronous`); the cross-DC links are +asynchronous by design regardless. WAL retention and force-failover budgets come from +`spec.replication` (`DocumentDBReplication`: `walLimitPolicy`, `walKeepSize`, +`forceFailoverAcceptingDataLossAfter`), and per-DC raft elections from +`spec.leaderElection` (`DocumentDBLeaderElectionConfig`). + +### What the operator creates + +Per data-bearing DC ``: + +- a per-DC `PetSet` `-` (e.g. `docdb-dcdr-dc-east`) with its own intra-DC raft; +- a DC-local headless governing `Service` so the DC's pods discover only each other; +- a cluster-scoped per-DC `PlacementPolicy` `-` pinning that group to the DC; +- a per-DC arbiter `PetSet` `--arbiter` when that DC's local node count is even. + +The witness DC (`role: Arbiter`) runs no DocumentDB pods. All per-DC pods carry the offshoot selectors +plus the `open-cluster-management.io/cluster-name` label, so the global primary/standby +Services and the single `AppBinding` keep working. + +## Connecting + +A DC-DR DocumentDB exposes the same single endpoint as any KubeDB DocumentDB: + +- the **primary Service** `` resolves to the active DC's writable leader (only + that leader is labeled `kubedb.com/role: primary`); +- the **standby Service** `-standby` resolves to the read-only leaders; +- one **`AppBinding`** `` for applications and KubeDB integrations. + +Because only the active DC's leader carries the `primary` label, the endpoint follows +failover automatically, applications keep using `` and reconnect after a +failover, landing on the new active DC. + +## Monitoring and observability + +### status.disasterRecovery + +The single CR carries the whole cross-DC view: + +```bash +$ kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +| Field | Meaning | +| --- | --- | +| `activeDC` | The DC that holds the Lease and runs the writable primary. | +| `phase` | `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. | +| `lastTransitionTime` | When `activeDC` last changed. | +| `dataCenters[].clusterName` | The data center, by its OCM managed cluster name. | +| `dataCenters[].role` | `primary` for the active DC's leader, else `standby`. | +| `dataCenters[].leader` | That DC's local raft leader pod. | +| `dataCenters[].writable` | True only for the active DC. | +| `dataCenters[].lagBytes` | The DC's cross-DC replication lag behind the active primary. | +| `dataCenters[].healthy` | Whether the DC has a ready pod. | + +### Useful checks + +```bash +# Which DC is active (from the coordination plane): +$ kubectl --kubeconfig -n dc-failover get lease primary-dc \ + -o jsonpath='{.spec.holderIdentity}' + +# The marker each spoke reads (run against a spoke): +$ kubectl -n dc-failover get configmap primary-dc -o yaml + +# Per-DC leaders and roles: +$ kubectl get pods -n demo -l app.kubernetes.io/instance=docdb-dcdr \ + -L kubedb.com/role,open-cluster-management.io/cluster-name + +# A standby DC leader stamps its lag here: +$ kubectl get pod -n demo -o jsonpath='{.metadata.annotations.kubedb\.com/dc-lag-bytes}' +``` + +## Replication, lag, and RPO + +- Cross-DC replication is **asynchronous** leader-to-leader WAL streaming. Within a + standby DC, the local followers **cascade** from their DC's leader, so each standby + DC opens exactly one cross-DC link. +- `lagBytes` is how far a DC's leader is behind the active primary, computed by that + DC's coordinator (the hub never opens cross-cluster SQL). It is the basis for the + RPO of an unplanned failover. +- A **planned switchover loses no committed rows** (zero RPO) because writes are + frozen and the target fully catches up before the handoff. An **unplanned failover** + may lose the last unreplicated bytes (bounded by the standby's lag at the moment the + active DC died). + +## Timing and tuning (RTO vs safety) + +DC-DR has one timing invariant that must hold for correctness: + +> **fence TTL + cross-DC clock skew < primary-DC Lease duration** + +The marker `renewTime` tracks the Lease's renewTime. A partitioned active DC +self-fences at `lastRenew + fence TTL`; a survivor can only acquire the expired Lease +at `lastRenew + LeaseDuration`. Keeping the fence TTL inside the Lease duration +guarantees the old active DC goes read-only **before** any new DC becomes writable , +no split-brain window. + +Default values: + +| Parameter | Where | Default | +| --- | --- | --- | +| Fence TTL | documentdb-coordinator | 30s | +| Marker refresh interval | dr-controlplane agent | 5s | +| Primary-DC Lease duration | dr-controlplane agent (`--election-lease-duration`) | 45s | +| Lease renew deadline | dr-controlplane agent (`--election-renew-deadline`) | 30s | +| Lease retry period | dr-controlplane agent (`--election-retry-period`) | 2s | + +The failover **RTO floor** is roughly the Lease duration (the time a survivor waits to +acquire). To lower RTO, lower the Lease duration **and** the fence TTL together, +always preserving `fence TTL + skew < LeaseDuration`. The retry period must stay well +under the fence TTL so the holder restamps `renewTime` and the marker reads fresh in +normal operation. + +## Quorum, roles, and arbiters + +- Each DC's raft needs its own quorum. A DC with an **even** local node count gets its + own in-DC arbiter (`--arbiter`) so intra-DC failover keeps quorum; an odd + count needs none. +- The witness DC (`role: Arbiter`) holds only the `dr-controlplane` vote, never DocumentDB data. +- Scaling a DC re-evaluates its parity automatically: the arbiter is created or removed + (and de-registered from the DC raft) as the local count crosses even/odd. + +Separately from a DC's *intra-DC* quorum, the **cross-DC** failover quorum needs a +majority of three voting sites. For how to lay this out across two or three data +centers (and why a third witness site is preferred), see +[Deployment topologies](/docs/guides/documentdb/dr/overview/index.md#deployment-topologies-2-dcs-vs-3-dcs). + +## Planned switchover (zero-RPO) + +Move the active DC on purpose by annotating the DocumentDB: + +```bash +$ kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to=dc-west +``` + +The hub then: + +1. checks the target is a known, healthy DC within the lag budget + (`dr.kubedb.com/switchover-max-lag-bytes`, default 16 MiB); +2. sets `phase: FailingOver` and asks the active DC to **quiesce** (hold its primary + read-only) via the primary-DC Lease, freezing the active write position; +3. waits until the target has replayed to within one WAL page of that frozen position; +4. hands the Lease to the target, which is promoted; the old DC resumes as a standby. + +The annotation is cleared automatically once the target is active. Watch +`status.disasterRecovery` for `phase` returning to `Steady` with the new `activeDC`. + +## Failback + +Failback is just a switchover back to the original DC once it is healthy again: + +```bash +$ kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to=dc-east +``` + +A DC that lost the Lease and rejoins automatically rewinds any divergent WAL tail +(`pg_rewind`, with a base-backup reseed fallback) and resumes streaming from the +current active primary before it is eligible. + +## Per-DC horizontal scaling + +Each DC has its own raft, so scale a specific DC with a `DocumentDBOpsRequest`: + +```yaml +apiVersion: ops.kubedb.com/v1alpha1 +kind: DocumentDBOpsRequest +metadata: + name: docdb-dcdr-scale-west + namespace: demo +spec: + type: HorizontalScaling + databaseRef: + name: docdb-dcdr + horizontalScaling: + dataCenters: + - clusterName: dc-west + replicas: 5 +``` + +- Each entry sets that DC's local node count; DCs not listed are unchanged. +- Nodes are added or removed one at a time over the DC-local network; the DC's arbiter + is created/removed as parity changes; on scale-down the removed node's replication + slot is dropped. +- The base `PlacementPolicy` is renumbered so the declarative topology matches. +- Scaling a Member DC to `1` makes it a single-node DC (no in-DC HA, still part of + cross-DC DR). Scaling to `0` is rejected, removing a whole DC is a topology change, + not horizontal scaling. + +## Day-2 operations + +The standard `DocumentDBOpsRequest` operations apply to every per-DC group on a DC-DR +cluster; issue them exactly as for a non-distributed DocumentDB: + +| Operation | DC-DR behavior | +| --- | --- | +| **Vertical scaling** | Patches every per-DC PetSet and per-DC arbiter, restarts per-DC pods. | +| **Volume expansion** (online/offline) | Expands every per-DC data PVC and per-DC arbiter PVC, waits on all per-DC PetSets. | +| **Version update** | Updates every per-DC PetSet. | +| **Storage migration** | Orphan-deletes and waits on every per-DC PetSet. | +| **Reconfigure / Restart / Rotate-Auth** | Apply across the per-DC pods. | + +The DC-DR control verbs `ForceFailOver`, `ReconnectStandby`, and `SetRaftKeyPair` are +driven by the hub (or issued as a `DocumentDBOpsRequest`) to promote a survivor, +re-point a standby's cross-DC stream, and rotate the raft key material respectively. + +## Backup + +Back up a DC-DR DocumentDB the same way as any KubeDB DocumentDB (KubeStash / the +DocumentDB archiver). Logical and base backups run against the writable endpoint, so +they read from the active DC; the AppBinding follows failover, so a scheduled backup +continues against the new active DC after a failover. Point-in-time recovery works as +usual. + +## Deletion and cleanup + +```bash +$ kubectl delete documentdb -n demo docdb-dcdr +``` + +Per `deletionPolicy`, the operator removes the per-DC PetSets, governing Services, and +the cluster-scoped per-DC `PlacementPolicies` it generated (these carry no owner +reference, so the operator deletes them explicitly). The user-provided base +`PlacementPolicy` is left for you to delete. + +## Limitations + +- **Adding or removing a whole data center** is a topology change (a new group, raft, + and cross-DC seed), distinct from horizontal scaling, and is performed by editing the + `PlacementPolicy` topology, not a `HorizontalScaling` request. +- Cross-DC replication is asynchronous; an unplanned failover has a non-zero RPO + bounded by the standby lag. Use a **planned switchover** for zero-RPO moves. +- All correctness depends on the timing invariant above; do not set a fence TTL that + meets or exceeds the Lease duration. diff --git a/docs/guides/documentdb/dr/overview/index.md b/docs/guides/documentdb/dr/overview/index.md new file mode 100644 index 000000000..d39bdea68 --- /dev/null +++ b/docs/guides/documentdb/dr/overview/index.md @@ -0,0 +1,384 @@ +--- +title: DC-DR Overview +menu: + docs_{{ .version }}: + identifier: guides-documentdb-dr-overview + name: Overview + parent: guides-documentdb-dr + weight: 10 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Cross Data Center Disaster Recovery (DC-DR) for DocumentDB + +KubeDB can run a single distributed `DocumentDB` across multiple data centers so the +database survives the loss of an entire data center (DC). Exactly one DC is writable +at any instant; the others are warm, read-only standbys that stream from it across +the DCs. When the active DC is lost, KubeDB promotes a surviving DC, and the single +connection endpoint follows the new writable DC. + +KubeDB `DocumentDB` is Microsoft DocumentDB (the `pg_documentdb` extension) running on +PostgreSQL under the hood, so DC-DR reuses the proven PostgreSQL machinery: WAL +streaming replication between data centers, the per-DC `documentdb-coordinator` raft, +and `pg_rewind` for failback. This guide builds on the same distributed substrate +(one CR, Open Cluster Management, KubeSlice, and a `PlacementPolicy`) and adds the +cross-DC failover machinery on top. + +This page is the conceptual overview and a quick start. See also: + +- [DC-DR User Guide](/docs/guides/documentdb/dr/guide/index.md), every + aspect of running in DC-DR mode (components, monitoring, timing, scaling, day-2 ops). +- [DC-DR Runbook](/docs/guides/documentdb/dr/runbook/index.md), what to + do in each operational scenario. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## How it works + +DC-DR is built on one rule: **the DocumentDB raft never stretches across data centers.** + +- **Each data center is a self-contained DocumentDB group.** The operator expands the + single `DocumentDB` CR into one group per data-bearing DC, each with its own + `documentdb-coordinator` raft electing a **local** leader, its own local replicas, + and (when its local replica count is even) its own local arbiter. The raft quorum + never crosses the DC boundary, so cross-DC latency or a partition can never flap an + election. +- **One cross-DC authority decides who is writable.** A small control plane + (`dr-controlplane`), backed by a three-site etcd quorum, publishes one + `coordination.k8s.io` **Lease** per failover scope. The DC that holds the Lease is + the **active** (writable) DC. This is the single cross-DC failover decision. +- **Cross-DC replication is leader-to-leader WAL streaming.** The standby DC's local + leader runs as an asynchronous streaming standby of the active DC's leader; that + standby DC's own replicas cascade from its local leader. So a standby DC opens + exactly one cross-DC replication link. Whether standbys stay Hot or Warm and whether + streaming is Synchronous or Asynchronous follow the CR's `spec.standbyMode` + (`Hot`/`Warm`) and `spec.streamingMode` (`Synchronous`/`Asynchronous`); cross-DC + links are asynchronous by design. +- **Writability is fenced locally and fails closed.** A per-DC `dr-controlplane` + agent projects the Lease holder onto its own spoke cluster as a small marker + `ConfigMap`. The `documentdb-coordinator` reads only that local marker: if it cannot + confirm its DC holds the Lease (the DC lost it, or is partitioned from the + coordination plane), it demotes its leader to read-only. Because the fence lives in + the DC and fails closed, a cut-off old-active DC stops accepting writes on its own, + before the hub even reacts. This local fence plus the etcd majority (only one DC can + hold the Lease) is the split-brain guarantee. +- **Only the active DC's leader is labeled `primary`.** Each DC's coordinator leads + its own raft, but a non-active DC's leader is labeled `kubedb.com/role: standby`, so + the single primary `Service` and the `AppBinding` always resolve to the active DC's + writable leader. + +### Data center roles + +Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: + +| Role | Holds DocumentDB data | Primary eligible | Purpose | +| --- | --- | --- | --- | +| **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. | +| **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** | +| **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** | + +> For DocumentDB the third "witness" data center is **vote-only** (it holds only the +> `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter` +> and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose +> witness must carry data; DocumentDB does not use it. + +A typical layout is two Member DCs plus one vote-only witness DC (`role: Arbiter`): +the three-site etcd quorum lives across all three, but DocumentDB data lives only in +the two Member DCs. + +## Deployment topologies (2 DCs vs 3 DCs) + +The DR feature needs two things, in different quantities: + +- **DocumentDB data** lives in the **Member** data centers. You need at least **two** + Member DCs for cross-DC redundancy (one active, one warm standby). +- **The failover decision** is made by the `dr-controlplane` etcd **quorum**. A quorum + makes progress only while a **majority of its three voting sites** is reachable. For + single-fault tolerance *and* split-brain safety, those three votes should sit in + **three independent failure domains**. The third domain can be a tiny vote-only + **witness** (`role: Arbiter`) that holds no DocumentDB data. + +So "how many data centers" has two answers: how many hold **data** (two or three), and +how many hold a **quorum vote** (always three for automatic, split-brain-free +failover). The `failoverPolicy.mode` selects the data layout: + +### A. Two Member DCs + a witness, `mode: TwoDC` (recommended) + +Three sites; two hold DocumentDB data, the third is a vote-only witness DC +(`role: Arbiter`, no DocumentDB): + +```yaml +failoverPolicy: + mode: TwoDC +distributionRules: +- { clusterName: dc-east, role: Member, replicaIndices: [0, 1, 2] } +- { clusterName: dc-west, role: Member, replicaIndices: [3, 4, 5] } +- { clusterName: dc-witness, role: Arbiter } # etcd vote only, no DocumentDB +``` + +Any single site can be lost: + +- **Lose a Member DC** → the surviving Member plus the witness form a 2/3 majority, so + the survivor acquires the Lease and is promoted automatically; the lost DC, if alive + but partitioned, self-fences read-only. +- **Lose the witness** → the two Members are still a 2/3 majority, so writes continue + uninterrupted. + +Because the witness runs no DocumentDB, it is small and cheap. **Run it in a third +public cloud or region**, this is the lowest-cost way to get correct, automatic +failover, and it is the recommended topology whenever a third location is available. + +### B. Three Member DCs, `mode: ThreeDC` + +Three sites, all data-bearing and primary-eligible: + +```yaml +failoverPolicy: + mode: ThreeDC +distributionRules: +- { clusterName: dc-east, role: Member, replicaIndices: [0, 1, 2] } +- { clusterName: dc-west, role: Member, replicaIndices: [3, 4, 5] } +- { clusterName: dc-south, role: Member, replicaIndices: [6, 7, 8] } +``` + +More data copies and read capacity, and any DC can become primary. Tolerates the loss +of any single Member DC. The cost is three full DocumentDB groups instead of two, use +it when you want a data copy and primary capability in all three locations. + +### C. Two sites only, reduced resiliency + +If you genuinely have only two locations, you still need a third quorum vote, so you +**place it inside one of the two DCs** (run the third `dr-controlplane` etcd member +there). There is no separate witness site, so that DC now holds **two of the three +votes**: + +- **Lose the other DC** (the one with one vote) → the two-vote DC keeps the majority → + failover/continuity works automatically. +- **Lose the two-vote DC** → the survivor holds only one of three votes, cannot form a + quorum, and therefore cannot safely become writable on its own. **Automatic failover + does not happen**; recovery is a manual, operator-confirmed step, and you must be + certain the failed DC is truly down to avoid split-brain. + +This protects against losing one specific DC, not both symmetrically. Prefer adding a +cheap third witness site (topology A) whenever possible. + +### At a glance + +| Topology | Sites | Data DCs | Tolerates | Automatic failover | +| --- | --- | --- | --- | --- | +| Two Member + Arbiter witness (`TwoDC`) | 3 | 2 | any 1 site | yes | +| Three Member (`ThreeDC`) | 3 | 3 | any 1 site | yes | +| Two sites, co-located quorum | 2 | 2 | only the one-vote DC | only when the one-vote DC is lost | + +## Prerequisites + +- A working distributed DocumentDB substrate: Open Cluster Management (OCM) hub and + spoke clusters, KubeSlice connecting the spokes, and a storage class on each spoke. + DocumentDB reuses the same substrate as + [Distributed Postgres](/docs/guides/postgres/distributed/overview/index.md), since it + runs on PostgreSQL under the hood. +- The `dr-controlplane` service and its three-site etcd quorum installed across the + data centers, with a `dr-controlplane` agent running in each spoke (DC). +- The KubeDB DocumentDB operator started with the DC-DR flags: + + ``` + --dc-dr-enabled + --dc-dr-coord-kubeconfig= + --dc-dr-local-dc= + ``` + +- One consistent **DC name** per data center, used everywhere: the OCM spoke cluster + name, the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the + pod label `open-cluster-management.io/cluster-name`, and the `PlacementPolicy` + `distributionRule.clusterName`. Keep them identical. + +## Deploy a DC-DR DocumentDB + +A DC-DR DocumentDB is a distributed `DocumentDB` whose `PlacementPolicy` carries a +`failoverPolicy` and per-DC roles. The user creates and edits a **single** `DocumentDB` +object and gets one `AppBinding` and one connection endpoint; the operator expands it +into the per-DC groups. + +### 1. PlacementPolicy + +Assign the global pod ordinals to data centers and tag each DC with its role. Here two +Member DCs (`dc-east`, `dc-west`) each get three DocumentDB pods, and `dc-arbiter` is +the tie-breaking witness: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: docdb-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global + mode: TwoDC + distributionRules: + - clusterName: dc-east + role: Member + replicaIndices: [0, 1, 2] + - clusterName: dc-west + role: Member + replicaIndices: [3, 4, 5] + - clusterName: dc-arbiter + role: Arbiter +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** witness DC + (vote only, no DocumentDB) carries none. +- `failoverPolicy.trigger.scope: Global` makes this one cluster-wide failover scope. + Use `Group` with a group name to put a database in its own scope. + +### 2. DocumentDB + +Reference the `PlacementPolicy` and opt the DocumentDB into DC-DR expansion: + +```yaml +apiVersion: kubedb.com/v1alpha2 +kind: DocumentDB +metadata: + name: docdb-dcdr + namespace: demo + annotations: + # Opt this distributed DocumentDB into per-DC DC-DR expansion. + dr.kubedb.com/enabled: "true" +spec: + version: "pg17-0.109.0" + replicas: 6 + distributed: true + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: docdb-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +The operator then creates, per data-bearing DC: + +- a per-DC `PetSet` named `-` (for example `docdb-dcdr-dc-east`) with its own + intra-DC raft and DC-local governing `Service`; +- a per-DC arbiter `PetSet` `--arbiter` when that DC's local node count is + even. + +The witness DC (`role: Arbiter`) runs no DocumentDB pods. + +## Observe the DC-DR state + +The single `DocumentDB` object's `status.disasterRecovery` carries the whole cross-DC +view: + +```bash +$ kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +```json +{ + "activeDC": "dc-east", + "phase": "Steady", + "lastTransitionTime": "2026-06-30T10:00:00Z", + "dataCenters": [ + { "clusterName": "dc-east", "role": "primary", "leader": "docdb-dcdr-dc-east-0", "writable": true, "healthy": true }, + { "clusterName": "dc-west", "role": "standby", "leader": "docdb-dcdr-dc-west-0", "writable": false, "healthy": true, "lagBytes": 4096 } + ] +} +``` + +- `activeDC` is the DC that currently holds the Lease and runs the writable primary. +- `phase` is `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. +- Each `dataCenters` entry reports that DC's local leader, whether it is the writable + primary, its health, and its cross-DC replication `lagBytes` (the in-DC coordinator + computes this and surfaces it; the hub never opens cross-cluster SQL). + +## Unplanned failover + +When the active DC is lost, its agents stop renewing the primary-DC Lease. After the +Lease duration a surviving Member DC's agent acquires it; that DC becomes `activeDC`. +The hub observes the change and drives a bounded-loss promotion of the survivor +through a `ForceFailOver` `DocumentDBOpsRequest`, while the old DC self-fences +read-only locally (before the hub reacts, even under a partition). The primary +`Service` and `AppBinding` then resolve to the new writable DC. + +You do not trigger this; it is automatic. `status.disasterRecovery.phase` moves to +`FailingOver` during the transition and back to `Steady` once the survivor is primary. + +## Planned switchover (zero-RPO) + +To move the active DC on purpose (maintenance, rebalancing) without losing committed +rows, annotate the DocumentDB with the target DC: + +```bash +$ kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to=dc-west +``` + +The switchover is coordinated for zero RPO: + +1. The target must be a known, healthy DC within the lag budget. +2. The hub asks the active DC to **quiesce** (hold its primary read-only) via the + primary-DC Lease, so the active primary's write position freezes. +3. The hub waits until the target has replayed to within one WAL page of the frozen + position. +4. The Lease hands off to the target; it is promoted and the active DC resumes (now as + a standby). The annotation is cleared automatically. + +Because writes are frozen and the target fully catches up before the handoff, a +planned switchover loses no committed rows. + +## Scale a data center + +Each DC has its own intra-DC raft, so a single `spec.replicas` cannot describe a +scale. Scale a specific DC with a `DocumentDBOpsRequest` that lists per-DC targets: + +```yaml +apiVersion: ops.kubedb.com/v1alpha1 +kind: DocumentDBOpsRequest +metadata: + name: docdb-dcdr-scale + namespace: demo +spec: + type: HorizontalScaling + databaseRef: + name: docdb-dcdr + horizontalScaling: + dataCenters: + - clusterName: dc-west + replicas: 5 +``` + +Each entry sets that data center's local node count; DCs not listed are unchanged. +The request resizes only `dc-west`'s raft (adding or removing nodes one at a time over +the DC-local network, managing that DC's arbiter parity), then updates the +`PlacementPolicy` so the declarative topology matches. No other DC's raft and no +cross-DC writability is touched. Scaling a DC to `1` makes it a single-node DC (no +in-DC HA, but still part of cross-DC DR); removing a whole DC is a topology change, not +a scale. + +## Day-2 operations + +The standard DocumentDB `DocumentDBOpsRequest` operations work on a DC-DR cluster and +act on every per-DC group: vertical scaling, volume expansion (online and offline), +version update, and storage migration each apply to all per-DC `PetSet`s and per-DC +arbiters. You issue them exactly as for a non-distributed DocumentDB. + +## Cleanup + +```bash +$ kubectl delete documentdb -n demo docdb-dcdr +$ kubectl delete placementpolicy docdb-dcdr +``` + +Deleting the `DocumentDB` removes the per-DC `PetSet`s, governing `Service`s, and the +cluster-scoped per-DC `PlacementPolicies` the operator generated. The user-provided +base `PlacementPolicy` is left for you to delete. diff --git a/docs/guides/documentdb/dr/runbook/index.md b/docs/guides/documentdb/dr/runbook/index.md new file mode 100644 index 000000000..24ec17620 --- /dev/null +++ b/docs/guides/documentdb/dr/runbook/index.md @@ -0,0 +1,356 @@ +--- +title: DC-DR Runbook +menu: + docs_{{ .version }}: + identifier: guides-documentdb-dr-runbook + name: Runbook + parent: guides-documentdb-dr + weight: 30 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# DocumentDB DC-DR Runbook + +Scenario-by-scenario procedures for operating a DocumentDB cluster in cross data +center disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what +KubeDB does **automatically**, how to **verify**, and the **action** to take. + +Read the [User Guide](/docs/guides/documentdb/dr/guide/index.md) for the +concepts and commands referenced here. Throughout, `` is the coordination +control plane kubeconfig, `docdb-dcdr`/`demo` are the example database and namespace. + +## Quick reference + +```bash +# Active DC, phase, and per-DC view: +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery}' | jq + +# Lease holder (the source of truth for "who is active"): +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' + +# Per-DC leaders, roles, and DCs: +kubectl get pods -n demo -l app.kubernetes.io/instance=docdb-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name + +# A spoke's marker (what its coordinators read): +kubectl -n dc-failover get configmap primary-dc -o jsonpath='{.data}' +``` + +Golden rules: + +- **The Lease decides who is writable.** Never make a pod writable by hand. +- **The fence fails closed.** A DC that cannot confirm it holds the Lease is read-only + by design, that is correct, not a bug. +- **Exactly one DC is `writable: true`** in `status.disasterRecovery` at any instant. + +--- + +## 1. Active DC lost (zone/cluster failure) + +**Symptoms:** the active DC's pods are gone/unreachable; writes fail briefly. + +**Automatic:** the lost DC's agents stop renewing the Lease. After the Lease duration +(~45s) a surviving Member DC's agent acquires it and becomes `activeDC`. The hub drives +a bounded-loss promotion (`ForceFailOver` `DocumentDBOpsRequest`) of the survivor; the +old DC, if partially alive, self-fences read-only. The primary `Service` and +`AppBinding` follow to the new DC. `phase` moves `FailingOver` → `Steady`. + +**Verify:** + +```bash +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # the survivor +kubectl get documentdbopsrequest -n demo -l app.kubernetes.io/managed-by=kubedb-dcdr # the failover ops +``` + +**Action:** none required for availability. Note the RPO: writes not yet replicated +when the DC died are lost. When the failed DC returns, see scenario 11 (re-add a DC). + +--- + +## 2. Network partition between data centers + +**Symptoms:** DCs are up but cannot reach each other or the coordination plane. + +**Automatic:** the side that loses the coordination plane stops getting Lease updates; +its marker `renewTime` freezes and, after the 30s fence TTL, its coordinator demotes +its leader to read-only, **before** the Lease duration lets the other side acquire +(this is the timing invariant). The side that keeps the etcd majority holds/acquires +the Lease and stays (or becomes) writable. There is no split-brain. + +**Verify there is exactly one writable DC:** + +```bash +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +``` + +**Action:** heal the network. The fenced side rejoins, rewinds any divergent tail, and +resumes streaming automatically. If both sides show `writable: false`, see scenario 6 +(coordination plane down). + +--- + +## 3. Planned switchover (maintenance on the active DC) + +**Action:** + +```bash +kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to=dc-west +``` + +**Automatic:** the hub gates on the target's health and lag, quiesces the active DC +(holds its primary read-only via the Lease), waits until the target catches up to +within one WAL page, then hands off. Zero committed rows are lost. The annotation is +cleared on completion. + +**Verify:** + +```bash +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # dc-west +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # Steady +``` + +**If it does not complete:** see scenario 8 (switchover stuck). + +--- + +## 4. Planned failback to the original DC + +After the original DC is healthy and caught up: + +```bash +kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to=dc-east +``` + +Same zero-RPO flow as scenario 3. A DC that previously lost the Lease rewinds its +divergent tail (`pg_rewind`, base-backup reseed fallback) before it is eligible, so +failback is safe even after an unplanned failover. + +--- + +## 5. A standby DC is lost + +**Symptoms:** a non-active DC's pods are gone; that DC shows `healthy: false`. + +**Impact:** none on writes, the active DC is unaffected. You lose that DC's +redundancy and its standby read capacity until it returns. + +**Verify the active DC is still writable:** + +```bash +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}' +``` + +**Action:** recover the DC's nodes; the per-DC group reschedules and re-seeds from the +active primary automatically. + +--- + +## 6. Coordination plane (dr-controlplane / etcd) unavailable + +**Symptoms:** the Lease cannot be read/renewed; markers go stale across all spokes; +every DC eventually fences read-only. + +**Automatic:** this is fail-closed, with no trustworthy Lease, **no** DC is allowed to +be writable. The database is read-only globally rather than risk split-brain. + +**Verify:** + +```bash +kubectl --kubeconfig -n dc-failover get lease primary-dc # error / stale renewTime +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' # all false +``` + +**Action:** restore the `dr-controlplane` etcd quorum. Once the Lease is renewable, the +holder's marker refreshes, its coordinator un-fences, and writes resume. Do **not** +force a pod writable to work around this. + +--- + +## 7. Failover not promoting a survivor + +**Symptoms:** the active DC is gone but `activeDC` does not move, or no writable DC +appears. + +**Diagnose:** + +```bash +# Did the Lease move? +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' +# Are there candidate pods in the survivor DC? +kubectl get pods -n demo -l app.kubernetes.io/instance=docdb-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +# Did the hub create a failover ops request, and what is its phase? +kubectl get documentdbopsrequest -n demo -l app.kubernetes.io/managed-by=kubedb-dcdr -o wide +``` + +**Common causes & action:** + +- **Lease did not move**, only Member DCs are eligible; confirm the survivor is a + `Member` in the `PlacementPolicy` and its agent can reach the coordination plane. +- **No candidates**, the survivor DC has no ready data pods; recover its pods. +- **Ops request failed**, inspect its conditions; the hub does not create a duplicate + while one is open, so resolve or delete the stuck request and let reconcile retry. + +--- + +## 8. Planned switchover stuck (target not catching up) + +**Symptoms:** after annotating `switchover-to`, `phase` stays `FailingOver` and the +Lease does not hand off. + +**Diagnose:** + +```bash +# Target lag and health: +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} lag={.lagBytes} healthy={.healthy}{"\n"}{end}' +``` + +**Causes & action:** + +- **Target lag not converging**, the active DC must be quiesced for the target to + reach the frozen LSN. Confirm the active DC's coordinator honored the quiesce (its + primary should be read-only); check the marker's `data.quiesce` names the active DC. +- **Target unhealthy / no lag report**, the switchover refuses a target with no + `lagBytes` yet; ensure the target DC's leader is up and publishing lag. +- **Target legitimately too far behind**, raise the budget only if you accept the + catch-up time: `kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-max-lag-bytes=`. +- **Abort**, remove the annotation to cancel: + `kubectl annotate documentdb -n demo docdb-dcdr dr.kubedb.com/switchover-to-`. + +--- + +## 9. Lag growing on a standby DC + +**Symptoms:** a DC's `lagBytes` climbs steadily. + +**Diagnose:** cross-DC network throughput/latency, write volume on the active primary, +and replication health on the standby DC's leader. + +```bash +kubectl get pod -n demo -o jsonpath='{.metadata.annotations.kubedb\.com/dc-lag-bytes}' +# On the active primary, check the cross-DC replica: +# SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication; +``` + +**Action:** relieve the bottleneck (network, primary load). High lag widens the RPO of +an unplanned failover and can block a planned switchover until it drains. + +--- + +## 10. A DC is unexpectedly read-only (fence tripped) + +**Symptoms:** a DC you expect to be active is read-only; its leader is labeled +`standby`. + +**Diagnose the fence chain:** + +```bash +# Does this spoke's marker name this DC and is renewTime fresh? +kubectl -n dc-failover get configmap primary-dc -o jsonpath='{.data}' +# Is the dr-controlplane agent running in this DC and renewing? +kubectl get pods -n -l app=dr-controlplane-agent +``` + +**Causes & action:** + +- **Marker stale** (`renewTime` old), the agent cannot reach the coordination plane, + or the projector is failing; restore agent connectivity. +- **Marker names another DC**, this DC simply is not the active one (correct). +- **Clock skew**, large cross-DC clock skew can trip the TTL early; verify NTP. The + timing budget assumes skew is well under (LeaseDuration − fence TTL). + +Never patch `kubedb.com/role` by hand to force writability, the next reconcile and the +fence will revert it, and you risk split-brain. + +--- + +## 11. Re-add / recover a previously lost data center + +After a DC returns from a failure: + +**Automatic:** its per-DC group reschedules, and each pod seeds from the active DC +primary (a node that was previously active rewinds its divergent tail first). Once +caught up, the DC's leader becomes a healthy read-only standby and its `lagBytes` +appears in status. + +**Verify:** + +```bash +kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} healthy={.healthy} lag={.lagBytes}{"\n"}{end}' +``` + +**Action:** to make it active again, perform a planned failback (scenario 4) once its +lag is small. + +--- + +## 12. Scale a data center up or down + +```bash +# Scale dc-west to 5 nodes: +kubectl apply -f - <<'YAML' +apiVersion: ops.kubedb.com/v1alpha1 +kind: DocumentDBOpsRequest +metadata: { name: docdb-dcdr-scale-west, namespace: demo } +spec: + type: HorizontalScaling + databaseRef: { name: docdb-dcdr } + horizontalScaling: + dataCenters: + - { clusterName: dc-west, replicas: 5 } +YAML +``` + +**Verify:** the per-DC PetSet reaches the new size, the arbiter appears/disappears with +parity, and only that DC changed. + +```bash +kubectl get petset -n demo -l app.kubernetes.io/instance=docdb-dcdr +kubectl get documentdbopsrequest -n demo docdb-dcdr-scale-west -o jsonpath='{.status.phase}' +``` + +**Notes:** scaling to `1` is allowed (single-node DC, no in-DC HA); scaling to `0` is +rejected, removing a DC is a topology change. + +--- + +## 13. Version upgrade in DC-DR + +Issue a normal `UpdateVersion` `DocumentDBOpsRequest`; the operator updates every +per-DC PetSet. Plan it during a low-traffic window and confirm each DC returns healthy +in `status.disasterRecovery` before relying on failover again. + +--- + +## 14. Suspected split-brain (two writable DCs) + +This should be impossible by design (etcd majority + fail-closed fence + the timing +invariant). If `status.disasterRecovery` ever shows two `writable: true` DCs, or two +pods labeled `kubedb.com/role: primary`: + +**Diagnose immediately:** + +```bash +kubectl get pods -n demo -l app.kubernetes.io/instance=docdb-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' +``` + +**Action:** the Lease holder is the true active DC. The other DC's fence should trip +within the TTL; if it does not, its marker is wrong (check its agent/projector) or the +timing invariant is misconfigured (verify fence TTL < Lease duration). Stop writes to +the non-Lease-holder DC at the application layer until the fence reasserts, then +reconcile (the non-holder rewinds and rejoins as a standby). + +--- + +## Escalation checklist + +When unsure, collect: + +```bash +kubectl get documentdb -n demo docdb-dcdr -o yaml +kubectl get documentdbopsrequest -n demo -l app.kubernetes.io/managed-by=kubedb-dcdr -o yaml +kubectl --kubeconfig -n dc-failover get lease -o yaml +kubectl -n dc-failover get configmap primary-dc -o yaml # on each spoke +kubectl get pods -n demo -l app.kubernetes.io/instance=docdb-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name -o wide +kubectl logs -n demo -c documentdb-coordinator +```