diff --git a/docs/guides/mongodb/README.md b/docs/guides/mongodb/README.md index 757745297..c3f5e4fec 100644 --- a/docs/guides/mongodb/README.md +++ b/docs/guides/mongodb/README.md @@ -64,3 +64,7 @@ aliases: - Detail concepts of [MongoDB object](/docs/guides/mongodb/concepts/mongodb.md). - Detail concepts of [MongoDBVersion object](/docs/guides/mongodb/concepts/catalog.md). - Want to hack on KubeDB? Check our [contribution guidelines](/docs/CONTRIBUTING.md). + +## Cross-DC Disaster Recovery (DC-DR) + +Do you want to run your MongoDB database across multiple data centers and recover from a full data center failure with a single, automatically failing-over endpoint? KubeDB runs one replica set across the data centers, spreads the voting members 3-site so no single data center holds a majority, writes with `w:majority`, and lets MongoDB's own election promote a surviving data center. Follow [here](/docs/guides/mongodb/dr/overview/index.md). diff --git a/docs/guides/mongodb/dr/_index.md b/docs/guides/mongodb/dr/_index.md new file mode 100644 index 000000000..ff51abc34 --- /dev/null +++ b/docs/guides/mongodb/dr/_index.md @@ -0,0 +1,10 @@ +--- +title: Disaster Recovery +menu: + docs_{{ .version }}: + identifier: mg-dr-mongodb + name: DR + parent: mg-mongodb-guides + weight: 36 +menu_name: docs_{{ .version }} +--- diff --git a/docs/guides/mongodb/dr/guide/index.md b/docs/guides/mongodb/dr/guide/index.md new file mode 100644 index 000000000..2b00b01c3 --- /dev/null +++ b/docs/guides/mongodb/dr/guide/index.md @@ -0,0 +1,349 @@ +--- +title: DC-DR User Guide +menu: + docs_{{ .version }}: + identifier: mg-dr-guide-mongodb + name: User Guide + parent: mg-dr-mongodb + weight: 20 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Running MongoDB in DC-DR Mode: User Guide + +This guide covers every aspect of operating a distributed MongoDB in cross data +center disaster recovery (DC-DR) mode: the components, the naming contract, +deployment, connecting with `w:majority`, reading from a secondary DC, monitoring, +lag and RPO, votes and roles, switchover and failback, scaling, and day-2 operations. + +Read the [DC-DR Overview](/docs/guides/mongodb/dr/overview/index.md) first for the +architecture, and the [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for +scenario-by-scenario procedures. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +## Components and where they run + +| Component | Runs in | Responsibility | +| --- | --- | --- | +| **`dr-controlplane`** + 3-site etcd quorum | across the data centers (an OCM control plane) | Publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease holder is the DC whose members get the higher MongoDB `priority`. The Lease is policy and observability, not the failover mechanism. | +| **`dr-controlplane` agent** | each spoke (DC) | Contends for the primary-DC Lease for its DC and projects the Lease decision into the local spoke. | +| **KubeDB MongoDB operator (hub)** | the OCM hub | Expands the `MongoDB` CR into per-DC member groups in one replica set, steers `priority` by `replSetReconfig`, drives planned switchover, follows MongoDB's election with the Lease, and writes `status.disasterRecovery`. | +| **`replication-mode-detector`** | every data-bearing MongoDB pod | Polls `isMaster` and labels the elected primary `kubedb.com/role: primary` and secondaries `secondary`. Election is native; the operator never forces it. | +| **MongoDB voting arbiter** | the arbiter DC (even layout only) | A data-less voting member that supplies the tie-break vote, co-located with the third etcd member. | +| **KubeSlice** | each spoke | Provides the cross-DC pod network (`*.slice.local`) so the one replica set spans clusters and the oplog replicates across DCs. | + +## The DC-name contract + +One string identifies a data center everywhere. **Keep these identical:** + +- the OCM spoke cluster name +- the agent `--dc-name` +- the primary-DC Lease `holderIdentity` +- the marker `data.activeDC` +- the pod label `open-cluster-management.io/cluster-name` +- the `PlacementPolicy` `distributionRule.clusterName` + +## Deploying + +### PlacementPolicy + +Map the global pod ordinals to data centers and tag each DC with its role: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: mg-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global # one cluster-wide failover scope (or Group + a group name) + mode: TwoDC # TwoDC: 2 Member DCs + an Arbiter DC; ThreeDC: 3 Member DCs + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1] + - clusterName: dc-b + role: Member + replicaIndices: [2, 3] + - clusterName: dc-c + role: Arbiter +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries + none. Its single MongoDB voting arbiter is scheduled onto the arbiter spoke and + co-located with the third etcd member. +- `mode: TwoDC` expects two Member DCs plus the Arbiter DC (the 2 + 2 + 1 even + layout); `ThreeDC` expects an odd number of Member DCs and no Arbiter DC. +- Roles are `Member` and `Arbiter` only. The `Witness` role used by other engines is + removed for MongoDB. + +### MongoDB + +```yaml +apiVersion: kubedb.com/v1 +kind: MongoDB +metadata: + name: mg-dcdr + namespace: demo +spec: + version: "8.0.5" + distributed: true + replicaSet: + name: rs0 + replicas: 4 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: mg-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +### What the operator creates + +- **One replica set** (`rs0`) whose data members are pinned to the Member DCs by the + `PlacementPolicy` `distributionRules`. The oplog replicates across DCs natively; + there is no second replication link. +- In the even (`TwoDC`) layout, **one data-less MongoDB voting arbiter** scheduled + onto the Arbiter DC, so the vote total is odd (for example 2 + 2 + 1). +- A `-secondary` read `Service` selecting `kubedb.com/role: secondary`, so clients + can target standby-DC secondaries for cross-DC reads. +- Split-horizon **Horizons** DNS (`members[*].horizons.external`) so external clients + reach each member by a routable name. + +All data-bearing pods carry the offshoot selectors plus the +`open-cluster-management.io/cluster-name` label, so the single primary and secondary +Services and the single `AppBinding` keep working as the primary moves. + +> The standby DC members are **non-hidden**, `votes:1`, low-`priority` electable +> secondaries. Do **not** make them `hidden:true`: hidden members serve no reads, are +> never election candidates, and get no role label, so a secondary Service would +> select nothing. + +## Connecting + +A DC-DR MongoDB exposes the same single endpoint as any KubeDB MongoDB: + +- the **primary Service** `` resolves to the active DC's writable primary (only + that pod is labeled `kubedb.com/role: primary`); +- the **secondary Service** `-secondary` resolves to the read-only secondaries + across the standby DCs; +- one **`AppBinding`** `` for applications and KubeDB integrations. + +Because priority keeps the primary in the active DC and only that pod is labeled +`primary`, the endpoint follows failover automatically. Applications keep using `` +and reconnect after a failover, landing on the new active DC. + +### Write with `w:majority` + +`w:majority` is the writable contract **and** the split-brain guarantee. Always write +with majority concern so a partitioned minority DC cannot commit: + +```javascript +db.orders.insertOne( + { item: "widget", qty: 1 }, + { writeConcern: { w: "majority", wtimeout: 10000 } } +) +``` + +- With `w:majority`, a primary that loses its majority auto-steps-down and a cut-off + DC self-fences read-only, so no committed write is lost to split brain. +- With `w:1`, a write acknowledges before it replicates cross-DC. On an unplanned + active-DC loss the bounded loss is the un-replicated oplog tail, which MongoDB rolls + back natively when the old primary rejoins. + +### Read from a secondary DC + +Target the secondary Service and a non-primary read preference to serve reads from a +nearer standby DC: + +```javascript +// Connect to the -secondary Service, then: +db.getMongo().setReadPref("secondaryPreferred") +db.orders.find({ item: "widget" }) +``` + +Secondary reads are eventually consistent, bounded by `oplogLagSeconds`. + +## Monitoring and observability + +### status.disasterRecovery + +The single CR carries the whole cross-DC view: + +```bash +$ kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +| Field | Meaning | +| --- | --- | +| `activeDC` | The DC whose members hold the higher priority and run the elected primary. | +| `phase` | `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. | +| `lastTransitionTime` | When `activeDC` last changed. | +| `dataCenters[].clusterName` | The data center, by its OCM managed cluster name. | +| `dataCenters[].role` | `Member` or `Arbiter`. | +| `dataCenters[].primary` | That DC's elected primary pod, empty if the DC holds no primary. | +| `dataCenters[].writable` | True only for the active DC. | +| `dataCenters[].oplogLagSeconds` | The DC's cross-DC oplog lag behind the active primary, in seconds. | +| `dataCenters[].healthy` | Whether the DC's health signal (its health Lease) is fresh. | + +### Useful checks + +```bash +# Which DC the Lease intends as active (from the coordination plane): +$ kubectl --kubeconfig -n dc-failover get lease primary-dc \ + -o jsonpath='{.spec.holderIdentity}' + +# Per-DC members and roles: +$ kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr \ + -L kubedb.com/role,open-cluster-management.io/cluster-name + +# The replica-set config and members (against the primary): +$ kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.conf().members.map(m => ({host:m.host, priority:m.priority, votes:m.votes, arbiterOnly:m.arbiterOnly}))' + +# Replication lag from the primary's view: +$ kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.printSecondaryReplicationInfo()' +``` + +## Replication, lag, and RPO + +- Cross-DC replication is the **native MongoDB oplog**, asynchronous. There is exactly + one logical replica set, so there is no extra replication link to manage. +- `oplogLagSeconds` is how far a DC's members are behind the active primary's optime, + computed in-DC and surfaced into status. It is the basis for the RPO of an unplanned + failover. +- A **planned switchover loses near-zero committed writes**, because the non-force + `replSetStepDown` only proceeds when an electable target secondary is caught up. An + **unplanned failover** may lose the last un-replicated `w:1` oplog tail (bounded by + the standby lag when the active DC died); `w:majority` writes are never lost. + +## Votes, roles, and the arbiter + +- **Votes are spread 3-site so no single data DC holds a majority.** The hub keeps the + votes balanced and steers only `priority`, never `votes`, in the steady and failover + paths. +- The requirement is an **odd total of voting members**, not an odd DC count: + - **Even layout** (two data DCs plus the Arbiter DC): keep equal voting members per + data DC and let the single data-less arbiter supply the odd vote (the 2 + 2 + 1 + shape). Do not add per-DC arbiters; an extra vote in one data DC would break the + symmetry that lets either data DC plus the arbiter elect. + - **Odd layout** (three or more Member DCs, no Arbiter DC): cap the data DCs to an + odd voting total, typically one voting member per DC with extra replicas at + `votes:0`. + - In both layouts, use `votes:0` members for extra read redundancy without changing + the vote balance. +- The **Arbiter DC** holds the `dr-controlplane` etcd vote **and** one MongoDB voting + arbiter, co-located so the two quorums agree. + +## Planned switchover (near-zero RPO) + +Move the active DC on purpose by annotating the MongoDB: + +```bash +$ kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The hub then: + +1. checks the target is a known, healthy DC within the oplog lag budget; +2. sets `phase: FailingOver` and raises the target DC's member `priority` by a normal + majority-committed `replSetReconfig`; +3. issues a **non-force** `replSetStepDown` on the current primary, which only + succeeds once an electable target secondary is caught up (the catch-up gate); +4. once MongoDB elects the new primary in the target DC, moves the Lease to match. + +Watch `status.disasterRecovery` for `phase` returning to `Steady` with the new +`activeDC`. + +## Failback + +Failback is native. A DC that lost the primary and rejoins rolls back any +un-replicated tail automatically (rollback files), or does a full initial resync if it +fell outside the rollback/oplog window. There is no `pg_rewind` step to run. + +Once the returned DC is caught up, steer the primary back with a planned switchover: + +```bash +$ kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to=dc-a +``` + +## Scaling and day-2 operations + +The standard `MongoDBOpsRequest` operations (`HorizontalScaling`, `VerticalScaling`, +`VolumeExpansion`, `UpdateVersion`, `Reconfigure`, `ReconfigureTLS`, `Restart`, +`Reprovision`, `RotateAuth`, `Horizons`, `StorageMigration`) apply to a DC-DR cluster. +They act on the distributed member groups across the DCs and are issued exactly as for +a single-cluster MongoDB. There is no failover ops type: failover is MongoDB's native +election, and the planned switchover is the `dr.kubedb.com/switchover-to` annotation, +not an ops request. + +### Scale a single data center + +The one replica set spans the DCs, so for a DC-DR cluster scale each data center +independently with `spec.horizontalScaling.dataCenters` instead of the cluster-wide +`spec.horizontalScaling.replicas`. Each entry names a `Member` data center (by its +`clusterName`, matching a Member `distributionRule`) and its desired local node count; +data centers not listed are left unchanged: + +```yaml +apiVersion: ops.kubedb.com/v1alpha1 +kind: MongoDBOpsRequest +metadata: + name: mg-dcdr-hscale + namespace: demo +spec: + type: HorizontalScaling + databaseRef: + name: mg-dcdr + horizontalScaling: + dataCenters: + - clusterName: dc-a + replicas: 3 + - clusterName: dc-b + replicas: 3 +``` + +Each data center's members are scaled within that single replica set. Keep the vote +total odd when adding voting members (see [Votes, roles, and the +arbiter](#votes-roles-and-the-arbiter)); add extra data redundancy as `votes:0` +members so the vote balance does not change. + +> **Note:** the distributed MongoDB substrate and the DC-DR layer are net-new for +> MongoDB. Treat the field names and flows in this guide as the intended user +> experience; confirm availability in your release before relying on them in +> production. + +## Deletion and cleanup + +```bash +$ kubectl delete mongodb -n demo mg-dcdr +``` + +Per `deletionPolicy`, the operator removes the per-DC member groups, the MongoDB +arbiter, and the cluster-scoped per-DC `PlacementPolicies` it generated (these carry +no owner reference, so the operator deletes them explicitly). The user-provided base +`PlacementPolicy` is left for you to delete. + +## Limitations + +- **Adding or removing a whole data center** is a topology change (a member-group and + cross-DC seed change), performed by editing the `PlacementPolicy` topology, not by a + scaling request. +- Cross-DC oplog replication is asynchronous; an unplanned failover has a non-zero RPO + bounded by the standby lag with `w:1`. Use `w:majority` for the split-brain + guarantee and a planned switchover for a near-zero-RPO move. +- In the 2 + 2 + 1 even layout, a full data-DC loss stalls `w:majority` writes until + the operator reconfigs the lost members out. Prefer an odd number of Member DCs to + avoid the stall. diff --git a/docs/guides/mongodb/dr/overview/index.md b/docs/guides/mongodb/dr/overview/index.md new file mode 100644 index 000000000..8a9005038 --- /dev/null +++ b/docs/guides/mongodb/dr/overview/index.md @@ -0,0 +1,334 @@ +--- +title: DC-DR Overview +menu: + docs_{{ .version }}: + identifier: mg-dr-overview-mongodb + name: Overview + parent: mg-dr-mongodb + weight: 10 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# Cross Data Center Disaster Recovery (DC-DR) for MongoDB + +KubeDB can run a single distributed `MongoDB` across multiple data centers (DCs) so +the database survives the loss of an entire data center. Exactly one DC runs the +writable primary at any instant; the others are warm secondaries that serve cross-DC +reads. When the active DC is lost, MongoDB's own majority election promotes a new +primary in a surviving DC automatically, and the single connection endpoint follows +the new writable DC. + +This page is the conceptual overview and a quick start. See also: + +- [DC-DR User Guide](/docs/guides/mongodb/dr/guide/index.md) for every aspect of + running in DC-DR mode (components, status, connecting, monitoring, switchover, + failback, day-2 ops). +- [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for what to do in each + operational scenario. + +> **New to KubeDB?** Please start [here](/docs/README.md). + +> **Availability:** the distributed MongoDB substrate (`spec.distributed`, the +> `PlacementPolicy`, cross-cluster networking) and the DC-DR layer are net-new for +> MongoDB. Treat the field names and flows here as the intended user experience and +> confirm availability in your release before relying on them in production. + +## Why MongoDB DC-DR is different + +Most KubeDB engines (Postgres, MariaDB, MSSQL) keep their consensus quorum **inside** +a single DC, because a raft or cluster manager flaps or stalls when its quorum spans +data centers. Those engines therefore run one independent group per DC and build a +separate cross-DC replication link. + +**MongoDB is the exception. A replica set is geo-aware by design.** Spreading voting +members across data centers with a tie-break voter is the documented, supported +MongoDB geo deployment. So for MongoDB: + +- **One replica set spans the DCs.** There is one logical replica set whose members + are pinned to DCs. The oplog is already the cross-DC link, replicated + asynchronously to members in the other DCs. There is **no second replication link + to build** and no remote-replica. +- **Failover is MongoDB's own election.** When the active DC is lost, the surviving + voting members form a majority and elect a new primary automatically. KubeDB does + **not** force or drive promotion. +- **Failback is native.** A returned old primary rolls back its un-replicated tail + automatically when it rejoins as a secondary, or does a full initial resync if it + fell outside the rollback/oplog window. There is **no `pg_rewind` equivalent**. + +## How it works + +DC-DR for MongoDB rests on four rules. + +- **Votes are spread 3-site so no single data DC holds a majority.** With two data + DCs the layout is `dc-a` data members `votes:1`, `dc-b` data members `votes:1`, and + one data-less **MongoDB voting arbiter** in a third arbiter DC. With totals like + 2 + 2 + 1 the majority is 3, so `dc-a` plus the arbiter DC, or `dc-b` plus the + arbiter DC, can elect, but neither data DC alone can. This removes split brain at + its root: a partitioned data DC can never gather a majority by itself. +- **`w:majority` is the writable contract and the split-brain guarantee.** Because + the safety comes from MongoDB's majority, the writable path defaults to + `w:majority`. A partitioned minority DC then cannot commit, and a primary that + loses its majority auto-steps-down to a secondary, so a cut-off DC goes read-only + on its own. With `w:1` the bounded loss is the un-replicated oplog tail, which is + rolled back natively on rejoin. +- **The Lease steers priority and follows the primary; it is not the failover + mechanism.** A small control plane (`dr-controlplane`), backed by a three-site etcd + quorum, publishes one `coordination.k8s.io` **Lease** per failover scope. The Lease + holder's data members get a higher MongoDB `priority`, so MongoDB keeps the primary + in the DC the operator intends. Priority is a preference, not a pin: during a member + bounce a standby member can briefly become primary and priority takeover returns + it, so the observed primary can transiently differ from the Lease-intended DC. That + is expected, not a bug; there is still exactly one primary at all times. On an + unplanned active-DC loss MongoDB elects in a standby DC first and the Lease then + **follows** the new primary (inverted from Postgres, where the Lease leads). +- **Only the active DC's primary carries `kubedb.com/role: primary`.** The existing + `replication-mode-detector` sidecar labels the elected primary `primary` and the + secondaries `secondary`. Because priority keeps the primary in the active DC, the + single primary `Service` and the `AppBinding` resolve there. Standby DC members are + non-hidden, electable `secondary` members that appear on a separate + `-secondary` read Service. + +> **Why not confine the votes to one DC?** A tempting design is to give all votes to +> the active DC and force-reconfig them away on Lease loss. That is unsafe: in a +> partition both sides would issue `replSetReconfig {force:true}` at once (the config +> diverges), and because the confined DC holds a majority, a partitioned old primary +> could still commit `w:majority` writes, a true split brain. Spreading votes 3-site +> with `w:majority` removes both problems with no force reconfig. + +### Data center roles + +Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: + +| Role | Holds MongoDB data | Primary eligible | Purpose | +| --- | --- | --- | --- | +| **Member** | yes | yes | A data-bearing replica-set member group; a candidate for the active DC. | +| **Arbiter** | no | no | The arbiter DC. Holds the `dr-controlplane` etcd vote **and** one MongoDB voting arbiter (data-less). Supplies the tie-break vote. | + +> Unlike MariaDB and MSSQL, whose arbiter DC holds no engine member, the MongoDB +> arbiter DC holds **both** the `dr-controlplane` etcd member and a MongoDB voting +> arbiter, co-located so the coordination quorum and the replica-set quorum agree. +> The petset `Witness` role is **removed** for MongoDB; only `Member` and `Arbiter` +> are used. + +## Deployment topologies + +MongoDB DC-DR supports two shapes. The difference is the vote math after a full DC +loss. + +### A. Two Member DCs plus an arbiter DC (the 2 + 2 + 1 even layout) + +Three sites; two hold MongoDB data, the third holds the etcd vote plus one MongoDB +voting arbiter (no data): + +```yaml +failoverPolicy: + mode: TwoDC +distributionRules: +- { clusterName: dc-a, role: Member, replicaIndices: [0, 1] } # votes:1 each +- { clusterName: dc-b, role: Member, replicaIndices: [2, 3] } # votes:1 each +- { clusterName: dc-c, role: Arbiter } # dr-controlplane etcd + 1 MongoDB arbiter +``` + +Five voting members, majority 3. Either data DC plus the arbiter DC can elect; no +single data DC can. + +- **Lose a data DC** the survivor plus the arbiter DC still form a majority, so + MongoDB elects a primary in the survivor automatically. But `w:majority` writes + **stall**, because only two of the five data-bearing members are reachable and a + majority of data acks is no longer possible (MongoDB's documented two-data-center + limitation). The operator then issues a normal, majority-committed `replSetReconfig` + that drops the lost members, so the majority recomputes to the survivors and + `w:majority` writes resume. +- **Lose the arbiter DC alone** the two data DCs together hold 4 of 5 votes, still a + majority, so a primary holds and writes continue. + +### B. Odd number of Member DCs, no arbiter DC (recommended) + +Three (or any odd number of) data-bearing `Member` DCs, every DC carrying data and +electable, no separate arbiter: + +```yaml +failoverPolicy: + mode: ThreeDC +distributionRules: +- { clusterName: dc-a, role: Member, replicaIndices: [0, 1] } +- { clusterName: dc-b, role: Member, replicaIndices: [2, 3] } +- { clusterName: dc-c, role: Member, replicaIndices: [4, 5] } +``` + +Cap the voting members to an odd total (typically one voting member per DC, extra +replicas at `votes:0`). A single DC loss then keeps a **data** majority, so MongoDB +elects in a surviving DC **and** `w:majority` never stalls, no reconfig-out step +needed. This is MongoDB's recommended geo shape; prefer it when a third data location +is available. + +### At a glance + +| Topology | Sites | Data DCs | Tolerates | `w:majority` after a data-DC loss | +| --- | --- | --- | --- | --- | +| 2 Member + Arbiter DC (`TwoDC`, 2 + 2 + 1) | 3 | 2 | any 1 site | stalls until the operator reconfigs the lost members out | +| Odd Member DCs (`ThreeDC`) | 3+ | 3+ | any 1 site | never stalls | + +## The single-CR, single-endpoint model + +The user creates **one** distributed `MongoDB` object (with `spec.distributed` and a +`PlacementPolicy` carrying `distributionRules` and a `failoverPolicy`) and gets +**one** `AppBinding` and **one** endpoint. The operator expands the CR into per-DC +member groups, all in one replica set, plus the MongoDB arbiter in the even layout, +with priority steered by the Lease. + +The single CR's `status.disasterRecovery` carries the whole cross-DC view: the active +DC, each DC's members and primary, the cross-DC oplog lag in seconds, and the DR +phase. + +## Prerequisites + +- A distributed MongoDB substrate: Open Cluster Management (OCM) hub and spoke + clusters, KubeSlice connecting the spokes (members reach each other over + `*.slice.local` with split-horizon **Horizons** DNS), and a storage class on each + data-bearing spoke. +- The `dr-controlplane` service and its three-site etcd quorum installed across the + data centers, with a `dr-controlplane` agent running in each spoke (DC). In the even + layout, the third etcd member sits in the arbiter DC alongside the MongoDB arbiter. +- The KubeDB MongoDB operator started with the DC-DR flags (coordination kubeconfig + and the operator's local DC name). +- One consistent **DC name** per data center, used everywhere: the OCM spoke cluster + name, the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the + pod label `open-cluster-management.io/cluster-name`, and the `PlacementPolicy` + `distributionRule.clusterName`. Keep them identical. + +## Deploy a DC-DR MongoDB + +### 1. PlacementPolicy + +Assign global pod ordinals to data centers and tag each DC with its role. Here two +Member DCs (`dc-a`, `dc-b`) each get two MongoDB members, and `dc-c` is the arbiter +DC: + +```yaml +apiVersion: apps.k8s.appscode.com/v1 +kind: PlacementPolicy +metadata: + name: mg-dcdr +spec: + clusterSpreadConstraint: + slice: + projectNamespace: kubeslice-demo + sliceName: demo-slice + failoverPolicy: + trigger: + scope: Global + mode: TwoDC + distributionRules: + - clusterName: dc-a + role: Member + replicaIndices: [0, 1] + - clusterName: dc-b + role: Member + replicaIndices: [2, 3] + - clusterName: dc-c + role: Arbiter +``` + +- A data-bearing **Member** rule carries `replicaIndices`; the **Arbiter** DC carries + none (its single MongoDB arbiter is not ordinal-pinned, it is scheduled onto the + arbiter spoke by the operator). +- `failoverPolicy.trigger.scope: Global` makes this one cluster-wide failover scope. + +### 2. MongoDB + +Reference the `PlacementPolicy` and opt the MongoDB into DC-DR expansion: + +```yaml +apiVersion: kubedb.com/v1 +kind: MongoDB +metadata: + name: mg-dcdr + namespace: demo +spec: + version: "8.0.5" + distributed: true + replicaSet: + name: rs0 + replicas: 4 + storageType: Durable + podTemplate: + spec: + podPlacementPolicy: + name: mg-dcdr + storage: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 1Gi + deletionPolicy: WipeOut +``` + +The operator expands this into one replica set whose members are pinned to `dc-a` and +`dc-b`, plus a single MongoDB voting arbiter in `dc-c`, and steers `priority` from the +Lease so the primary stays in the active DC. + +## Observe the DC-DR state + +The single `MongoDB` object's `status.disasterRecovery` carries the whole cross-DC +view: + +```bash +$ kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery}' | jq +``` + +```json +{ + "activeDC": "dc-a", + "phase": "Steady", + "lastTransitionTime": "2026-06-30T10:00:00Z", + "dataCenters": [ + { "clusterName": "dc-a", "role": "Member", "primary": "mg-dcdr-0", "writable": true, "healthy": true, "oplogLagSeconds": 0 }, + { "clusterName": "dc-b", "role": "Member", "primary": "", "writable": false, "healthy": true, "oplogLagSeconds": 2 }, + { "clusterName": "dc-c", "role": "Arbiter", "primary": "", "writable": false, "healthy": true } + ] +} +``` + +- `activeDC` is the DC whose members hold the higher priority and run the elected + primary. +- `phase` is `Steady`, `FailingOver`, `FailingBack`, or `Degraded`. +- Each `dataCenters` entry reports the DC role, its elected primary pod (if any), + whether it is the writable DC, its health, and its cross-DC `oplogLagSeconds` (the + optime delta behind the active primary, computed in-DC; the hub never opens + cross-cluster connections). + +## Unplanned failover + +When the active DC is lost, MongoDB's own election promotes a new primary in a +surviving DC using the survivor plus the arbiter DC (or the surviving data majority in +the odd layout). You do not trigger this. The orchestrator observes the new primary +and moves the Lease to match, then (in the 2 + 2 + 1 layout) issues a +majority-committed `replSetReconfig` dropping the lost members so `w:majority` writes +resume. `status.disasterRecovery.phase` moves to `FailingOver` and back to `Steady`. + +## Planned switchover (near-zero RPO) + +To move the active DC on purpose without losing committed writes, annotate the +MongoDB with the target DC: + +```bash +$ kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +The orchestrator raises the target DC's `priority`, then issues a non-force +`replSetStepDown` on the current primary. A non-force stepDown only succeeds when an +electable secondary in the target is caught up within the catch-up window, which is +the near-zero-RPO gate. The Lease then follows to `dc-b`. + +## Cleanup + +```bash +$ kubectl delete mongodb -n demo mg-dcdr +$ kubectl delete placementpolicy mg-dcdr +``` + +Deleting the `MongoDB` removes the per-DC member groups, the arbiter, and the +generated per-DC `PlacementPolicies`. The user-provided base `PlacementPolicy` is left +for you to delete. diff --git a/docs/guides/mongodb/dr/runbook/index.md b/docs/guides/mongodb/dr/runbook/index.md new file mode 100644 index 000000000..a7f3c2cb7 --- /dev/null +++ b/docs/guides/mongodb/dr/runbook/index.md @@ -0,0 +1,335 @@ +--- +title: DC-DR Runbook +menu: + docs_{{ .version }}: + identifier: mg-dr-runbook-mongodb + name: Runbook + parent: mg-dr-mongodb + weight: 30 +menu_name: docs_{{ .version }} +section_menu_id: guides +--- + +# MongoDB DC-DR Runbook + +Scenario-by-scenario procedures for operating a MongoDB cluster in cross data center +disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and +MongoDB do **automatically**, how to **verify**, and the **action** to take. + +Read the [User Guide](/docs/guides/mongodb/dr/guide/index.md) for the concepts and +commands referenced here. Throughout, `` is the coordination control plane +kubeconfig, `mg-dcdr`/`demo` are the example database and namespace. + +## Quick reference + +```bash +# Active DC, phase, and per-DC view: +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery}' | jq + +# Lease holder (the DC the coordination plane intends as active): +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' + +# Per-DC members, roles, and DCs: +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name + +# Replica-set members, priorities, and votes (against the primary): +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.conf().members.map(m => ({host:m.host, priority:m.priority, votes:m.votes, arbiterOnly:m.arbiterOnly}))' +``` + +Golden rules: + +- **MongoDB's majority election decides the primary.** Never force a member primary by + hand and never run `replSetReconfig {force:true}` outside a documented double-failure + DR action. +- **`w:majority` is the split-brain guarantee.** A minority DC cannot commit and a + primary that loses its majority auto-steps-down. +- **Exactly one DC is `writable: true`** in `status.disasterRecovery`, and exactly one + pod is labeled `kubedb.com/role: primary`, at any instant. +- **The Lease follows the primary.** A transient gap between the Lease-intended DC and + the observed primary (during a member bounce and priority takeover) is expected. + +--- + +## 1. Active DC lost (zone/cluster failure) + +**Symptoms:** the active DC's members are gone/unreachable; writes fail briefly. + +**Automatic:** the surviving voting members (a standby data DC plus the arbiter DC in +the even layout, or the surviving data majority in the odd layout) form a majority and +MongoDB **elects a new primary on its own**. The mode-detector relabels the new +primary `primary`. The orchestrator observes the new primary and moves the Lease to +match. In the 2 + 2 + 1 even layout it then issues a normal majority-committed +`replSetReconfig` dropping the lost DC's members, so `w:majority` writes (otherwise +stalled on two of five data members) resume. `phase` moves `FailingOver` to `Steady`. + +**Verify:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # the survivor +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +``` + +**Action:** none required for availability. Note the RPO: `w:1` writes not yet +replicated when the DC died are lost (`w:majority` writes are not). When the failed DC +returns, see scenario 9 (re-add a DC). + +--- + +## 2. `w:majority` writes stall after a data-DC loss (even layout only) + +**Symptoms:** in the 2 + 2 + 1 layout, after losing a whole data DC, a primary is +elected but `w:majority` writes time out. + +**Cause:** only two of the five data-bearing members are reachable, so a majority of +data acks is impossible (MongoDB's documented two-data-center limitation). + +**Automatic:** the orchestrator issues a normal majority-committed `replSetReconfig` +that drops the lost members, so the majority recomputes to the survivors and +`w:majority` resumes. This is **not** a force reconfig (the surviving data DC plus the +arbiter hold a majority of the original config). + +**Verify:** + +```bash +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.conf().members.length' # dropped to the survivors +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # Steady +``` + +**Action:** none if the reconfig completed. To avoid this stall entirely, run an +**odd** number of Member DCs with no Arbiter DC; a single DC loss then keeps a data +majority and `w:majority` never stalls. + +--- + +## 3. Network partition between data centers + +**Symptoms:** DCs are up but cannot reach each other. + +**Automatic:** a primary on the minority side loses its majority and +**auto-steps-down** to a secondary, so the cut-off DC goes read-only on its own. With +`w:majority` a minority side cannot commit, so there is no split brain and the fence +needs no action (it could not act anyway: lowering priority is a normal reconfig that +needs a majority the isolated DC does not have, and force reconfig is forbidden). The +majority side keeps or elects the primary and stays writable. + +**Verify there is exactly one writable DC:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName}={.writable} {end}' +``` + +**Action:** heal the network. The minority side rejoins, rolls back any un-replicated +tail natively, and resumes as secondaries automatically. + +--- + +## 4. Planned switchover (maintenance on the active DC) + +**Action:** + +```bash +kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to=dc-b +``` + +**Automatic:** the hub gates on the target's health and oplog lag, raises the target +DC's `priority` by a normal `replSetReconfig`, then issues a **non-force** +`replSetStepDown` on the current primary. The non-force stepDown only proceeds once an +electable target secondary is caught up, so near-zero committed writes are lost. The +Lease follows to the new primary. + +**Verify:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery.activeDC}' # dc-b +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{.status.disasterRecovery.phase}' # Steady +``` + +**If it does not complete:** see scenario 7 (switchover stuck). + +--- + +## 5. Planned failback to the original DC + +After the original DC is healthy and caught up (failback is native: rollback of the +un-replicated tail, or a full initial resync if it fell outside the rollback/oplog +window), steer the primary back: + +```bash +kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to=dc-a +``` + +Same near-zero-RPO flow as scenario 4. There is no `pg_rewind` step; MongoDB rejoins +the returned members on its own before they become electable. + +--- + +## 6. Arbiter DC lost (even layout) + +**Symptoms:** the Arbiter DC is gone; its etcd member and the MongoDB voting arbiter +are unreachable. + +**Impact:** none on writes. The two data DCs together hold 4 of the 5 votes, still a +majority, so a primary holds and `w:majority` writes continue. You lose the tie-break +vote, so a subsequent **second** failure (a data DC) can no longer auto-elect. + +**Verify the cluster is still writable:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName}{end}' +``` + +**Action:** restore the Arbiter DC (the etcd member and the MongoDB arbiter) to regain +single-fault tolerance. + +--- + +## 7. Planned switchover stuck (target not catching up) + +**Symptoms:** after annotating `switchover-to`, `phase` stays `FailingOver` and the +primary does not move. + +**Diagnose:** + +```bash +# Target oplog lag and health: +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} lag={.oplogLagSeconds} healthy={.healthy}{"\n"}{end}' +# Replication state from the primary: +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.printSecondaryReplicationInfo()' +``` + +**Causes & action:** + +- **Target lag not converging** the non-force `replSetStepDown` refuses until an + electable target secondary is caught up. Relieve the cross-DC bottleneck (network, + primary write load) so the target drains its lag. +- **Target unhealthy** ensure the target DC has a ready, electable secondary. +- **Abort** remove the annotation to cancel: + `kubectl annotate mongodb -n demo mg-dcdr dr.kubedb.com/switchover-to-`. + +--- + +## 8. A standby DC is lost + +**Symptoms:** a non-active DC's members are gone; that DC shows `healthy: false`. + +**Impact:** none on writes in the odd layout (the active DC is unaffected). In the even +layout, losing a data DC is scenario 1/2. You lose that DC's redundancy and its +secondary read capacity until it returns. + +**Verify the active DC is still writable:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[?(@.writable==true)]}{.clusterName}{end}' +``` + +**Action:** recover the DC's members; they reschedule and resync from the active +primary over the oplog automatically. + +--- + +## 9. Re-add / recover a previously lost data center + +After a DC returns from a failure: + +**Automatic:** its members rejoin the replica set and catch up over the native oplog. A +member that was previously primary rolls back its un-replicated tail automatically, or +does a full initial resync if it fell outside the rollback/oplog window. In the even +layout, if the lost members were reconfigured out (scenario 2), the operator +reconfigs the returned members back in as low-priority secondaries. + +**Verify:** + +```bash +kubectl get mongodb -n demo mg-dcdr -o jsonpath='{range .status.disasterRecovery.dataCenters[*]}{.clusterName} healthy={.healthy} lag={.oplogLagSeconds}{"\n"}{end}' +``` + +**Action:** to make it active again, perform a planned failback (scenario 5) once its +oplog lag is small. + +--- + +## 10. A DC is unexpectedly read-only + +**Symptoms:** a DC you expect to run the primary has only secondaries. + +**Diagnose:** + +```bash +# Where is the primary, and what are the member priorities? +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.conf().members.map(m => ({host:m.host, priority:m.priority}))' +# What DC does the Lease intend? +kubectl --kubeconfig -n dc-failover get lease primary-dc -o jsonpath='{.spec.holderIdentity}' +``` + +**Causes & action:** + +- **Transient priority takeover** during a member bounce a standby can briefly hold the + primary and MongoDB priority takeover returns it. Wait a few seconds and recheck; + this is expected. +- **Lease intends another DC** the priority is intentionally lower here; this DC is not + the active one (correct). +- **Lost majority** the DC is partitioned or short of votes, so MongoDB cannot elect a + primary here. See scenario 3. + +Never run `replSetReconfig {force:true}` or patch `kubedb.com/role` by hand to force a +primary; the next reconcile reverts it and you risk a diverged config. + +--- + +## 11. Coordination plane (dr-controlplane / etcd) unavailable + +**Symptoms:** the Lease cannot be read/renewed across the spokes. + +**Automatic:** MongoDB keeps running on its own majority, so the cluster stays writable +in whichever DC holds the primary; the Lease is policy, not the failover mechanism, so +its loss does not by itself make MongoDB read-only. What you lose is **priority +steering and planned switchover**: the operator cannot reconfig priorities or move the +active DC until the Lease quorum returns. + +**Verify:** + +```bash +kubectl --kubeconfig -n dc-failover get lease primary-dc # error / stale +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role # a primary still exists +``` + +**Action:** restore the `dr-controlplane` etcd quorum (in the even layout it shares the +Arbiter DC with the MongoDB arbiter). Once the Lease is renewable, priority steering and +switchover resume. + +--- + +## 12. Suspected split-brain (two primaries) + +This should be impossible with `w:majority` and 3-site votes (no single DC holds a +majority, and a primary that loses its majority auto-steps-down). If +`status.disasterRecovery` ever shows two `writable: true` DCs, or two pods labeled +`kubedb.com/role: primary`: + +**Diagnose immediately:** + +```bash +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.status().members.map(m => ({name:m.name, state:m.stateStr}))' +``` + +**Action:** confirm clients write with `w:majority` (a minority primary cannot commit +majority writes, so it cannot diverge committed data). The minority primary should +auto-step-down within its election timeout. Verify the vote layout still spreads votes +3-site (no single data DC was given a majority of votes by a bad reconfig). Do not +force-reconfig; restore connectivity and let MongoDB settle to a single primary. + +--- + +## Escalation checklist + +When unsure, collect: + +```bash +kubectl get mongodb -n demo mg-dcdr -o yaml +kubectl --kubeconfig -n dc-failover get lease -o yaml +kubectl get pods -n demo -l app.kubernetes.io/instance=mg-dcdr -L kubedb.com/role,open-cluster-management.io/cluster-name -o wide +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.status()' +kubectl exec -n demo mg-dcdr-0 -- mongosh --quiet --eval 'rs.conf()' +```