Skip to content

Add DocumentDB disaster recovery (DC-DR) guide#950

Open
tamalsaha wants to merge 1 commit into
masterfrom
documentdb-dc-dr-docs
Open

Add DocumentDB disaster recovery (DC-DR) guide#950
tamalsaha wants to merge 1 commit into
masterfrom
documentdb-dc-dr-docs

Conversation

@tamalsaha

@tamalsaha tamalsaha commented Jul 1, 2026

Copy link
Copy Markdown
Member

This adds cross data center disaster recovery (DC-DR) documentation for KubeDB DocumentDB, mirroring the Postgres DR docs since DocumentDB is Microsoft DocumentDB (the pg_documentdb extension) running on PostgreSQL under the hood. The same machinery applies: WAL streaming replication between data centers, the per-DC documentdb-coordinator raft, the dr-controlplane primary-DC Lease as the single cross-DC failover authority, a fail-closed marker fence, leader-to-leader cross-DC streaming with intra-DC cascade, and pg_rewind failback.

Pages added (all under docs/guides/documentdb/)

  • dr/overview/index.md (DC-DR Overview): architecture, data center roles (Member / Arbiter / Witness), deployment topologies (TwoDC, ThreeDC, two-site), quick start, status.disasterRecovery, unplanned failover, planned zero-RPO switchover, and per-DC scaling.
  • dr/guide/index.md (User Guide): components, the DC-name contract, operator flags, deployment, connecting, monitoring, replication and lag, RTO/safety timing invariant, quorum and arbiters, switchover and failback, per-DC horizontal scaling, day-2 ops, backup, and cleanup.
  • dr/runbook/index.md (Runbook): 14 scenario-by-scenario procedures (active DC loss, partition, planned switchover/failback, coordination-plane loss, stuck switchover, lag growth, fence tripped, split-brain suspicion) plus an escalation checklist.

Notes

  • A minimal DocumentDB guides parent (docs/guides/documentdb/_index.md) was added because none existed on master; the DR section hangs off it.
  • Examples use kind DocumentDB (apiVersion kubedb.com/v1alpha2, version "17.9") and DocumentDBOpsRequest, with documentdb-coordinator as the per-pod coordinator.
  • Where the overview references the shared distributed substrate, it links to the Distributed Postgres guide since DocumentDB reuses it.

Summary by CodeRabbit

  • Documentation
    • Added new DocumentDB guides for cross data center disaster recovery, including an index page plus a DC-DR overview covering roles, topology options, deployment steps, monitoring, and operational flows for failover and planned switchover/failback, scaling, and cleanup/backup behavior.
    • Added a complete DC-DR runbook with golden rules, quick-reference commands, scenario-based troubleshooting, and an escalation checklist for investigating issues such as promotion failures, lag, fencing/read-only states, and suspected split-brain.

kodiak-appscode[bot]
kodiak-appscode Bot previously approved these changes Jul 1, 2026
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds new documentation for DocumentDB DC-DR, including navigation index pages, a conceptual overview, an operational user guide, and a scenario-based runbook.

Changes

DocumentDB DC-DR Documentation

Layer / File(s) Summary
Navigation index pages
docs/guides/documentdb/_index.md, docs/guides/documentdb/dr/_index.md
Adds front matter defining page titles and versioned menu metadata for the DocumentDB and DR guide sections.
DC-DR conceptual overview
docs/guides/documentdb/dr/overview/index.md
Documents the DC-DR model, fencing, WAL streaming, DC roles, topology options, prerequisites, deployment examples, status fields, failover/switchover, scaling, day-2 ops, and cleanup.
DC-DR operational user guide
docs/guides/documentdb/dr/guide/index.md
Documents component responsibilities, marker/fence contract, DC naming, operator flags, deployment, per-DC resource creation, connectivity, monitoring, replication/RPO, timing invariants, quorum, switchover/failback, scaling, backup, cleanup, and limitations.
DC-DR runbook and escalation procedures
docs/guides/documentdb/dr/runbook/index.md
Adds quick-reference commands, golden rules, 14 numbered operational scenarios, and an escalation checklist.

Estimated code review effort: 2 (Simple) | ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding DocumentDB disaster recovery documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch documentdb-dc-dr-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/documentdb/dr/overview/index.md`:
- Around line 73-84: The document text is pointing to the wrong PlacementPolicy
field name: it mentions distributionRule.role even though the YAML uses
distributionRules. Update the prose in this section to use the plural field path
consistently, and keep the explanation aligned with the Arbiter/Witness role
examples so readers are directed to the actual schema used in the DocumentDB
guide.

In `@docs/guides/documentdb/dr/runbook/index.md`:
- Around line 337-340: Update the split-brain safety guidance in the runbook
section around the Lease holder/fence invariant to include cross-DC clock skew,
not just fence TTL versus Lease duration. In the relevant passage describing the
active DC, fence trip timing, and the timing invariant, replace the current
check with the full condition using fence TTL + cross-DC clock skew < primary-DC
Lease duration, and keep the rest of the operational guidance about stopping
writes and verifying the fence behavior aligned with this stronger rule.
- Around line 100-103: Update the “Automatic” switchover description so the
handoff threshold matches the documented
`dr.kubedb.com/switchover-max-lag-bytes` lag budget instead of saying “within
one WAL page.” In the runbook text under the automatic switchover section,
replace the stricter phrasing with wording that reflects the controller’s max
lag bytes behavior and default 16 MiB budget, keeping the rest of the sequence
(gates on health and lag, quiesces the active DC, waits for catch-up, then hands
off) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e16d37c0-712e-4e54-ad11-15255e51035b

📥 Commits

Reviewing files that changed from the base of the PR and between 405b88b and 779aafa.

📒 Files selected for processing (5)
  • docs/guides/documentdb/_index.md
  • docs/guides/documentdb/dr/_index.md
  • docs/guides/documentdb/dr/guide/index.md
  • docs/guides/documentdb/dr/overview/index.md
  • docs/guides/documentdb/dr/runbook/index.md

Comment on lines +73 to +84
Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`:

| Role | Holds DocumentDB data | Primary eligible | Purpose |
| --- | --- | --- | --- |
| **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. |
| **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** |
| **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** |

> For DocumentDB the third "witness" data center is **vote-only** (it holds only the
> `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter`
> and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose
> witness must carry data; DocumentDB does not use it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the plural field path here.

The prose points at distributionRule.role, but the YAML below uses distributionRules. That mismatch can send readers to a non-existent field.

♻️ Suggested fix
-Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`:
+Each DC plays one role, set on the `role` field of each `PlacementPolicy` `distributionRules[]` entry:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`:
| Role | Holds DocumentDB data | Primary eligible | Purpose |
| --- | --- | --- | --- |
| **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. |
| **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** |
| **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** |
> For DocumentDB the third "witness" data center is **vote-only** (it holds only the
> `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter`
> and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose
> witness must carry data; DocumentDB does not use it.
Each DC plays one role, set on the `role` field of each `PlacementPolicy` `distributionRules[]` entry:
| Role | Holds DocumentDB data | Primary eligible | Purpose |
| --- | --- | --- | --- |
| **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. |
| **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** |
| **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** |
> For DocumentDB the third "witness" data center is **vote-only** (it holds only the
> `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter`
> and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose
> witness must carry data; DocumentDB does not use it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/documentdb/dr/overview/index.md` around lines 73 - 84, The
document text is pointing to the wrong PlacementPolicy field name: it mentions
distributionRule.role even though the YAML uses distributionRules. Update the
prose in this section to use the plural field path consistently, and keep the
explanation aligned with the Arbiter/Witness role examples so readers are
directed to the actual schema used in the DocumentDB guide.

Comment on lines +100 to +103
**Automatic:** the hub gates on the target's health and lag, quiesces the active DC
(holds its primary read-only via the Lease), waits until the target catches up to
within one WAL page, then hands off. Zero committed rows are lost. The annotation is
cleared on completion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Align the switchover handoff threshold with the documented lag budget.

“Within one WAL page” conflicts with the user guide’s dr.kubedb.com/switchover-max-lag-bytes budget (default 16 MiB). Please make the runbook use the same criterion so operators do not expect a stricter handoff than the controller actually enforces.

Proposed wording
- waits until the target catches up to within one WAL page, then hands off. Zero committed rows are lost.
+ waits until the target catches up within the configured `dr.kubedb.com/switchover-max-lag-bytes` budget (default 16 MiB), then hands off. Zero committed rows are lost.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Automatic:** the hub gates on the target's health and lag, quiesces the active DC
(holds its primary read-only via the Lease), waits until the target catches up to
within one WAL page, then hands off. Zero committed rows are lost. The annotation is
cleared on completion.
**Automatic:** the hub gates on the target's health and lag, quiesces the active DC
(holds its primary read-only via the Lease), waits until the target catches up within
the configured `dr.kubedb.com/switchover-max-lag-bytes` budget (default 16 MiB), then hands off. Zero committed rows are lost. The annotation is
cleared on completion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/documentdb/dr/runbook/index.md` around lines 100 - 103, Update
the “Automatic” switchover description so the handoff threshold matches the
documented `dr.kubedb.com/switchover-max-lag-bytes` lag budget instead of saying
“within one WAL page.” In the runbook text under the automatic switchover
section, replace the stricter phrasing with wording that reflects the
controller’s max lag bytes behavior and default 16 MiB budget, keeping the rest
of the sequence (gates on health and lag, quiesces the active DC, waits for
catch-up, then hands off) unchanged.

Comment thread docs/guides/documentdb/dr/runbook/index.md
Signed-off-by: Tamal Saha <tamal@appscode.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/documentdb/dr/runbook/index.md`:
- Around line 137-141: The active DC check in the DocumentDB DR runbook is using
the wrong field from status.disasterRecovery.dataCenters, so it returns nothing.
Update the kubectl jsonpath in the “Verify the active DC is still writable”
example to use clusterName instead of name, matching the fields exposed by the
DocumentDB status. Keep the rest of the runbook text unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 251e4c99-e2f1-469a-9a17-a7648012f946

📥 Commits

Reviewing files that changed from the base of the PR and between 779aafa and e3e1f61.

📒 Files selected for processing (5)
  • docs/guides/documentdb/_index.md
  • docs/guides/documentdb/dr/_index.md
  • docs/guides/documentdb/dr/guide/index.md
  • docs/guides/documentdb/dr/overview/index.md
  • docs/guides/documentdb/dr/runbook/index.md
✅ Files skipped from review due to trivial changes (2)
  • docs/guides/documentdb/_index.md
  • docs/guides/documentdb/dr/_index.md

Comment on lines +137 to +141
**Verify the active DC is still writable:**

```bash
kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}'
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use clusterName in the active-DC check.

status.disasterRecovery.dataCenters exposes clusterName, so this jsonpath returns nothing with .name.

Suggested fix
-kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}'
+kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].clusterName}'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Verify the active DC is still writable:**
```bash
kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}'
```
kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].clusterName}'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/documentdb/dr/runbook/index.md` around lines 137 - 141, The
active DC check in the DocumentDB DR runbook is using the wrong field from
status.disasterRecovery.dataCenters, so it returns nothing. Update the kubectl
jsonpath in the “Verify the active DC is still writable” example to use
clusterName instead of name, matching the fields exposed by the DocumentDB
status. Keep the rest of the runbook text unchanged.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Visit the preview URL for this PR (updated for commit e3e1f61):

https://kubedb-v2-hugo--pr950-documentdb-dc-dr-doc-s0bo7a88.web.app

(expires Wed, 08 Jul 2026 16:04:46 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 0f29ae8ae0bd54a99bf2b223b6833be47acd5943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant