Add DocumentDB disaster recovery (DC-DR) guide#950
Conversation
📝 WalkthroughWalkthroughAdds new documentation for DocumentDB DC-DR, including navigation index pages, a conceptual overview, an operational user guide, and a scenario-based runbook. ChangesDocumentDB DC-DR Documentation
Estimated code review effort: 2 (Simple) | ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/guides/documentdb/dr/overview/index.md`:
- Around line 73-84: The document text is pointing to the wrong PlacementPolicy
field name: it mentions distributionRule.role even though the YAML uses
distributionRules. Update the prose in this section to use the plural field path
consistently, and keep the explanation aligned with the Arbiter/Witness role
examples so readers are directed to the actual schema used in the DocumentDB
guide.
In `@docs/guides/documentdb/dr/runbook/index.md`:
- Around line 337-340: Update the split-brain safety guidance in the runbook
section around the Lease holder/fence invariant to include cross-DC clock skew,
not just fence TTL versus Lease duration. In the relevant passage describing the
active DC, fence trip timing, and the timing invariant, replace the current
check with the full condition using fence TTL + cross-DC clock skew < primary-DC
Lease duration, and keep the rest of the operational guidance about stopping
writes and verifying the fence behavior aligned with this stronger rule.
- Around line 100-103: Update the “Automatic” switchover description so the
handoff threshold matches the documented
`dr.kubedb.com/switchover-max-lag-bytes` lag budget instead of saying “within
one WAL page.” In the runbook text under the automatic switchover section,
replace the stricter phrasing with wording that reflects the controller’s max
lag bytes behavior and default 16 MiB budget, keeping the rest of the sequence
(gates on health and lag, quiesces the active DC, waits for catch-up, then hands
off) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e16d37c0-712e-4e54-ad11-15255e51035b
📒 Files selected for processing (5)
docs/guides/documentdb/_index.mddocs/guides/documentdb/dr/_index.mddocs/guides/documentdb/dr/guide/index.mddocs/guides/documentdb/dr/overview/index.mddocs/guides/documentdb/dr/runbook/index.md
| Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: | ||
|
|
||
| | Role | Holds DocumentDB data | Primary eligible | Purpose | | ||
| | --- | --- | --- | --- | | ||
| | **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. | | ||
| | **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** | | ||
| | **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** | | ||
|
|
||
| > For DocumentDB the third "witness" data center is **vote-only** (it holds only the | ||
| > `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter` | ||
| > and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose | ||
| > witness must carry data; DocumentDB does not use it. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Use the plural field path here.
The prose points at distributionRule.role, but the YAML below uses distributionRules. That mismatch can send readers to a non-existent field.
♻️ Suggested fix
-Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`:
+Each DC plays one role, set on the `role` field of each `PlacementPolicy` `distributionRules[]` entry:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Each DC plays one role, set on the `PlacementPolicy` `distributionRule.role`: | |
| | Role | Holds DocumentDB data | Primary eligible | Purpose | | |
| | --- | --- | --- | --- | | |
| | **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. | | |
| | **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** | | |
| | **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** | | |
| > For DocumentDB the third "witness" data center is **vote-only** (it holds only the | |
| > `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter` | |
| > and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose | |
| > witness must carry data; DocumentDB does not use it. | |
| Each DC plays one role, set on the `role` field of each `PlacementPolicy` `distributionRules[]` entry: | |
| | Role | Holds DocumentDB data | Primary eligible | Purpose | | |
| | --- | --- | --- | --- | | |
| | **Member** | yes | yes | A full DocumentDB group; a candidate for the active DC. | | |
| | **Arbiter** | no | no | Vote only, the `dr-controlplane` etcd tie-breaker; runs no DocumentDB. **This is the role a DocumentDB witness DC uses.** | | |
| | **Witness** | yes | no | Data-bearing but never primary, for engines whose witness must carry data (e.g. MongoDB). **Not used by DocumentDB.** | | |
| > For DocumentDB the third "witness" data center is **vote-only** (it holds only the | |
| > `dr-controlplane` etcd member, no DocumentDB), so it is declared with `role: Arbiter` | |
| > and empty `replicaIndices`. The petset `Witness` role is reserved for engines whose | |
| > witness must carry data; DocumentDB does not use it. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/documentdb/dr/overview/index.md` around lines 73 - 84, The
document text is pointing to the wrong PlacementPolicy field name: it mentions
distributionRule.role even though the YAML uses distributionRules. Update the
prose in this section to use the plural field path consistently, and keep the
explanation aligned with the Arbiter/Witness role examples so readers are
directed to the actual schema used in the DocumentDB guide.
| **Automatic:** the hub gates on the target's health and lag, quiesces the active DC | ||
| (holds its primary read-only via the Lease), waits until the target catches up to | ||
| within one WAL page, then hands off. Zero committed rows are lost. The annotation is | ||
| cleared on completion. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Align the switchover handoff threshold with the documented lag budget.
“Within one WAL page” conflicts with the user guide’s dr.kubedb.com/switchover-max-lag-bytes budget (default 16 MiB). Please make the runbook use the same criterion so operators do not expect a stricter handoff than the controller actually enforces.
Proposed wording
- waits until the target catches up to within one WAL page, then hands off. Zero committed rows are lost.
+ waits until the target catches up within the configured `dr.kubedb.com/switchover-max-lag-bytes` budget (default 16 MiB), then hands off. Zero committed rows are lost.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Automatic:** the hub gates on the target's health and lag, quiesces the active DC | |
| (holds its primary read-only via the Lease), waits until the target catches up to | |
| within one WAL page, then hands off. Zero committed rows are lost. The annotation is | |
| cleared on completion. | |
| **Automatic:** the hub gates on the target's health and lag, quiesces the active DC | |
| (holds its primary read-only via the Lease), waits until the target catches up within | |
| the configured `dr.kubedb.com/switchover-max-lag-bytes` budget (default 16 MiB), then hands off. Zero committed rows are lost. The annotation is | |
| cleared on completion. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/documentdb/dr/runbook/index.md` around lines 100 - 103, Update
the “Automatic” switchover description so the handoff threshold matches the
documented `dr.kubedb.com/switchover-max-lag-bytes` lag budget instead of saying
“within one WAL page.” In the runbook text under the automatic switchover
section, replace the stricter phrasing with wording that reflects the
controller’s max lag bytes behavior and default 16 MiB budget, keeping the rest
of the sequence (gates on health and lag, quiesces the active DC, waits for
catch-up, then hands off) unchanged.
Signed-off-by: Tamal Saha <tamal@appscode.com>
779aafa to
e3e1f61
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/guides/documentdb/dr/runbook/index.md`:
- Around line 137-141: The active DC check in the DocumentDB DR runbook is using
the wrong field from status.disasterRecovery.dataCenters, so it returns nothing.
Update the kubectl jsonpath in the “Verify the active DC is still writable”
example to use clusterName instead of name, matching the fields exposed by the
DocumentDB status. Keep the rest of the runbook text unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 251e4c99-e2f1-469a-9a17-a7648012f946
📒 Files selected for processing (5)
docs/guides/documentdb/_index.mddocs/guides/documentdb/dr/_index.mddocs/guides/documentdb/dr/guide/index.mddocs/guides/documentdb/dr/overview/index.mddocs/guides/documentdb/dr/runbook/index.md
✅ Files skipped from review due to trivial changes (2)
- docs/guides/documentdb/_index.md
- docs/guides/documentdb/dr/_index.md
| **Verify the active DC is still writable:** | ||
|
|
||
| ```bash | ||
| kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}' | ||
| ``` |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Use clusterName in the active-DC check.
status.disasterRecovery.dataCenters exposes clusterName, so this jsonpath returns nothing with .name.
Suggested fix
-kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}'
+kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].clusterName}'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Verify the active DC is still writable:** | |
| ```bash | |
| kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].name}' | |
| ``` | |
| kubectl get documentdb -n demo docdb-dcdr -o jsonpath='{.status.disasterRecovery.dataCenters[?(@.writable==true)].clusterName}' |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/documentdb/dr/runbook/index.md` around lines 137 - 141, The
active DC check in the DocumentDB DR runbook is using the wrong field from
status.disasterRecovery.dataCenters, so it returns nothing. Update the kubectl
jsonpath in the “Verify the active DC is still writable” example to use
clusterName instead of name, matching the fields exposed by the DocumentDB
status. Keep the rest of the runbook text unchanged.
|
Visit the preview URL for this PR (updated for commit e3e1f61): https://kubedb-v2-hugo--pr950-documentdb-dc-dr-doc-s0bo7a88.web.app (expires Wed, 08 Jul 2026 16:04:46 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 0f29ae8ae0bd54a99bf2b223b6833be47acd5943 |
This adds cross data center disaster recovery (DC-DR) documentation for KubeDB DocumentDB, mirroring the Postgres DR docs since DocumentDB is Microsoft DocumentDB (the pg_documentdb extension) running on PostgreSQL under the hood. The same machinery applies: WAL streaming replication between data centers, the per-DC documentdb-coordinator raft, the dr-controlplane primary-DC Lease as the single cross-DC failover authority, a fail-closed marker fence, leader-to-leader cross-DC streaming with intra-DC cascade, and pg_rewind failback.
Pages added (all under docs/guides/documentdb/)
Notes
Summary by CodeRabbit