Add MongoDB cross-DC disaster recovery (DC-DR) docs#941
Conversation
Add a Disaster Recovery (DR) section under the MongoDB guides with three pages: an overview, a hands-on user guide, and an operational runbook, mirroring the Postgres DC-DR docs structure. The MongoDB design differs from Postgres. MongoDB is geo-aware, so one replica set spans the data centers and the oplog is the native cross-DC link (no second replication link, no remote-replica). Failover is MongoDB's own majority election (the operator does not force promotion) and failback is native rollback or resync (no pg_rewind). Voting members are spread 3-site so no single data DC holds a majority: the 2+2+1 even layout uses two data DCs plus one data-less MongoDB voting arbiter in the arbiter DC, which also holds the dr-controlplane etcd member. Writes use w:majority as the split-brain guarantee. The dr-controlplane Lease steers member priority and follows the elected primary, it is not the failover mechanism. Planned switchover raises target priority then issues a non-force replSetStepDown via the dr.kubedb.com/switchover-to annotation. Also add a Cross-DC Disaster Recovery section to the MongoDB guides README linking to the overview. Signed-off-by: Tamal Saha <tamal@appscode.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds MongoDB DC-DR documentation: a new index page and README link, an overview page, a deployment guide, and a runbook with scenario-based operational procedures. ChangesMongoDB DC-DR Documentation
Estimated code review effort: 2 (Simple) | ~15 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
docs/guides/mongodb/dr/overview/index.md (2)
308-312: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMinor style: "on purpose" → "deliberately"
- To move the active DC on purpose without losing committed writes, annotate the + To move the active DC deliberately without losing committed writes, annotate the🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/mongodb/dr/overview/index.md` around lines 308 - 312, The wording in the MongoDB DR overview should be updated from “on purpose” to “deliberately” for a more polished style. Find the sentence in the documentation near the kubectl annotate example and revise only that phrasing while keeping the rest of the instructions and example unchanged.
131-137: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winClarify the automatic reconfig scope for even-layout stall recovery.
The text states the operator issues
replSetReconfigthat drops lost members sow:majorityresumes. Cross-reference with the runbook scenario 2 which notes this is "not a force reconfig." Consider adding that same clarification here for consistency, since readers of the overview may not read the runbook.- majority-committed `replSetReconfig` that drops the lost members, so `w:majority` writes - resume. `status.disasterRecovery.phase` moves to `FailingOver` and back to `Steady`. + majority-committed `replSetReconfig` that drops the lost members, so `w:majority` writes + resume. This is not a force reconfig. `status.disasterRecovery.phase` moves to `FailingOver` and back to `Steady`.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/mongodb/dr/overview/index.md` around lines 131 - 137, Clarify the recovery description in the overview’s “Lose a data DC” section by stating that the operator performs a normal, majority-committed replSetReconfig and that it is not a force reconfig. Update the wording around the recovery step so it matches the runbook scenario 2 terminology, making the scope of the automatic reconfig explicit for the stalled w:majority case.docs/guides/mongodb/dr/guide/index.md (1)
260-268: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winClarify the switchover catch-up gate description.
The current phrasing "so near-zero committed writes are lost" is ambiguous—it can be read as stating that committed writes are lost rather than that almost none are lost. The overview's phrasing "which is the near-zero-RPO gate" is clearer. Consider aligning:
- 3. issues a **non-force** `replSetStepDown` on the current primary, which only - succeeds once an electable target secondary is caught up (the catch-up gate); + 3. issues a **non-force** `replSetStepDown` on the current primary, which only + succeeds once an electable target secondary is caught up (the near-zero-RPO catch-up gate);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/mongodb/dr/guide/index.md` around lines 260 - 268, Clarify the switchover catch-up gate wording in the MongoDB DR guide so it does not imply committed writes are lost; update the description in the failover/switchover steps and the related sentence near Watch status.disasterRecovery to match the clearer “near-zero-RPO gate” phrasing. Keep the explanation tied to the non-force replSetStepDown and electable target secondary catch-up behavior described in the surrounding section.docs/guides/mongodb/dr/runbook/index.md (1)
134-139: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winClarify the switchover RPO description.
"so near-zero committed writes are lost" is ambiguous. It should convey that almost no committed writes are lost, not that they are lost.
- The non-force stepDown only proceeds once an electable target secondary is caught up, so near-zero committed writes are lost. + The non-force stepDown only proceeds once an electable target secondary is caught up, so near-zero committed writes are lost (almost none).Or better, align with the overview's phrasing:
- The non-force stepDown only proceeds once an electable target secondary is caught up, so near-zero committed writes are lost. + The non-force stepDown only proceeds once an electable target secondary is caught up (the near-zero-RPO gate).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/mongodb/dr/runbook/index.md` around lines 134 - 139, The switchover RPO wording in the runbook is ambiguous in the automatic failover description near the replSetStepDown flow. Update the sentence in the switchover section so it clearly states that almost no committed writes are lost, matching the intended RPO meaning and the phrasing used in the overview, rather than implying committed writes are lost.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/guides/mongodb/dr/guide/index.md`:
- Around line 20-25: Update the link text in the MongoDB DR guide so it is
descriptive instead of “here”; the issue is in the introductory paragraph that
references /docs/README.md. Replace the generic anchor text with a meaningful
label that identifies the destination, and keep the existing link target
unchanged. Use the surrounding README callout in the guide content as the place
to locate and adjust the text.
In `@docs/guides/mongodb/dr/overview/index.md`:
- Around line 24-30: The overview page contains a non-descriptive accessibility
link in the “New to KubeDB?” note, so update the markdown in the overview
document to use clearer link text instead of “here.” Keep the destination
pointing to the same README, but replace the vague anchor text with something
that names the target, such as the KubeDB README, to improve accessibility and
clarity.
In `@docs/guides/mongodb/dr/runbook/index.md`:
- Around line 13-16: Update the wording in the MongoDB DC-DR runbook intro to
use the hyphenated form “cross-data-center” instead of “cross data center.” Keep
the change in the opening description near the “MongoDB DC-DR Runbook” heading
so the terminology is consistent throughout the document.
In `@docs/guides/mongodb/README.md`:
- Around line 68-70: The MongoDB DC-DR section uses non-descriptive link text in
the final sentence, so update the link in the “Cross-DC Disaster Recovery
(DC-DR)” section to use meaningful destination text instead of “here”. Keep the
same target and make the anchor text describe the linked overview page clearly
for accessibility and readability.
---
Nitpick comments:
In `@docs/guides/mongodb/dr/guide/index.md`:
- Around line 260-268: Clarify the switchover catch-up gate wording in the
MongoDB DR guide so it does not imply committed writes are lost; update the
description in the failover/switchover steps and the related sentence near Watch
status.disasterRecovery to match the clearer “near-zero-RPO gate” phrasing. Keep
the explanation tied to the non-force replSetStepDown and electable target
secondary catch-up behavior described in the surrounding section.
In `@docs/guides/mongodb/dr/overview/index.md`:
- Around line 308-312: The wording in the MongoDB DR overview should be updated
from “on purpose” to “deliberately” for a more polished style. Find the sentence
in the documentation near the kubectl annotate example and revise only that
phrasing while keeping the rest of the instructions and example unchanged.
- Around line 131-137: Clarify the recovery description in the overview’s “Lose
a data DC” section by stating that the operator performs a normal,
majority-committed replSetReconfig and that it is not a force reconfig. Update
the wording around the recovery step so it matches the runbook scenario 2
terminology, making the scope of the automatic reconfig explicit for the stalled
w:majority case.
In `@docs/guides/mongodb/dr/runbook/index.md`:
- Around line 134-139: The switchover RPO wording in the runbook is ambiguous in
the automatic failover description near the replSetStepDown flow. Update the
sentence in the switchover section so it clearly states that almost no committed
writes are lost, matching the intended RPO meaning and the phrasing used in the
overview, rather than implying committed writes are lost.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 0067aa54-f9f3-4fc6-8db7-3719c6316bf6
📒 Files selected for processing (5)
docs/guides/mongodb/README.mddocs/guides/mongodb/dr/_index.mddocs/guides/mongodb/dr/guide/index.mddocs/guides/mongodb/dr/overview/index.mddocs/guides/mongodb/dr/runbook/index.md
| Read the [DC-DR Overview](/docs/guides/mongodb/dr/overview/index.md) first for the | ||
| architecture, and the [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for | ||
| scenario-by-scenario procedures. | ||
|
|
||
| > **New to KubeDB?** Please start [here](/docs/README.md). | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use descriptive link text for accessibility.
"Please start here" is non-descriptive link text.
> **New to KubeDB?** Please start [here](/docs/README.md).
> **New to KubeDB?** Please start with the [KubeDB README](/docs/README.md).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Read the [DC-DR Overview](/docs/guides/mongodb/dr/overview/index.md) first for the | |
| architecture, and the [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for | |
| scenario-by-scenario procedures. | |
| > **New to KubeDB?** Please start [here](/docs/README.md). | |
| Read the [DC-DR Overview](/docs/guides/mongodb/dr/overview/index.md) first for the | |
| architecture, and the [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for | |
| scenario-by-scenario procedures. | |
| > **New to KubeDB?** Please start with the [KubeDB README](/docs/README.md). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 24-24: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/mongodb/dr/guide/index.md` around lines 20 - 25, Update the link
text in the MongoDB DR guide so it is descriptive instead of “here”; the issue
is in the introductory paragraph that references /docs/README.md. Replace the
generic anchor text with a meaningful label that identifies the destination, and
keep the existing link target unchanged. Use the surrounding README callout in
the guide content as the place to locate and adjust the text.
Source: Linters/SAST tools
| - [DC-DR User Guide](/docs/guides/mongodb/dr/guide/index.md) for every aspect of | ||
| running in DC-DR mode (components, status, connecting, monitoring, switchover, | ||
| failback, day-2 ops). | ||
| - [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for what to do in each | ||
| operational scenario. | ||
|
|
||
| > **New to KubeDB?** Please start [here](/docs/README.md). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use descriptive link text for accessibility.
"Please start here" is non-descriptive link text.
> **New to KubeDB?** Please start [here](/docs/README.md).
> **New to KubeDB?** Please start with the [KubeDB README](/docs/README.md).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - [DC-DR User Guide](/docs/guides/mongodb/dr/guide/index.md) for every aspect of | |
| running in DC-DR mode (components, status, connecting, monitoring, switchover, | |
| failback, day-2 ops). | |
| - [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for what to do in each | |
| operational scenario. | |
| > **New to KubeDB?** Please start [here](/docs/README.md). | |
| - [DC-DR User Guide](/docs/guides/mongodb/dr/guide/index.md) for every aspect of | |
| running in DC-DR mode (components, status, connecting, monitoring, switchover, | |
| failback, day-2 ops). | |
| - [DC-DR Runbook](/docs/guides/mongodb/dr/runbook/index.md) for what to do in each | |
| operational scenario. | |
| > **New to KubeDB?** Please start with the [KubeDB README](/docs/README.md). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 30-30: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/mongodb/dr/overview/index.md` around lines 24 - 30, The overview
page contains a non-descriptive accessibility link in the “New to KubeDB?” note,
so update the markdown in the overview document to use clearer link text instead
of “here.” Keep the destination pointing to the same README, but replace the
vague anchor text with something that names the target, such as the KubeDB
README, to improve accessibility and clarity.
Source: Linters/SAST tools
| # MongoDB DC-DR Runbook | ||
|
|
||
| Scenario-by-scenario procedures for operating a MongoDB cluster in cross data center | ||
| disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix hyphenation: "cross data center" → "cross-data-center"
- Scenario-by-scenario procedures for operating a MongoDB cluster in cross data center
+ Scenario-by-scenario procedures for operating a MongoDB cluster in cross-data-center📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # MongoDB DC-DR Runbook | |
| Scenario-by-scenario procedures for operating a MongoDB cluster in cross data center | |
| disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and | |
| # MongoDB DC-DR Runbook | |
| Scenario-by-scenario procedures for operating a MongoDB cluster in cross-data-center | |
| disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and |
🧰 Tools
🪛 LanguageTool
[grammar] ~15-~15: Use a hyphen to join words.
Context: ...for operating a MongoDB cluster in cross data center disaster recovery (DC-DR) mo...
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/mongodb/dr/runbook/index.md` around lines 13 - 16, Update the
wording in the MongoDB DC-DR runbook intro to use the hyphenated form
“cross-data-center” instead of “cross data center.” Keep the change in the
opening description near the “MongoDB DC-DR Runbook” heading so the terminology
is consistent throughout the document.
Source: Linters/SAST tools
| ## Cross-DC Disaster Recovery (DC-DR) | ||
|
|
||
| Do you want to run your MongoDB database across multiple data centers and recover from a full data center failure with a single, automatically failing-over endpoint? KubeDB runs one replica set across the data centers, spreads the voting members 3-site so no single data center holds a majority, writes with `w:majority`, and lets MongoDB's own election promote a surviving data center. Follow [here](/docs/guides/mongodb/dr/overview/index.md). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use descriptive link text for accessibility.
"Follow here" is non-descriptive. Replace with meaningful text describing the destination.
- Follow [here](/docs/guides/mongodb/dr/overview/index.md).
+ Follow the [MongoDB Cross-DC Disaster Recovery overview](/docs/guides/mongodb/dr/overview/index.md).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Cross-DC Disaster Recovery (DC-DR) | |
| Do you want to run your MongoDB database across multiple data centers and recover from a full data center failure with a single, automatically failing-over endpoint? KubeDB runs one replica set across the data centers, spreads the voting members 3-site so no single data center holds a majority, writes with `w:majority`, and lets MongoDB's own election promote a surviving data center. Follow [here](/docs/guides/mongodb/dr/overview/index.md). | |
| ## Cross-DC Disaster Recovery (DC-DR) | |
| Do you want to run your MongoDB database across multiple data centers and recover from a full data center failure with a single, automatically failing-over endpoint? KubeDB runs one replica set across the data centers, spreads the voting members 3-site so no single data center holds a majority, writes with `w:majority`, and lets MongoDB's own election promote a surviving data center. Follow the [MongoDB Cross-DC Disaster Recovery overview](/docs/guides/mongodb/dr/overview/index.md). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 70-70: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/mongodb/README.md` around lines 68 - 70, The MongoDB DC-DR
section uses non-descriptive link text in the final sentence, so update the link
in the “Cross-DC Disaster Recovery (DC-DR)” section to use meaningful
destination text instead of “here”. Keep the same target and make the anchor
text describe the linked overview page clearly for accessibility and
readability.
Source: Linters/SAST tools
- guide: add HorizontalScaling to the day-2 ops list (it was omitted) and
document per-DC scaling via spec.horizontalScaling.dataCenters[].{clusterName,
replicas} with a MongoDBOpsRequest example, matching the ops API.
- guide: correct the status.disasterRecovery.dataCenters[].healthy description to
reflect health Lease freshness (the type semantics), not just a ready member.
- overview: add a net-new availability caveat so the page does not read as GA; the
distributed substrate and DC-DR layer are net-new for MongoDB.
Verified all YAML/status field names against apimachinery (MongoDBDCStatus:
clusterName, role, primary, writable, oplogLagSeconds, healthy; phase; activeDC;
lastTransitionTime) and petset (distributionRules role Member/Arbiter, Witness
removed on the remove-witness-role branch; failoverPolicy.mode, trigger.scope).
spec.podTemplate.spec.podPlacementPolicy matches the shipped distributed Postgres
convention. No em-dashes; weight 36 is free; front matter and parent chain valid.
Signed-off-by: Tamal Saha <tamal@appscode.com>
Add a Disaster Recovery (DR) section under the MongoDB guides, the MongoDB counterpart to the Postgres DC-DR docs (#912). This is the user-facing companion to the apimachinery MongoDB DC-DR types and the mongodb operator substrate PRs.
What is added
Three pages under
docs/guides/mongodb/dr/, mirroring the Postgres DC-DR shape:w:majorityguarantee, the role of the Lease, and the single-CR single-endpoint model.MongoDBCR withspec.distributed, thePlacementPolicywith Member/Arbiter roles and afailoverPolicy, inspectingstatus.disasterRecovery, connecting withw:majority, and reading from the secondary Service.w:majoritystall and reconfig-out step in the even layout, native failback, the arbiter-DC-loss case, partition behavior, and verifying exactly one writable DC.A Cross-DC Disaster Recovery section is also added to the MongoDB guides README, linking to the overview.
Why MongoDB differs from Postgres
MongoDB is geo-aware, so one replica set spans the data centers and the oplog is the native cross-DC link (no second replication link, no remote-replica). Failover is MongoDB's own majority election, the operator does not force promotion, and failback is native rollback or resync (no pg_rewind). Voting members are spread 3-site so no single data DC holds a majority: the even layout pairs two data DCs with one data-less MongoDB voting arbiter in the arbiter DC, which also holds the dr-controlplane etcd member.
w:majorityis the split-brain guarantee. The Lease steers member priority and follows the elected primary rather than driving it. Planned switchover raises target priority then issues a non-forcereplSetStepDown, triggered by thedr.kubedb.com/switchover-toannotation.Summary by CodeRabbit
New Features
Documentation
w:majority), client connection behavior, status visibility, and cleanup notes.