Skip to content

Add RabbitMQ cross data center disaster recovery (DC-DR) docs#947

Open
tamalsaha wants to merge 1 commit into
masterfrom
rabbitmq-dc-dr-docs
Open

Add RabbitMQ cross data center disaster recovery (DC-DR) docs#947
tamalsaha wants to merge 1 commit into
masterfrom
rabbitmq-dc-dr-docs

Conversation

@tamalsaha

@tamalsaha tamalsaha commented Jul 1, 2026

Copy link
Copy Markdown
Member

Adds documentation for KubeDB RabbitMQ cross data center disaster recovery (DC-DR), mirroring the structure of the Kafka DR docs and adapting it to RabbitMQ semantics.

New pages under docs/guides/rabbitmq/dr/:

  • _index.md menu entry (Disaster Recovery).
  • overview/index.md: concept overview and quick start. Explains why RabbitMQ DR is the async-replication camp (no cluster-wide primary; quorum queues run their own intra-DC Raft), the five architecture rules, DC roles (Member/Arbiter), the single-CR single-endpoint model, deploy walkthrough, status.disasterRecovery, failover, planned switchover, failback, cleanup.
  • guide/index.md: full user guide. Components, DC-name contract, deployment, operator-managed Federation upstreams and policies, connecting/publishing over AMQP, consumers resuming after a flip, monitoring federation lag, the publish fence (permission or listener gate), planned switchover, failback, scaling and day-2 ops, limitations.
  • runbook/index.md: 12 scenario-by-scenario procedures plus an escalation checklist.

RabbitMQ-specific mechanics vs the Kafka template:

  • Cross-DC replication is the RabbitMQ Federation plugin (or Shovel), active-to-standby, not MirrorMaker 2. No Connector CRD; the operator manages federation runtime parameters and policies.
  • Quorum queues run per-queue Raft intra-DC; the Arbiter DC is engine-free.
  • AMQP (5672) publish endpoint follows the active cluster; inter-node 25672; management/federation 15672.
  • Writability is Lease-gated and fenced fail-closed (revoke write permission or gate the AMQP listener).
  • status.disasterRecovery reports activeDC, phase, and per-DC nodesReady/federationLagMessages/writable/healthy. Planned switchover is annotation-triggered (dr.kubedb.com/switchover-to), no Switchover ops type.

The pages hedge that the distributed substrate and DC-DR layer are net-new and forward-looking, matching the Kafka docs' tone.

Summary by CodeRabbit

  • Documentation
    • Added a new RabbitMQ Disaster Recovery docs section with an overview, step-by-step guide, and runbook.
    • Documented cross–data center active/passive behavior, failover and failback workflows, deployment prerequisites, monitoring tips, and cleanup steps.
    • Included scenario-based operational guidance and troubleshooting commands for common disaster recovery situations.

Signed-off-by: Tamal Saha <tamal@appscode.com>
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request adds new RabbitMQ Disaster Recovery documentation to the docs site: a navigation index page, a comprehensive DC-DR user guide, a conceptual overview page, and an operational runbook with scenario-based procedures for handling failover, switchover, and failback.

Changes

RabbitMQ DC-DR Documentation

Layer / File(s) Summary
Navigation index and guide intro
docs/guides/rabbitmq/dr/_index.md, docs/guides/rabbitmq/dr/guide/index.md
Adds front matter/menu entry for the DR section and the guide's title, purpose, and components overview.
Deployment examples and operator behavior
docs/guides/rabbitmq/dr/guide/index.md
Documents PlacementPolicy/RabbitMQ CR examples, operator-created resources per DC, Federation wiring, and client/consumer connection behavior.
Observability, fencing, switchover, failback, and limitations
docs/guides/rabbitmq/dr/guide/index.md
Covers status.disasterRecovery fields, publish fence mechanisms, planned switchover steps, failback semantics, day-2 ops, cleanup, and documented limitations.
Overview concept, architecture, and prerequisites
docs/guides/rabbitmq/dr/overview/index.md
Introduces the DC-DR concept, Lease-driven active-cluster selection, Federation topology, Member/Arbiter roles, and prerequisites.
Overview deployment example, failover flows, and cleanup
docs/guides/rabbitmq/dr/overview/index.md
Provides deployment manifests, status observability, unplanned/planned failover mechanics, failback semantics, and cleanup commands.
Runbook quick reference and scenario procedures
docs/guides/rabbitmq/dr/runbook/index.md
Adds quick reference, golden rules, and twelve scenario-based procedures covering node/DC loss, partitions, switchover, consumer resumption, failback, standby/arbiter loss, coordination-plane issues, active-DC determination, split writes, and an escalation checklist.

Estimated code review effort: 2 (Simple) | ~12 minutes

Sequence Diagram(s)

Not applicable — this PR consists solely of documentation additions with no code or control-flow changes.

Compact metadata

  • Type: Documentation
  • Files changed: 4 new files
  • Lines added: +1007 / -0

Related issues: None referenced.

Related PRs: None referenced.

Suggested labels: documentation

Suggested reviewers: None specified.

🐰

New docs take root in the guide,
DC to DC, side by side,
Lease and fence and Federation flow,
Runbook scenarios, ready to go,
Disaster recovery, documented with pride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the addition of RabbitMQ cross data center disaster recovery documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rabbitmq-dc-dr-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
docs/guides/rabbitmq/dr/guide/index.md (1)

25-25: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use descriptive link text instead of "here".

Replace "here" with descriptive text like "the KubeDB getting started guide" for accessibility and to satisfy markdownlint MD059.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/guide/index.md` at line 25, The intro link text is
too generic and should be made descriptive to satisfy markdownlint MD059 and
improve accessibility. Update the link in the guide index to use meaningful text
instead of “here”, such as referring to the KubeDB getting started guide, while
keeping the same destination path.
docs/guides/rabbitmq/dr/runbook/index.md (4)

15-16: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add hyphen to "cross-data-center".

"cross data center disaster recovery" should be hyphenated as "cross-data-center disaster recovery" for correct compound modifier usage.

-Scenario-by-scenario procedures for operating a RabbitMQ workload in cross data center
-disaster recovery (DC-DR) mode.
+Scenario-by-scenario procedures for operating a RabbitMQ workload in cross-data-center
+disaster recovery (DC-DR) mode.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/runbook/index.md` around lines 15 - 16, The runbook
text uses “cross data center” as a compound modifier, so update the wording in
the RabbitMQ DR guide to hyphenate it consistently as “cross-data-center
disaster recovery.” Make the edit in the affected introductory sentence in the
docs content so the phrase reads naturally and matches the preferred terminology
throughout the guide.

23-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use descriptive link text.

"here" is non-descriptive link text. Use meaningful text that describes the destination.

-> **New to KubeDB?** Please start [here](/docs/README.md).
+> **New to KubeDB?** Please [start with the KubeDB introduction](/docs/README.md).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/runbook/index.md` at line 23, The link text in the
runbook introduction is too generic; update the Markdown link in the
introductory line to use descriptive text that names the destination instead of
“here.” Adjust the sentence containing the reference to the docs README so the
linked text clearly describes what readers will find, keeping the same target
path.

246-247: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Clarify annotation removal syntax.

The command kubectl annotate rabbitmq -n demo rm-dcdr dr.kubedb.com/switchover-to- uses the trailing - syntax to remove an annotation, which is correct kubectl syntax but may confuse readers. Consider adding a brief note that the trailing hyphen removes the annotation.

-  `kubectl annotate rabbitmq -n demo rm-dcdr dr.kubedb.com/switchover-to-`. The active
+  `kubectl annotate rabbitmq -n demo rm-dcdr dr.kubedb.com/switchover-to-` (the trailing `-` removes the annotation). The active
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/runbook/index.md` around lines 246 - 247, Clarify the
kubectl annotation removal syntax in the RabbitMQ DR runbook example: the
command using kubectl annotate on rm-dcdr with dr.kubedb.com/switchover-to-
should explicitly note that the trailing hyphen removes the annotation. Update
the surrounding prose in the runbook section to mention this syntax so readers
understand the command without mistaking it for a typo.

324-325: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Clarify which cluster to check federation links on in split-write diagnosis.

In scenario 12, the command checks rm-dcdr-dc-b-0, but during a split-write scenario, the wrong-direction upstream could be on either DC. Consider noting that you should check both DCs' federation links, or clarify why dc-b specifically is the right target.

-kubectl exec -n demo rm-dcdr-dc-b-0 -- rabbitmqctl list_federation_links   # both directions must not be enabled
+kubectl exec -n demo rm-dcdr-dc-b-0 -- rabbitmqctl list_federation_links   # check both DCs; both directions must not be enabled
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/runbook/index.md` around lines 324 - 325, Clarify the
split-write federation check in the runbook: the current use of kubectl exec on
rm-dcdr-dc-b-0 implies only dc-b should be inspected, but the wrong-direction
upstream may exist on either cluster. Update the surrounding guidance in the
scenario 12 section to explicitly say to check federation links on both DCs (or
explain why rm-dcdr-dc-b-0 is the correct target), and keep the RabbitMQ
troubleshooting command aligned with that guidance.
docs/guides/rabbitmq/dr/overview/index.md (1)

31-31: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Descriptive link text improves accessibility.

"here" is non-descriptive for screen-reader users scanning links. Consider "Please start with the KubeDB introduction" or similar.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/overview/index.md` at line 31, Update the link text
in the documentation overview to use descriptive wording instead of “here” so it
is accessible for screen-reader users. In the markdown content near the “New to
KubeDB?” note, change the anchor text to something like “KubeDB introduction”
while keeping the same target, and make sure the sentence still reads naturally.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/rabbitmq/dr/guide/index.md`:
- Line 15: The guide text uses the compound modifier “cross data center” and
should be hyphenated as “cross-data-center” before “disaster recovery.” Update
the wording in the RabbitMQ DR guide content to use the hyphenated form
consistently, keeping the existing meaning intact.
- Around line 39-49: The DC-name contract lists the marker with an inconsistent
path, so update the marker identifier in the guide to match the DC-DR Overview
contract. Use the existing “The DC-name contract” section and the
`data.activeDC` bullet as the target for correction, ensuring it consistently
uses the same `activeDC` identifier everywhere across the docs.
- Around line 149-159: The federation policy example uses the wrong definition
key for the upstream reference, so update the example in the RabbitMQ DR guide
to match the operator’s single upstream configuration. In the policy snippet
near the federation example, replace the use of federation-upstream-set in the
definition block with federation-upstream unless you are explicitly documenting
a federation-upstream-set runtime parameter elsewhere; keep the example aligned
with the actual upstream name dcdr-upstream-from-dc-a and the surrounding
federation policy text.

In `@docs/guides/rabbitmq/dr/overview/index.md`:
- Line 76: The description of classic queues is too absolute in the RabbitMQ DR
overview: it says they are non-replicated, but the doc should reflect that
classic queues do not automatically replicate and can be mirrored via deprecated
classic queue mirroring. Update the wording in the overview text to use the
existing classic-queue explanation consistently, avoiding any claim that classic
queues are inherently non-replicated or equivalent to quorum-style replication.

---

Nitpick comments:
In `@docs/guides/rabbitmq/dr/guide/index.md`:
- Line 25: The intro link text is too generic and should be made descriptive to
satisfy markdownlint MD059 and improve accessibility. Update the link in the
guide index to use meaningful text instead of “here”, such as referring to the
KubeDB getting started guide, while keeping the same destination path.

In `@docs/guides/rabbitmq/dr/overview/index.md`:
- Line 31: Update the link text in the documentation overview to use descriptive
wording instead of “here” so it is accessible for screen-reader users. In the
markdown content near the “New to KubeDB?” note, change the anchor text to
something like “KubeDB introduction” while keeping the same target, and make
sure the sentence still reads naturally.

In `@docs/guides/rabbitmq/dr/runbook/index.md`:
- Around line 15-16: The runbook text uses “cross data center” as a compound
modifier, so update the wording in the RabbitMQ DR guide to hyphenate it
consistently as “cross-data-center disaster recovery.” Make the edit in the
affected introductory sentence in the docs content so the phrase reads naturally
and matches the preferred terminology throughout the guide.
- Line 23: The link text in the runbook introduction is too generic; update the
Markdown link in the introductory line to use descriptive text that names the
destination instead of “here.” Adjust the sentence containing the reference to
the docs README so the linked text clearly describes what readers will find,
keeping the same target path.
- Around line 246-247: Clarify the kubectl annotation removal syntax in the
RabbitMQ DR runbook example: the command using kubectl annotate on rm-dcdr with
dr.kubedb.com/switchover-to- should explicitly note that the trailing hyphen
removes the annotation. Update the surrounding prose in the runbook section to
mention this syntax so readers understand the command without mistaking it for a
typo.
- Around line 324-325: Clarify the split-write federation check in the runbook:
the current use of kubectl exec on rm-dcdr-dc-b-0 implies only dc-b should be
inspected, but the wrong-direction upstream may exist on either cluster. Update
the surrounding guidance in the scenario 12 section to explicitly say to check
federation links on both DCs (or explain why rm-dcdr-dc-b-0 is the correct
target), and keep the RabbitMQ troubleshooting command aligned with that
guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8efbf509-06f7-44df-9a79-d8713a94c0b1

📥 Commits

Reviewing files that changed from the base of the PR and between 405b88b and cfc928a.

📒 Files selected for processing (4)
  • docs/guides/rabbitmq/dr/_index.md
  • docs/guides/rabbitmq/dr/guide/index.md
  • docs/guides/rabbitmq/dr/overview/index.md
  • docs/guides/rabbitmq/dr/runbook/index.md


# Running RabbitMQ in DC-DR Mode: User Guide

This guide covers every aspect of operating a distributed RabbitMQ in cross data center

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix hyphenation: "cross-data-center".

"cross data center" should be hyphenated as "cross-data-center" when used as a compound modifier before "disaster recovery."

🧰 Tools
🪛 LanguageTool

[grammar] ~15-~15: Use a hyphen to join words.
Context: ...perating a distributed RabbitMQ in cross data center disaster recovery (DC-DR) mo...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/guide/index.md` at line 15, The guide text uses the
compound modifier “cross data center” and should be hyphenated as
“cross-data-center” before “disaster recovery.” Update the wording in the
RabbitMQ DR guide content to use the hyphenated form consistently, keeping the
existing meaning intact.

Comment on lines +39 to +49
## The DC-name contract

One string identifies a data center everywhere. **Keep these identical:**

- the OCM spoke cluster name
- the agent `--dc-name`
- the primary-DC Lease `holderIdentity`
- the marker `data.activeDC`
- the pod label `open-cluster-management.io/cluster-name`
- the `PlacementPolicy` `distributionRule.clusterName`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Cross-file inconsistency: marker field name contradicts overview.

The guide lists data.activeDC (Line 46), but the DC-DR Overview documents the contract as activeDC without the data. prefix. Reconcile the marker path so both pages use the same identifier.

As per path instructions, cross-file contract consistency is required for DR configuration accuracy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/guide/index.md` around lines 39 - 49, The DC-name
contract lists the marker with an inconsistent path, so update the marker
identifier in the guide to match the DC-DR Overview contract. Use the existing
“The DC-name contract” section and the `data.activeDC` bullet as the target for
correction, ensuring it consistently uses the same `activeDC` identifier
everywhere across the docs.

Comment on lines +149 to +159
```jsonc
// federation policy, set by the operator on the standby (dc-b) cluster
{
"name": "dcdr-federation",
"pattern": "^(?!amq\\.).*", // federate user queues and exchanges
"apply-to": "queues",
"definition": {
"federation-upstream-set": "dcdr-upstream-from-dc-a"
}
}
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Fix federation policy: federation-upstream-set should be federation-upstream.

The policy example references "federation-upstream-set": "dcdr-upstream-from-dc-a", but federation-upstream-set expects the name of a federation-upstream-set runtime parameter (a collection), not a single federation-upstream name. Since the operator defines a single upstream (dcdr-upstream-from-dc-a), the policy should use "federation-upstream" to reference it directly.

     "definition": {
-      "federation-upstream-set": "dcdr-upstream-from-dc-a"
+      "federation-upstream": "dcdr-upstream-from-dc-a"
     }

Alternatively, if the operator actually defines a federation-upstream-set containing this upstream, document that component instead and keep the policy as-is.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```jsonc
// federation policy, set by the operator on the standby (dc-b) cluster
{
"name": "dcdr-federation",
"pattern": "^(?!amq\\.).*", // federate user queues and exchanges
"apply-to": "queues",
"definition": {
"federation-upstream-set": "dcdr-upstream-from-dc-a"
}
}
```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/guide/index.md` around lines 149 - 159, The
federation policy example uses the wrong definition key for the upstream
reference, so update the example in the RabbitMQ DR guide to match the
operator’s single upstream configuration. In the policy snippet near the
federation example, replace the use of federation-upstream-set in the definition
block with federation-upstream unless you are explicitly documenting a
federation-upstream-set runtime parameter elsewhere; keep the example aligned
with the actual upstream name dcdr-upstream-from-dc-a and the surrounding
federation policy text.

group. The Raft group never crosses the DC boundary, so inter-DC latency or a
partition can never flap queue leadership or stall a queue. There is no cross-DC
RabbitMQ voter. Use quorum queues (not classic queues) so intra-DC HA survives node
loss; classic queues are non-replicated.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Classic queue replication claim is inaccurate.

Classic queues can be mirrored via classic queue mirroring (though deprecated). The statement that they are "non-replicated" overstates the case and may mislead readers evaluating HA options. Prefer "classic queues do not automatically replicate" or "classic queues lack quorum-style replication."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/rabbitmq/dr/overview/index.md` at line 76, The description of
classic queues is too absolute in the RabbitMQ DR overview: it says they are
non-replicated, but the doc should reflect that classic queues do not
automatically replicate and can be mirrored via deprecated classic queue
mirroring. Update the wording in the overview text to use the existing
classic-queue explanation consistently, avoiding any claim that classic queues
are inherently non-replicated or equivalent to quorum-style replication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant