Skip to content

Add Redis cross-DC disaster recovery (DC-DR) docs#943

Open
tamalsaha wants to merge 1 commit into
masterfrom
redis-dc-dr-docs
Open

Add Redis cross-DC disaster recovery (DC-DR) docs#943
tamalsaha wants to merge 1 commit into
masterfrom
redis-dc-dr-docs

Conversation

@tamalsaha

@tamalsaha tamalsaha commented Jul 1, 2026

Copy link
Copy Markdown
Member

What

Adds documentation for cross data center disaster recovery (DC-DR) of KubeDB Redis / Valkey, under docs/guides/redis/dr/, mirroring the structure of the other engines' DC-DR docs.

  • dr/overview/index.md — concepts (gossip/Sentinel quorum stays intra-DC, the dr-controlplane Lease is the single cross-DC authority, plain async REPLICAOF is the cross-DC link, the fail-closed rd-coordinator fence), 2-DC vs 3-DC topologies, prerequisites, a deploy walkthrough (PlacementPolicy + Redis), observing status.disasterRecovery, unplanned failover, planned near-zero-RPO switchover, failback via resync, per-DC scaling, and cleanup.
  • dr/runbook/index.md — scenario-by-scenario procedures (active DC lost, partition, planned switchover/failback, standby lost, coordination plane down, switchover stuck, stale marker, re-adding a DC, verifying the split-brain guarantee).

Scope

Native DC-DR scope is Sentinel and Standalone (they have a native cross-cluster REPLICAOF primitive). Cluster mode has no native cross-cluster replication and is called out as deferred to a later external-sync phase.

Related

Documents the feature implemented in kubedb/redis#658 (operator), kubedb/redis-coordinator#164 (fence), and kubedb/apimachinery#1793 (status.disasterRecovery types). All commands, annotations (dr.kubedb.com/enabled, dr.kubedb.com/switchover-to, dr.kubedb.com/switchover-max-lag-bytes), env, status fields, and flags match that implementation.

Summary by CodeRabbit

  • Documentation
    • Added a new Redis disaster recovery guide page with setup details, topology examples, prerequisites, configuration steps, and status inspection guidance.
    • Added a Redis/Valkey cross–data center disaster recovery runbook with command references and step-by-step procedures for failover, switchover, failback, recovery, and split-brain checks.

Document DC-DR for Redis/Valkey under docs/guides/redis/dr/: an overview
(concepts, 2-DC vs 3-DC topologies, deploy, observe status.disasterRecovery,
unplanned failover, planned near-zero-RPO switchover, failback, scaling,
cleanup) and a scenario runbook. Native scope is Sentinel and Standalone;
the cross-DC link is a plain async REPLICAOF, the writable DC is chosen by
the dr-controlplane primary-DC Lease, and the rd-coordinator fence holds a
non-active DC's master standby, fail closed.

Signed-off-by: Tamal Saha <tamal@appscode.com>
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds three new documentation pages under docs/guides/redis/dr: a guide index with menu configuration, an overview page describing the Redis Cross-DC Disaster Recovery (DC-DR) architecture and setup, and a runbook detailing operational procedures and failure-scenario responses.

Changes

Redis DC-DR Documentation

Layer / File(s) Summary
Guide section front-matter and menu entry
docs/guides/redis/dr/_index.md, docs/guides/redis/dr/overview/index.md (front-matter)
Adds versioned menu configuration and page titles for the new DC-DR guide section.
DC-DR concepts and architecture
docs/guides/redis/dr/overview/index.md
Documents the intra-DC isolation model, Lease-based cross-DC writability, fail-closed fencing, DC role semantics (Member/Arbiter), and 3-site/2-site deployment topologies.
DC-DR deployment and status observation
docs/guides/redis/dr/overview/index.md
Documents prerequisites, PlacementPolicy and Redis manifest configuration for DC-DR expansion, and status.disasterRecovery observability fields.
Failover, switchover, failback, scaling and cleanup
docs/guides/redis/dr/overview/index.md
Documents unplanned failover, planned switchover/failback flows, per-DC scaling via RedisOpsRequest, and cleanup instructions.
Runbook quick reference and golden rules
docs/guides/redis/dr/runbook/index.md
Adds runbook intro, quick commands, and golden rules for Lease-driven writability and fencing.
Runbook failure and recovery scenarios
docs/guides/redis/dr/runbook/index.md
Adds scenario-based procedures for DC loss, network partition, switchover/failback, standby loss, coordination plane outage, stuck switchover, stale markers, DC re-add, and split-brain verification.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested reviewers

  • kodiak-appscode

A rabbit hops across two data centers wide,
With Lease in paw and fences at its side. 🐇
One DC writes, the others stand and wait,
Switchover smooth, no split-brain fate.
Docs now written, runbook clear and bright—
Disaster comes? We'll hop through the night! 🌙

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: new Redis cross-DC disaster recovery documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch redis-dc-dr-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
docs/guides/redis/dr/overview/index.md (2)

163-167: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a language specifier to the fenced code block.

The shell flags block is missing a language tag, so it won't render with syntax highlighting.

📝 Proposed fix
  • bash
    --dc-dr-enabled
    --dc-dr-coord-kubeconfig=
    --dc-dr-local-dc=<this operator's data center name>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/guides/redis/dr/overview/index.md around lines 163 - 167, Add a
language tag to the fenced shell-flags block in the docs snippet so it renders
with syntax highlighting; update the markdown code fence around the dc-dr flag
example to use a bash specifier, keeping the existing flag lines unchanged.


</details>

<!-- cr-comment:v1:7663c761b384e10a3aa3c0f5 -->

---

`40-40`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ | _⚡ Quick win_

**Use descriptive link text for accessibility.**

Replace the non-descriptive "here" link text with something that describes the destination, e.g., "the KubeDB quick start guide".




<details>
<summary>📝 Proposed fix</summary>

```diff
- > **New to KubeDB?** Please start [here](/docs/README.md).
+ > **New to KubeDB?** Please start [the KubeDB quick start guide](/docs/README.md).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/redis/dr/overview/index.md` at line 40, The link text in the “New
to KubeDB?” sentence is non-descriptive and should be updated for accessibility.
In the Markdown content, replace the generic “here” anchor in the introductory
sentence with descriptive text that identifies the destination, using the nearby
“New to KubeDB?” phrasing and the README link target as guidance. Keep the
sentence clear and self-explanatory without relying on the link destination
alone.
docs/guides/redis/dr/runbook/index.md (1)

166-166: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Simplify wordy phrasing.

"a majority of its three sites" → "2 of 3 sites" or "a majority of the three etcd members".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/redis/dr/runbook/index.md` at line 166, The phrasing in the DR
runbook is too wordy and should be simplified. Update the sentence containing
“restore the dr-controlplane etcd quorum” to replace “a majority of its three
sites” with a clearer equivalent such as “2 of 3 sites” or “a majority of the
three etcd members.” Keep the wording concise and consistent with the
surrounding runbook language.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/redis/dr/runbook/index.md`:
- Line 45: The runbook’s golden rule in the redis DR guide is too strict and
conflicts with the quorum-loss scenario. Update the statement referenced by the
disasterRecovery section to say “at most one DC is writable: true” and
explicitly allow zero writable DCs when the coordination plane quorum is lost.
Keep the wording aligned with Scenario 6 so the guidance is consistent across
the document.

---

Nitpick comments:
In `@docs/guides/redis/dr/overview/index.md`:
- Around line 163-167: Add a language tag to the fenced shell-flags block in the
docs snippet so it renders with syntax highlighting; update the markdown code
fence around the dc-dr flag example to use a bash specifier, keeping the
existing flag lines unchanged.
- Line 40: The link text in the “New to KubeDB?” sentence is non-descriptive and
should be updated for accessibility. In the Markdown content, replace the
generic “here” anchor in the introductory sentence with descriptive text that
identifies the destination, using the nearby “New to KubeDB?” phrasing and the
README link target as guidance. Keep the sentence clear and self-explanatory
without relying on the link destination alone.

In `@docs/guides/redis/dr/runbook/index.md`:
- Line 166: The phrasing in the DR runbook is too wordy and should be
simplified. Update the sentence containing “restore the dr-controlplane etcd
quorum” to replace “a majority of its three sites” with a clearer equivalent
such as “2 of 3 sites” or “a majority of the three etcd members.” Keep the
wording concise and consistent with the surrounding runbook language.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 00edbb60-18e9-42bb-966d-6be36276ade1

📥 Commits

Reviewing files that changed from the base of the PR and between 405b88b and 3bd12b6.

📒 Files selected for processing (3)
  • docs/guides/redis/dr/_index.md
  • docs/guides/redis/dr/overview/index.md
  • docs/guides/redis/dr/runbook/index.md

`SENTINEL FAILOVER` across DCs by hand.
- **The fence fails closed.** A DC that cannot confirm it holds the Lease keeps its master
labeled `standby` and read-only by design; that is correct, not a bug.
- **Exactly one DC is `writable: true`** in `status.disasterRecovery` at any instant.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Contradiction: "Exactly one DC is writable: true" fails when coordination plane is down.

Scenario 6 (lines 152-168) correctly states that when the etcd quorum is lost, all DCs fail closed and no DC is writable. The golden rule should qualify this to avoid operator panic during a coordination-plane outage.

Suggested rephrase: "At most one DC is writable: true in status.disasterRecovery at any instant; zero is expected when the coordination plane quorum is lost."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/redis/dr/runbook/index.md` at line 45, The runbook’s golden rule
in the redis DR guide is too strict and conflicts with the quorum-loss scenario.
Update the statement referenced by the disasterRecovery section to say “at most
one DC is writable: true” and explicitly allow zero writable DCs when the
coordination plane quorum is lost. Keep the wording aligned with Scenario 6 so
the guidance is consistent across the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant