Skip to content

Add MySQL cross-DC disaster recovery (DC-DR) guide#940

Open
tamalsaha wants to merge 1 commit into
masterfrom
mysql-dc-dr-docs
Open

Add MySQL cross-DC disaster recovery (DC-DR) guide#940
tamalsaha wants to merge 1 commit into
masterfrom
mysql-dc-dr-docs

Conversation

@tamalsaha

@tamalsaha tamalsaha commented Jun 30, 2026

Copy link
Copy Markdown
Member

What

Documents running a single distributed MySQL across multiple data centers for cross data center disaster recovery (DC-DR). Adds a new docs/guides/mysql/dr/ section, mirroring the Postgres DR section but adapted to MySQL Group Replication.

The companion feature work: apimachinery status.disasterRecovery + spec.distributed (kubedb/apimachinery#1796) and the coordinator fence + cross-DC channel (kubedb/mysql-coordinator#186).

Pages

  • overview/index.md — architecture (the GR-never-crosses-DC rule, each Member DC a self contained GR cluster, the dr-controlplane Lease as the single cross-DC authority, the fail-closed super_read_only fence, the named dcdr async channel via CHANGE REPLICATION SOURCE ... SOURCE_AUTO_POSITION = 1), data center roles, 2-DC vs 3-DC topologies, deploy, status.disasterRecovery, unplanned failover, planned switchover.
  • guide/index.md — components and where they run, the DC-name contract, operator flags, deploy, connecting, monitoring, replication/lag (GTID gap + Seconds_Behind_Source), the timing invariant, quorum/roles (odd GR group sizes, fence re-assertion after every GR election), switchover/failback (GTID catch-up, clone re-seed), per-DC horizontal scaling, DC-aware health, day-2 ops.
  • runbook/index.md — 14 scenario-by-scenario procedures (active DC loss, partition, planned switchover/failback, coordination-plane down, stuck switchover, growing lag, fence tripped, re-add a DC, scaling, version upgrade, suspected split-brain) plus an escalation checklist.

MySQL specifics vs the Postgres DR docs

  • HA is Group Replication (single-primary GR cluster per DC, GR's own election), not a raft.
  • Cross-DC link is an async replication channel on the standby DC's GR primary (adapted from RemoteReplica), with GR distributing intra-DC; lag is the GTID gap plus Seconds_Behind_Source.
  • The fence holds super_read_only = ON and is re-asserted after every GR election (GR clears it on its elected primary).
  • Failback uses GTID auto-positioning with a clone re-seed when diverged beyond purged GTIDs.

Notes

This is forward-looking docs that land with the MySQL DC-DR feature (sibling to the in-flight Postgres DR docs). The one /docs/guides/postgres/dr/ cross-link (SemiSync mode pointer) resolves once the Postgres DR docs section merges.

Summary by CodeRabbit

  • Documentation
    • Added new MySQL disaster recovery guide pages, including an overview, detailed user guide, and operational runbook.
    • Documented cross–data center deployment patterns, failover and switchover behavior, scaling, backup, cleanup, and recovery procedures.
    • Added guidance on observing disaster recovery status, replication lag, safety timing considerations, and common operational scenarios.

kodiak-appscode[bot]
kodiak-appscode Bot previously approved these changes Jun 30, 2026
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@tamalsaha, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 40 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 757bc64f-bc39-4c9c-bd13-8cdcdfd5db5b

📥 Commits

Reviewing files that changed from the base of the PR and between fbdaa02 and 6d04f13.

📒 Files selected for processing (4)
  • docs/guides/mysql/dr/_index.md
  • docs/guides/mysql/dr/guide/index.md
  • docs/guides/mysql/dr/overview/index.md
  • docs/guides/mysql/dr/runbook/index.md
📝 Walkthrough

Walkthrough

This PR adds four new documentation pages for MySQL cross data-center disaster recovery (DC-DR): a landing index page, a conceptual overview, a detailed user guide, and an operational runbook covering 14 failure/maintenance scenarios. No code or exported entities are changed.

Changes

MySQL DC-DR Documentation

Layer / File(s) Summary
DR landing page
docs/guides/mysql/dr/_index.md
Adds front matter with title and versioned menu metadata for the new DR section.
Conceptual overview
docs/guides/mysql/dr/overview/index.md
Documents Group Replication-based DC-DR concepts, DC roles, topologies, prerequisites, deployment steps, status fields, failover/switchover, scaling, day-2 ops, and cleanup.
Detailed user guide
docs/guides/mysql/dr/guide/index.md
Documents component responsibilities, DC-name contract, operator flags, PlacementPolicy/MySQL manifests, generated resources, connection model, observability, replication/RPO, timing invariants, quorum, switchover/failback, scaling, backups, health checks, cleanup, and limitations.
Operational runbook
docs/guides/mysql/dr/runbook/index.md
Adds quick reference commands, golden rules, and 14 scenario-based procedures for DC loss, partitions, switchover/failback, coordination outages, scaling, upgrades, split-brain, plus an escalation checklist.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

Four burrows dug, deep and wide,
Maps for DCs on either side. 🐰
Lease in paw, fence held tight,
Failover's swift in the dead of night.
Runbook scrolls for every scare—
This bunny's docs are doc-umented with care!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding a MySQL cross-DC disaster recovery guide.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mysql-dc-dr-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
docs/guides/mysql/dr/overview/index.md (5)

172-179: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Rephrase to avoid "only" repetition in the table.

The last row repeats "only" twice, which reads awkwardly. Rephrase for clarity:

  | Two sites, co-located quorum | 2 | 2 | only the one-vote DC | only when the one-vote DC is lost |
+ | Two sites, co-located quorum | 2 | 2 | the one-vote DC | when the one-vote DC is lost |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/overview/index.md` around lines 172 - 179, The “At a
glance” table in the MySQL DR overview repeats “only” in the last row, so reword
that row for smoother readability without changing the meaning. Update the “Two
sites, co-located quorum” entry in the markdown table to avoid the double “only”
phrasing while keeping the same tolerance and failover conditions, and keep the
rest of the table unchanged.

Source: Linters/SAST tools


68-71: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Fix hyphenation in "intra standby-DC".

"An intra standby-DC GR election" should be hyphenated as "An intra-standby-DC GR election" for correct compound modifier grammar.

-  label loop. An intra standby-DC GR election (which moves the local primary and the
+  label loop. An intra-standby-DC GR election (which moves the local primary and the
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/overview/index.md` around lines 68 - 71, Fix the
compound modifier in the DR overview prose by changing the phrase in the
affected paragraph from “intra standby-DC GR election” to “intra-standby-DC GR
election”; update the wording in the markdown content near the fence/GR election
explanation so the grammar is consistent and no other surrounding text changes
are needed.

Source: Linters/SAST tools


188-192: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to the fenced code block.

  • --dc-dr-enabled
    --dc-dr-coord-kubeconfig=<path to the coordination control plane kubeconfig>
    --dc-dr-local-dc=<this operator's data center name>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/guides/mysql/dr/overview/index.md around lines 188 - 192, Add a
language specifier to the fenced CLI snippet in the MySQL DR overview docs so
the markdown code block is explicitly marked as bash; update the existing fenced
block containing the --dc-dr-enabled, --dc-dr-coord-kubeconfig, and
--dc-dr-local-dc flags to use the correct syntax highlighting identifier.


</details>

<!-- cr-comment:v1:d5ef3d993a54bf923d395618 -->

_Source: Linters/SAST tools_

---

`86-89`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ | _💤 Low value_

**Capitalize "PetSet" correctly.**

"petset" should be "PetSet" to match the KubeDB/Kubernetes custom resource naming convention used elsewhere in the document.

```diff
-  with `role: Arbiter` and empty `replicaIndices`. The petset `Witness` role (a
+  with `role: Arbiter` and empty `replicaIndices`. The PetSet `Witness` role (a
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/overview/index.md` around lines 86 - 89, Update the
MySQL DR overview text to capitalize the Kubernetes resource name consistently:
change the lowercase “petset” mention in the explanatory paragraph to “PetSet”
so it matches the convention used elsewhere in the document. Keep the rest of
the wording unchanged and make the edit in the paragraph discussing the
`Witness` role and `replicaIndices`.

315-319: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Clarify the relabeling description.

"relabels the survivor's GR primary primary" is telegraphic. Specify the label being set:

-  primary `primary`, sets `super_read_only = OFF`, stops the survivor's inbound channel,
+  primary as `primary`, sets `super_read_only = OFF`, stops the survivor's inbound channel,

Or more precisely:

-  primary `primary`, sets `super_read_only = OFF`, stops the survivor's inbound channel,
+  primary to `kubedb.com/role: primary`, sets `super_read_only = OFF`, stops the survivor's inbound channel,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/overview/index.md` around lines 315 - 319, The wording
in the DR overview is too terse around the survivor relabeling step; update the
sentence describing the survivor’s GR primary so it explicitly states that the
label being set is `primary` (for example, “sets the survivor’s GR primary label
to `primary`”). Keep the rest of the recovery flow description unchanged, and
adjust the surrounding text near the `primary Service` and `AppBinding`
references only if needed for readability.
docs/guides/mysql/dr/runbook/index.md (1)

191-191: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Rephrase for clarity: "only once it does" is ambiguous.

The pronoun "it does" is distant from its antecedent. Replace with explicit reference:

-  the coordinator clears `super_read_only` only once it does.
+  the coordinator clears `super_read_only` only once that condition is met.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/runbook/index.md` at line 191, The sentence in the MySQL
DR runbook is ambiguous because “it does” does not clearly refer back to the
fresh `renewTime`. Rephrase the text in the surrounding runbook section to use
an explicit subject, likely by naming `renewTime` directly in the clause about
when the coordinator clears `super_read_only`, so the timing condition is
immediately clear.
docs/guides/mysql/dr/guide/index.md (1)

64-68: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add shell language tag to the operator flags block.

The fenced code block at lines 64-68 lacks a language specifier. Add bash after the opening backticks to satisfy markdownlint and enable syntax highlighting.

+```bash
--dc-dr-enabled
--dc-dr-coord-kubeconfig=
--dc-dr-local-dc=

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/mysql/dr/guide/index.md` around lines 64 - 68, Add the missing
shell language specifier to the fenced operator flags block in the MySQL DR
guide so the markdown lint passes and syntax highlighting works. Update the code
fence that contains the dc-dr flag examples to use the bash language tag,
keeping the flag contents unchanged and preserving the existing section
formatting in the guide.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/mysql/dr/runbook/index.md`:
- Line 15: The runbook text uses an unhyphenated compound modifier, so update
the phrasing in the MySQL DR guide to use “cross-data-center disaster recovery
(DC-DR) mode.” Fix the wording in the affected documentation sentence so the
compound adjective is hyphenated consistently and the terminology matches the
intended disaster recovery label.

---

Nitpick comments:
In `@docs/guides/mysql/dr/guide/index.md`:
- Around line 64-68: Add the missing shell language specifier to the fenced
operator flags block in the MySQL DR guide so the markdown lint passes and
syntax highlighting works. Update the code fence that contains the dc-dr flag
examples to use the bash language tag, keeping the flag contents unchanged and
preserving the existing section formatting in the guide.

In `@docs/guides/mysql/dr/overview/index.md`:
- Around line 172-179: The “At a glance” table in the MySQL DR overview repeats
“only” in the last row, so reword that row for smoother readability without
changing the meaning. Update the “Two sites, co-located quorum” entry in the
markdown table to avoid the double “only” phrasing while keeping the same
tolerance and failover conditions, and keep the rest of the table unchanged.
- Around line 68-71: Fix the compound modifier in the DR overview prose by
changing the phrase in the affected paragraph from “intra standby-DC GR
election” to “intra-standby-DC GR election”; update the wording in the markdown
content near the fence/GR election explanation so the grammar is consistent and
no other surrounding text changes are needed.
- Around line 188-192: Add a language specifier to the fenced CLI snippet in the
MySQL DR overview docs so the markdown code block is explicitly marked as bash;
update the existing fenced block containing the --dc-dr-enabled,
--dc-dr-coord-kubeconfig, and --dc-dr-local-dc flags to use the correct syntax
highlighting identifier.
- Around line 86-89: Update the MySQL DR overview text to capitalize the
Kubernetes resource name consistently: change the lowercase “petset” mention in
the explanatory paragraph to “PetSet” so it matches the convention used
elsewhere in the document. Keep the rest of the wording unchanged and make the
edit in the paragraph discussing the `Witness` role and `replicaIndices`.
- Around line 315-319: The wording in the DR overview is too terse around the
survivor relabeling step; update the sentence describing the survivor’s GR
primary so it explicitly states that the label being set is `primary` (for
example, “sets the survivor’s GR primary label to `primary`”). Keep the rest of
the recovery flow description unchanged, and adjust the surrounding text near
the `primary Service` and `AppBinding` references only if needed for
readability.

In `@docs/guides/mysql/dr/runbook/index.md`:
- Line 191: The sentence in the MySQL DR runbook is ambiguous because “it does”
does not clearly refer back to the fresh `renewTime`. Rephrase the text in the
surrounding runbook section to use an explicit subject, likely by naming
`renewTime` directly in the clause about when the coordinator clears
`super_read_only`, so the timing condition is immediately clear.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 005491d7-f3e7-471c-ae75-4fb24d594e5b

📥 Commits

Reviewing files that changed from the base of the PR and between 405b88b and fbdaa02.

📒 Files selected for processing (4)
  • docs/guides/mysql/dr/_index.md
  • docs/guides/mysql/dr/guide/index.md
  • docs/guides/mysql/dr/overview/index.md
  • docs/guides/mysql/dr/runbook/index.md

Comment thread docs/guides/mysql/dr/runbook/index.md
Document running a single distributed MySQL across data centers for cross data
center disaster recovery. Mirrors the Postgres DR section, adapted to MySQL Group
Replication: each Member data center is its own self contained GR cluster, the
dr-controlplane primary DC Lease picks the one writable data center, and every
standby data center streams a named async replication channel (CHANGE REPLICATION
SOURCE ... SOURCE_AUTO_POSITION = 1) from the active data center's primary while
GR distributes intra-DC.

Adds docs/guides/mysql/dr:
- overview: architecture, the GR-never-crosses-DC rule, data center roles,
  2-DC vs 3-DC topologies, deploy, status.disasterRecovery, failover, switchover.
- guide: components, the DC-name contract, fence (super_read_only, re-asserted
  after every GR election), GTID and seconds lag, timing invariant, per-DC
  scaling, DC-aware health, day-2 ops.
- runbook: scenario by scenario procedures (active DC loss, partition, planned
  switchover and failback, stuck switchover, lag, fence tripped, split-brain).

Signed-off-by: Tamal Saha <tamal@appscode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant