Add ClickHouse DC-DR guide#945
Conversation
Signed-off-by: Tamal Saha <tamal@appscode.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR adds ClickHouse DC-DR documentation across the README, overview, user guide, and runbook, covering deployment structure, routing and quorum behavior, operational procedures, and troubleshooting scenarios. ChangesDC-DR Documentation
Estimated code review effort: 2 (Simple) | ~15 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
docs/guides/clickhouse/dr/runbook/index.md (1)
64-65: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd "from" for clarity: "moves from
FailingOvertoSteady".Current phrasing "
phasemovesFailingOvertoSteady" is slightly telegraphic. Suggest: "moves fromFailingOvertoSteady".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/clickhouse/dr/runbook/index.md` around lines 64 - 65, The wording in the runbook sentence is missing the transition preposition, making the state change hard to read. Update the text around the “phase” description so it says it moves from `FailingOver` to `Steady`, preserving the same meaning but improving clarity in the DR runbook section.docs/guides/clickhouse/dr/guide/index.md (2)
15-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueFix hyphenation: "cross data center" should be "cross-data-center".
The phrase "cross data center disaster recovery" is missing hyphens. It should read "cross-data-center disaster recovery" to be grammatically correct as a compound modifier.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/clickhouse/dr/guide/index.md` around lines 15 - 18, Update the opening description in the ClickHouse DR guide so the compound modifier is hyphenated correctly; in the document text for the guide intro, change the phrase used in the overview to “cross-data-center disaster recovery” so the wording is grammatically consistent.Source: Linters/SAST tools
130-132: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider formal phrasing: "keep working" → "continue to work".
For more formal documentation tone, consider: "so the single write endpoint and the single
AppBindingcontinue to work as the active DC moves."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/clickhouse/dr/guide/index.md` around lines 130 - 132, The phrasing in the ClickHouse DR guide is too informal in the sentence about the single write endpoint and single AppBinding. Update the wording in this doc section to use a more formal tone by replacing “keep working” with “continue to work,” so the sentence reads naturally as the active DC moves.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/guides/clickhouse/dr/overview/index.md`:
- Around line 188-191: The overview’s marker reference is inconsistent with the
User Guide’s DC-name contract: it currently says “activeDC” while the contract
uses “data.activeDC.” Update the wording in the overview to explicitly refer to
the marker as the ConfigMap/Secret data path `data.activeDC`, and keep the
terminology aligned with the existing DC-name contract in the guide. Use the
marker mention near the DC-name list as the anchor for the edit so both
documents describe the same field consistently.
- Around line 317-318: Update the wording in the DR overview text to replace the
vague “even layout” phrase with the same topology terminology used elsewhere,
specifically “two-Member-plus-Arbiter layout” or “2+1 layout.” Make the change
in the section describing the surviving DCs and Keeper quorum so the terminology
is consistent with the earlier topology description.
- Line 32: The link text in the ClickHouse DR overview intro is too vague, so
update the existing “here” anchor to use descriptive text that matches its
destination, such as the KubeDB quickstart guide. Make this change in the
introductory sentence so the markdown link remains the same target but the
visible text is meaningful and self-descriptive.
In `@docs/guides/clickhouse/dr/runbook/index.md`:
- Around line 114-115: The wording in the DR runbook is ambiguous about
“near-zero committed writes are lost”; update the sentence in the ClickHouse DR
runbook text to make the RPO meaning explicit, e.g. by rephrasing it around
“committed writes lost are near-zero” or “RPO is near-zero,” while keeping the
rest of the explanation in the same section unchanged.
- Around line 15-16: Update the ClickHouse DR runbook wording so the compound
modifier is hyphenated correctly: change the “cross data center disaster
recovery” phrasing in the introduction to “cross-data-center disaster recovery.”
Keep the rest of the sentence intact and ensure the document consistently uses
the hyphenated form in this section.
- Around line 265-266: The endpoint guidance in the runbook uses the placeholder
`<db>`, which reads like a database name instead of a write endpoint. Update the
sentence in this section to use `<endpoint>` or rephrase it more clearly as
“Point writes at the endpoint, not at this DC directly,” keeping the rest of the
standby guidance unchanged.
- Around line 137-139: The DR runbook text still references PostgreSQL-specific
pg_rewind, which does not apply to ClickHouse. Update the affected prose in the
runbook section describing the near-zero-RPO flow to remove pg_rewind and
replace it with ClickHouse-appropriate wording that explains there is no
divergence/rollback step for a DC that lacked Keeper quorum.
In `@docs/guides/clickhouse/README.md`:
- Line 50: The ClickHouse README uses non-descriptive link text in the sentence
that ends with the DR overview link. Update the anchor text in the relevant
documentation section to a meaningful phrase that describes the destination,
such as the DC-DR Overview guide, while keeping the same link target. Locate the
prose near the ClickHouse disaster-recovery explanation in the README and
replace the generic “Follow here” wording with descriptive text.
---
Nitpick comments:
In `@docs/guides/clickhouse/dr/guide/index.md`:
- Around line 15-18: Update the opening description in the ClickHouse DR guide
so the compound modifier is hyphenated correctly; in the document text for the
guide intro, change the phrase used in the overview to “cross-data-center
disaster recovery” so the wording is grammatically consistent.
- Around line 130-132: The phrasing in the ClickHouse DR guide is too informal
in the sentence about the single write endpoint and single AppBinding. Update
the wording in this doc section to use a more formal tone by replacing “keep
working” with “continue to work,” so the sentence reads naturally as the active
DC moves.
In `@docs/guides/clickhouse/dr/runbook/index.md`:
- Around line 64-65: The wording in the runbook sentence is missing the
transition preposition, making the state change hard to read. Update the text
around the “phase” description so it says it moves from `FailingOver` to
`Steady`, preserving the same meaning but improving clarity in the DR runbook
section.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: cd5aecbf-4ae8-4905-82ad-067c8a234889
📒 Files selected for processing (5)
docs/guides/clickhouse/README.mddocs/guides/clickhouse/dr/_index.mddocs/guides/clickhouse/dr/guide/index.mddocs/guides/clickhouse/dr/overview/index.mddocs/guides/clickhouse/dr/runbook/index.md
| - [DC-DR Runbook](/docs/guides/clickhouse/dr/runbook/index.md) for what to do in each | ||
| operational scenario. | ||
|
|
||
| > **New to KubeDB?** Please start [here](/docs/README.md). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use descriptive link text.
"here" does not describe the destination. Replace with something like "KubeDB quickstart guide".
-> **New to KubeDB?** Please start [here](/docs/README.md).
+> **New to KubeDB?** Please start with the [KubeDB quickstart guide](/docs/README.md).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > **New to KubeDB?** Please start [here](/docs/README.md). | |
| > **New to KubeDB?** Please start with the [KubeDB quickstart guide](/docs/README.md). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 32-32: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/overview/index.md` at line 32, The link text in the
ClickHouse DR overview intro is too vague, so update the existing “here” anchor
to use descriptive text that matches its destination, such as the KubeDB
quickstart guide. Make this change in the introductory sentence so the markdown
link remains the same target but the visible text is meaningful and
self-descriptive.
Source: Linters/SAST tools
| - One consistent **DC name** per data center, used everywhere: the OCM spoke cluster | ||
| name, the agent `--dc-name`, the Lease `holderIdentity`, the marker `activeDC`, the | ||
| pod label `open-cluster-management.io/cluster-name`, and the `PlacementPolicy` | ||
| `distributionRule.clusterName`. Keep them identical. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Align "marker" reference with the User Guide's DC-name contract.
The User Guide specifies data.activeDC for the marker, while this overview uses activeDC. Clarify whether the marker is a ConfigMap/Secret key path (data.activeDC) or a field name, and keep both documents consistent.
Cross-reference: docs/guides/clickhouse/dr/guide/index.md:36-46 defines the DC-name contract as the marker data.activeDC.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/overview/index.md` around lines 188 - 191, The
overview’s marker reference is inconsistent with the User Guide’s DC-name
contract: it currently says “activeDC” while the contract uses “data.activeDC.”
Update the wording in the overview to explicitly refer to the marker as the
ConfigMap/Secret data path `data.activeDC`, and keep the terminology aligned
with the existing DC-name contract in the guide. Use the marker mention near the
DC-name list as the anchor for the edit so both documents describe the same
field consistently.
| When the active DC is lost, the surviving DCs that still hold Keeper quorum (a standby | ||
| data DC plus the arbiter DC in the even layout) **keep accepting writes on their own**, |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Replace unclear "even layout" terminology.
Use "two-Member-plus-Arbiter layout" or "2+1 layout" to match the topology described earlier.
-surviving DCs that still hold Keeper quorum (a standby data DC plus the arbiter DC in the even layout)
+surviving DCs that still hold Keeper quorum (a standby Member DC plus the Arbiter DC in the 2+1 layout)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| When the active DC is lost, the surviving DCs that still hold Keeper quorum (a standby | |
| data DC plus the arbiter DC in the even layout) **keep accepting writes on their own**, | |
| When the active DC is lost, the surviving DCs that still hold Keeper quorum (a standby | |
| standby Member DC plus the Arbiter DC in the 2+1 layout) **keep accepting writes on their own**, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/overview/index.md` around lines 317 - 318, Update
the wording in the DR overview text to replace the vague “even layout” phrase
with the same topology terminology used elsewhere, specifically
“two-Member-plus-Arbiter layout” or “2+1 layout.” Make the change in the section
describing the surviving DCs and Keeper quorum so the terminology is consistent
with the earlier topology description.
| Scenario-by-scenario procedures for operating a ClickHouse cluster in cross data center | ||
| disaster recovery (DC-DR) mode. Each scenario lists the **symptoms**, what KubeDB and |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Hyphenate "cross-data-center".
Per the grammar hint, "cross data center disaster recovery" should be "cross-data-center disaster recovery" (compound modifier before the noun).
🧰 Tools
🪛 LanguageTool
[grammar] ~15-~15: Use a hyphen to join words.
Context: ... operating a ClickHouse cluster in cross data center disaster recovery (DC-DR) mo...
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/runbook/index.md` around lines 15 - 16, Update the
ClickHouse DR runbook wording so the compound modifier is hyphenated correctly:
change the “cross data center disaster recovery” phrasing in the introduction to
“cross-data-center disaster recovery.” Keep the rest of the sentence intact and
ensure the document consistently uses the hyphenated form in this section.
Source: Linters/SAST tools
| to `dc-b`. Because it waits for the target to catch up before flipping, near-zero committed | ||
| writes are lost. There is no promotion step. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Clarify "near-zero committed writes are lost".
The phrasing is ambiguous. Suggest: "Because it waits for the target to catch up before flipping, committed writes lost are near-zero" or "RPO is near-zero".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/runbook/index.md` around lines 114 - 115, The
wording in the DR runbook is ambiguous about “near-zero committed writes are
lost”; update the sentence in the ClickHouse DR runbook text to make the RPO
meaning explicit, e.g. by rephrasing it around “committed writes lost are
near-zero” or “RPO is near-zero,” while keeping the rest of the explanation in
the same section unchanged.
| Same near-zero-RPO flow as scenario 3. There is no `pg_rewind` step and no rollback; a DC | ||
| that lacked Keeper quorum committed nothing to diverge. | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Remove PostgreSQL-specific pg_rewind reference.
ClickHouse does not use pg_rewind (a PostgreSQL tool). This appears to be copy-paste residue from another database's DC-DR docs. Replace with a ClickHouse-appropriate description.
-Same near-zero-RPO flow as scenario 3. There is no `pg_rewind` step and no rollback; a DC
-that lacked Keeper quorum committed nothing to diverge.
+Same near-zero-RPO flow as scenario 3. There is no rewind step and no rollback; a DC
+that lacked Keeper quorum committed nothing, so there is nothing to diverge.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Same near-zero-RPO flow as scenario 3. There is no `pg_rewind` step and no rollback; a DC | |
| that lacked Keeper quorum committed nothing to diverge. | |
| Same near-zero-RPO flow as scenario 3. There is no rewind step and no rollback; a DC | |
| that lacked Keeper quorum committed nothing, so there is nothing to diverge. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/runbook/index.md` around lines 137 - 139, The DR
runbook text still references PostgreSQL-specific pg_rewind, which does not
apply to ClickHouse. Update the affected prose in the runbook section describing
the near-zero-RPO flow to remove pg_rewind and replace it with
ClickHouse-appropriate wording that explains there is no divergence/rollback
step for a DC that lacked Keeper quorum.
| is a standby (correct). Point writes at the endpoint `<db>`, not at this DC directly. | ||
| - **Lost Keeper quorum** `keeperQuorum:false` or `is_readonly:1` means this DC cannot reach |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix placeholder in endpoint guidance.
"Point writes at the endpoint <db>" uses <db> which suggests a database name, not the write endpoint. Replace with <endpoint> or rephrase to "Point writes at the endpoint, not at this DC directly."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/dr/runbook/index.md` around lines 265 - 266, The
endpoint guidance in the runbook uses the placeholder `<db>`, which reads like a
database name instead of a write endpoint. Update the sentence in this section
to use `<endpoint>` or rephrase it more clearly as “Point writes at the
endpoint, not at this DC directly,” keeping the rest of the standby guidance
unchanged.
|
|
||
| ## Cross-DC Disaster Recovery (DC-DR) | ||
|
|
||
| Do you want to run your ClickHouse database across multiple data centers and recover from a full data center failure with a single, automatically re-routing write endpoint? KubeDB runs one logical `ReplicatedMergeTree` cluster across the data centers, spreads ClickHouse Keeper 3-site so no single data center holds a Keeper majority (the split-brain guarantee), and lets the `dr-controlplane` Lease route the write endpoint to a data center that still holds Keeper quorum. Follow [here](/docs/guides/clickhouse/dr/overview/index.md). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use descriptive link text.
"Follow here" is non-descriptive. Replace with meaningful text such as "the DC-DR Overview guide".
-Follow [here](/docs/guides/clickhouse/dr/overview/index.md).
+Follow the [DC-DR Overview guide](/docs/guides/clickhouse/dr/overview/index.md).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Do you want to run your ClickHouse database across multiple data centers and recover from a full data center failure with a single, automatically re-routing write endpoint? KubeDB runs one logical `ReplicatedMergeTree` cluster across the data centers, spreads ClickHouse Keeper 3-site so no single data center holds a Keeper majority (the split-brain guarantee), and lets the `dr-controlplane` Lease route the write endpoint to a data center that still holds Keeper quorum. Follow [here](/docs/guides/clickhouse/dr/overview/index.md). | |
| Do you want to run your ClickHouse database across multiple data centers and recover from a full data center failure with a single, automatically re-routing write endpoint? KubeDB runs one logical `ReplicatedMergeTree` cluster across the data centers, spreads ClickHouse Keeper 3-site so no single data center holds a Keeper majority (the split-brain guarantee), and lets the `dr-controlplane` Lease route the write endpoint to a data center that still holds Keeper quorum. Follow the [DC-DR Overview guide](/docs/guides/clickhouse/dr/overview/index.md). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 50-50: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/guides/clickhouse/README.md` at line 50, The ClickHouse README uses
non-descriptive link text in the sentence that ends with the DR overview link.
Update the anchor text in the relevant documentation section to a meaningful
phrase that describes the destination, such as the DC-DR Overview guide, while
keeping the same link target. Locate the prose near the ClickHouse
disaster-recovery explanation in the README and replace the generic “Follow
here” wording with descriptive text.
Source: Linters/SAST tools
|
Visit the preview URL for this PR (updated for commit 1bcd619): https://kubedb-v2-hugo--pr945-clickhouse-dc-dr-doc-9ayz6p9e.web.app (expires Wed, 08 Jul 2026 05:15:14 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 0f29ae8ae0bd54a99bf2b223b6833be47acd5943 |
Add the fifth How-it-works rule: ReplicatedMergeTree fetches are not DC-aware, so with two or more in-DC replicas of a shard the operator designates one in-DC replica per shard as the cross-DC fetch source and the others fetch intra-DC, holding cross-DC part traffic to one copy per shard per DC. Signed-off-by: Tamal Saha <tamal@appscode.com>
Adds the cross data center disaster recovery (DC-DR) documentation for KubeDB ClickHouse, mirroring the structure of the MongoDB DC-DR docs. ClickHouse is the multi-master, Keeper-quorum analog of MongoDB:
ReplicatedMergeTreereplicas in each DC replicate natively and asynchronously over port 9009, coordinated by a shared ClickHouse Keeper (Raft) ensemble. There is no second replication link to build and no promotion step.New pages under
docs/guides/clickhouse/dr/:overview/index.md): why ClickHouse DC-DR is the MongoDB analog, the four core rules (Keeper spread 3-site as the failover authority, Keeper quorum as the split-brain guarantee, the Lease as write-endpoint routing only, local reads), the three Keeper placement topologies (3-site spread as the documented automatic path, two-cluster per-region Keeper often better for write-heavy ingest, single-DC Keeper for lowest latency with manual failover), data center roles, the single-CR single-endpoint model, prerequisites, a deploy walkthrough with a realisticPlacementPolicy(two Member DCs + one Arbiter DC holding a data-less Keeper voter and the dr-controlplane etcd member), andstatus.disasterRecovery.guide/index.md): components, the DC-name contract, deployment, connecting through the single write endpoint, the Keeper-quorum write contract, local reads, monitoring viasystem.replicas(absolute_delay,queue_size,log_pointervslog_max_index), lag and RPO, Keeper placement and the arbiter, planned switchover via thedr.kubedb.com/switchover-toannotation, native failback (no rewind), and day-2 ops including per-DCHorizontalScaling.runbook/index.md): twelve scenarios (active-DC loss, partition, planned switchover, failback, arbiter-DC loss, stuck switchover, standby loss, re-add a DC, Keeper quorum lost, unexpected read-only, coordination-plane loss, suspected split-brain) each with symptoms, automatic behavior, verification, and action.Key correctness points: the ClickHouse Keeper Raft quorum is the data-plane safety (a partitioned minority DC loses quorum, cannot register parts, and its inserts fail); the
dr-controlplaneLease is only routing, policy, and observability (it steers the single write endpoint, it does not promote anything); failback is native and clean with no rewind. Also adds a DC-DR link to the ClickHouse guides README.Summary by CodeRabbit