Draft
Conversation
Add support for ALTER CHANGEFEED SET/UNSET INCLUDE/EXCLUDE TABLES. The following syntax is supported: ALTER CHANGEFEED 123 SET INCLUDE TABLES foo, bar ALTER CHANGEFEED 123 SET EXCLUDE TABLES foo, bar ALTER CHANGEFEED 123 UNSET INCLUDE TABLES ALTER CHANGEFEED 123 UNSET EXCLUDE TABLES Release note: none Resolves: cockroachdb#147428
Doesn't address initial scan, do that in follow up
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
528bc77 to
b4acc4c
Compare
aerfrei
pushed a commit
that referenced
this pull request
Nov 14, 2025
156830: storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick r=miraradeva,iskettaneh a=dodeca12 This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes. ### Changes Core changes are within these files: ```sh pkg/kv/kvserver/storeliveness ├── support_manager.go # Rename SendAsync→EnqueueMessage, add smearing settings └── transport.go # Add smearing sender goroutine to transport.go which takes care of smearing when enabled ``` ### Background Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process. ### Commits **Commit: Introduce heartbeat smearing** - Adds a smearing sender goroutine to `transport.go` that batches enqueued messages - Smears send signals across queues using `taskpacer` to spread traffic over time - Configurable via cluster settings (default: enabled) **How it works:** 1. Messages are enqueued via `EnqueueMessage()` into per-node queues 2. When `SendAllEnqueuedMessages()` is called, transport's smearing sender goroutine waits briefly to batch messages 3. Transport's smearing sender goroutine uses `taskpacer` to pace signaling to each queue over a smear duration 4. Each `processQueue` goroutine drains its queue and sends when signalled ### New Cluster Settings - `kv.store_liveness.heartbeat_smearing.enabled` (default: true) - Enable/disable smearing - `kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) - Batching window duration - `kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) - Time to spread sends across queues ### Backward Compatibility - Feature is disabled by setting `kv.store_liveness.heartbeat_smearing.enabled=false` - When disabled, messages are sent immediately via the existing path (non-smearing mode) ### Testing - Added comprehensive unit tests verifying: - Messages batch correctly across multiple destinations - Smearing spreads signals over the configured time window - Smearing mode vs immediate mode behaviour - Message ordering and reliability All existing tests updated to call `SendAllEnqueuedMessages()` after enqueuing when smearing is enabled. #### Roachprod testing ##### Prototype #1 - Ran a prototype with a [similar design](cockroachdb#154942) (TL;DR of prototype, the heartbeats were smeared via `SupportManager` goroutines being put to sleep; this current design ensures that `SupportManager` goroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works. | Before changes (current behaviour on master) | After changes (prototype) | |--------|--------| | <img width="2680" height="570" alt="image" src="https://github.com/user-attachments/assets/32fe6ee0-437f-48eb-b3f1-087a3eafe6ac" /> | <img width="2692" height="634" alt="image" src="https://github.com/user-attachments/assets/66b5b82b-bbc4-4f47-a13e-5f6d42a1c6d4" /> | ##### Current changes - Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour. - Ran a mini-roachprod test on [this prototype commit](https://github.com/cockroachdb/cockroach/pull/155317/files#diff-9282b4b1d9a2fe32fae81e5776eb081e58069b4bc7db76718873b75f026e16c1) (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour. <img width="1797" height="469" alt="image" src="https://github.com/user-attachments/assets/bd7778ef-9f8d-4dbf-8ed2-dac40e7fb03c" /> Fixes: cockroachdb#148210 Release note: None Co-authored-by: Swapneeth Gorantla <swapneeth.gorantla@cockroachlabs.com>
aerfrei
pushed a commit
that referenced
this pull request
Dec 23, 2025
159877: kvserver: deflake TestReadLoadMetricAccounting r=tbg a=tbg `TestReadLoadMetricAccounting` has a history of flaking due to lease-related writes interfering with load metric measurements. Issue cockroachdb#141716 (and cockroachdb#141586) identified the same failure signature: ``` Error: Max difference between 0 and 85 allowed is 4, but difference was -85 ``` The root cause was identified by `@pav-kv:` an "unexpected" leader lease upgrade write was interfering with the test's write bytes measurements. PR cockroachdb#141843 added `tc.MaybeWaitForLeaseUpgrade()` to wait for lease upgrades before starting measurements. **The fix from cockroachdb#141843 IS present** in the failing SHA. However, the test still flaked with the same error signature (85 write bytes when expecting 0). The logs show: 1. AddSSTableRequest evaluated (test setup) 2. Many LeaseInfoRequest polls (from MaybeWaitForLeaseUpgrade) 3. RequestLeaseRequest (the lease upgrade write) 4. More LeaseInfoRequest polls 5. "lease is now of type: LeaseLeader" - **upgrade complete** 6. "test #1" - test loop begins 7. GetRequest evaluated (the actual test request) 8. **Assertion fails** - 85 write bytes observed The race condition is subtle: `MaybeWaitForLeaseUpgrade()` waits until `FindRangeLeaseEx()` reports the lease is upgraded, but it does **not** guarantee that the write bytes have been recorded to load stats. This is because stats are recorded "awkwardly late" on the client goroutine (`SendWithWriteBytes`). The fix: 1. Wraps each test case iteration in `SucceedsSoon` 2. Resets load stats, sends the request, checks results 3. If any stat doesn't match (due to background activity like lease upgrades), returns an error to trigger retry 4. Adds a comment noting that test cases must be idempotent (they are—all reads) ## Related Issues/PRs | Issue/PR | Status | Relevance | |----------|--------|-----------| | cockroachdb#159719 | OPEN | Current failure | | cockroachdb#141716 | CLOSED | Duplicate, Feb 2025 | | cockroachdb#141586 | CLOSED | Original issue, Feb 2025 | | cockroachdb#141843 | MERGED | Deflake attempt (wait for lease upgrade) | | cockroachdb#141599 | MERGED | Added logging to help debug | | cockroachdb#141905 | CLOSED | Duplicate | | cockroachdb#134799 | CLOSED | Older occurrence | This is more robust than trying to synchronize with specific background operations because it handles **any** source of interference, not just lease upgrades. Epic: none Closes cockroachdb#159719. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.