metrics,grafana: show changefeed downstream type#4251
metrics,grafana: show changefeed downstream type#4251wlwilliamx wants to merge 2 commits intopingcap:masterfrom
Conversation
Summary of ChangesHello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances TiCDC's monitoring capabilities by providing clear visibility into the downstream replication targets of each changefeed. By introducing a new metric and integrating it into the Grafana dashboard, operators can now quickly identify the type of sink (e.g., MySQL, Kafka, S3) a changefeed is writing to, which improves operational insight and simplifies the process of understanding data flow within the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds downstream-type inference and metrics for changefeeds, nil-safe GetInfo/SetInfo on in-memory ChangeFeedInfo, updates owner/controller and changefeed lifecycle to set and clean per-changefeed downstream metrics, updates MySQL/TiDB sink runtime metrics, and adds Grafana table panels showing changefeed → downstream_type. Changes
Sequence Diagram(s)sequenceDiagram
participant Owner as Owner Controller
participant CF as Changefeed (in-memory)
participant Metrics as Prometheus Gauges
participant Prom as Prometheus Scraper
participant Graf as Grafana
Owner->>CF: GetInfo() -> read displayName, SinkURI
CF-->>Owner: ChangeFeedInfo (may be nil)
Owner->>Metrics: DownstreamTypeFromSinkURI(SinkURI) -> set Gauge{keyspace,changefeed,downstream_type}=1
Owner->>Metrics: delete old Gauge label if downstream_type changed or feed removed
Prom->>Metrics: scrape metrics endpoint
Graf->>Prom: query ticdc_owner_changefeed_downstream_info{...}
Graf-->>User: render table of keyspace/changefeed/downstream_type
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric to expose the downstream type of a changefeed, which is a valuable addition for monitoring. The implementation includes defining the metric, logic for populating it in the controller, and cleanup procedures. A new utility to derive the downstream type from the sink URI is also added, along with comprehensive unit tests. The Grafana dashboards are updated to display this new information. The overall approach is solid, but I've found a critical issue that could lead to a panic.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
coordinator/changefeed/changefeed_db.go (1)
133-153: Merge the duplicateif removeblocks.The downstream info gauge cleanup at lines 143–153 adds a second
if removecheck immediately after the first one (lines 133–141). Both guards cover the same condition; folding them together removes the redundancy and makes the removal path read as a single unit.♻️ Proposed refactor
if remove { log.Info("remove changefeed", zap.String("changefeed", cf.ID.String())) delete(db.changefeeds, cfID) delete(db.changefeedDisplayNames, cf.ID.DisplayName) delete(db.stopped, cfID) + info := cf.GetInfo() + if info != nil { + downstreamType := metrics.DownstreamTypeFromSinkURI(info.SinkURI) + metrics.ChangefeedDownstreamInfoGauge.DeleteLabelValues( + cfID.Keyspace(), + cfID.Name(), + downstreamType, + ) + } } else { log.Info("stop changefeed", zap.String("changefeed", cfID.String())) db.stopped[cfID] = cf } - - if remove { - info := cf.GetInfo() - if info != nil { - downstreamType := metrics.DownstreamTypeFromSinkURI(info.SinkURI) - metrics.ChangefeedDownstreamInfoGauge.DeleteLabelValues( - cfID.Keyspace(), - cfID.Name(), - downstreamType, - ) - } - }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@coordinator/changefeed/changefeed_db.go` around lines 133 - 153, There are two adjacent "if remove" blocks; merge them so the removal path is a single block: move the downstream gauge cleanup (using cf.GetInfo(), metrics.DownstreamTypeFromSinkURI(info.SinkURI), and metrics.ChangefeedDownstreamInfoGauge.DeleteLabelValues(cfID.Keyspace(), cfID.Name(), downstreamType)) into the first "if remove" that already deletes db.changefeeds, db.changefeedDisplayNames and db.stopped, and remove the second redundant "if remove" branch so all removal-related actions for cfID/cf occur together.pkg/metrics/downstream_type_test.go (1)
36-46: Consider adding test cases for the remaining Pulsar variants and canonical GCS/AzBlob schemes.The implementation handles four Pulsar schemes (
pulsar,pulsar+ssl,pulsar+http,pulsar+https) and the canonical GCS (gcs://) and AzBlob (azblob://) schemes, but the test matrix only exercisespulsar://,pulsar+https://,gs://, andazure://. Additionally, an empty-string URI ("") is an untested edge case that the implementation silently maps to"unknown".➕ Suggested additional test cases
{sinkURI: "pulsar://127.0.0.1:6650/topic", expectedLabel: "pulsar"}, {sinkURI: "pulsar+https://127.0.0.1:8443/topic", expectedLabel: "pulsar"}, + {sinkURI: "pulsar+ssl://127.0.0.1:6651/topic", expectedLabel: "pulsar"}, + {sinkURI: "pulsar+http://127.0.0.1:8080/topic", expectedLabel: "pulsar"}, {sinkURI: "file:///tmp/ticdc", expectedLabel: "file"}, {sinkURI: "s3://bucket/prefix", expectedLabel: "s3"}, {sinkURI: "gs://bucket/prefix", expectedLabel: "gcs"}, + {sinkURI: "gcs://bucket/prefix", expectedLabel: "gcs"}, {sinkURI: "azure://bucket/prefix", expectedLabel: "azblob"}, + {sinkURI: "azblob://bucket/prefix", expectedLabel: "azblob"}, {sinkURI: "noop://bucket/prefix", expectedLabel: "noop"}, {sinkURI: "blackhole:///", expectedLabel: "blackhole"}, {sinkURI: "127.0.0.1:3306", expectedLabel: "unknown"}, {sinkURI: "mysql://[::1", expectedLabel: "unknown"}, + {sinkURI: "", expectedLabel: "unknown"},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/downstream_type_test.go` around lines 36 - 46, The test matrix in pkg/metrics/downstream_type_test.go omits several URI variants that the production code handles; update the table of cases (entries using sinkURI and expectedLabel) in the test (the downstream type test) to add rows for pulsar+ssl://..., pulsar+http://..., the canonical gcs:// and azblob:// schemes, and an empty-string "" case expecting "unknown"; ensure you use the same sinkURI/expectedLabel fields so the existing loop/assert logic exercises these new cases.metrics/grafana/ticdc_new_arch.json (1)
4635-4645: Consider usingformat: "table"to simplify the transformation chain.With
instant: true, the Prometheus data source can return a table response (format: "table") where label columns are already promoted — making thelabelsToFieldstransformation redundant. Theorganizestep is still needed to reorder/hide columns.♻️ Proposed simplification
"expr": "max by (namespace, changefeed, downstream_type) (ticdc_owner_changefeed_downstream_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})", - "format": "time_series", + "format": "table", "instant": true, "refId": "A"Then remove the now-redundant
labelsToFieldstransformation:"transformations": [ - { - "id": "labelsToFields", - "options": {} - }, { "id": "organize",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics/grafana/ticdc_new_arch.json` around lines 4635 - 4645, The panel "Changefeed Downstream Type" is using instant: true but still requests time_series format and applies a redundant labelsToFields transformation; change the query block to use format: "table" (keeping instant: true) so Prometheus returns label columns directly, then remove the labelsToFields transformation entry and keep the organize/other transformations to reorder or hide columns; look for the panel title "Changefeed Downstream Type", the query with expr "ticdc_owner_changefeed_downstream_info...", and the transformation object with "id": "labelsToFields" to make these edits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@coordinator/changefeed/changefeed.go`:
- Around line 115-121: The SetInfo method currently dereferences a nil receiver
because the condition mixes c == nil with c.info == nil; first guard against a
nil receiver by returning immediately if c == nil, then handle lazy
initialization of c.info: if c.info == nil create a new atomic.Pointer via
atomic.NewPointer(info) and assign it to c.info, else call c.info.Store(info);
reference the Changefeed.SetInfo method, the receiver c, the field c.info,
atomic.NewPointer and c.info.Store to locate and apply the fix.
---
Nitpick comments:
In `@coordinator/changefeed/changefeed_db.go`:
- Around line 133-153: There are two adjacent "if remove" blocks; merge them so
the removal path is a single block: move the downstream gauge cleanup (using
cf.GetInfo(), metrics.DownstreamTypeFromSinkURI(info.SinkURI), and
metrics.ChangefeedDownstreamInfoGauge.DeleteLabelValues(cfID.Keyspace(),
cfID.Name(), downstreamType)) into the first "if remove" that already deletes
db.changefeeds, db.changefeedDisplayNames and db.stopped, and remove the second
redundant "if remove" branch so all removal-related actions for cfID/cf occur
together.
In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 4635-4645: The panel "Changefeed Downstream Type" is using
instant: true but still requests time_series format and applies a redundant
labelsToFields transformation; change the query block to use format: "table"
(keeping instant: true) so Prometheus returns label columns directly, then
remove the labelsToFields transformation entry and keep the organize/other
transformations to reorder or hide columns; look for the panel title "Changefeed
Downstream Type", the query with expr
"ticdc_owner_changefeed_downstream_info...", and the transformation object with
"id": "labelsToFields" to make these edits.
In `@pkg/metrics/downstream_type_test.go`:
- Around line 36-46: The test matrix in pkg/metrics/downstream_type_test.go
omits several URI variants that the production code handles; update the table of
cases (entries using sinkURI and expectedLabel) in the test (the downstream type
test) to add rows for pulsar+ssl://..., pulsar+http://..., the canonical gcs://
and azblob:// schemes, and an empty-string "" case expecting "unknown"; ensure
you use the same sinkURI/expectedLabel fields so the existing loop/assert logic
exercises these new cases.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
coordinator/changefeed/changefeed.gocoordinator/changefeed/changefeed_db.gocoordinator/controller.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.jsonpkg/metrics/changefeed.gopkg/metrics/downstream_type.gopkg/metrics/downstream_type_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/downstream_type.go`:
- Around line 41-45: The downstream label for MySQL schemes currently returns
"mysql/tidb" in the switch handling config.MySQLScheme and
config.MySQLSSLScheme; change it to return "mysql" instead so it matches the
normalization spec and the PR screenshot, and rely on the existing
ChangefeedDownstreamIsTiDBGauge metric to indicate confirmed TiDB instances;
update the accompanying comment in downstream_type.go (around the MySQL/SSL
case) to reflect that SSL maps to "mysql" and TiDB confirmation is conveyed
separately.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
downstreamadapter/sink/mysql/sink.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.jsonpkg/metrics/changefeed.gopkg/metrics/downstream_type.gopkg/metrics/downstream_type_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/metrics/changefeed.go
- pkg/metrics/downstream_type_test.go
- metrics/grafana/ticdc_new_arch.json
- metrics/nextgengrafana/ticdc_new_arch_next_gen.json
- metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
8f84d76 to
0f84dea
Compare
0f84dea to
06343b6
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asddongmen, lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
@wlwilliamx: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #4250
What is changed and how it works?
ticdc_owner_changefeed_downstream_info{namespace|keyspace_name, changefeed, downstream_type} 1to expose downstream type per changefeed.downstream_typefrom the sink-uri scheme and normalize it to a small stable set (e.g.mysql+ssl->mysql,pulsar+https->pulsar,gs->gcs,azure->azblob).Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. This PR only adds an additive metric and dashboard panel, and the per-changefeed metric update is on the same 5s loop as existing changefeed metrics.
Do you need to update user documentation, design documentation or monitoring documentation?
Monitoring dashboard is updated.
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Behavior
Tests