metrics,grafana: add ticdc_arch to build info#4266
metrics,grafana: add ticdc_arch to build info#4266wlwilliamx wants to merge 1 commit intopingcap:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThe changes introduce a new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the observability of TiCDC instances by introducing a new metric label, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a ticdc_arch label to the ticdc_server_build_info metric to distinguish between new and old architecture TiCDC instances. The changes correctly update the metric definition and set the label value. The Grafana dashboards are also updated to display this new information, with a fallback mechanism for older instances that don't report this metric. While the implementation is functionally correct, the PromQL queries in the Grafana dashboards have become quite complex and hard to maintain. I've added specific comments with suggestions to improve the maintainability of these queries.
| "targets": [ | ||
| { | ||
| "expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})", | ||
| "expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})", |
There was a problem hiding this comment.
The new PromQL query is functionally correct for providing a fallback for older TiCDC versions. However, its length and the deeply nested label_replace calls make it difficult to read and maintain.
For improved readability, you could consider using a recording rule in Prometheus to pre-calculate the fallback metric with the necessary labels. This would simplify the dashboard query significantly.
If a recording rule is not feasible, adding comments within the Grafana panel's description explaining the query logic would be helpful for future maintenance.
| "targets": [ | ||
| { | ||
| "expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})", | ||
| "expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})", |
There was a problem hiding this comment.
Similar to the other dashboard file, this PromQL query is very long and complex due to the nested label_replace functions for fallback support. This impacts readability and maintainability.
Consider using a Prometheus recording rule to abstract away this complexity. A recording rule could generate a clean metric for old-architecture nodes, which would make this dashboard query much simpler.
If that's not an option, at least a comment in the panel description explaining the query would be beneficial.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/cdc/server/server.go (1)
126-126: Consider extracting"newarch"to a named constant.The raw string literal
"newarch"is embedded directly in theWithLabelValuescall. If this value is ever referenced or compared elsewhere (e.g., in tests, dashboard query validation, or a future"oldarch"metric path), having it as an unshared literal creates a maintenance risk.♻️ Proposed refactor
Define a constant — for example in
pkg/metricsor a shared location — and reference it here:+// In pkg/metrics/server.go (or a suitable shared constants file): +const ( + TiCDCArchNewArch = "newarch" + TiCDCArchOldArch = "oldarch" +)-metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), "newarch").Set(1) +metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), metrics.TiCDCArchNewArch).Set(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cdc/server/server.go` at line 126, Extract the literal "newarch" into a well-named constant (e.g., const ArchNew = "newarch") and use that constant in the metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the constant in a shared place such as the pkg/metrics package (or another common package used by cmd/cdc/server), then replace the direct literal in server.go (the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/cdc/server/server.go`:
- Line 126: Extract the literal "newarch" into a well-named constant (e.g.,
const ArchNew = "newarch") and use that constant in the
metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the
constant in a shared place such as the pkg/metrics package (or another common
package used by cmd/cdc/server), then replace the direct literal in server.go
(the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant
reference.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cmd/cdc/server/server.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonpkg/metrics/server.go
What problem does this PR solve?
Issue Number: close #4265
What is changed and how it works?
ticdc_archlabel toticdc_server_build_info.ticdc_arch="newarch"when TiCDC runs in new architecture mode.ticdc_arch. If an instance does not exposeticdc_server_build_info(old arch), the panel falls back toticdc_server_etcd_health_check_duration_countand showsticdc_arch="oldarch".Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. The change adds one constant label to an existing build info gauge metric and only affects dashboard queries.
Do you need to update user documentation, design documentation or monitoring documentation?
Monitoring dashboard is updated in this PR.
Release note
Summary by CodeRabbit
New Features
Monitoring