Skip to content

metrics,grafana: add ticdc_arch to build info#4266

Open
wlwilliamx wants to merge 1 commit intopingcap:masterfrom
wlwilliamx:metrics/add-ticdc-arch-build-info
Open

metrics,grafana: add ticdc_arch to build info#4266
wlwilliamx wants to merge 1 commit intopingcap:masterfrom
wlwilliamx:metrics/add-ticdc-arch-build-info

Conversation

@wlwilliamx
Copy link
Collaborator

@wlwilliamx wlwilliamx commented Feb 25, 2026

What problem does this PR solve?

Issue Number: close #4265

What is changed and how it works?

  • Add ticdc_arch label to ticdc_server_build_info.
  • Set ticdc_arch="newarch" when TiCDC runs in new architecture mode.
  • Update the Grafana "Build Info" panel to show ticdc_arch. If an instance does not expose ticdc_server_build_info (old arch), the panel falls back to ticdc_server_etcd_health_check_duration_count and shows ticdc_arch="oldarch".

Check List

Tests

  • Manual test
CleanShot 2026-02-24 at 19 23 10@2x

Questions

Will it cause performance regression or break compatibility?

No. The change adds one constant label to an existing build info gauge metric and only affects dashboard queries.

Do you need to update user documentation, design documentation or monitoring documentation?

Monitoring dashboard is updated in this PR.

Release note

Expose TiCDC architecture mode (newarch/oldarch) in build info metric and dashboard.

Summary by CodeRabbit

  • New Features

    • Added architecture tracking dimension to build information metrics, enabling identification and monitoring of different deployment architectures.
  • Monitoring

    • Updated Grafana dashboards to display the new architecture dimension in Build Info panels, with enhanced data aggregation and fallback logic for comprehensive architecture visibility across instances.

@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 25, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

The changes introduce a new ticdc_arch label to the ticdc_server_build_info metric to expose the TiCDC runtime architecture mode. The label is added to the Prometheus metric definition, instrumented with "newarch" value at server startup, and integrated into Grafana dashboard queries and transformations for visualization.

Changes

Cohort / File(s) Summary
Metrics Definition
pkg/metrics/server.go
Added ticdc_arch as a new label to the BuildInfo GaugeVec metric definition, extending the label set from 4 to 5 labels to track architecture mode.
Server Instrumentation
cmd/cdc/server/server.go
Updated metrics initialization to pass "newarch" as the value for the new ticdc_arch label when setting BuildInfo.
Grafana Dashboards
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json
Updated Build Info panel queries to include ticdc_arch in aggregation dimensions and reindexed field mappings in organize transformations to accommodate the new label (git_hash: 2→3, kernel_type: 1→2, release_version: 3→4, utc_build_time: 4→5, ticdc_arch: 1).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A label is born, so clear and bright,
Newarch or oldarch—truth takes flight!
Through Prometheus streams and Grafana's view,
The architecture mode shines right through!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'metrics,grafana: add ticdc_arch to build info' clearly and concisely describes the main change: adding a ticdc_arch label to build info metrics and dashboards.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #4265: adds ticdc_arch label to ticdc_server_build_info with newarch/oldarch values, updates Grafana dashboards to show this label, and maintains backward compatibility with fallback logic for old-arch instances.
Out of Scope Changes check ✅ Passed All changes are directly related to adding ticdc_arch to the build info metric and updating Grafana dashboards; no unrelated or out-of-scope modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description follows the template structure with Issue Number, What is changed and how it works, Tests, Questions answered, and Release note all properly filled out.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 25, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability of TiCDC instances by introducing a new metric label, ticdc_arch, which identifies the architecture mode (new or old) of a running TiCDC server. This information is then integrated into the Grafana monitoring dashboards, providing users with clearer insights into the operational state and architecture of their TiCDC deployments. The change ensures that even older instances are represented, enhancing overall monitoring consistency.

Highlights

  • Metric Enhancement: The ticdc_server_build_info metric now includes a ticdc_arch label to indicate whether TiCDC is running in 'newarch' or 'oldarch' mode.
  • Grafana Dashboard Update: The 'Build Info' panel in Grafana dashboards has been updated to display the ticdc_arch label. It also includes fallback logic to infer 'oldarch' for instances that do not expose the new ticdc_server_build_info metric.
  • Code Instrumentation: The TiCDC server initialization now explicitly sets the ticdc_arch label to 'newarch' when reporting build information.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cmd/cdc/server/server.go
    • Added the ticdc_arch label with the value "newarch" to the metrics.BuildInfo metric when reporting build information.
  • metrics/grafana/ticdc_new_arch.json
    • Modified the Prometheus query for the "Build Info" panel to include ticdc_arch in the max by clause.
    • Added an or clause to the query to provide fallback build info for instances that do not expose ticdc_server_build_info, labeling them as ticdc_arch="oldarch".
    • Adjusted the indexByName mapping to account for the new ticdc_arch label.
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
    • Modified the Prometheus query for the "Build Info" panel to include ticdc_arch in the max by clause.
    • Added an or clause to the query to provide fallback build info for instances that do not expose ticdc_server_build_info, labeling them as ticdc_arch="oldarch".
    • Adjusted the indexByName mapping to account for the new ticdc_arch label.
  • pkg/metrics/server.go
    • Updated the BuildInfo Prometheus GaugeVec definition to include "ticdc_arch" as a new label.
    • Updated the comment for BuildInfo to reflect the inclusion of architecture mode.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ticdc_arch label to the ticdc_server_build_info metric to distinguish between new and old architecture TiCDC instances. The changes correctly update the metric definition and set the label value. The Grafana dashboards are also updated to display this new information, with a fallback mechanism for older instances that don't report this metric. While the implementation is functionally correct, the PromQL queries in the Grafana dashboards have become quite complex and hard to maintain. I've added specific comments with suggestions to improve the maintainability of these queries.

"targets": [
{
"expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",
"expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new PromQL query is functionally correct for providing a fallback for older TiCDC versions. However, its length and the deeply nested label_replace calls make it difficult to read and maintain.

For improved readability, you could consider using a recording rule in Prometheus to pre-calculate the fallback metric with the necessary labels. This would simplify the dashboard query significantly.

If a recording rule is not feasible, adding comments within the Grafana panel's description explaining the query logic would be helpful for future maintenance.

"targets": [
{
"expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",
"expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the other dashboard file, this PromQL query is very long and complex due to the nested label_replace functions for fallback support. This impacts readability and maintainability.

Consider using a Prometheus recording rule to abstract away this complexity. A recording rule could generate a clean metric for old-architecture nodes, which would make this dashboard query much simpler.

If that's not an option, at least a comment in the panel description explaining the query would be beneficial.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/cdc/server/server.go (1)

126-126: Consider extracting "newarch" to a named constant.

The raw string literal "newarch" is embedded directly in the WithLabelValues call. If this value is ever referenced or compared elsewhere (e.g., in tests, dashboard query validation, or a future "oldarch" metric path), having it as an unshared literal creates a maintenance risk.

♻️ Proposed refactor

Define a constant — for example in pkg/metrics or a shared location — and reference it here:

+// In pkg/metrics/server.go (or a suitable shared constants file):
+const (
+    TiCDCArchNewArch = "newarch"
+    TiCDCArchOldArch = "oldarch"
+)
-metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), "newarch").Set(1)
+metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), metrics.TiCDCArchNewArch).Set(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cdc/server/server.go` at line 126, Extract the literal "newarch" into a
well-named constant (e.g., const ArchNew = "newarch") and use that constant in
the metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the
constant in a shared place such as the pkg/metrics package (or another common
package used by cmd/cdc/server), then replace the direct literal in server.go
(the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant
reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/cdc/server/server.go`:
- Line 126: Extract the literal "newarch" into a well-named constant (e.g.,
const ArchNew = "newarch") and use that constant in the
metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the
constant in a shared place such as the pkg/metrics package (or another common
package used by cmd/cdc/server), then replace the direct literal in server.go
(the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant
reference.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75e291b and 509a59c.

📒 Files selected for processing (4)
  • cmd/cdc/server/server.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • pkg/metrics/server.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics: show TiCDC architecture mode in build info

1 participant