Skip to content

HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264

Open
kuudori wants to merge 1 commit into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1205-reconciliation-metrics
Open

HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264
kuudori wants to merge 1 commit into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1205-reconciliation-metrics

Conversation

@kuudori

@kuudori kuudori commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add unified reconciliation metrics replacing deletion-specific metrics
    • hyperfleet_api_reconciliation_started_total — counter recording Reconciled→False transitions
    • hyperfleet_api_resource_pending_reconciliation — gauge of resources with Reconciled=False
    • hyperfleet_api_resource_pending_reconciliation_stuck — gauge of resources stuck beyond threshold
    • hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds — max stuck duration
  • All metrics carry resource_type and is_delete labels to distinguish normal from deletion lifecycle
  • Remove old deletion-specific metrics (resource_pending_deletion_total, resource_pending_deletion_duration_seconds, resource_pending_deletion_stuck)
  • Update Helm chart values, schema, and PrometheusRule alerts to reference new metric names
  • Configurable reconciliation_stuck_threshold (default 10m) via flag/env/config

AC Coverage

# Acceptance Criteria Test
1 reconciliation_started_total counter TestReconciliationStartedTotalIsRegistered, TestRecordReconciliationStarted_*
2 resource_pending_reconciliation gauge TestReconciliationCollector_Integration/pending
3 resource_pending_reconciliation_stuck gauge TestReconciliationCollector_Integration/stuck
4 stuck_duration_seconds gauge TestReconciliationCollector_Integration/max_stuck_duration
5 Architecture docs Deferred — HYPERFLEET-1301 (different repo)
6 Old deletion metrics removed All references removed, files deleted

Design Decisions

  • Counter trigger: fires at Reconciled→False transition (not at soft-delete time), capturing both deletion and create/update reconciliations
  • Collector SQL: single CTE query parses JSONB once per row via jsonb_path_query_first, leveraging existing BTREE expression indexes — 1 round-trip instead of 3
  • Labels: component/version as const labels (not variable) — they never vary, matches Prometheus best practice

Breaking Changes

  • Dashboard/alert queries referencing hyperfleet_api_resource_pending_deletion_stuck must update to hyperfleet_api_resource_pending_reconciliation_stuck
  • Config key deletion_stuck_threshold renamed to reconciliation_stuck_threshold (default changed from 30m to 10m)
  • Helm value config.metrics.deletion_stuck_threshold renamed to config.metrics.reconciliation_stuck_threshold

Test plan

  • Unit tests pass (make verify-all — 1268 tests)
  • Helm chart tests pass (make test-helm)
  • Integration tests pass (make test-integration — requires Docker/testcontainers)
  • E2E: gated in CI, not run locally

Closes: HYPERFLEET-1205

Replace deletion-specific metrics with four reconciliation metrics:
- reconciliation_started_total counter (Reconciled→False transitions)
- resource_pending_reconciliation gauge (Reconciled=False count)
- resource_pending_reconciliation_stuck gauge (beyond threshold)
- resource_pending_reconciliation_stuck_duration_seconds gauge (max)

All metrics carry resource_type and is_delete labels. Collector uses
a single CTE query parsing JSONB once per row via
jsonb_path_query_first, leveraging existing BTREE expression indexes.

Counter instrumented in recomputeAndSaveClusterStatus and
computeNodePoolConditionsJSON on Reconciled→False transition.
Configurable reconciliation_stuck_threshold (default 10m).

Remove deletion-specific metrics (resource_pending_deletion_total,
resource_pending_deletion_duration_seconds,
resource_pending_deletion_stuck) and update Helm chart values,
schema, and PrometheusRule alerts to reference new metric names.
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mbrudnoy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added reconciliation-focused metrics and alerts for resources that remain in progress too long.
    • Introduced a new configuration option and CLI flag for the reconciliation stuck threshold.
    • Updated monitoring output to distinguish normal, stuck, and duration-based reconciliation states.
  • Bug Fixes
    • Improved status tracking so repeated failure signals are not emitted unnecessarily.
    • Removed outdated deletion-based metric behavior in favor of reconciliation-based reporting.

Walkthrough

pkg/metrics/deletion.go and its test file are deleted. A replacement pkg/metrics/reconciliation.go adds a reconciliation_started_total counter and a ReconciliationCollector that runs a single CTE SQL query to emit pending, stuck, and max-stuck-duration gauges labeled by resource_type and is_delete. MetricsConfig replaces DeletionStuckThreshold with ReconciliationStuckThreshold; CLI flags, env bindings, and loader keys are updated to match. Service call sites in cluster.go and status_helpers.go emit RecordReconciliationStarted only on a reconciled=false transition, gated by prior condition state. Helm chart alert rules, values, and schema are updated from deletion to reconciliation terminology.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Security notes (no praise, direct)

CWE-89 — SQL injection surface in reconciliationQuery
The CTE in reconciliation.go passes $1 (a computed time.Time) as a parameterized argument, which is correct. Verify that no string interpolation is used anywhere in that constant and that the *sql.DB context timeout is enforced before the query runs, not after. A missing context cancellation on slow scrapes leaves a goroutine leak attack surface under high cardinality load.

CWE-400 — Unbounded metric cardinality via resource_type label
RecordReconciliationStarted(resourceType string, isDelete bool) accepts a caller-supplied string directly as a Prometheus label value. If resourceType is ever derived from an external or user-controlled input path, this creates unbounded metric cardinality (memory exhaustion). Confirm the call sites ("cluster", "nodepool") are always static literals.

CWE-362 — TOCTOU in prevReconciledStatus comparison
recomputeAndSaveClusterStatus reads prevReconciledStatus from the in-memory cluster object, then writes new status and conditionally emits the metric. If two goroutines race on the same cluster, the "not already false" guard can be passed by both, emitting duplicate metrics. Verify this path is serialized per-cluster.

Supply chain note — deleted integration test
test/integration/deletion_metrics_test.go is removed entirely without a phased deprecation. Any CI job that relied on this test for coverage is now silently uncovered until reconciliation_metrics_test.go is confirmed to run in the same integration suite environment.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the main change: adding metrics for active and stuck reconciliations.
Description check ✅ Passed The description matches the diff and explains the metric rename, removals, config updates, and tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed PASS: No non-test log statement logs token/password/secret/credential values; only redacted config output. No CWE-532/CWE-200 leak found.
No Hardcoded Secrets ✅ Passed No hardcoded secrets, embedded credentials, or long base64 blobs found in changed files; only placeholders and metric/config literals. No CWE-798/CWE-259 issue.
No Weak Cryptography ✅ Passed No CWE-327/CWE-208 issues: touched files and repo-wide Go search show no md5/des/rc4/SHA1-for-security, ECB, custom crypto, or secret compares.
No Injection Vectors ✅ Passed No CWE-89/78/79/502 sink in changed prod code; SQL uses $1 parameters, and no exec/template/yaml unmarshal additions found.
No Privileged Containers ✅ Passed No changed Helm/Dockerfile workload sets privileged flags; the only Helm template is a PrometheusRule, not a pod spec (CWE-250).
No Pii Or Sensitive Data In Logs ✅ Passed No CWE-532 issue found: changed logs are generic errors/info and do not print PII, session data, raw bodies, or credentialed hostnames.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands.

@kuudori kuudori marked this pull request as ready for review June 29, 2026 22:48
@openshift-ci openshift-ci Bot requested review from ciaranRoche and jsell-rh June 29, 2026 22:49
@hyperfleet-ci-bot

Copy link
Copy Markdown

Risk Score: 5 — risk/high

Signal Detail Points
PR size 1249 lines (>500) +2
Sensitive paths cmd/ +2
Test coverage Missing tests for: cmd/hyperfleet-api/servecmd cmd/hyperfleet-api/server pkg/config pkg/services +1

Computed by hyperfleet-risk-scorer

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/services/status_helpers.go (1)

30-45: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Keep the counter emission out of computeNodePoolConditionsJSON.

This helper is also used by recomputeNodePoolConditions, which only mutates in-memory objects. That means reconciliation_started_total can advance without any persisted Reconciled=False transition, and it can also overcount when the later nodePoolDao.Save fails.

Return a started flag from this helper and emit the metric only in the call sites that successfully persist the new status.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/services/status_helpers.go` around lines 30 - 45, Move the
reconciliation_started_total emission out of the status helper logic in
computeNodePoolConditionsJSON/related aggregation flow: the helper should only
determine the new status and return a started flag when reconciled becomes False
from a non-False previous state. Then update the call sites, especially the path
that persists via nodePoolDao.Save, to record metrics only after a successful
save so recomputeNodePoolConditions and failed persistence do not increment the
counter prematurely.
pkg/services/cluster.go (1)

241-260: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Record reconciliation_started_total only after SaveStatusConditions succeeds.

This increments the counter before the new Reconciled=False state is persisted. If SaveStatusConditions fails, the next retry still sees the old stored status and increments again, so the transition counter drifts upward without a committed state change.

Suggested fix
-	if reconciled.Status == api.ConditionFalse &&
-		(prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) {
-		metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil)
-	}
-
 	allConditions := make([]api.ResourceCondition, 0, fixedConditionCount+len(adapterConditions))
 	allConditions = append(allConditions, reconciled, lastKnownReconciled)
 	allConditions = append(allConditions, adapterConditions...)
@@
 	if err := s.clusterDao.SaveStatusConditions(ctx, cluster.ID, conditionsJSON); err != nil {
 		return nil, handleUpdateError(api.ResourceTypeCluster, err)
 	}
 
 	cluster.StatusConditions = conditionsJSON
+	if reconciled.Status == api.ConditionFalse &&
+		(prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) {
+		metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil)
+	}
 	return cluster, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/services/cluster.go` around lines 241 - 260, Move the
reconciliation-start metric in cluster reconciliation so
`metrics.RecordReconciliationStarted` is only called after
`s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new
status. Update the `cluster` status update flow around `reconciled`,
`prevReconciledStatus`, and `SaveStatusConditions` so the counter increments
only on a committed `Reconciled=False` transition and does not fire on failed
retries.
🧹 Nitpick comments (1)
test/integration/reconciliation_metrics_test.go (1)

23-165: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a nodepool integration case for the second half of the UNION.

reconciliationQuery aggregates both clusters and node_pools, but this suite only inserts clusters. A nodepool-only regression in the second SELECT can ship unnoticed while every test here stays green. Add at least one nodepool pending/stuck assertion using factories.NewNodePoolWithStatusAtTime(...) (or the equivalent factory). As per path instructions, **/*_test.go: New exported functions and critical logic paths SHOULD have tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/integration/reconciliation_metrics_test.go` around lines 23 - 165, The
reconciliation integration suite only exercises the clusters half of
reconciliationQuery, so add a nodepool-focused case in
TestReconciliationCollector_Integration to cover the node_pools UNION branch.
Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory)
with metrics.NewReconciliationCollector and collectReconciliationMetrics to
assert pending/stuck behavior and labels for resourceTypeNodePool, including an
is_delete check. Keep the new assertions alongside the existing cluster subtests
so regressions in the second SELECT are caught.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/config/metrics.go`:
- Line 17: The stuck-threshold config rename is dropping support for the
existing deletion_stuck_threshold key/env/flag, which will break upgrades for
users with persisted overrides. Update the relevant config wiring in metrics.go,
loader.go, and flags.go so ReconciliationStuckThreshold still accepts the legacy
name as a deprecated alias for one release, or explicitly fail fast if the old
name is still present. Make sure the old env/flag binding and mapstructure/json
compatibility are preserved alongside the new metrics key.

In `@pkg/metrics/reconciliation.go`:
- Around line 163-187: The reconciliation collector in the metrics path is only
logging query/scan/iteration failures, which allows Prometheus to scrape partial
or empty results from the collector. Update the `reconciliation` collector logic
around `QueryContext`, `rows.Scan`, and `rows.Err()` to return
`prometheus.NewInvalidMetric(...)` on any DB or iteration failure so the scrape
fails instead of silently omitting series. Keep the existing `c.pendingDesc`,
`c.stuckDesc`, and `c.durationDesc` emission path for successful rows, but treat
any failure as a collector error rather than continuing with incomplete data.

---

Outside diff comments:
In `@pkg/services/cluster.go`:
- Around line 241-260: Move the reconciliation-start metric in cluster
reconciliation so `metrics.RecordReconciliationStarted` is only called after
`s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new
status. Update the `cluster` status update flow around `reconciled`,
`prevReconciledStatus`, and `SaveStatusConditions` so the counter increments
only on a committed `Reconciled=False` transition and does not fire on failed
retries.

In `@pkg/services/status_helpers.go`:
- Around line 30-45: Move the reconciliation_started_total emission out of the
status helper logic in computeNodePoolConditionsJSON/related aggregation flow:
the helper should only determine the new status and return a started flag when
reconciled becomes False from a non-False previous state. Then update the call
sites, especially the path that persists via nodePoolDao.Save, to record metrics
only after a successful save so recomputeNodePoolConditions and failed
persistence do not increment the counter prematurely.

---

Nitpick comments:
In `@test/integration/reconciliation_metrics_test.go`:
- Around line 23-165: The reconciliation integration suite only exercises the
clusters half of reconciliationQuery, so add a nodepool-focused case in
TestReconciliationCollector_Integration to cover the node_pools UNION branch.
Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory)
with metrics.NewReconciliationCollector and collectReconciliationMetrics to
assert pending/stuck behavior and labels for resourceTypeNodePool, including an
is_delete check. Keep the new assertions alongside the existing cluster subtests
so regressions in the second SELECT are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 047cad3f-7742-4d6c-9220-0664aa89b23d

📥 Commits

Reviewing files that changed from the base of the PR and between 4fe21ec and 5f4905e.

📒 Files selected for processing (18)
  • charts/README.md
  • charts/templates/prometheusrule.yaml
  • charts/values.schema.json
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/metrics/reconciliation.go
  • pkg/metrics/reconciliation_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • pkg/services/status_helpers.go
  • test/integration/deletion_metrics_test.go
  • test/integration/reconciliation_metrics_test.go
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
💤 Files with no reviewable changes (4)
  • pkg/services/node_pool.go
  • pkg/metrics/deletion.go
  • test/integration/deletion_metrics_test.go
  • pkg/metrics/deletion_test.go

Comment thread pkg/config/metrics.go
Port int `mapstructure:"port" json:"port" validate:"required,min=1,max=65535"`
LabelMetricsInclusionDuration time.Duration `mapstructure:"label_metrics_inclusion_duration" json:"label_metrics_inclusion_duration" validate:"required"` //nolint:lll
DeletionStuckThreshold time.Duration `mapstructure:"deletion_stuck_threshold" json:"deletion_stuck_threshold" validate:"required"` //nolint:lll
ReconciliationStuckThreshold time.Duration `mapstructure:"reconciliation_stuck_threshold" json:"reconciliation_stuck_threshold" validate:"required"` //nolint:lll

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Keep the old stuck-threshold key as a deprecated alias for one release.

Line 17 changes the persisted config key, and pkg/config/loader.go/pkg/config/flags.go drop the old env/flag bindings at the same time. Existing metrics.deletion_stuck_threshold overrides will be ignored on upgrade and silently fall back to 10m, which changes stuck classification and alert timing for running deployments. Keep the old key/env/flag as a deprecated alias, or fail fast when the legacy name is still set. As per path instructions, **/config/**: Configuration changes affect all deployments. Review for: - Backward compatibility of config changes.

Also applies to: 30-37

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/config/metrics.go` at line 17, The stuck-threshold config rename is
dropping support for the existing deletion_stuck_threshold key/env/flag, which
will break upgrades for users with persisted overrides. Update the relevant
config wiring in metrics.go, loader.go, and flags.go so
ReconciliationStuckThreshold still accepts the legacy name as a deprecated alias
for one release, or explicitly fail fast if the old name is still present. Make
sure the old env/flag binding and mapstructure/json compatibility are preserved
alongside the new metrics key.

Source: Path instructions

Comment on lines +163 to +187
rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
if err != nil {
logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
return
}
defer rows.Close()

for rows.Next() {
var resourceType, isDelete string
var pending, stuck int64
var maxDuration float64

if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
continue
}

labels := []string{resourceType, isDelete}
ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)
ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)
ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)
}

if err := rows.Err(); err != nil {
logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't hide collector failures behind empty scrapes.

At Line 163 and Line 186, DB/query failures are only logged. Prometheus will still get a scrape with missing or partial series, so the new reconciliation alerts go blind during DB outages or SQL regressions (CWE-391). Emit prometheus.NewInvalidMetric(...) and fail the scrape instead of returning partial/empty data.

Suggested fix
 	rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
 	if err != nil {
 		logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
+		ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+		ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+		ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
 		return
 	}
 	defer rows.Close()
@@
 		if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
 			logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
-			continue
+			ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+			ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+			ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
+			return
 		}
@@
 	if err := rows.Err(); err != nil {
 		logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")
+		ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+		ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+		ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
if err != nil {
logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
return
}
defer rows.Close()
for rows.Next() {
var resourceType, isDelete string
var pending, stuck int64
var maxDuration float64
if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
continue
}
labels := []string{resourceType, isDelete}
ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)
ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)
ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)
}
if err := rows.Err(); err != nil {
logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")
rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
if err != nil {
logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
return
}
defer rows.Close()
for rows.Next() {
var resourceType, isDelete string
var pending, stuck int64
var maxDuration float64
if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
return
}
labels := []string{resourceType, isDelete}
ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)
ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)
ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)
}
if err := rows.Err(); err != nil {
logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")
ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/metrics/reconciliation.go` around lines 163 - 187, The reconciliation
collector in the metrics path is only logging query/scan/iteration failures,
which allows Prometheus to scrape partial or empty results from the collector.
Update the `reconciliation` collector logic around `QueryContext`, `rows.Scan`,
and `rows.Err()` to return `prometheus.NewInvalidMetric(...)` on any DB or
iteration failure so the scrape fails instead of silently omitting series. Keep
the existing `c.pendingDesc`, `c.stuckDesc`, and `c.durationDesc` emission path
for successful rows, but treat any failure as a collector error rather than
continuing with incomplete data.

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

@kuudori: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/validate-commits 5f4905e link true /test validate-commits

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant