HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations by kuudori · Pull Request #264 · openshift-hyperfleet/hyperfleet-api

kuudori · 2026-06-29T22:39:35Z

Summary

Add unified reconciliation metrics replacing deletion-specific metrics
- hyperfleet_api_reconciliation_started_total — counter recording Reconciled→False transitions
- hyperfleet_api_resource_pending_reconciliation — gauge of resources with Reconciled=False
- hyperfleet_api_resource_pending_reconciliation_stuck — gauge of resources stuck beyond threshold
- hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds — max stuck duration
All metrics carry resource_type and is_delete labels to distinguish normal from deletion lifecycle
Remove old deletion-specific metrics (resource_pending_deletion_total, resource_pending_deletion_duration_seconds, resource_pending_deletion_stuck)
Update Helm chart values, schema, and PrometheusRule alerts to reference new metric names
Configurable reconciliation_stuck_threshold (default 10m) via flag/env/config

AC Coverage

#	Acceptance Criteria	Test
1	`reconciliation_started_total` counter	`TestReconciliationStartedTotalIsRegistered`, `TestRecordReconciliationStarted_*`
2	`resource_pending_reconciliation` gauge	`TestReconciliationCollector_Integration/pending`
3	`resource_pending_reconciliation_stuck` gauge	`TestReconciliationCollector_Integration/stuck`
4	`stuck_duration_seconds` gauge	`TestReconciliationCollector_Integration/max_stuck_duration`
5	Architecture docs	Deferred — HYPERFLEET-1301 (different repo)
6	Old deletion metrics removed	All references removed, files deleted

Design Decisions

Counter trigger: fires at Reconciled→False transition (not at soft-delete time), capturing both deletion and create/update reconciliations
Collector SQL: single CTE query parses JSONB once per row via jsonb_path_query_first, leveraging existing BTREE expression indexes — 1 round-trip instead of 3
Labels: component/version as const labels (not variable) — they never vary, matches Prometheus best practice

Breaking Changes

Dashboard/alert queries referencing hyperfleet_api_resource_pending_deletion_stuck must update to hyperfleet_api_resource_pending_reconciliation_stuck
Config key deletion_stuck_threshold renamed to reconciliation_stuck_threshold (default changed from 30m to 10m)
Helm value config.metrics.deletion_stuck_threshold renamed to config.metrics.reconciliation_stuck_threshold

Test plan

Unit tests pass (make verify-all — 1268 tests)
Helm chart tests pass (make test-helm)
Integration tests pass (make test-integration — requires Docker/testcontainers)
E2E: gated in CI, not run locally

Closes: HYPERFLEET-1205

Replace deletion-specific metrics with four reconciliation metrics: - reconciliation_started_total counter (Reconciled→False transitions) - resource_pending_reconciliation gauge (Reconciled=False count) - resource_pending_reconciliation_stuck gauge (beyond threshold) - resource_pending_reconciliation_stuck_duration_seconds gauge (max) All metrics carry resource_type and is_delete labels. Collector uses a single CTE query parsing JSONB once per row via jsonb_path_query_first, leveraging existing BTREE expression indexes. Counter instrumented in recomputeAndSaveClusterStatus and computeNodePoolConditionsJSON on Reconciled→False transition. Configurable reconciliation_stuck_threshold (default 10m). Remove deletion-specific metrics (resource_pending_deletion_total, resource_pending_deletion_duration_seconds, resource_pending_deletion_stuck) and update Helm chart values, schema, and PrometheusRule alerts to reference new metric names.

openshift-ci · 2026-06-29T22:39:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-06-29T22:39:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mbrudnoy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-29T22:39:43Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added reconciliation-focused metrics and alerts for resources that remain in progress too long.
- Introduced a new configuration option and CLI flag for the reconciliation stuck threshold.
- Updated monitoring output to distinguish normal, stuck, and duration-based reconciliation states.
Bug Fixes
- Improved status tracking so repeated failure signals are not emitted unnecessarily.
- Removed outdated deletion-based metric behavior in favor of reconciliation-based reporting.

Walkthrough

pkg/metrics/deletion.go and its test file are deleted. A replacement pkg/metrics/reconciliation.go adds a reconciliation_started_total counter and a ReconciliationCollector that runs a single CTE SQL query to emit pending, stuck, and max-stuck-duration gauges labeled by resource_type and is_delete. MetricsConfig replaces DeletionStuckThreshold with ReconciliationStuckThreshold; CLI flags, env bindings, and loader keys are updated to match. Service call sites in cluster.go and status_helpers.go emit RecordReconciliationStarted only on a reconciled=false transition, gated by prior condition state. Helm chart alert rules, values, and schema are updated from deletion to reconciliation terminology.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Security notes (no praise, direct)

CWE-89 — SQL injection surface in reconciliationQuery
The CTE in reconciliation.go passes $1 (a computed time.Time) as a parameterized argument, which is correct. Verify that no string interpolation is used anywhere in that constant and that the *sql.DB context timeout is enforced before the query runs, not after. A missing context cancellation on slow scrapes leaves a goroutine leak attack surface under high cardinality load.

CWE-400 — Unbounded metric cardinality via resource_type label
RecordReconciliationStarted(resourceType string, isDelete bool) accepts a caller-supplied string directly as a Prometheus label value. If resourceType is ever derived from an external or user-controlled input path, this creates unbounded metric cardinality (memory exhaustion). Confirm the call sites ("cluster", "nodepool") are always static literals.

CWE-362 — TOCTOU in prevReconciledStatus comparison
recomputeAndSaveClusterStatus reads prevReconciledStatus from the in-memory cluster object, then writes new status and conditionally emits the metric. If two goroutines race on the same cluster, the "not already false" guard can be passed by both, emitting duplicate metrics. Verify this path is serialized per-cluster.

Supply chain note — deleted integration test
test/integration/deletion_metrics_test.go is removed entirely without a phased deprecation. Any CI job that relied on this test for coverage is now silently uncovered until reconciliation_metrics_test.go is confirmed to run in the same integration suite environment.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly names the main change: adding metrics for active and stuck reconciliations.
Description check	✅ Passed	The description matches the diff and explains the metric rename, removals, config updates, and tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output	✅ Passed	PASS: No non-test log statement logs token/password/secret/credential values; only redacted config output. No CWE-532/CWE-200 leak found.
No Hardcoded Secrets	✅ Passed	No hardcoded secrets, embedded credentials, or long base64 blobs found in changed files; only placeholders and metric/config literals. No CWE-798/CWE-259 issue.
No Weak Cryptography	✅ Passed	No CWE-327/CWE-208 issues: touched files and repo-wide Go search show no md5/des/rc4/SHA1-for-security, ECB, custom crypto, or secret compares.
No Injection Vectors	✅ Passed	No CWE-89/78/79/502 sink in changed prod code; SQL uses $1 parameters, and no exec/template/yaml unmarshal additions found.
No Privileged Containers	✅ Passed	No changed Helm/Dockerfile workload sets privileged flags; the only Helm template is a PrometheusRule, not a pod spec (CWE-250).
No Pii Or Sensitive Data In Logs	✅ Passed	No CWE-532 issue found: changed logs are generic errors/info and do not print PII, session data, raw bodies, or credentialed hostnames.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

_{Comment @coderabbitai help to get the list of available commands.}

hyperfleet-ci-bot · 2026-06-29T22:51:57Z

Risk Score: 5 — `risk/high`

Signal	Detail	Points
PR size	1249 lines (>500)	+2
Sensitive paths	cmd/	+2
Test coverage	Missing tests for: cmd/hyperfleet-api/servecmd cmd/hyperfleet-api/server pkg/config pkg/services	+1

_{Computed by hyperfleet-risk-scorer}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

pkg/services/status_helpers.go (1)
30-45: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Keep the counter emission out of computeNodePoolConditionsJSON.

This helper is also used by recomputeNodePoolConditions, which only mutates in-memory objects. That means reconciliation_started_total can advance without any persisted Reconciled=False transition, and it can also overcount when the later nodePoolDao.Save fails.

Return a started flag from this helper and emit the metric only in the call sites that successfully persist the new status.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/services/status_helpers.go` around lines 30 - 45, Move the
reconciliation_started_total emission out of the status helper logic in
computeNodePoolConditionsJSON/related aggregation flow: the helper should only
determine the new status and return a started flag when reconciled becomes False
from a non-False previous state. Then update the call sites, especially the path
that persists via nodePoolDao.Save, to record metrics only after a successful
save so recomputeNodePoolConditions and failed persistence do not increment the
counter prematurely.
pkg/services/cluster.go (1)
241-260: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Record reconciliation_started_total only after SaveStatusConditions succeeds.

This increments the counter before the new Reconciled=False state is persisted. If SaveStatusConditions fails, the next retry still sees the old stored status and increments again, so the transition counter drifts upward without a committed state change.
Suggested fix
-	if reconciled.Status == api.ConditionFalse &&
-		(prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) {
-		metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil)
-	}
-
 	allConditions := make([]api.ResourceCondition, 0, fixedConditionCount+len(adapterConditions))
 	allConditions = append(allConditions, reconciled, lastKnownReconciled)
 	allConditions = append(allConditions, adapterConditions...)
@@
 	if err := s.clusterDao.SaveStatusConditions(ctx, cluster.ID, conditionsJSON); err != nil {
 		return nil, handleUpdateError(api.ResourceTypeCluster, err)
 	}
 
 	cluster.StatusConditions = conditionsJSON
+	if reconciled.Status == api.ConditionFalse &&
+		(prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) {
+		metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil)
+	}
 	return cluster, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/services/cluster.go` around lines 241 - 260, Move the
reconciliation-start metric in cluster reconciliation so
`metrics.RecordReconciliationStarted` is only called after
`s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new
status. Update the `cluster` status update flow around `reconciled`,
`prevReconciledStatus`, and `SaveStatusConditions` so the counter increments
only on a committed `Reconciled=False` transition and does not fire on failed
retries.

🧹 Nitpick comments (1)

test/integration/reconciliation_metrics_test.go (1)
23-165: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a nodepool integration case for the second half of the UNION.

reconciliationQuery aggregates both clusters and node_pools, but this suite only inserts clusters. A nodepool-only regression in the second SELECT can ship unnoticed while every test here stays green. Add at least one nodepool pending/stuck assertion using factories.NewNodePoolWithStatusAtTime(...) (or the equivalent factory). As per path instructions, **/*_test.go: New exported functions and critical logic paths SHOULD have tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/integration/reconciliation_metrics_test.go` around lines 23 - 165, The
reconciliation integration suite only exercises the clusters half of
reconciliationQuery, so add a nodepool-focused case in
TestReconciliationCollector_Integration to cover the node_pools UNION branch.
Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory)
with metrics.NewReconciliationCollector and collectReconciliationMetrics to
assert pending/stuck behavior and labels for resourceTypeNodePool, including an
is_delete check. Keep the new assertions alongside the existing cluster subtests
so regressions in the second SELECT are caught.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/config/metrics.go`:
- Line 17: The stuck-threshold config rename is dropping support for the
existing deletion_stuck_threshold key/env/flag, which will break upgrades for
users with persisted overrides. Update the relevant config wiring in metrics.go,
loader.go, and flags.go so ReconciliationStuckThreshold still accepts the legacy
name as a deprecated alias for one release, or explicitly fail fast if the old
name is still present. Make sure the old env/flag binding and mapstructure/json
compatibility are preserved alongside the new metrics key.

In `@pkg/metrics/reconciliation.go`:
- Around line 163-187: The reconciliation collector in the metrics path is only
logging query/scan/iteration failures, which allows Prometheus to scrape partial
or empty results from the collector. Update the `reconciliation` collector logic
around `QueryContext`, `rows.Scan`, and `rows.Err()` to return
`prometheus.NewInvalidMetric(...)` on any DB or iteration failure so the scrape
fails instead of silently omitting series. Keep the existing `c.pendingDesc`,
`c.stuckDesc`, and `c.durationDesc` emission path for successful rows, but treat
any failure as a collector error rather than continuing with incomplete data.

---

Outside diff comments:
In `@pkg/services/cluster.go`:
- Around line 241-260: Move the reconciliation-start metric in cluster
reconciliation so `metrics.RecordReconciliationStarted` is only called after
`s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new
status. Update the `cluster` status update flow around `reconciled`,
`prevReconciledStatus`, and `SaveStatusConditions` so the counter increments
only on a committed `Reconciled=False` transition and does not fire on failed
retries.

In `@pkg/services/status_helpers.go`:
- Around line 30-45: Move the reconciliation_started_total emission out of the
status helper logic in computeNodePoolConditionsJSON/related aggregation flow:
the helper should only determine the new status and return a started flag when
reconciled becomes False from a non-False previous state. Then update the call
sites, especially the path that persists via nodePoolDao.Save, to record metrics
only after a successful save so recomputeNodePoolConditions and failed
persistence do not increment the counter prematurely.

---

Nitpick comments:
In `@test/integration/reconciliation_metrics_test.go`:
- Around line 23-165: The reconciliation integration suite only exercises the
clusters half of reconciliationQuery, so add a nodepool-focused case in
TestReconciliationCollector_Integration to cover the node_pools UNION branch.
Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory)
with metrics.NewReconciliationCollector and collectReconciliationMetrics to
assert pending/stuck behavior and labels for resourceTypeNodePool, including an
is_delete check. Keep the new assertions alongside the existing cluster subtests
so regressions in the second SELECT are caught.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 047cad3f-7742-4d6c-9220-0664aa89b23d

📥 Commits

Reviewing files that changed from the base of the PR and between 4fe21ec and 5f4905e.

📒 Files selected for processing (18)

charts/README.md
charts/templates/prometheusrule.yaml
charts/values.schema.json
charts/values.yaml
cmd/hyperfleet-api/servecmd/cmd.go
cmd/hyperfleet-api/server/metrics_middleware.go
pkg/config/flags.go
pkg/config/loader.go
pkg/config/metrics.go
pkg/metrics/deletion.go
pkg/metrics/deletion_test.go
pkg/metrics/reconciliation.go
pkg/metrics/reconciliation_test.go
pkg/services/cluster.go
pkg/services/node_pool.go
pkg/services/status_helpers.go
test/integration/deletion_metrics_test.go
test/integration/reconciliation_metrics_test.go

🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

openshift-hyperfleet/architecture (manual)
openshift-hyperfleet/hyperfleet-api (manual)
openshift-hyperfleet/hyperfleet-sentinel (manual)
openshift-hyperfleet/hyperfleet-adapter (manual)
openshift-hyperfleet/hyperfleet-broker (manual)

💤 Files with no reviewable changes (4)

pkg/services/node_pool.go
pkg/metrics/deletion.go
test/integration/deletion_metrics_test.go
pkg/metrics/deletion_test.go

coderabbitai · 2026-06-29T22:57:25Z

 	Port                          int           `mapstructure:"port" json:"port" validate:"required,min=1,max=65535"`
 	LabelMetricsInclusionDuration time.Duration `mapstructure:"label_metrics_inclusion_duration" json:"label_metrics_inclusion_duration" validate:"required"` //nolint:lll
-	DeletionStuckThreshold        time.Duration `mapstructure:"deletion_stuck_threshold" json:"deletion_stuck_threshold" validate:"required"`                 //nolint:lll
+	ReconciliationStuckThreshold  time.Duration `mapstructure:"reconciliation_stuck_threshold" json:"reconciliation_stuck_threshold" validate:"required"`     //nolint:lll


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Keep the old stuck-threshold key as a deprecated alias for one release.

Line 17 changes the persisted config key, and pkg/config/loader.go/pkg/config/flags.go drop the old env/flag bindings at the same time. Existing metrics.deletion_stuck_threshold overrides will be ignored on upgrade and silently fall back to 10m, which changes stuck classification and alert timing for running deployments. Keep the old key/env/flag as a deprecated alias, or fail fast when the legacy name is still set. As per path instructions, **/config/**: Configuration changes affect all deployments. Review for: - Backward compatibility of config changes.

Also applies to: 30-37

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/config/metrics.go` at line 17, The stuck-threshold config rename is dropping support for the existing deletion_stuck_threshold key/env/flag, which will break upgrades for users with persisted overrides. Update the relevant config wiring in metrics.go, loader.go, and flags.go so ReconciliationStuckThreshold still accepts the legacy name as a deprecated alias for one release, or explicitly fail fast if the old name is still present. Make sure the old env/flag binding and mapstructure/json compatibility are preserved alongside the new metrics key.

Source: Path instructions

coderabbitai · 2026-06-29T22:57:25Z

+	rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
+	if err != nil {
+		logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
+		return
+	}
+	defer rows.Close()
+
+	for rows.Next() {
+		var resourceType, isDelete string
+		var pending, stuck int64
+		var maxDuration float64
+
+		if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
+			logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
+			continue
+		}
+
+		labels := []string{resourceType, isDelete}
+		ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)
+		ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)
+		ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)
+	}
+
+	if err := rows.Err(); err != nil {
+		logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't hide collector failures behind empty scrapes.

At Line 163 and Line 186, DB/query failures are only logged. Prometheus will still get a scrape with missing or partial series, so the new reconciliation alerts go blind during DB outages or SQL regressions (CWE-391). Emit prometheus.NewInvalidMetric(...) and fail the scrape instead of returning partial/empty data.

Suggested fix

rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL if err != nil { logger.WithError(ctx, err).Error("Failed to query reconciliation metrics") + ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) + ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) + ch <- prometheus.NewInvalidMetric(c.durationDesc, err) return } defer rows.Close() @@ if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil { logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row") - continue + ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) + ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) + ch <- prometheus.NewInvalidMetric(c.durationDesc, err) + return } @@ if err := rows.Err(); err != nil { logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows") + ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) + ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) + ch <- prometheus.NewInvalidMetric(c.durationDesc, err) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL

if err != nil {

logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")

return

}

defer rows.Close()

for rows.Next() {

var resourceType, isDelete string

var pending, stuck int64

var maxDuration float64

if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {

logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")

continue

}

labels := []string{resourceType, isDelete}

ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)

ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)

ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)

}

if err := rows.Err(); err != nil {

logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")

rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL

if err != nil {

logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")

ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)

ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)

ch <- prometheus.NewInvalidMetric(c.durationDesc, err)

return

}

defer rows.Close()

for rows.Next() {

var resourceType, isDelete string

var pending, stuck int64

var maxDuration float64

if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {

logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")

ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)

ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)

ch <- prometheus.NewInvalidMetric(c.durationDesc, err)

return

}

labels := []string{resourceType, isDelete}

ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...)

ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...)

ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...)

}

if err := rows.Err(); err != nil {

logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")

ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)

ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)

ch <- prometheus.NewInvalidMetric(c.durationDesc, err)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/metrics/reconciliation.go` around lines 163 - 187, The reconciliation collector in the metrics path is only logging query/scan/iteration failures, which allows Prometheus to scrape partial or empty results from the collector. Update the `reconciliation` collector logic around `QueryContext`, `rows.Scan`, and `rows.Err()` to return `prometheus.NewInvalidMetric(...)` on any DB or iteration failure so the scrape fails instead of silently omitting series. Keep the existing `c.pendingDesc`, `c.stuckDesc`, and `c.durationDesc` emission path for successful rows, but treat any failure as a collector error rather than continuing with incomplete data.

openshift-ci · 2026-06-29T23:06:15Z

@kuudori: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/validate-commits	`5f4905e`	link	true	`/test validate-commits`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 29, 2026

kuudori marked this pull request as ready for review June 29, 2026 22:48

openshift-ci Bot removed the do-not-merge/work-in-progress label Jun 29, 2026

openshift-ci Bot requested review from ciaranRoche and jsell-rh June 29, 2026 22:49

hyperfleet-ci-bot Bot added the risk/high label Jun 29, 2026

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264

HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264
kuudori wants to merge 1 commit into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1205-reconciliation-metrics

kuudori commented Jun 29, 2026 •

edited by atlassian Bot

Loading

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

hyperfleet-ci-bot Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kuudori commented Jun 29, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AC Coverage

Design Decisions

Breaking Changes

Test plan

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

hyperfleet-ci-bot Bot commented Jun 29, 2026

Risk Score: 5 — risk/high

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kuudori commented Jun 29, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Risk Score: 5 — `risk/high`