HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264
HYPERFLEET-1205: Add metrics for tracking active and stuck reconciliations#264kuudori wants to merge 1 commit into
Conversation
Replace deletion-specific metrics with four reconciliation metrics: - reconciliation_started_total counter (Reconciled→False transitions) - resource_pending_reconciliation gauge (Reconciled=False count) - resource_pending_reconciliation_stuck gauge (beyond threshold) - resource_pending_reconciliation_stuck_duration_seconds gauge (max) All metrics carry resource_type and is_delete labels. Collector uses a single CTE query parsing JSONB once per row via jsonb_path_query_first, leveraging existing BTREE expression indexes. Counter instrumented in recomputeAndSaveClusterStatus and computeNodePoolConditionsJSON on Reconciled→False transition. Configurable reconciliation_stuck_threshold (default 10m). Remove deletion-specific metrics (resource_pending_deletion_total, resource_pending_deletion_duration_seconds, resource_pending_deletion_stuck) and update Helm chart values, schema, and PrometheusRule alerts to reference new metric names.
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbit
Walkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Security notes (no praise, direct) CWE-89 — SQL injection surface in CWE-400 — Unbounded metric cardinality via CWE-362 — TOCTOU in Supply chain note — deleted integration test 🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Risk Score: 5 —
|
| Signal | Detail | Points |
|---|---|---|
| PR size | 1249 lines (>500) | +2 |
| Sensitive paths | cmd/ | +2 |
| Test coverage | Missing tests for: cmd/hyperfleet-api/servecmd cmd/hyperfleet-api/server pkg/config pkg/services | +1 |
Computed by hyperfleet-risk-scorer
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/services/status_helpers.go (1)
30-45: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftKeep the counter emission out of
computeNodePoolConditionsJSON.This helper is also used by
recomputeNodePoolConditions, which only mutates in-memory objects. That meansreconciliation_started_totalcan advance without any persistedReconciled=Falsetransition, and it can also overcount when the laternodePoolDao.Savefails.Return a
startedflag from this helper and emit the metric only in the call sites that successfully persist the new status.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/services/status_helpers.go` around lines 30 - 45, Move the reconciliation_started_total emission out of the status helper logic in computeNodePoolConditionsJSON/related aggregation flow: the helper should only determine the new status and return a started flag when reconciled becomes False from a non-False previous state. Then update the call sites, especially the path that persists via nodePoolDao.Save, to record metrics only after a successful save so recomputeNodePoolConditions and failed persistence do not increment the counter prematurely.pkg/services/cluster.go (1)
241-260: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winRecord
reconciliation_started_totalonly afterSaveStatusConditionssucceeds.This increments the counter before the new
Reconciled=Falsestate is persisted. IfSaveStatusConditionsfails, the next retry still sees the old stored status and increments again, so the transition counter drifts upward without a committed state change.Suggested fix
- if reconciled.Status == api.ConditionFalse && - (prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) { - metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil) - } - allConditions := make([]api.ResourceCondition, 0, fixedConditionCount+len(adapterConditions)) allConditions = append(allConditions, reconciled, lastKnownReconciled) allConditions = append(allConditions, adapterConditions...) @@ if err := s.clusterDao.SaveStatusConditions(ctx, cluster.ID, conditionsJSON); err != nil { return nil, handleUpdateError(api.ResourceTypeCluster, err) } cluster.StatusConditions = conditionsJSON + if reconciled.Status == api.ConditionFalse && + (prevReconciledStatus == nil || *prevReconciledStatus != api.ConditionFalse) { + metrics.RecordReconciliationStarted("cluster", cluster.DeletedTime != nil) + } return cluster, nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/services/cluster.go` around lines 241 - 260, Move the reconciliation-start metric in cluster reconciliation so `metrics.RecordReconciliationStarted` is only called after `s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new status. Update the `cluster` status update flow around `reconciled`, `prevReconciledStatus`, and `SaveStatusConditions` so the counter increments only on a committed `Reconciled=False` transition and does not fire on failed retries.
🧹 Nitpick comments (1)
test/integration/reconciliation_metrics_test.go (1)
23-165: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a nodepool integration case for the second half of the UNION.
reconciliationQueryaggregates bothclustersandnode_pools, but this suite only inserts clusters. A nodepool-only regression in the secondSELECTcan ship unnoticed while every test here stays green. Add at least one nodepool pending/stuck assertion usingfactories.NewNodePoolWithStatusAtTime(...)(or the equivalent factory). As per path instructions,**/*_test.go:New exported functions and critical logic paths SHOULD have tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/integration/reconciliation_metrics_test.go` around lines 23 - 165, The reconciliation integration suite only exercises the clusters half of reconciliationQuery, so add a nodepool-focused case in TestReconciliationCollector_Integration to cover the node_pools UNION branch. Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory) with metrics.NewReconciliationCollector and collectReconciliationMetrics to assert pending/stuck behavior and labels for resourceTypeNodePool, including an is_delete check. Keep the new assertions alongside the existing cluster subtests so regressions in the second SELECT are caught.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/config/metrics.go`:
- Line 17: The stuck-threshold config rename is dropping support for the
existing deletion_stuck_threshold key/env/flag, which will break upgrades for
users with persisted overrides. Update the relevant config wiring in metrics.go,
loader.go, and flags.go so ReconciliationStuckThreshold still accepts the legacy
name as a deprecated alias for one release, or explicitly fail fast if the old
name is still present. Make sure the old env/flag binding and mapstructure/json
compatibility are preserved alongside the new metrics key.
In `@pkg/metrics/reconciliation.go`:
- Around line 163-187: The reconciliation collector in the metrics path is only
logging query/scan/iteration failures, which allows Prometheus to scrape partial
or empty results from the collector. Update the `reconciliation` collector logic
around `QueryContext`, `rows.Scan`, and `rows.Err()` to return
`prometheus.NewInvalidMetric(...)` on any DB or iteration failure so the scrape
fails instead of silently omitting series. Keep the existing `c.pendingDesc`,
`c.stuckDesc`, and `c.durationDesc` emission path for successful rows, but treat
any failure as a collector error rather than continuing with incomplete data.
---
Outside diff comments:
In `@pkg/services/cluster.go`:
- Around line 241-260: Move the reconciliation-start metric in cluster
reconciliation so `metrics.RecordReconciliationStarted` is only called after
`s.clusterDao.SaveStatusConditions` succeeds, not before persisting the new
status. Update the `cluster` status update flow around `reconciled`,
`prevReconciledStatus`, and `SaveStatusConditions` so the counter increments
only on a committed `Reconciled=False` transition and does not fire on failed
retries.
In `@pkg/services/status_helpers.go`:
- Around line 30-45: Move the reconciliation_started_total emission out of the
status helper logic in computeNodePoolConditionsJSON/related aggregation flow:
the helper should only determine the new status and return a started flag when
reconciled becomes False from a non-False previous state. Then update the call
sites, especially the path that persists via nodePoolDao.Save, to record metrics
only after a successful save so recomputeNodePoolConditions and failed
persistence do not increment the counter prematurely.
---
Nitpick comments:
In `@test/integration/reconciliation_metrics_test.go`:
- Around line 23-165: The reconciliation integration suite only exercises the
clusters half of reconciliationQuery, so add a nodepool-focused case in
TestReconciliationCollector_Integration to cover the node_pools UNION branch.
Use factories.NewNodePoolWithStatusAtTime (or the equivalent nodepool factory)
with metrics.NewReconciliationCollector and collectReconciliationMetrics to
assert pending/stuck behavior and labels for resourceTypeNodePool, including an
is_delete check. Keep the new assertions alongside the existing cluster subtests
so regressions in the second SELECT are caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 047cad3f-7742-4d6c-9220-0664aa89b23d
📒 Files selected for processing (18)
charts/README.mdcharts/templates/prometheusrule.yamlcharts/values.schema.jsoncharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.gopkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/metrics/reconciliation.gopkg/metrics/reconciliation_test.gopkg/services/cluster.gopkg/services/node_pool.gopkg/services/status_helpers.gotest/integration/deletion_metrics_test.gotest/integration/reconciliation_metrics_test.go
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
💤 Files with no reviewable changes (4)
- pkg/services/node_pool.go
- pkg/metrics/deletion.go
- test/integration/deletion_metrics_test.go
- pkg/metrics/deletion_test.go
| Port int `mapstructure:"port" json:"port" validate:"required,min=1,max=65535"` | ||
| LabelMetricsInclusionDuration time.Duration `mapstructure:"label_metrics_inclusion_duration" json:"label_metrics_inclusion_duration" validate:"required"` //nolint:lll | ||
| DeletionStuckThreshold time.Duration `mapstructure:"deletion_stuck_threshold" json:"deletion_stuck_threshold" validate:"required"` //nolint:lll | ||
| ReconciliationStuckThreshold time.Duration `mapstructure:"reconciliation_stuck_threshold" json:"reconciliation_stuck_threshold" validate:"required"` //nolint:lll |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Keep the old stuck-threshold key as a deprecated alias for one release.
Line 17 changes the persisted config key, and pkg/config/loader.go/pkg/config/flags.go drop the old env/flag bindings at the same time. Existing metrics.deletion_stuck_threshold overrides will be ignored on upgrade and silently fall back to 10m, which changes stuck classification and alert timing for running deployments. Keep the old key/env/flag as a deprecated alias, or fail fast when the legacy name is still set. As per path instructions, **/config/**: Configuration changes affect all deployments. Review for: - Backward compatibility of config changes.
Also applies to: 30-37
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/config/metrics.go` at line 17, The stuck-threshold config rename is
dropping support for the existing deletion_stuck_threshold key/env/flag, which
will break upgrades for users with persisted overrides. Update the relevant
config wiring in metrics.go, loader.go, and flags.go so
ReconciliationStuckThreshold still accepts the legacy name as a deprecated alias
for one release, or explicitly fail fast if the old name is still present. Make
sure the old env/flag binding and mapstructure/json compatibility are preserved
alongside the new metrics key.
Source: Path instructions
| rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL | ||
| if err != nil { | ||
| logger.WithError(ctx, err).Error("Failed to query reconciliation metrics") | ||
| return | ||
| } | ||
| defer rows.Close() | ||
|
|
||
| for rows.Next() { | ||
| var resourceType, isDelete string | ||
| var pending, stuck int64 | ||
| var maxDuration float64 | ||
|
|
||
| if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil { | ||
| logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row") | ||
| continue | ||
| } | ||
|
|
||
| labels := []string{resourceType, isDelete} | ||
| ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...) | ||
| ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...) | ||
| ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...) | ||
| } | ||
|
|
||
| if err := rows.Err(); err != nil { | ||
| logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don't hide collector failures behind empty scrapes.
At Line 163 and Line 186, DB/query failures are only logged. Prometheus will still get a scrape with missing or partial series, so the new reconciliation alerts go blind during DB outages or SQL regressions (CWE-391). Emit prometheus.NewInvalidMetric(...) and fail the scrape instead of returning partial/empty data.
Suggested fix
rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL
if err != nil {
logger.WithError(ctx, err).Error("Failed to query reconciliation metrics")
+ ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
return
}
defer rows.Close()
@@
if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil {
logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row")
- continue
+ ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
+ return
}
@@
if err := rows.Err(); err != nil {
logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows")
+ ch <- prometheus.NewInvalidMetric(c.pendingDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.stuckDesc, err)
+ ch <- prometheus.NewInvalidMetric(c.durationDesc, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL | |
| if err != nil { | |
| logger.WithError(ctx, err).Error("Failed to query reconciliation metrics") | |
| return | |
| } | |
| defer rows.Close() | |
| for rows.Next() { | |
| var resourceType, isDelete string | |
| var pending, stuck int64 | |
| var maxDuration float64 | |
| if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil { | |
| logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row") | |
| continue | |
| } | |
| labels := []string{resourceType, isDelete} | |
| ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...) | |
| ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...) | |
| ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...) | |
| } | |
| if err := rows.Err(); err != nil { | |
| logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows") | |
| rows, err := c.db.QueryContext(ctx, reconciliationQuery, threshold) //nolint:gosec // compile-time SQL | |
| if err != nil { | |
| logger.WithError(ctx, err).Error("Failed to query reconciliation metrics") | |
| ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.durationDesc, err) | |
| return | |
| } | |
| defer rows.Close() | |
| for rows.Next() { | |
| var resourceType, isDelete string | |
| var pending, stuck int64 | |
| var maxDuration float64 | |
| if err := rows.Scan(&resourceType, &isDelete, &pending, &stuck, &maxDuration); err != nil { | |
| logger.WithError(ctx, err).Error("Failed to scan reconciliation metric row") | |
| ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.durationDesc, err) | |
| return | |
| } | |
| labels := []string{resourceType, isDelete} | |
| ch <- prometheus.MustNewConstMetric(c.pendingDesc, prometheus.GaugeValue, float64(pending), labels...) | |
| ch <- prometheus.MustNewConstMetric(c.stuckDesc, prometheus.GaugeValue, float64(stuck), labels...) | |
| ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, maxDuration, labels...) | |
| } | |
| if err := rows.Err(); err != nil { | |
| logger.WithError(ctx, err).Error("Error iterating reconciliation metric rows") | |
| ch <- prometheus.NewInvalidMetric(c.pendingDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.stuckDesc, err) | |
| ch <- prometheus.NewInvalidMetric(c.durationDesc, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/metrics/reconciliation.go` around lines 163 - 187, The reconciliation
collector in the metrics path is only logging query/scan/iteration failures,
which allows Prometheus to scrape partial or empty results from the collector.
Update the `reconciliation` collector logic around `QueryContext`, `rows.Scan`,
and `rows.Err()` to return `prometheus.NewInvalidMetric(...)` on any DB or
iteration failure so the scrape fails instead of silently omitting series. Keep
the existing `c.pendingDesc`, `c.stuckDesc`, and `c.durationDesc` emission path
for successful rows, but treat any failure as a collector error rather than
continuing with incomplete data.
|
@kuudori: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
hyperfleet_api_reconciliation_started_total— counter recording Reconciled→False transitionshyperfleet_api_resource_pending_reconciliation— gauge of resources with Reconciled=Falsehyperfleet_api_resource_pending_reconciliation_stuck— gauge of resources stuck beyond thresholdhyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds— max stuck durationresource_typeandis_deletelabels to distinguish normal from deletion lifecycleresource_pending_deletion_total,resource_pending_deletion_duration_seconds,resource_pending_deletion_stuck)reconciliation_stuck_threshold(default 10m) via flag/env/configAC Coverage
reconciliation_started_totalcounterTestReconciliationStartedTotalIsRegistered,TestRecordReconciliationStarted_*resource_pending_reconciliationgaugeTestReconciliationCollector_Integration/pendingresource_pending_reconciliation_stuckgaugeTestReconciliationCollector_Integration/stuckstuck_duration_secondsgaugeTestReconciliationCollector_Integration/max_stuck_durationDesign Decisions
jsonb_path_query_first, leveraging existing BTREE expression indexes — 1 round-trip instead of 3component/versionas const labels (not variable) — they never vary, matches Prometheus best practiceBreaking Changes
hyperfleet_api_resource_pending_deletion_stuckmust update tohyperfleet_api_resource_pending_reconciliation_stuckdeletion_stuck_thresholdrenamed toreconciliation_stuck_threshold(default changed from 30m to 10m)config.metrics.deletion_stuck_thresholdrenamed toconfig.metrics.reconciliation_stuck_thresholdTest plan
make verify-all— 1268 tests)make test-helm)make test-integration— requires Docker/testcontainers)Closes: HYPERFLEET-1205