Metrics Documentation

This document describes all Prometheus metrics exposed by HyperFleet API, including their meanings, expected ranges, and example queries for common investigations.

Metrics Endpoint

Metrics are exposed at:

Endpoint: /metrics
Port: 9090 (default, configurable via --metrics-server-bindaddress)
Format: OpenMetrics/Prometheus text format

Application Metrics

Build Info

`hyperfleet_api_build_info`

Type: Gauge (always 1)

Description: Build information for the HyperFleet API component.

Labels:

Label	Description	Example Values
`component`	Component name	`api`
`version`	Application version (git sha)	`abc123`, `abc123-modified`
`commit`	Git commit SHA	`abc123`
`go_version`	Go runtime version	`go1.24.0`

Example output:

hyperfleet_api_build_info{component="api",version="abc123",commit="abc123",go_version="go1.24.0"} 1

API Request Metrics

These metrics track all inbound HTTP requests to the API server.

`hyperfleet_api_requests_total`

Type: Counter

Description: Total number of HTTP requests served by the API.

Labels:

Label	Description	Example Values
`component`	Component name	`api`
`version`	Application version	`abc123`
`method`	HTTP method	`GET`, `POST`, `PUT`, `PATCH`, `DELETE`
`path`	Request path (with IDs replaced by `-`)	`/api/hyperfleet/v1/clusters/-`
`code`	HTTP response status code	`200`, `201`, `400`, `404`, `500`

Path normalization: Object identifiers in paths are replaced with - to reduce cardinality. For example, /api/hyperfleet/v1/clusters/abc123 becomes /api/hyperfleet/v1/clusters/-.

Example output:

hyperfleet_api_requests_total{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523
hyperfleet_api_requests_total{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters/-"} 8742
hyperfleet_api_requests_total{component="api",version="abc123",code="201",method="POST",path="/api/hyperfleet/v1/clusters"} 156
hyperfleet_api_requests_total{component="api",version="abc123",code="404",method="GET",path="/api/hyperfleet/v1/clusters/-"} 23

`hyperfleet_api_request_duration_seconds`

Type: Histogram

Description: Distribution of request processing times in seconds.

Labels: Same as hyperfleet_api_requests_total

Buckets: 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s, 1s, 2.5s, 5s, 10s

Derived metrics:

hyperfleet_api_request_duration_seconds_sum - Total time spent processing requests
hyperfleet_api_request_duration_seconds_count - Number of requests measured
hyperfleet_api_request_duration_seconds_bucket - Number of requests completed within each bucket

Example output:

hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.005"} 800
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.01"} 1000
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.025"} 1200
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.05"} 1350
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.1"} 1450
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.25"} 1490
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.5"} 1510
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="1"} 1520
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="2.5"} 1522
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="5"} 1523
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="10"} 1523
hyperfleet_api_request_duration_seconds_bucket{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="+Inf"} 1523
hyperfleet_api_request_duration_seconds_sum{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 45.23
hyperfleet_api_request_duration_seconds_count{component="api",version="abc123",code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523

Reconciliation Observability Metrics

These metrics track resources pending reconciliation — both normal lifecycle (create/update) and deletion lifecycle. The is_delete label distinguishes the two.

`hyperfleet_api_reconciliation_started_total`

Type: Counter

Description: Total number of resources that entered the unreconciled state (Reconciled condition transitioned to False). Incremented only after the transition is persisted to the database.

Labels:

Label	Description	Example Values
`resource_type`	Type of resource	`cluster`, `nodepool`
`is_delete`	Whether this is a deletion reconciliation	`true`, `false`
`component`	Component name (const)	`api`
`version`	Application version (const)	`abc123`

Example output:

hyperfleet_api_reconciliation_started_total{component="api",is_delete="false",resource_type="cluster",version="abc123"} 42
hyperfleet_api_reconciliation_started_total{component="api",is_delete="true",resource_type="nodepool",version="abc123"} 12

`hyperfleet_api_resource_pending_reconciliation`

Type: Gauge (Collector)

Description: Number of resources currently pending reconciliation (Reconciled=False). Computed on each Prometheus scrape via a SQL query against the database.

Labels: Same as hyperfleet_api_reconciliation_started_total

Behavior at zero: When no resources are pending for a given (resource_type, is_delete) combination, the series is absent rather than emitting 0. The > 0 alert expressions handle this correctly.

`hyperfleet_api_resource_pending_reconciliation_stuck`

Type: Gauge (Collector)

Description: Number of resources pending reconciliation beyond the stuck threshold (default 10 minutes). Identifies resources whose Reconciled condition has been False for longer than the configured threshold.

Labels: Same as hyperfleet_api_reconciliation_started_total

Configuration: The stuck threshold is configurable via --metrics-reconciliation-stuck-threshold (default 10m).

`hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds`

Type: Gauge (Collector)

Description: Maximum duration in seconds that any resource has been stuck pending reconciliation, per (resource_type, is_delete) combination.

Labels: Same as hyperfleet_api_reconciliation_started_total

Example output:

hyperfleet_api_resource_pending_reconciliation{component="api",is_delete="false",resource_type="cluster",version="abc123"} 5
hyperfleet_api_resource_pending_reconciliation_stuck{component="api",is_delete="false",resource_type="cluster",version="abc123"} 2
hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds{component="api",is_delete="false",resource_type="cluster",version="abc123"} 1847.3

Reconciliation Alerts

Two alerts are available via the PrometheusRule (requires monitoring.prometheusRule.enabled=true in Helm values):

Alert	Severity	Condition	Description
`HyperFleetResourceReconciliationStuckWarning`	Warning	`stuck > 0` for 5m	Resources stuck pending reconciliation beyond threshold + 5m
`HyperFleetResourceReconciliationStuckCritical`	Critical	`stuck_duration_seconds > 1800` for 5m	Elapsed stuck duration exceeds 30m timeout (survives Prometheus restarts)

Note: The critical alert uses the collector-reported stuck_duration_seconds gauge rather than Prometheus for duration, so it fires immediately after a Prometheus restart if a resource has been stuck beyond the timeout. Both alerts cover deletion and normal (create/update) reconciliation. The is_delete label allows separate alerting rules if needed.

Go Runtime Metrics

The following metrics are automatically exposed by the Prometheus Go client library.

Process Metrics

Metric	Type	Description
`process_cpu_seconds_total`	Counter	Total user and system CPU time spent in seconds
`process_max_fds`	Gauge	Maximum number of open file descriptors
`process_open_fds`	Gauge	Number of open file descriptors
`process_resident_memory_bytes`	Gauge	Resident memory size in bytes
`process_start_time_seconds`	Gauge	Start time of the process since unix epoch
`process_virtual_memory_bytes`	Gauge	Virtual memory size in bytes

Go Runtime Metrics

Metric	Type	Description
`go_gc_duration_seconds`	Summary	A summary of pause durations during GC cycles
`go_goroutines`	Gauge	Number of goroutines currently existing
`go_memstats_alloc_bytes`	Gauge	Bytes allocated and still in use
`go_memstats_alloc_bytes_total`	Counter	Total bytes allocated (even if freed)
`go_memstats_heap_alloc_bytes`	Gauge	Heap bytes allocated and still in use
`go_memstats_heap_idle_bytes`	Gauge	Heap bytes waiting to be used
`go_memstats_heap_inuse_bytes`	Gauge	Heap bytes in use
`go_memstats_heap_objects`	Gauge	Number of allocated objects
`go_memstats_heap_sys_bytes`	Gauge	Heap bytes obtained from system
`go_memstats_sys_bytes`	Gauge	Total bytes obtained from system
`go_threads`	Gauge	Number of OS threads created

Expected Ranges and Alerting Thresholds

Request Rate

Condition	Threshold	Severity	Description
Normal	< 1000 req/s	-	Normal operating range
Warning	> 1000 req/s	Warning	High load, monitor closely
Critical	> 5000 req/s	Critical	Capacity limit approaching

Error Rate

Condition	Threshold	Severity	Description
Normal	< 1%	-	Normal error rate
Warning	1-5%	Warning	Elevated errors, investigate
Critical	> 5%	Critical	High error rate, immediate action

Latency (P99)

Condition	Threshold	Severity	Description
Normal	< 500ms	-	Good response times
Warning	500ms - 2s	Warning	Degraded performance
Critical	> 2s	Critical	Unacceptable latency

Memory Usage

Condition	Threshold	Severity	Description
Normal	< 70% of limit	-	Healthy memory usage
Warning	70-85% of limit	Warning	Memory pressure
Critical	> 85% of limit	Critical	OOM risk

Goroutines

Condition	Threshold	Severity	Description
Normal	< 1000	-	Normal goroutine count
Warning	1000-5000	Warning	High goroutine count
Critical	> 5000	Critical	Possible goroutine leak

Example PromQL Queries

Request Rate

# Total request rate (requests per second)
sum(rate(hyperfleet_api_requests_total[5m]))

# Request rate by pod/instance
sum(rate(hyperfleet_api_requests_total[5m])) by (instance)

# Request rate by endpoint
sum(rate(hyperfleet_api_requests_total[5m])) by (path)

# Request rate by status code
sum(rate(hyperfleet_api_requests_total[5m])) by (code)

# Request rate by method
sum(rate(hyperfleet_api_requests_total[5m])) by (method)

Error Rate

# Overall error rate (5xx responses)
sum(rate(hyperfleet_api_requests_total{code=~"5.."}[5m])) /
sum(rate(hyperfleet_api_requests_total[5m])) * 100

# Error rate by endpoint
sum(rate(hyperfleet_api_requests_total{code=~"5.."}[5m])) by (path) /
sum(rate(hyperfleet_api_requests_total[5m])) by (path) * 100

# Client error rate (4xx responses)
sum(rate(hyperfleet_api_requests_total{code=~"4.."}[5m])) /
sum(rate(hyperfleet_api_requests_total[5m])) * 100

Latency

# Average request duration (last 10 minutes)
rate(hyperfleet_api_request_duration_seconds_sum[10m]) /
rate(hyperfleet_api_request_duration_seconds_count[10m])

# Average request duration by endpoint
sum(rate(hyperfleet_api_request_duration_seconds_sum[5m])) by (path) /
sum(rate(hyperfleet_api_request_duration_seconds_count[5m])) by (path)

# P50 latency (approximate using histogram)
histogram_quantile(0.5, sum(rate(hyperfleet_api_request_duration_seconds_bucket[5m])) by (le))

# P90 latency
histogram_quantile(0.9, sum(rate(hyperfleet_api_request_duration_seconds_bucket[5m])) by (le))

# P99 latency
histogram_quantile(0.99, sum(rate(hyperfleet_api_request_duration_seconds_bucket[5m])) by (le))

# P99 latency by endpoint
histogram_quantile(0.99, sum(rate(hyperfleet_api_request_duration_seconds_bucket[5m])) by (le, path))

Resource Usage

# Memory usage in MB
process_resident_memory_bytes / 1024 / 1024

# Memory usage trend (increase over 1 hour)
delta(process_resident_memory_bytes[1h]) / 1024 / 1024

# Goroutine count
go_goroutines

# Goroutine trend
delta(go_goroutines[1h])

# CPU usage rate
rate(process_cpu_seconds_total[5m])

# File descriptor usage percentage
process_open_fds / process_max_fds * 100

Reconciliation Observability

# Resources entering unreconciled state per minute
sum by (resource_type, is_delete) (rate(hyperfleet_api_reconciliation_started_total[5m])) * 60

# Resources currently stuck pending reconciliation
hyperfleet_api_resource_pending_reconciliation_stuck

# Stuck resources — deletion only
hyperfleet_api_resource_pending_reconciliation_stuck{is_delete="true"}

# Maximum stuck duration by type
hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds

# All pending resources by type
hyperfleet_api_resource_pending_reconciliation

Common Investigation Queries

# Slowest endpoints (average latency)
topk(10,
  sum(rate(hyperfleet_api_request_duration_seconds_sum[5m])) by (path) /
  sum(rate(hyperfleet_api_request_duration_seconds_count[5m])) by (path)
)

# Most requested endpoints
topk(10, sum(rate(hyperfleet_api_requests_total[5m])) by (path))

# Endpoints with highest error rate
topk(10,
  sum(rate(hyperfleet_api_requests_total{code=~"5.."}[5m])) by (path) /
  sum(rate(hyperfleet_api_requests_total[5m])) by (path)
)

# Percentage of requests taking longer than 1 second
1 - (sum(rate(hyperfleet_api_request_duration_seconds_bucket{le="1"}[5m])) /
sum(rate(hyperfleet_api_request_duration_seconds_count[5m])))

Prometheus Operator Integration

If using Prometheus Operator, enable the ServiceMonitor in Helm values:

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
  labels:
    release: prometheus  # Match your Prometheus selector

See Deployment Guide for details.

Google Managed Prometheus (GMP) Integration

If running on GKE with Google Managed Prometheus, enable the PodMonitoring resource in Helm values:

monitoring:
  podMonitoring:
    enabled: true
    interval: 30s

This creates a monitoring.googleapis.com/v1/PodMonitoring resource that configures the GMP collector agent to scrape the /metrics endpoint directly from pods. The serviceMonitor and podMonitoring options are independent and can be enabled simultaneously.

Grafana Dashboard

Example dashboard JSON for HyperFleet API monitoring is available in the architecture repository. Key panels to include:

Request Rate - Total requests per second over time
Error Rate - Percentage of 5xx responses
Latency Distribution - P50, P90, P99 latencies
Request Duration Heatmap - Visual distribution of request times
Top Endpoints - Most frequently accessed paths
Memory Usage - Resident memory over time
Goroutines - Goroutine count over time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metrics Documentation

Metrics Endpoint

Application Metrics

Build Info

`hyperfleet_api_build_info`

API Request Metrics

`hyperfleet_api_requests_total`

`hyperfleet_api_request_duration_seconds`

Reconciliation Observability Metrics

`hyperfleet_api_reconciliation_started_total`

`hyperfleet_api_resource_pending_reconciliation`

`hyperfleet_api_resource_pending_reconciliation_stuck`

`hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds`

Reconciliation Alerts

Go Runtime Metrics

Process Metrics

Go Runtime Metrics

Expected Ranges and Alerting Thresholds

Request Rate

Error Rate

Latency (P99)

Memory Usage

Goroutines

Example PromQL Queries

Request Rate

Error Rate

Latency

Resource Usage

Reconciliation Observability

Common Investigation Queries

Prometheus Operator Integration

Google Managed Prometheus (GMP) Integration

Grafana Dashboard

Related Documentation

Uh oh!

FilesExpand file tree

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Metrics Documentation

Metrics Endpoint

Application Metrics

Build Info

hyperfleet_api_build_info

API Request Metrics

hyperfleet_api_requests_total

hyperfleet_api_request_duration_seconds

Reconciliation Observability Metrics

hyperfleet_api_reconciliation_started_total

hyperfleet_api_resource_pending_reconciliation

hyperfleet_api_resource_pending_reconciliation_stuck

hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds

Reconciliation Alerts

Go Runtime Metrics

Process Metrics

Go Runtime Metrics

Expected Ranges and Alerting Thresholds

Request Rate

Error Rate

Latency (P99)

Memory Usage

Goroutines

Example PromQL Queries

Request Rate

Error Rate

Latency

Resource Usage

Reconciliation Observability

Common Investigation Queries

Prometheus Operator Integration

Google Managed Prometheus (GMP) Integration

Grafana Dashboard

Related Documentation

`hyperfleet_api_build_info`

`hyperfleet_api_requests_total`

`hyperfleet_api_request_duration_seconds`

`hyperfleet_api_reconciliation_started_total`

`hyperfleet_api_resource_pending_reconciliation`

`hyperfleet_api_resource_pending_reconciliation_stuck`

`hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds`