Workload-aware Kubernetes HPA scale-down — annotates pods with deletion-cost based on real-time activity metrics from a pluggable backend, so idle pods are removed first.
When Kubernetes HPA scales down a deployment, it has no awareness of which pods are actively processing work. This can lead to mid-operation pod termination, causing failures and retries.
Pod Cost Manager solves this by:
- Querying a pluggable metrics backend for real-time pod activity metrics (configurable per workload)
- Computing a deletion cost (0-1000) for each pod based on current workload
- Annotating each pod with
controller.kubernetes.io/pod-deletion-costso Kubernetes preferentially removes idle pods first - Optionally managing HPA
minReplicasto prevent scale-down below the active pod count
The pod-deletion-cost annotation is a Kubernetes-native mechanism respected by any HPA controller, including KEDA.
The result: HPA scale-down decisions become workload-aware, eliminating unnecessary failures.
Metrics Backend ──> Pod Cost Manager ──> Kubernetes API
(Prometheus, (CronJob) (pod annotations + HPA patches)
Datadog, etc.)
The manager runs as a Kubernetes CronJob (default: every minute). Each run:
- Discovers target pods via label selectors
- Queries the metrics backend for pod-level metrics in batch (one query per metric type, not per pod)
- Enriches metrics with Kubernetes pod status (age, readiness, termination state)
- Calculates a weighted cost score per pod
- Annotates each pod with
controller.kubernetes.io/pod-deletion-cost - (Optional) Patches HPA
minReplicas = max(active_pods + buffer, baseline)
Each pod receives a score from 0 (idle, safe to remove) to 1000 (very active, protect):
| Metric | Weight | Max Points | Description |
|---|---|---|---|
| Running Splits | 10 per split | 400 | Actively executing query fragments |
| Waiting Splits | 5 per split | 300 | Queued work ready to execute |
| High-Priority Tasks (L0-L2) | 15 per task | 150 | Critical execution pipeline tasks |
| Low-Priority Tasks (L3-L4) | 10 per task | 100 | Background execution tasks |
| CPU Utilization | 0.5 per % | 50 | Current CPU usage rate |
| Output Buffers | 50 per buffer | 500 | Active result streaming buffers (critical) |
All metric names, weights, and caps are fully configurable via Helm values. The defaults are tuned for Trino distributed query workloads but can be adapted to any application.
Edge cases are handled before the weighted calculation:
| Condition | Cost | Rationale |
|---|---|---|
| Terminating pod | 0 | Already being removed |
| Not-ready pod | 50 | Unhealthy, prefer removal |
| New pod (< 3 min) | 500 | Protect during startup |
| No metrics available | 500 | Assume active (safe default) |
A small random jitter (default 0-10) is added to break ties between equally-scored pods.
Output buffers represent active result streaming from workers to the coordinator. Killing a worker with active output buffers causes immediate query failure with no recovery path. The high weight (50 per buffer) and cap (500 points) ensure these workers are strongly protected.
For Trino output buffer metrics, deploy the companion trino-buffer-exporter.
docker build -t pod-cost-manager:latest .helm upgrade --install pod-cost-manager ./chart \
-n trino \
-f chart/values.yaml# Check CronJob
kubectl get cronjobs -n trino
# Check recent job runs
kubectl get jobs -n trino -l app=pod-cost-manager
# View pod annotations
kubectl get pods -n trino -l component=worker -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.controller\.kubernetes\.io/pod-deletion-cost}{"\n"}{end}'| Parameter | Description | Default |
|---|---|---|
image.repository |
Container image repository | pod-cost-manager |
image.tag |
Container image tag | latest |
image.pullPolicy |
Image pull policy | IfNotPresent |
| Parameter | Description | Default |
|---|---|---|
metricsBackend.type |
Backend type: prometheus |
prometheus |
metricsBackend.url |
Server URL | http://prometheus-server.prometheus.svc |
metricsBackend.port |
Server port | 80 |
metricsBackend.timeout_seconds |
Query timeout | 10 |
metricsBackend.auth.type |
Auth type: none, basic, bearer |
none |
metricsBackend.auth.existingSecret |
Secret name for auth credentials | "" |
The architecture supports custom metrics backends. To add one (e.g., Datadog, CloudWatch), implement the MetricsBackend ABC in pod-cost-manager.py and register it in create_metrics_backend().
| Parameter | Description | Default |
|---|---|---|
podSelector.<name>.enabled |
Enable this selector | true |
podSelector.<name>.namespace |
Kubernetes namespace | trino |
podSelector.<name>.labels |
Label selector map | {app: trino, component: worker, release: trino} |
You can define multiple selectors for blue/green or multi-cluster deployments. See examples/.
| Parameter | Description | Default |
|---|---|---|
metricNames.runningSplits |
Running splits metric | trino_execution_executor_TaskExecutor_RunningSplits |
metricNames.waitingSplits |
Waiting splits metric | trino_execution_executor_TaskExecutor_WaitingSplits |
metricNames.runningTasksLevelPattern |
Task level pattern (use {level}) |
trino_execution_executor_TaskExecutor_RunningTasksLevel{level} |
metricNames.cpuUsage |
CPU usage metric | container_cpu_usage_seconds_total |
metricNames.cpuContainerName |
Container name for CPU query | trino-worker |
metricNames.cpuRateWindow |
Rate window for CPU query | 15m |
metricNames.activeOutputBuffers |
Active output buffers metric | trino_worker_active_output_buffers |
metricNames.outputBufferedBytes |
Output buffered bytes metric | trino_worker_output_buffered_bytes |
metricNames.taskLevels |
Number of task priority levels | 5 |
| Parameter | Description | Default |
|---|---|---|
edgeCaseCosts.terminatingPod |
Cost for terminating pods | 0 |
edgeCaseCosts.notReadyPod |
Cost for not-ready pods | 50 |
edgeCaseCosts.newPod |
Cost for newly created pods | 500 |
edgeCaseCosts.noMetricsPod |
Cost when no metrics available | 500 |
| Parameter | Description | Default |
|---|---|---|
podAge.newPodThresholdHours |
Hours below which a pod is "new" | 0.05 (~3 minutes) |
| Parameter | Description | Default |
|---|---|---|
jitter.min |
Minimum jitter value | 0 |
jitter.max |
Maximum jitter value | 10 |
| Parameter | Description | Default |
|---|---|---|
costCalculation.weights.running_splits |
Weight per running split | 10 |
costCalculation.weights.waiting_splits |
Weight per waiting split | 5 |
costCalculation.weights.running_tasks_high_priority |
Weight per L0-L2 task | 15 |
costCalculation.weights.running_tasks_low_priority |
Weight per L3-L4 task | 10 |
costCalculation.weights.cpu_utilization |
Weight per CPU % | 0.5 |
costCalculation.weights.output_buffers |
Weight per active buffer | 50 |
costCalculation.caps.running_splits |
Max points from running splits | 400 |
costCalculation.caps.waiting_splits |
Max points from waiting splits | 300 |
costCalculation.caps.running_tasks_high |
Max points from L0-L2 tasks | 150 |
costCalculation.caps.running_tasks_low |
Max points from L3-L4 tasks | 100 |
costCalculation.caps.cpu_utilization |
Max points from CPU | 50 |
costCalculation.caps.output_buffers |
Max points from output buffers | 500 |
costCalculation.caps.max_total |
Overall maximum cost | 1000 |
| Parameter | Description | Default |
|---|---|---|
hpaManagement.enabled |
Enable dynamic minReplicas management | false |
hpaManagement.buffer |
Extra pods above active count | 1 |
hpaManagement.baseline_minimum |
Minimum floor for minReplicas | 2 |
hpaManagement.hpa_names |
Map of selector keys to HPA names | {} |
| Parameter | Description | Default |
|---|---|---|
schedule |
CronJob schedule | */1 * * * * |
namespace |
Deployment namespace | trino |
dryRun |
Log without patching | false |
logging.level |
Log level | INFO |
logging.format |
Log format | json |
datadog.enabled |
Enable Datadog log annotations | false |
Create a Kubernetes secret with your token:
kubectl create secret generic prometheus-credentials \
--from-literal=token=YOUR_TOKEN \
-n trinoConfigure in values:
metricsBackend:
type: "prometheus"
url: "https://prometheus.example.com"
port: 443
auth:
type: "bearer"
existingSecret: "prometheus-credentials"kubectl create secret generic prometheus-credentials \
--from-literal=password=YOUR_PASSWORD \
-n trinometricsBackend:
auth:
type: "basic"
existingSecret: "prometheus-credentials"The username is configured in the config YAML; only the password is injected from the secret.
See the examples/ directory:
- values-basic.yaml - Single release
- values-blue-green.yaml - Blue/green deployment with HPA management
- values-with-auth.yaml - Prometheus with bearer token auth
- values-custom-metrics.yaml - Custom metric names and tuning
- custom-backends.py - Implementing alternative metrics backends (Datadog, CloudWatch, static/testing)
- Kubernetes 1.24+ (pod deletion cost annotation support)
- A metrics backend (Prometheus with application metrics, or implement a custom backend)
- Helm 3.x
- (Optional) trino-buffer-exporter for Trino output buffer metrics
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Pod Cost Manager is maintained by Simon, the agentic marketing platform that combines customer data with real-world signals to orchestrate personalized, 1:1 campaigns at scale. We built this tool to manage intelligent autoscaling for our Trino query infrastructure and open-sourced it so others can benefit.
Apache License 2.0. See LICENSE for details.