Skip to content

simon-data/pod-cost-manager

Pod Cost Manager

Workload-aware Kubernetes HPA scale-down — annotates pods with deletion-cost based on real-time activity metrics from a pluggable backend, so idle pods are removed first.

Overview

When Kubernetes HPA scales down a deployment, it has no awareness of which pods are actively processing work. This can lead to mid-operation pod termination, causing failures and retries.

Pod Cost Manager solves this by:

  1. Querying a pluggable metrics backend for real-time pod activity metrics (configurable per workload)
  2. Computing a deletion cost (0-1000) for each pod based on current workload
  3. Annotating each pod with controller.kubernetes.io/pod-deletion-cost so Kubernetes preferentially removes idle pods first
  4. Optionally managing HPA minReplicas to prevent scale-down below the active pod count

The pod-deletion-cost annotation is a Kubernetes-native mechanism respected by any HPA controller, including KEDA.

The result: HPA scale-down decisions become workload-aware, eliminating unnecessary failures.

How It Works

Architecture

Metrics Backend ──> Pod Cost Manager ──> Kubernetes API
  (Prometheus,       (CronJob)          (pod annotations + HPA patches)
   Datadog, etc.)

The manager runs as a Kubernetes CronJob (default: every minute). Each run:

  1. Discovers target pods via label selectors
  2. Queries the metrics backend for pod-level metrics in batch (one query per metric type, not per pod)
  3. Enriches metrics with Kubernetes pod status (age, readiness, termination state)
  4. Calculates a weighted cost score per pod
  5. Annotates each pod with controller.kubernetes.io/pod-deletion-cost
  6. (Optional) Patches HPA minReplicas = max(active_pods + buffer, baseline)

Cost Calculation

Each pod receives a score from 0 (idle, safe to remove) to 1000 (very active, protect):

Metric Weight Max Points Description
Running Splits 10 per split 400 Actively executing query fragments
Waiting Splits 5 per split 300 Queued work ready to execute
High-Priority Tasks (L0-L2) 15 per task 150 Critical execution pipeline tasks
Low-Priority Tasks (L3-L4) 10 per task 100 Background execution tasks
CPU Utilization 0.5 per % 50 Current CPU usage rate
Output Buffers 50 per buffer 500 Active result streaming buffers (critical)

All metric names, weights, and caps are fully configurable via Helm values. The defaults are tuned for Trino distributed query workloads but can be adapted to any application.

Edge cases are handled before the weighted calculation:

Condition Cost Rationale
Terminating pod 0 Already being removed
Not-ready pod 50 Unhealthy, prefer removal
New pod (< 3 min) 500 Protect during startup
No metrics available 500 Assume active (safe default)

A small random jitter (default 0-10) is added to break ties between equally-scored pods.

Output Buffer Protection

Output buffers represent active result streaming from workers to the coordinator. Killing a worker with active output buffers causes immediate query failure with no recovery path. The high weight (50 per buffer) and cap (500 points) ensure these workers are strongly protected.

For Trino output buffer metrics, deploy the companion trino-buffer-exporter.

Quick Start

1. Build the Docker Image

docker build -t pod-cost-manager:latest .

2. Deploy with Helm

helm upgrade --install pod-cost-manager ./chart \
  -n trino \
  -f chart/values.yaml

3. Verify

# Check CronJob
kubectl get cronjobs -n trino

# Check recent job runs
kubectl get jobs -n trino -l app=pod-cost-manager

# View pod annotations
kubectl get pods -n trino -l component=worker -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.controller\.kubernetes\.io/pod-deletion-cost}{"\n"}{end}'

Configuration Reference

Image

Parameter Description Default
image.repository Container image repository pod-cost-manager
image.tag Container image tag latest
image.pullPolicy Image pull policy IfNotPresent

Metrics Backend

Parameter Description Default
metricsBackend.type Backend type: prometheus prometheus
metricsBackend.url Server URL http://prometheus-server.prometheus.svc
metricsBackend.port Server port 80
metricsBackend.timeout_seconds Query timeout 10
metricsBackend.auth.type Auth type: none, basic, bearer none
metricsBackend.auth.existingSecret Secret name for auth credentials ""

The architecture supports custom metrics backends. To add one (e.g., Datadog, CloudWatch), implement the MetricsBackend ABC in pod-cost-manager.py and register it in create_metrics_backend().

Pod Selector

Parameter Description Default
podSelector.<name>.enabled Enable this selector true
podSelector.<name>.namespace Kubernetes namespace trino
podSelector.<name>.labels Label selector map {app: trino, component: worker, release: trino}

You can define multiple selectors for blue/green or multi-cluster deployments. See examples/.

Metric Names

Parameter Description Default
metricNames.runningSplits Running splits metric trino_execution_executor_TaskExecutor_RunningSplits
metricNames.waitingSplits Waiting splits metric trino_execution_executor_TaskExecutor_WaitingSplits
metricNames.runningTasksLevelPattern Task level pattern (use {level}) trino_execution_executor_TaskExecutor_RunningTasksLevel{level}
metricNames.cpuUsage CPU usage metric container_cpu_usage_seconds_total
metricNames.cpuContainerName Container name for CPU query trino-worker
metricNames.cpuRateWindow Rate window for CPU query 15m
metricNames.activeOutputBuffers Active output buffers metric trino_worker_active_output_buffers
metricNames.outputBufferedBytes Output buffered bytes metric trino_worker_output_buffered_bytes
metricNames.taskLevels Number of task priority levels 5

Edge Case Costs

Parameter Description Default
edgeCaseCosts.terminatingPod Cost for terminating pods 0
edgeCaseCosts.notReadyPod Cost for not-ready pods 50
edgeCaseCosts.newPod Cost for newly created pods 500
edgeCaseCosts.noMetricsPod Cost when no metrics available 500

Pod Age

Parameter Description Default
podAge.newPodThresholdHours Hours below which a pod is "new" 0.05 (~3 minutes)

Jitter

Parameter Description Default
jitter.min Minimum jitter value 0
jitter.max Maximum jitter value 10

Cost Calculation

Parameter Description Default
costCalculation.weights.running_splits Weight per running split 10
costCalculation.weights.waiting_splits Weight per waiting split 5
costCalculation.weights.running_tasks_high_priority Weight per L0-L2 task 15
costCalculation.weights.running_tasks_low_priority Weight per L3-L4 task 10
costCalculation.weights.cpu_utilization Weight per CPU % 0.5
costCalculation.weights.output_buffers Weight per active buffer 50
costCalculation.caps.running_splits Max points from running splits 400
costCalculation.caps.waiting_splits Max points from waiting splits 300
costCalculation.caps.running_tasks_high Max points from L0-L2 tasks 150
costCalculation.caps.running_tasks_low Max points from L3-L4 tasks 100
costCalculation.caps.cpu_utilization Max points from CPU 50
costCalculation.caps.output_buffers Max points from output buffers 500
costCalculation.caps.max_total Overall maximum cost 1000

HPA Management

Parameter Description Default
hpaManagement.enabled Enable dynamic minReplicas management false
hpaManagement.buffer Extra pods above active count 1
hpaManagement.baseline_minimum Minimum floor for minReplicas 2
hpaManagement.hpa_names Map of selector keys to HPA names {}

Other

Parameter Description Default
schedule CronJob schedule */1 * * * *
namespace Deployment namespace trino
dryRun Log without patching false
logging.level Log level INFO
logging.format Log format json
datadog.enabled Enable Datadog log annotations false

Authentication

Bearer Token

Create a Kubernetes secret with your token:

kubectl create secret generic prometheus-credentials \
  --from-literal=token=YOUR_TOKEN \
  -n trino

Configure in values:

metricsBackend:
  type: "prometheus"
  url: "https://prometheus.example.com"
  port: 443
  auth:
    type: "bearer"
    existingSecret: "prometheus-credentials"

Basic Auth

kubectl create secret generic prometheus-credentials \
  --from-literal=password=YOUR_PASSWORD \
  -n trino
metricsBackend:
  auth:
    type: "basic"
    existingSecret: "prometheus-credentials"

The username is configured in the config YAML; only the password is injected from the secret.

Examples

See the examples/ directory:

Prerequisites

  • Kubernetes 1.24+ (pod deletion cost annotation support)
  • A metrics backend (Prometheus with application metrics, or implement a custom backend)
  • Helm 3.x
  • (Optional) trino-buffer-exporter for Trino output buffer metrics

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

About Simon

Pod Cost Manager is maintained by Simon, the agentic marketing platform that combines customer data with real-world signals to orchestrate personalized, 1:1 campaigns at scale. We built this tool to manage intelligent autoscaling for our Trino query infrastructure and open-sourced it so others can benefit.

License

Apache License 2.0. See LICENSE for details.

About

Workload-aware Kubernetes HPA scale-down for Trino — annotates workers with pod-deletion-cost based on real-time query activity from a pluggable metrics backend.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors