BudAIScaler is a Kubernetes autoscaler designed for GenAI workloads. It provides intelligent scaling with GPU awareness, cost optimization, adaptive learning, and predictive capabilities.
| Strategy | Description | Best For |
|---|---|---|
| HPA | Kubernetes HorizontalPodAutoscaler wrapper | Standard workloads, native K8s integration |
| KPA | KNative-style with panic/stable windows | Bursty traffic, rapid scale-up needs |
| BudScaler | Custom GenAI-optimized algorithm | LLM inference, GPU workloads |
- Pod Metrics - HTTP endpoints on pods (e.g.,
/metrics) - Resource Metrics - CPU/Memory via Kubernetes Metrics API
- Custom Metrics - Kubernetes Custom Metrics API
- External Metrics - External HTTP endpoints
- Prometheus - Direct PromQL queries
- Inference Engines - Native support for vLLM, TGI, SGLang, Triton
The BudAIScaler includes an adaptive learning system that improves scaling decisions over time:
- Pattern Detection - Automatically detects workload patterns (KV cache pressure, batch spikes, traffic drains)
- Seasonal Learning - Learns time-of-day and day-of-week traffic patterns
- Prediction Accuracy Tracking - Monitors MAE, MAPE, and direction accuracy
- Self-Calibration - Automatically adjusts model parameters based on prediction errors
- Persistent Learning - Stores learned data in ConfigMaps for recovery across restarts
Define known traffic patterns for deterministic schedule-based scaling:
spec:
scheduleHints:
- name: morning-rush
cronExpression: "0 9 * * 1-5"
targetReplicas: 5
duration: 2h
- name: batch-processing
cronExpression: "0 2 * * *"
targetReplicas: 10
duration: 1hScale based on GPU memory and compute utilization:
spec:
gpuConfig:
enabled: true
gpuMemoryThreshold: 80
gpuComputeThreshold: 70
metricsSource: dcgm # dcgm, nvml, or prometheus
dcgmEndpoint: "http://dcgm-exporter:9400/metrics"Optimize scaling decisions based on cost constraints:
spec:
costConfig:
enabled: true
cloudProvider: aws
budgetPerHour: "10.00"
budgetPerDay: "200.00"
preferSpotInstances: true
spotFallbackToOnDemand: trueML-based prediction for proactive scaling:
spec:
metricsSources:
- targetMetric: "app:cache_usage"
targetValue: "70"
- targetMetric: "app:queue_depth"
targetValue: "50"
predictionConfig:
enabled: true
lookAheadMinutes: 15
historicalDataDays: 7
enableLearning: true
minConfidence: 0.7
# Specify which metrics to use for prediction (optional)
predictionMetrics:
- "app:cache_usage"
# Specify which metrics to track for seasonal patterns (optional)
seasonalMetrics:
- "app:cache_usage"
- "app:queue_depth"The prediction system is fully generic - you can use any metrics from your metricsSources. See Prediction Algorithm for details.
- Kubernetes 1.26+
- Helm 3.x (for Helm installation)
- kubectl configured to access your cluster
# Install from OCI registry
helm install scaler oci://ghcr.io/budecosystem/charts/scaler \
--namespace scaler-system \
--create-namespace
# Verify installation
kubectl get pods -n scaler-system
kubectl get crd budaiscalers.scaler.bud.studio# Install CRDs
kubectl apply -k config/crd
# Install controller
kubectl apply -k config/default# Clone the repository
git clone https://github.com/BudEcosystem/scaler.git
cd scaler
# Install using local chart
helm install scaler charts/scaler \
--namespace scaler-system \
--create-namespaceapiVersion: scaler.bud.studio/v1alpha1
kind: BudAIScaler
metadata:
name: my-app-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
scalingStrategy: BudScaler
metricsSources:
- metricSourceType: inferenceEngine
targetMetric: gpu_cache_usage_perc
targetValue: "70"
inferenceEngineConfig:
engineType: vllm
metricsPort: 8000apiVersion: scaler.bud.studio/v1alpha1
kind: BudAIScaler
metadata:
name: llm-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 20
scalingStrategy: BudScaler
metricsSources:
- metricSourceType: inferenceEngine
targetMetric: gpu_cache_usage_perc
targetValue: "70"
inferenceEngineConfig:
engineType: vllm
metricsPort: 8000
# Schedule-based scaling hints
scheduleHints:
- name: business-hours
cronExpression: "0 9 * * 1-5"
targetReplicas: 8
duration: 10h
# Adaptive learning and prediction
predictionConfig:
enabled: true
lookAheadMinutes: 15
enableLearning: true
minConfidence: 0.7
# GPU awareness
gpuConfig:
enabled: true
gpuMemoryThreshold: 80
# Cost constraints
costConfig:
enabled: true
budgetPerHour: "50.00"
preferSpotInstances: true
# Scaling behavior
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60Delegates scaling to a native Kubernetes HPA:
spec:
scalingStrategy: HPA
metricsSources:
- metricSourceType: resource
targetMetric: cpu
targetValue: "80"KNative-style scaling with panic and stable windows for rapid response:
spec:
scalingStrategy: KPA
metricsSources:
- metricSourceType: pod
targetMetric: requests_per_second
targetValue: "100"
port: "8080"
path: "/metrics"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
scaleDown:
stabilizationWindowSeconds: 300Optimized for GenAI workloads with GPU and inference engine awareness:
spec:
scalingStrategy: BudScaler
metricsSources:
- metricSourceType: inferenceEngine
targetMetric: gpu_cache_usage_perc
targetValue: "70"
inferenceEngineConfig:
engineType: vllm
metricsPort: 8000metricsSources:
- metricSourceType: inferenceEngine
targetMetric: gpu_cache_usage_perc # or: num_requests_waiting, num_requests_running
targetValue: "70"
inferenceEngineConfig:
engineType: vllm
metricsPort: 8000
metricsPath: "/metrics"metricsSources:
- metricSourceType: inferenceEngine
targetMetric: queue_size
targetValue: "10"
inferenceEngineConfig:
engineType: tgi
metricsPort: 8080spec:
metricsSources:
- targetMetric: "vllm:gpu_cache_usage_perc"
targetValue: "70"
- targetMetric: "vllm:num_requests_waiting"
targetValue: "10"
predictionConfig:
enabled: true
enableLearning: true
lookAheadMinutes: 15
historicalDataDays: 7
minConfidence: 0.7
maxPredictedReplicas: 20
# Use cache usage for time-series prediction
predictionMetrics:
- "vllm:gpu_cache_usage_perc"
# Track seasonal patterns for both metrics
seasonalMetrics:
- "vllm:gpu_cache_usage_perc"
- "vllm:num_requests_waiting"The learning system tracks:
| Metric | Description |
|---|---|
dataPointsCollected |
Total data points for learning |
detectedPattern |
Current workload pattern (normal, kv_pressure, batch_spike) |
patternConfidence |
Confidence in detected pattern (0-1) |
directionAccuracy |
Accuracy of scale up/down predictions |
overallAccuracy |
Overall prediction accuracy |
recentMAE |
Mean Absolute Error of recent predictions |
recentMAPE |
Mean Absolute Percentage Error |
metricsSources:
- metricSourceType: prometheus
targetMetric: custom_metric
targetValue: "100"
endpoint: "http://prometheus.monitoring:9090"
promQL: "sum(rate(http_requests_total{app='myapp'}[5m]))"┌─────────────────────────────────────────────────────────────────────┐
│ BudAIScaler Controller │
├─────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────────────────────┐│
│ │ Metrics │ │ Algorithms │ │ Scaling Context ││
│ │ Collector │ │ │ │ (Single Source of Truth) ││
│ │ │ │ - HPA │ │ ││
│ │ - Pod │ │ - KPA │ │ - Replica bounds ││
│ │ - Resource │ │ - BudScaler│ │ - Scaling rates ││
│ │ - Prometheus│ │ │ │ - Time windows ││
│ │ - External │ │ │ │ - GPU/Cost/Prediction config ││
│ │ - vLLM/TGI │ │ │ │ ││
│ └─────────────┘ └─────────────┘ └───────────────────────────────┘│
├─────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Adaptive Learning System ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ ││
│ │ │ Pattern │ │ Prediction │ │ Calibration │ ││
│ │ │ Detection │ │ Accuracy │ │ System │ ││
│ │ │ │ │ Tracker │ │ │ ││
│ │ │ - KV Cache │ │ - MAE/MAPE │ │ - Auto-adjusts params │ ││
│ │ │ - Batch Spike│ │ - Direction │ │ - Improves over time │ ││
│ │ │ - Seasonal │ │ - Overall │ │ - ConfigMap persistence │ ││
│ │ └──────────────┘ └──────────────┘ └──────────────────────────┘ ││
│ └─────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Target Workload │
│ (Deployment/StatefulSet) │
└───────────────────────────────┘
For detailed documentation, see the docs directory:
- Prediction Algorithm - How predictive scaling works
- Scaling Decision Hierarchy - Priority order of scaling inputs
See the config/samples directory for complete examples:
scaler_v1alpha1_budaiscaler_hpa.yaml- HPA strategyscaler_v1alpha1_budaiscaler_kpa.yaml- KPA strategyscaler_v1alpha1_budaiscaler_vllm.yaml- vLLM scalingscaler_v1alpha1_budaiscaler_tgi.yaml- TGI scalingscaler_v1alpha1_budaiscaler_full.yaml- Full configuration with all features
End-to-end tests are available in the e2e/ directory:
cd e2e/scripts
# Run scaling test (~3 min)
./scaling-test.sh run
# Run learning system test (~5 min)
./learning-test.sh run
# Run prediction accuracy test (~8 min)
./accuracy-test.sh run --quickSee e2e/README.md for detailed test documentation.
Chart Location: oci://ghcr.io/budecosystem/charts/scaler
# Install specific version
helm install scaler oci://ghcr.io/budecosystem/charts/scaler --version 0.1.0
# Upgrade
helm upgrade scaler oci://ghcr.io/budecosystem/charts/scaler
# Uninstall
helm uninstall scaler -n scaler-system| Parameter | Description | Default |
|---|---|---|
replicaCount |
Controller replicas | 1 |
image.repository |
Controller image | ghcr.io/budecosystem/scaler |
image.tag |
Image tag | Chart appVersion |
resources.limits.cpu |
CPU limit | 500m |
resources.limits.memory |
Memory limit | 256Mi |
metrics.enabled |
Enable Prometheus metrics | true |
gpu.enabled |
Enable GPU metrics | false |
cost.enabled |
Enable cost tracking | false |
crds.install |
Install CRDs | true |
See charts/scaler/values.yaml for all options.
- Go 1.22+
- Docker
- kubectl
- Kind or Minikube (for local testing)
# Build the controller
make build
# Build Docker image
make docker-build
# Run tests
make test# Install CRDs
make install
# Run controller locally
make run- Phase 1: Core Autoscaling
- HPA strategy (K8s HPA wrapper)
- KPA strategy (panic/stable windows)
- BudScaler strategy (GenAI-optimized)
- Metric aggregation with time windows
- ScalingContext as single source of truth
- Phase 2: Extended Metrics
- Pod HTTP endpoint metrics
- Kubernetes resource metrics (CPU/memory)
- Prometheus/PromQL integration
- vLLM metrics support
- TGI metrics support
- Phase 3: Adaptive Learning
- Pattern detection (KV cache, batch spikes)
- Seasonal pattern learning
- Prediction accuracy tracking
- Self-calibration system
- ConfigMap persistence
- Phase 4: Schedule Hints
- Cron-based schedule definitions
- Independent of prediction config
- Duration-based activation
- Phase 5: GPU-Aware Scaling
- GPU metrics provider (DCGM, NVML, Prometheus)
- GPU memory/compute thresholds
- GPU topology awareness
- Multi-GPU support
- Phase 6: Cost-Aware Scaling
- Cost calculator implementation
- Budget constraints (hourly, daily)
- Cloud pricing integration (AWS, GCP, Azure)
- Spot instance optimization
- Phase 7: Multi-Cluster Support
- Cross-cluster metrics aggregation
- Federated scaling decisions
Contributions are welcome! Please read our Contributing Guide for details.
Apache License 2.0 - see LICENSE for details.
This project is inspired by: