feat(autoscaling): CPURequestsRemoveLimitsMemoryRequestsAndLimits + cluster burstable default#49314
feat(autoscaling): CPURequestsRemoveLimitsMemoryRequestsAndLimits + cluster burstable default#49314clamoriniere wants to merge 9 commits into
Conversation
Files inventory check summaryFile checks results against ancestor efc625c8: Results for datadog-agent_7.81.0~devel.git.587.bebf5b2.pipeline.117750425-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: efc625c Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_metrics_logs | memory utilization | +1.67 | [+1.41, +1.93] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.53 | [+0.47, +0.59] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.44 | [-0.48, -0.40] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -1.86 | [-2.88, -0.83] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 146.51MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 729.93KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 476.84MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.13MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 178.74MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.02MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 342.14 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 395.55MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.94GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (ddprof)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | comparison | 10 | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
|
This pull request has been automatically marked as stale because it has not had activity in the past 15 days. It will be closed in 30 days if no further activity occurs. If this pull request is still relevant, adding a comment or pushing new commits will keep it open. Also, you can always reopen the pull request if you missed the window. Thank you for your contributions! |
cd2fa28 to
6a083de
Compare
|
🎯 Code Coverage (details) 🔗 Commit SHA: bebf5b2 | Docs | Datadog PR Page | Give us feedback! |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
28 successful checks with minimal change (< 2 KiB)
|
|
This pull request has been automatically marked as stale because it has not had activity in the past 15 days. It will be closed in 30 days if no further activity occurs. If this pull request is still relevant, adding a comment or pushing new commits will keep it open. Also, you can always reopen the pull request if you missed the window. Thank you for your contributions! |
bbaa1d6 to
6521372
Compare
LaurieScs
left a comment
There was a problem hiding this comment.
LGTM, even though I'm not an agent expert 😅
e2d86a0 to
b9dae94
Compare
…Limits controlled value Bump datadog-operator/api to v0.0.0-20260414104914-c59fc90bbc2c which introduces the new CPURequestsRemoveLimitsMemoryRequestsAndLimits enum value for container controlledValues. When a container constraint sets this value the autoscaler applies different strategies per resource: - CPU: request recommendation applied, existing CPU limits actively removed from the live pod so the container can burst freely. - Memory: both requests and limits are controlled (RequestsAndLimits semantics unchanged). Changes: - applyVerticalConstraints: strip CPU from recommendation limits when CPURequestsRemoveLimitsMemoryRequestsAndLimits is set so that the backend never pushes a new CPU limit. - patchContainerResources: actively delete any pre-existing CPU limit from the live pod for the same controlled value. - getContainerControlledValues: new helper resolving ControlledValues from spec constraints (specific name > wildcard). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estsAndLimits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…val to pod patcher
Replace the `controlledValues` parameter threaded through `patchPod` /
`patchContainerResources` with a sentinel approach: `applyVerticalConstraints`
inserts `resource.MustParse("-1")` into `ContainerResources.Limits[cpu]` to
signal that any pre-existing CPU limit must be actively deleted from the live
pod. `patchContainerResources` detects the sentinel via `Cmp()` and deletes
the limit entry, keeping the function signatures clean.
The insertion is split into two phases: phase 1 deletes the CPU limit before
the clamping and `limits >= requests` invariant check; phase 2 inserts the
sentinel after the invariant check to prevent it from being overwritten.
`BuildStatus` is updated to call `ContainerResourcesForStatus()` (new helper
on `VerticalScalingValues`) which strips any negative-quantity limit entries
before writing to the DPA status, so the sentinel never leaks into the CRD.
Assisted-by: Claude:claude-sonnet-4-6
…ernalBuilder Introduce `autoscaling.workload.options.burstable` (default: true) as the lowest-priority fallback in the IsBurstable() priority chain: spec.options.burstable > preview annotation > cluster config default The config value is read once at startup in provider.go and injected into a *PodAutoscalerInternalBuilder, which is threaded through to all three PodAutoscalerInternal construction paths (NewFromKubernetes, NewFromSettings, NewFromProfile), replacing the former standalone constructor functions. This avoids per-reconcile config lookups and ensures a single shared builder instance across the controller, config retriever, and autoscaler syncer. Also bump datadog-operator/api to v0.0.0-20260503193602-adf766128732. Assisted-by: Claude:claude-sonnet-4-6
When `spec.options.burstable` and the preview annotation are both unset, the effective burstable mode now falls back to `false` for pods that are in Guaranteed QOS (requests == limits for all containers), rather than propagating the cluster-level default. This prevents the autoscaler from silently changing a pod's QOS class by removing its CPU limit when the user has not explicitly requested burstable mode. Priority chain in IsBurstable(): 1. spec.options.burstable (explicit, highest) 2. preview annotation (explicit via profile) 3. podsGuaranteedQOS=true → false (new: protects Guaranteed QOS) 4. cluster-level default (lowest) Implementation: - Add podsGuaranteedQOS bool field to PodAutoscalerInternal and a SetPodsGuaranteedQOS setter called by the vertical controller each sync cycle after fetching the current pods. - Add isPodsGuaranteedQOS helper that returns true only when every pod for the workload has QOSClass == Guaranteed. - Move the pod fetch before applyVerticalConstraints in sync() so the QOS state is available when IsBurstable() is first consulted. Assisted-by: Claude:claude-sonnet-4-6
…kend recommendation When mainScalingValues.Vertical is nil (e.g. right after a cluster-agent restart, before the first RC push), SetActiveScalingValues was self-assigning scalingValues.Vertical. This carried the burstable CPU-limit sentinel (-1) forward from the previous SetConstrainedVerticalScaling call. applyVerticalConstraints(burstable=false) then early-returned because there were no constraints and burstable was off, leaving the burstable hash as recommendationID. Pods already carried that hash, so shouldTriggerRollout declared the rollout complete and CPU limits were never restored. Fix: set scalingValues.Vertical = nil when verticalActiveSource is nil. sync() already exits early on a nil vertical recommendation, which is the correct behaviour when no authoritative recommendation is available yet. The rollout fires as soon as the next RC push populates mainScalingValues. Assisted-by: Claude:claude-sonnet-4-6
Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com>
b9dae94 to
be790ac
Compare
Update go.sum checksums to match the go.mod reference to github.com/DataDog/datadog-operator/api v0.0.0-20260515125012-8e158b708444. Assisted-by: Claude:claude-sonnet-4-6
f2fd0d5 to
ea54032
Compare
|
Hi @vboulineau, thanks for the review. Following your feedback I scoped the PR down significantly. Since we decided not to enable the burstable option by default at the cluster level anymore, the global config setting, the I also walked back the unnecessary cosmetic changes (function renames, blank lines, reworded doc comments) so the diff stays focused on the actual new functionality. Ready for another round when you have time — hopefully closer to what you had in mind this time. |
Remove cluster-level burstable config, PodAutoscalerInternalBuilder, and podsGuaranteedQOS field. IsBurstable() now takes no arguments and resolves purely from spec.options.burstable or the preview annotation. Also: delete dead containerResourcesForStatus, tighten removeLimitSentinel comment, revert resourceName rename in pod_patcher, and collapse the IsBurstable test matrix. Assisted-by: Claude:claude-opus-4-7
ea54032 to
bebf5b2
Compare
Summary
1.
CPURequestsRemoveLimitsMemoryRequestsAndLimitscontrolled valueImplements the new
CPURequestsRemoveLimitsMemoryRequestsAndLimitscontainercontrolledValuesenum from the datadog-operator (c59fc90).When set on a container constraint:
RequestsAndLimitssemantics).The removal flows via a sentinel quantity (
-1) stored onContainerResources.Limits[cpu]byapplyVerticalConstraintsand recognised by the pod patcher, which actively deletes any pre-existing CPU limit from the live pod (an absent entry would mean "leave untouched").2. Recognise
spec.options.burstableinIsBurstable()IsBurstable()now resolves fromspec.options.burstablefirst, then the existing preview annotation. There is no cluster-level default — burstable mode requires an explicit per-DPA opt-in via either signal.3. Report effective burstable mode in
status.options.burstableBuildStatuspopulatesstatus.options.burstableso consumers can observe the resolved value:spec.options.burstableis set: reported as-is (true or false).true.4. Fix stale-sentinel propagation when the backend stops emitting a vertical recommendation
SetActiveScalingValuespreviously self-assignedp.scalingValues.Verticalwhen the active vertical source wasnil(becauseselectScalingValues(nil)returnsp.scalingValues). This kept any burstable sentinel from the previous cycle alive and causedapplyVerticalConstraints(burstable=false)to early-return, suppressing the rollout when burstable was toggled off. NowVerticalis explicitly reset tonilso the next sync sees "no recommendation" and clears live state.5. Dependency bump
go.mod:datadog-operator/api→v0.0.0-20260515125012-8e158b708444.Test plan
TestApplyVerticalConstraints_CPURequestsRemoveLimits— CPU limit stripped from recommendation, memory limit preserved, sentinel inserted.TestPatchContainerResources_*/TestPatchPod_*— CPU limit actively removed from live pod, idempotent when already absent.IsBurstable priority matrix— spec > annotation > default false.BuildStatusreportsoptions.burstablefor spec (true/false) and for annotation-enabled, omits when neither is set.TestSetActiveScalingValues_NilSource_ClearsVertical— nil vertical source clearsscalingValues.Verticalinstead of self-assigning the stale sentinel.dda inv test --targets=./pkg/clusteragent/autoscaling/workload/...dda inv linter.go --targets ./pkg/clusteragent/autoscaling/workload/...status.options.burstable: true; DPA without explicit opt-in reports no options (defaults to non-burstable).🤖 PR description and code assisted by Claude:claude-opus-4-7