feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11
feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11shenxianpeng wants to merge 1 commit intomainfrom
Conversation
…e rules Adds a ConfigMap-driven policy engine that automatically enrolls pods into GPU slice optimization based on namespace or pod label conditions, without requiring the castops.io/optimize annotation on each pod. Key changes: - internal/webhook/policy.go: SlicePolicy/PolicyRule types, PolicyLoader interface, ConfigMapPolicyLoader (reads cast-slice-policy ConfigMap), MatchPolicy/RuleMatches matching functions - pod_webhook.go: WithPolicyLoader option, policy-aware resolveSliceRatio (pod annotation > policy rule ratio > policy rule workload type > default) - metrics.go: castslice_policy_enforced_total counter - main.go: --policy-namespace flag + POD_NAMESPACE env var wiring - config/deploy/deployment.yaml: Role/RoleBinding for ConfigMap read, POD_NAMESPACE env var via Downward API - config/policy/cast-slice-policy-example.yaml: example policy ConfigMap - Tests: 27 new test cases; coverage 90.5% (>80% CI gate) https://claude.ai/code/session_01UyecCdVqoXvrbmLGHM1oqV
There was a problem hiding this comment.
Pull request overview
This PR introduces a ConfigMap-driven Policy Engine for the pod mutating webhook, enabling namespace-level and/or label-based automatic enrollment into GPU slice optimization while preserving annotation-based overrides and fail-open behavior on policy load failures.
Changes:
- Added
SlicePolicy/PolicyRulemodel plus aConfigMapPolicyLoaderand matching logic. - Updated the pod webhook to optionally evaluate policy rules (via
WithPolicyLoader) and added a newcastslice_policy_enforced_totalmetric. - Wired policy namespace configuration (
--policy-namespace/POD_NAMESPACE) and added deployment RBAC + Downward API env var, alongside example policy YAML and changelog entry.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
main.go |
Adds policy namespace selection and wires a ConfigMap-backed policy loader into the webhook handler. |
internal/webhook/policy.go |
Implements policy data types, ConfigMap loader, and rule matching helpers. |
internal/webhook/policy_test.go |
Adds unit tests for rule matching and ConfigMapPolicyLoader behaviors. |
internal/webhook/pod_webhook.go |
Adds policy evaluation flow, updates slice ratio resolution precedence, and tracks policy-driven enforcement. |
internal/webhook/pod_webhook_test.go |
Adds policy-path test cases covering matching, overrides, metrics, and error paths. |
internal/webhook/metrics.go |
Registers the new castslice_policy_enforced_total counter. |
config/deploy/deployment.yaml |
Adds Role/RoleBinding for ConfigMap reads and injects POD_NAMESPACE via Downward API. |
config/policy/cast-slice-policy-example.yaml |
Provides an example policy ConfigMap for operators. |
CHANGELOG.md |
Documents v0.4.0 feature additions/changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pod.Labels[k] != v { | ||
| return false | ||
| } |
There was a problem hiding this comment.
RuleMatches compares pod.Labels[k] to v without checking label presence. If a policy rule sets a labelSelector value to the empty string, this will incorrectly match pods that lack the label entirely (nil/absent labels read as ""). Consider checking for key existence (using the "ok" result from the map lookup) so that missing labels never satisfy a selector entry.
| if pod.Labels[k] != v { | |
| return false | |
| } | |
| if pod.Labels == nil { | |
| return false | |
| } | |
| if podVal, ok := pod.Labels[k]; !ok || podVal != v { | |
| return false | |
| } |
| if rule != nil && rule.SliceRatio > 0 { | ||
| if rule.SliceRatio > maxSliceRatio { | ||
| return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio) | ||
| } | ||
| return rule.SliceRatio, nil |
There was a problem hiding this comment.
PolicyRule.SliceRatio is only validated when it’s > 0; negative values (e.g. -1 in policy JSON) are silently treated as “unset” and fall through to workloadType/default. That’s inconsistent with the stated 1–64 range and can mask policy misconfiguration. Consider validating that a non-zero SliceRatio is within [1, 64] (reject < 0 and > 64; decide explicitly whether 0 is allowed as “unset”).
| if rule != nil && rule.SliceRatio > 0 { | |
| if rule.SliceRatio > maxSliceRatio { | |
| return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio) | |
| } | |
| return rule.SliceRatio, nil | |
| if rule != nil { | |
| if rule.SliceRatio < 0 || rule.SliceRatio > maxSliceRatio { | |
| return 0, fmt.Errorf("policy sliceRatio %d must be between 1 and %d, or 0 to use workloadType/default", rule.SliceRatio, maxSliceRatio) | |
| } | |
| if rule.SliceRatio > 0 { | |
| return rule.SliceRatio, nil | |
| } |
| // Load fetches and parses the policy ConfigMap. Returns (nil, nil) when the | ||
| // ConfigMap does not exist — absence of a policy is not an error. | ||
| func (l *ConfigMapPolicyLoader) Load(ctx context.Context) (*SlicePolicy, error) { | ||
| cm := &corev1.ConfigMap{} | ||
| key := types.NamespacedName{Namespace: l.namespace, Name: policyConfigMapName} | ||
| if err := l.reader.Get(ctx, key, cm); err != nil { | ||
| if errors.IsNotFound(err) { | ||
| return nil, nil | ||
| } | ||
| return nil, fmt.Errorf("get policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err) | ||
| } | ||
|
|
||
| raw, ok := cm.Data[policyConfigMapKey] | ||
| if !ok { | ||
| return nil, nil | ||
| } | ||
|
|
||
| var policy SlicePolicy | ||
| if err := json.Unmarshal([]byte(raw), &policy); err != nil { | ||
| return nil, fmt.Errorf("parse policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err) | ||
| } |
There was a problem hiding this comment.
ConfigMapPolicyLoader.Load performs a ConfigMap Get and JSON unmarshal every time policy is evaluated (i.e., on each non-opt-in Pod admission). This adds avoidable admission-path latency and CPU overhead, especially if the policy grows. Consider caching the parsed policy in-memory (e.g., keyed by ConfigMap resourceVersion with a short TTL, or via an informer/watch) and only re-parsing when it changes.
|
|
||
| ### Changed | ||
| - `NewPodGPUSliceHandler` now accepts variadic `Option` arguments (`WithPolicyLoader`) — fully backward-compatible; existing callers need no changes | ||
| - `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1) |
There was a problem hiding this comment.
The changelog’s stated resolveSliceRatio priority order is incomplete/ambiguous vs the actual implementation (which is slice-ratio annotation > workload-type annotation > policy sliceRatio > policy workloadType > default). Updating this line to match the real precedence will help operators reason about overrides correctly.
| - `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1) | |
| - `resolveSliceRatio` extended with policy-rule fallback levels (priority: `castops.io/slice-ratio` annotation > `castops.io/workload-type` annotation > policy rule `sliceRatio` > policy rule `workloadType` > default 1) |
Summary
SlicePolicythat automatically enrolls pods into GPU slice optimization based on namespace and/or pod label conditions — no per-pod annotation requiredcastops.io/optimize: "true"are unaffected; pod-level annotations always override policy valuescastslice_policy_enforced_totaltracks policy-driven mutations separately from annotation-driven onesHow it works
Create a
cast-slice-policyConfigMap in thecast-slicenamespace:Rules are evaluated in order; first match wins. Namespace and label conditions are AND-ed within a single rule. See
config/policy/cast-slice-policy-example.yamlfor a full example.Priority order (highest → lowest)
castops.io/slice-ratiopod annotationcastops.io/workload-typepod annotationPolicyRule.sliceRatioPolicyRule.workloadTypeFiles changed
internal/webhook/policy.goSlicePolicy,PolicyRule,PolicyLoader,ConfigMapPolicyLoader,MatchPolicy,RuleMatchesinternal/webhook/pod_webhook.goWithPolicyLoaderoption, policy-awareresolveSliceRatio, fail-open policy evaluation inHandleinternal/webhook/metrics.gocastslice_policy_enforced_totalcountermain.go--policy-namespaceflag +POD_NAMESPACEenv var wiringinternal/webhook/policy_test.gointernal/webhook/pod_webhook_test.goconfig/deploy/deployment.yamlPOD_NAMESPACEDownward API injectionconfig/policy/cast-slice-policy-example.yamlCHANGELOG.mdTest plan
go test -race -count=2 ./internal/...— all pass, no race conditionsgofmt -l .— no formatting issuespolicy_enforced_totalnot incremented for annotation-driven mutationssliceRatio > 64rejectedworkloadTyperejectedConfigMapPolicyLoader: not-found, get error, missing key, invalid JSON, valid JSON