feat: v0.4.0 Policy Engine (namespace-level and label-based rules) by shenxianpeng · Pull Request #11 · castops/cast-slice

shenxianpeng · 2026-03-28T06:33:14Z

Summary

Policy Engine: Adds a ConfigMap-driven SlicePolicy that automatically enrolls pods into GPU slice optimization based on namespace and/or pod label conditions — no per-pod annotation required
Fail-open design: Policy load errors are non-fatal; the webhook allows pods through and falls back to annotation-only behavior
Full backward compatibility: Existing pods with castops.io/optimize: "true" are unaffected; pod-level annotations always override policy values
New metric: castslice_policy_enforced_total tracks policy-driven mutations separately from annotation-driven ones
27 new test cases; coverage raised from 77.8% → 90.5%

How it works

Create a cast-slice-policy ConfigMap in the cast-slice namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cast-slice-policy
  namespace: cast-slice
data:
  policy.json: |
    {
      "rules": [
        {
          "namespaces": ["ml-team", "gpu-lab"],
          "workloadType": "training"
        },
        {
          "labelSelector": {"cast-slice/workload": "inference"},
          "workloadType": "inference"
        },
        {
          "namespaces": ["dev"],
          "labelSelector": {"gpu": "shared"},
          "sliceRatio": 8
        }
      ]
    }

Rules are evaluated in order; first match wins. Namespace and label conditions are AND-ed within a single rule. See config/policy/cast-slice-policy-example.yaml for a full example.

Priority order (highest → lowest)

castops.io/slice-ratio pod annotation
castops.io/workload-type pod annotation
PolicyRule.sliceRatio
PolicyRule.workloadType
Default ratio of 1

Files changed

File	Change
`internal/webhook/policy.go`	New — `SlicePolicy`, `PolicyRule`, `PolicyLoader`, `ConfigMapPolicyLoader`, `MatchPolicy`, `RuleMatches`
`internal/webhook/pod_webhook.go`	`WithPolicyLoader` option, policy-aware `resolveSliceRatio`, fail-open policy evaluation in `Handle`
`internal/webhook/metrics.go`	`castslice_policy_enforced_total` counter
`main.go`	`--policy-namespace` flag + `POD_NAMESPACE` env var wiring
`internal/webhook/policy_test.go`	New — unit tests for loader, matcher, JSON parsing
`internal/webhook/pod_webhook_test.go`	27 new policy-path test cases
`config/deploy/deployment.yaml`	RBAC Role/RoleBinding for ConfigMap read; `POD_NAMESPACE` Downward API injection
`config/policy/cast-slice-policy-example.yaml`	New — annotated example policy
`CHANGELOG.md`	v0.4.0 entry

Test plan

…e rules Adds a ConfigMap-driven policy engine that automatically enrolls pods into GPU slice optimization based on namespace or pod label conditions, without requiring the castops.io/optimize annotation on each pod. Key changes: - internal/webhook/policy.go: SlicePolicy/PolicyRule types, PolicyLoader interface, ConfigMapPolicyLoader (reads cast-slice-policy ConfigMap), MatchPolicy/RuleMatches matching functions - pod_webhook.go: WithPolicyLoader option, policy-aware resolveSliceRatio (pod annotation > policy rule ratio > policy rule workload type > default) - metrics.go: castslice_policy_enforced_total counter - main.go: --policy-namespace flag + POD_NAMESPACE env var wiring - config/deploy/deployment.yaml: Role/RoleBinding for ConfigMap read, POD_NAMESPACE env var via Downward API - config/policy/cast-slice-policy-example.yaml: example policy ConfigMap - Tests: 27 new test cases; coverage 90.5% (>80% CI gate) https://claude.ai/code/session_01UyecCdVqoXvrbmLGHM1oqV

Copilot

Pull request overview

This PR introduces a ConfigMap-driven Policy Engine for the pod mutating webhook, enabling namespace-level and/or label-based automatic enrollment into GPU slice optimization while preserving annotation-based overrides and fail-open behavior on policy load failures.

Changes:

Added SlicePolicy / PolicyRule model plus a ConfigMapPolicyLoader and matching logic.
Updated the pod webhook to optionally evaluate policy rules (via WithPolicyLoader) and added a new castslice_policy_enforced_total metric.
Wired policy namespace configuration (--policy-namespace / POD_NAMESPACE) and added deployment RBAC + Downward API env var, alongside example policy YAML and changelog entry.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`main.go`	Adds policy namespace selection and wires a ConfigMap-backed policy loader into the webhook handler.
`internal/webhook/policy.go`	Implements policy data types, ConfigMap loader, and rule matching helpers.
`internal/webhook/policy_test.go`	Adds unit tests for rule matching and ConfigMapPolicyLoader behaviors.
`internal/webhook/pod_webhook.go`	Adds policy evaluation flow, updates slice ratio resolution precedence, and tracks policy-driven enforcement.
`internal/webhook/pod_webhook_test.go`	Adds policy-path test cases covering matching, overrides, metrics, and error paths.
`internal/webhook/metrics.go`	Registers the new `castslice_policy_enforced_total` counter.
`config/deploy/deployment.yaml`	Adds Role/RoleBinding for ConfigMap reads and injects `POD_NAMESPACE` via Downward API.
`config/policy/cast-slice-policy-example.yaml`	Provides an example policy ConfigMap for operators.
`CHANGELOG.md`	Documents v0.4.0 feature additions/changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-28T06:56:45Z

internal/webhook/policy.go

+		if pod.Labels[k] != v {
+			return false
+		}


RuleMatches compares pod.Labels[k] to v without checking label presence. If a policy rule sets a labelSelector value to the empty string, this will incorrectly match pods that lack the label entirely (nil/absent labels read as ""). Consider checking for key existence (using the "ok" result from the map lookup) so that missing labels never satisfy a selector entry.

Suggested change

if pod.Labels[k] != v {

return false

}

if pod.Labels == nil {

return false

}

if podVal, ok := pod.Labels[k]; !ok || podVal != v {

return false

}

Copilot · 2026-03-28T06:56:46Z

internal/webhook/pod_webhook.go

+	if rule != nil && rule.SliceRatio > 0 {
+		if rule.SliceRatio > maxSliceRatio {
+			return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio)
+		}
+		return rule.SliceRatio, nil


PolicyRule.SliceRatio is only validated when it’s > 0; negative values (e.g. -1 in policy JSON) are silently treated as “unset” and fall through to workloadType/default. That’s inconsistent with the stated 1–64 range and can mask policy misconfiguration. Consider validating that a non-zero SliceRatio is within [1, 64] (reject < 0 and > 64; decide explicitly whether 0 is allowed as “unset”).

Suggested change

if rule != nil && rule.SliceRatio > 0 {

if rule.SliceRatio > maxSliceRatio {

return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio)

}

return rule.SliceRatio, nil

if rule != nil {

if rule.SliceRatio < 0 || rule.SliceRatio > maxSliceRatio {

return 0, fmt.Errorf("policy sliceRatio %d must be between 1 and %d, or 0 to use workloadType/default", rule.SliceRatio, maxSliceRatio)

}

if rule.SliceRatio > 0 {

return rule.SliceRatio, nil

}

Copilot · 2026-03-28T06:56:46Z

internal/webhook/policy.go

+// Load fetches and parses the policy ConfigMap. Returns (nil, nil) when the
+// ConfigMap does not exist — absence of a policy is not an error.
+func (l *ConfigMapPolicyLoader) Load(ctx context.Context) (*SlicePolicy, error) {
+	cm := &corev1.ConfigMap{}
+	key := types.NamespacedName{Namespace: l.namespace, Name: policyConfigMapName}
+	if err := l.reader.Get(ctx, key, cm); err != nil {
+		if errors.IsNotFound(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("get policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err)
+	}
+
+	raw, ok := cm.Data[policyConfigMapKey]
+	if !ok {
+		return nil, nil
+	}
+
+	var policy SlicePolicy
+	if err := json.Unmarshal([]byte(raw), &policy); err != nil {
+		return nil, fmt.Errorf("parse policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err)
+	}


ConfigMapPolicyLoader.Load performs a ConfigMap Get and JSON unmarshal every time policy is evaluated (i.e., on each non-opt-in Pod admission). This adds avoidable admission-path latency and CPU overhead, especially if the policy grows. Consider caching the parsed policy in-memory (e.g., keyed by ConfigMap resourceVersion with a short TTL, or via an informer/watch) and only re-parsing when it changes.

Copilot · 2026-03-28T06:56:46Z

CHANGELOG.md

+
+### Changed
+- `NewPodGPUSliceHandler` now accepts variadic `Option` arguments (`WithPolicyLoader`) — fully backward-compatible; existing callers need no changes
+- `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1)


The changelog’s stated resolveSliceRatio priority order is incomplete/ambiguous vs the actual implementation (which is slice-ratio annotation > workload-type annotation > policy sliceRatio > policy workloadType > default). Updating this line to match the real precedence will help operators reason about overrides correctly.

Suggested change

- `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1)

- `resolveSliceRatio` extended with policy-rule fallback levels (priority: `castops.io/slice-ratio` annotation > `castops.io/workload-type` annotation > policy rule `sliceRatio` > policy rule `workloadType` > default 1)

shenxianpeng requested a review from Copilot March 28, 2026 06:53

Copilot started reviewing on behalf of shenxianpeng March 28, 2026 06:53 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

shenxianpeng added the enhancement New feature or request label Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11

feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11
shenxianpeng wants to merge 1 commit intomainfrom
claude/policy-engine-v0.4-CjGOm

shenxianpeng commented Mar 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1)
	- `resolveSliceRatio` extended with policy-rule fallback levels (priority: `castops.io/slice-ratio` annotation > `castops.io/workload-type` annotation > policy rule `sliceRatio` > policy rule `workloadType` > default 1)

Conversation

shenxianpeng commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Priority order (highest → lowest)

Files changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shenxianpeng commented Mar 28, 2026 •

edited

Loading