Skip to content

feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11

Open
shenxianpeng wants to merge 1 commit intomainfrom
claude/policy-engine-v0.4-CjGOm
Open

feat: v0.4.0 Policy Engine (namespace-level and label-based rules)#11
shenxianpeng wants to merge 1 commit intomainfrom
claude/policy-engine-v0.4-CjGOm

Conversation

@shenxianpeng
Copy link
Copy Markdown
Member

@shenxianpeng shenxianpeng commented Mar 28, 2026

Summary

  • Policy Engine: Adds a ConfigMap-driven SlicePolicy that automatically enrolls pods into GPU slice optimization based on namespace and/or pod label conditions — no per-pod annotation required
  • Fail-open design: Policy load errors are non-fatal; the webhook allows pods through and falls back to annotation-only behavior
  • Full backward compatibility: Existing pods with castops.io/optimize: "true" are unaffected; pod-level annotations always override policy values
  • New metric: castslice_policy_enforced_total tracks policy-driven mutations separately from annotation-driven ones
  • 27 new test cases; coverage raised from 77.8% → 90.5%

How it works

Create a cast-slice-policy ConfigMap in the cast-slice namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cast-slice-policy
  namespace: cast-slice
data:
  policy.json: |
    {
      "rules": [
        {
          "namespaces": ["ml-team", "gpu-lab"],
          "workloadType": "training"
        },
        {
          "labelSelector": {"cast-slice/workload": "inference"},
          "workloadType": "inference"
        },
        {
          "namespaces": ["dev"],
          "labelSelector": {"gpu": "shared"},
          "sliceRatio": 8
        }
      ]
    }

Rules are evaluated in order; first match wins. Namespace and label conditions are AND-ed within a single rule. See config/policy/cast-slice-policy-example.yaml for a full example.

Priority order (highest → lowest)

  1. castops.io/slice-ratio pod annotation
  2. castops.io/workload-type pod annotation
  3. PolicyRule.sliceRatio
  4. PolicyRule.workloadType
  5. Default ratio of 1

Files changed

File Change
internal/webhook/policy.go New — SlicePolicy, PolicyRule, PolicyLoader, ConfigMapPolicyLoader, MatchPolicy, RuleMatches
internal/webhook/pod_webhook.go WithPolicyLoader option, policy-aware resolveSliceRatio, fail-open policy evaluation in Handle
internal/webhook/metrics.go castslice_policy_enforced_total counter
main.go --policy-namespace flag + POD_NAMESPACE env var wiring
internal/webhook/policy_test.go New — unit tests for loader, matcher, JSON parsing
internal/webhook/pod_webhook_test.go 27 new policy-path test cases
config/deploy/deployment.yaml RBAC Role/RoleBinding for ConfigMap read; POD_NAMESPACE Downward API injection
config/policy/cast-slice-policy-example.yaml New — annotated example policy
CHANGELOG.md v0.4.0 entry

Test plan

  • go test -race -count=2 ./internal/... — all pass, no race conditions
  • Coverage: 90.5% (CI gate: ≥80%)
  • gofmt -l . — no formatting issues
  • Backward compat: all 13 original v0.3 tests still pass unchanged
  • Policy namespace match mutates pod without opt-in annotation
  • Policy label match mutates pod without opt-in annotation
  • AND logic: both namespace + label must match when both specified
  • First-match-wins ordering
  • Pod annotation overrides policy workload type and ratio
  • policy_enforced_total not incremented for annotation-driven mutations
  • Fail-open on policy load error (pod allowed, not mutated)
  • Policy sliceRatio > 64 rejected
  • Unknown policy workloadType rejected
  • ConfigMapPolicyLoader: not-found, get error, missing key, invalid JSON, valid JSON

…e rules

Adds a ConfigMap-driven policy engine that automatically enrolls pods into
GPU slice optimization based on namespace or pod label conditions, without
requiring the castops.io/optimize annotation on each pod.

Key changes:
- internal/webhook/policy.go: SlicePolicy/PolicyRule types, PolicyLoader
  interface, ConfigMapPolicyLoader (reads cast-slice-policy ConfigMap),
  MatchPolicy/RuleMatches matching functions
- pod_webhook.go: WithPolicyLoader option, policy-aware resolveSliceRatio
  (pod annotation > policy rule ratio > policy rule workload type > default)
- metrics.go: castslice_policy_enforced_total counter
- main.go: --policy-namespace flag + POD_NAMESPACE env var wiring
- config/deploy/deployment.yaml: Role/RoleBinding for ConfigMap read,
  POD_NAMESPACE env var via Downward API
- config/policy/cast-slice-policy-example.yaml: example policy ConfigMap
- Tests: 27 new test cases; coverage 90.5% (>80% CI gate)

https://claude.ai/code/session_01UyecCdVqoXvrbmLGHM1oqV
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a ConfigMap-driven Policy Engine for the pod mutating webhook, enabling namespace-level and/or label-based automatic enrollment into GPU slice optimization while preserving annotation-based overrides and fail-open behavior on policy load failures.

Changes:

  • Added SlicePolicy / PolicyRule model plus a ConfigMapPolicyLoader and matching logic.
  • Updated the pod webhook to optionally evaluate policy rules (via WithPolicyLoader) and added a new castslice_policy_enforced_total metric.
  • Wired policy namespace configuration (--policy-namespace / POD_NAMESPACE) and added deployment RBAC + Downward API env var, alongside example policy YAML and changelog entry.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
main.go Adds policy namespace selection and wires a ConfigMap-backed policy loader into the webhook handler.
internal/webhook/policy.go Implements policy data types, ConfigMap loader, and rule matching helpers.
internal/webhook/policy_test.go Adds unit tests for rule matching and ConfigMapPolicyLoader behaviors.
internal/webhook/pod_webhook.go Adds policy evaluation flow, updates slice ratio resolution precedence, and tracks policy-driven enforcement.
internal/webhook/pod_webhook_test.go Adds policy-path test cases covering matching, overrides, metrics, and error paths.
internal/webhook/metrics.go Registers the new castslice_policy_enforced_total counter.
config/deploy/deployment.yaml Adds Role/RoleBinding for ConfigMap reads and injects POD_NAMESPACE via Downward API.
config/policy/cast-slice-policy-example.yaml Provides an example policy ConfigMap for operators.
CHANGELOG.md Documents v0.4.0 feature additions/changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +128 to +130
if pod.Labels[k] != v {
return false
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuleMatches compares pod.Labels[k] to v without checking label presence. If a policy rule sets a labelSelector value to the empty string, this will incorrectly match pods that lack the label entirely (nil/absent labels read as ""). Consider checking for key existence (using the "ok" result from the map lookup) so that missing labels never satisfy a selector entry.

Suggested change
if pod.Labels[k] != v {
return false
}
if pod.Labels == nil {
return false
}
if podVal, ok := pod.Labels[k]; !ok || podVal != v {
return false
}

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +106
if rule != nil && rule.SliceRatio > 0 {
if rule.SliceRatio > maxSliceRatio {
return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio)
}
return rule.SliceRatio, nil
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PolicyRule.SliceRatio is only validated when it’s > 0; negative values (e.g. -1 in policy JSON) are silently treated as “unset” and fall through to workloadType/default. That’s inconsistent with the stated 1–64 range and can mask policy misconfiguration. Consider validating that a non-zero SliceRatio is within [1, 64] (reject < 0 and > 64; decide explicitly whether 0 is allowed as “unset”).

Suggested change
if rule != nil && rule.SliceRatio > 0 {
if rule.SliceRatio > maxSliceRatio {
return 0, fmt.Errorf("policy sliceRatio %d exceeds maximum of %d", rule.SliceRatio, maxSliceRatio)
}
return rule.SliceRatio, nil
if rule != nil {
if rule.SliceRatio < 0 || rule.SliceRatio > maxSliceRatio {
return 0, fmt.Errorf("policy sliceRatio %d must be between 1 and %d, or 0 to use workloadType/default", rule.SliceRatio, maxSliceRatio)
}
if rule.SliceRatio > 0 {
return rule.SliceRatio, nil
}

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +90
// Load fetches and parses the policy ConfigMap. Returns (nil, nil) when the
// ConfigMap does not exist — absence of a policy is not an error.
func (l *ConfigMapPolicyLoader) Load(ctx context.Context) (*SlicePolicy, error) {
cm := &corev1.ConfigMap{}
key := types.NamespacedName{Namespace: l.namespace, Name: policyConfigMapName}
if err := l.reader.Get(ctx, key, cm); err != nil {
if errors.IsNotFound(err) {
return nil, nil
}
return nil, fmt.Errorf("get policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err)
}

raw, ok := cm.Data[policyConfigMapKey]
if !ok {
return nil, nil
}

var policy SlicePolicy
if err := json.Unmarshal([]byte(raw), &policy); err != nil {
return nil, fmt.Errorf("parse policy ConfigMap %s/%s: %w", l.namespace, policyConfigMapName, err)
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigMapPolicyLoader.Load performs a ConfigMap Get and JSON unmarshal every time policy is evaluated (i.e., on each non-opt-in Pod admission). This adds avoidable admission-path latency and CPU overhead, especially if the policy grows. Consider caching the parsed policy in-memory (e.g., keyed by ConfigMap resourceVersion with a short TTL, or via an informer/watch) and only re-parsing when it changes.

Copilot uses AI. Check for mistakes.

### Changed
- `NewPodGPUSliceHandler` now accepts variadic `Option` arguments (`WithPolicyLoader`) — fully backward-compatible; existing callers need no changes
- `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog’s stated resolveSliceRatio priority order is incomplete/ambiguous vs the actual implementation (which is slice-ratio annotation > workload-type annotation > policy sliceRatio > policy workloadType > default). Updating this line to match the real precedence will help operators reason about overrides correctly.

Suggested change
- `resolveSliceRatio` extended with policy-rule fallback levels (priority: pod annotation > policy rule explicit ratio > policy rule workload type > default 1)
- `resolveSliceRatio` extended with policy-rule fallback levels (priority: `castops.io/slice-ratio` annotation > `castops.io/workload-type` annotation > policy rule `sliceRatio` > policy rule `workloadType` > default 1)

Copilot uses AI. Check for mistakes.
@shenxianpeng shenxianpeng added the enhancement New feature or request label Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants