Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,6 @@
"secrets",
"autoscaling"
],
"mcpServers": "./.claude-mcp.json"
"mcpServers": "./.claude-mcp.json",
"hooks": "./hooks/hooks.json"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the official reference for the plugin.json, it is documented that hooks/hooks.json is auto-loaded from the default location which is also hooks/hooks.json. Are you sure the hooks were ignored when you installed and used the plugin?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested with Sonnet, it wasn't loading the hooks and was attempting to call the CLI directly which caused it to ignore the explicit hooks call. I added this to the plugin.json, reloaded the plugin and ran it again. It leveraged the new hooks after that.

}
1 change: 1 addition & 0 deletions .codex-plugin/plugin.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"autoscaling"
],
"skills": "./skills/",
"rules": "./rules/",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rules is not a valid property in Codex's plugin.json and is silently ignored.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the rules call. I was hoping it would be called from the initial load as an additional reference but if it's being ignored, then we can remove it.

"mcpServers": "./.mcp.json",
"apps": "./.app.json",
"interface": {
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ Thumbs.db
.idea/
.vscode/

# Claude Code local session data (hooks are distributed via .claude-plugin/plugin.json)
.claude/

# Control Plane local artifacts
*-bootstrap.json
*.bak.yaml
Expand Down
14 changes: 14 additions & 0 deletions GEMINI.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,20 @@ The plugin auto-configures the Control Plane MCP Server. Your `CPLN_TOKEN` (prom

**Never write a cpln command from memory.** Before constructing a command, consult `rules/cli-conventions.md` (command structure, shared flags, resource command map, hallucination traps) and `skills/cpln/SKILL.md` (setup, workflows, examples). Verify exact flag names with `cpln <command> --help` or the MCP suggest tool (`mcp__cpln__cpln_suggest`).

## CLI Guardrails

These commands do not exist — never generate them:

- `cpln secret create` → use type-specific: `cpln secret create-opaque`, `create-aws`, `create-tls`, etc.
- `cpln apply` without `--file` → always: `cpln apply --file manifest.yaml`
- `cpln <resource> list` → use `cpln <resource> get` (no args = list all)

These are too destructive to run without explicit user confirmation in the conversation:

- `cpln gvc delete-all-workloads` — destroys every workload in the GVC
- `cpln volumeset shrink` — permanent data loss on the old volume
- Any `cpln <resource> delete` — surface the org, GVC, resource name, and blast radius before proceeding

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out CLAUDE.md and GEMINI.md aren't shipped to users, they're only loaded when developing on this repo. Let's remove this change, anything user-facing belongs in a skill.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick correction, GEMINI.md actually does ship to users (Gemini CLI loads it every session via contextFileName), so it's not dev-only like CLAUDE.md. I think the right framing is to treat it as a guardrails file, short, always-on rules like destructive-op confirmations and CLI conventions.

## Key Conventions

- CLI commands use `cpln` prefix (e.g., `cpln apply --file manifest.yaml`)
Expand Down
10 changes: 10 additions & 0 deletions agents/k8s-migrator.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,16 @@ This approach requires manual work to parameterize the converted output, but giv

## Docker Compose Migration (`cpln stack`)

> **Firewall default mismatch — read before writing native manifests.**
> `cpln stack` defaults external outbound to **open** for all services that expose ports. Native Control Plane workload manifests default external outbound to **blocked**. If you are writing CPLN manifests by hand (rather than using `cpln stack` directly), you must add explicit outbound rules for every external API, database, or service your workload calls — otherwise it silently cannot reach anything outside the platform. This is the most common failure mode for manual Docker Compose migrations.
>
> ```yaml
> firewallConfig:
> external:
> outboundAllowCIDR:
> - 0.0.0.0/0 # or restrict to specific CIDRs/hostnames
> ```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small fix: the note says cpln stack opens outbound "for all services that expose ports." The port part is wrong, that rule is for inbound. Outbound is opened for every service, except when network_mode: none.

Suggested rewrite:

cpln stack defaults external outbound to open for every service it generates, except those with network_mode: none. Native Control Plane workload manifests default external outbound to blocked.

Everything else looks good to me.

### Key Differences

1. **Service URLs must be rewritten**: `http://service-name:port` → `http://workload-name.GVC_NAME.cpln.local:port`
Expand Down
4 changes: 3 additions & 1 deletion agents/secret-setup-wizard.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Use `cpln://secret/NAME` to reference the full secret, or `cpln://secret/NAME.KE

| Secret Type | Available Keys | Example |
|:---|:---|:---|
| opaque | `payload` (decoded if base64 runtime decode enabled), or omit key for raw JSON with `payload` + `encoding` | `cpln://secret/my-api-key.payload` |
| opaque | `payload` — **see encoding warning below** | `cpln://secret/my-api-key.payload` |
| dictionary | user-defined keys | `cpln://secret/db-config.DB_HOST` |
| userpass | `username`, `password` | `cpln://secret/creds.password` |
| tls | `key`, `cert`, `chain` | `cpln://secret/my-tls.cert` |
Expand All @@ -137,6 +137,8 @@ Use `cpln://secret/NAME` to reference the full secret, or `cpln://secret/NAME.KE
| nats-account | `accountId`, `privateKey` | `cpln://secret/my-nats.accountId` |
| any type | omit key for full secret as JSON | `cpln://secret/my-secret` |

**Opaque `.payload` encoding warning:** If the secret was created with base64 encoding (common when storing binary content — certs, keys, binary tokens — via the console or API), the `.payload` reference returns the base64-encoded string, not the decoded value. The workload receives it as a base64 string and typically fails with a cryptographic or parse error. To get the decoded value at runtime, the secret must have runtime decoding enabled (`encoding: base64` + runtime decode on the secret spec), or use the full secret reference (`cpln://secret/NAME`) and decode in application code. For plaintext secrets (API keys, connection strings, passwords), `.payload` works as expected. **Before injecting an opaque secret as `.payload`, ask the user: was this secret created with base64 encoding?**
Comment on lines 130 to +140
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is inaccurate. Opaque .payload always delivers the original value the user stored. There is no "runtime decoding" flag. encoding: 'base64' is purely a storage setting so binary content can survive the JSON API; the backend forwards it to Kubernetes as-is and K8s decodes it back to the original bytes at injection time. Asking the user "was this created with base64 encoding?" before injecting .payload adds friction without changing what the workload sees.

Suggested replacement:

> **Opaque `.payload` reference:** `.payload` always delivers the value the user originally stored. If the secret was created with `encoding: 'base64'` (used to store binary content such as binaries, certs or keys that aren't valid UTF-8), the actuator forwards the base64 to Kubernetes as-is and Kubernetes decodes it back to the original bytes when injecting as an env var or mounting as a file, no application-side decoding required. **Caveat:** env vars on most container runtimes don't reliably carry null bytes or non-UTF-8 content, so for opaque secrets whose decoded value is binary, mount as a volume instead of injecting as an env var.

Just so you know, with encoding: 'base64':

  1. You store the base64 string in payload with encoding base64.
  2. The backend forwards that base64 string into the K8s Secret.data.payload field as-is (because K8s already requires data values to be base64-encoded, that's just the K8s format).
  3. K8s decodes it once at injection time, so the workload receives the original decoded bytes (the binary the base64 represented).

Important: encoding: 'base64' is the right choice when you have binary content that you base64-encoded just to fit it into a JSON string, and you want the workload to see the binary. If you actually want the workload to see the literal base64 string itself (e.g., it's a token that happens to look like base64 and your app expects it as text), use encoding: 'plain' and store the string as plaintext, otherwise K8s will decode it and your app will get the underlying bytes instead of the string. Maybe let's consider saying that here in the AI plugin as well.


**As volume mount:** Export the workload, add a volume, and apply:

```bash
Expand Down
2 changes: 2 additions & 0 deletions agents/workload-troubleshooter/diagnostics.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ cpln policy add-binding my-secret-policy --permission reveal --identity //gvc/MY

**Or use `mcp__cpln__create_policy`** — creates the policy with bindings in one call. Params: `name` (required), `targetKind` (required), `targetLinks` (optional), `addPermissions` (optional array of permission strings), `addIdentities` (optional array of identity links), `org` (uses session context if set, required otherwise).

**If `cpln apply` fails on a policy manifest with a validation error and the YAML looks correct:** check that `targetKind` is a valid resource kind, all `principalLinks` use full resource paths (`//gvc/GVC/identity/NAME`), and `permissions` values are valid for the target kind. The API auto-sorts permissions alphabetically — ordering is not a cause of validation errors.

## C. Port Mismatch

**Symptoms**: Workload shows healthy but returns 502/503, or traffic doesn't reach the container.
Expand Down
37 changes: 37 additions & 0 deletions hooks/hooks.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,44 @@
"command": "input=$(cat); cmd=$(echo \"$input\" | jq -r '.tool_input.command // empty'); if echo \"$cmd\" | grep -qE 'cpln\\s+apply' && ! echo \"$cmd\" | grep -qE '--file|--f\\b|-f\\b'; then echo 'BLOCK: cpln apply requires --file flag. Usage: cpln apply --file manifest.yaml' >&2; exit 1; fi"
}
]
},
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "input=$(cat); cmd=$(echo \"$input\" | jq -r '.tool_input.command // empty'); if echo \"$cmd\" | grep -qE 'cpln\\s+gvc\\s+delete-all-workloads'; then echo 'BLOCK: cpln gvc delete-all-workloads destroys every workload in the GVC. This command is too destructive to run from the AI layer. Confirm the org, GVC, and full blast radius in the conversation, then run this command manually in your terminal.' >&2; exit 1; fi"
}
]
},
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "input=$(cat); cmd=$(echo \"$input\" | jq -r '.tool_input.command // empty'); if echo \"$cmd\" | grep -qE 'cpln\\s+volumeset\\s+shrink'; then echo 'BLOCK: cpln volumeset shrink causes permanent data loss on the old volume. This command is too destructive to run from the AI layer. Confirm the org, GVC, volumeset name, and new size in the conversation, then run this command manually in your terminal.' >&2; exit 1; fi"
}
]
},
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "input=$(cat); cmd=$(echo \"$input\" | jq -r '.tool_input.command // empty'); if echo \"$cmd\" | grep -qE 'cpln\\s+\\w+\\s+list\\b'; then echo 'BLOCK: cpln <resource> list does not exist. Use cpln <resource> get (with no arguments to list all, or with a name to get one).' >&2; exit 1; fi"
}
]
},
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "input=$(cat); cmd=$(echo \"$input\" | jq -r '.tool_input.command // empty'); if echo \"$cmd\" | grep -qE 'cpln\\s+(workload|gvc|secret|identity|domain|policy|volumeset|serviceaccount|cloudaccount|agent|group|ipset|mk8s|image)\\s+delete\\b'; then echo 'WARNING: Destructive delete detected. Verify the correct org, GVC (if applicable), and resource name before proceeding. This action cannot be undone.' >&2; fi"
}
]
}
]
}
}

3 changes: 2 additions & 1 deletion rules/cli-conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ cpln logs '{gvc="GVC", workload="WORKLOAD"}' --org ORG --tail

| Type | Command | Required Flags |
|------|---------|---------------|
| Opaque | `create-opaque` | `--file` or `--payload` |
| Opaque | `create-opaque` | `--name`, `--file` (path or `-` for stdin). No `--payload` flag — write value to a file or pipe via stdin. Add `--encoding plain` for plaintext values (default encoding is base64). |
| Dictionary | `create-dictionary` | `--entry KEY=VAL` (repeatable) |
| Username/Password | `create-userpass` | `--username`, `--password` |
| AWS | `create-aws` | `--access-key`, `--secret-key` |
Expand Down Expand Up @@ -283,6 +283,7 @@ Flags: `--address`, `--location`, `--replica`.
| `cpln workload update --identity X` | `cpln workload update REF --set spec.identityLink=//identity/X` |
| `cpln secret update --data '{}'` | `cpln secret edit REF` or `cpln apply --file` |
| `cpln gvc update --location LOC` | `cpln gvc update REF --set 'spec.staticPlacement.locationLinks+=//location/LOC'` |
| `spec.containers[N].resources.requests/limits` in manifests | Use flat `cpu` and `memory` directly on the container: `cpu: 50m` / `memory: 128Mi`. Kubernetes-style nested `resources` is not a valid field and will cause a 400 at apply time. |

## The Verification Rule

Expand Down
23 changes: 22 additions & 1 deletion rules/cpln-guardrails.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ If any of those is unclear, ask. Propose what looks right and request confirmati

> Before I run this, I want to confirm the target. Your active profile appears to be `<name>` (org: `<org>`, GVC: `<gvc>`). Should I use that, or a different org / profile / GVC?

For **read-only** commands (`get`, `query`, `audit`, `logs`, `permissions`, `access-report`, `eventlog`), defaulting to the active profile is acceptable — but **announce the target before running**: *"Using profile `<name>` → org `<org>`, GVC `<gvc>`"* — so the user can correct course before output is produced.
For **read-only** commands (`get`, `query`, `audit`, `logs`, `permissions`, `access-report`, `eventlog`), defaulting to the active profile is acceptable — but **announce the exact target before running and pause one turn for correction**: *"Using profile `<name>` → org `<org>`, GVC `<gvc>`. Reading now — let me know if that's the wrong environment."* Do not run the command in the same turn as the announcement. This one-turn pause is especially important in multi-GVC or multi-org environments where reading the wrong environment leads to debugging the wrong workload, which is a common and expensive mistake.

**Why this rule exists.** Operating on the wrong org or GVC has caused production deletes, cross-environment secret leaks, and accidental cross-tenant changes. The cost of asking is one extra turn; the cost of acting on the wrong context is irreversible.

Expand Down Expand Up @@ -106,6 +106,26 @@ Missing any one step = silent failure at runtime. This is the #1 support issue.
- **Cron**: Deploys to ALL GVC locations, no overrides. Cannot expose ports.
- **Workload type is immutable** after creation. Changing type requires delete + recreate.

### Resource Protection — Suggest Before Any Production Resource

Before creating or modifying any resource a user identifies as production-critical, proactively suggest the `cpln/protected` tag. This is a platform-level safeguard that causes the API to reject any delete attempt — it works regardless of who (or what) tries to delete the resource, and does not require a conversation.

```bash
# Protect a workload
cpln workload tag WORKLOAD --tag cpln/protected=true --gvc GVC --org ORG

# Protect a GVC
cpln gvc tag GVC --tag cpln/protected=true --org ORG

# Protect a volumeset
cpln volumeset tag VS --tag cpln/protected=true --gvc GVC --org ORG

# Remove protection before a confirmed intentional delete
cpln workload tag WORKLOAD --remove-tag cpln/protected --gvc GVC --org ORG
```

When a delete is requested on a protected resource: (1) surface the protection, (2) confirm the user explicitly wants to remove it, (3) remove the tag, (4) proceed with the normal destructive-operation confirmation flow.

### Destructive Operations — Always Confirm With Blast Radius

Some operations cannot be undone, or have effects that reach beyond the resource being changed. **Before any destructive operation listed below, the AI MUST present a structured summary AND wait for explicit user confirmation — even when permissions are set to bypass / auto-approve.** Permission mode is about Claude Code's tool-prompt UX; this rule is conversation-level safety and is independent. Bypass permissions does NOT authorize destructive product operations.
Expand Down Expand Up @@ -337,6 +357,7 @@ Before submitting work with Control Plane:
- [ ] Service account keys in CI/CD (not user tokens)
- [ ] No `docker.io/` prefix on external images
- [ ] `cpln apply --ready` used for deployments
- [ ] For distroless or minimal Alpine images: confirm `sleep` binary is present, or set a custom preStop hook — if `sleep` is absent in any container, all containers receive SIGKILL immediately on shutdown, bypassing the grace period

## Resources

Expand Down
1 change: 1 addition & 0 deletions rules/workload-manifest-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ Probe types: exactly one of `exec`, `grpc`, `tcpSocket`, `httpGet` (xor constrai

| Error | Fix |
|:---|:---|
| `spec.containers[N].resources` present | Remove it — Control Plane does not use Kubernetes-style `resources.requests/limits`. Set `cpu` and `memory` directly on the container object: `cpu: 50m`, `memory: 128Mi`. This returns a 400 with `"resources" is not allowed`. |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, let's also mention here (and on that other similar note we have in another file) that cpln workload update can also allow a user to set CPU and memory, e.g. cpln workload update my-workload --set spec.containers.<name>.cpu=25m. You can verify and see a list of other fields that a user can change with cpln workload update --help.

| Memory-to-CPU ratio exceeded | 1024Mi memory needs at least 128m CPU (ratio 8:1) |
| GPU with Capacity AI | Disable Capacity AI when using GPU |
| Concurrency on standard/stateful | Use rps, cpu, memory, latency, or keda instead |
Expand Down
3 changes: 2 additions & 1 deletion skills/access-control/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ bindings:
```

**Constraints:**
- Each binding's permissions must be **sorted alphabetically and unique** (validation rule).
- Each binding's permissions must be **unique**. The API auto-sorts them alphabetically — you don't need to sort manually.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Duplicate permissions in the same binding will cause a validation error" is wrong. The API silently de-duplicates them on write, duplicates don't trigger an error, they just get dropped. Ordering is also auto-normalized: the API sorts the array on write regardless of input order.

Suggested replacement:

Permission ordering and duplicates don't matter, the API normalizes both. You don't need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions within the same binding are silently de-duplicated, not rejected, the request still succeeds. (Avoid duplicates anyway for manifest clarity and clean diffs.)

Or maybe we don't mention this at all, it is not that important info to point out.

- A policy can have up to **50 bindings**, each with up to **200 principal links**.
- The same principal can appear in multiple bindings (different permission sets).

Expand Down Expand Up @@ -278,6 +278,7 @@ bindings:
## Gotchas

- **Policies fail silently when wrong.** A typo in `targetKind`, a missing principal link, or an invalid permission name produces a policy that exists but grants nothing. Always verify with `cpln policy access-report POLICY_NAME` after creation.
- **Permission ordering doesn't matter — the API auto-sorts.** You do not need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions in the same binding will cause a validation error.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- **Built-in policies cannot be modified or deleted.** Origins `builtin` are read-only; create your own with `default` origin.
- **`reveal` (not `read`) is the permission for accessing secret values.** This is the most common permission-name mistake.
- **Identity links are GVC-scoped.** Use `//gvc/GVC/identity/NAME`, not `//identity/NAME`.
Expand Down
12 changes: 12 additions & 0 deletions skills/autoscaling-capacity/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,19 @@ spec:

### Event-Driven KEDA (Standard, Redis Queue)

**Prerequisite:** KEDA must be enabled on the GVC before any workload can use `metric: keda`. Applying a workload with `metric: keda` to a GVC without KEDA enabled will silently not scale — no error event, the workload just ignores queue depth.

```yaml
# Step 1: Enable KEDA on the GVC (one-time setup)
kind: gvc
name: my-gvc
spec:
keda:
enabled: true
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. One small enhancement worth adding: most real KEDA triggers (SQS, Pub/Sub, Azure queues, etc.) need cloud credentials, so spec.keda usually also takes identityLink and/or secrets. The minimal enabled: true is fine as the prerequisite check, but the example could mention these for production use:

kind: gvc
name: my-gvc
spec:
  keda:
    enabled: true
    identityLink: //identity/keda-id   # required for cloud-resource triggers
    secrets:                            # optional, for TriggerAuthentication
      - //secret/queue-creds

```yaml
# Step 2: Configure the workload
kind: workload
name: queue-processor
spec:
Expand Down
7 changes: 7 additions & 0 deletions skills/cpln/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,13 @@ cpln policy add-binding secret-access \
# Inject the secret into the workload
cpln workload update my-app --gvc my-gvc \
--set spec.containers.main.env.DB_PASSWORD.value=cpln://secret/db-password.payload

# ALWAYS verify the injection landed — --set exits 0 even if the container name
# doesn't match, silently writing to a path that doesn't exist in the spec.
cpln workload get my-app --gvc my-gvc -o json \
| jq '.spec.containers[] | select(.name == "main") | .env'
# If DB_PASSWORD is absent from the output, the container name was wrong.
# Re-run with the correct name from: cpln workload get my-app --gvc my-gvc -o json | jq '[.spec.containers[].name]'
```

## Workflow: GitOps with cpln apply
Expand Down
2 changes: 1 addition & 1 deletion skills/workload-security/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ Full `spec.rolloutOptions` configuration:

### Critical Warnings

- If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately
- If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately — the entire grace period is skipped. This silently affects distroless images, scratch-based images, and some minimal Alpine builds. Verify with `cpln workload exec WORKLOAD --gvc GVC -- which sleep` before relying on the grace period. If `sleep` is absent, either add it to the image or configure an explicit preStop hook that does not depend on it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back over this, I think part of what I originally wrote is also off and the additions stacked on top of it, sorry about that. The "ALL containers get SIGKILL immediately, the entire grace period is skipped" framing isn't accurate: if the preStop hook fails because sleep is missing, Kubernetes still delivers SIGTERM and still honors the full terminationGracePeriodSeconds. What actually gets lost is the request-draining delay, so the load balancer may still send traffic at the moment SIGTERM arrives. preStop is per-container too, and on K8s 1.33+ Control Plane uses the native lifecycle.preStop.sleep hook with no binary dependency, so distroless/scratch are fine there, the risk only applies on older clusters. Also which sleep won't work on distroless (no which); something like /bin/sleep 0 is a better check.

That said, I don't think this is actually worth mentioning in the AI plugin. It's a narrow edge case (distroless/scratch on K8s < 1.33), the failure mode is subtle rather than catastrophic, let's remove completely.

- If a custom preStop hook throws an error in **any** container, ALL containers receive SIGKILL immediately

### Custom PreStop Hook
Expand Down