A Kubernetes operator for AI agents that keeps the LLM out of the cluster's write path. One CRD, any runtime (OpenClaw, NanoClaw, ZeroClaw, PicoClaw, IronClaw, HermesClaw, HermesRS, K8sOps), self-healing from day one.
k8s4claw keeps the LLM out of the cluster's write path. The agent's ServiceAccount cannot patch workload objects. The LLM can only submit an ops-intent annotation on a Claw CR; a Go reconciler validates the intent's JSON shape against a 5-action allowlist (plus generation guard) and is the only component that mutates workloads. The reconciler also writes an Ed25519 audit receipt when the action runs through the auto-execute path — the signature is for the audit trail, not for authorization, and is non-blocking if signing fails.
LLMops tools with kubectl RBAC |
k8s4claw | |
|---|---|---|
| Who mutates StatefulSets | the LLM | the reconciler, never the LLM |
| Blast radius if prompt-injected | everything the SA can touch | bounded by the intent allowlist |
| Audit | kubectl audit logs | Ed25519-signed receipts + K8s audit |
This is the main architectural distinction. Everything else (runtime registry, IPC bus, auto-update, archival) is infrastructure for running AI agents on K8s.
See threat model · comparison · claw4k8s design
Real OOM → ClawOpsController detects → rule matched → intent applied by reconciler → Ed25519 audit receipt. 90 seconds, end to end.
Running AI agents in production means solving the same problems over and over: secret management, persistent storage, graceful updates, inter-service communication, and observability. k8s4claw wraps all of this into a single Claw CRD so you can focus on what your agent does, not how it runs.
On top of that, claw4k8s lets agents self-heal without ever granting the LLM direct cluster-mutation rights. Deterministic rules auto-fix common issues (OOM → bump memory); novel issues escalate to a Companion Claw that proposes a fix, routes to human approval via Slack, and only then is the signed intent applied — still through the same reconciler, still bounded by the same allowlist.
| Capability | k8sgpt | kubectl-ai | Holmes (Robusta) | k8s4claw + claw4k8s |
|---|---|---|---|---|
LLM has kubectl / patch RBAC |
— | human approves | yes | no (by design) |
| Diagnoses cluster issues | ✓ | ✓ | ✓ | ✓ |
| Fixes without human approval | — | — | — | ✓ (rule-based) |
| Fixes with human approval | — | ✓ | ✓ | ✓ (LLM escalation) |
| Agents manage their own infra | — | — | — | ✓ |
| Cryptographic audit trail | — | — | — | ✓ (Ed25519) |
| Graceful LLM fallback | — | — | partial | ✓ (notification) |
| Primary target | diagnostic CLI | kubectl wrapper | SRE incidents | AI agent self-management |
The wedge: claw4k8s is the first K8s operator where AI agents manage their own infrastructure, with the LLM kept out of the write path by RBAC rather than by a review step. Dogfooding as the product. See full comparison and threat model.
graph TB
subgraph "Kubernetes Cluster"
OP[k8s4claw Operator]
subgraph "Claw Pod"
INIT["claw-init<br/>(config merge)"]
RT["Runtime Container<br/>(OpenClaw / NanoClaw / ...)"]
IPC["IPC Bus Sidecar<br/>(WAL + DLQ + backpressure)"]
CH["Channel Sidecar<br/>(Slack / Webhook / ...)"]
ARC["Archive Sidecar<br/>(S3 upload)"]
end
STS[StatefulSet]
SVC[Service]
CM[ConfigMap]
SA[ServiceAccount]
PDB[PodDisruptionBudget]
PVC[(PVCs<br/>session / output / workspace)]
SEC[/Secrets/]
OP -->|manages| STS
OP -->|manages| SVC
OP -->|manages| CM
OP -->|manages| SA
OP -->|manages| PDB
STS -->|creates| PVC
STS -.->|runs| INIT
STS -.->|runs| RT
STS -.->|runs| IPC
STS -.->|runs| CH
STS -.->|runs| ARC
CH <-->|"UDS<br/>bus.sock"| IPC
IPC <-->|"WS / TCP / UDS / SSE"| RT
RT -->|reads| CM
RT -->|reads| SEC
ARC -->|mounts| PVC
end
REG[(OCI Registry)]
OBJ[(S3 / MinIO)]
EXT["External Service<br/>(Slack API, etc.)"]
OP -.->|"polls tags<br/>(auto-update)"| REG
ARC -->|uploads| OBJ
CH <-->|API calls| EXT
The IPC Bus is a native sidecar that routes JSON messages between channel sidecars and the AI runtime:
Channel Sidecar ──UDS──► IPC Bus ──Bridge──► Runtime Container
│ WAL │
│ DLQ │
│ Ring │
│Buffer │
└───────┘
- WAL — append-only write-ahead log for at-least-once delivery
- DLQ — BoltDB dead letter queue for messages exceeding retry limits
- Backpressure — ring buffer with high/low watermark flow control
- Bridge protocols — WebSocket (OpenClaw), TCP (PicoClaw), UDS (NanoClaw), SSE (ZeroClaw)
The Runtime column shows the exact value to put in spec.runtime on a Claw CR. Names ending in claw are k8s4claw's internal runtime type enum — they are wrappers around upstream projects, not forks.
spec.runtime |
Language | Upstream / Use Case | Gateway | Probe |
|---|---|---|---|---|
openclaw |
Go | WebSocket AI gateway + Anthropic SDK (first-party, verified end-to-end) | 18900 | HTTP |
hermesclaw |
Python | NousResearch upstream Hermes Agent — build instructions in runtimes/hermesclaw/ | 8642 | HTTP |
hermesrs |
Rust | Runs hermes-agent-rs — Rust Hermes (verified end-to-end) | 8080 | HTTP |
k8sops |
Go | Companion Claw runtime used by claw4k8s for self-healing | 18910 | HTTP |
custom |
Any | Bring your own runtime image | — | — |
- Kubernetes cluster (v1.28+, or kind / minikube for local dev)
- kubectl configured
- Go 1.23+ (for building from source)
Option A: Helm (recommended, v0.2.1+):
helm install k8s4claw oci://ghcr.io/prismer-ai/charts/k8s4claw --version 0.2.1 \
--namespace k8s4claw-system --create-namespace \
--set webhook.certManager.enabled=true # requires cert-manager pre-installedOr from source:
git clone https://github.com/Prismer-AI/k8s4claw.git
cd k8s4claw
helm install k8s4claw charts/k8s4claw --namespace k8s4claw-system --create-namespaceOption B: From source with Make:
git clone https://github.com/Prismer-AI/k8s4claw.git
cd k8s4claw
# Install CRDs into the cluster
make install
# Run operator locally (or deploy with `make deploy`)
make runkubectl create secret generic llm-api-keys \
--from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
--from-literal=OPENAI_API_KEY=sk-xxx# my-agent.yaml
apiVersion: claw.prismer.ai/v1alpha1
kind: Claw
metadata:
name: my-agent
spec:
runtime: openclaw
config:
model: "claude-sonnet-4"
credentials:
secretRef:
name: llm-api-keys
persistence:
session:
enabled: true
size: 2Gi
mountPath: /data/session
workspace:
enabled: true
size: 10Gi
mountPath: /workspacekubectl apply -f my-agent.yaml
# Watch it come up
kubectl get claw my-agent -wapiVersion: claw.prismer.ai/v1alpha1
kind: ClawChannel
metadata:
name: slack-team
spec:
type: slack
mode: bidirectional
credentials:
secretRef:
name: slack-bot-token
config:
appId: "A0123456789"Then reference it in your Claw:
spec:
channels:
- name: slack-team
mode: bidirectionalThe unique wedge: AI agents manage their own Kubernetes infrastructure. See architecture diagrams.
- ClawOpsController — watches Pod status (OOMKilled, CrashLoop, HighCPU, Evicted) and auto-executes low-risk fixes from a deterministic rule engine
- Intent annotation pattern — agents never patch StatefulSets directly; a single reconciler consumes intents through a 5-action allowlist with generation-based idempotency. Zero controller contention.
- Companion Claw (LLM agent) — handles novel issues. Analyzes, proposes, routes to human approval via Slack (ClawChannel integration).
- Ed25519 audit receipts — auto-executed actions get a signed receipt for the audit trail when signing succeeds (signing is non-blocking); pure-Go signer with optional
signetCLI fallback. - Graceful LLM fallback — 3 retries with exponential backoff, then degrades to human notification — never paralyzes.
- ClawOpsEscalation CRD — dual-purpose audit + workflow state machine (Pending → Analyzing → Proposed → AwaitingApproval → Approved → Executed).
ClawCRD manages StatefulSet, Service, ConfigMap, ServiceAccount, PDB, PVCs, NetworkPolicy, Ingress, RBAC in a single declarative resource- Per-runtime resource defaults, liveness/readiness probes, graceful shutdown tuning
- Webhook validation: credential requirements, PVC immutability, runtime type lock, NetworkPolicy mandatory for
k8sopsruntime - Finalizer-based cleanup with
Retain/Delete/Archivereclaim policies
- OCI registry polling on cron schedule
- Semver constraint filtering (
^1.x,~2.0.0) - Health-verified rollouts with configurable timeout
- Automatic rollback + circuit breaker after N consecutive failures
- Session, output, and workspace PVCs via StatefulSet
volumeClaimTemplates - CSI VolumeSnapshot on cron schedule with retention pruning
- S3-compatible archival sidecar (S3, MinIO, GCS, R2) with lifecycle policies
- ClawChannel CRD — declarative channel definitions with reference counting
- Built-in sidecars: Slack, Discord, Webhook (more coming)
- Custom sidecar support for any protocol
- Bidirectional / inbound / outbound modes
- WAL — at-least-once delivery via BoltDB write-ahead log
- DLQ — dead letter queue for messages exceeding retry limits
- Backpressure — ring buffer with high/low watermark flow control
- Protocol bridges — WebSocket (OpenClaw), TCP (PicoClaw), UDS (NanoClaw), SSE (ZeroClaw)
- Pod Security Standards:
runAsNonRoot,readOnlyRootFilesystem,seccompProfile=RuntimeDefault,drop=[ALL]capabilities - NetworkPolicy defaults: default-deny + selective allow
- Per-instance ServiceAccount with
automountServiceAccountToken=false - ExternalSecrets integration for secrets rotation
- Ed25519 cryptographic audit for all ops actions
- ClawSelfConfig CRD — agents can modify their own skills, config, workspace files, and env vars
- Scoped allowlist via
spec.selfConfigure.allowedActions(skills, config, workspaceFiles, envVars) - Rate limits on self-mutation
- Prometheus metrics per Claw instance (reconcile latency, phase transitions, remediation actions, LLM latency)
- K8s Events on all phase transitions
- Status subresource with detailed conditions (RuntimeReady, AutoUpdateStatus, ChannelStatus)
- PrometheusRule + ServiceMonitor templates in the Helm chart
import "github.com/Prismer-AI/k8s4claw/sdk"
client, err := sdk.NewClient()
if err != nil {
log.Fatal(err)
}
claw, err := client.Create(ctx, &sdk.ClawSpec{
Runtime: sdk.OpenClaw,
Config: &sdk.RuntimeConfig{
Environment: map[string]string{"MODEL": "claude-sonnet-4"},
},
})make build # Build operator binary
make build-ipcbus # Build IPC Bus binary
make test # Run tests (requires setup-envtest)
make lint # Lint
make vet # Run go vet
make fmt # Run gofmt + goimports
make manifests # Generate CRD YAML
make generate # Generate deepcopy
make docker-build # Build container imageSee CONTRIBUTING.md for the full development guide.
- Operator Core Design
- IPC Bus + Resilience Design
- Auto-Update Controller Design
- claw4k8s Autonomous Ops Design — self-healing + LLM escalation + Ed25519 audit
- claw4k8s Implementation Plan — task-by-task breakdown
- claw4k8s Architecture Diagrams — Mermaid flowcharts of the full auto-remediation loop
- vs k8sgpt / kubectl-ai / Holmes — positioning comparison
Apache-2.0