k8s4claw

A Kubernetes operator for AI agents that keeps the LLM out of the cluster's write path. One CRD, any runtime (OpenClaw, NanoClaw, ZeroClaw, PicoClaw, IronClaw, HermesClaw, HermesRS, K8sOps), self-healing from day one.

The core idea

k8s4claw keeps the LLM out of the cluster's write path. The agent's ServiceAccount cannot patch workload objects. The LLM can only submit an ops-intent annotation on a Claw CR; a Go reconciler validates the intent's JSON shape against a 5-action allowlist (plus generation guard) and is the only component that mutates workloads. The reconciler also writes an Ed25519 audit receipt when the action runs through the auto-execute path — the signature is for the audit trail, not for authorization, and is non-blocking if signing fails.

	LLMops tools with `kubectl` RBAC	k8s4claw
Who mutates StatefulSets	the LLM	the reconciler, never the LLM
Blast radius if prompt-injected	everything the SA can touch	bounded by the intent allowlist
Audit	kubectl audit logs	Ed25519-signed receipts + K8s audit

This is the main architectural distinction. Everything else (runtime registry, IPC bus, auto-update, archival) is infrastructure for running AI agents on K8s.

See threat model · comparison · claw4k8s design

Real OOM → ClawOpsController detects → rule matched → intent applied by reconciler → Ed25519 audit receipt. 90 seconds, end to end.

Why k8s4claw?

Running AI agents in production means solving the same problems over and over: secret management, persistent storage, graceful updates, inter-service communication, and observability. k8s4claw wraps all of this into a single Claw CRD so you can focus on what your agent does, not how it runs.

On top of that, claw4k8s lets agents self-heal without ever granting the LLM direct cluster-mutation rights. Deterministic rules auto-fix common issues (OOM → bump memory); novel issues escalate to a Companion Claw that proposes a fix, routes to human approval via Slack, and only then is the signed intent applied — still through the same reconciler, still bounded by the same allowlist.

How is this different?

Capability	k8sgpt	kubectl-ai	Holmes (Robusta)	k8s4claw + claw4k8s
LLM has `kubectl` / patch RBAC	—	human approves	yes	no (by design)
Diagnoses cluster issues	✓	✓	✓	✓
Fixes without human approval	—	—	—	✓ (rule-based)
Fixes with human approval	—	✓	✓	✓ (LLM escalation)
Agents manage their own infra	—	—	—	✓
Cryptographic audit trail	—	—	—	✓ (Ed25519)
Graceful LLM fallback	—	—	partial	✓ (notification)
Primary target	diagnostic CLI	kubectl wrapper	SRE incidents	AI agent self-management

The wedge: claw4k8s is the first K8s operator where AI agents manage their own infrastructure, with the LLM kept out of the write path by RBAC rather than by a review step. Dogfooding as the product. See full comparison and threat model.

Architecture

graph TB
    subgraph "Kubernetes Cluster"
        OP[k8s4claw Operator]

        subgraph "Claw Pod"
            INIT["claw-init<br/>(config merge)"]
            RT["Runtime Container<br/>(OpenClaw / NanoClaw / ...)"]
            IPC["IPC Bus Sidecar<br/>(WAL + DLQ + backpressure)"]
            CH["Channel Sidecar<br/>(Slack / Webhook / ...)"]
            ARC["Archive Sidecar<br/>(S3 upload)"]
        end

        STS[StatefulSet]
        SVC[Service]
        CM[ConfigMap]
        SA[ServiceAccount]
        PDB[PodDisruptionBudget]
        PVC[(PVCs<br/>session / output / workspace)]
        SEC[/Secrets/]

        OP -->|manages| STS
        OP -->|manages| SVC
        OP -->|manages| CM
        OP -->|manages| SA
        OP -->|manages| PDB
        STS -->|creates| PVC

        STS -.->|runs| INIT
        STS -.->|runs| RT
        STS -.->|runs| IPC
        STS -.->|runs| CH
        STS -.->|runs| ARC

        CH <-->|"UDS<br/>bus.sock"| IPC
        IPC <-->|"WS / TCP / UDS / SSE"| RT
        RT -->|reads| CM
        RT -->|reads| SEC
        ARC -->|mounts| PVC
    end

    REG[(OCI Registry)]
    OBJ[(S3 / MinIO)]
    EXT["External Service<br/>(Slack API, etc.)"]

    OP -.->|"polls tags<br/>(auto-update)"| REG
    ARC -->|uploads| OBJ
    CH <-->|API calls| EXT

IPC Bus Detail

The IPC Bus is a native sidecar that routes JSON messages between channel sidecars and the AI runtime:

Channel Sidecar ──UDS──► IPC Bus ──Bridge──► Runtime Container
                        │  WAL  │
                        │  DLQ  │
                        │ Ring  │
                        │Buffer │
                        └───────┘

WAL — append-only write-ahead log for at-least-once delivery
DLQ — BoltDB dead letter queue for messages exceeding retry limits
Backpressure — ring buffer with high/low watermark flow control
Bridge protocols — WebSocket (OpenClaw), TCP (PicoClaw), UDS (NanoClaw), SSE (ZeroClaw)

Supported Runtimes

The Runtime column shows the exact value to put in spec.runtime on a Claw CR. Names ending in claw are k8s4claw's internal runtime type enum — they are wrappers around upstream projects, not forks.

`spec.runtime`	Language	Upstream / Use Case	Gateway	Probe
`openclaw`	Go	WebSocket AI gateway + Anthropic SDK (first-party, verified end-to-end)	18900	HTTP
`hermesclaw`	Python	NousResearch upstream Hermes Agent — build instructions in runtimes/hermesclaw/	8642	HTTP
`hermesrs`	Rust	Runs hermes-agent-rs — Rust Hermes (verified end-to-end)	8080	HTTP
`k8sops`	Go	Companion Claw runtime used by claw4k8s for self-healing	18910	HTTP
`custom`	Any	Bring your own runtime image	—	—

Quick Start

Prerequisites

Kubernetes cluster (v1.28+, or kind / minikube for local dev)
kubectl configured
Go 1.23+ (for building from source)

1. Install CRDs and run the operator

Option A: Helm (recommended, v0.2.1+):

helm install k8s4claw oci://ghcr.io/prismer-ai/charts/k8s4claw --version 0.2.1 \
  --namespace k8s4claw-system --create-namespace \
  --set webhook.certManager.enabled=true  # requires cert-manager pre-installed

Or from source:

git clone https://github.com/Prismer-AI/k8s4claw.git
cd k8s4claw
helm install k8s4claw charts/k8s4claw --namespace k8s4claw-system --create-namespace

Option B: From source with Make:

git clone https://github.com/Prismer-AI/k8s4claw.git
cd k8s4claw

# Install CRDs into the cluster
make install

# Run operator locally (or deploy with `make deploy`)
make run

2. Create a Secret for your LLM API keys

kubectl create secret generic llm-api-keys \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
  --from-literal=OPENAI_API_KEY=sk-xxx

3. Deploy your first AI agent

# my-agent.yaml
apiVersion: claw.prismer.ai/v1alpha1
kind: Claw
metadata:
  name: my-agent
spec:
  runtime: openclaw
  config:
    model: "claude-sonnet-4"
  credentials:
    secretRef:
      name: llm-api-keys
  persistence:
    session:
      enabled: true
      size: 2Gi
      mountPath: /data/session
    workspace:
      enabled: true
      size: 10Gi
      mountPath: /workspace

kubectl apply -f my-agent.yaml

# Watch it come up
kubectl get claw my-agent -w

4. Connect a Slack channel (optional)

apiVersion: claw.prismer.ai/v1alpha1
kind: ClawChannel
metadata:
  name: slack-team
spec:
  type: slack
  mode: bidirectional
  credentials:
    secretRef:
      name: slack-bot-token
  config:
    appId: "A0123456789"

Then reference it in your Claw:

spec:
  channels:
    - name: slack-team
      mode: bidirectional

Features

claw4k8s — Autonomous Self-Healing (v0.2.0+)

The unique wedge: AI agents manage their own Kubernetes infrastructure. See architecture diagrams.

ClawOpsController — watches Pod status (OOMKilled, CrashLoop, HighCPU, Evicted) and auto-executes low-risk fixes from a deterministic rule engine
Intent annotation pattern — agents never patch StatefulSets directly; a single reconciler consumes intents through a 5-action allowlist with generation-based idempotency. Zero controller contention.
Companion Claw (LLM agent) — handles novel issues. Analyzes, proposes, routes to human approval via Slack (ClawChannel integration).
Ed25519 audit receipts — auto-executed actions get a signed receipt for the audit trail when signing succeeds (signing is non-blocking); pure-Go signer with optional signet CLI fallback.
Graceful LLM fallback — 3 retries with exponential backoff, then degrades to human notification — never paralyzes.
ClawOpsEscalation CRD — dual-purpose audit + workflow state machine (Pending → Analyzing → Proposed → AwaitingApproval → Approved → Executed).

Declarative Lifecycle Management

Claw CRD manages StatefulSet, Service, ConfigMap, ServiceAccount, PDB, PVCs, NetworkPolicy, Ingress, RBAC in a single declarative resource
Per-runtime resource defaults, liveness/readiness probes, graceful shutdown tuning
Webhook validation: credential requirements, PVC immutability, runtime type lock, NetworkPolicy mandatory for k8sops runtime
Finalizer-based cleanup with Retain / Delete / Archive reclaim policies

Auto-Update with Circuit Breaker

OCI registry polling on cron schedule
Semver constraint filtering (^1.x, ~2.0.0)
Health-verified rollouts with configurable timeout
Automatic rollback + circuit breaker after N consecutive failures

Persistence & Archival

Session, output, and workspace PVCs via StatefulSet volumeClaimTemplates
CSI VolumeSnapshot on cron schedule with retention pruning
S3-compatible archival sidecar (S3, MinIO, GCS, R2) with lifecycle policies

Communication Channels

ClawChannel CRD — declarative channel definitions with reference counting
Built-in sidecars: Slack, Discord, Webhook (more coming)
Custom sidecar support for any protocol
Bidirectional / inbound / outbound modes

IPC Bus — Reliable In-Pod Messaging

WAL — at-least-once delivery via BoltDB write-ahead log
DLQ — dead letter queue for messages exceeding retry limits
Backpressure — ring buffer with high/low watermark flow control
Protocol bridges — WebSocket (OpenClaw), TCP (PicoClaw), UDS (NanoClaw), SSE (ZeroClaw)

Security & Compliance

Pod Security Standards: runAsNonRoot, readOnlyRootFilesystem, seccompProfile=RuntimeDefault, drop=[ALL] capabilities
NetworkPolicy defaults: default-deny + selective allow
Per-instance ServiceAccount with automountServiceAccountToken=false
ExternalSecrets integration for secrets rotation
Ed25519 cryptographic audit for all ops actions

Self-Configuration

ClawSelfConfig CRD — agents can modify their own skills, config, workspace files, and env vars
Scoped allowlist via spec.selfConfigure.allowedActions (skills, config, workspaceFiles, envVars)
Rate limits on self-mutation

Observability

Prometheus metrics per Claw instance (reconcile latency, phase transitions, remediation actions, LLM latency)
K8s Events on all phase transitions
Status subresource with detailed conditions (RuntimeReady, AutoUpdateStatus, ChannelStatus)
PrometheusRule + ServiceMonitor templates in the Helm chart

Go SDK

import "github.com/Prismer-AI/k8s4claw/sdk"

client, err := sdk.NewClient()
if err != nil {
    log.Fatal(err)
}

claw, err := client.Create(ctx, &sdk.ClawSpec{
    Runtime: sdk.OpenClaw,
    Config: &sdk.RuntimeConfig{
        Environment: map[string]string{"MODEL": "claude-sonnet-4"},
    },
})

Development

make build          # Build operator binary
make build-ipcbus   # Build IPC Bus binary
make test           # Run tests (requires setup-envtest)
make lint           # Lint
make vet            # Run go vet
make fmt            # Run gofmt + goimports
make manifests      # Generate CRD YAML
make generate       # Generate deepcopy
make docker-build   # Build container image

See CONTRIBUTING.md for the full development guide.

Design Documents

Operator Core Design
IPC Bus + Resilience Design
Auto-Update Controller Design
claw4k8s Autonomous Ops Design — self-healing + LLM escalation + Ed25519 audit
claw4k8s Implementation Plan — task-by-task breakdown
claw4k8s Architecture Diagrams — Mermaid flowcharts of the full auto-remediation loop
vs k8sgpt / kubectl-ai / Holmes — positioning comparison

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.devcontainer		.devcontainer
.github		.github
api/v1alpha1		api/v1alpha1
charts/k8s4claw		charts/k8s4claw
cmd		cmd
config		config
docs		docs
internal		internal
runtimes		runtimes
scripts		scripts
sdk		sdk
.editorconfig		.editorconfig
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
Dockerfile		Dockerfile
Dockerfile.channel-discord		Dockerfile.channel-discord
Dockerfile.channel-slack		Dockerfile.channel-slack
Dockerfile.channel-webhook		Dockerfile.channel-webhook
Dockerfile.claw4k8s		Dockerfile.claw4k8s
Dockerfile.init		Dockerfile.init
Dockerfile.ipcbus		Dockerfile.ipcbus
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
VERSION		VERSION
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k8s4claw

The core idea

Why k8s4claw?

How is this different?

Architecture

IPC Bus Detail

Supported Runtimes

Quick Start

Prerequisites

1. Install CRDs and run the operator

2. Create a Secret for your LLM API keys

3. Deploy your first AI agent

4. Connect a Slack channel (optional)

Features

claw4k8s — Autonomous Self-Healing (v0.2.0+)

Declarative Lifecycle Management

Auto-Update with Circuit Breaker

Persistence & Archival

Communication Channels

IPC Bus — Reliable In-Pod Messaging

Security & Compliance

Self-Configuration

Observability

Go SDK

Development

Design Documents

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

k8s4claw

The core idea

Why k8s4claw?

How is this different?

Architecture

IPC Bus Detail

Supported Runtimes

Quick Start

Prerequisites

1. Install CRDs and run the operator

2. Create a Secret for your LLM API keys

3. Deploy your first AI agent

4. Connect a Slack channel (optional)

Features

claw4k8s — Autonomous Self-Healing (v0.2.0+)

Declarative Lifecycle Management

Auto-Update with Circuit Breaker

Persistence & Archival

Communication Channels

IPC Bus — Reliable In-Pod Messaging

Security & Compliance

Self-Configuration

Observability

Go SDK

Development

Design Documents

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages