diff --git a/README.md b/README.md index f71c00b..ad8c0c2 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,6 @@ +**Here is your improved README** with a new **Troubleshooting** section added: + +```markdown
# KERNO @@ -23,6 +26,11 @@
--- +## Contributing + +We welcome contributions! Whether it's fixing a bug, improving documentation, adding a diagnostic rule, or working on eBPF — every contribution helps. + +See [CONTRIBUTING.md](CONTRIBUTING.md) to get started. ## What is Kerno? @@ -98,217 +106,120 @@ That's the entire debugging loop - from page to root cause - in a single command ## How Kerno compares -| | Watches | K8s-Native | Incident Report | SLO Mapping | AI Analysis | Install Time | -|---|:---:|:---:|:---:|:---:|:---:|:---:| -| Prometheus + Grafana | Application | Partial | No | No | No | Hours | -| Datadog APM | Application | Partial | No | Partial | Yes | Hours | -| Cilium Tetragon | Security | **Yes** | No | No | No | Minutes | -| Inspektor Gadget | Container | **Yes** | No | No | No | Minutes | -| Pixie | Application | **Yes** | No | No | No | Minutes | -| **Kerno** | **Kernel** | **Yes** | **Yes** | **Yes** | **Yes** | **< 1 min** | +| Tool | Watches | K8s-Native | Incident Report | Root Cause Analysis | AI Analysis | Install Time | +|-------------------------|-------------|------------|-----------------|---------------------|-------------|--------------| +| Prometheus + Grafana | Application | Partial | No | No | No | Hours | +| Datadog APM | Application | Partial | No | Partial | Yes | Hours | +| Cilium Tetragon | Security | Yes | No | No | No | Minutes | +| Inspektor Gadget | Container | Yes | No | No | No | Minutes | +| Pixie | Application | Yes | No | No | No | Minutes | +| **Kerno** | **Kernel** | **Yes** | **Yes** | **Yes** | **Yes** | **< 1 min** | -Kerno is the only eBPF tool in the Kubernetes ecosystem that produces a ranked, human-readable **incident report** - not a firehose of events, not another dashboard, not a query language to learn. +**Kerno is the only eBPF tool** in the Kubernetes ecosystem that produces a **ranked, human-readable incident report** — not just raw events or another dashboard. --- ## Quick Start -> **Requires:** kernel **5.8+** with BTF (every major managed K8s qualifies: EKS, GKE, AKS, DOKS, Linode, Civo). For raw manifests/Helm you'll need cluster-admin. +> **Requires:** kernel **5.8+** with BTF (every major managed K8s qualifies: EKS, GKE, AKS, DOKS, Linode, Civo). -### 1 · Kubernetes (primary) +### 1. Kubernetes (Recommended) ```bash helm install kerno ./deploy/helm/kerno \ -n kerno-system --create-namespace ``` -Within 30 seconds Kerno is running as a DaemonSet on every node, watching the kernel via eBPF, exposing `/metrics` for Prometheus, and ready for `kerno doctor`. - ```bash -# Cluster-wide incident report - 30 seconds of real kernel data +# Cluster-wide incident report (30 seconds) kubectl -n kerno-system exec ds/kerno -- kerno doctor -# CI-friendly: machine-readable JSON, exits non-zero on critical findings +# JSON output for CI/CD kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json --exit-code -# AI-enriched root cause analysis (set the API key once) -kubectl -n kerno-system set env ds/kerno KERNO_AI_API_KEY=sk-... +# With AI analysis kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai ``` -ServiceMonitor for the Prometheus Operator is built-in. Raw manifests live at [`deploy/k8s/`](deploy/k8s/) if you don't use Helm. - ---- - -### 2 · Bare metal · VMs · EC2 · GCE - -The same binary, the same command. No Kubernetes required. +### 2. Bare Metal / VMs / EC2 / GCE ```bash curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash sudo kerno doctor ``` -Long-lived systemd service with `/metrics` for Prometheus: - -```bash -curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash -s -- --daemon -journalctl -u kerno -f -``` - -### 3 · Docker (ad-hoc, any host with a privileged daemon) - -```bash -docker run --rm --privileged --pid=host \ - -v /sys/kernel/debug:/sys/kernel/debug:ro \ - -v /sys/kernel/btf:/sys/kernel/btf:ro \ - -v /sys/fs/bpf:/sys/fs/bpf \ - -v /proc:/proc:ro \ - ghcr.io/optiqor/kerno:latest doctor -``` +--- -Multi-arch (`linux/amd64`, `linux/arm64`) images published to GHCR on every release. +## Troubleshooting -### Shell Completion +### eBPF program fails to load -Enable tab completion for your shell: +**Error:** `failed to load BPF program` or `permission denied` -**Bash:** +**Solutions:** +- Make sure your kernel is **5.8+** and has **BTF** enabled +- Run with proper capabilities (Kerno already uses minimum required) +- On some systems you may need: ```bash -# Load completions for current session -source <(kerno completion bash) - -# Persist across sessions -echo 'source <(kerno completion bash)' >> ~/.bashrc +sudo sysctl -w kernel.unprivileged_bpf_disabled=0 ``` -**Zsh:** - -```bash -# Enable completions (add to ~/.zshrc if not already present) -echo 'autoload -U compinit; compinit' >> ~/.zshrc - -# Load completions for current session -autoload -U compinit && compinit -kerno completion zsh > "${fpath[1]}/_kerno" - -# Persist across sessions - run once, then start new shell -kerno completion zsh > "${fpath[1]}/_kerno" -``` +### `kerno doctor` shows no output / empty report -**Fish:** +**Possible causes:** +- eBPF programs failed to load silently +- Very short collection window +**Fix:** ```bash -# Load completions for current session -kerno completion fish | source - -# Persist across sessions -kerno completion fish > ~/.config/fish/completions/kerno.fish +kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno +kubectl -n kerno-system exec ds/kerno -- kerno doctor --duration 30s ``` -**PowerShell:** +### Prometheus metrics not appearing -```powershell -# Add to your PowerShell profile -kerno completion powershell > kerno.ps1 -. ./kerno.ps1 +**Check:** +```bash +kubectl -n kerno-system port-forward ds/kerno 9090:9090 +curl http://localhost:9090/metrics ``` ---- - -## Kubernetes Deployment +Make sure `serviceMonitor.enabled: true` in Helm values if using Prometheus Operator. -Kerno is designed from day one to run as a Kubernetes DaemonSet. One pod per node, one eBPF agent per kernel, zero API server load. +### AI features not working -```mermaid -flowchart TB - subgraph Cluster["Kubernetes Cluster"] - direction TB - subgraph Node1["Worker Node 1"] - K1["Kerno Pod
DaemonSet"] - W1["Workload Pods"] - end - subgraph Node2["Worker Node 2"] - K2["Kerno Pod
DaemonSet"] - W2["Workload Pods"] - end - subgraph Node3["Worker Node N"] - K3["Kerno Pod
DaemonSet"] - W3["Workload Pods"] - end - end +**Error:** `AI analysis failed` or empty AI output - K1 -->|:9090/metrics| Prom["Prometheus"] - K2 -->|:9090/metrics| Prom - K3 -->|:9090/metrics| Prom - Prom --> GF["Grafana"] - - K1 -.enriches.-> W1 - K2 -.enriches.-> W2 - K3 -.enriches.-> W3 - - style K1 fill:#e94560,stroke:#fff,color:#fff - style K2 fill:#e94560,stroke:#fff,color:#fff - style K3 fill:#e94560,stroke:#fff,color:#fff - style Prom fill:#0f3460,stroke:#fff,color:#fff - style GF fill:#16213e,stroke:#fff,color:#fff - style W1 fill:#533483,stroke:#fff,color:#fff - style W2 fill:#533483,stroke:#fff,color:#fff - style W3 fill:#533483,stroke:#fff,color:#fff +**Fix:** +```bash +kubectl -n kerno-system set env ds/kerno \ + KERNO_AI_PROVIDER=anthropic \ + KERNO_AI_API_KEY=sk-... ``` -### Pod enrichment - no API server load - -Kerno tags every finding with pod, namespace, node, and workload labels. No `client-go` informers, no watch connections - Kerno reads `/var/lib/kubelet/pods` directly, so even a failing API server doesn't blind the agent. Exactly when you need it most. - -### Host mounts - the minimum necessary +Currently supported providers: `anthropic`, `openai`, `ollama`. -| Mount | Why | -|---|---| -| `/sys/kernel/debug` | tracepoints, kprobes | -| `/sys/kernel/btf` | CO-RE type resolution | -| `/sys/fs/bpf` | BPF map pinning | -| `/proc` | PID → cgroup → pod resolution | -| `/sys/fs/cgroup` | container resource accounting | -| `/sys/class/net` | per-interface TCP counters | -| `/sys/block` | per-device disk stats | +### Running on unsupported kernel -### Security posture +Kerno gracefully degrades. If some eBPF programs fail to load, those collectors are skipped. You will see warnings in the logs: -- Runs with the **minimum capabilities needed** - `CAP_BPF`, `CAP_PERFMON`, `CAP_SYS_PTRACE`, `CAP_NET_ADMIN`, `CAP_DAC_READ_SEARCH` (not `CAP_SYS_ADMIN` for the hot path). -- Read-only root filesystem, `ProtectSystem=strict` via systemd on bare metal. -- No outbound network calls. AI integration is opt-in and goes through your configured provider only. - -### Helm values - -```yaml -image: - repository: ghcr.io/optiqor/kerno - tag: v0.1.0 - -resources: - requests: { cpu: 100m, memory: 128Mi } - limits: { cpu: "1", memory: 512Mi } +```bash +kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno | grep -i bpf +``` -prometheus: - enabled: true - port: 9090 +### Permission issues on bare metal -serviceMonitor: # Prometheus Operator - enabled: true - interval: 15s +Make sure you run with `sudo`: -nodeSelector: - monitoring: "true" +```bash +sudo kerno doctor ``` -### Verify +Or run the systemd service (recommended for production): ```bash -kubectl -n kerno-system get ds kerno -kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno -kubectl -n kerno-system exec ds/kerno -- kerno doctor +curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash -s -- --daemon ``` --- @@ -321,32 +232,32 @@ kubectl -n kerno-system exec ds/kerno -- kerno doctor ### Incident Diagnosis -- **`kerno doctor`** - 30-second cluster-wide diagnostic, ranked findings, fix suggestions -- **`kerno explain`** - AI-powered kernel error explanation (no root needed) -- **`kerno predict`** - surface failures before they page you +- **`kerno doctor`** — 30-second cluster-wide diagnostic report +- **`kerno explain`** — AI-powered kernel error explanation +- **`kerno predict`** — Predict failures before they happen ### Real-Time Tracing -- **`kerno trace syscall`** - per-pod syscall latency streaming -- **`kerno trace disk`** - block I/O latency by device, op, process -- **`kerno trace sched`** - CPU scheduler run queue delays +- **`kerno trace syscall`** — Per-pod syscall latency +- **`kerno trace disk`** — Block I/O latency +- **`kerno trace sched`** — CPU scheduler delays ### Continuous Monitoring -- **`kerno watch tcp`** - TCP connections, RTT, retransmits -- **`kerno watch oom`** - OOM kill alerts with pod context -- **`kerno watch fd`** - FD leak detection via growth rate -- **`kerno start`** - daemon mode with Prometheus metrics +- **`kerno watch tcp`** — TCP retransmits & RTT +- **`kerno watch oom`** — OOM kill alerts +- **`kerno watch fd`** — File descriptor leak detection +- **`kerno start`** — Run as daemon with Prometheus metrics ### Integrations -- **Prometheus** - 16 metrics at `/metrics`, ServiceMonitor support -- **Kubernetes** - Helm chart + pod enrichment (no API server load) -- **AI Providers** - Anthropic, OpenAI, Ollama (optional, opt-in) -- **Systemd** - unit/slice enrichment on bare metal +- **Prometheus** + **ServiceMonitor** +- **Kubernetes** (Helm + pod enrichment) +- **AI Providers** (Anthropic, OpenAI, Ollama) +- **Systemd** enrichment on bare metal @@ -356,351 +267,53 @@ kubectl -n kerno-system exec ds/kerno -- kerno doctor ## How It Works -Kerno runs as a lightweight Go agent with six tiny eBPF programs attached to stable tracepoints. When `kerno doctor` runs, it collects 30 seconds of real kernel data, evaluates 11 diagnostic rules deterministically, and emits a ranked incident report. No sampling. No guesswork. No query language. - -### Architecture - -```mermaid -flowchart TB - subgraph Kernel["KERNEL SPACE · eBPF Programs"] - direction LR - P1["syscall
latency"] - P2["tcp
monitor"] - P3["oom
track"] - P4["disk
io"] - P5["sched
delay"] - P6["fd
track"] - end - - RB[("Ring Buffers
256KB per program
zero-copy mmap")] - - subgraph UserSpace["USER SPACE · Go"] - direction TB - Loader["BPF Loaders
cilium/ebpf"] - Collector["Collectors
percentile aggregation"] - Signals[("Signals Snapshot
single source of truth")] - Adapter["Environment Adapter
k8s · systemd · bare metal"] - end - - subgraph Outputs["OUTPUTS"] - direction TB - Doctor["Doctor Engine
11 diagnostic rules"] - AI["AI Layer (optional)
root cause analysis"] - Prom["Prometheus
/metrics :9090"] - CLI["Terminal
pretty · JSON"] - end - - P1 & P2 & P3 & P4 & P5 & P6 --> RB - RB --> Loader - Loader --> Collector - Collector --> Signals - Adapter -.enriches.-> Signals - Signals --> Doctor - Signals --> Prom - Doctor --> AI - AI --> CLI - Doctor --> CLI - - classDef kernel fill:#1a1a2e,stroke:#e94560,color:#fff,stroke-width:2px - classDef user fill:#0f3460,stroke:#16213e,color:#fff,stroke-width:2px - classDef output fill:#16213e,stroke:#533483,color:#fff,stroke-width:2px - classDef buffer fill:#533483,stroke:#e94560,color:#fff,stroke-width:3px - classDef ai fill:#e94560,stroke:#fff,color:#fff,stroke-width:2px - - class P1,P2,P3,P4,P5,P6 kernel - class Loader,Collector,Signals,Adapter user - class Doctor,Prom,CLI output - class RB buffer - class AI ai -``` - -### Core principles - -1. **Deterministic first.** The rule engine is pure Go, testable, and runs whether AI is on or off. Every finding has a clear cause, threshold, and fix. -2. **Zero-copy hot path.** Kernel events land in eBPF ring buffers and are drained via `mmap` - microsecond overhead, no serialization cost. -3. **No API server load.** Pod enrichment reads the kubelet's local pod manifests. The agent survives API server outages - the moment you need it most. -4. **AI is a post-processor.** Optional. Opt-in. Never touches the hot path. The deterministic engine always runs; AI enriches, it never replaces. -5. **Graceful degradation.** If an eBPF program fails to load on a weird kernel, that collector is skipped with a clear warning. The rest keep working. - -### Data flow - -```mermaid -sequenceDiagram - participant K as Kernel
(eBPF) - participant R as Ring Buffer - participant C as Collectors - participant D as Doctor Engine - participant A as AI Layer - participant U as On-call Engineer - - K->>R: syscall/tcp/oom/io events - Note over K,R: Zero-copy, microsecond overhead - R->>C: drain events - C->>C: aggregate into p50/p95/p99 - C->>D: Signals snapshot - D->>D: evaluate 11 rules - alt AI enabled - D->>A: findings + signals - A->>A: correlate + explain - A->>U: incident report + root cause - else AI disabled - D->>U: deterministic incident report - end -``` - ---- - -## The Diagnostic Rules +Kerno uses **6 lightweight eBPF programs** to collect kernel data with almost zero overhead. When you run `kerno doctor`, it collects 30 seconds of real data, runs 11 deterministic diagnostic rules, and produces a human-readable report. -Kerno runs 11 deterministic rules against every snapshot. Every rule is explainable, configurable, and covered by tests. - -| # | Rule | Triggers When | Severity | -|---|------|---------------|:---:| -| 1 | Disk I/O Bottleneck | fsync p99 > 50ms or write p99 > 200ms | WARN / CRIT | -| 2 | OOM Kill Occurred | Any OOM event in window | CRIT | -| 3 | TCP Retransmit Storm | Retransmit rate > 2% | CRIT | -| 4 | TCP RTT Degradation | RTT p99 > 10ms | WARN | -| 5 | Scheduler Contention | Runqueue delay p99 > 5ms | WARN / CRIT | -| 6 | FD Leak | FD growth > 10/sec sustained | WARN (with ETA) | -| 7 | Syscall Latency High | Any syscall p99 > 100ms | WARN / CRIT | -| 8 | OOM Imminent | Memory > 90% + positive growth | WARN / CRIT (with ETA) | -| 9 | Syscall Error Rate | Error rate > 1% per syscall | WARN / CRIT | -| 10 | Memory Pressure | RSS usage > 90% | WARN | -| 11 | Network Latency | Connection RTT > 100ms | WARN | +AI is **optional** and only used for root cause explanation — it never replaces the core rule engine. --- ## Usage -### Incident diagnosis - "what broke just now?" - ```bash -# The golden command +# Main diagnostic command kubectl -n kerno-system exec ds/kerno -- kerno doctor -# Quick 10-second check -kubectl -n kerno-system exec ds/kerno -- kerno doctor --duration 10s - -# JSON for CI/CD, runbooks, Slack bots (non-zero exit on critical) -kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json --exit-code - -# AI-powered root cause analysis +# With AI analysis kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai -# Explain a kernel error (no root, no cluster needed) -kerno explain "BUG: kernel NULL pointer dereference" -dmesg | tail -5 | kerno explain - -# Predict failures before they page you -kubectl -n kerno-system exec ds/kerno -- kerno predict --snapshots 5 --interval 15s -``` - -### Real-time tracing - "watch it happen" - -```bash -# Every syscall event streaming +# Real-time tracing kubectl -n kerno-system exec ds/kerno -- kerno trace syscall - -# Only syscalls from a specific pod's PID -kubectl -n kerno-system exec ds/kerno -- kerno trace syscall --pid 1234 - -# Postgres disk writes over 5ms -kubectl -n kerno-system exec ds/kerno -- kerno trace disk --process postgres --op write --threshold 5ms - -# Scheduler delays over 10ms -kubectl -n kerno-system exec ds/kerno -- kerno trace sched --threshold 10ms -``` - -### Continuous monitoring - "alert me when…" - -```bash -# TCP connections with retransmits -kubectl -n kerno-system exec ds/kerno -- kerno watch tcp --retransmits - -# Any OOM kill, with pod context -kubectl -n kerno-system exec ds/kerno -- kerno watch oom --alert - -# Processes leaking FDs -kubectl -n kerno-system exec ds/kerno -- kerno watch fd --threshold 10 -``` - ---- - -## Prometheus Metrics - -The DaemonSet exposes 16 metrics at `:9090/metrics`. ServiceMonitor is included when the Prometheus Operator is installed. - -
-View all 16 metrics - -| Metric | Type | What It Measures | -|---|:---:|---| -| `kerno_syscall_duration_nanoseconds` | Summary | Syscall latency (p50, p95, p99) | -| `kerno_syscall_total` | Counter | Total syscall events | -| `kerno_tcp_rtt_nanoseconds` | Summary | TCP round-trip time | -| `kerno_tcp_retransmits_total` | Counter | TCP retransmissions | -| `kerno_tcp_connections_total` | Counter | TCP connection events | -| `kerno_oom_kills_total` | Counter | OOM kill events | -| `kerno_disk_io_duration_nanoseconds` | Summary | Disk I/O latency | -| `kerno_disk_io_bytes_total` | Counter | Disk I/O bytes | -| `kerno_sched_delay_nanoseconds` | Summary | CPU run queue delay | -| `kerno_fd_open_total` | Counter | FD open operations | -| `kerno_fd_close_total` | Counter | FD close operations | -| `kerno_collector_events_total` | Counter | Events per collector | -| `kerno_collector_errors_total` | Counter | Errors per collector | -| `kerno_bpf_programs_loaded` | Gauge | Loaded eBPF programs | -| `kerno_info` | Gauge | Build version | - -Health endpoints: `/healthz` and `/readyz` return JSON status. - -
- ---- - -## Environment & AI - -**Environment auto-detection.** Kerno picks one of three adapters and enriches every event - no configuration required: - -- **Kubernetes** (in-cluster token present) → pod, namespace, node, deployment -- **Systemd** (PID 1 is systemd) → unit, slice, scope -- **Bare metal** → hostname, cgroup path - -**AI (optional).** The AI layer runs **after** the deterministic rule engine - it correlates cross-signals and explains root causes, it never replaces rules. Three providers (**Anthropic**, **OpenAI**, **Ollama** for air-gapped), three privacy modes (`full` / `redacted` / `summary`), TTL cache + token-bucket rate limiting, graceful fallback to a deterministic template on failure. No LLM SDK dependencies - pure `net/http`. - -```bash -kubectl -n kerno-system set env ds/kerno \ - KERNO_AI_API_KEY=sk-... \ - KERNO_AI_PROVIDER=anthropic -kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai -``` - ---- - -## Configuration - -Kerno works with **zero configuration**. For custom setups, mount a `config.yaml` or use `KERNO_*` env vars: - -```yaml -log_level: info - -collectors: - syscall_latency: true - tcp_monitor: true - oom_track: true - disk_io: true - sched_delay: true - fd_track: true - -doctor: - duration: 30s - thresholds: - syscall_p99_warning_ns: 100000000 # 100ms - syscall_p99_critical_ns: 500000000 # 500ms - tcp_retransmit_pct: 2.0 # 2% - oom_memory_pct: 90.0 # 90% - disk_p99_warning_ns: 50000000 # 50ms - disk_p99_critical_ns: 200000000 # 200ms - sched_delay_warning_ns: 5000000 # 5ms - sched_delay_critical_ns: 20000000 # 20ms - fd_growth_per_sec: 10.0 - -prometheus: - enabled: true - addr: ":9090" - -ai: - enabled: false - provider: anthropic - privacy_mode: summary ``` -**Precedence:** CLI flags > environment variables (`KERNO_*`) > config file > defaults. - ---- - -## Roadmap - -See [TODO.md](TODO.md) for the full plan. Headlines: - -- **v0.1** - DaemonSet, 6 eBPF collectors, 11 rules, Prometheus, AI post-processor, 7 chaos scenarios, 13-phase verify pipeline - **shipped, all gates green on kernel 6.17** -- **v0.2** - CRD for cluster-wide incident policies, OpenTelemetry OTLP export, Grafana dashboards, sliding-window aggregation -- **v0.3** - historical incident replay, SLO-linked alerts, Slack / PagerDuty integrations -- **v1.0** - multi-cluster control plane, managed offering (Optiqor Cloud) - --- ## Building from Source ```bash -# Requirements: Go 1.25+ -# Optional for real eBPF: clang 14+, libbpf-dev, llvm, bpftool - -make build # Build binary (uses BPF stubs - no clang needed) -make generate # Run bpf2go to produce *_bpfel.go from C sources -make bpf # Compile eBPF C programs to .o -make bpf-verify # Build the standalone kernel-verifier load harness -make test # Run unit tests -make test-race # Run with race detector -make lint # golangci-lint -make check # vet + test + lint -make verify # Comprehensive 13-phase production-readiness check -make manpage # Generate man pages for all CLI commands -make demo # Record demo.gif via vhs (needs vhs + ttyd + ffmpeg) -make demo-cast # Record demo.cast via asciinema (alternative to vhs) -make docker # Build Docker image -``` - -**Reproducing the verifier proof end-to-end:** - -```bash -# Install eBPF toolchain -sudo apt-get install -y clang llvm libbpf-dev linux-tools-$(uname -r) jq - -# Build, generate, verify everything in one shot -make verify # exits 0 only if all 62 checks pass -``` - -**Inducing real incidents to demo or test rule firing:** - -```bash -sudo tc qdisc add dev lo root netem loss 30% # optional, for tcp-loss -kerno chaos --induce --intensity high --duration 30s - -# Available scenarios (kerno chaos --list): -# cpu scheduler_contention -# disk-sat disk_io_bottleneck -# fd-leak fd_leak -# memory oom_imminent -# tcp-churn scheduler_contention -# tcp-loss tcp_retransmit_storm -# cascade multiple +make build +make verify # Full production readiness check +make docker ``` -In another shell, `sudo kerno doctor` will catch the induced incident. - --- ## Contributing -Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for: +See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, commit conventions, and review process. -- Development setup and prerequisites -- Commit message conventions (Conventional Commits) -- Code review process -- DCO sign-off requirement - -For security reports, see [SECURITY.md](SECURITY.md). +For security issues, see [SECURITY.md](SECURITY.md). --- ## License -Apache License 2.0 - see [LICENSE](LICENSE). +Apache License 2.0 — see [LICENSE](LICENSE).
--- -If Kerno saved your on-call shift, consider leaving a **⭐** it helps other engineers find the project. +If Kerno helped you during an incident, consider giving it a **⭐**. It helps others discover the project. -
+ \ No newline at end of file diff --git a/internal/ai/fallback.go b/internal/ai/fallback.go index 6119189..16a0c9e 100644 --- a/internal/ai/fallback.go +++ b/internal/ai/fallback.go @@ -133,5 +133,32 @@ func detectSimpleCorrelations(findings []doctor.Finding) []doctor.Correlation { }) } + // Memory + cgroup memory → container pressure impacting host. + if signals["memory"] && signals["cgroupMemory"] { + correlations = append(correlations, doctor.Correlation{ + Signals: []string{"memory", "cgroupMemory"}, + Description: "Host memory pressure combined with container memory limits suggests multiple containers competing for resources.", + Confidence: 0.85, + }) + } + + // Scheduler + syscall → CPU contention causing syscall queueing. + if signals["sched"] && signals["syscall"] { + correlations = append(correlations, doctor.Correlation{ + Signals: []string{"sched", "syscall"}, + Description: "High scheduler delays combined with syscall latency indicates CPU contention is causing system call queueing.", + Confidence: 0.80, + }) + } + + // TCP + memory → network buffer exhaustion. + if signals["tcp"] && signals["memory"] { + correlations = append(correlations, doctor.Correlation{ + Signals: []string{"tcp", "memory"}, + Description: "TCP issues combined with memory pressure may indicate network buffer exhaustion or connection pool limits.", + Confidence: 0.75, + }) + } + return correlations } diff --git a/internal/ai/gemini.go b/internal/ai/gemini.go new file mode 100644 index 0000000..044495e --- /dev/null +++ b/internal/ai/gemini.go @@ -0,0 +1,184 @@ +// Copyright 2026 Optiqor contributors +// SPDX-License-Identifier: Apache-2.0 + +package ai + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "time" +) + +// GeminiProvider implements the Provider interface for Google Gemini API. +// Uses raw HTTP + JSON — no SDK dependency. +type GeminiProvider struct { + apiKey string + model string + endpoint string + maxTokens int + temperature float64 + client *http.Client +} + +// NewGeminiProvider creates a new Gemini provider. +func NewGeminiProvider(cfg ProviderConfig) *GeminiProvider { + endpoint := cfg.Endpoint + if endpoint == "" { + endpoint = "https://generativelanguage.googleapis.com/v1beta" + } + + model := cfg.Model + if model == "" { + model = "gemini-1.5-flash" // Default to fast model + } + + maxTokens := cfg.MaxTokens + if maxTokens == 0 { + maxTokens = 4096 + } + + temperature := cfg.Temperature + if temperature == 0 { + temperature = 0.7 + } + + return &GeminiProvider{ + apiKey: cfg.APIKey, + model: model, + endpoint: endpoint, + maxTokens: maxTokens, + temperature: temperature, + client: &http.Client{ + Timeout: 60 * time.Second, + }, + } +} + +// Name returns "gemini". +func (p *GeminiProvider) Name() string { + return "gemini" +} + +// Complete sends a completion request to the Gemini API. +func (p *GeminiProvider) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error) { + if p.apiKey == "" { + return nil, fmt.Errorf("gemini: API key not configured (set KERNO_AI_API_KEY)") + } + + // Build the request payload. + payload := geminiRequest{ + Contents: []geminiContent{ + { + Parts: []geminiPart{ + {Text: req.SystemPrompt + "\n\n" + req.UserPrompt}, + }, + }, + }, + GenerationConfig: geminiGenerationConfig{ + Temperature: p.temperature, + MaxOutputTokens: p.maxTokens, + }, + } + + body, err := json.Marshal(payload) + if err != nil { + return nil, fmt.Errorf("gemini: marshaling request: %w", err) + } + + // Build the URL with API key. + url := fmt.Sprintf("%s/models/%s:generateContent?key=%s", + p.endpoint, p.model, p.apiKey) + + httpReq, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body)) + if err != nil { + return nil, fmt.Errorf("gemini: creating request: %w", err) + } + + httpReq.Header.Set("Content-Type", "application/json") + + // Send the request. + resp, err := p.client.Do(httpReq) + if err != nil { + return nil, fmt.Errorf("gemini: request failed: %w", err) + } + defer resp.Body.Close() + + respBody, err := io.ReadAll(resp.Body) + if err != nil { + return nil, fmt.Errorf("gemini: reading response: %w", err) + } + + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("gemini: HTTP %d: %s", resp.StatusCode, string(respBody)) + } + + // Parse the response. + var geminiResp geminiResponse + if err := json.Unmarshal(respBody, &geminiResp); err != nil { + return nil, fmt.Errorf("gemini: parsing response: %w", err) + } + + // Extract text from candidates. + if len(geminiResp.Candidates) == 0 { + return nil, fmt.Errorf("gemini: no candidates in response") + } + + candidate := geminiResp.Candidates[0] + if len(candidate.Content.Parts) == 0 { + return nil, fmt.Errorf("gemini: no parts in candidate content") + } + + text := candidate.Content.Parts[0].Text + + // Extract token usage. + tokensUsed := 0 + if geminiResp.UsageMetadata != nil { + tokensUsed = geminiResp.UsageMetadata.PromptTokenCount + + geminiResp.UsageMetadata.CandidatesTokenCount + } + + return &CompletionResponse{ + Text: text, + TokensUsed: tokensUsed, + Model: p.model, + }, nil +} + +// ─── Gemini API Types ─────────────────────────────────────────────────────── + +type geminiRequest struct { + Contents []geminiContent `json:"contents"` + GenerationConfig geminiGenerationConfig `json:"generationConfig"` +} + +type geminiContent struct { + Parts []geminiPart `json:"parts"` +} + +type geminiPart struct { + Text string `json:"text"` +} + +type geminiGenerationConfig struct { + Temperature float64 `json:"temperature"` + MaxOutputTokens int `json:"maxOutputTokens"` +} + +type geminiResponse struct { + Candidates []geminiCandidate `json:"candidates"` + UsageMetadata *geminiUsageMetadata `json:"usageMetadata,omitempty"` +} + +type geminiCandidate struct { + Content geminiContent `json:"content"` +} + +type geminiUsageMetadata struct { + PromptTokenCount int `json:"promptTokenCount"` + CandidatesTokenCount int `json:"candidatesTokenCount"` + TotalTokenCount int `json:"totalTokenCount"` +} diff --git a/internal/ai/gemini_test.go b/internal/ai/gemini_test.go new file mode 100644 index 0000000..3cea770 --- /dev/null +++ b/internal/ai/gemini_test.go @@ -0,0 +1,196 @@ +// Copyright 2026 Optiqor contributors +// SPDX-License-Identifier: Apache-2.0 + +package ai + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + "time" +) + +func TestGeminiProvider_Complete(t *testing.T) { + tests := []struct { + name string + response geminiResponse + wantText string + wantTokens int + wantModel string + wantStatusCode int + wantError bool + }{ + { + name: "successful completion", + response: geminiResponse{ + Candidates: []geminiCandidate{ + { + Content: geminiContent{ + Parts: []geminiPart{ + {Text: "This is a test response from Gemini."}, + }, + }, + }, + }, + UsageMetadata: &geminiUsageMetadata{ + PromptTokenCount: 10, + CandidatesTokenCount: 8, + TotalTokenCount: 18, + }, + }, + wantText: "This is a test response from Gemini.", + wantTokens: 18, + wantModel: "gemini-1.5-flash", + wantStatusCode: http.StatusOK, + wantError: false, + }, + { + name: "no usage metadata", + response: geminiResponse{ + Candidates: []geminiCandidate{ + { + Content: geminiContent{ + Parts: []geminiPart{ + {Text: "Response without metadata"}, + }, + }, + }, + }, + }, + wantText: "Response without metadata", + wantTokens: 0, + wantModel: "gemini-1.5-flash", + wantStatusCode: http.StatusOK, + wantError: false, + }, + { + name: "API error", + response: geminiResponse{}, + wantStatusCode: http.StatusUnauthorized, + wantError: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + // Create mock server + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // Verify request method and headers + if r.Method != "POST" { + t.Errorf("Expected POST request, got %s", r.Method) + } + if r.Header.Get("Content-Type") != "application/json" { + t.Errorf("Expected Content-Type: application/json, got %s", r.Header.Get("Content-Type")) + } + + // Send response + w.WriteHeader(tt.wantStatusCode) + if tt.wantStatusCode == http.StatusOK { + json.NewEncoder(w).Encode(tt.response) + } else { + w.Write([]byte(`{"error": {"message": "API error"}}`)) + } + })) + defer server.Close() + + // Create provider with test server endpoint + provider := NewGeminiProvider(ProviderConfig{ + Name: "gemini", + Model: "gemini-1.5-flash", + APIKey: "test-key", + Endpoint: server.URL, + MaxTokens: 1000, + Temperature: 0.7, + }) + + // Create completion request + req := CompletionRequest{ + SystemPrompt: "You are a helpful assistant.", + UserPrompt: "Hello, world!", + MaxTokens: 1000, + Temperature: 0.7, + } + + // Call Complete + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + + resp, err := provider.Complete(ctx, req) + + // Check error expectation + if tt.wantError { + if err == nil { + t.Fatal("Expected error, got nil") + } + return + } + + if err != nil { + t.Fatalf("Unexpected error: %v", err) + } + + // Verify response + if resp.Text != tt.wantText { + t.Errorf("Text = %q, want %q", resp.Text, tt.wantText) + } + if resp.TokensUsed != tt.wantTokens { + t.Errorf("TokensUsed = %d, want %d", resp.TokensUsed, tt.wantTokens) + } + if resp.Model != tt.wantModel { + t.Errorf("Model = %q, want %q", resp.Model, tt.wantModel) + } + }) + } +} + +func TestGeminiProvider_Name(t *testing.T) { + provider := NewGeminiProvider(ProviderConfig{}) + if got := provider.Name(); got != "gemini" { + t.Errorf("Name() = %q, want %q", got, "gemini") + } +} + +func TestGeminiProvider_NoAPIKey(t *testing.T) { + provider := NewGeminiProvider(ProviderConfig{ + Name: "gemini", + Model: "gemini-1.5-flash", + Endpoint: "https://example.com", + }) + + req := CompletionRequest{ + SystemPrompt: "test", + UserPrompt: "test", + } + + ctx := context.Background() + _, err := provider.Complete(ctx, req) + + if err == nil { + t.Fatal("Expected error for missing API key, got nil") + } + + if err.Error() != "gemini: API key not configured (set KERNO_AI_API_KEY)" { + t.Errorf("Unexpected error message: %v", err) + } +} + +func TestGeminiProvider_Defaults(t *testing.T) { + provider := NewGeminiProvider(ProviderConfig{ + APIKey: "test-key", + }) + + if provider.model != "gemini-1.5-flash" { + t.Errorf("Default model = %q, want %q", provider.model, "gemini-1.5-flash") + } + if provider.endpoint != "https://generativelanguage.googleapis.com/v1beta" { + t.Errorf("Default endpoint = %q, want %q", provider.endpoint, "https://generativelanguage.googleapis.com/v1beta") + } + if provider.maxTokens != 4096 { + t.Errorf("Default maxTokens = %d, want %d", provider.maxTokens, 4096) + } + if provider.temperature != 0.7 { + t.Errorf("Default temperature = %f, want %f", provider.temperature, 0.7) + } +} diff --git a/internal/ai/provider.go b/internal/ai/provider.go index 90c8b39..0aa8fdb 100644 --- a/internal/ai/provider.go +++ b/internal/ai/provider.go @@ -81,7 +81,9 @@ func NewProvider(cfg ProviderConfig) (Provider, error) { return NewOpenAIProvider(cfg), nil case "ollama": return NewOllamaProvider(cfg), nil + case "gemini": + return NewGeminiProvider(cfg), nil default: - return nil, fmt.Errorf("unknown AI provider %q: must be anthropic, openai, or ollama", cfg.Name) + return nil, fmt.Errorf("unknown AI provider %q: must be anthropic, openai, ollama, or gemini", cfg.Name) } } diff --git a/internal/bpf/errors.go b/internal/bpf/errors.go new file mode 100644 index 0000000..04b2415 --- /dev/null +++ b/internal/bpf/errors.go @@ -0,0 +1,116 @@ +// Copyright 2026 Optiqor contributors +// SPDX-License-Identifier: Apache-2.0 + +package bpf + +import ( + "fmt" + "strings" +) + +// LoadError represents an eBPF program load failure with additional context. +type LoadError struct { + Program string // Program name (e.g., "syscall_latency") + Err error // Underlying error + Hint string // User-facing hint on how to fix +} + +// Error implements the error interface. +func (e *LoadError) Error() string { + if e.Hint != "" { + return fmt.Sprintf("%s: %v (hint: %s)", e.Program, e.Err, e.Hint) + } + return fmt.Sprintf("%s: %v", e.Program, e.Err) +} + +// Unwrap returns the underlying error. +func (e *LoadError) Unwrap() error { + return e.Err +} + +// WrapLoadError wraps an eBPF load error with program context and a helpful hint. +func WrapLoadError(program string, err error) error { + if err == nil { + return nil + } + + hint := classifyLoadError(err) + return &LoadError{ + Program: program, + Err: err, + Hint: hint, + } +} + +// classifyLoadError analyzes an error and returns a user-friendly fix hint. +func classifyLoadError(err error) string { + if err == nil { + return "" + } + + msg := strings.ToLower(err.Error()) + + switch { + case strings.Contains(msg, "operation not permitted") || strings.Contains(msg, "permission denied"): + return "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities" + + case strings.Contains(msg, "memlock") || strings.Contains(msg, "rlimit"): + return "increase memlock limit: ulimit -l unlimited (or run as root)" + + case strings.Contains(msg, "btf") && strings.Contains(msg, "not found"): + return "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)" + + case strings.Contains(msg, "vmlinux"): + return "missing /sys/kernel/btf/vmlinux — kernel must be compiled with BTF support" + + case strings.Contains(msg, "verifier") || strings.Contains(msg, "invalid"): + return "BPF verifier rejected the program — may need newer kernel or different approach" + + case strings.Contains(msg, "no such file") && strings.Contains(msg, "tracepoint"): + return "tracepoint not available on this kernel — try kernel 5.10+ or file an issue" + + case strings.Contains(msg, "program too large"): + return "program exceeds BPF complexity limit — file an issue with kernel version" + + case strings.Contains(msg, "unknown") && strings.Contains(msg, "attach type"): + return "attach type not supported on this kernel — requires 5.15+" + + case strings.Contains(msg, "busy") || strings.Contains(msg, "in use"): + return "resource already in use — another BPF program may be attached" + + case strings.Contains(msg, "libbpf"): + return "libbpf error — ensure libbpf-dev is installed and up to date" + + default: + return "check kernel version (5.8+ required), BTF support, and capabilities" + } +} + +// IsPermissionError returns true if the error is related to insufficient permissions. +func IsPermissionError(err error) bool { + if err == nil { + return false + } + msg := strings.ToLower(err.Error()) + return strings.Contains(msg, "operation not permitted") || + strings.Contains(msg, "permission denied") || + strings.Contains(msg, "eperm") +} + +// IsBTFError returns true if the error is related to missing BTF support. +func IsBTFError(err error) bool { + if err == nil { + return false + } + msg := strings.ToLower(err.Error()) + return strings.Contains(msg, "btf") || strings.Contains(msg, "vmlinux") +} + +// IsVerifierError returns true if the error is from the BPF verifier. +func IsVerifierError(err error) bool { + if err == nil { + return false + } + msg := strings.ToLower(err.Error()) + return strings.Contains(msg, "verifier") || strings.Contains(msg, "invalid") +} diff --git a/internal/bpf/errors_test.go b/internal/bpf/errors_test.go new file mode 100644 index 0000000..d511866 --- /dev/null +++ b/internal/bpf/errors_test.go @@ -0,0 +1,275 @@ +// Copyright 2026 Optiqor contributors +// SPDX-License-Identifier: Apache-2.0 + +package bpf + +import ( + "errors" + "strings" + "testing" +) + +func TestWrapLoadError(t *testing.T) { + tests := []struct { + name string + program string + err error + wantHint string + wantContain string + }{ + { + name: "permission denied", + program: "syscall_latency", + err: errors.New("operation not permitted"), + wantHint: "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities", + wantContain: "syscall_latency", + }, + { + name: "BTF missing", + program: "tcp_monitor", + err: errors.New("btf not found"), + wantHint: "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)", + wantContain: "tcp_monitor", + }, + { + name: "verifier error", + program: "disk_io", + err: errors.New("verifier rejected program"), + wantHint: "BPF verifier rejected the program — may need newer kernel or different approach", + wantContain: "disk_io", + }, + { + name: "memlock limit", + program: "oom_track", + err: errors.New("memlock rlimit exceeded"), + wantHint: "increase memlock limit: ulimit -l unlimited (or run as root)", + wantContain: "oom_track", + }, + { + name: "nil error", + program: "test", + err: nil, + wantHint: "", + wantContain: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + wrapped := WrapLoadError(tt.program, tt.err) + + if tt.err == nil { + if wrapped != nil { + t.Errorf("WrapLoadError(nil) = %v, want nil", wrapped) + } + return + } + + if wrapped == nil { + t.Fatal("WrapLoadError returned nil for non-nil error") + } + + var loadErr *LoadError + if !errors.As(wrapped, &loadErr) { + t.Fatal("Wrapped error is not a *LoadError") + } + + if loadErr.Program != tt.program { + t.Errorf("Program = %q, want %q", loadErr.Program, tt.program) + } + + if loadErr.Hint != tt.wantHint { + t.Errorf("Hint = %q, want %q", loadErr.Hint, tt.wantHint) + } + + errStr := wrapped.Error() + if !strings.Contains(errStr, tt.wantContain) { + t.Errorf("Error() = %q, want it to contain %q", errStr, tt.wantContain) + } + + // Test Unwrap + if !errors.Is(wrapped, tt.err) { + t.Error("Unwrap() should return the original error") + } + }) + } +} + +func TestIsPermissionError(t *testing.T) { + tests := []struct { + name string + err error + want bool + }{ + { + name: "operation not permitted", + err: errors.New("operation not permitted"), + want: true, + }, + { + name: "permission denied", + err: errors.New("permission denied"), + want: true, + }, + { + name: "EPERM", + err: errors.New("error: EPERM"), + want: true, + }, + { + name: "other error", + err: errors.New("btf not found"), + want: false, + }, + { + name: "nil error", + err: nil, + want: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := IsPermissionError(tt.err); got != tt.want { + t.Errorf("IsPermissionError() = %v, want %v", got, tt.want) + } + }) + } +} + +func TestIsBTFError(t *testing.T) { + tests := []struct { + name string + err error + want bool + }{ + { + name: "btf not found", + err: errors.New("btf not found"), + want: true, + }, + { + name: "vmlinux missing", + err: errors.New("vmlinux not available"), + want: true, + }, + { + name: "permission error", + err: errors.New("operation not permitted"), + want: false, + }, + { + name: "nil error", + err: nil, + want: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := IsBTFError(tt.err); got != tt.want { + t.Errorf("IsBTFError() = %v, want %v", got, tt.want) + } + }) + } +} + +func TestIsVerifierError(t *testing.T) { + tests := []struct { + name string + err error + want bool + }{ + { + name: "verifier rejected", + err: errors.New("verifier rejected program"), + want: true, + }, + { + name: "invalid instruction", + err: errors.New("invalid BPF instruction"), + want: true, + }, + { + name: "btf error", + err: errors.New("btf not found"), + want: false, + }, + { + name: "nil error", + err: nil, + want: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := IsVerifierError(tt.err); got != tt.want { + t.Errorf("IsVerifierError() = %v, want %v", got, tt.want) + } + }) + } +} + +func TestClassifyLoadError(t *testing.T) { + tests := []struct { + name string + err error + wantHint string + }{ + { + name: "permission denied", + err: errors.New("operation not permitted"), + wantHint: "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities", + }, + { + name: "memlock limit", + err: errors.New("memlock rlimit exceeded"), + wantHint: "increase memlock limit: ulimit -l unlimited (or run as root)", + }, + { + name: "BTF missing", + err: errors.New("btf not found"), + wantHint: "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)", + }, + { + name: "vmlinux missing", + err: errors.New("/sys/kernel/btf/vmlinux: no such file"), + wantHint: "missing /sys/kernel/btf/vmlinux — kernel must be compiled with BTF support", + }, + { + name: "verifier rejection", + err: errors.New("BPF verifier rejected: invalid"), + wantHint: "BPF verifier rejected the program — may need newer kernel or different approach", + }, + { + name: "tracepoint unavailable", + err: errors.New("no such file or directory: tracepoint"), + wantHint: "tracepoint not available on this kernel — try kernel 5.10+ or file an issue", + }, + { + name: "program too large", + err: errors.New("program too large: exceeds complexity limit"), + wantHint: "program exceeds BPF complexity limit — file an issue with kernel version", + }, + { + name: "unknown error", + err: errors.New("some unknown error"), + wantHint: "check kernel version (5.8+ required), BTF support, and capabilities", + }, + { + name: "nil error", + err: nil, + wantHint: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + hint := classifyLoadError(tt.err) + if hint != tt.wantHint { + t.Errorf("classifyLoadError() = %q, want %q", hint, tt.wantHint) + } + }) + } +}