diff --git a/README.md b/README.md
index f71c00b..ad8c0c2 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,6 @@
+**Here is your improved README** with a new **Troubleshooting** section added:
+
+```markdown
# KERNO
@@ -23,6 +26,11 @@
---
+## Contributing
+
+We welcome contributions! Whether it's fixing a bug, improving documentation, adding a diagnostic rule, or working on eBPF — every contribution helps.
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) to get started.
## What is Kerno?
@@ -98,217 +106,120 @@ That's the entire debugging loop - from page to root cause - in a single command
## How Kerno compares
-| | Watches | K8s-Native | Incident Report | SLO Mapping | AI Analysis | Install Time |
-|---|:---:|:---:|:---:|:---:|:---:|:---:|
-| Prometheus + Grafana | Application | Partial | No | No | No | Hours |
-| Datadog APM | Application | Partial | No | Partial | Yes | Hours |
-| Cilium Tetragon | Security | **Yes** | No | No | No | Minutes |
-| Inspektor Gadget | Container | **Yes** | No | No | No | Minutes |
-| Pixie | Application | **Yes** | No | No | No | Minutes |
-| **Kerno** | **Kernel** | **Yes** | **Yes** | **Yes** | **Yes** | **< 1 min** |
+| Tool | Watches | K8s-Native | Incident Report | Root Cause Analysis | AI Analysis | Install Time |
+|-------------------------|-------------|------------|-----------------|---------------------|-------------|--------------|
+| Prometheus + Grafana | Application | Partial | No | No | No | Hours |
+| Datadog APM | Application | Partial | No | Partial | Yes | Hours |
+| Cilium Tetragon | Security | Yes | No | No | No | Minutes |
+| Inspektor Gadget | Container | Yes | No | No | No | Minutes |
+| Pixie | Application | Yes | No | No | No | Minutes |
+| **Kerno** | **Kernel** | **Yes** | **Yes** | **Yes** | **Yes** | **< 1 min** |
-Kerno is the only eBPF tool in the Kubernetes ecosystem that produces a ranked, human-readable **incident report** - not a firehose of events, not another dashboard, not a query language to learn.
+**Kerno is the only eBPF tool** in the Kubernetes ecosystem that produces a **ranked, human-readable incident report** — not just raw events or another dashboard.
---
## Quick Start
-> **Requires:** kernel **5.8+** with BTF (every major managed K8s qualifies: EKS, GKE, AKS, DOKS, Linode, Civo). For raw manifests/Helm you'll need cluster-admin.
+> **Requires:** kernel **5.8+** with BTF (every major managed K8s qualifies: EKS, GKE, AKS, DOKS, Linode, Civo).
-### 1 · Kubernetes (primary)
+### 1. Kubernetes (Recommended)
```bash
helm install kerno ./deploy/helm/kerno \
-n kerno-system --create-namespace
```
-Within 30 seconds Kerno is running as a DaemonSet on every node, watching the kernel via eBPF, exposing `/metrics` for Prometheus, and ready for `kerno doctor`.
-
```bash
-# Cluster-wide incident report - 30 seconds of real kernel data
+# Cluster-wide incident report (30 seconds)
kubectl -n kerno-system exec ds/kerno -- kerno doctor
-# CI-friendly: machine-readable JSON, exits non-zero on critical findings
+# JSON output for CI/CD
kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json --exit-code
-# AI-enriched root cause analysis (set the API key once)
-kubectl -n kerno-system set env ds/kerno KERNO_AI_API_KEY=sk-...
+# With AI analysis
kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai
```
-ServiceMonitor for the Prometheus Operator is built-in. Raw manifests live at [`deploy/k8s/`](deploy/k8s/) if you don't use Helm.
-
----
-
-### 2 · Bare metal · VMs · EC2 · GCE
-
-The same binary, the same command. No Kubernetes required.
+### 2. Bare Metal / VMs / EC2 / GCE
```bash
curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash
sudo kerno doctor
```
-Long-lived systemd service with `/metrics` for Prometheus:
-
-```bash
-curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash -s -- --daemon
-journalctl -u kerno -f
-```
-
-### 3 · Docker (ad-hoc, any host with a privileged daemon)
-
-```bash
-docker run --rm --privileged --pid=host \
- -v /sys/kernel/debug:/sys/kernel/debug:ro \
- -v /sys/kernel/btf:/sys/kernel/btf:ro \
- -v /sys/fs/bpf:/sys/fs/bpf \
- -v /proc:/proc:ro \
- ghcr.io/optiqor/kerno:latest doctor
-```
+---
-Multi-arch (`linux/amd64`, `linux/arm64`) images published to GHCR on every release.
+## Troubleshooting
-### Shell Completion
+### eBPF program fails to load
-Enable tab completion for your shell:
+**Error:** `failed to load BPF program` or `permission denied`
-**Bash:**
+**Solutions:**
+- Make sure your kernel is **5.8+** and has **BTF** enabled
+- Run with proper capabilities (Kerno already uses minimum required)
+- On some systems you may need:
```bash
-# Load completions for current session
-source <(kerno completion bash)
-
-# Persist across sessions
-echo 'source <(kerno completion bash)' >> ~/.bashrc
+sudo sysctl -w kernel.unprivileged_bpf_disabled=0
```
-**Zsh:**
-
-```bash
-# Enable completions (add to ~/.zshrc if not already present)
-echo 'autoload -U compinit; compinit' >> ~/.zshrc
-
-# Load completions for current session
-autoload -U compinit && compinit
-kerno completion zsh > "${fpath[1]}/_kerno"
-
-# Persist across sessions - run once, then start new shell
-kerno completion zsh > "${fpath[1]}/_kerno"
-```
+### `kerno doctor` shows no output / empty report
-**Fish:**
+**Possible causes:**
+- eBPF programs failed to load silently
+- Very short collection window
+**Fix:**
```bash
-# Load completions for current session
-kerno completion fish | source
-
-# Persist across sessions
-kerno completion fish > ~/.config/fish/completions/kerno.fish
+kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno
+kubectl -n kerno-system exec ds/kerno -- kerno doctor --duration 30s
```
-**PowerShell:**
+### Prometheus metrics not appearing
-```powershell
-# Add to your PowerShell profile
-kerno completion powershell > kerno.ps1
-. ./kerno.ps1
+**Check:**
+```bash
+kubectl -n kerno-system port-forward ds/kerno 9090:9090
+curl http://localhost:9090/metrics
```
----
-
-## Kubernetes Deployment
+Make sure `serviceMonitor.enabled: true` in Helm values if using Prometheus Operator.
-Kerno is designed from day one to run as a Kubernetes DaemonSet. One pod per node, one eBPF agent per kernel, zero API server load.
+### AI features not working
-```mermaid
-flowchart TB
- subgraph Cluster["Kubernetes Cluster"]
- direction TB
- subgraph Node1["Worker Node 1"]
- K1["Kerno Pod
DaemonSet"]
- W1["Workload Pods"]
- end
- subgraph Node2["Worker Node 2"]
- K2["Kerno Pod
DaemonSet"]
- W2["Workload Pods"]
- end
- subgraph Node3["Worker Node N"]
- K3["Kerno Pod
DaemonSet"]
- W3["Workload Pods"]
- end
- end
+**Error:** `AI analysis failed` or empty AI output
- K1 -->|:9090/metrics| Prom["Prometheus"]
- K2 -->|:9090/metrics| Prom
- K3 -->|:9090/metrics| Prom
- Prom --> GF["Grafana"]
-
- K1 -.enriches.-> W1
- K2 -.enriches.-> W2
- K3 -.enriches.-> W3
-
- style K1 fill:#e94560,stroke:#fff,color:#fff
- style K2 fill:#e94560,stroke:#fff,color:#fff
- style K3 fill:#e94560,stroke:#fff,color:#fff
- style Prom fill:#0f3460,stroke:#fff,color:#fff
- style GF fill:#16213e,stroke:#fff,color:#fff
- style W1 fill:#533483,stroke:#fff,color:#fff
- style W2 fill:#533483,stroke:#fff,color:#fff
- style W3 fill:#533483,stroke:#fff,color:#fff
+**Fix:**
+```bash
+kubectl -n kerno-system set env ds/kerno \
+ KERNO_AI_PROVIDER=anthropic \
+ KERNO_AI_API_KEY=sk-...
```
-### Pod enrichment - no API server load
-
-Kerno tags every finding with pod, namespace, node, and workload labels. No `client-go` informers, no watch connections - Kerno reads `/var/lib/kubelet/pods` directly, so even a failing API server doesn't blind the agent. Exactly when you need it most.
-
-### Host mounts - the minimum necessary
+Currently supported providers: `anthropic`, `openai`, `ollama`.
-| Mount | Why |
-|---|---|
-| `/sys/kernel/debug` | tracepoints, kprobes |
-| `/sys/kernel/btf` | CO-RE type resolution |
-| `/sys/fs/bpf` | BPF map pinning |
-| `/proc` | PID → cgroup → pod resolution |
-| `/sys/fs/cgroup` | container resource accounting |
-| `/sys/class/net` | per-interface TCP counters |
-| `/sys/block` | per-device disk stats |
+### Running on unsupported kernel
-### Security posture
+Kerno gracefully degrades. If some eBPF programs fail to load, those collectors are skipped. You will see warnings in the logs:
-- Runs with the **minimum capabilities needed** - `CAP_BPF`, `CAP_PERFMON`, `CAP_SYS_PTRACE`, `CAP_NET_ADMIN`, `CAP_DAC_READ_SEARCH` (not `CAP_SYS_ADMIN` for the hot path).
-- Read-only root filesystem, `ProtectSystem=strict` via systemd on bare metal.
-- No outbound network calls. AI integration is opt-in and goes through your configured provider only.
-
-### Helm values
-
-```yaml
-image:
- repository: ghcr.io/optiqor/kerno
- tag: v0.1.0
-
-resources:
- requests: { cpu: 100m, memory: 128Mi }
- limits: { cpu: "1", memory: 512Mi }
+```bash
+kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno | grep -i bpf
+```
-prometheus:
- enabled: true
- port: 9090
+### Permission issues on bare metal
-serviceMonitor: # Prometheus Operator
- enabled: true
- interval: 15s
+Make sure you run with `sudo`:
-nodeSelector:
- monitoring: "true"
+```bash
+sudo kerno doctor
```
-### Verify
+Or run the systemd service (recommended for production):
```bash
-kubectl -n kerno-system get ds kerno
-kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno
-kubectl -n kerno-system exec ds/kerno -- kerno doctor
+curl -sfL https://raw.githubusercontent.com/optiqor/kerno/main/scripts/install.sh | sudo bash -s -- --daemon
```
---
@@ -321,32 +232,32 @@ kubectl -n kerno-system exec ds/kerno -- kerno doctor
### Incident Diagnosis
-- **`kerno doctor`** - 30-second cluster-wide diagnostic, ranked findings, fix suggestions
-- **`kerno explain`** - AI-powered kernel error explanation (no root needed)
-- **`kerno predict`** - surface failures before they page you
+- **`kerno doctor`** — 30-second cluster-wide diagnostic report
+- **`kerno explain`** — AI-powered kernel error explanation
+- **`kerno predict`** — Predict failures before they happen
### Real-Time Tracing
-- **`kerno trace syscall`** - per-pod syscall latency streaming
-- **`kerno trace disk`** - block I/O latency by device, op, process
-- **`kerno trace sched`** - CPU scheduler run queue delays
+- **`kerno trace syscall`** — Per-pod syscall latency
+- **`kerno trace disk`** — Block I/O latency
+- **`kerno trace sched`** — CPU scheduler delays
### Continuous Monitoring
-- **`kerno watch tcp`** - TCP connections, RTT, retransmits
-- **`kerno watch oom`** - OOM kill alerts with pod context
-- **`kerno watch fd`** - FD leak detection via growth rate
-- **`kerno start`** - daemon mode with Prometheus metrics
+- **`kerno watch tcp`** — TCP retransmits & RTT
+- **`kerno watch oom`** — OOM kill alerts
+- **`kerno watch fd`** — File descriptor leak detection
+- **`kerno start`** — Run as daemon with Prometheus metrics
### Integrations
-- **Prometheus** - 16 metrics at `/metrics`, ServiceMonitor support
-- **Kubernetes** - Helm chart + pod enrichment (no API server load)
-- **AI Providers** - Anthropic, OpenAI, Ollama (optional, opt-in)
-- **Systemd** - unit/slice enrichment on bare metal
+- **Prometheus** + **ServiceMonitor**
+- **Kubernetes** (Helm + pod enrichment)
+- **AI Providers** (Anthropic, OpenAI, Ollama)
+- **Systemd** enrichment on bare metal
|
@@ -356,351 +267,53 @@ kubectl -n kerno-system exec ds/kerno -- kerno doctor
## How It Works
-Kerno runs as a lightweight Go agent with six tiny eBPF programs attached to stable tracepoints. When `kerno doctor` runs, it collects 30 seconds of real kernel data, evaluates 11 diagnostic rules deterministically, and emits a ranked incident report. No sampling. No guesswork. No query language.
-
-### Architecture
-
-```mermaid
-flowchart TB
- subgraph Kernel["KERNEL SPACE · eBPF Programs"]
- direction LR
- P1["syscall
latency"]
- P2["tcp
monitor"]
- P3["oom
track"]
- P4["disk
io"]
- P5["sched
delay"]
- P6["fd
track"]
- end
-
- RB[("Ring Buffers
256KB per program
zero-copy mmap")]
-
- subgraph UserSpace["USER SPACE · Go"]
- direction TB
- Loader["BPF Loaders
cilium/ebpf"]
- Collector["Collectors
percentile aggregation"]
- Signals[("Signals Snapshot
single source of truth")]
- Adapter["Environment Adapter
k8s · systemd · bare metal"]
- end
-
- subgraph Outputs["OUTPUTS"]
- direction TB
- Doctor["Doctor Engine
11 diagnostic rules"]
- AI["AI Layer (optional)
root cause analysis"]
- Prom["Prometheus
/metrics :9090"]
- CLI["Terminal
pretty · JSON"]
- end
-
- P1 & P2 & P3 & P4 & P5 & P6 --> RB
- RB --> Loader
- Loader --> Collector
- Collector --> Signals
- Adapter -.enriches.-> Signals
- Signals --> Doctor
- Signals --> Prom
- Doctor --> AI
- AI --> CLI
- Doctor --> CLI
-
- classDef kernel fill:#1a1a2e,stroke:#e94560,color:#fff,stroke-width:2px
- classDef user fill:#0f3460,stroke:#16213e,color:#fff,stroke-width:2px
- classDef output fill:#16213e,stroke:#533483,color:#fff,stroke-width:2px
- classDef buffer fill:#533483,stroke:#e94560,color:#fff,stroke-width:3px
- classDef ai fill:#e94560,stroke:#fff,color:#fff,stroke-width:2px
-
- class P1,P2,P3,P4,P5,P6 kernel
- class Loader,Collector,Signals,Adapter user
- class Doctor,Prom,CLI output
- class RB buffer
- class AI ai
-```
-
-### Core principles
-
-1. **Deterministic first.** The rule engine is pure Go, testable, and runs whether AI is on or off. Every finding has a clear cause, threshold, and fix.
-2. **Zero-copy hot path.** Kernel events land in eBPF ring buffers and are drained via `mmap` - microsecond overhead, no serialization cost.
-3. **No API server load.** Pod enrichment reads the kubelet's local pod manifests. The agent survives API server outages - the moment you need it most.
-4. **AI is a post-processor.** Optional. Opt-in. Never touches the hot path. The deterministic engine always runs; AI enriches, it never replaces.
-5. **Graceful degradation.** If an eBPF program fails to load on a weird kernel, that collector is skipped with a clear warning. The rest keep working.
-
-### Data flow
-
-```mermaid
-sequenceDiagram
- participant K as Kernel
(eBPF)
- participant R as Ring Buffer
- participant C as Collectors
- participant D as Doctor Engine
- participant A as AI Layer
- participant U as On-call Engineer
-
- K->>R: syscall/tcp/oom/io events
- Note over K,R: Zero-copy, microsecond overhead
- R->>C: drain events
- C->>C: aggregate into p50/p95/p99
- C->>D: Signals snapshot
- D->>D: evaluate 11 rules
- alt AI enabled
- D->>A: findings + signals
- A->>A: correlate + explain
- A->>U: incident report + root cause
- else AI disabled
- D->>U: deterministic incident report
- end
-```
-
----
-
-## The Diagnostic Rules
+Kerno uses **6 lightweight eBPF programs** to collect kernel data with almost zero overhead. When you run `kerno doctor`, it collects 30 seconds of real data, runs 11 deterministic diagnostic rules, and produces a human-readable report.
-Kerno runs 11 deterministic rules against every snapshot. Every rule is explainable, configurable, and covered by tests.
-
-| # | Rule | Triggers When | Severity |
-|---|------|---------------|:---:|
-| 1 | Disk I/O Bottleneck | fsync p99 > 50ms or write p99 > 200ms | WARN / CRIT |
-| 2 | OOM Kill Occurred | Any OOM event in window | CRIT |
-| 3 | TCP Retransmit Storm | Retransmit rate > 2% | CRIT |
-| 4 | TCP RTT Degradation | RTT p99 > 10ms | WARN |
-| 5 | Scheduler Contention | Runqueue delay p99 > 5ms | WARN / CRIT |
-| 6 | FD Leak | FD growth > 10/sec sustained | WARN (with ETA) |
-| 7 | Syscall Latency High | Any syscall p99 > 100ms | WARN / CRIT |
-| 8 | OOM Imminent | Memory > 90% + positive growth | WARN / CRIT (with ETA) |
-| 9 | Syscall Error Rate | Error rate > 1% per syscall | WARN / CRIT |
-| 10 | Memory Pressure | RSS usage > 90% | WARN |
-| 11 | Network Latency | Connection RTT > 100ms | WARN |
+AI is **optional** and only used for root cause explanation — it never replaces the core rule engine.
---
## Usage
-### Incident diagnosis - "what broke just now?"
-
```bash
-# The golden command
+# Main diagnostic command
kubectl -n kerno-system exec ds/kerno -- kerno doctor
-# Quick 10-second check
-kubectl -n kerno-system exec ds/kerno -- kerno doctor --duration 10s
-
-# JSON for CI/CD, runbooks, Slack bots (non-zero exit on critical)
-kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json --exit-code
-
-# AI-powered root cause analysis
+# With AI analysis
kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai
-# Explain a kernel error (no root, no cluster needed)
-kerno explain "BUG: kernel NULL pointer dereference"
-dmesg | tail -5 | kerno explain
-
-# Predict failures before they page you
-kubectl -n kerno-system exec ds/kerno -- kerno predict --snapshots 5 --interval 15s
-```
-
-### Real-time tracing - "watch it happen"
-
-```bash
-# Every syscall event streaming
+# Real-time tracing
kubectl -n kerno-system exec ds/kerno -- kerno trace syscall
-
-# Only syscalls from a specific pod's PID
-kubectl -n kerno-system exec ds/kerno -- kerno trace syscall --pid 1234
-
-# Postgres disk writes over 5ms
-kubectl -n kerno-system exec ds/kerno -- kerno trace disk --process postgres --op write --threshold 5ms
-
-# Scheduler delays over 10ms
-kubectl -n kerno-system exec ds/kerno -- kerno trace sched --threshold 10ms
-```
-
-### Continuous monitoring - "alert me when…"
-
-```bash
-# TCP connections with retransmits
-kubectl -n kerno-system exec ds/kerno -- kerno watch tcp --retransmits
-
-# Any OOM kill, with pod context
-kubectl -n kerno-system exec ds/kerno -- kerno watch oom --alert
-
-# Processes leaking FDs
-kubectl -n kerno-system exec ds/kerno -- kerno watch fd --threshold 10
-```
-
----
-
-## Prometheus Metrics
-
-The DaemonSet exposes 16 metrics at `:9090/metrics`. ServiceMonitor is included when the Prometheus Operator is installed.
-
-
-View all 16 metrics
-
-| Metric | Type | What It Measures |
-|---|:---:|---|
-| `kerno_syscall_duration_nanoseconds` | Summary | Syscall latency (p50, p95, p99) |
-| `kerno_syscall_total` | Counter | Total syscall events |
-| `kerno_tcp_rtt_nanoseconds` | Summary | TCP round-trip time |
-| `kerno_tcp_retransmits_total` | Counter | TCP retransmissions |
-| `kerno_tcp_connections_total` | Counter | TCP connection events |
-| `kerno_oom_kills_total` | Counter | OOM kill events |
-| `kerno_disk_io_duration_nanoseconds` | Summary | Disk I/O latency |
-| `kerno_disk_io_bytes_total` | Counter | Disk I/O bytes |
-| `kerno_sched_delay_nanoseconds` | Summary | CPU run queue delay |
-| `kerno_fd_open_total` | Counter | FD open operations |
-| `kerno_fd_close_total` | Counter | FD close operations |
-| `kerno_collector_events_total` | Counter | Events per collector |
-| `kerno_collector_errors_total` | Counter | Errors per collector |
-| `kerno_bpf_programs_loaded` | Gauge | Loaded eBPF programs |
-| `kerno_info` | Gauge | Build version |
-
-Health endpoints: `/healthz` and `/readyz` return JSON status.
-
-
-
----
-
-## Environment & AI
-
-**Environment auto-detection.** Kerno picks one of three adapters and enriches every event - no configuration required:
-
-- **Kubernetes** (in-cluster token present) → pod, namespace, node, deployment
-- **Systemd** (PID 1 is systemd) → unit, slice, scope
-- **Bare metal** → hostname, cgroup path
-
-**AI (optional).** The AI layer runs **after** the deterministic rule engine - it correlates cross-signals and explains root causes, it never replaces rules. Three providers (**Anthropic**, **OpenAI**, **Ollama** for air-gapped), three privacy modes (`full` / `redacted` / `summary`), TTL cache + token-bucket rate limiting, graceful fallback to a deterministic template on failure. No LLM SDK dependencies - pure `net/http`.
-
-```bash
-kubectl -n kerno-system set env ds/kerno \
- KERNO_AI_API_KEY=sk-... \
- KERNO_AI_PROVIDER=anthropic
-kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai
-```
-
----
-
-## Configuration
-
-Kerno works with **zero configuration**. For custom setups, mount a `config.yaml` or use `KERNO_*` env vars:
-
-```yaml
-log_level: info
-
-collectors:
- syscall_latency: true
- tcp_monitor: true
- oom_track: true
- disk_io: true
- sched_delay: true
- fd_track: true
-
-doctor:
- duration: 30s
- thresholds:
- syscall_p99_warning_ns: 100000000 # 100ms
- syscall_p99_critical_ns: 500000000 # 500ms
- tcp_retransmit_pct: 2.0 # 2%
- oom_memory_pct: 90.0 # 90%
- disk_p99_warning_ns: 50000000 # 50ms
- disk_p99_critical_ns: 200000000 # 200ms
- sched_delay_warning_ns: 5000000 # 5ms
- sched_delay_critical_ns: 20000000 # 20ms
- fd_growth_per_sec: 10.0
-
-prometheus:
- enabled: true
- addr: ":9090"
-
-ai:
- enabled: false
- provider: anthropic
- privacy_mode: summary
```
-**Precedence:** CLI flags > environment variables (`KERNO_*`) > config file > defaults.
-
----
-
-## Roadmap
-
-See [TODO.md](TODO.md) for the full plan. Headlines:
-
-- **v0.1** - DaemonSet, 6 eBPF collectors, 11 rules, Prometheus, AI post-processor, 7 chaos scenarios, 13-phase verify pipeline - **shipped, all gates green on kernel 6.17**
-- **v0.2** - CRD for cluster-wide incident policies, OpenTelemetry OTLP export, Grafana dashboards, sliding-window aggregation
-- **v0.3** - historical incident replay, SLO-linked alerts, Slack / PagerDuty integrations
-- **v1.0** - multi-cluster control plane, managed offering (Optiqor Cloud)
-
---
## Building from Source
```bash
-# Requirements: Go 1.25+
-# Optional for real eBPF: clang 14+, libbpf-dev, llvm, bpftool
-
-make build # Build binary (uses BPF stubs - no clang needed)
-make generate # Run bpf2go to produce *_bpfel.go from C sources
-make bpf # Compile eBPF C programs to .o
-make bpf-verify # Build the standalone kernel-verifier load harness
-make test # Run unit tests
-make test-race # Run with race detector
-make lint # golangci-lint
-make check # vet + test + lint
-make verify # Comprehensive 13-phase production-readiness check
-make manpage # Generate man pages for all CLI commands
-make demo # Record demo.gif via vhs (needs vhs + ttyd + ffmpeg)
-make demo-cast # Record demo.cast via asciinema (alternative to vhs)
-make docker # Build Docker image
-```
-
-**Reproducing the verifier proof end-to-end:**
-
-```bash
-# Install eBPF toolchain
-sudo apt-get install -y clang llvm libbpf-dev linux-tools-$(uname -r) jq
-
-# Build, generate, verify everything in one shot
-make verify # exits 0 only if all 62 checks pass
-```
-
-**Inducing real incidents to demo or test rule firing:**
-
-```bash
-sudo tc qdisc add dev lo root netem loss 30% # optional, for tcp-loss
-kerno chaos --induce --intensity high --duration 30s
-
-# Available scenarios (kerno chaos --list):
-# cpu scheduler_contention
-# disk-sat disk_io_bottleneck
-# fd-leak fd_leak
-# memory oom_imminent
-# tcp-churn scheduler_contention
-# tcp-loss tcp_retransmit_storm
-# cascade multiple
+make build
+make verify # Full production readiness check
+make docker
```
-In another shell, `sudo kerno doctor` will catch the induced incident.
-
---
## Contributing
-Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for:
+See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, commit conventions, and review process.
-- Development setup and prerequisites
-- Commit message conventions (Conventional Commits)
-- Code review process
-- DCO sign-off requirement
-
-For security reports, see [SECURITY.md](SECURITY.md).
+For security issues, see [SECURITY.md](SECURITY.md).
---
## License
-Apache License 2.0 - see [LICENSE](LICENSE).
+Apache License 2.0 — see [LICENSE](LICENSE).
---
-If Kerno saved your on-call shift, consider leaving a **⭐** it helps other engineers find the project.
+If Kerno helped you during an incident, consider giving it a **⭐**. It helps others discover the project.
-
+
\ No newline at end of file
diff --git a/internal/ai/fallback.go b/internal/ai/fallback.go
index 6119189..16a0c9e 100644
--- a/internal/ai/fallback.go
+++ b/internal/ai/fallback.go
@@ -133,5 +133,32 @@ func detectSimpleCorrelations(findings []doctor.Finding) []doctor.Correlation {
})
}
+ // Memory + cgroup memory → container pressure impacting host.
+ if signals["memory"] && signals["cgroupMemory"] {
+ correlations = append(correlations, doctor.Correlation{
+ Signals: []string{"memory", "cgroupMemory"},
+ Description: "Host memory pressure combined with container memory limits suggests multiple containers competing for resources.",
+ Confidence: 0.85,
+ })
+ }
+
+ // Scheduler + syscall → CPU contention causing syscall queueing.
+ if signals["sched"] && signals["syscall"] {
+ correlations = append(correlations, doctor.Correlation{
+ Signals: []string{"sched", "syscall"},
+ Description: "High scheduler delays combined with syscall latency indicates CPU contention is causing system call queueing.",
+ Confidence: 0.80,
+ })
+ }
+
+ // TCP + memory → network buffer exhaustion.
+ if signals["tcp"] && signals["memory"] {
+ correlations = append(correlations, doctor.Correlation{
+ Signals: []string{"tcp", "memory"},
+ Description: "TCP issues combined with memory pressure may indicate network buffer exhaustion or connection pool limits.",
+ Confidence: 0.75,
+ })
+ }
+
return correlations
}
diff --git a/internal/ai/gemini.go b/internal/ai/gemini.go
new file mode 100644
index 0000000..044495e
--- /dev/null
+++ b/internal/ai/gemini.go
@@ -0,0 +1,184 @@
+// Copyright 2026 Optiqor contributors
+// SPDX-License-Identifier: Apache-2.0
+
+package ai
+
+import (
+ "bytes"
+ "context"
+ "encoding/json"
+ "fmt"
+ "io"
+ "net/http"
+ "time"
+)
+
+// GeminiProvider implements the Provider interface for Google Gemini API.
+// Uses raw HTTP + JSON — no SDK dependency.
+type GeminiProvider struct {
+ apiKey string
+ model string
+ endpoint string
+ maxTokens int
+ temperature float64
+ client *http.Client
+}
+
+// NewGeminiProvider creates a new Gemini provider.
+func NewGeminiProvider(cfg ProviderConfig) *GeminiProvider {
+ endpoint := cfg.Endpoint
+ if endpoint == "" {
+ endpoint = "https://generativelanguage.googleapis.com/v1beta"
+ }
+
+ model := cfg.Model
+ if model == "" {
+ model = "gemini-1.5-flash" // Default to fast model
+ }
+
+ maxTokens := cfg.MaxTokens
+ if maxTokens == 0 {
+ maxTokens = 4096
+ }
+
+ temperature := cfg.Temperature
+ if temperature == 0 {
+ temperature = 0.7
+ }
+
+ return &GeminiProvider{
+ apiKey: cfg.APIKey,
+ model: model,
+ endpoint: endpoint,
+ maxTokens: maxTokens,
+ temperature: temperature,
+ client: &http.Client{
+ Timeout: 60 * time.Second,
+ },
+ }
+}
+
+// Name returns "gemini".
+func (p *GeminiProvider) Name() string {
+ return "gemini"
+}
+
+// Complete sends a completion request to the Gemini API.
+func (p *GeminiProvider) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error) {
+ if p.apiKey == "" {
+ return nil, fmt.Errorf("gemini: API key not configured (set KERNO_AI_API_KEY)")
+ }
+
+ // Build the request payload.
+ payload := geminiRequest{
+ Contents: []geminiContent{
+ {
+ Parts: []geminiPart{
+ {Text: req.SystemPrompt + "\n\n" + req.UserPrompt},
+ },
+ },
+ },
+ GenerationConfig: geminiGenerationConfig{
+ Temperature: p.temperature,
+ MaxOutputTokens: p.maxTokens,
+ },
+ }
+
+ body, err := json.Marshal(payload)
+ if err != nil {
+ return nil, fmt.Errorf("gemini: marshaling request: %w", err)
+ }
+
+ // Build the URL with API key.
+ url := fmt.Sprintf("%s/models/%s:generateContent?key=%s",
+ p.endpoint, p.model, p.apiKey)
+
+ httpReq, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
+ if err != nil {
+ return nil, fmt.Errorf("gemini: creating request: %w", err)
+ }
+
+ httpReq.Header.Set("Content-Type", "application/json")
+
+ // Send the request.
+ resp, err := p.client.Do(httpReq)
+ if err != nil {
+ return nil, fmt.Errorf("gemini: request failed: %w", err)
+ }
+ defer resp.Body.Close()
+
+ respBody, err := io.ReadAll(resp.Body)
+ if err != nil {
+ return nil, fmt.Errorf("gemini: reading response: %w", err)
+ }
+
+ if resp.StatusCode != http.StatusOK {
+ return nil, fmt.Errorf("gemini: HTTP %d: %s", resp.StatusCode, string(respBody))
+ }
+
+ // Parse the response.
+ var geminiResp geminiResponse
+ if err := json.Unmarshal(respBody, &geminiResp); err != nil {
+ return nil, fmt.Errorf("gemini: parsing response: %w", err)
+ }
+
+ // Extract text from candidates.
+ if len(geminiResp.Candidates) == 0 {
+ return nil, fmt.Errorf("gemini: no candidates in response")
+ }
+
+ candidate := geminiResp.Candidates[0]
+ if len(candidate.Content.Parts) == 0 {
+ return nil, fmt.Errorf("gemini: no parts in candidate content")
+ }
+
+ text := candidate.Content.Parts[0].Text
+
+ // Extract token usage.
+ tokensUsed := 0
+ if geminiResp.UsageMetadata != nil {
+ tokensUsed = geminiResp.UsageMetadata.PromptTokenCount +
+ geminiResp.UsageMetadata.CandidatesTokenCount
+ }
+
+ return &CompletionResponse{
+ Text: text,
+ TokensUsed: tokensUsed,
+ Model: p.model,
+ }, nil
+}
+
+// ─── Gemini API Types ───────────────────────────────────────────────────────
+
+type geminiRequest struct {
+ Contents []geminiContent `json:"contents"`
+ GenerationConfig geminiGenerationConfig `json:"generationConfig"`
+}
+
+type geminiContent struct {
+ Parts []geminiPart `json:"parts"`
+}
+
+type geminiPart struct {
+ Text string `json:"text"`
+}
+
+type geminiGenerationConfig struct {
+ Temperature float64 `json:"temperature"`
+ MaxOutputTokens int `json:"maxOutputTokens"`
+}
+
+type geminiResponse struct {
+ Candidates []geminiCandidate `json:"candidates"`
+ UsageMetadata *geminiUsageMetadata `json:"usageMetadata,omitempty"`
+}
+
+type geminiCandidate struct {
+ Content geminiContent `json:"content"`
+}
+
+type geminiUsageMetadata struct {
+ PromptTokenCount int `json:"promptTokenCount"`
+ CandidatesTokenCount int `json:"candidatesTokenCount"`
+ TotalTokenCount int `json:"totalTokenCount"`
+}
diff --git a/internal/ai/gemini_test.go b/internal/ai/gemini_test.go
new file mode 100644
index 0000000..3cea770
--- /dev/null
+++ b/internal/ai/gemini_test.go
@@ -0,0 +1,196 @@
+// Copyright 2026 Optiqor contributors
+// SPDX-License-Identifier: Apache-2.0
+
+package ai
+
+import (
+ "context"
+ "encoding/json"
+ "net/http"
+ "net/http/httptest"
+ "testing"
+ "time"
+)
+
+func TestGeminiProvider_Complete(t *testing.T) {
+ tests := []struct {
+ name string
+ response geminiResponse
+ wantText string
+ wantTokens int
+ wantModel string
+ wantStatusCode int
+ wantError bool
+ }{
+ {
+ name: "successful completion",
+ response: geminiResponse{
+ Candidates: []geminiCandidate{
+ {
+ Content: geminiContent{
+ Parts: []geminiPart{
+ {Text: "This is a test response from Gemini."},
+ },
+ },
+ },
+ },
+ UsageMetadata: &geminiUsageMetadata{
+ PromptTokenCount: 10,
+ CandidatesTokenCount: 8,
+ TotalTokenCount: 18,
+ },
+ },
+ wantText: "This is a test response from Gemini.",
+ wantTokens: 18,
+ wantModel: "gemini-1.5-flash",
+ wantStatusCode: http.StatusOK,
+ wantError: false,
+ },
+ {
+ name: "no usage metadata",
+ response: geminiResponse{
+ Candidates: []geminiCandidate{
+ {
+ Content: geminiContent{
+ Parts: []geminiPart{
+ {Text: "Response without metadata"},
+ },
+ },
+ },
+ },
+ },
+ wantText: "Response without metadata",
+ wantTokens: 0,
+ wantModel: "gemini-1.5-flash",
+ wantStatusCode: http.StatusOK,
+ wantError: false,
+ },
+ {
+ name: "API error",
+ response: geminiResponse{},
+ wantStatusCode: http.StatusUnauthorized,
+ wantError: true,
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ // Create mock server
+ server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+ // Verify request method and headers
+ if r.Method != "POST" {
+ t.Errorf("Expected POST request, got %s", r.Method)
+ }
+ if r.Header.Get("Content-Type") != "application/json" {
+ t.Errorf("Expected Content-Type: application/json, got %s", r.Header.Get("Content-Type"))
+ }
+
+ // Send response
+ w.WriteHeader(tt.wantStatusCode)
+ if tt.wantStatusCode == http.StatusOK {
+ json.NewEncoder(w).Encode(tt.response)
+ } else {
+ w.Write([]byte(`{"error": {"message": "API error"}}`))
+ }
+ }))
+ defer server.Close()
+
+ // Create provider with test server endpoint
+ provider := NewGeminiProvider(ProviderConfig{
+ Name: "gemini",
+ Model: "gemini-1.5-flash",
+ APIKey: "test-key",
+ Endpoint: server.URL,
+ MaxTokens: 1000,
+ Temperature: 0.7,
+ })
+
+ // Create completion request
+ req := CompletionRequest{
+ SystemPrompt: "You are a helpful assistant.",
+ UserPrompt: "Hello, world!",
+ MaxTokens: 1000,
+ Temperature: 0.7,
+ }
+
+ // Call Complete
+ ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+ defer cancel()
+
+ resp, err := provider.Complete(ctx, req)
+
+ // Check error expectation
+ if tt.wantError {
+ if err == nil {
+ t.Fatal("Expected error, got nil")
+ }
+ return
+ }
+
+ if err != nil {
+ t.Fatalf("Unexpected error: %v", err)
+ }
+
+ // Verify response
+ if resp.Text != tt.wantText {
+ t.Errorf("Text = %q, want %q", resp.Text, tt.wantText)
+ }
+ if resp.TokensUsed != tt.wantTokens {
+ t.Errorf("TokensUsed = %d, want %d", resp.TokensUsed, tt.wantTokens)
+ }
+ if resp.Model != tt.wantModel {
+ t.Errorf("Model = %q, want %q", resp.Model, tt.wantModel)
+ }
+ })
+ }
+}
+
+func TestGeminiProvider_Name(t *testing.T) {
+ provider := NewGeminiProvider(ProviderConfig{})
+ if got := provider.Name(); got != "gemini" {
+ t.Errorf("Name() = %q, want %q", got, "gemini")
+ }
+}
+
+func TestGeminiProvider_NoAPIKey(t *testing.T) {
+ provider := NewGeminiProvider(ProviderConfig{
+ Name: "gemini",
+ Model: "gemini-1.5-flash",
+ Endpoint: "https://example.com",
+ })
+
+ req := CompletionRequest{
+ SystemPrompt: "test",
+ UserPrompt: "test",
+ }
+
+ ctx := context.Background()
+ _, err := provider.Complete(ctx, req)
+
+ if err == nil {
+ t.Fatal("Expected error for missing API key, got nil")
+ }
+
+ if err.Error() != "gemini: API key not configured (set KERNO_AI_API_KEY)" {
+ t.Errorf("Unexpected error message: %v", err)
+ }
+}
+
+func TestGeminiProvider_Defaults(t *testing.T) {
+ provider := NewGeminiProvider(ProviderConfig{
+ APIKey: "test-key",
+ })
+
+ if provider.model != "gemini-1.5-flash" {
+ t.Errorf("Default model = %q, want %q", provider.model, "gemini-1.5-flash")
+ }
+ if provider.endpoint != "https://generativelanguage.googleapis.com/v1beta" {
+ t.Errorf("Default endpoint = %q, want %q", provider.endpoint, "https://generativelanguage.googleapis.com/v1beta")
+ }
+ if provider.maxTokens != 4096 {
+ t.Errorf("Default maxTokens = %d, want %d", provider.maxTokens, 4096)
+ }
+ if provider.temperature != 0.7 {
+ t.Errorf("Default temperature = %f, want %f", provider.temperature, 0.7)
+ }
+}
diff --git a/internal/ai/provider.go b/internal/ai/provider.go
index 90c8b39..0aa8fdb 100644
--- a/internal/ai/provider.go
+++ b/internal/ai/provider.go
@@ -81,7 +81,9 @@ func NewProvider(cfg ProviderConfig) (Provider, error) {
return NewOpenAIProvider(cfg), nil
case "ollama":
return NewOllamaProvider(cfg), nil
+ case "gemini":
+ return NewGeminiProvider(cfg), nil
default:
- return nil, fmt.Errorf("unknown AI provider %q: must be anthropic, openai, or ollama", cfg.Name)
+ return nil, fmt.Errorf("unknown AI provider %q: must be anthropic, openai, ollama, or gemini", cfg.Name)
}
}
diff --git a/internal/bpf/errors.go b/internal/bpf/errors.go
new file mode 100644
index 0000000..04b2415
--- /dev/null
+++ b/internal/bpf/errors.go
@@ -0,0 +1,116 @@
+// Copyright 2026 Optiqor contributors
+// SPDX-License-Identifier: Apache-2.0
+
+package bpf
+
+import (
+ "fmt"
+ "strings"
+)
+
+// LoadError represents an eBPF program load failure with additional context.
+type LoadError struct {
+ Program string // Program name (e.g., "syscall_latency")
+ Err error // Underlying error
+ Hint string // User-facing hint on how to fix
+}
+
+// Error implements the error interface.
+func (e *LoadError) Error() string {
+ if e.Hint != "" {
+ return fmt.Sprintf("%s: %v (hint: %s)", e.Program, e.Err, e.Hint)
+ }
+ return fmt.Sprintf("%s: %v", e.Program, e.Err)
+}
+
+// Unwrap returns the underlying error.
+func (e *LoadError) Unwrap() error {
+ return e.Err
+}
+
+// WrapLoadError wraps an eBPF load error with program context and a helpful hint.
+func WrapLoadError(program string, err error) error {
+ if err == nil {
+ return nil
+ }
+
+ hint := classifyLoadError(err)
+ return &LoadError{
+ Program: program,
+ Err: err,
+ Hint: hint,
+ }
+}
+
+// classifyLoadError analyzes an error and returns a user-friendly fix hint.
+func classifyLoadError(err error) string {
+ if err == nil {
+ return ""
+ }
+
+ msg := strings.ToLower(err.Error())
+
+ switch {
+ case strings.Contains(msg, "operation not permitted") || strings.Contains(msg, "permission denied"):
+ return "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities"
+
+ case strings.Contains(msg, "memlock") || strings.Contains(msg, "rlimit"):
+ return "increase memlock limit: ulimit -l unlimited (or run as root)"
+
+ case strings.Contains(msg, "btf") && strings.Contains(msg, "not found"):
+ return "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)"
+
+ case strings.Contains(msg, "vmlinux"):
+ return "missing /sys/kernel/btf/vmlinux — kernel must be compiled with BTF support"
+
+ case strings.Contains(msg, "verifier") || strings.Contains(msg, "invalid"):
+ return "BPF verifier rejected the program — may need newer kernel or different approach"
+
+ case strings.Contains(msg, "no such file") && strings.Contains(msg, "tracepoint"):
+ return "tracepoint not available on this kernel — try kernel 5.10+ or file an issue"
+
+ case strings.Contains(msg, "program too large"):
+ return "program exceeds BPF complexity limit — file an issue with kernel version"
+
+ case strings.Contains(msg, "unknown") && strings.Contains(msg, "attach type"):
+ return "attach type not supported on this kernel — requires 5.15+"
+
+ case strings.Contains(msg, "busy") || strings.Contains(msg, "in use"):
+ return "resource already in use — another BPF program may be attached"
+
+ case strings.Contains(msg, "libbpf"):
+ return "libbpf error — ensure libbpf-dev is installed and up to date"
+
+ default:
+ return "check kernel version (5.8+ required), BTF support, and capabilities"
+ }
+}
+
+// IsPermissionError returns true if the error is related to insufficient permissions.
+func IsPermissionError(err error) bool {
+ if err == nil {
+ return false
+ }
+ msg := strings.ToLower(err.Error())
+ return strings.Contains(msg, "operation not permitted") ||
+ strings.Contains(msg, "permission denied") ||
+ strings.Contains(msg, "eperm")
+}
+
+// IsBTFError returns true if the error is related to missing BTF support.
+func IsBTFError(err error) bool {
+ if err == nil {
+ return false
+ }
+ msg := strings.ToLower(err.Error())
+ return strings.Contains(msg, "btf") || strings.Contains(msg, "vmlinux")
+}
+
+// IsVerifierError returns true if the error is from the BPF verifier.
+func IsVerifierError(err error) bool {
+ if err == nil {
+ return false
+ }
+ msg := strings.ToLower(err.Error())
+ return strings.Contains(msg, "verifier") || strings.Contains(msg, "invalid")
+}
diff --git a/internal/bpf/errors_test.go b/internal/bpf/errors_test.go
new file mode 100644
index 0000000..d511866
--- /dev/null
+++ b/internal/bpf/errors_test.go
@@ -0,0 +1,275 @@
+// Copyright 2026 Optiqor contributors
+// SPDX-License-Identifier: Apache-2.0
+
+package bpf
+
+import (
+ "errors"
+ "strings"
+ "testing"
+)
+
+func TestWrapLoadError(t *testing.T) {
+ tests := []struct {
+ name string
+ program string
+ err error
+ wantHint string
+ wantContain string
+ }{
+ {
+ name: "permission denied",
+ program: "syscall_latency",
+ err: errors.New("operation not permitted"),
+ wantHint: "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities",
+ wantContain: "syscall_latency",
+ },
+ {
+ name: "BTF missing",
+ program: "tcp_monitor",
+ err: errors.New("btf not found"),
+ wantHint: "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)",
+ wantContain: "tcp_monitor",
+ },
+ {
+ name: "verifier error",
+ program: "disk_io",
+ err: errors.New("verifier rejected program"),
+ wantHint: "BPF verifier rejected the program — may need newer kernel or different approach",
+ wantContain: "disk_io",
+ },
+ {
+ name: "memlock limit",
+ program: "oom_track",
+ err: errors.New("memlock rlimit exceeded"),
+ wantHint: "increase memlock limit: ulimit -l unlimited (or run as root)",
+ wantContain: "oom_track",
+ },
+ {
+ name: "nil error",
+ program: "test",
+ err: nil,
+ wantHint: "",
+ wantContain: "",
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ wrapped := WrapLoadError(tt.program, tt.err)
+
+ if tt.err == nil {
+ if wrapped != nil {
+ t.Errorf("WrapLoadError(nil) = %v, want nil", wrapped)
+ }
+ return
+ }
+
+ if wrapped == nil {
+ t.Fatal("WrapLoadError returned nil for non-nil error")
+ }
+
+ var loadErr *LoadError
+ if !errors.As(wrapped, &loadErr) {
+ t.Fatal("Wrapped error is not a *LoadError")
+ }
+
+ if loadErr.Program != tt.program {
+ t.Errorf("Program = %q, want %q", loadErr.Program, tt.program)
+ }
+
+ if loadErr.Hint != tt.wantHint {
+ t.Errorf("Hint = %q, want %q", loadErr.Hint, tt.wantHint)
+ }
+
+ errStr := wrapped.Error()
+ if !strings.Contains(errStr, tt.wantContain) {
+ t.Errorf("Error() = %q, want it to contain %q", errStr, tt.wantContain)
+ }
+
+ // Test Unwrap
+ if !errors.Is(wrapped, tt.err) {
+ t.Error("Unwrap() should return the original error")
+ }
+ })
+ }
+}
+
+func TestIsPermissionError(t *testing.T) {
+ tests := []struct {
+ name string
+ err error
+ want bool
+ }{
+ {
+ name: "operation not permitted",
+ err: errors.New("operation not permitted"),
+ want: true,
+ },
+ {
+ name: "permission denied",
+ err: errors.New("permission denied"),
+ want: true,
+ },
+ {
+ name: "EPERM",
+ err: errors.New("error: EPERM"),
+ want: true,
+ },
+ {
+ name: "other error",
+ err: errors.New("btf not found"),
+ want: false,
+ },
+ {
+ name: "nil error",
+ err: nil,
+ want: false,
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ if got := IsPermissionError(tt.err); got != tt.want {
+ t.Errorf("IsPermissionError() = %v, want %v", got, tt.want)
+ }
+ })
+ }
+}
+
+func TestIsBTFError(t *testing.T) {
+ tests := []struct {
+ name string
+ err error
+ want bool
+ }{
+ {
+ name: "btf not found",
+ err: errors.New("btf not found"),
+ want: true,
+ },
+ {
+ name: "vmlinux missing",
+ err: errors.New("vmlinux not available"),
+ want: true,
+ },
+ {
+ name: "permission error",
+ err: errors.New("operation not permitted"),
+ want: false,
+ },
+ {
+ name: "nil error",
+ err: nil,
+ want: false,
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ if got := IsBTFError(tt.err); got != tt.want {
+ t.Errorf("IsBTFError() = %v, want %v", got, tt.want)
+ }
+ })
+ }
+}
+
+func TestIsVerifierError(t *testing.T) {
+ tests := []struct {
+ name string
+ err error
+ want bool
+ }{
+ {
+ name: "verifier rejected",
+ err: errors.New("verifier rejected program"),
+ want: true,
+ },
+ {
+ name: "invalid instruction",
+ err: errors.New("invalid BPF instruction"),
+ want: true,
+ },
+ {
+ name: "btf error",
+ err: errors.New("btf not found"),
+ want: false,
+ },
+ {
+ name: "nil error",
+ err: nil,
+ want: false,
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ if got := IsVerifierError(tt.err); got != tt.want {
+ t.Errorf("IsVerifierError() = %v, want %v", got, tt.want)
+ }
+ })
+ }
+}
+
+func TestClassifyLoadError(t *testing.T) {
+ tests := []struct {
+ name string
+ err error
+ wantHint string
+ }{
+ {
+ name: "permission denied",
+ err: errors.New("operation not permitted"),
+ wantHint: "run with sudo or grant CAP_BPF+CAP_PERFMON+CAP_SYS_ADMIN capabilities",
+ },
+ {
+ name: "memlock limit",
+ err: errors.New("memlock rlimit exceeded"),
+ wantHint: "increase memlock limit: ulimit -l unlimited (or run as root)",
+ },
+ {
+ name: "BTF missing",
+ err: errors.New("btf not found"),
+ wantHint: "kernel needs CONFIG_DEBUG_INFO_BTF=y (requires kernel 5.8+)",
+ },
+ {
+ name: "vmlinux missing",
+ err: errors.New("/sys/kernel/btf/vmlinux: no such file"),
+ wantHint: "missing /sys/kernel/btf/vmlinux — kernel must be compiled with BTF support",
+ },
+ {
+ name: "verifier rejection",
+ err: errors.New("BPF verifier rejected: invalid"),
+ wantHint: "BPF verifier rejected the program — may need newer kernel or different approach",
+ },
+ {
+ name: "tracepoint unavailable",
+ err: errors.New("no such file or directory: tracepoint"),
+ wantHint: "tracepoint not available on this kernel — try kernel 5.10+ or file an issue",
+ },
+ {
+ name: "program too large",
+ err: errors.New("program too large: exceeds complexity limit"),
+ wantHint: "program exceeds BPF complexity limit — file an issue with kernel version",
+ },
+ {
+ name: "unknown error",
+ err: errors.New("some unknown error"),
+ wantHint: "check kernel version (5.8+ required), BTF support, and capabilities",
+ },
+ {
+ name: "nil error",
+ err: nil,
+ wantHint: "",
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ hint := classifyLoadError(tt.err)
+ if hint != tt.wantHint {
+ t.Errorf("classifyLoadError() = %q, want %q", hint, tt.wantHint)
+ }
+ })
+ }
+}