Skip to content

feat(bpf): detect DNS resolution failures via UDP/53 eBPF (#22)#201

Open
Anushreer22 wants to merge 21 commits into
optiqor:mainfrom
Anushreer22:feat/dns-monitor-ebpf-22
Open

feat(bpf): detect DNS resolution failures via UDP/53 eBPF (#22)#201
Anushreer22 wants to merge 21 commits into
optiqor:mainfrom
Anushreer22:feat/dns-monitor-ebpf-22

Conversation

@Anushreer22

Copy link
Copy Markdown
Contributor

What

Implements DNS visibility for Kerno — the #2 silent killer on Kubernetes.
Adds an eBPF program that hooks sendmsg/recvmsg syscalls filtered to UDP
port 53, measures per-pod DNS latency and failure rates token-by-token,
and fires doctor rules when CoreDNS is slow or dropping queries.

Files Changed

File Purpose
internal/bpf/c/dns_monitor.c eBPF C program — hooks sys_enter_sendmsg + sys_enter_recvmsg, filters port 53, tracks latency via dns_inflight hash map
internal/bpf/dns_monitor.go Go loader — attaches both tracepoints, reads ring buffer (build tag: ebpf)
internal/bpf/gen_stub.go dnsMonitorObjects stub for non-ebpf builds
internal/bpf/events.go DNSEvent struct + DNSEventType constants matching C struct exactly
internal/bpf/loader.go EventDNSMonitor = 8 added
internal/collector/dns.go DNSCollector — aggregates request/response/failure rates, p50/p95/p99 latency histogram, per-process stats, 5s timeout reaper goroutine
internal/collector/signals.go DNSSnapshot + DNSConsumerEntry types added to Signals
internal/doctor/rules.go dns_high_latency + dns_failure_rate rules wired into Evaluate()
internal/chaos/dns.go dns-flood chaos scenario firing concurrent DNS lookups

Doctor Rules

Rule WARNING CRITICAL
dns_high_latency P99 > 100ms P99 > 500ms
dns_failure_rate > 1% timeouts > 5% timeouts

How It Works

  1. eBPF hooks sys_enter_sendmsg — records send timestamp in dns_inflight map keyed by (pid, query_id)
  2. eBPF hooks sys_enter_recvmsg — emits recv event; userspace looks up send timestamp to compute latency
  3. A reaper goroutine runs every 5s to expire in-flight queries with no response → counted as failures
  4. Filter is enforced in the eBPF program (port == 53), not userspace — overhead < 0.1% CPU

Acceptance Criteria

  • eBPF filter at kernel level (port == 53)
  • dns_high_latency: P99 > 100ms = WARNING, > 500ms = CRITICAL
  • dns_failure_rate: > 1% = WARNING, > 5% = CRITICAL
  • Per-process enrichment via comm + PID
  • 5-second failure timeout (no matching response = failed)
  • kerno chaos --induce dns-flood pairs with dns_high_latency rule
  • Stub added to gen_stub.go so non-ebpf builds compile cleanly
  • Bare metal compatible (chaos falls back to 8.8.8.8 if 127.0.0.53 unavailable)

Testing

# Trigger the rules with chaos
kerno chaos --induce dns-flood --duration 30s

# Verify doctor fires DNS findings
kerno doctor

# Verify eBPF verifier passes on kernel 5.15+
make bpf-verify

Closes #22

@Anushreer22 Anushreer22 requested a review from btwshivam as a code owner June 8, 2026 02:37
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules labels Jun 8, 2026
- Converted 7 event decode tests to table-driven style
- Added exact size, short buffer, and oversized buffer cases
- Merged TestDecodeSyscallEventTooShort into the table
- All tests pass with t.Parallel()

Signed-off-by: Anushree R <anushreer695@gmail.com>
- internal/bpf/c/dns_monitor.c: eBPF program hooking sys_enter_sendmsg +
  sys_enter_recvmsg, filtering port 53, tracking latency via dns_inflight map
- internal/bpf/dns_monitor.go: Go loader with go:build ebpf tag
- internal/bpf/gen_stub.go: dnsMonitorObjects stub for non-ebpf builds
- internal/bpf/events.go: DNSEvent struct + DNSEventType constants
- internal/bpf/loader.go: EventDNSMonitor = 8
- internal/collector/dns.go: DNSCollector with 5s failure reaper
- internal/collector/signals.go: DNSSnapshot + DNSConsumerEntry types
- internal/doctor/rules.go: dns_high_latency + dns_failure_rate rules
- internal/chaos/dns.go: dns-flood chaos scenario

Closes optiqor#22
@Anushreer22 Anushreer22 force-pushed the feat/dns-monitor-ebpf-22 branch from 48baa58 to 8793161 Compare June 8, 2026 02:41
Signed-off-by: Anushree R <anushreer695@gmail.com>
Signed-off-by: Anushree R <anushreer695@gmail.com>
…param lint

Signed-off-by: Anushree R <anushreer695@gmail.com>
…ther loaders

Signed-off-by: Anushree R <anushreer695@gmail.com>
Signed-off-by: Anushree R <anushreer695@gmail.com>
Signed-off-by: Anushree R <anushreer695@gmail.com>
Signed-off-by: Anushree R <anushreer695@gmail.com>
@github-actions github-actions Bot added the area/ops Operations, deployment, runtime ergonomics label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Doctor: detect DNS resolution failures via UDP/53 eBPF (K8s #2 silent killer)

1 participant