Skip to content

feat(collector): add rate limiting, drop counters and BPF backpressure#212

Open
ariktadas144 wants to merge 5 commits into
optiqor:mainfrom
ariktadas144:feat/collector-ratelimit-backpressure
Open

feat(collector): add rate limiting, drop counters and BPF backpressure#212
ariktadas144 wants to merge 5 commits into
optiqor:mainfrom
ariktadas144:feat/collector-ratelimit-backpressure

Conversation

@ariktadas144

Copy link
Copy Markdown

Closes #46.

Summary

Implements Phase 9.2.4: bounded-overhead guarantees for kerno collectors under sustained high event rates.

Changes

1. Ringbuf drop observability

  • kerno_drop_count PERCPU_ARRAY map added to kerno.h
  • KERNO_RECORD_DROP() called at every bpf_ringbuf_reserve() NULL-check site (8 sites across all 6 programs)
  • DropMapper interface + DropMap() on all 6 loaders
  • pollDropMaps() goroutine in start.go polls every 5s and increments kerno_ringbuf_drops_total{program,cpu} with observed deltas
  • Works with cilium/ebpf v0.21.0 which has no reader-side drop API

2. Adaptive userspace sampling

  • internal/collector/aggregator/ratelimit.go: token-bucket + probabilistic sampler, safe for concurrent use
  • Allow() wired into all 5 collector record() paths
  • kerno_collector_sampled_total{collector} increments on every rate-limited event
  • p99 accuracy verified within ±5% at 80% sampling (0.02% actual)

3. BPF-side backpressure

  • KERNO_BACKPRESSURE() macro + cpu_backpressure per-CPU map in kerno.h
  • Guard at all 8 bpf_ringbuf_reserve() sites across 6 programs

4. Overhead control loop

  • Reads /proc/self/stat every 5s, updates kerno_overhead_pct
  • Controlled by collectors.sampling.target_overhead_pct (default 1.0%)

5. Config

collectors:
  rate_limits:
    syscall_latency: 500000
    sched_delay: 200000
  sampling:
    enabled: true
    target_overhead_pct: 1.0

New metrics

Metric Type Description
kerno_ringbuf_drops_total Counter BPF-side drops at ringbuf_reserve, by program
kerno_collector_sampled_total Counter Userspace rate-limiter drops, by collector
kerno_overhead_pct Gauge kerno CPU overhead — alert if > 2%

Test results

All packages pass. 4 new TestRateLimiter_* tests including p99 accuracy.

Acceptance criteria

  • Drop counters non-zero under overflow (BPF map polled every 5s)
  • Sampling counters non-zero when budget exceeded
  • kerno_overhead_pct at /metrics
  • p99 within ±5% at 80% sampling
  • All 6 eBPF programs guarded at 8 sites
  • All existing tests pass

Signed-off-by: ariktadas144 <ariktadas144@gmail.com>
Signed-off-by: ariktadas144 <ariktadas144@gmail.com>
@ariktadas144 ariktadas144 requested a review from btwshivam as a code owner June 10, 2026 11:34
@github-actions

Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules area/perf Performance and throughput area/ops Operations, deployment, runtime ergonomics labels Jun 10, 2026
@ariktadas144 ariktadas144 changed the title Feat/collector ratelimit backpressure feat(collector): add rate limiting, drop counters and BPF backpressure Jun 10, 2026
…Performance and overhead section

Signed-off-by: ariktadas144 <ariktadas144@gmail.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 10, 2026
@ariktadas144

Copy link
Copy Markdown
Author

@btwshivam All CI checks are now passing. Here's a summary of what was addressed:

Lint fixes

  • Added ok-check on atomic.Value type assertions in ratelimit.go to satisfy errcheck
  • Verified math/rand/v2 import is correct (resolves gosec false positive)
  • Ran gofmt across all changed files to fix trailing blank lines

Implementation summary

  • Token-bucket + probabilistic sampler in internal/collector/aggregator/ratelimit.go; p99 accuracy verified within ±5% at 80% sampling (0.02% actual error)
  • Allow() wired into all 5 collector record() paths (syscall, tcp, sched, fd, diskio)
  • kerno_drop_count PERCPU_ARRAY BPF map + KERNO_RECORD_DROP() at all 8 bpf_ringbuf_reserve() NULL-check sites across 6 programs
  • DropMapper interface + pollDropMaps() goroutine polls drop maps every 5s and increments kerno_ringbuf_drops_total{program,cpu}
  • KERNO_BACKPRESSURE() macro + guard at all 8 ringbuf reserve sites
  • Overhead control loop reading /proc/self/stat every 5s, updating kerno_overhead_pct
  • Config extended with CollectorRateLimits and CollectorSamplingConfig
  • README "Performance and overhead" section added

Happy to make any changes based on your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics area/perf Performance and throughput documentation Improvements or additions to documentation level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collectors: rate limiting + drop counters under load (Phase 9.2.4)

1 participant