feat: eBPF event deduplication before CEL rule evaluation by slashben · Pull Request #762 · kubescape/node-agent

slashben · 2026-03-29T13:31:11Z

Summary

Adds a lock-free, fixed-size dedup cache (pkg/dedupcache/) using packed atomic.Uint64 slots (48-bit xxhash key + 16-bit expiry bucket at 64ms granularity)
Computes per-event-type dedup keys for: open, network, dns, capabilities, http, ssh, symlink, hardlink, ptrace, syscall. Events that should never be deduplicated (exec, exit, fork, randomx, kmod, bpf, unshare, iouring) are excluded.
Integrates dedup check in EventHandlerFactory.ProcessEvent() — duplicate events skip RuleManager (CEL evaluation), ContainerProfileManager, and MalwareManager while still flowing to metrics, DNS manager, and network stream
Configurable via EventDedupConfig (enabled by default, 2^18 = 262K slots ≈ 2MB)
Adds Prometheus counter node_agent_dedup_events_total{event_type, result} for observability

Motivation

Node-Agent Performance Epic §1.3 — targets 10% of the 20% CPU reduction goal by deduplicating repetitive eBPF events (e.g., repeated file opens, DNS lookups) before they reach the expensive CEL rule evaluation engine.

Design

See design/node-agent-performance-epic/ebpf-event-deduplication.md for the full design document.

Benchmarks

DedupCache.CheckAndSet: ~6ns/op miss, ~0.8ns/op hit, 0 allocations
Key computation: 19–44ns/op depending on event type, 0 allocations
Race detector: clean under 100-goroutine concurrent hammer test

Test plan

Unit tests for cache correctness (insert, TTL expiry, slot collision, pack/unpack)
Concurrent hammer test (100 goroutines × 10,000 ops)
Deterministic key output tests and collision resistance checks
Benchmarks for cache and all key functions
Race detector clean
Config test updated with new defaults
Full go test ./... passes (pre-existing eBPF permission failures only)
Kind cluster benchmark: baseline vs dedup CPU comparison

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Configurable event deduplication to reduce redundant event processing; dedup metadata exposed on events.
- Metrics reporting for deduplicated vs passed events by event type.
Tests
- Unit tests and benchmarks added for dedup cache and key-generation logic.
Documentation
- Deduplication benchmark, results, and reproduction guide added.
Chores
- Performance benchmark tooling and CI workflow added; dependency manifest updated.

Add a lock-free dedup cache that prevents structurally identical eBPF events from reaching the expensive CEL rule engine. Events are keyed by type-specific fields (mntns + pid + relevant attributes) using xxhash, with per-event-type TTL windows (2-10s). The cache uses packed atomic uint64 slots (48-bit key + 16-bit expiry bucket) for zero-lock concurrent access from the 3,000-goroutine worker pool. Consumers opt in to skipping duplicates: RuleManager, ContainerProfileManager, and MalwareManager skip; Metrics, DNSManager, NetworkStream, and RulePolicy always process. No events are dropped — the Duplicate flag is advisory. Benchmarks: cache check ~7ns/op, key computation 24-52ns/op, 0 allocations. Implements design/node-agent-performance-epic/ebpf-event-deduplication.md §1.3 (targets 10% of the 20% CPU reduction goal). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>

coderabbitai · 2026-03-29T13:31:30Z

📝 Walkthrough

Walkthrough

Adds an optional event deduplication feature: a lock-free DedupCache with xxhash key generators, integration into event enrichment and handler pipeline (marks and skips duplicates), new metrics and config defaults, early-skip in rule reporting for duplicates, supporting tests/benchmarks, benchmark tooling, and go.mod dependency promotions.

Changes

Cohort / File(s)	Summary
Dependency Management `go.mod`	Promoted `github.com/cespare/xxhash/v2 v2.3.0` and `google.golang.org/protobuf v1.36.11` from indirect to direct `require`.
Configuration `pkg/config/config.go`, `pkg/config/config_test.go`	Added `EventDedupConfig` and `Config.EventDedup` with defaults (`enabled=true`, `slotsExponent=18`); tests updated.
Deduplication Core `pkg/dedupcache/dedup_cache.go`, `pkg/dedupcache/dedup_cache_test.go`	New lock-free `DedupCache` using atomic uint64 slots (48-bit key + 16-bit expiry bucket), `NewDedupCache`, `CheckAndSet`, TTL behavior, unit tests and benchmarks including concurrency and collision scenarios.
Deduplication Key Generators `pkg/dedupcache/keys.go`, `pkg/dedupcache/keys_test.go`	Added xxhash-based exported Compute* functions for Open, Network, DNS, Capabilities, HTTP, SSH, Symlink, Hardlink, Ptrace, Syscall; tests and benchmarks for determinism and input sensitivity.
Event Model `pkg/ebpf/events/enriched_event.go`	`EnrichedEvent` gains `Duplicate bool` and `DedupBucket uint16`.
Container Watcher Integration `pkg/containerwatcher/v2/container_watcher.go`	Sets `DedupBucket` during enrichment and conditionally constructs `DedupCache` based on config.
Event Handler Factory `pkg/containerwatcher/v2/event_handler_factory.go`	`NewEventHandlerFactory` accepts `dedupCache`; `ProcessEvent` computes per-event keys/TTL, calls `CheckAndSet`, sets `Duplicate`, reports dedup metrics, and skips specified handlers for duplicates.
Metrics Manager `pkg/metricsmanager/metrics_manager_interface.go`, `pkg/metricsmanager/metrics_manager_mock.go`, `pkg/metricsmanager/prometheus/prometheus.go`	Added `ReportDedupEvent(eventType, duplicate)` to interface and mock; Prometheus adds `node_agent_dedup_events_total` counter and implements reporting.
Rule Manager `pkg/rulemanager/rule_manager.go`	`ReportEnrichedEvent` returns immediately for `enrichedEvent.Duplicate`, skipping subsequent processing and exports.
Benchmarking Tooling `benchmark/compare-metrics.py`, `benchmark/dedup-bench.sh`, `benchmark/get_data_from_prometheus.py`, `benchmark/requirements.txt`, `benchmark/load-simulator/daemonset.yaml`	New end-to-end benchmark scripts and helpers: Prometheus data collector, bench orchestrator, comparator, requirements, and load-simulator manifest; outputs CSV/PNG/JSON for comparisons.
Docs `docs/dedup-testing/benchmark-results.md`	Added benchmark results and reproduction notes.
CI Workflow `.github/workflows/benchmark.yaml`, `.github/workflows/incluster-comp-pr-merged.yaml`	New `Performance Benchmark` workflow (reusable) and a job added to `incluster-comp-pr-merged.yaml` to call it for release-labeled PRs.

Sequence Diagram

sequenceDiagram
    participant EventSrc as Event Source
    participant CWatcher as Container Watcher
    participant EHandler as Event Handler Factory
    participant DCache as Dedup Cache
    participant Metrics as Metrics Manager
    participant RuleMgr as Rule Manager

    EventSrc->>CWatcher: Emit raw event
    CWatcher->>CWatcher: Enrich event, set DedupBucket
    CWatcher->>EHandler: enrichAndProcess(enrichedEvent)
    EHandler->>EHandler: Compute dedup key & TTL
    EHandler->>DCache: CheckAndSet(key, ttlBuckets, currentBucket)
    alt Duplicate
        DCache-->>EHandler: true
        EHandler->>EHandler: enrichedEvent.Duplicate = true
        EHandler->>Metrics: ReportDedupEvent(type, duplicate=true)
    else Not duplicate
        DCache-->>EHandler: false
        EHandler->>Metrics: ReportDedupEvent(type, duplicate=false)
    end
    EHandler->>RuleMgr: ReportEnrichedEvent(enrichedEvent)
    alt enrichedEvent.Duplicate == true
        RuleMgr-->>EHandler: return (skip processing)
    else
        RuleMgr-->>EHandler: continue (enrich, evaluate, export)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through hashes, bucketed time and byte,
I tucked duplicates out of sight, neat and tight.
Keys spun by xxhash, slots set to keep,
One event, one heartbeat — no echoed repeat. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the primary change: introducing eBPF event deduplication before CEL rule evaluation, which aligns directly with the changeset's core objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/ebpf-event-dedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

pkg/metricsmanager/metrics_manager_mock.go (1)
67-67: Consider recording dedup results in the mock.

This keeps the interface satisfied, but it also makes the new metric path unassertable in unit tests. Tracking duplicate/passed counts here would keep dedup observability testable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/metrics_manager_mock.go` at line 67, The
MetricsMock.ReportDedupEvent stub should record dedup outcomes so tests can
assert on them: update the MetricsMock struct to include fields (e.g.,
DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a
sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType
utils.EventType, duplicate bool) to lock, increment the appropriate counter (or
append a record with eventType and duplicate), and unlock; expose simple
getter/accessor methods or make the fields public so unit tests can inspect
counts/events.
pkg/metricsmanager/prometheus/prometheus.go (1)
395-400: Avoid per-event label-map allocations here.

ReportDedupEvent runs on the dedup hot path, but With(prometheus.Labels{...}) creates a fresh map on every call. That adds avoidable allocation/GC pressure right where this PR is trying to save CPU. WithLabelValues(string(eventType), result) or a tiny cached lookup would keep this path much cheaper.
♻️ Minimal change
-	p.dedupEventCounter.With(prometheus.Labels{eventTypeLabel: string(eventType), "result": result}).Inc()
+	p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/prometheus/prometheus.go` around lines 395 - 400,
ReportDedupEvent currently allocates a new map on every hot-path call by using
p.dedupEventCounter.With(prometheus.Labels{...}), causing GC/alloc pressure;
change it to use the zero-allocation variant WithLabelValues(string(eventType),
result) or implement a small cached map/observer keyed by eventType/result to
avoid per-call map allocations - update the ReportDedupEvent function to call
p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a
precomputed label handle cache) and keep references to dedupEventCounter and
eventTypeLabel names to locate the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/config/config.go`:
- Around line 28-32: The EventDedupConfig.SlotsExponent is used unchecked when
creating the dedup cache and can cause an overflow/panic in
dedupcache.NewDedupCache; update LoadConfig to validate SlotsExponent (e.g.,
ensure 0 <= SlotsExponent <= 31) before using it, returning an error if out of
range. Locate the struct EventDedupConfig and the code path in LoadConfig (and
any other places that call dedupcache.NewDedupCache) and add explicit bounds
checking for SlotsExponent, with a clear error message referencing the invalid
value, so callers cannot pass an exponent >31 (or your chosen safe max) into
dedupcache.NewDedupCache.

In `@pkg/containerwatcher/v2/container_watcher.go`:
- Line 473: The DedupBucket is being recomputed per event which can change
mid-batch; compute the bucket once for the entire queue batch and reuse it. Move
the calculation (uint16(time.Now().UnixNano() / (64 * 1_000_000))) out of the
per-event code and store it in a local variable (e.g., dedupBucket) at the start
of the batch processing routine, then assign enrichedEvent.DedupBucket =
dedupBucket for each event; ensure the calculation and assignment use the same
expression so behavior is unchanged except now stable across the batch.

In `@pkg/containerwatcher/v2/event_handler_factory.go`:
- Around line 193-196: The dedupSkipSet currently contains
containerProfileAdapter (and malwareManager), which causes ReportDroppedEvent
calls within containerProfileAdapter to be skipped for duplicate events; modify
the event flow so dropped-event reporting is always executed before
deduplication: extract the dropped-event handling path (calls to
ReportDroppedEvent and checks of enrichedEvent.HasDroppedEvents) out of
containerProfileAdapter and invoke it unconditionally prior to evaluating
factory.dedupSkipSet, or alternatively call the adapter's ReportDroppedEvent
logic explicitly before the duplicate-skip check; ensure the same change is
applied to the second occurrence referenced around the other dedupSkipSet
insertion so kernel-drop notifications are never suppressed.

In `@pkg/dedupcache/dedup_cache.go`:
- Around line 36-46: The expiry comparison in DedupCache.CheckAndSet uses a
wrapping uint16 bucket which breaks across the 65535→0 rollover; change the
expiry to a non-wrapping time base by extending the stored expiry field to a
larger integer (e.g., use a uint32/uint64 expiry inside the packed value),
update pack and unpack to encode/decode that wider expiry, ensure CheckAndSet
stores pack(key, uint64(currentBucket)+uint64(ttlBuckets)) and compares
storedExpiry > uint64(currentBucket) using the wider type, and add a regression
test that simulates buckets across the 65535→0 boundary to verify dedup
behavior; alternatively, if you prefer epoching, implement explicit rollover
detection in CheckAndSet that clears/epochs slots when currentBucket <
lastSeenBucket and add the same wrap test.
- Around line 16-21: NewDedupCache currently computes size := uint64(1) <<
slotsExponent without validating slotsExponent which can yield size==0 (when
slotsExponent>=64) or unreasonably large sizes (e.g., >30) causing panics or
huge allocations; add a bounds check at the start of NewDedupCache to reject
invalid exponents (e.g., require slotsExponent > 0 and <= 30 or another
project-safe maximum), and return an error or panic with a clear message instead
of constructing a zero/huge slice; ensure the check prevents creating a
DedupCache with zero-length slots and protects callers of CheckAndSet from
out-of-range accesses.

In `@pkg/dedupcache/keys.go`:
- Around line 41-48: ComputeNetworkKey (and the other key-builder helpers that
hash multiple variable-length strings such as the capability+syscall builder,
the host+path+rawQuery builder, and the oldPath+newPath builder) currently write
adjacent strings directly into the xxhash stream which can cause ambiguous
collisions; update each function (starting with ComputeNetworkKey) to prefix
every string component with its length (or another unambiguous delimiter) before
calling h.WriteString so the concatenation is unambiguous—e.g., call
writeUint32/Uint16/Uint64 for the string length then write the string itself for
each variable-length field; apply the same change to all similar helpers
mentioned in the comment so every multi-string hash is length-delimited.

---

Nitpick comments:
In `@pkg/metricsmanager/metrics_manager_mock.go`:
- Line 67: The MetricsMock.ReportDedupEvent stub should record dedup outcomes so
tests can assert on them: update the MetricsMock struct to include fields (e.g.,
DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a
sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType
utils.EventType, duplicate bool) to lock, increment the appropriate counter (or
append a record with eventType and duplicate), and unlock; expose simple
getter/accessor methods or make the fields public so unit tests can inspect
counts/events.

In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 395-400: ReportDedupEvent currently allocates a new map on every
hot-path call by using p.dedupEventCounter.With(prometheus.Labels{...}), causing
GC/alloc pressure; change it to use the zero-allocation variant
WithLabelValues(string(eventType), result) or implement a small cached
map/observer keyed by eventType/result to avoid per-call map allocations -
update the ReportDedupEvent function to call
p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a
precomputed label handle cache) and keep references to dedupEventCounter and
eventTypeLabel names to locate the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d640beeb-a69e-4b2d-9fa9-a9145caec339

📥 Commits

Reviewing files that changed from the base of the PR and between 2b05e16 and e1cdc49.

📒 Files selected for processing (14)

go.mod
pkg/config/config.go
pkg/config/config_test.go
pkg/containerwatcher/v2/container_watcher.go
pkg/containerwatcher/v2/event_handler_factory.go
pkg/dedupcache/dedup_cache.go
pkg/dedupcache/dedup_cache_test.go
pkg/dedupcache/keys.go
pkg/dedupcache/keys_test.go
pkg/ebpf/events/enriched_event.go
pkg/metricsmanager/metrics_manager_interface.go
pkg/metricsmanager/metrics_manager_mock.go
pkg/metricsmanager/prometheus/prometheus.go
pkg/rulemanager/rule_manager.go

coderabbitai · 2026-03-29T13:46:45Z

pkg/config/config.go

+// EventDedupConfig controls eBPF event deduplication before CEL rule evaluation.
+type EventDedupConfig struct {
+	Enabled       bool  `mapstructure:"enabled"`
+	SlotsExponent uint8 `mapstructure:"slotsExponent"`
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C3 'type EventDedupConfig|SlotsExponent|eventDedup::slotsExponent|NewDedupCache\(' pkg/config pkg/containerwatcher/v2 pkg/dedupcache rg -n -C4 'func NewDedupCache|slotsExponent|1\s*<<|make\(|panic|return .*error' pkg/dedupcache

Repository: kubescape/node-agent

Length of output: 6520

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for LoadConfig function and any validation around eventDedup rg -n -A 30 'func LoadConfig' pkg/config/config.go | head -60 # Search for any bounds checking or max exponent constant rg -n 'max.*exponent|exponent.*max|MAX.*EXPONENT|slotsExponent.*<|slotsExponent.*>|validate|bounds' pkg/ -i # Check if eventDedup appears anywhere else with potential constraints rg -n 'eventDedup' pkg/ -C 2

Repository: kubescape/node-agent

Length of output: 7611

🏁 Script executed:

#!/bin/bash set -euo pipefail # Get complete LoadConfig function from config.go sed -n '145,260p' pkg/config/config.go # Also check if there's a Validate method on Config struct rg -n 'func.*Config.*Validate|func \(c \*Config\)' pkg/config/ -A 5

Repository: kubescape/node-agent

Length of output: 6085

Add validation for SlotsExponent to prevent invalid cache allocations.

The slotsExponent field in EventDedupConfig is passed directly to dedupcache.NewDedupCache(...) without bounds checking. Since NewDedupCache performs an unconstrained bit-shift (size := uint64(1) << slotsExponent), an invalid exponent (e.g., > 63) can cause undefined behavior or runtime panic. Add explicit validation in LoadConfig to ensure slotsExponent stays within a safe range (e.g., 0–31, matching typical cache sizes up to 2 GB).

Also applies to: 79, 193–194

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/config/config.go` around lines 28 - 32, The EventDedupConfig.SlotsExponent is used unchecked when creating the dedup cache and can cause an overflow/panic in dedupcache.NewDedupCache; update LoadConfig to validate SlotsExponent (e.g., ensure 0 <= SlotsExponent <= 31) before using it, returning an error if out of range. Locate the struct EventDedupConfig and the code path in LoadConfig (and any other places that call dedupcache.NewDedupCache) and add explicit bounds checking for SlotsExponent, with a clear error message referencing the invalid value, so callers cannot pass an exponent >31 (or your chosen safe max) into dedupcache.NewDedupCache.

coderabbitai · 2026-03-29T13:46:45Z

pkg/containerwatcher/v2/container_watcher.go


 func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry) {
 	enrichedEvent := cw.eventEnricher.EnrichEvents(entry)
+	enrichedEvent.DedupBucket = uint16(time.Now().UnixNano() / (64 * 1_000_000))


⚠️ Potential issue | 🟠 Major

Keep a stable dedup bucket for the whole queue batch.

Line 473 recomputes the bucket per event even though EnrichedEvent.DedupBucket is documented as batch-cached. A large drain can cross a 64ms boundary mid-batch and let same-batch duplicates miss dedup.

🧩 One way to keep the bucket stable

func (cw *ContainerWatcher) processQueueBatch() { batchSize := cw.cfg.EventBatchSize processedCount := 0 + dedupBucket := uint16(time.Now().UnixNano() / (64 * 1_000_000)) for !cw.orderedEventQueue.Empty() && processedCount < batchSize { event, ok := cw.orderedEventQueue.PopEvent() if !ok { break } - cw.enrichAndProcess(event) + cw.enrichAndProcess(event, dedupBucket) processedCount++ } } -func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry) { +func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry, dedupBucket uint16) { enrichedEvent := cw.eventEnricher.EnrichEvents(entry) - enrichedEvent.DedupBucket = uint16(time.Now().UnixNano() / (64 * 1_000_000)) + enrichedEvent.DedupBucket = dedupBucket

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/containerwatcher/v2/container_watcher.go` at line 473, The DedupBucket is being recomputed per event which can change mid-batch; compute the bucket once for the entire queue batch and reuse it. Move the calculation (uint16(time.Now().UnixNano() / (64 * 1_000_000))) out of the per-event code and store it in a local variable (e.g., dedupBucket) at the start of the batch processing routine, then assign enrichedEvent.DedupBucket = dedupBucket for each event; ensure the calculation and assignment use the same expression so behavior is unchanged except now stable across the batch.

coderabbitai · 2026-03-29T13:46:45Z

pkg/containerwatcher/v2/event_handler_factory.go

+	// Populate dedupSkipSet: managers that skip processing when event is duplicate.
+	// RuleManager checks enrichedEvent.Duplicate internally.
+	factory.dedupSkipSet[containerProfileAdapter] = struct{}{}
+	factory.dedupSkipSet[malwareManager] = struct{}{}


⚠️ Potential issue | 🟠 Major

Don't let dedup suppress dropped-event reporting.

containerProfileAdapter does more than profile updates: at Lines 101-104 it also calls ReportDroppedEvent. Once the whole adapter is in dedupSkipSet, any duplicate event with HasDroppedEvents()==true skips that call too, so the profile manager can miss kernel-drop notifications and keep treating the profile as complete when it isn't. Please split the dropped-event path out of the adapter, or run that part before the duplicate skip.

Also applies to: 332-335

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/containerwatcher/v2/event_handler_factory.go` around lines 193 - 196, The dedupSkipSet currently contains containerProfileAdapter (and malwareManager), which causes ReportDroppedEvent calls within containerProfileAdapter to be skipped for duplicate events; modify the event flow so dropped-event reporting is always executed before deduplication: extract the dropped-event handling path (calls to ReportDroppedEvent and checks of enrichedEvent.HasDroppedEvents) out of containerProfileAdapter and invoke it unconditionally prior to evaluating factory.dedupSkipSet, or alternatively call the adapter's ReportDroppedEvent logic explicitly before the duplicate-skip check; ensure the same change is applied to the second occurrence referenced around the other dedupSkipSet insertion so kernel-drop notifications are never suppressed.

coderabbitai · 2026-03-29T13:46:45Z

pkg/dedupcache/dedup_cache.go

+func NewDedupCache(slotsExponent uint8) *DedupCache {
+	size := uint64(1) << slotsExponent
+	return &DedupCache{
+		slots: make([]atomic.Uint64, size),
+		mask:  size - 1,
+	}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat pkg/dedupcache/dedup_cache.go

Repository: kubescape/node-agent

Length of output: 1705

🏁 Script executed:

cat > /tmp/test_shift.go << 'EOF' package main import "fmt" func main() { // Test shift behavior with counts >= 64 for exp := uint8(62); exp < 66; exp++ { size := uint64(1) << exp mask := size - 1 fmt.Printf("exponent=%d, size=%d, mask=%d\n", exp, size, mask) } // Direct test with 64 and 65 fmt.Println("\nDirect shift tests:") fmt.Printf("uint64(1) << 64 = %d\n", uint64(1) << 64) fmt.Printf("uint64(1) << 65 = %d\n", uint64(1) << 65) // What happens with a zero-sized slice? fmt.Println("\nSlice access with zero size:") slice := make([]uint64, 0) fmt.Printf("len(slice) = %d, cap(slice) = %d\n", len(slice), cap(slice)) // Try to access it (this will panic) defer func() { if r := recover(); r != nil { fmt.Printf("Panic caught: %v\n", r) } }() _ = slice[0] fmt.Println("Access succeeded (unexpected)") } EOF go run /tmp/test_shift.go

Repository: kubescape/node-agent

Length of output: 300

🏁 Script executed:

cat > /tmp/test_shift.go << 'EOF' package main import "fmt" func main() { // Test shift behavior with runtime variable counts >= 64 for exp := uint8(62); exp < 66; exp++ { size := uint64(1) << exp mask := size - 1 fmt.Printf("exponent=%d, size=%d, mask=%d\n", exp, size, mask) } // What happens with a zero-sized slice? fmt.Println("\nSlice behavior:") size := uint64(0) slice := make([]uint64, size) fmt.Printf("len(slice) = %d, cap(slice) = %d\n", len(slice), cap(slice)) // Try to access it (this will panic) defer func() { if r := recover(); r != nil { fmt.Printf("Panic caught: %v\n", r) } }() _ = slice[0] fmt.Println("Access succeeded (unexpected)") } EOF go run /tmp/test_shift.go

Repository: kubescape/node-agent

Length of output: 399

🏁 Script executed:

rg "NewDedupCache" --type go -B 2 -A 2

Repository: kubescape/node-agent

Length of output: 2734

🏁 Script executed:

rg "EventDedup" --type go -B 5 -A 5

Repository: kubescape/node-agent

Length of output: 5619

🏁 Script executed:

rg "SlotsExponent" --type go -B 3 -A 3

Repository: kubescape/node-agent

Length of output: 1406

🏁 Script executed:

rg "EventDedup" --type go -A 20 | grep -A 20 "Validate\|validate"

Repository: kubescape/node-agent

Length of output: 46

🏁 Script executed:

rg "func.*Validate" pkg/config --type go -A 10

Repository: kubescape/node-agent

Length of output: 46

🏁 Script executed:

cat pkg/config/config.go | head -200

Repository: kubescape/node-agent

Length of output: 13131

🏁 Script executed:

cat pkg/config/config.go | tail -100

Repository: kubescape/node-agent

Length of output: 2894

Validate slotsExponent before constructing the cache; invalid values cause a panic.

NewDedupCache is exported, but uint64(1) << slotsExponent produces zero once slotsExponent reaches 64. This results in a zero-length slots slice and mask underflow to 18446744073709551615. At runtime, CheckAndSet will panic with "index out of range" when it tries to access c.slots[idx]. Additionally, exponents much below 64 but still unreasonable (e.g., > 30) create unbounded memory allocations. Add a bounds check in the constructor to reject invalid exponents.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/dedupcache/dedup_cache.go` around lines 16 - 21, NewDedupCache currently computes size := uint64(1) << slotsExponent without validating slotsExponent which can yield size==0 (when slotsExponent>=64) or unreasonably large sizes (e.g., >30) causing panics or huge allocations; add a bounds check at the start of NewDedupCache to reject invalid exponents (e.g., require slotsExponent > 0 and <= 30 or another project-safe maximum), and return an error or panic with a clear message instead of constructing a zero/huge slice; ensure the check prevents creating a DedupCache with zero-length slots and protects callers of CheckAndSet from out-of-range accesses.

coderabbitai · 2026-03-29T13:46:45Z

pkg/dedupcache/dedup_cache.go

+func (c *DedupCache) CheckAndSet(key uint64, ttlBuckets uint16, currentBucket uint16) bool {
+	idx := key & c.mask
+
+	stored := c.slots[idx].Load()
+	storedKey, storedExpiry := unpack(stored)
+	if storedKey == (key & 0xFFFFFFFFFFFF0000) && storedExpiry > currentBucket {
+		return true // duplicate
+	}
+
+	c.slots[idx].Store(pack(key, currentBucket+ttlBuckets))
+	return false


⚠️ Potential issue | 🔴 Critical

uint16 expiry rolls over every ~70 minutes.

The caller feeds CheckAndSet a uint16 64ms time bucket, so storedExpiry > currentBucket breaks as soon as the counter wraps at 65535 -> 0: fresh entries inserted just before rollover stop matching right after rollover, and stale slots can become "live" again on a later cycle. That means dedup will silently misclassify events after a long-running agent uptime. Please switch expiry tracking to a non-wrapping time base, or clear/epoch the cache on bucket rollover, and add a regression test around the wrap boundary.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/dedupcache/dedup_cache.go` around lines 36 - 46, The expiry comparison in DedupCache.CheckAndSet uses a wrapping uint16 bucket which breaks across the 65535→0 rollover; change the expiry to a non-wrapping time base by extending the stored expiry field to a larger integer (e.g., use a uint32/uint64 expiry inside the packed value), update pack and unpack to encode/decode that wider expiry, ensure CheckAndSet stores pack(key, uint64(currentBucket)+uint64(ttlBuckets)) and compares storedExpiry > uint64(currentBucket) using the wider type, and add a regression test that simulates buckets across the 65535→0 boundary to verify dedup behavior; alternatively, if you prefer epoching, implement explicit rollover detection in CheckAndSet that clears/epochs slots when currentBucket < lastSeenBucket and add the same wrap test.

coderabbitai · 2026-03-29T13:46:45Z

pkg/dedupcache/keys.go

+func ComputeNetworkKey(mntns uint64, pid uint32, dstAddr string, dstPort uint16, proto string) uint64 {
+	h := xxhash.New()
+	writeUint64(h, mntns)
+	writeUint32(h, pid)
+	h.WriteString(dstAddr)
+	writeUint16(h, dstPort)
+	h.WriteString(proto)
+	return h.Sum64()


⚠️ Potential issue | 🔴 Critical

Length-delimit the string components before hashing.

Several key builders concatenate adjacent variable-length strings directly into the hash stream (capability+syscall, host+path+rawQuery, oldPath+newPath, etc.). Different tuples can therefore serialize to the exact same bytes before xxhash is even applied—for example, host="fo", path="o/bar" and host="foo", path="/bar". Because duplicates bypass security handlers, this turns into false dedup for distinct events. Please prefix each string with its length, or another unambiguous delimiter, before writing it.

🔧 Safer serialization pattern

+func writeString(h *xxhash.Digest, s string) { + writeUint32(h, uint32(len(s))) + h.WriteString(s) +} + func ComputeHTTPKey(mntns uint64, pid uint32, direction string, method string, host string, path string, rawQuery string) uint64 { h := xxhash.New() writeUint64(h, mntns) writeUint32(h, pid) - h.WriteString(direction) - h.WriteString(method) - h.WriteString(host) - h.WriteString(path) - h.WriteString(rawQuery) + writeString(h, direction) + writeString(h, method) + writeString(h, host) + writeString(h, path) + writeString(h, rawQuery) return h.Sum64() }

Apply the same pattern to every helper that hashes more than one variable-length string.

Also applies to: 61-67, 71-80, 93-109

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/dedupcache/keys.go` around lines 41 - 48, ComputeNetworkKey (and the other key-builder helpers that hash multiple variable-length strings such as the capability+syscall builder, the host+path+rawQuery builder, and the oldPath+newPath builder) currently write adjacent strings directly into the xxhash stream which can cause ambiguous collisions; update each function (starting with ComputeNetworkKey) to prefix every string component with its length (or another unambiguous delimiter) before calling h.WriteString so the concatenation is unambiguous—e.g., call writeUint32/Uint16/Uint64 for the string length then write the string itself for each variable-length field; apply the same change to all similar helpers mentioned in the comment so every multi-string hash is length-delimited.

Benchmark comparing v0.3.71 (baseline) vs dedup branch on a kind cluster with 1000 open/s, 100 http/s load. Results show -16% avg CPU, -29% peak CPU, with 91-99% dedup ratios on high-frequency event types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/dedup-testing/benchmark-results.md`:
- Line 87: Replace the misspelled directory name "perfornamce" with the correct
spelling "performance" in the markdown line that references the directory (the
string "perfornamce" in the diff); update any other occurrences of the same typo
in the document (search for "perfornamce") so all directory references use
"performance".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a3b9631-0ff1-4184-a380-896c0145c38b

📥 Commits

Reviewing files that changed from the base of the PR and between e1cdc49 and 0de1bad.

⛔ Files ignored due to path filters (4)

docs/dedup-testing/after_cpu_usage.png is excluded by !**/*.png
docs/dedup-testing/after_memory_usage.png is excluded by !**/*.png
docs/dedup-testing/before_cpu_usage.png is excluded by !**/*.png
docs/dedup-testing/before_memory_usage.png is excluded by !**/*.png

📒 Files selected for processing (1)

docs/dedup-testing/benchmark-results.md

coderabbitai · 2026-03-30T04:31:39Z

docs/dedup-testing/benchmark-results.md

+## Reproducing
+
+```bash
+cd perfornamce


⚠️ Potential issue | 🟡 Minor

Fix typo in directory path.

The directory name contains a typo: "perfornamce" should be "performance".

📝 Proposed fix

-cd perfornamce +cd performance

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd perfornamce

cd performance

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/dedup-testing/benchmark-results.md` at line 87, Replace the misspelled directory name "perfornamce" with the correct spelling "performance" in the markdown line that references the directory (the string "perfornamce" in the diff); update any other occurrences of the same typo in the document (search for "perfornamce") so all directory references use "performance".

Add benchmark scripts and CI workflow that runs A/B performance benchmarks on Kind clusters, comparing baseline vs PR node-agent images. Posts results as PR comments and uploads artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

.github/workflows/incluster-comp-pr-merged.yaml (1)

339-345: Benchmark doesn't gate release promotion.

create-release-and-retag still does not depend on this new job, so the image can be retagged before the benchmark finishes or even if it fails. If this prerelease benchmark is supposed to validate the image before promotion, add benchmark to the downstream needs/condition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/incluster-comp-pr-merged.yaml around lines 339 - 345, The
benchmark job is not gating release promotion: ensure the downstream job that
performs promotion (create-release-and-retag) depends on the new benchmark job
and only runs when it succeeds; update the workflow so the
create-release-and-retag job's needs includes "benchmark" (or add a conditional
that checks success of needs.benchmark) so the image cannot be retagged before
or if the benchmark fails.

benchmark/dedup-bench.sh (1)

64-72: Use the pinned benchmark requirements here.

benchmark/requirements.txt already pins the Python stack, but local runs bypass it and install unpinned pandas matplotlib requests instead. That lets the same benchmark run against different dependency versions locally vs CI, and the existing .venv is never refreshed when the pins change.

Possible fix

     # Set up Python venv with deps
     local venv_dir="${SCRIPT_DIR}/.venv"
     if [[ ! -d "$venv_dir" ]]; then
-        log "Creating Python venv and installing dependencies..."
+        log "Creating Python venv..."
         python3 -m venv "$venv_dir"
-        "$venv_dir/bin/pip" install pandas matplotlib requests
     fi
+    log "Installing benchmark Python dependencies..."
+    "$venv_dir/bin/pip" install -r "$SCRIPT_DIR/requirements.txt"
     # Use the venv python for the rest of the script
     export PATH="${venv_dir}/bin:${PATH}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmark/dedup-bench.sh` around lines 64 - 72, The script creates a venv in
venv_dir but installs unpinned packages; change the install step to install from
the pinned requirements file (benchmark/requirements.txt) instead of hardcoded
packages and ensure pip runs with upgrade to pick up changed pins (replace the
"$venv_dir/bin/pip" install pandas matplotlib requests call with an install from
"${SCRIPT_DIR}/requirements.txt" using --upgrade/--requirement). Keep the rest
of the venv handling (export PATH with venv_dir) the same so local runs match CI
and pinned deps are applied when the venv is created or refreshed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/benchmark.yaml:
- Around line 38-46: The workflow currently downloads kind from
https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64 and pipes Helm's installer
from https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
directly in the "Install Kind" and "Install Helm" steps; replace these with
pinned, verifiable installs: for the "Install Kind" step either use a maintained
setup action (e.g., actions/setup-kind@vX) or download the tagged release asset
and verify it against a trusted SHA256 checksum before installing; for the
"Install Helm" step use a pinned setup action (e.g., azure/setup-helm@vX or
another official action) or call the Helm install script from a specific release
tag (not main) and verify its checksum/signature, ensuring both steps log
failures and do not pipe unverified scripts into bash.
- Around line 113-119: The workflow uses peter-evans/create-or-update-comment@v4
with an unsupported input `comment-tag`, causing new comments each run; remove
`comment-tag` and implement a sticky-comment flow by first using an action like
peter-evans/find-comment to search for the existing comment via `body-includes`
(e.g., matching text from `report.md`), capture its `comment-id`, then pass that
`comment-id` into create-or-update-comment (or use `edit-mode`) along with
`body-path: report.md` and `issue-number` to update the existing comment;
alternatively, switch the job trigger to `pull_request_target` to avoid fork PR
permission issues as documented in the create-or-update-comment README.

In `@benchmark/compare-metrics.py`:
- Around line 37-42: compute_resource_stats currently filters node-agent rows
but computes avg/peak across pods (per-pod), undercounting total footprint on
multi-agent clusters; change it to first aggregate Value across node-agent pods
per timestamp (e.g., group the filtered na DataFrame by the timestamp column,
such as "Timestamp", and sum the "Value" for each timestamp) and then return
{"avg": summed_values.mean(), "peak": summed_values.max()} while keeping the
initial filter (na = df[df["Pod"].str.contains("node-agent", na=False)]) and the
same return shape.

In `@benchmark/dedup-bench.sh`:
- Around line 338-345: The loop over queries writes Prometheus responses without
checking HTTP or Prometheus-level errors; update the block that performs
requests.get (inside the for name, query in queries.items() loop) to call
resp.raise_for_status() after the request and then parse resp.json() into a
variable and verify it contains "status" == "success" before writing the file to
output_dir; if either the HTTP status or the Prometheus "status" is an error,
print a clear error including name and response content and fail the benchmark
run (e.g., raise an exception or call sys.exit(1)) instead of silently
continuing.
- Around line 267-269: The remove_load_simulator() function currently deletes
the load-simulator namespace with --wait=false which returns immediately and
causes the after phase to recreate the namespace while it may still be
Terminating; change remove_load_simulator() to wait for the namespace to be
fully deleted before returning (e.g., use kubectl wait --for=delete
namespace/load-simulator or poll kubectl get namespace load-simulator until it
no longer exists) so subsequent apply steps (the recreate/apply logic) cannot
run against a terminating namespace.

In `@benchmark/load-simulator/daemonset.yaml`:
- Around line 15-16: The load-simulator container image is not pinned (image
field under the container named "load-simulator"), causing implicit :latest
resolution and potential benchmark drift; update the image reference in the
daemonset to an immutable reference by replacing
quay.io/benarmosec/load-simulator-f866d4884d08e4a0d1907b094978b48d with either a
specific versioned tag (e.g., :v1.2.3) or, preferably, a content digest
(quay.io/benarmosec/load-simulator@sha256:<digest>) so deployments from
benchmark/dedup-bench.sh are reproducible.

---

Nitpick comments:
In @.github/workflows/incluster-comp-pr-merged.yaml:
- Around line 339-345: The benchmark job is not gating release promotion: ensure
the downstream job that performs promotion (create-release-and-retag) depends on
the new benchmark job and only runs when it succeeds; update the workflow so the
create-release-and-retag job's needs includes "benchmark" (or add a conditional
that checks success of needs.benchmark) so the image cannot be retagged before
or if the benchmark fails.

In `@benchmark/dedup-bench.sh`:
- Around line 64-72: The script creates a venv in venv_dir but installs unpinned
packages; change the install step to install from the pinned requirements file
(benchmark/requirements.txt) instead of hardcoded packages and ensure pip runs
with upgrade to pick up changed pins (replace the "$venv_dir/bin/pip" install
pandas matplotlib requests call with an install from
"${SCRIPT_DIR}/requirements.txt" using --upgrade/--requirement). Keep the rest
of the venv handling (export PATH with venv_dir) the same so local runs match CI
and pinned deps are applied when the venv is created or refreshed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a349e24d-9f4e-433d-8cd5-a515c1a65c9b

📥 Commits

Reviewing files that changed from the base of the PR and between 0de1bad and c9e3d43.

📒 Files selected for processing (7)

.github/workflows/benchmark.yaml
.github/workflows/incluster-comp-pr-merged.yaml
benchmark/compare-metrics.py
benchmark/dedup-bench.sh
benchmark/get_data_from_prometheus.py
benchmark/load-simulator/daemonset.yaml
benchmark/requirements.txt

coderabbitai · 2026-04-05T04:59:53Z

.github/workflows/benchmark.yaml

+      - name: Install Kind
+        run: |
+          curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64
+          chmod +x ./kind
+          sudo mv ./kind /usr/local/bin/kind
+
+      - name: Install Helm
+        run: |
+          curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash


⚠️ Potential issue | 🟠 Major

Pin and verify the bootstrap downloads.

kind is fetched as a raw binary and Helm is installed by piping a script from Helm's main branch straight into bash. That makes this job non-reproducible and opens a supply-chain hole on a workflow that has PR write permissions. Prefer versioned, checksum-verified artifacts or pinned setup actions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/benchmark.yaml around lines 38 - 46, The workflow currently downloads kind from https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64 and pipes Helm's installer from https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 directly in the "Install Kind" and "Install Helm" steps; replace these with pinned, verifiable installs: for the "Install Kind" step either use a maintained setup action (e.g., actions/setup-kind@vX) or download the tagged release asset and verify it against a trusted SHA256 checksum before installing; for the "Install Helm" step use a pinned setup action (e.g., azure/setup-helm@vX or another official action) or call the Helm install script from a specific release tag (not main) and verify its checksum/signature, ensuring both steps log failures and do not pipe unverified scripts into bash.

coderabbitai · 2026-04-05T04:59:53Z

.github/workflows/benchmark.yaml

+      - name: Comment on PR
+        uses: peter-evans/create-or-update-comment@v4
+        if: github.event_name == 'pull_request' && always()
+        with:
+          issue-number: ${{ github.event.pull_request.number }}
+          body-path: report.md
+          comment-tag: benchmark-results


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "action.yml" -o -name "action.yaml" | head -20

Repository: kubescape/node-agent

Length of output: 46

🏁 Script executed:

# Check the benchmark.yaml file cat -n .github/workflows/benchmark.yaml | sed -n '110,125p'

Repository: kubescape/node-agent

Length of output: 755

🏁 Script executed:

# Fetch the action.yml from peter-evans/create-or-update-comment@v4 curl -s https://raw.githubusercontent.com/peter-evans/create-or-update-comment/v4/action.yml | head -100

Repository: kubescape/node-agent

Length of output: 1533

🏁 Script executed:

# Search for information about fork PRs and GITHUB_TOKEN curl -s https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions 2>/dev/null | grep -i fork || echo "Need web search"

Repository: kubescape/node-agent

Length of output: 79

🏁 Script executed:

# Also check if find-comment action exists curl -s https://raw.githubusercontent.com/peter-evans/find-comment/main/action.yml 2>/dev/null | head -50

Repository: kubescape/node-agent

Length of output: 1370

🏁 Script executed:

# Verify the fork PR + GITHUB_TOKEN restriction curl -s https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows | grep -A 5 "fork" | head -20 || echo "Checking GitHub Actions documentation about fork restrictions..."

Repository: kubescape/node-agent

Length of output: 46

🏁 Script executed:

# Check if there are any recent issues or documentation about fork PRs and GITHUB_TOKEN curl -s https://api.github.com/repos/peter-evans/create-or-update-comment/issues?state=all&per_page=5 2>/dev/null | head -50

Repository: kubescape/node-agent

Length of output: 46

🏁 Script executed:

# Check the README or docs in the action repo curl -s https://raw.githubusercontent.com/peter-evans/create-or-update-comment/v4/README.md | grep -A 10 -B 5 "fork\|permission\|GITHUB_TOKEN" | head -40

Repository: kubescape/node-agent

Length of output: 2913

Remove the unsupported comment-tag input and implement a sticky comment strategy.

The create-or-update-comment@v4 action does not have a comment-tag input—it supports issue-number, comment-id, body/body-path, and edit-mode. The comment-tag parameter will be silently ignored, causing the action to create a new comment on each run instead of updating a sticky one.

Additionally, this action cannot post comments in public repositories when triggered by fork PRs; attempting this will result in Resource not accessible by integration. Use the find-comment action to locate an existing comment by body-includes and pass its comment-id to create-or-update-comment, or switch to the pull_request_target event as documented in the action README.

🧰 Tools

🪛 actionlint (1.7.12)

[error] 119-119: input "comment-tag" is not defined in action "peter-evans/create-or-update-comment@v4". available inputs are "append-separator", "body", "body-file", "body-path", "comment-id", "edit-mode", "issue-number", "reactions", "reactions-edit-mode", "repository", "token"

(action)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/benchmark.yaml around lines 113 - 119, The workflow uses peter-evans/create-or-update-comment@v4 with an unsupported input `comment-tag`, causing new comments each run; remove `comment-tag` and implement a sticky-comment flow by first using an action like peter-evans/find-comment to search for the existing comment via `body-includes` (e.g., matching text from `report.md`), capture its `comment-id`, then pass that `comment-id` into create-or-update-comment (or use `edit-mode`) along with `body-path: report.md` and `issue-number` to update the existing comment; alternatively, switch the job trigger to `pull_request_target` to avoid fork PR permission issues as documented in the create-or-update-comment README.

coderabbitai · 2026-04-05T04:59:53Z

benchmark/compare-metrics.py

+def compute_resource_stats(df: pd.DataFrame) -> dict:
+    """Filter to node-agent pods and compute avg/peak."""
+    na = df[df["Pod"].str.contains("node-agent", na=False)]
+    if na.empty:
+        return {"avg": 0.0, "peak": 0.0}
+    return {"avg": na["Value"].mean(), "peak": na["Value"].max()}


⚠️ Potential issue | 🟠 Major

Aggregate node-agent pods before computing avg/peak.

On clusters with more than one node-agent pod, this reports average per-pod usage and busiest-single-pod peak, not total node-agent footprint. Since benchmark/dedup-bench.sh creates a multi-node kind cluster at Line 97-108, the PR comment can understate real resource cost. Sum Value across node-agent pods per timestamp first, or relabel the table as per-pod.

Possible fix

def compute_resource_stats(df: pd.DataFrame) -> dict: """Filter to node-agent pods and compute avg/peak.""" na = df[df["Pod"].str.contains("node-agent", na=False)] if na.empty: return {"avg": 0.0, "peak": 0.0} - return {"avg": na["Value"].mean(), "peak": na["Value"].max()} + total_by_time = na.groupby("Time", as_index=False)["Value"].sum() + return { + "avg": total_by_time["Value"].mean(), + "peak": total_by_time["Value"].max(), + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def compute_resource_stats(df: pd.DataFrame) -> dict:

"""Filter to node-agent pods and compute avg/peak."""

na = df[df["Pod"].str.contains("node-agent", na=False)]

if na.empty:

return {"avg": 0.0, "peak": 0.0}

return {"avg": na["Value"].mean(), "peak": na["Value"].max()}

def compute_resource_stats(df: pd.DataFrame) -> dict:

"""Filter to node-agent pods and compute avg/peak."""

na = df[df["Pod"].str.contains("node-agent", na=False)]

if na.empty:

return {"avg": 0.0, "peak": 0.0}

total_by_time = na.groupby("Time", as_index=False)["Value"].sum()

return {

"avg": total_by_time["Value"].mean(),

"peak": total_by_time["Value"].max(),

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/compare-metrics.py` around lines 37 - 42, compute_resource_stats currently filters node-agent rows but computes avg/peak across pods (per-pod), undercounting total footprint on multi-agent clusters; change it to first aggregate Value across node-agent pods per timestamp (e.g., group the filtered na DataFrame by the timestamp column, such as "Timestamp", and sum the "Value" for each timestamp) and then return {"avg": summed_values.mean(), "peak": summed_values.max()} while keeping the initial filter (na = df[df["Pod"].str.contains("node-agent", na=False)]) and the same return shape.

coderabbitai · 2026-04-05T04:59:53Z

benchmark/dedup-bench.sh

+remove_load_simulator() {
+    log "Removing load simulator..."
+    kubectl delete namespace load-simulator --wait=false 2>/dev/null || true


⚠️ Potential issue | 🟠 Major

Don't reuse the namespace before deletion completes.

kubectl delete namespace ... --wait=false returns immediately, but the after phase recreates load-simulator almost right away. If the namespace is still Terminating, the ConfigMap/DaemonSet apply becomes flaky.

Minimal fix

remove_load_simulator() { log "Removing load simulator..." - kubectl delete namespace load-simulator --wait=false 2>/dev/null || true + kubectl delete namespace load-simulator --wait=true --timeout=300s 2>/dev/null || true }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

remove_load_simulator() {

log "Removing load simulator..."

kubectl delete namespace load-simulator --wait=false 2>/dev/null || true

remove_load_simulator() {

log "Removing load simulator..."

kubectl delete namespace load-simulator --wait=true --timeout=300s 2>/dev/null || true

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/dedup-bench.sh` around lines 267 - 269, The remove_load_simulator() function currently deletes the load-simulator namespace with --wait=false which returns immediately and causes the after phase to recreate the namespace while it may still be Terminating; change remove_load_simulator() to wait for the namespace to be fully deleted before returning (e.g., use kubectl wait --for=delete namespace/load-simulator or poll kubectl get namespace load-simulator until it no longer exists) so subsequent apply steps (the recreate/apply logic) cannot run against a terminating namespace.

coderabbitai · 2026-04-05T04:59:53Z

benchmark/dedup-bench.sh

+for name, query in queries.items():
+    try:
+        resp = requests.get(f'{url}/api/v1/query', params={'query': query, 'time': end.isoformat()}, timeout=30)
+        with open(os.path.join(output_dir, f'{name}.json'), 'w') as f:
+            json.dump(resp.json(), f, indent=2)
+        print(f'Collected {name}')
+    except Exception as e:
+        print(f'Warning: {name}: {e}')


⚠️ Potential issue | 🟠 Major

Fail the benchmark when Prometheus queries fail.

This block writes whatever JSON comes back and prints Collected ... without checking resp.raise_for_status() or Prometheus "status". A bad query or 5xx will quietly degrade into “no data available” later instead of failing the run.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/dedup-bench.sh` around lines 338 - 345, The loop over queries writes Prometheus responses without checking HTTP or Prometheus-level errors; update the block that performs requests.get (inside the for name, query in queries.items() loop) to call resp.raise_for_status() after the request and then parse resp.json() into a variable and verify it contains "status" == "success" before writing the file to output_dir; if either the HTTP status or the Prometheus "status" is an error, print a clear error including name and response content and fail the benchmark run (e.g., raise an exception or call sys.exit(1)) instead of silently continuing.

coderabbitai · 2026-04-05T04:59:53Z

benchmark/load-simulator/daemonset.yaml

+        - name: load-simulator
+          image: quay.io/benarmosec/load-simulator-f866d4884d08e4a0d1907b094978b48d


⚠️ Potential issue | 🟠 Major

Pin the load-simulator image immutably.

This reference has neither a tag nor a digest, so Kubernetes will resolve :latest implicitly. Because benchmark/dedup-bench.sh deploys this workload for every before/after run, the generator can drift over time and invalidate benchmark comparisons. Use an immutable digest, or at least an explicit versioned tag.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/load-simulator/daemonset.yaml` around lines 15 - 16, The load-simulator container image is not pinned (image field under the container named "load-simulator"), causing implicit :latest resolution and potential benchmark drift; update the image reference in the daemonset to an immutable reference by replacing quay.io/benarmosec/load-simulator-f866d4884d08e4a0d1907b094978b48d with either a specific versioned tag (e.g., :v1.2.3) or, preferably, a content digest (quay.io/benarmosec/load-simulator@sha256:<digest>) so deployments from benchmark/dedup-bench.sh are reproducible.

github-actions · 2026-04-05T05:21:50Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.162	0.158	-2.5%
Peak CPU (cores)	0.172	0.167	-3.0%
Avg Memory (MiB)	404.116	313.546	-22.4%
Peak Memory (MiB)	409.445	316.445	-22.7%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	0	0	N/A
dns	1200	0	0.0%
hardlink	6000	0	0.0%
http	1700	119458	98.6%
network	900	77998	98.9%
open	34901	621204	94.7%
symlink	6000	0	0.0%
syscall	982	1903	66.0%

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	9
dns_counter	1440	1402
exec_counter	7205	7016
network_counter	94746	92271
open_counter	788615	770123
syscall_counter	3642	3480

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

matthyx added this to KS PRs tracking Mar 29, 2026

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

matthyx moved this to WIP in KS PRs tracking Mar 31, 2026

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

		- name: load-simulator
		image: quay.io/benarmosec/load-simulator-f866d4884d08e4a0d1907b094978b48d

Conversation

slashben commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Benchmarks

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 5, 2026

Performance Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

slashben commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 29, 2026 •

edited

Loading