Skip to content

fix(metrics): track unique cardinality label tuples#210

Open
ramnnn2006 wants to merge 1 commit into
optiqor:mainfrom
ramnnn2006:fix/metrics-cardinality-label-tuples
Open

fix(metrics): track unique cardinality label tuples#210
ramnnn2006 wants to merge 1 commit into
optiqor:mainfrom
ramnnn2006:fix/metrics-cardinality-label-tuples

Conversation

@ramnnn2006

Copy link
Copy Markdown
Contributor

What

this PR fixes a bug in cardinalityOK where seen[metric] was incremented on every event instead of tracking distinct label combinations. it also refactors the underlying data structure and updates the test suite to cover the correct semantics.

Why

Fixes #117

cardinalityOK was acting as a raw event counter, not a cardinality tracker. one busy (syscall, comm) pair could exhaust the 5000 limit in under a second and permanently silence all observations for that category, including series that were already being tracked fine. the bug meant the cap was effectively unreachable in any real workload.

How

the refactor changes seen from map[string]int to map[string]map[string]struct{} so each metric key maps to a bounded set of distinct label tuples instead of an event count. added a labelKey param to cardinalityOK and updated all 5 callers to compute their label values before the guard and pass them joined with \x00:

  • recordSyscall passes sc + "\x00" + comm
  • recordTCP passes src + "\x00" + dst + "\x00" + comm
  • recordDiskIO passes dev + "\x00" + op
  • recordSchedDelay and recordFD pass comm

known tuples always pass through even when the set is full. only genuinely new combinations get dropped once the cap is hit.

Testing

rewrote TestCardinalityLimit to test the actual cardinality verified event counting, which is the buggy behavior):

  • N unique keys fill the bucket, last call returns true

  • N+1 unique keys, last one is correctly rejected

  • known key passes even when the bucket is full (the exact cas

  • saturating one metric has no effect on a separate metric

  • go build ./... passes (GOOS=linux for eBPF-dependent packages)

  • go test ./... passes

  • go vet ./... passes (non-eBPF packages)

  • golangci-lint run ./... passes

  • Tested locally

  • N/A — pure refactor/bug fix, no eBPF programs touched

  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs

  • ./scripts/verify.sh passes

Checklist

  • PR title follows Conventional Commits (fix(metrics): track unique label tuples in cardinalityOK)
  • all commits are DCO-signed (git commit -s)
  • no unrelated changes pulled in
  • documentation updated where user-visible behavior changehanged)
  • added/updated tests for new code paths
  • if a new doctor rule, paired with a chaos scenario in `sable)

cardinalityOK was incrementing seen[metric] on every event, so a
single busy (syscall, comm) pair could exhaust the 5000 limit in
under a second and permanently silence all observations for that
category, including already-tracked series.

fix: change seen from map[string]int to map[string]map[string]struct{}
and add a labelKey param to cardinalityOK. each caller now passes the
actual label tuple (joined with \x00) so only genuinely new combinations
count toward the cap. known tuples always pass through even when full.

updated TestCardinalityLimit to test new semantics: unique keys fill
the bucket, known keys still pass when full, new keys are rejected,
different metrics stay isolated.

Signed-off-by: ramnnn2006 <phrkvvp@gmail.com>
@ramnnn2006 ramnnn2006 requested a review from btwshivam as a code owner June 9, 2026 12:55
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added level:intermediate 50-200 lines or 3-5 files (auto-applied) testing Tests and test coverage labels Jun 9, 2026
@ramnnn2006

Copy link
Copy Markdown
Contributor Author

hi @btwshivam , could you review one
i think this PR fits type:bug , type:refactor , type:testing and quality:clean labels here since i feel it fixes a real bug, changes the data structure, and updates the tests.
also gssoc:approved is required for the contribution to count.

happy to keep working on more issues in the repo, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

level:intermediate 50-200 lines or 3-5 files (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(metrics): cardinality guard counts events, stalls metric for run

1 participant