Skip to content

refactor(metrics): replace swap-and-drain delta collection with in-place iteration#3420

Merged
cijothomas merged 6 commits intoopen-telemetry:mainfrom
bryantbiggs:refactor/inplace-delta-collection
Apr 25, 2026
Merged

refactor(metrics): replace swap-and-drain delta collection with in-place iteration#3420
cijothomas merged 6 commits intoopen-telemetry:mainfrom
bryantbiggs:refactor/inplace-delta-collection

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

@bryantbiggs bryantbiggs commented Mar 14, 2026

Splits from #3392, per maintainer feedback. Refs #2328.

Summary

  • Replaces the two-HashMap swap-and-drain pattern in collect_and_reset with in-place iteration using TrackerEntry status tracking
  • TrackerEntry wraps each aggregator with has_been_updated (AtomicBool) to track whether measurements occurred since the last collection cycle
  • Removes has_no_attribute_value — the no_attribute_tracker.has_been_updated flag now serves the same purpose
  • collect_and_reset now iterates under a read lock (no write lock in steady state), exports only updated entries, and evicts stale entries under a write lock with TOCTOU re-check
  • Adds drain_and_reset for Observable/async instruments that need true staleness detection via map clearing
  • Eliminates O(n) write-lock acquisitions on the hot path per collection cycle when attribute sets are reused (the common case)

User-facing behavior change

Recovery from cardinality overflow now requires 2 collect cycles instead of 1 — the first cycle marks entries as stale (no updates), the second cycle evicts them. This trade-off enables the in-place iteration pattern that avoids write-lock contention in steady state.

Locking trade-offs (discussed with @cijothomas and @utpilla in #3392)

Scenario Old (swap-and-drain) New (in-place)
Steady state (reused attrs) O(n) write locks per cycle Read locks only
New/returning attrs during collect Write lock for insert Write lock for insert (competes with read lock held by collect)
Stale eviction Implicit (map cleared) Write lock proportional to stale count

Test plan

  • All 149 existing metrics tests pass
  • counter_aggregation_overflow_delta tests updated for two-phase eviction semantics
  • No changes to cumulative collection path (collect_readonly)
  • cargo fmt --all -- --check passes
  • cargo clippy -p opentelemetry_sdk --features metrics -- -D warnings passes

Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 94.85294% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.3%. Comparing base (8e95e16) to head (35f7e52).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/internal/mod.rs 93.9% 7 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3420   +/-   ##
=====================================
  Coverage   83.2%   83.3%           
=====================================
  Files        128     128           
  Lines      25086   25177   +91     
=====================================
+ Hits       20896   20978   +82     
- Misses      4190    4199    +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 14, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
Comment thread opentelemetry-sdk/CHANGELOG.md Outdated
@bryantbiggs bryantbiggs force-pushed the refactor/inplace-delta-collection branch from c055494 to 0e888e0 Compare March 14, 2026 21:13
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 14, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 15, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs
@bryantbiggs bryantbiggs force-pushed the refactor/inplace-delta-collection branch from 0e888e0 to 1f5fd4c Compare March 15, 2026 17:21
@bryantbiggs
Copy link
Copy Markdown
Contributor Author

CI integration test failure (counter_non_tokio) is pre-existing flaky test — UUID collision from parallel tests sharing the collector. Previous run on this branch passed. Our changes only touch opentelemetry-sdk/src/metrics/internal/mod.rs.

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Opened #3424 to fix the flaky integration test that showed up in CI here.

Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs Outdated
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs Outdated
…ace iteration

Replace the two-HashMap swap-and-drain pattern in collect_and_reset with
in-place iteration using TrackerEntry status tracking. TrackerEntry wraps
each aggregator with has_been_updated (AtomicBool) and bound_count
(AtomicUsize) fields.

collect_and_reset now:
- Iterates under a read lock (no write lock in steady state)
- Exports only entries updated since last collection
- Evicts stale unbound entries under a write lock with TOCTOU re-check

A new drain_and_reset method preserves the old map-clearing behavior for
Observable/async instruments that need staleness detection.

This eliminates O(n) write-lock acquisitions on the hot path per
collection cycle when attribute sets are reused (the common case).

Splits from open-telemetry#3392. Refs open-telemetry#2328.
Per review feedback, bound_count belongs in the bound instruments PR
since it's not used by the delta collection refactor alone.
Remove redundant has_no_attribute_value field from ValueMap — the
no_attribute_tracker.has_been_updated flag now serves the same purpose.

Rewrite changelog entry to focus on user-facing behavior: cardinality
overflow recovery now requires 2 collect cycles.
… HashMap keys

When a measurement arrives with attributes in non-sorted order, measure() inserts
two keys into the trackers HashMap for the same logical attribute set (original
order + sorted canonical order), both pointing to the same Arc<TrackerEntry>.

Adds a targeted test for collect_and_reset stale eviction confirming that both
keys are removed, leaving no zombie entries in the map.
@bryantbiggs bryantbiggs force-pushed the refactor/inplace-delta-collection branch from 485f6e6 to 4b47197 Compare April 24, 2026 00:28
The no_attribute_tracker correctly used Release/Acquire, but
per-attribute trackers used Relaxed. Since concurrent measure()
and collect_and_reset() can both hold RwLock read locks (which
don't synchronize with each other), the flag is the sole ordering
point for aggregator data visibility. Relaxed is insufficient on
weakly-ordered architectures (ARM).
Copy link
Copy Markdown
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@cijothomas cijothomas added this pull request to the merge queue Apr 25, 2026
Merged via the queue into open-telemetry:main with commit 1ed4755 Apr 25, 2026
28 checks passed
@bryantbiggs bryantbiggs deleted the refactor/inplace-delta-collection branch April 25, 2026 01:49
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 25, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 29, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 30, 2026
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants