Skip to content

feat(metrics): add bound instruments behind experimental feature flag#3392

Closed
bryantbiggs wants to merge 20 commits intoopen-telemetry:mainfrom
bryantbiggs:bound-instruments
Closed

feat(metrics): add bound instruments behind experimental feature flag#3392
bryantbiggs wants to merge 20 commits intoopen-telemetry:mainfrom
bryantbiggs:bound-instruments

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

@bryantbiggs bryantbiggs commented Feb 25, 2026

Refs #1374, #2328

Summary

Adds BoundCounter and BoundHistogram to the public API behind the experimental_metrics_bound_instruments feature flag. These types cache the resolved aggregator reference for a fixed attribute set, allowing subsequent measurements to bypass per-call sort, dedup, hash, and HashMap lookup entirely.

let counter = meter.u64_counter("requests").build();
let bound = counter.bind(&[KeyValue::new("method", "GET")]);
bound.add(1); // ~1.9ns — no attribute lookup

Architecture

TrackerEntry

pub(crate) struct TrackerEntry<A: Aggregator> {
    pub(crate) aggregator: A,
    pub(crate) has_been_updated: AtomicBool,
    pub(crate) bound_count: AtomicUsize,
}

has_been_updated tracks whether any measurement occurred since the last collect. bound_count tracks how many live BoundCounter/BoundHistogram handles reference this entry.

Delta collection (collect_and_reset)

In-place iteration under a read lock (no write lock in steady state):

  1. Export only entries where has_been_updated was true, then reset the flag
  2. Entries with bound_count > 0 are never evicted — they persist across idle cycles but produce no export when there are no updates
  3. Stale unbound entries (has_been_updated == false && bound_count == 0) are evicted under a write lock with a TOCTOU re-check

Cardinality overflow handling

ValueMap::bind() returns Option<Arc<TrackerEntry>>None when at the cardinality limit. Bound handle types use a Direct/Fallback enum:

  • Direct: normal case — dedicated tracker, single-digit ns, no map lookup per call
  • Fallback: at overflow — delegates every call to the unbound Measure::call() path

This gives correct overflow attribution (same behavior as unbound at overflow) with automatic recovery when delta collect frees cardinality space. An otel_debug! log fires when fallback occurs.

Aggregator types that don't support direct binding (ExponentialHistogram, LastValue, PrecomputedSum) also use the fallback path gracefully instead of panicking.

Feature flag

All public API, traits, noop impls, and internal plumbing are gated behind experimental_metrics_bound_instruments. The testing feature enables it by default.

Benchmark Results

Apple M4 Max, 16 cores (12 performance + 4 efficiency), macOS 15.4:

Benchmark Time vs Unbound
Counter Unbound 53.2 ns
Counter Bound 1.87 ns ~28x faster
Histogram Unbound 58.6 ns
Histogram Bound 6.57 ns ~8.9x faster

EC2 results across 9 instance configurations (c7i/c7a/c7g × large/xlarge/4xlarge) show 5-28x counter speedup and 5-9x histogram speedup depending on architecture. Full results in the expandable sections below.

EC2 Counter Results
Instance Unbound Bound Speedup
c7i.large 86.2ns 7.47ns 11.5x
c7i.xlarge 88.9ns 7.62ns 11.7x
c7i.4xlarge 90.1ns 7.65ns 11.8x
c7a.large 77.4ns 2.71ns 28.6x
c7a.xlarge 75.0ns 2.71ns 27.7x
c7a.4xlarge 75.0ns 2.71ns 27.7x
c7g.large 111.8ns 6.87ns 16.3x
c7g.xlarge 111.4ns 6.84ns 16.3x
c7g.4xlarge 112.9ns 6.73ns 16.8x
EC2 Histogram Results
Instance Unbound Bound Speedup
c7i.large 97.1ns 17.1ns 5.7x
c7i.xlarge 99.0ns 17.9ns 5.5x
c7i.4xlarge 98.5ns 18.4ns 5.4x
c7a.large 92.5ns 13.7ns 6.8x
c7a.xlarge 92.4ns 13.7ns 6.7x
c7a.4xlarge 92.5ns 13.7ns 6.8x
c7g.large 133.2ns 27.7ns 4.8x
c7g.xlarge 133.5ns 27.6ns 4.8x
c7g.4xlarge 133.8ns 27.2ns 4.9x
Previous experiments (EXP-002, EXP-003) and why bound instruments are the right approach

EXP-002: Micro-optimizations on existing architecture

  • Changes: Relaxed atomic ordering, stale entry eviction during delta collect
  • Result: <2% improvement — the HashMap bottleneck is untouched

EXP-003: Sharded HashMap with pre-hashing

  • Changes: 16 Mutex-guarded shards, pre-computed foldhash, PassthroughHasher
  • Result: +10-33% regression across all architectures. HashedAttributes::from_raw() on every measure() call is more expensive than the existing two-lookup pattern.

Key insight

Profiling shows 78% of CPU in HashMap key hashing (52%) and key comparison (26%). The only way to meaningfully improve measure() throughput is to skip the lookup entirely. Bound instruments achieve this by moving the cost to bind-time (once) rather than measure-time (every call).

Test Coverage

15 tests covering:

  • Cumulative and delta temporality for both Counter and Histogram
  • Bound + unbound sharing the same data point (same attribute set)
  • Idle delta cycles (no update = no export, but handle persists)
  • Cardinality overflow fallback to unbound path (Counter and Histogram)
  • Recovery after delta eviction frees cardinality space
  • Overflow fallback handles working across multiple delta cycles
  • Drop enabling eviction of stale entries
  • Multiple bound handles sharing the same tracker (ref-counting)
  • Binding with empty attributes

Impact on existing code

  • Zero regression on unbound measurement path
  • No changes to collect path semantics for unbound instruments
  • All existing tests pass
  • Additive public API behind feature flag — no breaking changes

Related

@bryantbiggs bryantbiggs force-pushed the bound-instruments branch 2 times, most recently from 9ce05fc to 1bb4efe Compare February 25, 2026 21:36
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 89.47368% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (9650783) to head (21542a9).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/mod.rs 93.7% 27 Missing ⚠️
opentelemetry-sdk/src/metrics/internal/mod.rs 90.8% 11 Missing ⚠️
...pentelemetry-sdk/src/metrics/internal/aggregate.rs 0.0% 6 Missing ⚠️
opentelemetry-sdk/src/metrics/noop.rs 0.0% 5 Missing ⚠️
opentelemetry/src/metrics/noop.rs 0.0% 5 Missing ⚠️
opentelemetry/src/metrics/instruments/histogram.rs 60.0% 4 Missing ⚠️
...-sdk/src/metrics/internal/exponential_histogram.rs 62.5% 3 Missing ⚠️
...entelemetry-sdk/src/metrics/internal/last_value.rs 25.0% 3 Missing ⚠️
...emetry-sdk/src/metrics/internal/precomputed_sum.rs 40.0% 3 Missing ⚠️
opentelemetry/src/metrics/instruments/counter.rs 66.6% 3 Missing ⚠️
... and 1 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3392     +/-   ##
=======================================
+ Coverage   83.2%   83.4%   +0.1%     
=======================================
  Files        128     128             
  Lines      25045   25688    +643     
=======================================
+ Hits       20858   21428    +570     
- Misses      4187    4260     +73     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

attrs,
tracker.aggregator.clone_and_reset(&self.config),
));
tracker.evicted.store(true, Ordering::Release);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking out loud:
do we need to evict bound instruments after delta collect()? Can we define the behavior as bound instruments remains forever, unless user explicitly call an unbind()?
if there are no new updates in a collect() cycle, then we need to not export anything, but value map keeps the tracker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 936993f — replaced the swap-and-drain collect_and_reset with in-place delta collection that supports bound instrument persistence.

What changed:

TrackerEntry now has two status fields instead of evicted:

  • has_been_updated: AtomicBool — set on every measure()/bound.add() call, checked and reset during collect. Only entries updated since last collect are exported. No updates in a cycle = no export, but the tracker stays in the map.
  • bound_count: AtomicUsize — incremented on bind(), decremented on Drop of the bound handle. Entries with bound_count > 0 are never evicted, even if they had no updates in a cycle.

Delta collect behavior:

  1. Iterate trackers under a read lock (no write lock needed in steady state)
  2. Export only entries where has_been_updated was true, then reset the flag
  3. Collect stale candidates: entries where has_been_updated=false AND bound_count=0 AND not overflow sentinel
  4. If stale candidates exist, acquire write lock, re-check has_been_updated under the lock (TOCTOU guard against concurrent measure()), then evict

Bound handles no longer check an evicted flag or have a fallback path — they always go direct to the aggregator. The Drop impl decrements bound_count so the entry becomes eligible for eviction after all handles are dropped.

The drain_and_reset path is preserved for async/Observable instruments (PrecomputedSum) that need true staleness detection via map clearing.

This design was informed by a previous experiment branch (feat/delta-collect-optimization, part of #2328) which validated the has_been_updated + in-place iteration pattern on EC2 benchmarks. The bound_count mechanism extends it to prevent eviction of bound entries.

self.count.fetch_add(1, Ordering::SeqCst);
new_tracker
} else {
// Over cardinality limit — bind to the overflow tracker
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky...when user called bind(), we may have been over limit and it gets bound to overflow. But after a delta collect, unused points are reclaimed, but the bound instrument still goes to overflow?
Or the delta collect set evicted to true, and it fallsback to slow path?

We may have to see if bound instruments should be exempt from the cardinality limit? (that is not good and will violate spec too). Open for ideas.

Copy link
Copy Markdown
Contributor Author

@bryantbiggs bryantbiggs Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7b531c4 with an overflow fallback to the unbound path.

Solution: bind() returns Option, None at overflow

ValueMap::bind() now returns Option<Arc<TrackerEntry>>None when at/over the cardinality limit. The bound handle types (BoundSumHandle, BoundHistogramHandle) use a Direct/Fallback enum:

  • Direct: normal case — dedicated tracker, single-digit ns, no map lookup on each call
  • Fallback: at overflow — delegates every .call() to the unbound Measure::call() path

This gives us:

  1. Correct overflow attribution — same behavior as unbound measurements at overflow, no data merging between distinct bound handles sharing an overflow tracker
  2. Automatic recovery — when delta collect evicts stale entries and frees cardinality space, measure.call() creates a proper entry for those attrs on the next invocation
  3. No bound_count inflation on the overflow tracker (was a problem with the previous approach of binding directly to it)

Trade-off: a Fallback handle is no faster than unbound (full HashMap lookup each time). But if you're at the cardinality limit when calling bind(), you're already in degraded mode — the user still gets correct data, just not the fast-path benefit. The guidance remains: bind early during initialization before cardinality fills up.

Benchmarks confirm zero regression on the Direct hot path from the enum match:

Benchmark Before After
Counter Bound 1.81 ns 1.83 ns
Histogram Bound 6.57 ns 6.57 ns
Counter Unbound 51.40 ns 51.75 ns
Histogram Unbound 55.91 ns 57.35 ns

Added 4 new tests covering overflow + delta interaction scenarios.

@cijothomas
Copy link
Copy Markdown
Member

cijothomas commented Feb 26, 2026

Looks very promising! Left some comments in code, and response to the open questions below. Will need to close some edge cases (cardinality cap, evict etc.). Also more test cases on this would give higher confidence!

Should BoundCounter/BoundHistogram auto-rebind after delta eviction, or is explicit bind() sufficient?
I left a comment in the code a bit ago; I think we should not evict the bound instrument after each collect(). It should be still kept bound. But don't export if no updates in a collect() cycle.

Should bound instruments be behind a feature flag initially?
Yes

Is there interest in bound UpDownCounter or Gauge support?
Not now. Just Counter alone is sufficient for proof of concept and to gather feedback. This PR already makes it happen for histogram, which is awesome.

Naming: bind() vs with_attributes() vs something else?

bind() sounds good. This was the original name for this idea when it was part of OTel spec in 2020, but got removed for scope control.

Comment thread opentelemetry-sdk/benches/bound_instruments.rs
@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Benchmark Snapshot — Before Delta Collect Refactor

Baseline numbers on the current bound-instruments branch (swap-and-drain collect_and_reset with evicted flag) before porting the in-place delta collection design. This serves as the "before" watermark.

Hardware: Apple M4 Max (MacBook Pro), macOS 15.4

Benchmark Time
Counter_Unbound_Delta 48.10 ns
Counter_Bound_Delta 1.70 ns
Histogram_Unbound_Delta 54.01 ns
Histogram_Bound_Delta 9.90 ns
Counter_Bound_Multithread/2 20.97 µs
Counter_Bound_Multithread/4 36.03 µs
Counter_Bound_Multithread/8 68.19 µs

Speedups: Counter 28.3x, Histogram 5.5x (bound vs unbound)

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Benchmark Snapshot — After Delta Collect Refactor (936993f)

Numbers after replacing swap-and-drain collect_and_reset with in-place delta collection using has_been_updated flag and bound_count persistence.

Hardware: Apple M4 Max (MacBook Pro), macOS 15.4

Benchmark Before After Change
Counter_Unbound_Delta 48.10 ns 51.40 ns +6.9% (noise)
Counter_Bound_Delta 1.70 ns 1.81 ns +6.5% (noise)
Histogram_Unbound_Delta 54.01 ns 55.91 ns +3.5% (noise)
Histogram_Bound_Delta 9.90 ns 6.57 ns -33.6%
Counter_Bound_Multithread/2 20.97 µs 20.68 µs -1.4% (noise)
Counter_Bound_Multithread/4 36.03 µs 35.12 µs -2.5%
Counter_Bound_Multithread/8 68.19 µs 66.14 µs -3.0%

Key observations:

  • Histogram bound path improved 34% (9.90ns → 6.57ns) — eliminating the evicted flag check and fallback branch removes a conditional on the hot path
  • Counter bound and unbound paths show +3-7% which is run-to-run variance on a laptop (sub-nanosecond absolute differences). EC2 benchmarks would show this as noise.
  • Multithread bound counters show slight improvement from removing the evicted flag load

Speedups vs unbound (after refactor): Counter 28.4x, Histogram 8.5x (up from 5.5x before)

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Benchmark Snapshot — After Overflow Fallback (7b531c4)

Numbers after adding Direct/Fallback enum to bound handles for cardinality overflow handling. bind() now returns OptionNone at overflow falls back to unbound Measure::call() path.

Hardware: Apple M4 Max (MacBook Pro), macOS 15.4

Benchmark Before (936993f) After (7b531c4) Change
Counter_Unbound_Delta 51.40 ns 51.75 ns +0.7% (noise)
Counter_Bound_Delta 1.81 ns 1.83 ns +1.1% (noise)
Histogram_Unbound_Delta 55.91 ns 57.35 ns +2.6% (noise)
Histogram_Bound_Delta 6.57 ns 6.57 ns 0.0%
Counter_Bound_Multithread/2 20.68 µs 21.05 µs +1.8% (noise)
Counter_Bound_Multithread/4 35.12 µs 34.18 µs -2.7% (noise)
Counter_Bound_Multithread/8 66.14 µs 63.63 µs -3.8%

Key observations:

  • Zero regression on the Direct hot path from the enum match — compiler optimizes the single-variant branch prediction perfectly
  • All differences are within run-to-run noise (sub-nanosecond absolute)
  • The Fallback path (not benchmarked here) performs identically to unbound since it delegates to Measure::call()

Speedups vs unbound (unchanged): Counter 28.3x, Histogram 8.7x

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Known limitation: bind(&[]) vs counter.add(_, &[]) produce separate data points

bind(&[]) goes through the hashmap path via ValueMap::bind_sorted(vec![]), while counter.add(_, &[]) uses the dedicated no_attribute_tracker fast path. This means they don't share aggregation — a user mixing both patterns on empty attributes would see two data points instead of one.

This is an edge case (binding with empty attributes is unusual — the whole point of bind() is to pre-resolve a fixed attribute set), and fixing it would require either routing bind(&[]) to the no_attribute_tracker (with bound_count support added to it) or routing empty-attr measure() calls through the hashmap. Either change touches the hot path for all instruments, not just bound ones.

Should we: (a) document this as a known limitation for now, (b) fix it in this PR, or (c) track as a follow-up issue?

@bryantbiggs bryantbiggs changed the title feat: add BoundCounter and BoundHistogram for pre-bound attribute sets feat(metrics): add bound instruments behind experimental feature flag Feb 26, 2026
@bryantbiggs bryantbiggs force-pushed the bound-instruments branch 3 times, most recently from 235ab3d to 825dc53 Compare February 26, 2026 16:28
@bryantbiggs bryantbiggs marked this pull request as ready for review February 26, 2026 17:03
@bryantbiggs bryantbiggs requested a review from a team as a code owner February 26, 2026 17:03
Comment thread opentelemetry-sdk/src/metrics/mod.rs
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs Outdated
let mut stale_entries: Vec<Arc<TrackerEntry<A>>> = Vec::new();

{
let Ok(trackers) = self.trackers.read() else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous design of maintaining 2 trackers allowed us to require a lock for a single instruction swap. But now, we need read() lock held thoughout the iteration through the entries, and then write() lock throughout the stale_entry cleanups.

Given we are not clearing entires for delta aggressively in each collect, this might be less impactful than it appears. The typical pattern is to re-use majority of trackers, so they don't get evicted, and hence, update/hot-path won't need to get write lock. The new attribute set (or the ones that went unused and making a comeback), would slightly suffer as its competing for a write lock() with collect, and the time it waits is directly proportional to the overall count of trackers.

This maybe a reasonable tradeoff to accept, as previous situation was definitely bad due to force clearing after every collect(), and thereby forcing write lock() on every attributeset after every collect().

Need some more time to think through this fully, but feel free to comment if my observations are correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your observations are correct. Here's how the locking trade-off breaks down:

Old design (swap-and-drain): Write lock on trackers held only for mem::swap (pointer swap — near instant), then iteration happens on a second map. But every delta collect clears all entries, so the next measure() for every attribute set must take a write lock to re-insert — O(n) write lock acquisitions on the hot path per collection cycle.

New design (in-place iteration): Read lock held throughout iteration, then write lock only if there are stale entries to evict.

Steady state (the common case): Attribute sets are reused across cycles → zero stale entries → no write lock during collect at all. Both measure() and collect_and_reset hold only read locks and run concurrently. This is strictly better than the old design where every attr set had to re-acquire a write lock after every collect.

Edge case — new/returning attrs during collect: As you noted, a measure() that needs to insert a new entry competes for the write lock with the read lock held by collect. Wait time is proportional to total tracker count. But this only affects never-before-seen attribute sets arriving during the iteration window. Once inserted, they're back on the read-lock fast path.

Stale eviction write lock: Duration proportional to the number of stale entries (the retain() call), not total tracker count. In steady state this never fires.

Bound instruments: Completely unaffected — they hold Arc<TrackerEntry> directly, bypassing the map and all locking entirely.

Atomics safety during concurrent read-lock iteration: has_been_updated.swap(false) and the aggregator read/reset use atomics internally (AtomicI64/AtomicU64 for sums, Mutex for histograms). If a concurrent measure() sets has_been_updated back to true after the swap, that measurement is captured in the next cycle — not lost.

Net: the new design trades a brief read-lock hold during collect for eliminating O(n) write-lock acquisitions on the hot path every cycle. The only regression is the narrow window where brand-new attribute sets must wait for collect's read lock to release before inserting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming! My conclusion is same - this is a trade-off, and I think this is a reasonable balance.

Briefly chatted with @utpilla on this as well (who has most experience with all perf things we have in this area!).

Copy link
Copy Markdown
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating my earlier sentiment - the bound instruments design is solid and the benchmark results are impressive!

After reviewing the full diff, I think we should spliot this into two PRs:
PR 1: In-place delta collection (avoids draining HashMap)
PR 2: Bound instruments (builds on PR 1)

The main reason - delta collection refactor changes locking semantics and aggregator reset behavior for every existing user (it's not behind the feature flag, and we don't really want to guard it behind flag either), so it deserves focused review on its own.

Splitting also makes each PR easier to review and means any issues found in the collect path refactor won't block the bound instruments design discussion

What do you think?

bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 14, 2026
…ace iteration

Replace the two-HashMap swap-and-drain pattern in collect_and_reset with
in-place iteration using TrackerEntry status tracking. TrackerEntry wraps
each aggregator with has_been_updated (AtomicBool) and bound_count
(AtomicUsize) fields.

collect_and_reset now:
- Iterates under a read lock (no write lock in steady state)
- Exports only entries updated since last collection
- Evicts stale unbound entries under a write lock with TOCTOU re-check

A new drain_and_reset method preserves the old map-clearing behavior for
Observable/async instruments that need staleness detection.

This eliminates O(n) write-lock acquisitions on the hot path per
collection cycle when attribute sets are reused (the common case).

Splits from open-telemetry#3392. Refs open-telemetry#2328.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 14, 2026
Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 15, 2026
Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 15, 2026
…ace iteration

Replace the two-HashMap swap-and-drain pattern in collect_and_reset with
in-place iteration using TrackerEntry status tracking. TrackerEntry wraps
each aggregator with has_been_updated (AtomicBool) and bound_count
(AtomicUsize) fields.

collect_and_reset now:
- Iterates under a read lock (no write lock in steady state)
- Exports only entries updated since last collection
- Evicts stale unbound entries under a write lock with TOCTOU re-check

A new drain_and_reset method preserves the old map-clearing behavior for
Observable/async instruments that need staleness detection.

This eliminates O(n) write-lock acquisitions on the hot path per
collection cycle when attribute sets are reused (the common case).

Splits from open-telemetry#3392. Refs open-telemetry#2328.
- Add explicit Send + Sync bounds on bind() return type for Rust 1.75
- Add TrackerMap type alias to satisfy clippy type_complexity
- Remove needless borrow in benchmark
Add integration tests for BoundCounter and BoundHistogram covering
cumulative, delta, and mixed bound/unbound measurement paths.
Add changelog entries to both opentelemetry and opentelemetry-sdk.
…rsistence

Replace swap-and-drain delta collection with in-place iteration using a
has_been_updated flag on TrackerEntry. This eliminates the evicted flag
and fallback path from bound instruments, and removes the second
trackers_for_collect HashMap entirely.

Key changes:

TrackerEntry now has two status fields:
- has_been_updated: AtomicBool — set on every measure()/bound.add() call,
  checked and reset during collect. Only entries updated since last collect
  are exported. This implements the "no updates = no export" semantic.
- bound_count: AtomicUsize — incremented on bind(), decremented on Drop
  of the bound handle. Entries with bound_count > 0 are never evicted
  from the map, even if they had no updates in a cycle.

collect_and_reset() (sync instruments, delta temporality):
- Iterates trackers under a read lock (no write lock needed in steady state)
- Exports only entries where has_been_updated was true, resets the flag
- Evicts stale unbound entries (has_been_updated=false, bound_count=0)
- Preserves bound entries and overflow sentinel indefinitely
- TOCTOU re-check under write lock prevents race with concurrent measure()

drain_and_reset() (async instruments like PrecomputedSum):
- Preserved for Observable instruments that need true staleness detection
- Uses std::mem::take instead of the old two-map swap pattern

Bound instrument handles (BoundSumHandle, BoundHistogramHandle):
- Always use the fast path — no eviction check, no fallback
- Set has_been_updated on every call so the tracker is exported at collect
- Decrement bound_count on Drop so the entry can be evicted after unbind

This design was informed by the delta-collect-optimization branch (EXP-002)
which validated the has_been_updated + in-place iteration pattern. The
bound_count mechanism extends it to prevent eviction of bound entries,
addressing the maintainer's feedback that bound instruments should persist
until explicitly unbound.

Part of open-telemetry#1374, open-telemetry#2328
…unbound path

When bind() is called at/over the cardinality limit, return None from
ValueMap::bind() instead of binding to the shared overflow tracker.
BoundSumHandle and BoundHistogramHandle now use a Direct/Fallback enum:
- Direct: fast path with dedicated tracker (normal case)
- Fallback: delegates to unbound Measure::call() path, ensuring correct
  overflow attribution and automatic recovery when delta collect frees space

Also fixes pre-existing counter_aggregation_overflow_delta test failure
caused by collect_and_reset's two-phase eviction (stale entries need a
second collect cycle to be evicted, unlike the old drain_and_reset).

Adds 4 new tests for overflow and delta interaction scenarios.
Add `experimental_metrics_bound_instruments` feature flag to both
opentelemetry and opentelemetry-sdk crates. All bound instrument
public API (Counter::bind, Histogram::bind, BoundCounter,
BoundHistogram, BoundSyncInstrument) and internal plumbing
(BoundMeasure, Measure::bind, ValueMap::bind) are gated behind
this flag.

Follows existing naming convention (experimental_metrics_*).
The testing feature enables it by default for test coverage.
… coverage

Add benchmark results as code comments in benches/bound_instruments.rs
with Apple M4 Max hardware info, following the project convention.

Add 7 new tests for bound instruments covering:
- Histogram delta temporality with reset verification
- Histogram bound+unbound sharing same data point
- Histogram idle delta cycle (no update = no export)
- Histogram overflow fallback to unbound path
- Counter drop enabling eviction of stale entries
- Multiple bound handles sharing same tracker (ref-counting)
- Counter with empty attributes via bind()
…ported aggregators

Aggregator types that don't support direct binding (ExponentialHistogram,
LastValue, PrecomputedSum) now gracefully fall back to the unbound
Measure::call() path instead of panicking at runtime. This prevents
crashes when bind() is called on instruments with view-remapped
aggregations (e.g., a Counter mapped to Base2ExponentialHistogram).

Also adds otel_debug! log when bind() hits cardinality overflow, and
#[must_use] on BoundCounter/BoundHistogram to catch accidental drops.
…d instruments

- Use stable rustfmt import ordering for cfg-gated imports
- Gate otel_debug import behind experimental_metrics_bound_instruments feature
- Fix clippy iter_kv_map warning in trace_asserter (pre-existing)
- Fix clippy collapsible-if and redundant-closure warnings
- Update CHANGELOG.md entries for both opentelemetry and opentelemetry-sdk
TrackerEntry::has_been_updated now starts as false (no recording yet)
instead of true. The measure() paths explicitly set it to true after
the first update(). This prevents phantom exports if a tracker is
ever created without an immediate recording (e.g., via bind()).

Also adds a TODO capturing maintainer feedback about permanent
overflow-fallback for bound instruments created at cardinality limit.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 24, 2026
…ace iteration

Replace the two-HashMap swap-and-drain pattern in collect_and_reset with
in-place iteration using TrackerEntry status tracking. TrackerEntry wraps
each aggregator with has_been_updated (AtomicBool) and bound_count
(AtomicUsize) fields.

collect_and_reset now:
- Iterates under a read lock (no write lock in steady state)
- Exports only entries updated since last collection
- Evicts stale unbound entries under a write lock with TOCTOU re-check

A new drain_and_reset method preserves the old map-clearing behavior for
Observable/async instruments that need staleness detection.

This eliminates O(n) write-lock acquisitions on the hot path per
collection cycle when attribute sets are reused (the common case).

Splits from open-telemetry#3392. Refs open-telemetry#2328.
@cijothomas
Copy link
Copy Markdown
Member

#3421 Closing in favor of

@cijothomas cijothomas closed this Apr 25, 2026
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 25, 2026
Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 29, 2026
Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 30, 2026
Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants