perf(metrics): partition ValueMap into per-thread shards to reduce lock contention#3473
perf(metrics): partition ValueMap into per-thread shards to reduce lock contention#3473souyang-dev wants to merge 4 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3473 +/- ##
=======================================
+ Coverage 83.7% 83.9% +0.1%
=======================================
Files 126 126
Lines 25386 25784 +398
=======================================
+ Hits 21255 21635 +380
- Misses 4131 4149 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
30be6f1 to
be5ab0b
Compare
7ab756e to
072ba19
Compare
f7aa92f to
6af0215
Compare
| count: AtomicUsize::new(0), | ||
| config, | ||
| cardinality_limit, | ||
| all_shards_cardinality_limit: per_shard_capacity.saturating_mul(n), |
There was a problem hiding this comment.
Pinging @cijothomas because metrics is really his area :) !
I'm not sure about this part but it might be a gap in my mental model. The metrics SDK spec defines aggregation_cardinality_limit as "the maximum number of data points allowed to be emitted in a collection cycle by a single instrument" and says that, for a given metric, the cardinality limit is a "hard limit" on the number of metric points collected during a collection cycle:
It looks like with this change, the effective limit becomes roughly configured_limit * num_shards where num_shards is the smaller of CPU count or 64, so a configured limit of 2000 can produce many more than 2000 points.
I think sharding needs to remain an internal implementation detail and preserve the configured stream cardinality limit. The user-visible impact is that a limit chosen to cap SDK memory usage and downstream time-series cardinality would no longer provide that cap; the same configuration could produce very different numbers of exported
series depending on CPU count.
There was a problem hiding this comment.
Previous shard attempts were also blocked on this same problem. It is non-trivial to solve with sharding approach, unfortunately. We'd need to solve contention issue differently where consciously opts in to this model. (I'll write what I have in my mind soon)
There was a problem hiding this comment.
Thank you @scottgerring for the comments.
I have updated the patch to enforce cardinality limit across all shards.
46111fc to
f85e882
Compare
afaf92e to
f787b10
Compare
f787b10 to
9640833
Compare
…ck contention Two changes are made: 1, Partition ValueMap into shards and let different threads update different shards to minimize locking contention. 2, Implement merge function for Aggregators, so collector can merge the shards.
Comply with otel spec: aggregation_cardinality_limit as "the maximum number of data points allowed to be emitted in a collection cycle by a single instrument".
Use only 1 shard if the aggregator needs to preserve order of update, so all updates will be serialized.
9640833 to
901e587
Compare
1, hold a shard write lock to protect flag swap and delta collection. 2, downscale both src and dest histogram to the same scale before doing merge.
901e587 to
ff20752
Compare
Metrics ValueMap is protected by a RwLock. It can be a source of contention under high concurrency. What's more, the cost to update a histogram is even higher because you need to grab a mutex lock after the read lock before you can update a bucket.
There is a previous attempt to shard the ValueMap (#1564), but that PR partitions metrics updates based on attribute-sets, therefore updates to the same attribute-set still land on the same shard, so it's not effective in reducing locking contention when most updates are focused on a small set of attributes.
Fixes #
This PR partitions the single HashMap into shards. Each thread update different shards based on thread id to minimize locking contention. If the number of shards >= number of CPUs you won't see locking contentions when updating metrics.
Here are Benchcmark results (metrics_histogram, metrics) with AMD EPYC 9555 CPUs (255 CPU threads).
The numbers in the tables are updates/second in millions. "shards=16" is this PR with 16 shards, "shards=32" is this PR with 32 shards.
Counter update throughput (cargo run --release --package stress --bin metrics -- N)
Histogram update throughput (cargo run --release --package stress --bin metrics_histogram -- N)
Changes
Two changes are made:
1, Partition ValueMap into shards and let different threads update different shards to minimize locking contention.
2, Implement merge function for Aggregators, so collector can merge the shards.
Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes