fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356
fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
Conversation
014b873 to
0e23ee9
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3356 +/- ##
=======================================
- Coverage 82.2% 81.8% -0.5%
=======================================
Files 128 125 -3
Lines 24626 23497 -1129
=======================================
- Hits 20266 19236 -1030
+ Misses 4360 4261 -99 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0e23ee9 to
5728c74
Compare
…times Replace `futures_executor::block_on()` with tokio `Handle::block_on()` on dedicated background threads to properly drive the IO reactor. This fixes deadlocks when shutdown()/force_flush() is called on single- threaded tokio runtimes or multi-thread runtimes with 1 worker thread. Changes: - Add `BlockingStrategy` utility that captures the tokio runtime handle at construction and uses `Handle::block_on()` from background threads, falling back to `futures_executor::block_on()` without tokio - Update BatchSpanProcessor, BatchLogProcessor, and PeriodicReader to use BlockingStrategy on their dedicated worker threads - Merge Tokio/TokioCurrentThread into single auto-detecting Tokio type - Remove experimental async runtime modules and features: - experimental_metrics_periodicreader_with_async_runtime - experimental_logs_batch_log_processor_with_async_runtime - experimental_trace_batch_span_processor_with_async_runtime - rt-tokio-current-thread feature and TokioCurrentThread struct Fixes: open-telemetry#2802 Refs: open-telemetry#2643, open-telemetry#2539, open-telemetry#2715, open-telemetry#2071
5728c74 to
8a18420
Compare
|
Big PRs need focus time to review, any chance you can do shorter PRs - one signal in one PR would be much easier, and then subsequent PRs will be even easier as the pattern is established and accepted |
There was a problem hiding this comment.
Pull request overview
This PR fixes critical deadlock issues that occur when OTLP processors call shutdown() or force_flush() on single-threaded tokio runtimes. The root causes were: (1) experimental async runtime modules that blocked the only available thread while waiting for responses, and (2) thread-based processors calling async exporters without tokio runtime context.
The solution aligns the Rust SDK with other OpenTelemetry language implementations by using dedicated OS threads for background processing, with a new BlockingStrategy utility that enters tokio context via Handle::enter() before blocking, making tokio types available without deadlocking.
Changes:
- Introduced
BlockingStrategyutility to safely call async exporters from synchronous worker threads - Merged
TokioandTokioCurrentThreadruntime types with automatic runtime flavor detection - Removed experimental async runtime features and related modules
- Updated all batch processors to use
BlockingStrategyinstead of directfutures_executor::block_on()
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| opentelemetry-sdk/src/util.rs | Added new BlockingStrategy utility for safe async-to-sync bridging |
| opentelemetry-sdk/src/runtime.rs | Merged TokioCurrentThread into Tokio with auto-detection of runtime flavor |
| opentelemetry-sdk/src/trace/span_processor.rs | Updated BatchSpanProcessor to use BlockingStrategy |
| opentelemetry-sdk/src/logs/batch_log_processor.rs | Updated BatchLogProcessor to use BlockingStrategy |
| opentelemetry-sdk/src/metrics/periodic_reader.rs | Updated PeriodicReader to use BlockingStrategy |
| opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs | Deleted experimental async span processor module |
| opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs | Deleted experimental async log processor module |
| opentelemetry-sdk/src/metrics/periodic_reader_with_async_runtime.rs | Deleted experimental async metrics reader module |
| opentelemetry-sdk/src/trace/runtime_tests.rs | Deleted runtime-specific tests |
| opentelemetry-sdk/src/trace/mod.rs | Removed references to deleted span_processor_with_async_runtime module |
| opentelemetry-sdk/src/logs/mod.rs | Removed references to deleted log_processor_with_async_runtime module |
| opentelemetry-sdk/src/metrics/mod.rs | Removed references to deleted periodic_reader_with_async_runtime module |
| opentelemetry-sdk/src/testing/trace/span_exporters.rs | Updated feature flag from rt-tokio-current-thread to rt-tokio |
| opentelemetry-sdk/src/lib.rs | Updated documentation to reflect merged runtime types |
| opentelemetry-sdk/Cargo.toml | Removed experimental feature flags and rt-tokio-current-thread |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .enable_all() | ||
| .build() | ||
| .expect("failed to create Tokio current thread runtime for OpenTelemetry"); |
There was a problem hiding this comment.
Inconsistent indentation on this builder chain. The .enable_all(), .build(), and .expect() calls should be indented to align with the tokio::runtime::Builder call on line 93.
| .enable_all() | |
| .build() | |
| .expect("failed to create Tokio current thread runtime for OpenTelemetry"); | |
| .enable_all() | |
| .build() | |
| .expect("failed to create Tokio current thread runtime for OpenTelemetry"); |
|
Hey @bryantbiggs thanks for the interest in this! The runtime trait has been something we've been thinking about for some time and would love to be able to stabilise. Can you clarify if you have tried using the Having said that, there is definitely effort to be put into stabilising the runtime trait so we can take it out from behind the feature guard and this approach could certainly go in that direction, but it is nuanced and cross-cutting enough that I would suggest we should start with an RFC. Echo Cijo's comments as well on the PR front - we'd love help here, but its gnarly enough I think we should be a bit more incremental/cautious. |
|
Thanks for the feedback @cijothomas and @scottgerring. Totally understood on the PR size — this is way bigger than anyone wants to review in one go, and I should have started a conversation before jumping to code. Before we talk about breaking this down, would it make sense to first align on whether the overall direction is reasonable? Happy to do that however works best for you — an RFC as Scott suggested, a design discussion in a separate issue, or just working through it here. The core problem I'm trying to address is that the default code path (no experimental features) deadlocks with tonic/gRPC exporters on constrained tokio runtimes. Once we agree on the right approach, I can break the work into smaller signal-by-signal PRs as Cijo suggested. Let me know what works. Scott — to answer your specific questions: Re: Here's a minimal reproduction (full project as a gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178). Tested with published Example 1 — PeriodicReader + use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;
fn main() {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
rt.block_on(async {
let exporter = MetricExporter::builder()
.with_tonic()
.build()
.expect("failed to build exporter");
let reader = PeriodicReader::builder(exporter)
.with_interval(Duration::from_secs(120))
.build();
let provider = SdkMeterProvider::builder().with_reader(reader).build();
let meter = provider.meter("deadlock-repro");
let counter = meter.u64_counter("test.counter").build();
counter.add(1, &[]);
provider.force_flush(); // hangs forever
});
}Example 2 — let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.enable_all()
.build()
.unwrap();
rt.block_on(async {
// ... same exporter/reader/provider setup ...
// calling force_flush from a spawned task blocks the only worker thread
tokio::spawn(async move {
provider.force_flush(); // hangs forever
}).await;
});Verified results:
The deadlock chain: The gist has all four examples (including the working multi-thread control case) that you can run locally to verify. |
|
Hey @bryantbiggs thanks for the quick turnaround and detail! At a quick glance over the diff, it looks to me like there are two separate things in this PR:
Are you happy to address the former independent of the latter? As I understand it this would address your issue, and would avoid pulling in the more involved part with the runtime abstraction. I caveat this by adding - I haven't had time to go into this in detail yet, so please correct me if I am missing something! I also note that discussions about the Runtime abstraction have been long running and nuanced and I think if we can remove that from the scope of this it will be much easier to review and get a PR through. |
|
yes! let me see what I can do to break it down and split those. thank you for taking a look! |
|
I'd suggest getting the first (the bug) one as a PR and having a chat about the second (what to do with runtime abstractions); it probably needs an ADR as there is a fair bit involved. You can also find us in the CNCF slack https://communityinviter.com/apps/cloud-native/cncf and #opentelemetry-rust if you like! I'm off for the weekend, but I will have cycles next week. Have a good one! |
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Problem
OTLP processors deadlock when
shutdown()/force_flush()is called on single-threaded tokio runtimes (or multi-thread with 1 worker, e.g. 1-vCPU k8s pods).Two root causes:
The
experimental_*_with_async_runtimeprocessor modules spawn tasks on the user's tokio runtime, then callfutures_executor::block_on(oneshot_receiver)to wait for the response. On single-threaded runtimes this blocks the only available thread — deadlock.The thread-based processors (PeriodicReader, BatchLogProcessor, BatchSpanProcessor) call
futures_executor::block_on(exporter.export(...))on dedicated OS threads. When the exporter uses tonic/gRPC, the export future needs tokio runtime context — barefutures_executordoesn't provide that context, causing hangs or panics depending on the exporter.Fixes #2802
Refs: #2643, #2539, #2715, #2071
Why this approach
Every other OpenTelemetry SDK uses dedicated OS threads for background processing and none of them expose async runtime configuration to users:
BatchSpanProcessorandPeriodicReader. No async runtime concept exists. Shutdown usessync.Once+sync.WaitGroup.DaemonThreadFactory) or usesScheduledExecutorService. No async runtime exposure.threading.Threadfor batch processing. The SDK is entirely synchronous internally — it does not useasyncioat all.AutoResetEvent/ManualResetEventfor signaling. Despite .NET having nativeasync/await, the OTel SDK deliberately uses OS threads to avoid sync-over-async deadlocks.The Rust SDK is the only OTel implementation that has this deadlock problem because it's the only one where exporters are async (tonic, reqwest) while the SDK's background threads need to call them synchronously. The
experimental_*_with_async_runtimemodules attempted to solve this by integrating with the user's async runtime, but this created the deadlock path described above.This PR aligns the Rust SDK with every other language implementation: dedicated OS threads for background work, with the tokio runtime context entered via
Handle::enter()before callingfutures_executor::block_on(). This makes tokio types (spawn, timers, IO resources) available on the worker threads without taking ownership of the reactor — IO continues to be driven by the runtime's own threads. This avoids the "Cannot drop a runtime in a context where blocking is not allowed" panic thatHandle::block_on()can trigger when the runtime's lifecycle doesn't perfectly match the worker thread's.Changes
New:
BlockingStrategyutility (util.rs)Handle::enter()+futures_executor::block_on()on worker threads to provide tokio contextfutures_executor::block_on()when no tokio runtime is availableUpdated processors to use
BlockingStrategy:BatchSpanProcessor— created at construction, passed to worker threadBatchLogProcessor— same patternPeriodicReader— stored inPeriodicReaderInner, used incollect_and_exportMerged
Tokio/TokioCurrentThread:Tokio::spawnnow auto-detects runtime flavor viaHandle::try_current()+runtime_flavor()tokio::spawnTokioCurrentThreadstructRemoved experimental async runtime modules and features:
experimental_metrics_periodicreader_with_async_runtimefeature +periodic_reader_with_async_runtime.rsexperimental_logs_batch_log_processor_with_async_runtimefeature +log_processor_with_async_runtime.rsexperimental_trace_batch_span_processor_with_async_runtimefeature +span_processor_with_async_runtime.rsrt-tokio-current-threadfeatureruntime_tests.rsNot changed:
SimpleSpanProcessor/SimpleLogProcessor— these run on the caller's thread (possibly a tokio worker) whereHandle::block_on()would panic, so they keepfutures_executor::block_on(). This is an inherent limitation of synchronous-on-every-event processors.NoAsyncruntime type — still used by OTLP retry logicopentelemetry-otlp,opentelemetry-proto, or other cratesBreaking changes
All removed items were behind
experimental_*feature flags, not stable API.experimental_metrics_periodicreader_with_async_runtimefeaturePeriodicReaderexperimental_logs_batch_log_processor_with_async_runtimefeatureBatchLogProcessorexperimental_trace_batch_span_processor_with_async_runtimefeatureBatchSpanProcessorrt-tokio-current-threadfeaturert-tokio(now auto-detects)runtime::TokioCurrentThreadstructruntime::TokioTest results
cargo check -p opentelemetry_sdk --all-features— passcargo clippy -p opentelemetry_sdk --no-default-features -- -Dwarnings— passcargo clippy -p opentelemetry_sdk --all-features -- -Dwarnings— passcargo test -p opentelemetry_sdk --features="testing"— 295 passed, 0 failed, 3 ignored (pre-existing)cargo check -p opentelemetry-otlp --all-features— passcargo test -p opentelemetry-otlp— 43 passedMerge conflict risk
PRs #3223, #3267, #3257, #3211, #3139, #2962 touch some of the same files. May need coordination on merge order.