fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380
fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
Conversation
d5e1ec9 to
4c2b288
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3380 +/- ##
======================================
Coverage 83.2% 83.3%
======================================
Files 128 128
Lines 25045 25164 +119
======================================
+ Hits 20858 20965 +107
- Misses 4187 4199 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c3840f7 to
ed5ff32
Compare
f9608c4 to
1825e7c
Compare
The default thread-based processors (BatchSpanProcessor, BatchLogProcessor, PeriodicReader) call futures_executor::block_on() on their dedicated worker threads. When the exporter uses tonic/gRPC, the export future depends on tokio tasks (e.g. tonic's Buffer worker) that can only be polled by tokio worker threads. If all tokio worker threads are blocked (single-threaded runtime, or multi-thread with 1 worker), this creates a circular wait. Add BlockingStrategy that captures the tokio runtime handle at construction time and enters the runtime context via Handle::enter() before calling futures_executor::block_on(). This makes tokio types available on the dedicated background threads without taking ownership of the reactor. Falls back to plain futures_executor::block_on() without tokio. Fixes: open-telemetry#2802
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
1825e7c to
d37cca1
Compare
| #[cfg(feature = "rt-tokio")] | ||
| Self::TokioHandle(handle) => { | ||
| let _guard = handle.enter(); | ||
| futures_executor::block_on(future) |
There was a problem hiding this comment.
It makes more sense to use Hanlder::block_on instead.
However, this still will cause problems with single-threaded tokio runtimes if block_on is called on the tokio thread:
When this is used on a current_thread runtime, only the Runtime::block_on method can drive the IO and timer drivers, but the Handle::block_on method cannot drive them. This means that, when using this method on a current_thread runtime, anything that relies on IO or timers will not work unless there is another thread currently calling Runtime::block_on on the same runtime.
This also applies to futures_executor::block_on. However, it seems that this is only used in the batch processing and that they always seem to spawn their own threads, is that correct?
I would change your tests to include tokio::time::sleep in the mock exporters, to be sure the timer and IO drivers can really run.
Summary
BlockingStrategythat captures the tokio runtime handle at construction and enters the runtime context viaHandle::enter()before callingfutures_executor::block_on()on dedicated background threadsBatchSpanProcessor,BatchLogProcessor, andPeriodicReaderto useBlockingStrategyinstead of barefutures_executor::block_on()futures_executor::block_on()when no tokio runtime is availableProblem
The default thread-based processors call
futures_executor::block_on(exporter.export(...))on their dedicated worker threads. When the exporter uses tonic/gRPC, the export future depends on tokio tasks (e.g. tonic'sBufferworker spawned viatokio::spawn) that can only be polled by tokio worker threads. If all tokio worker threads are blocked — single-threaded runtime (current_thread), ormulti_threadwith 1 worker (common in 1-vCPU k8s pods) — this creates a circular wait: the worker thread waits for the export to complete, but the export can't complete because no tokio thread is available to poll the Buffer worker.Reproduction and detailed analysis in #3356 (comment): #3356 (comment)
Minimal repro gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178
force_flush()shutdown()current_threadErr(Timeout(5s)), worker thread stays stuckmulti_thread(1)+tokio::spawnmulti_thread(default workers)Solution
BlockingStrategy::new()is called during processor construction (while inside the tokio runtime context). It callsHandle::try_current()to capture the runtime handle. On the dedicated background thread,blocking_strategy.block_on(future)enters the runtime context viaHandle::enter()before callingfutures_executor::block_on(). This makes tokio types (spawn, timers, IO) available without taking ownership of the reactor — IO continues to be driven by the runtime's own threads.When no tokio runtime is available (e.g., non-tokio environments), it falls back to plain
futures_executor::block_on()— preserving existing behavior.Scope
This is the scoped-down version of #3356, containing only the bug fix as suggested by @scottgerring in #3356 (comment). The experimental async runtime removal is intentionally excluded and can be discussed separately.
Fixes #2802
Test plan
cargo check -p opentelemetry_sdk --all-featurescargo check -p opentelemetry_sdk --no-default-featurescargo clippy -p opentelemetry_sdk --all-features -- -Dwarningscargo test -p opentelemetry_sdk --features="testing"— 295 passed, 0 failed, 3 ignored (pre-existing)cargo check -p opentelemetry-otlp --all-features