Skip to content

fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380

Open
bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/blocking-strategy-deadlock
Open

fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380
bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/blocking-strategy-deadlock

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

Summary

  • Add BlockingStrategy that captures the tokio runtime handle at construction and enters the runtime context via Handle::enter() before calling futures_executor::block_on() on dedicated background threads
  • Update BatchSpanProcessor, BatchLogProcessor, and PeriodicReader to use BlockingStrategy instead of bare futures_executor::block_on()
  • Falls back to plain futures_executor::block_on() when no tokio runtime is available

Problem

The default thread-based processors call futures_executor::block_on(exporter.export(...)) on their dedicated worker threads. When the exporter uses tonic/gRPC, the export future depends on tokio tasks (e.g. tonic's Buffer worker spawned via tokio::spawn) that can only be polled by tokio worker threads. If all tokio worker threads are blocked — single-threaded runtime (current_thread), or multi_thread with 1 worker (common in 1-vCPU k8s pods) — this creates a circular wait: the worker thread waits for the export to complete, but the export can't complete because no tokio thread is available to poll the Buffer worker.

Reproduction and detailed analysis in #3356 (comment): #3356 (comment)

Minimal repro gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178

Scenario force_flush() shutdown()
current_thread Hangs forever Returns Err(Timeout(5s)), worker thread stays stuck
multi_thread(1) + tokio::spawn Hangs forever (entire runtime freezes) Same pattern
multi_thread(default workers) Returns immediately Returns immediately

Solution

BlockingStrategy::new() is called during processor construction (while inside the tokio runtime context). It calls Handle::try_current() to capture the runtime handle. On the dedicated background thread, blocking_strategy.block_on(future) enters the runtime context via Handle::enter() before calling futures_executor::block_on(). This makes tokio types (spawn, timers, IO) available without taking ownership of the reactor — IO continues to be driven by the runtime's own threads.

When no tokio runtime is available (e.g., non-tokio environments), it falls back to plain futures_executor::block_on() — preserving existing behavior.

Scope

This is the scoped-down version of #3356, containing only the bug fix as suggested by @scottgerring in #3356 (comment). The experimental async runtime removal is intentionally excluded and can be discussed separately.

Fixes #2802

Test plan

  • cargo check -p opentelemetry_sdk --all-features
  • cargo check -p opentelemetry_sdk --no-default-features
  • cargo clippy -p opentelemetry_sdk --all-features -- -Dwarnings
  • cargo test -p opentelemetry_sdk --features="testing" — 295 passed, 0 failed, 3 ignored (pre-existing)
  • cargo check -p opentelemetry-otlp --all-features

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 89.51613% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.3%. Comparing base (9650783) to head (f9608c4).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/periodic_reader.rs 82.3% 6 Missing ⚠️
opentelemetry-sdk/src/logs/batch_log_processor.rs 88.8% 4 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs 92.6% 3 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main   #3380    +/-   ##
======================================
  Coverage   83.2%   83.3%            
======================================
  Files        128     128            
  Lines      25045   25164   +119     
======================================
+ Hits       20858   20965   +107     
- Misses      4187    4199    +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The default thread-based processors (BatchSpanProcessor, BatchLogProcessor,
PeriodicReader) call futures_executor::block_on() on their dedicated worker
threads. When the exporter uses tonic/gRPC, the export future depends on
tokio tasks (e.g. tonic's Buffer worker) that can only be polled by tokio
worker threads. If all tokio worker threads are blocked (single-threaded
runtime, or multi-thread with 1 worker), this creates a circular wait.

Add BlockingStrategy that captures the tokio runtime handle at construction
time and enters the runtime context via Handle::enter() before calling
futures_executor::block_on(). This makes tokio types available on the
dedicated background threads without taking ownership of the reactor.
Falls back to plain futures_executor::block_on() without tokio.

Fixes: open-telemetry#2802
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
@bryantbiggs bryantbiggs force-pushed the fix/blocking-strategy-deadlock branch from 1825e7c to d37cca1 Compare April 29, 2026 14:18
#[cfg(feature = "rt-tokio")]
Self::TokioHandle(handle) => {
let _guard = handle.enter();
futures_executor::block_on(future)
Copy link
Copy Markdown

@thalesfragoso thalesfragoso May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to use Hanlder::block_on instead.

However, this still will cause problems with single-threaded tokio runtimes if block_on is called on the tokio thread:

When this is used on a current_thread runtime, only the Runtime::block_on method can drive the IO and timer drivers, but the Handle::block_on method cannot drive them. This means that, when using this method on a current_thread runtime, anything that relies on IO or timers will not work unless there is another thread currently calling Runtime::block_on on the same runtime.

This also applies to futures_executor::block_on. However, it seems that this is only used in the batch processing and that they always seem to spawn their own threads, is that correct?

I would change your tests to include tokio::time::sleep in the mock exporters, to be sure the timer and IO drivers can really run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLP MetricExporter deadlock issue

2 participants