fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes by bryantbiggs · Pull Request #3356 · open-telemetry/opentelemetry-rust

bryantbiggs · 2026-02-19T22:12:33Z

Problem

OTLP processors deadlock when shutdown()/force_flush() is called on single-threaded tokio runtimes (or multi-thread with 1 worker, e.g. 1-vCPU k8s pods).

Two root causes:

The experimental_*_with_async_runtime processor modules spawn tasks on the user's tokio runtime, then call futures_executor::block_on(oneshot_receiver) to wait for the response. On single-threaded runtimes this blocks the only available thread — deadlock.
The thread-based processors (PeriodicReader, BatchLogProcessor, BatchSpanProcessor) call futures_executor::block_on(exporter.export(...)) on dedicated OS threads. When the exporter uses tonic/gRPC, the export future needs tokio runtime context — bare futures_executor doesn't provide that context, causing hangs or panics depending on the exporter.

Fixes #2802
Refs: #2643, #2539, #2715, #2071

Why this approach

Every other OpenTelemetry SDK uses dedicated OS threads for background processing and none of them expose async runtime configuration to users:

Go: spawns a goroutine for BatchSpanProcessor and PeriodicReader. No async runtime concept exists. Shutdown uses sync.Once + sync.WaitGroup.
Java: creates dedicated daemon threads (via DaemonThreadFactory) or uses ScheduledExecutorService. No async runtime exposure.
Python: creates a dedicated daemon threading.Thread for batch processing. The SDK is entirely synchronous internally — it does not use asyncio at all.
.NET: creates dedicated background threads with AutoResetEvent/ManualResetEvent for signaling. Despite .NET having native async/await, the OTel SDK deliberately uses OS threads to avoid sync-over-async deadlocks.

The Rust SDK is the only OTel implementation that has this deadlock problem because it's the only one where exporters are async (tonic, reqwest) while the SDK's background threads need to call them synchronously. The experimental_*_with_async_runtime modules attempted to solve this by integrating with the user's async runtime, but this created the deadlock path described above.

This PR aligns the Rust SDK with every other language implementation: dedicated OS threads for background work, with the tokio runtime context entered via Handle::enter() before calling futures_executor::block_on(). This makes tokio types (spawn, timers, IO resources) available on the worker threads without taking ownership of the reactor — IO continues to be driven by the runtime's own threads. This avoids the "Cannot drop a runtime in a context where blocking is not allowed" panic that Handle::block_on() can trigger when the runtime's lifecycle doesn't perfectly match the worker thread's.

Changes

New: BlockingStrategy utility (util.rs)

Captures the tokio runtime handle at construction time (when called from within a tokio context)
Uses Handle::enter() + futures_executor::block_on() on worker threads to provide tokio context
Falls back to plain futures_executor::block_on() when no tokio runtime is available

Updated processors to use BlockingStrategy:

BatchSpanProcessor — created at construction, passed to worker thread
BatchLogProcessor — same pattern
PeriodicReader — stored in PeriodicReaderInner, used in collect_and_export

Merged Tokio/TokioCurrentThread:

Tokio::spawn now auto-detects runtime flavor via Handle::try_current() + runtime_flavor()
Multi-thread: spawns via tokio::spawn
Current-thread: spawns a separate OS thread with its own runtime
Deleted TokioCurrentThread struct

Removed experimental async runtime modules and features:

experimental_metrics_periodicreader_with_async_runtime feature + periodic_reader_with_async_runtime.rs
experimental_logs_batch_log_processor_with_async_runtime feature + log_processor_with_async_runtime.rs
experimental_trace_batch_span_processor_with_async_runtime feature + span_processor_with_async_runtime.rs
rt-tokio-current-thread feature
runtime_tests.rs

Not changed:

SimpleSpanProcessor/SimpleLogProcessor — these run on the caller's thread (possibly a tokio worker) where Handle::block_on() would panic, so they keep futures_executor::block_on(). This is an inherent limitation of synchronous-on-every-event processors.
NoAsync runtime type — still used by OTLP retry logic
opentelemetry-otlp, opentelemetry-proto, or other crates

Breaking changes

All removed items were behind experimental_* feature flags, not stable API.

Removed	Migration
`experimental_metrics_periodicreader_with_async_runtime` feature	Use default thread-based `PeriodicReader`
`experimental_logs_batch_log_processor_with_async_runtime` feature	Use default thread-based `BatchLogProcessor`
`experimental_trace_batch_span_processor_with_async_runtime` feature	Use default thread-based `BatchSpanProcessor`
`rt-tokio-current-thread` feature	Use `rt-tokio` (now auto-detects)
`runtime::TokioCurrentThread` struct	Use `runtime::Tokio`

Test results

cargo check -p opentelemetry_sdk --all-features — pass
cargo clippy -p opentelemetry_sdk --no-default-features -- -Dwarnings — pass
cargo clippy -p opentelemetry_sdk --all-features -- -Dwarnings — pass
cargo test -p opentelemetry_sdk --features="testing" — 295 passed, 0 failed, 3 ignored (pre-existing)
cargo check -p opentelemetry-otlp --all-features — pass
cargo test -p opentelemetry-otlp — 43 passed

Merge conflict risk

PRs #3223, #3267, #3257, #3211, #3139, #2962 touch some of the same files. May need coordination on merge order.

codecov · 2026-02-19T22:17:49Z

Codecov Report

❌ Patch coverage is 65.07937% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.8%. Comparing base (3c41f29) to head (ed6e932).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
opentelemetry-sdk/src/runtime.rs	0.0%	21 Missing ⚠️
opentelemetry-sdk/src/logs/batch_log_processor.rs	92.3%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #3356     +/-   ##
=======================================
- Coverage   82.2%   81.8%   -0.5%     
=======================================
  Files        128     125      -3     
  Lines      24626   23497   -1129     
=======================================
- Hits       20266   19236   -1030     
+ Misses      4360    4261     -99

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…times Replace `futures_executor::block_on()` with tokio `Handle::block_on()` on dedicated background threads to properly drive the IO reactor. This fixes deadlocks when shutdown()/force_flush() is called on single- threaded tokio runtimes or multi-thread runtimes with 1 worker thread. Changes: - Add `BlockingStrategy` utility that captures the tokio runtime handle at construction and uses `Handle::block_on()` from background threads, falling back to `futures_executor::block_on()` without tokio - Update BatchSpanProcessor, BatchLogProcessor, and PeriodicReader to use BlockingStrategy on their dedicated worker threads - Merge Tokio/TokioCurrentThread into single auto-detecting Tokio type - Remove experimental async runtime modules and features: - experimental_metrics_periodicreader_with_async_runtime - experimental_logs_batch_log_processor_with_async_runtime - experimental_trace_batch_span_processor_with_async_runtime - rt-tokio-current-thread feature and TokioCurrentThread struct Fixes: open-telemetry#2802 Refs: open-telemetry#2643, open-telemetry#2539, open-telemetry#2715, open-telemetry#2071

cijothomas · 2026-02-20T01:27:04Z

Big PRs need focus time to review, any chance you can do shorter PRs - one signal in one PR would be much easier, and then subsequent PRs will be even easier as the pattern is established and accepted

Copilot

Pull request overview

This PR fixes critical deadlock issues that occur when OTLP processors call shutdown() or force_flush() on single-threaded tokio runtimes. The root causes were: (1) experimental async runtime modules that blocked the only available thread while waiting for responses, and (2) thread-based processors calling async exporters without tokio runtime context.

The solution aligns the Rust SDK with other OpenTelemetry language implementations by using dedicated OS threads for background processing, with a new BlockingStrategy utility that enters tokio context via Handle::enter() before blocking, making tokio types available without deadlocking.

Changes:

Introduced BlockingStrategy utility to safely call async exporters from synchronous worker threads
Merged Tokio and TokioCurrentThread runtime types with automatic runtime flavor detection
Removed experimental async runtime features and related modules
Updated all batch processors to use BlockingStrategy instead of direct futures_executor::block_on()

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
opentelemetry-sdk/src/util.rs	Added new `BlockingStrategy` utility for safe async-to-sync bridging
opentelemetry-sdk/src/runtime.rs	Merged `TokioCurrentThread` into `Tokio` with auto-detection of runtime flavor
opentelemetry-sdk/src/trace/span_processor.rs	Updated `BatchSpanProcessor` to use `BlockingStrategy`
opentelemetry-sdk/src/logs/batch_log_processor.rs	Updated `BatchLogProcessor` to use `BlockingStrategy`
opentelemetry-sdk/src/metrics/periodic_reader.rs	Updated `PeriodicReader` to use `BlockingStrategy`
opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs	Deleted experimental async span processor module
opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs	Deleted experimental async log processor module
opentelemetry-sdk/src/metrics/periodic_reader_with_async_runtime.rs	Deleted experimental async metrics reader module
opentelemetry-sdk/src/trace/runtime_tests.rs	Deleted runtime-specific tests
opentelemetry-sdk/src/trace/mod.rs	Removed references to deleted span_processor_with_async_runtime module
opentelemetry-sdk/src/logs/mod.rs	Removed references to deleted log_processor_with_async_runtime module
opentelemetry-sdk/src/metrics/mod.rs	Removed references to deleted periodic_reader_with_async_runtime module
opentelemetry-sdk/src/testing/trace/span_exporters.rs	Updated feature flag from `rt-tokio-current-thread` to `rt-tokio`
opentelemetry-sdk/src/lib.rs	Updated documentation to reflect merged runtime types
opentelemetry-sdk/Cargo.toml	Removed experimental feature flags and `rt-tokio-current-thread`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T01:30:55Z

+                            .enable_all()
+                            .build()
+                            .expect("failed to create Tokio current thread runtime for OpenTelemetry");


Inconsistent indentation on this builder chain. The .enable_all(), .build(), and .expect() calls should be indented to align with the tokio::runtime::Builder call on line 93.

Suggested change

.enable_all()

.build()

.expect("failed to create Tokio current thread runtime for OpenTelemetry");

.enable_all()

.build()

.expect("failed to create Tokio current thread runtime for OpenTelemetry");

scottgerring · 2026-02-20T11:39:24Z

Hey @bryantbiggs thanks for the interest in this! The runtime trait has been something we've been thinking about for some time and would love to be able to stabilise.

Can you clarify if you have tried using the rt-tokio-current-thread feature - if you are and you are able to cause a deadlock could you share a repro? If there is a deadlock when the configuration is setup properly we should certainly look at addressing that (and I don't think this is very well documented at the moment - if you are using the unstable runtime feature and a single threaded runtime, then you will also need rt-tokio-current-thread).

Having said that, there is definitely effort to be put into stabilising the runtime trait so we can take it out from behind the feature guard and this approach could certainly go in that direction, but it is nuanced and cross-cutting enough that I would suggest we should start with an RFC. Echo Cijo's comments as well on the PR front - we'd love help here, but its gnarly enough I think we should be a bit more incremental/cautious.

bryantbiggs · 2026-02-20T15:02:49Z

Thanks for the feedback @cijothomas and @scottgerring. Totally understood on the PR size — this is way bigger than anyone wants to review in one go, and I should have started a conversation before jumping to code.

Before we talk about breaking this down, would it make sense to first align on whether the overall direction is reasonable? Happy to do that however works best for you — an RFC as Scott suggested, a design discussion in a separate issue, or just working through it here. The core problem I'm trying to address is that the default code path (no experimental features) deadlocks with tonic/gRPC exporters on constrained tokio runtimes. Once we agree on the right approach, I can break the work into smaller signal-by-signal PRs as Cijo suggested. Let me know what works.

Scott — to answer your specific questions:

Re: rt-tokio-current-thread — the deadlocks I'm hitting are on the default code path, not the experimental async-runtime processors. The default thread-based processors (PeriodicReader, BatchSpanProcessor, BatchLogProcessor) call futures_executor::block_on(exporter.export(...)) on their dedicated worker threads. rt-tokio-current-thread only applies to the experimental processors — it enables TokioCurrentThread which is only usable with the experimental_*_with_async_runtime builders that accept a Runtime parameter. The default PeriodicReader::builder(exporter) doesn't take a runtime parameter at all, so there's no way to opt into the current-thread workaround on the default path.

Here's a minimal reproduction (full project as a gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178). Tested with published opentelemetry_sdk v0.31.0 and opentelemetry-otlp v0.31.0, no experimental features enabled.

Example 1 — PeriodicReader + current_thread runtime (same as #[tokio::test] default):

use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;

fn main() {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    rt.block_on(async {
        let exporter = MetricExporter::builder()
            .with_tonic()
            .build()
            .expect("failed to build exporter");

        let reader = PeriodicReader::builder(exporter)
            .with_interval(Duration::from_secs(120))
            .build();

        let provider = SdkMeterProvider::builder().with_reader(reader).build();

        let meter = provider.meter("deadlock-repro");
        let counter = meter.u64_counter("test.counter").build();
        counter.add(1, &[]);

        provider.force_flush(); // hangs forever
    });
}

Example 2 — multi_thread(1 worker) + tokio::spawn (simulates a 1-vCPU k8s pod):

let rt = tokio::runtime::Builder::new_multi_thread()
    .worker_threads(1)
    .enable_all()
    .build()
    .unwrap();

rt.block_on(async {
    // ... same exporter/reader/provider setup ...

    // calling force_flush from a spawned task blocks the only worker thread
    tokio::spawn(async move {
        provider.force_flush(); // hangs forever
    }).await;
});

Verified results:

Scenario	`force_flush()`	`shutdown()`
`current_thread`	Hangs forever (`recv()` has no timeout)	Returns `Err(Timeout(5s))`, but worker thread stays stuck
`multi_thread(1)` + `tokio::spawn`	Hangs forever (entire runtime freezes)	Same pattern
`multi_thread(default workers)`	Returns immediately (connection error, no hang)	Returns immediately

The deadlock chain: force_flush() sends a Flush message to the PeriodicReader's dedicated OS thread and blocks on std::sync::mpsc::Receiver::recv() (no timeout). The worker thread receives the message and calls futures_executor::block_on(exporter.export(rm)). Tonic's Channel internally spawns a Buffer worker as a tokio task (via tokio::spawn) at channel creation time — the export future sends the gRPC request to this Buffer worker and awaits the response. The Buffer worker can only be polled by tokio worker threads. If all tokio worker threads are blocked by the recv() call, the Buffer worker can't process the request, the export never completes, and the worker thread never responds — circular wait.

The gist has all four examples (including the working multi-thread control case) that you can run locally to verify.

scottgerring · 2026-02-20T15:59:34Z

Hey @bryantbiggs thanks for the quick turnaround and detail!
That's ... not great that that's not in the experimental flags.

At a quick glance over the diff, it looks to me like there are two separate things in this PR:

A change addressing the issue you discuss and provided the repro for (the BlockingStrategy piece)
Removal of a pile of adjacent experimental stuff - this appears to be the majority of the diff, but is as far as I can see at a quick glance unrelated to the former

Are you happy to address the former independent of the latter? As I understand it this would address your issue, and would avoid pulling in the more involved part with the runtime abstraction.

I caveat this by adding - I haven't had time to go into this in detail yet, so please correct me if I am missing something! I also note that discussions about the Runtime abstraction have been long running and nuanced and I think if we can remove that from the scope of this it will be much easier to review and get a PR through.

bryantbiggs · 2026-02-20T16:01:27Z

yes! let me see what I can do to break it down and split those. thank you for taking a look!

scottgerring · 2026-02-20T16:12:47Z

I'd suggest getting the first (the bug) one as a PR and having a chat about the second (what to do with runtime abstractions); it probably needs an ADR as there is a fair bit involved.

You can also find us in the CNCF slack https://communityinviter.com/apps/cloud-native/cncf and #opentelemetry-rust if you like!

I'm off for the weekend, but I will have cycles next week. Have a good one!

bryantbiggs · 2026-02-20T21:53:39Z

Split the bug fix into a standalone PR as suggested: #3380

That PR contains only the BlockingStrategy piece — 4 files, +68/-6 lines. The experimental async runtime removal is excluded and can be discussed separately (likely needs an ADR per Scott's suggestion).

Closing this in favor of #3380.

Add tests with TokioSpawn*Exporter mocks that call tokio::spawn() inside export(), simulating tonic/gRPC exporters. These prove that BlockingStrategy correctly provides tokio runtime context on the processor's dedicated OS thread, preventing deadlocks on constrained multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).

bryantbiggs requested a review from a team as a code owner February 19, 2026 22:12

bryantbiggs force-pushed the fix/otlp-deadlock-runtime-consolidation branch from 014b873 to 0e23ee9 Compare February 19, 2026 22:14

bryantbiggs mentioned this pull request Feb 19, 2026

feat: Bump MSRV to 1.85 and replace 3rd party crates with std library APIs #3352

Draft

4 tasks

bryantbiggs force-pushed the fix/otlp-deadlock-runtime-consolidation branch from 0e23ee9 to 5728c74 Compare February 19, 2026 22:22

bryantbiggs force-pushed the fix/otlp-deadlock-runtime-consolidation branch from 5728c74 to 8a18420 Compare February 19, 2026 22:33

Merge branch 'main' into fix/otlp-deadlock-runtime-consolidation

ed6e932

cijothomas requested a review from Copilot February 20, 2026 01:26

Copilot started reviewing on behalf of cijothomas February 20, 2026 01:26 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

scottgerring self-requested a review February 20, 2026 08:59

bryantbiggs marked this pull request as draft February 20, 2026 16:18

bryantbiggs mentioned this pull request Feb 20, 2026

fix(sdk): resolve exporter deadlock on constrained tokio runtimes #3380

Open

5 tasks

bryantbiggs closed this Feb 20, 2026

bryantbiggs deleted the fix/otlp-deadlock-runtime-consolidation branch February 20, 2026 21:54

bryantbiggs mentioned this pull request Feb 21, 2026

Processor test suite gaps: no coverage for tokio-dependent exporters or runtime deadlock scenarios #3381

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356

fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356
bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/otlp-deadlock-runtime-consolidation

bryantbiggs commented Feb 19, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

cijothomas commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

scottgerring commented Feb 20, 2026 •

edited

Loading

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

scottgerring commented Feb 20, 2026

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

scottgerring commented Feb 20, 2026 •

edited

Loading

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bryantbiggs commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why this approach

Changes

Breaking changes

Test results

Merge conflict risk

Uh oh!

codecov Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cijothomas commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

scottgerring commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

scottgerring commented Feb 20, 2026

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

scottgerring commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryantbiggs commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bryantbiggs commented Feb 19, 2026 •

edited

Loading

codecov Bot commented Feb 19, 2026 •

edited

Loading

scottgerring commented Feb 20, 2026 •

edited

Loading

scottgerring commented Feb 20, 2026 •

edited

Loading