Skip to content

fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356

Closed
bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/otlp-deadlock-runtime-consolidation
Closed

fix(sdk): resolve OTLP exporter deadlock on single-threaded tokio runtimes#3356
bryantbiggs wants to merge 2 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/otlp-deadlock-runtime-consolidation

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

@bryantbiggs bryantbiggs commented Feb 19, 2026

Problem

OTLP processors deadlock when shutdown()/force_flush() is called on single-threaded tokio runtimes (or multi-thread with 1 worker, e.g. 1-vCPU k8s pods).

Two root causes:

  1. The experimental_*_with_async_runtime processor modules spawn tasks on the user's tokio runtime, then call futures_executor::block_on(oneshot_receiver) to wait for the response. On single-threaded runtimes this blocks the only available thread — deadlock.

  2. The thread-based processors (PeriodicReader, BatchLogProcessor, BatchSpanProcessor) call futures_executor::block_on(exporter.export(...)) on dedicated OS threads. When the exporter uses tonic/gRPC, the export future needs tokio runtime context — bare futures_executor doesn't provide that context, causing hangs or panics depending on the exporter.

Fixes #2802
Refs: #2643, #2539, #2715, #2071

Why this approach

Every other OpenTelemetry SDK uses dedicated OS threads for background processing and none of them expose async runtime configuration to users:

  • Go: spawns a goroutine for BatchSpanProcessor and PeriodicReader. No async runtime concept exists. Shutdown uses sync.Once + sync.WaitGroup.
  • Java: creates dedicated daemon threads (via DaemonThreadFactory) or uses ScheduledExecutorService. No async runtime exposure.
  • Python: creates a dedicated daemon threading.Thread for batch processing. The SDK is entirely synchronous internally — it does not use asyncio at all.
  • .NET: creates dedicated background threads with AutoResetEvent/ManualResetEvent for signaling. Despite .NET having native async/await, the OTel SDK deliberately uses OS threads to avoid sync-over-async deadlocks.

The Rust SDK is the only OTel implementation that has this deadlock problem because it's the only one where exporters are async (tonic, reqwest) while the SDK's background threads need to call them synchronously. The experimental_*_with_async_runtime modules attempted to solve this by integrating with the user's async runtime, but this created the deadlock path described above.

This PR aligns the Rust SDK with every other language implementation: dedicated OS threads for background work, with the tokio runtime context entered via Handle::enter() before calling futures_executor::block_on(). This makes tokio types (spawn, timers, IO resources) available on the worker threads without taking ownership of the reactor — IO continues to be driven by the runtime's own threads. This avoids the "Cannot drop a runtime in a context where blocking is not allowed" panic that Handle::block_on() can trigger when the runtime's lifecycle doesn't perfectly match the worker thread's.

Changes

New: BlockingStrategy utility (util.rs)

  • Captures the tokio runtime handle at construction time (when called from within a tokio context)
  • Uses Handle::enter() + futures_executor::block_on() on worker threads to provide tokio context
  • Falls back to plain futures_executor::block_on() when no tokio runtime is available

Updated processors to use BlockingStrategy:

  • BatchSpanProcessor — created at construction, passed to worker thread
  • BatchLogProcessor — same pattern
  • PeriodicReader — stored in PeriodicReaderInner, used in collect_and_export

Merged Tokio/TokioCurrentThread:

  • Tokio::spawn now auto-detects runtime flavor via Handle::try_current() + runtime_flavor()
  • Multi-thread: spawns via tokio::spawn
  • Current-thread: spawns a separate OS thread with its own runtime
  • Deleted TokioCurrentThread struct

Removed experimental async runtime modules and features:

  • experimental_metrics_periodicreader_with_async_runtime feature + periodic_reader_with_async_runtime.rs
  • experimental_logs_batch_log_processor_with_async_runtime feature + log_processor_with_async_runtime.rs
  • experimental_trace_batch_span_processor_with_async_runtime feature + span_processor_with_async_runtime.rs
  • rt-tokio-current-thread feature
  • runtime_tests.rs

Not changed:

  • SimpleSpanProcessor/SimpleLogProcessor — these run on the caller's thread (possibly a tokio worker) where Handle::block_on() would panic, so they keep futures_executor::block_on(). This is an inherent limitation of synchronous-on-every-event processors.
  • NoAsync runtime type — still used by OTLP retry logic
  • opentelemetry-otlp, opentelemetry-proto, or other crates

Breaking changes

All removed items were behind experimental_* feature flags, not stable API.

Removed Migration
experimental_metrics_periodicreader_with_async_runtime feature Use default thread-based PeriodicReader
experimental_logs_batch_log_processor_with_async_runtime feature Use default thread-based BatchLogProcessor
experimental_trace_batch_span_processor_with_async_runtime feature Use default thread-based BatchSpanProcessor
rt-tokio-current-thread feature Use rt-tokio (now auto-detects)
runtime::TokioCurrentThread struct Use runtime::Tokio

Test results

  • cargo check -p opentelemetry_sdk --all-features — pass
  • cargo clippy -p opentelemetry_sdk --no-default-features -- -Dwarnings — pass
  • cargo clippy -p opentelemetry_sdk --all-features -- -Dwarnings — pass
  • cargo test -p opentelemetry_sdk --features="testing" — 295 passed, 0 failed, 3 ignored (pre-existing)
  • cargo check -p opentelemetry-otlp --all-features — pass
  • cargo test -p opentelemetry-otlp — 43 passed

Merge conflict risk

PRs #3223, #3267, #3257, #3211, #3139, #2962 touch some of the same files. May need coordination on merge order.

@bryantbiggs bryantbiggs requested a review from a team as a code owner February 19, 2026 22:12
@bryantbiggs bryantbiggs force-pushed the fix/otlp-deadlock-runtime-consolidation branch from 014b873 to 0e23ee9 Compare February 19, 2026 22:14
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 65.07937% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.8%. Comparing base (3c41f29) to head (ed6e932).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/runtime.rs 0.0% 21 Missing ⚠️
opentelemetry-sdk/src/logs/batch_log_processor.rs 92.3% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3356     +/-   ##
=======================================
- Coverage   82.2%   81.8%   -0.5%     
=======================================
  Files        128     125      -3     
  Lines      24626   23497   -1129     
=======================================
- Hits       20266   19236   -1030     
+ Misses      4360    4261     -99     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…times

Replace `futures_executor::block_on()` with tokio `Handle::block_on()`
on dedicated background threads to properly drive the IO reactor. This
fixes deadlocks when shutdown()/force_flush() is called on single-
threaded tokio runtimes or multi-thread runtimes with 1 worker thread.

Changes:
- Add `BlockingStrategy` utility that captures the tokio runtime handle
  at construction and uses `Handle::block_on()` from background threads,
  falling back to `futures_executor::block_on()` without tokio
- Update BatchSpanProcessor, BatchLogProcessor, and PeriodicReader to
  use BlockingStrategy on their dedicated worker threads
- Merge Tokio/TokioCurrentThread into single auto-detecting Tokio type
- Remove experimental async runtime modules and features:
  - experimental_metrics_periodicreader_with_async_runtime
  - experimental_logs_batch_log_processor_with_async_runtime
  - experimental_trace_batch_span_processor_with_async_runtime
  - rt-tokio-current-thread feature and TokioCurrentThread struct

Fixes: open-telemetry#2802
Refs: open-telemetry#2643, open-telemetry#2539, open-telemetry#2715, open-telemetry#2071
@bryantbiggs bryantbiggs force-pushed the fix/otlp-deadlock-runtime-consolidation branch from 5728c74 to 8a18420 Compare February 19, 2026 22:33
@cijothomas
Copy link
Copy Markdown
Member

Big PRs need focus time to review, any chance you can do shorter PRs - one signal in one PR would be much easier, and then subsequent PRs will be even easier as the pattern is established and accepted

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical deadlock issues that occur when OTLP processors call shutdown() or force_flush() on single-threaded tokio runtimes. The root causes were: (1) experimental async runtime modules that blocked the only available thread while waiting for responses, and (2) thread-based processors calling async exporters without tokio runtime context.

The solution aligns the Rust SDK with other OpenTelemetry language implementations by using dedicated OS threads for background processing, with a new BlockingStrategy utility that enters tokio context via Handle::enter() before blocking, making tokio types available without deadlocking.

Changes:

  • Introduced BlockingStrategy utility to safely call async exporters from synchronous worker threads
  • Merged Tokio and TokioCurrentThread runtime types with automatic runtime flavor detection
  • Removed experimental async runtime features and related modules
  • Updated all batch processors to use BlockingStrategy instead of direct futures_executor::block_on()

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
opentelemetry-sdk/src/util.rs Added new BlockingStrategy utility for safe async-to-sync bridging
opentelemetry-sdk/src/runtime.rs Merged TokioCurrentThread into Tokio with auto-detection of runtime flavor
opentelemetry-sdk/src/trace/span_processor.rs Updated BatchSpanProcessor to use BlockingStrategy
opentelemetry-sdk/src/logs/batch_log_processor.rs Updated BatchLogProcessor to use BlockingStrategy
opentelemetry-sdk/src/metrics/periodic_reader.rs Updated PeriodicReader to use BlockingStrategy
opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs Deleted experimental async span processor module
opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs Deleted experimental async log processor module
opentelemetry-sdk/src/metrics/periodic_reader_with_async_runtime.rs Deleted experimental async metrics reader module
opentelemetry-sdk/src/trace/runtime_tests.rs Deleted runtime-specific tests
opentelemetry-sdk/src/trace/mod.rs Removed references to deleted span_processor_with_async_runtime module
opentelemetry-sdk/src/logs/mod.rs Removed references to deleted log_processor_with_async_runtime module
opentelemetry-sdk/src/metrics/mod.rs Removed references to deleted periodic_reader_with_async_runtime module
opentelemetry-sdk/src/testing/trace/span_exporters.rs Updated feature flag from rt-tokio-current-thread to rt-tokio
opentelemetry-sdk/src/lib.rs Updated documentation to reflect merged runtime types
opentelemetry-sdk/Cargo.toml Removed experimental feature flags and rt-tokio-current-thread

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +96
.enable_all()
.build()
.expect("failed to create Tokio current thread runtime for OpenTelemetry");
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation on this builder chain. The .enable_all(), .build(), and .expect() calls should be indented to align with the tokio::runtime::Builder call on line 93.

Suggested change
.enable_all()
.build()
.expect("failed to create Tokio current thread runtime for OpenTelemetry");
.enable_all()
.build()
.expect("failed to create Tokio current thread runtime for OpenTelemetry");

Copilot uses AI. Check for mistakes.
@scottgerring scottgerring self-requested a review February 20, 2026 08:59
@scottgerring
Copy link
Copy Markdown
Member

scottgerring commented Feb 20, 2026

Hey @bryantbiggs thanks for the interest in this! The runtime trait has been something we've been thinking about for some time and would love to be able to stabilise.

Can you clarify if you have tried using the rt-tokio-current-thread feature - if you are and you are able to cause a deadlock could you share a repro? If there is a deadlock when the configuration is setup properly we should certainly look at addressing that (and I don't think this is very well documented at the moment - if you are using the unstable runtime feature and a single threaded runtime, then you will also need rt-tokio-current-thread).

Having said that, there is definitely effort to be put into stabilising the runtime trait so we can take it out from behind the feature guard and this approach could certainly go in that direction, but it is nuanced and cross-cutting enough that I would suggest we should start with an RFC. Echo Cijo's comments as well on the PR front - we'd love help here, but its gnarly enough I think we should be a bit more incremental/cautious.

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @cijothomas and @scottgerring. Totally understood on the PR size — this is way bigger than anyone wants to review in one go, and I should have started a conversation before jumping to code.

Before we talk about breaking this down, would it make sense to first align on whether the overall direction is reasonable? Happy to do that however works best for you — an RFC as Scott suggested, a design discussion in a separate issue, or just working through it here. The core problem I'm trying to address is that the default code path (no experimental features) deadlocks with tonic/gRPC exporters on constrained tokio runtimes. Once we agree on the right approach, I can break the work into smaller signal-by-signal PRs as Cijo suggested. Let me know what works.


Scott — to answer your specific questions:

Re: rt-tokio-current-thread — the deadlocks I'm hitting are on the default code path, not the experimental async-runtime processors. The default thread-based processors (PeriodicReader, BatchSpanProcessor, BatchLogProcessor) call futures_executor::block_on(exporter.export(...)) on their dedicated worker threads. rt-tokio-current-thread only applies to the experimental processors — it enables TokioCurrentThread which is only usable with the experimental_*_with_async_runtime builders that accept a Runtime parameter. The default PeriodicReader::builder(exporter) doesn't take a runtime parameter at all, so there's no way to opt into the current-thread workaround on the default path.

Here's a minimal reproduction (full project as a gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178). Tested with published opentelemetry_sdk v0.31.0 and opentelemetry-otlp v0.31.0, no experimental features enabled.

Example 1 — PeriodicReader + current_thread runtime (same as #[tokio::test] default):

use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;

fn main() {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    rt.block_on(async {
        let exporter = MetricExporter::builder()
            .with_tonic()
            .build()
            .expect("failed to build exporter");

        let reader = PeriodicReader::builder(exporter)
            .with_interval(Duration::from_secs(120))
            .build();

        let provider = SdkMeterProvider::builder().with_reader(reader).build();

        let meter = provider.meter("deadlock-repro");
        let counter = meter.u64_counter("test.counter").build();
        counter.add(1, &[]);

        provider.force_flush(); // hangs forever
    });
}

Example 2 — multi_thread(1 worker) + tokio::spawn (simulates a 1-vCPU k8s pod):

let rt = tokio::runtime::Builder::new_multi_thread()
    .worker_threads(1)
    .enable_all()
    .build()
    .unwrap();

rt.block_on(async {
    // ... same exporter/reader/provider setup ...

    // calling force_flush from a spawned task blocks the only worker thread
    tokio::spawn(async move {
        provider.force_flush(); // hangs forever
    }).await;
});

Verified results:

Scenario force_flush() shutdown()
current_thread Hangs forever (recv() has no timeout) Returns Err(Timeout(5s)), but worker thread stays stuck
multi_thread(1) + tokio::spawn Hangs forever (entire runtime freezes) Same pattern
multi_thread(default workers) Returns immediately (connection error, no hang) Returns immediately

The deadlock chain: force_flush() sends a Flush message to the PeriodicReader's dedicated OS thread and blocks on std::sync::mpsc::Receiver::recv() (no timeout). The worker thread receives the message and calls futures_executor::block_on(exporter.export(rm)). Tonic's Channel internally spawns a Buffer worker as a tokio task (via tokio::spawn) at channel creation time — the export future sends the gRPC request to this Buffer worker and awaits the response. The Buffer worker can only be polled by tokio worker threads. If all tokio worker threads are blocked by the recv() call, the Buffer worker can't process the request, the export never completes, and the worker thread never responds — circular wait.

The gist has all four examples (including the working multi-thread control case) that you can run locally to verify.

@scottgerring
Copy link
Copy Markdown
Member

Hey @bryantbiggs thanks for the quick turnaround and detail!
That's ... not great that that's not in the experimental flags.

At a quick glance over the diff, it looks to me like there are two separate things in this PR:

  • A change addressing the issue you discuss and provided the repro for (the BlockingStrategy piece)
  • Removal of a pile of adjacent experimental stuff - this appears to be the majority of the diff, but is as far as I can see at a quick glance unrelated to the former

Are you happy to address the former independent of the latter? As I understand it this would address your issue, and would avoid pulling in the more involved part with the runtime abstraction.

I caveat this by adding - I haven't had time to go into this in detail yet, so please correct me if I am missing something! I also note that discussions about the Runtime abstraction have been long running and nuanced and I think if we can remove that from the scope of this it will be much easier to review and get a PR through.

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

yes! let me see what I can do to break it down and split those. thank you for taking a look!

@scottgerring
Copy link
Copy Markdown
Member

scottgerring commented Feb 20, 2026

I'd suggest getting the first (the bug) one as a PR and having a chat about the second (what to do with runtime abstractions); it probably needs an ADR as there is a fair bit involved.

You can also find us in the CNCF slack https://communityinviter.com/apps/cloud-native/cncf and #opentelemetry-rust if you like!

I'm off for the weekend, but I will have cycles next week. Have a good one!

@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Split the bug fix into a standalone PR as suggested: #3380

That PR contains only the BlockingStrategy piece — 4 files, +68/-6 lines. The experimental async runtime removal is excluded and can be discussed separately (likely needs an ADR per Scott's suggestion).

Closing this in favor of #3380.

@bryantbiggs bryantbiggs deleted the fix/otlp-deadlock-runtime-consolidation branch February 20, 2026 21:54
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Feb 21, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 6, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 6, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Mar 21, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 24, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 29, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
bryantbiggs added a commit to bryantbiggs/opentelemetry-rust that referenced this pull request Apr 29, 2026
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLP MetricExporter deadlock issue

4 participants