Skip to content

Review thread safety of client/API/Evaluation context #96

@toddbaert

Description

@toddbaert

Review the thread safety of the SDK to make sure there's no potential concurrency issues, particularly around state maintained in the global API object, clients, and evaluation context objects.

See here for a similar discussion in the Java SDK and others.

Background

Python's GIL is sometimes mistaken for making threading a non-issue (like Node's single-threaded model), but they're not the same. The GIL only makes individual bytecode operations atomic; compound operations like check-then-act or read-modify-write are not safe. Providers commonly run background threads for connection management and emit events from them, so these races aren't just theoretical.

Audit

Here's what we found. The only locks in the SDK today are in _event_support.py (and even those have some gaps). Everything else is unprotected.

  • ProviderRegistry singleton (provider/_registry.py) - this is the big one. No locks at all. set_provider() and set_default_provider() have check-then-act patterns with side effects (initialize/shutdown), so concurrent calls can double-shutdown, lose providers, or initialize providers that never get registered. clear_providers() has a window where shut-down providers are still returned to callers.
  • Global _hooks list (hook/__init__.py) - add_hooks does _hooks = _hooks + hooks which is a read-concat-rebind; two concurrent calls can lose one set of hooks
  • api.clear_providers() (api.py) - calls provider_registry.clear_providers() then _event_support.clear() as two separate steps; event handlers still exist but providers are gone in between
  • _event_support.py - has RLock (good), but immediate handler execution in add_*_handler runs outside the lock, and clear() acquires two separate locks non-atomically
  • Global _evaluation_context (evaluation_context/__init__.py) - no lock; TOCTOU across the evaluation pipeline (context can change between read and use)
  • Global transaction context propagator (transaction_context/__init__.py) - no lock; read-then-call-method races with propagator swap
  • Client instance state (client.py) - self.hooks has the same lost-update pattern as global hooks; _assert_provider_status reads self.provider again instead of using the reference captured earlier in the evaluation flow, so the status check can be for a different provider than the one actually used
  • AbstractProvider._on_emit (provider/__init__.py) - emit() does if hasattr(self, "_on_emit"): self._on_emit(...) which is a TOCTOU; detach() during shutdown can delete _on_emit while a background thread is between the check and the call

Definition of done

  • issues created for thread safety of global API, client, and evaluation context

Metadata

Metadata

Assignees

No one assigned

    Labels

    1.0-releaseRequired for a 1.0 release

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions