Skip to content

Potential race condition during startup #22

@jhillyerd

Description

@jhillyerd

tl;dr: during stratus startup, there is a tiny window where a socket message will go to the owning process/supervisor instead of the stratus actor.

Backstory

I have a project that talks to Home Assistant over a websocket. After upgrading to Stratus 2.0, my project stopped working. I eventually traced it back to not receiving the initial auth_required message from HA, which should be sent immediately upon connecting. I couldn't figure out the cause, so I let an LLM grind on it for quite a while, and eventually it found a race condition: there is a tiny window of time where a message will go to the owning process/supervisor instead of the stratus actor. My local home asisstant must be responding fast enough to fit in that window.

I will send a PR which fixes the issue for me shortly. It's small, but not exactly elegant. :)

The LLM's summary:

Root Cause

There is a race condition in stratus.start/1 (in build/packages/stratus/src/stratus.gleam) between:

  1. Handshake completion - The WebSocket handshake is performed by the calling process (in our case, the OTP supervisor context)
  2. Socket data arriving - Home Assistant sends auth_required immediately after handshake
  3. Socket ownership transfer - controlling_process/2 transfers socket ownership to the new actor

The timeline is:

1. Handshake completes (socket owned by supervisor context)
2. Home Assistant sends auth_required message
3. SSL socket message arrives in SUPERVISOR's mailbox
4. controlling_process() transfers socket to actor
5. Actor starts with selector, but auth_required is already in supervisor's mailbox
6. Actor never receives auth_required, stays in AuthPending state forever
7. Dependent actors timeout waiting for HA responses
8. Supervisor crashes with init_failed

Evidence

From debug output:

[DEBG] Handshake successful
"HA init: stratus.selecting called"
[EROR] Supervisor received unexpected message: [Ssl(Sslsocket(...), <<compressed data>>)]

The SSL socket message with the auth_required data went to the supervisor instead of the HA actor.

The Bug Location

In stratus.gleam lines ~490-508:

|> actor.start
|> result.map_error(ActorFailed)
|> result.try(fn(started) {
  // PROBLEM: Socket is still owned by calling process at this point!
  // Any incoming data goes to calling process's mailbox.
  case transport {
    Tcp -> tcp.controlling_process(handshake_response.socket, started.pid)
    Ssl -> ssl.controlling_process(handshake_response.socket, started.pid)
  }
  // ...
})

The socket ownership transfer happens after actor.start returns, but by that time the server may have already sent data that ends up in the wrong mailbox.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions