Skip to content

Make JSON-RPC request id allocation collision-free and resilient to cancellation#84

Open
PeterPalenik wants to merge 13 commits intohappyleavesaoc:masterfrom
PeterPalenik:fix/protocol-race-keyerror
Open

Make JSON-RPC request id allocation collision-free and resilient to cancellation#84
PeterPalenik wants to merge 13 commits intohappyleavesaoc:masterfrom
PeterPalenik:fix/protocol-race-keyerror

Conversation

@PeterPalenik
Copy link
Copy Markdown

@PeterPalenik PeterPalenik commented May 2, 2026

Opening this after chasing down a bunch of symptoms downstream in Music
Assistant: occasional KeyError: 385 / KeyError: 518 followed by
Server disconnected: <id>, also reported in
music-assistant/support#4435 (where a commenter
independently traced it to this library). The trail led back to this
library's JSON-RPC request handling.

What was broken

Three things compound under concurrent load:

  1. Request ids came from random.randint(1, 1000). With ~50 in-flight
    calls the birthday paradox already gives a >70% collision probability.
    When two callers pick the same id, the second one's _buffer[id] = ...
    overwrites the first; responses get cross-routed and the original waiter
    hangs forever.

  2. When a response arrived for an id no longer in _buffer (cancellation,
    late server reply, or the collision case above), handle_response did
    self._buffer[id] and the KeyError propagated out. That surfaced in
    the asyncio transport and disconnected the snapserver session.

  3. Cancelled request() calls (e.g. via asyncio.wait_for timeout) left
    _buffer[id] behind, so the dict slowly grew and stale responses had a
    chance to land in it later.

Fix

A monotonic counter under threading.Lock replaces random.randint. The
lock is overkill for asyncio under current CPython, but I'd rather have it
than rely on itertools.count atomicity, which is implementation-defined
under PEP 703 free-threaded builds.

handle_response uses .get() and silently drops responses for unknown
ids, with a DEBUG log so future regressions stay diagnosable. And
request() runs under try/finally so _buffer.pop(id, None) always runs.

Testing

I patched the lib into my running Music Assistant container (snapserver
0.35.0 on the other side) and ran 200 concurrent server.rpc_version()
calls:

Before After
Successful 174 / 200 (87%) 200 / 200 (100%)
Timeouts (10s deadline) 26 0
Wall time 10.0s 0.1s

Same script, same snapserver, only protocol.py differs. The 26 timeouts
match the ~19 collision pairs the birthday paradox predicts for
randint(1,1000) at that load.

@PeterPalenik PeterPalenik force-pushed the fix/protocol-race-keyerror branch from e51f8c7 to ce48ff9 Compare May 2, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant