fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438)#516
Conversation
…eferences (OPTI-2438) The 'disconnected' handler in setReconnect() scheduled a setTimeout without storing its ID, making it impossible to cancel. When 'failed' fired shortly after, the immediate reconnect() could fail and reset isReconnecting before the 1500ms timer fired, allowing a second independent retry loop to spawn. Changes: - Store _disconnectTimerId and _retryTimerId references - Clear both timers in stop() and at the start of reconnect() - Add _reconnectGeneration counter so stale retry callbacks are no-ops Co-Authored-By: craig.johnston <cjohn@dolby.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
Avoid incrementing _reconnectGeneration when reconnect() is called while isReconnecting is true (nested no-op calls from initConnection). This prevents stale generation checks from permanently breaking the retry loop. Co-Authored-By: craig.johnston <cjohn@dolby.com>
| this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') }) | ||
| } else if (state === 'disconnected') { | ||
| this.alreadyDisconnected = true | ||
| setTimeout(() => this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') }), 1500) |
There was a problem hiding this comment.
Do we actually need a separate timer when we receive a disconnected, why couldn't we immediately invoke the reconnect here? Seems like 1500ms is a hack of some sorts. Either way it should work the way it is.
There was a problem hiding this comment.
Good question. The 1500ms delay is pre-existing behavior that this PR preserves intentionally. The original intent appears to be giving the WebRTC ICE agent time to recover from a transient 'disconnected' state before triggering a full reconnect — ICE can sometimes transition disconnected → connected on its own if it's just a brief network blip. The immediate path (line 113) only fires when alreadyDisconnected && firstReconnection, i.e., the second consecutive disconnect or a 'failed' state, which are more definitive signals.
Agreed it feels like a heuristic/hack — a cleaner approach might be to wait for 'failed' exclusively rather than using a time-based guess. But that's a behavioral change worth its own discussion/ticket. This PR keeps the scope narrow to just fixing the timer leak and race condition.
There was a problem hiding this comment.
@Yousif-CS if you want more changes (e.g. changing this to align), just ask for them. Devin will do them for you.
Summary
Fixes a race condition in
BaseWebRTC.reconnect()where two independent retry loops can run simultaneously, destructively competing by each callingstop()and tearing down the other's connection.Root cause: The
'disconnected'handler insetReconnect()schedules asetTimeout(reconnect, 1500)without storing the timer ID. If'failed'fires shortly after and triggers an immediatereconnect()that fails,isReconnectingresets tofalsebefore the 1500ms timer fires — allowing the delayed callback to pass all guards and spawn a second parallel retry loop. Neitherstop()norreconnect()could cancel the pending timer since no reference was stored.Fix (single file —
BaseWebRTC.js):_disconnectTimerId(1500ms delay) and_retryTimerId(exponential backoff retry)stop()and at the top ofreconnect()_reconnectGenerationcounter: incremented on eachreconnect()entry, checked in the retry callback so stale timers from a previous reconnect cycle become no-opsTracked in OPTI-2438.
Review & Testing Checklist for Human
generationin its closure and compares against_reconnectGenerationbefore firing. Confirm this correctly prevents stale retry callbacks without accidentally blocking legitimate retries.reconnect()directly, which clears timers and checksisReconnectingat entry. Verify this is sufficient deduplication, especially under rapiddisconnected → failed → disconnectedstate transitions.codec.test.jsis in the test suite. Consider adding tests that mock PeerConnection state transitions and verify only a single retry loop is active at any time. At minimum, manually test with a network disruption scenario (e.g., toggle network on a viewer withautoReconnect=true) and confirm only one reconnect sequence runs.Notes
clearTimeout(null)is a safe no-op, so the redundant clearing instop()(called from withinreconnect()after timers are already nulled) is harmless.isReconnecting,stopReconnection,alreadyDisconnected,firstReconnection), which has its own complexity worth revisiting separately.Link to Devin session: https://dolby.devinenterprise.com/sessions/85dde289bf444b2f8be31933bf10cb7d
Requested by: @craig-johnston