Skip to content

fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438)#516

Open
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1777004319-fix-reconnect-race-condition
Open

fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438)#516
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1777004319-fix-reconnect-race-condition

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented Apr 24, 2026

Summary

Fixes a race condition in BaseWebRTC.reconnect() where two independent retry loops can run simultaneously, destructively competing by each calling stop() and tearing down the other's connection.

Root cause: The 'disconnected' handler in setReconnect() schedules a setTimeout(reconnect, 1500) without storing the timer ID. If 'failed' fires shortly after and triggers an immediate reconnect() that fails, isReconnecting resets to false before the 1500ms timer fires — allowing the delayed callback to pass all guards and spawn a second parallel retry loop. Neither stop() nor reconnect() could cancel the pending timer since no reference was stored.

Fix (single file — BaseWebRTC.js):

  • Store timer references in _disconnectTimerId (1500ms delay) and _retryTimerId (exponential backoff retry)
  • Clear both timers in stop() and at the top of reconnect()
  • Add a _reconnectGeneration counter: incremented on each reconnect() entry, checked in the retry callback so stale timers from a previous reconnect cycle become no-ops

Tracked in OPTI-2438.

Review & Testing Checklist for Human

  • Verify generation counter logic: The retry timer (catch block, line ~161) captures generation in its closure and compares against _reconnectGeneration before firing. Confirm this correctly prevents stale retry callbacks without accidentally blocking legitimate retries.
  • Confirm the disconnect timer (line ~116) doesn't need a generation guard: It calls reconnect() directly, which clears timers and checks isReconnecting at entry. Verify this is sufficient deduplication, especially under rapid disconnected → failed → disconnected state transitions.
  • No unit tests exist for reconnection logic — only codec.test.js is in the test suite. Consider adding tests that mock PeerConnection state transitions and verify only a single retry loop is active at any time. At minimum, manually test with a network disruption scenario (e.g., toggle network on a viewer with autoReconnect=true) and confirm only one reconnect sequence runs.

Notes

  • clearTimeout(null) is a safe no-op, so the redundant clearing in stop() (called from within reconnect() after timers are already nulled) is harmless.
  • The fix is intentionally minimal — it does not refactor the existing flag-based state machine (isReconnecting, stopReconnection, alreadyDisconnected, firstReconnection), which has its own complexity worth revisiting separately.

Link to Devin session: https://dolby.devinenterprise.com/sessions/85dde289bf444b2f8be31933bf10cb7d
Requested by: @craig-johnston


Open in Devin Review

…eferences (OPTI-2438)

The 'disconnected' handler in setReconnect() scheduled a setTimeout
without storing its ID, making it impossible to cancel. When 'failed'
fired shortly after, the immediate reconnect() could fail and reset
isReconnecting before the 1500ms timer fired, allowing a second
independent retry loop to spawn.

Changes:
- Store _disconnectTimerId and _retryTimerId references
- Clear both timers in stop() and at the start of reconnect()
- Add _reconnectGeneration counter so stale retry callbacks are no-ops

Co-Authored-By: craig.johnston <cjohn@dolby.com>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 24, 2026

⚠️ No Changeset found

Latest commit: 98d4fb4

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

Avoid incrementing _reconnectGeneration when reconnect() is called
while isReconnecting is true (nested no-op calls from initConnection).
This prevents stale generation checks from permanently breaking the
retry loop.

Co-Authored-By: craig.johnston <cjohn@dolby.com>
this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') })
} else if (state === 'disconnected') {
this.alreadyDisconnected = true
setTimeout(() => this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') }), 1500)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need a separate timer when we receive a disconnected, why couldn't we immediately invoke the reconnect here? Seems like 1500ms is a hack of some sorts. Either way it should work the way it is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The 1500ms delay is pre-existing behavior that this PR preserves intentionally. The original intent appears to be giving the WebRTC ICE agent time to recover from a transient 'disconnected' state before triggering a full reconnect — ICE can sometimes transition disconnected → connected on its own if it's just a brief network blip. The immediate path (line 113) only fires when alreadyDisconnected && firstReconnection, i.e., the second consecutive disconnect or a 'failed' state, which are more definitive signals.

Agreed it feels like a heuristic/hack — a cleaner approach might be to wait for 'failed' exclusively rather than using a time-based guess. But that's a behavioral change worth its own discussion/ticket. This PR keeps the scope narrow to just fixing the timer leak and race condition.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yousif-CS if you want more changes (e.g. changing this to align), just ask for them. Devin will do them for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants