fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438) by devin-ai-integration[bot] · Pull Request #516 · millicast/millicast-sdk

devin-ai-integration · 2026-04-24T04:22:55Z

Summary

Fixes a race condition in BaseWebRTC.reconnect() where two independent retry loops can run simultaneously, destructively competing by each calling stop() and tearing down the other's connection.

Root cause: The 'disconnected' handler in setReconnect() schedules a setTimeout(reconnect, 1500) without storing the timer ID. If 'failed' fires shortly after and triggers an immediate reconnect() that fails, isReconnecting resets to false before the 1500ms timer fires — allowing the delayed callback to pass all guards and spawn a second parallel retry loop. Neither stop() nor reconnect() could cancel the pending timer since no reference was stored.

Fix (single file — BaseWebRTC.js):

Store timer references in _disconnectTimerId (1500ms delay) and _retryTimerId (exponential backoff retry)
Clear both timers in stop() and at the top of reconnect()
Add a _reconnectGeneration counter: incremented on each reconnect() entry, checked in the retry callback so stale timers from a previous reconnect cycle become no-ops

Tracked in OPTI-2438.

Review & Testing Checklist for Human

Verify generation counter logic: The retry timer (catch block, line ~161) captures generation in its closure and compares against _reconnectGeneration before firing. Confirm this correctly prevents stale retry callbacks without accidentally blocking legitimate retries.
Confirm the disconnect timer (line ~116) doesn't need a generation guard: It calls reconnect() directly, which clears timers and checks isReconnecting at entry. Verify this is sufficient deduplication, especially under rapid disconnected → failed → disconnected state transitions.
No unit tests exist for reconnection logic — only codec.test.js is in the test suite. Consider adding tests that mock PeerConnection state transitions and verify only a single retry loop is active at any time. At minimum, manually test with a network disruption scenario (e.g., toggle network on a viewer with autoReconnect=true) and confirm only one reconnect sequence runs.

Notes

clearTimeout(null) is a safe no-op, so the redundant clearing in stop() (called from within reconnect() after timers are already nulled) is harmless.
The fix is intentionally minimal — it does not refactor the existing flag-based state machine (isReconnecting, stopReconnection, alreadyDisconnected, firstReconnection), which has its own complexity worth revisiting separately.

Link to Devin session: https://dolby.devinenterprise.com/sessions/85dde289bf444b2f8be31933bf10cb7d
Requested by: @craig-johnston

…eferences (OPTI-2438) The 'disconnected' handler in setReconnect() scheduled a setTimeout without storing its ID, making it impossible to cancel. When 'failed' fired shortly after, the immediate reconnect() could fail and reset isReconnecting before the 1500ms timer fired, allowing a second independent retry loop to spawn. Changes: - Store _disconnectTimerId and _retryTimerId references - Clear both timers in stop() and at the start of reconnect() - Add _reconnectGeneration counter so stale retry callbacks are no-ops Co-Authored-By: craig.johnston <cjohn@dolby.com>

devin-ai-integration · 2026-04-24T04:22:58Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

changeset-bot · 2026-04-24T04:23:00Z

⚠️ No Changeset found

Latest commit: 98d4fb4

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Avoid incrementing _reconnectGeneration when reconnect() is called while isReconnecting is true (nested no-op calls from initConnection). This prevents stale generation checks from permanently breaking the retry loop. Co-Authored-By: craig.johnston <cjohn@dolby.com>

Yousif-CS · 2026-04-24T06:34:18Z

          this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') })
        } else if (state === 'disconnected') {
          this.alreadyDisconnected = true
-          setTimeout(() => this.reconnect({ error: new Error('Connection state change: RTCPeerConnectionState disconnected') }), 1500)


Do we actually need a separate timer when we receive a disconnected, why couldn't we immediately invoke the reconnect here? Seems like 1500ms is a hack of some sorts. Either way it should work the way it is.

Good question. The 1500ms delay is pre-existing behavior that this PR preserves intentionally. The original intent appears to be giving the WebRTC ICE agent time to recover from a transient 'disconnected' state before triggering a full reconnect — ICE can sometimes transition disconnected → connected on its own if it's just a brief network blip. The immediate path (line 113) only fires when alreadyDisconnected && firstReconnection, i.e., the second consecutive disconnect or a 'failed' state, which are more definitive signals.

Agreed it feels like a heuristic/hack — a cleaner approach might be to wait for 'failed' exclusively rather than using a time-based guess. But that's a behavioral change worth its own discussion/ticket. This PR keeps the scope narrow to just fixing the timer leak and race condition.

@Yousif-CS if you want more changes (e.g. changing this to align), just ask for them. Devin will do them for you.

devin-ai-integration Bot assigned craig-johnston Apr 24, 2026

craig-johnston assigned Yousif-CS Apr 24, 2026

This comment was marked as resolved.

Sign in to view

Yousif-CS reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438)#516

fix: prevent parallel reconnect loops by storing and clearing timer references (OPTI-2438)#516
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1777004319-fix-reconnect-race-condition

devin-ai-integration Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Apr 24, 2026

Uh oh!

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Yousif-CS Apr 24, 2026

Uh oh!

devin-ai-integration Bot Apr 24, 2026

Uh oh!

craig-johnston Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devin-ai-integration Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented Apr 24, 2026

🤖 Devin AI Engineer

Uh oh!

changeset-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

This comment was marked as resolved.

Uh oh!

Yousif-CS Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

craig-johnston Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devin-ai-integration Bot commented Apr 24, 2026 •

edited

Loading

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading