fix(cdp): coalesce concurrent reconnects, fix flapping breaker, plug load-event listener leak#18
Conversation
…load-event listener leak
### Reconnect race
`withAutoReconnect` used an `isReconnecting` boolean flag and busy-
waited on it for up to 6s. Two concurrent operations entering the
catch branch at the same time (which happens during the `comet_ask`
polling loop, where `safeEvaluate` may be in flight while a separate
`extractAgentStatus` evaluate is also running) BOTH observe
`isReconnecting === false`, BOTH increment `reconnectAttempts`, BOTH
set the flag, and BOTH call `this.reconnect()` — which closes and
reopens `this.client`. The second close races with the first
connect, corrupting `this.client` and producing cascading errors
("not open / readyState") that re-enter the same code path.
Replaced with a single in-flight `reconnectPromise` cache. New callers
that find a pending promise simply `await` it; only the first caller
actually starts a reconnect. The promise is cleared in `.finally()`
so the next genuine failure restarts the cycle.
### Flapping breaker
`reconnectAttempts` was reset to 0 on every successful operation. A
connection that succeeds every Nth try ("flapping") would silently
never trip `maxReconnectAttempts`, hiding a degraded transport. The
counter now resets only after 3 consecutive successes.
### Connection-error substring list
`'not found'` was in the list of strings classified as a connection
error. It also matches the legitimate `Error('No file input found...')`
thrown by upload code paths, triggering an unnecessary reconnect
cascade on a normal application error. Removed.
### Sidecar in fresh-start fallback
The fresh-start path inside `withAutoReconnect` picked
`targets.find(t => t.url.includes('perplexity'))` without excluding
the sidecar — same bug as `reconnect()` itself (fixed in the
companion sidecar-filter PR). Aligned here too so this PR is
self-contained.
### Page.loadEventFired listener leak
`navigateWithRetry` did `Promise.race([loadEventFired(), timeoutPromise])`.
When the timeout wins, the `Page.loadEventFired` listener stays
registered on the CDP client. Over a long session with repeated
timeouts, listeners accumulate (CDP eventually emits a
"MaxListenersExceededWarning").
Replaced with `client.once('Page.loadEventFired', onLoad)` plus an
explicit `client.removeListener` on the timeout branch, so the
listener is always removed exactly once.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RapierCraft
left a comment
There was a problem hiding this comment.
Code Review: PR #18 — Reconnect Coalescing, Flapping Breaker Fix, Listener Leak Plug
Reviewed commit: ebf055e
Overall Assessment: APPROVE with minor observations
This is a well-reasoned set of reliability fixes. The PR description accurately diagnoses each problem and the implementations are sound. I found no bugs, race conditions, or resource leaks in the proposed code.
1. Reconnection Coalescing (ensureSingleReconnect + reconnectPromise)
Verdict: Correct
The pattern of caching a single in-flight promise and clearing it in .finally() is the textbook solution for coalescing concurrent async operations. Key observations:
-
Race-free: Only the first caller entering
ensureSingleReconnect()whenreconnectPromise === nullcreates the promise. All subsequent callers get the same promise reference. Since JS is single-threaded (no preemption between the null-check and assignment), there is no TOCTOU race. -
Delay calculation uses snapshot:
const attempt = this.reconnectAttemptscaptures the counter at call time. This is correct — if two callers enterwithAutoReconnect's catch branch concurrently, only the first increments and initiates; the second awaits the existing promise via the guard at the top ofwithAutoReconnect. -
One subtlety worth noting (not a bug): When
ensureSingleReconnectis called from the catch block,this.reconnectAttemptshas already been incremented on line 318 (this.reconnectAttempts++). InsideensureSingleReconnect, the delay formula usesattempt - 1as the exponent (Math.max(attempt - 1, 0)). So the first failure (attempts=1) yields300 * 1.3^0 = 300ms, which matches the old behavior. Good. -
Error propagation: If the reconnect promise rejects, ALL waiters receive the rejection. The
try { await this.reconnectPromise; } catch { /* fall through to retry */ }at the top ofwithAutoReconnectcorrectly swallows this so waiters can attempt their own operation. The caller that initiated the reconnect (in the catch block) will also get the rejection and fall into the fresh-start path. This is correct behavior.
2. Circuit Breaker / Flapping Fix (consecutiveSuccesses)
Verdict: Correct
- Counter increments on success, resets to 0 on error. After 3 consecutive successes,
reconnectAttemptsresets to 0. - A flapping connection (succeed-fail-succeed-fail) never accumulates 3 consecutive successes, so the breaker counter never resets, and it will eventually trip
maxReconnectAttempts. This correctly fixes the described bug.
Minor observation (non-blocking): The threshold of 3 is hardcoded. Consider extracting to a named constant (STABILITY_THRESHOLD = 3) for clarity, but this is purely cosmetic given the well-written comment above the field declaration.
3. Load Event Listener Leak Fix
Verdict: Correct
I verified against chrome-remote-interface v0.33 internals:
- The CRI client extends
EventEmitter - Protocol events are emitted as
this.emit('Page.loadEventFired', params, sessionId)(chrome.js:287) - Using
client.once('Page.loadEventFired', onLoad)correctly subscribes to exactly one firing client.removeListener('Page.loadEventFired', onLoad)correctly unsubscribes when timeout wins
The implementation properly ensures the listener is cleaned up in exactly one of two ways:
- Load fires first:
onLoadruns, clears the timeout viaclearTimeout(timer), resolves the promise.onceauto-removes the listener. - Timeout fires first: Timer callback explicitly removes the listener via
removeListener, then rejects.
No leak possible. The old Promise.race([client.Page.loadEventFired(), timeout]) left listeners registered because Page.loadEventFired() internally calls chrome.once(eventName, fulfill) — when the timeout wins, that once listener remains until the next Page.loadEventFired event fires (which may never come for a stuck page), or until the client is destroyed.
4. Removal of 'not found' from connection errors
Verdict: Correct and important
This prevents false-positive reconnect triggers from legitimate application errors like "No file input found" in upload code. Good catch.
5. Sidecar URL exclusion in fresh-start fallback
Verdict: Correct
The fallback now prefers a non-sidecar Perplexity tab but falls back to any Perplexity tab (including sidecar) if no non-sidecar tab exists. This is defensive and appropriate.
Potential Concerns (all non-blocking)
-
No
consecutiveSuccessesreset on reconnect: After a successful reconnect,consecutiveSuccessesis not explicitly reset. The next successful operation will set it to 1. This means after a reconnect, you need 3 more successful ops before the breaker resets. This seems intentional (proving stability post-reconnect) but worth confirming. -
reconnectPromisecleared infinallybefore the retryoperation()call: Afterawait this.ensureSingleReconnect()resolves on line 323 (in the catch path), the promise is already cleared (via.finally()). If the subsequentreturn await operation()on line 324 fails with another connection error, the next recursion throughwithAutoReconnectwould create a new reconnect promise. This is correct — each retry cycle gets its own reconnect attempt, and the coalescing only applies to concurrent callers during a single reconnect window. -
Thread safety of
reconnectAttempts++: Multiple concurrent callers could both hitthis.reconnectAttempts++before either reachesensureSingleReconnect. Since JS is single-threaded, the increment is atomic — but theoretically two concurrent callers hitting the catch block for different errors could both increment the counter (bumping it by 2 for one "logical" failure). In practice this is harmless — it just means the breaker trips slightly sooner under concurrent failures, which is conservative and safe.
Summary
All three fixes are well-implemented with no bugs or leaks found. The code is clean, well-commented, and the PR description is thorough. The public API is unchanged.
Recommend: merge.
Summary
Three independent reliability fixes around the auto-reconnect path.
1. Reconnect race
`withAutoReconnect` used an `isReconnecting` boolean and busy-waited on it. Two concurrent operations entering the catch branch at the same time (very real during the `comet_ask` polling loop where a `safeEvaluate` may be in flight while `extractAgentStatus` is also evaluating) both observe the flag as false, both increment the counter, both set it, both call `this.reconnect()` — which closes and reopens `this.client`. The second close races with the first connect, corrupting state and producing cascading `not open / readyState` errors that re-enter the same path.
Replaced with a single in-flight `reconnectPromise` cache. New callers that find a pending promise simply `await` it; only the first caller actually starts a reconnect. The promise is cleared in `.finally()` so the next genuine failure restarts the cycle.
2. Flapping breaker
`reconnectAttempts` was reset to 0 on every successful operation. A flapping connection that succeeds every Nth try would silently never trip `maxReconnectAttempts`, hiding a degraded transport. Counter now resets only after 3 consecutive successes.
3. Page.loadEventFired listener leak
`navigateWithRetry` did `Promise.race([loadEventFired(), timeout])`. When the timeout wins, the underlying `Page.loadEventFired` listener stays registered. Over a long session with repeated timeouts, listeners accumulate (eventually a `MaxListenersExceededWarning` from EventEmitter).
Replaced with `client.once('Page.loadEventFired', onLoad)` plus an explicit `removeListener` on the timeout branch, so the listener is always removed exactly once.
Bonus cleanups
Compatibility
Public API unchanged. Internal state (
isReconnectingboolean) is replaced byreconnectPromise— no consumers reach into private state.Test plan
🤖 Generated with Claude Code