Skip to content

fix(cdp): coalesce concurrent reconnects, fix flapping breaker, plug load-event listener leak#18

Merged
RapierCraft merged 1 commit into
RapierCraft:mainfrom
sergei-aronsen:fix/reconnect-race
May 16, 2026
Merged

fix(cdp): coalesce concurrent reconnects, fix flapping breaker, plug load-event listener leak#18
RapierCraft merged 1 commit into
RapierCraft:mainfrom
sergei-aronsen:fix/reconnect-race

Conversation

@sergei-aronsen
Copy link
Copy Markdown

Summary

Three independent reliability fixes around the auto-reconnect path.

1. Reconnect race

`withAutoReconnect` used an `isReconnecting` boolean and busy-waited on it. Two concurrent operations entering the catch branch at the same time (very real during the `comet_ask` polling loop where a `safeEvaluate` may be in flight while `extractAgentStatus` is also evaluating) both observe the flag as false, both increment the counter, both set it, both call `this.reconnect()` — which closes and reopens `this.client`. The second close races with the first connect, corrupting state and producing cascading `not open / readyState` errors that re-enter the same path.

Replaced with a single in-flight `reconnectPromise` cache. New callers that find a pending promise simply `await` it; only the first caller actually starts a reconnect. The promise is cleared in `.finally()` so the next genuine failure restarts the cycle.

2. Flapping breaker

`reconnectAttempts` was reset to 0 on every successful operation. A flapping connection that succeeds every Nth try would silently never trip `maxReconnectAttempts`, hiding a degraded transport. Counter now resets only after 3 consecutive successes.

3. Page.loadEventFired listener leak

`navigateWithRetry` did `Promise.race([loadEventFired(), timeout])`. When the timeout wins, the underlying `Page.loadEventFired` listener stays registered. Over a long session with repeated timeouts, listeners accumulate (eventually a `MaxListenersExceededWarning` from EventEmitter).

Replaced with `client.once('Page.loadEventFired', onLoad)` plus an explicit `removeListener` on the timeout branch, so the listener is always removed exactly once.

Bonus cleanups

  • Removed `'not found'` from the connection-error substring list — it falsely matched the legitimate `Error('No file input found...')` thrown by upload code, triggering unnecessary reconnects on normal application errors.
  • Fresh-start fallback inside `withAutoReconnect` now excludes the sidecar URL when picking a target tab (matches the rationale in the companion sidecar-filter PR).

Compatibility

Public API unchanged. Internal state (isReconnecting boolean) is replaced by reconnectPromise — no consumers reach into private state.

Test plan

  • `npm run test:unit` passes
  • Manual: forced disconnect during a polling `comet_ask` recovers cleanly without cascading errors
  • Manual: `navigateWithRetry` to a slow URL eventually succeeds without `MaxListenersExceededWarning`
  • Manual: a flapping connection (succeed/fail alternating) eventually trips the breaker instead of looping forever

🤖 Generated with Claude Code

…load-event listener leak

### Reconnect race

`withAutoReconnect` used an `isReconnecting` boolean flag and busy-
waited on it for up to 6s. Two concurrent operations entering the
catch branch at the same time (which happens during the `comet_ask`
polling loop, where `safeEvaluate` may be in flight while a separate
`extractAgentStatus` evaluate is also running) BOTH observe
`isReconnecting === false`, BOTH increment `reconnectAttempts`, BOTH
set the flag, and BOTH call `this.reconnect()` — which closes and
reopens `this.client`. The second close races with the first
connect, corrupting `this.client` and producing cascading errors
("not open / readyState") that re-enter the same code path.

Replaced with a single in-flight `reconnectPromise` cache. New callers
that find a pending promise simply `await` it; only the first caller
actually starts a reconnect. The promise is cleared in `.finally()`
so the next genuine failure restarts the cycle.

### Flapping breaker

`reconnectAttempts` was reset to 0 on every successful operation. A
connection that succeeds every Nth try ("flapping") would silently
never trip `maxReconnectAttempts`, hiding a degraded transport. The
counter now resets only after 3 consecutive successes.

### Connection-error substring list

`'not found'` was in the list of strings classified as a connection
error. It also matches the legitimate `Error('No file input found...')`
thrown by upload code paths, triggering an unnecessary reconnect
cascade on a normal application error. Removed.

### Sidecar in fresh-start fallback

The fresh-start path inside `withAutoReconnect` picked
`targets.find(t => t.url.includes('perplexity'))` without excluding
the sidecar — same bug as `reconnect()` itself (fixed in the
companion sidecar-filter PR). Aligned here too so this PR is
self-contained.

### Page.loadEventFired listener leak

`navigateWithRetry` did `Promise.race([loadEventFired(), timeoutPromise])`.
When the timeout wins, the `Page.loadEventFired` listener stays
registered on the CDP client. Over a long session with repeated
timeouts, listeners accumulate (CDP eventually emits a
"MaxListenersExceededWarning").

Replaced with `client.once('Page.loadEventFired', onLoad)` plus an
explicit `client.removeListener` on the timeout branch, so the
listener is always removed exactly once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@RapierCraft RapierCraft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #18 — Reconnect Coalescing, Flapping Breaker Fix, Listener Leak Plug

Reviewed commit: ebf055e

Overall Assessment: APPROVE with minor observations

This is a well-reasoned set of reliability fixes. The PR description accurately diagnoses each problem and the implementations are sound. I found no bugs, race conditions, or resource leaks in the proposed code.


1. Reconnection Coalescing (ensureSingleReconnect + reconnectPromise)

Verdict: Correct

The pattern of caching a single in-flight promise and clearing it in .finally() is the textbook solution for coalescing concurrent async operations. Key observations:

  • Race-free: Only the first caller entering ensureSingleReconnect() when reconnectPromise === null creates the promise. All subsequent callers get the same promise reference. Since JS is single-threaded (no preemption between the null-check and assignment), there is no TOCTOU race.

  • Delay calculation uses snapshot: const attempt = this.reconnectAttempts captures the counter at call time. This is correct — if two callers enter withAutoReconnect's catch branch concurrently, only the first increments and initiates; the second awaits the existing promise via the guard at the top of withAutoReconnect.

  • One subtlety worth noting (not a bug): When ensureSingleReconnect is called from the catch block, this.reconnectAttempts has already been incremented on line 318 (this.reconnectAttempts++). Inside ensureSingleReconnect, the delay formula uses attempt - 1 as the exponent (Math.max(attempt - 1, 0)). So the first failure (attempts=1) yields 300 * 1.3^0 = 300ms, which matches the old behavior. Good.

  • Error propagation: If the reconnect promise rejects, ALL waiters receive the rejection. The try { await this.reconnectPromise; } catch { /* fall through to retry */ } at the top of withAutoReconnect correctly swallows this so waiters can attempt their own operation. The caller that initiated the reconnect (in the catch block) will also get the rejection and fall into the fresh-start path. This is correct behavior.

2. Circuit Breaker / Flapping Fix (consecutiveSuccesses)

Verdict: Correct

  • Counter increments on success, resets to 0 on error. After 3 consecutive successes, reconnectAttempts resets to 0.
  • A flapping connection (succeed-fail-succeed-fail) never accumulates 3 consecutive successes, so the breaker counter never resets, and it will eventually trip maxReconnectAttempts. This correctly fixes the described bug.

Minor observation (non-blocking): The threshold of 3 is hardcoded. Consider extracting to a named constant (STABILITY_THRESHOLD = 3) for clarity, but this is purely cosmetic given the well-written comment above the field declaration.

3. Load Event Listener Leak Fix

Verdict: Correct

I verified against chrome-remote-interface v0.33 internals:

  • The CRI client extends EventEmitter
  • Protocol events are emitted as this.emit('Page.loadEventFired', params, sessionId) (chrome.js:287)
  • Using client.once('Page.loadEventFired', onLoad) correctly subscribes to exactly one firing
  • client.removeListener('Page.loadEventFired', onLoad) correctly unsubscribes when timeout wins

The implementation properly ensures the listener is cleaned up in exactly one of two ways:

  1. Load fires first: onLoad runs, clears the timeout via clearTimeout(timer), resolves the promise. once auto-removes the listener.
  2. Timeout fires first: Timer callback explicitly removes the listener via removeListener, then rejects.

No leak possible. The old Promise.race([client.Page.loadEventFired(), timeout]) left listeners registered because Page.loadEventFired() internally calls chrome.once(eventName, fulfill) — when the timeout wins, that once listener remains until the next Page.loadEventFired event fires (which may never come for a stuck page), or until the client is destroyed.

4. Removal of 'not found' from connection errors

Verdict: Correct and important

This prevents false-positive reconnect triggers from legitimate application errors like "No file input found" in upload code. Good catch.

5. Sidecar URL exclusion in fresh-start fallback

Verdict: Correct

The fallback now prefers a non-sidecar Perplexity tab but falls back to any Perplexity tab (including sidecar) if no non-sidecar tab exists. This is defensive and appropriate.


Potential Concerns (all non-blocking)

  1. No consecutiveSuccesses reset on reconnect: After a successful reconnect, consecutiveSuccesses is not explicitly reset. The next successful operation will set it to 1. This means after a reconnect, you need 3 more successful ops before the breaker resets. This seems intentional (proving stability post-reconnect) but worth confirming.

  2. reconnectPromise cleared in finally before the retry operation() call: After await this.ensureSingleReconnect() resolves on line 323 (in the catch path), the promise is already cleared (via .finally()). If the subsequent return await operation() on line 324 fails with another connection error, the next recursion through withAutoReconnect would create a new reconnect promise. This is correct — each retry cycle gets its own reconnect attempt, and the coalescing only applies to concurrent callers during a single reconnect window.

  3. Thread safety of reconnectAttempts++: Multiple concurrent callers could both hit this.reconnectAttempts++ before either reaches ensureSingleReconnect. Since JS is single-threaded, the increment is atomic — but theoretically two concurrent callers hitting the catch block for different errors could both increment the counter (bumping it by 2 for one "logical" failure). In practice this is harmless — it just means the breaker trips slightly sooner under concurrent failures, which is conservative and safe.


Summary

All three fixes are well-implemented with no bugs or leaks found. The code is clean, well-commented, and the PR description is thorough. The public API is unchanged.

Recommend: merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants