Skip to content

fix(screenshot): wait for paint via Page.lifecycleEvent#7

Merged
RapierCraft merged 10 commits into
RapierCraft:mainfrom
madisonrickert:upstream-pr/screenshot-paint-wait
May 16, 2026
Merged

fix(screenshot): wait for paint via Page.lifecycleEvent#7
RapierCraft merged 10 commits into
RapierCraft:mainfrom
madisonrickert:upstream-pr/screenshot-paint-wait

Conversation

@madisonrickert
Copy link
Copy Markdown
Contributor

Summary

  • Subscribes to Page.lifecycleEvent at connect time and accumulates events per (frameId, loaderId) for the current document
  • Adds waitForLifecycle(name, timeoutMs) — returns synchronously if the event already fired for the current document, otherwise waits for the next occurrence (or times out)
  • screenshot() awaits firstContentfulPaint up to 2s before capture

Problem

screenshot() proceeds as soon as the CDP connection is up. On a freshly navigated tab the renderer may not have painted yet, so the captured frame is blank or shows the previous document. document.readyState === 'complete' is too weak a signal — for SPAs it fires when the bundle loads, often before answer content has rendered.

Fix

Lighthouse and Puppeteer use Page.lifecycleEvent to know when the renderer has actually painted. This PR does the same:

// At connect():
await this.client.Page.setLifecycleEventsEnabled({ enabled: true });
this.client.Page.lifecycleEvent((params) => {
  // accumulate (frameId, loaderId) → Set<eventName>
});

// In screenshot():
await this.waitForLifecycle('firstContentfulPaint', 2000);

Properties:

  • No regression on steady-state pages — FCP has already fired, so waitForLifecycle returns synchronously.
  • Bounded blast radius — 2s timeout means a non-painting page degrades to the prior behavior, never hangs longer.
  • Best-effort — if setLifecycleEventsEnabled fails (older protocol versions, etc.), the lifecycle map stays empty and waitForLifecycle returns false immediately.

The new waitForLifecycle helper is also reusable for future ops that want to gate on networkAlmostIdle, load, etc.

Test Plan

  • npm run build passes
  • Verified screenshot on steady-state page returns immediately (FCP cached)
  • Verified screenshot post-navigation waits briefly then captures the new document, not the old one

Dependency

Builds on #6 (the 0×0 viewport fix). The first commit in the diff is #6's commit; once #6 merges, this PR will rebase down to a single commit. Happy to rebase manually if preferred.

Copy link
Copy Markdown
Owner

@RapierCraft RapierCraft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #7 — fix(screenshot): wait for paint via Page.lifecycleEvent

Verdict: CHANGES REQUESTED

Reviewed commit: HEAD | Scope: src/cdp-client.ts | Depends on: PR #6

Findings

# Finding Severity Confidence
1 Listener leak in waitForLifecycle: (this.client as any)?.removeListener?.('Page.lifecycleEvent', listener) is not the correct API for chrome-remote-interface. The CRI library uses .off() or the event-specific deregistration pattern, not .removeListener(). The listener registered via .on('Page.lifecycleEvent', listener) is never actually removed, leaking a listener on every screenshot that waits for FCP. HIGH CONFIRMED
2 Race condition in waitForLifecycle: The synchronous cache check (for...of frameLifecycle.values()) happens before the listener is registered. If FCP fires in that gap, neither the cache check nor the listener catches it, causing a needless 2s timeout. Fix: register listener first, then check cache. MEDIUM CONFIRMED
3 No cleanup in disconnect(): frameLifecycle map and lifecycleListener are not cleared or deregistered when disconnecting. Map grows unbounded across reconnections to different tabs. MEDIUM CONFIRMED
4 waitForLifecycle is public with no documented contract — should be private since it's an internal implementation detail LOW CONFIRMED
5 Inherits PR #6's || vs && viewport check issue (see PR #6 review) HIGH CONFIRMED

Recommendation

Do not merge until findings #1-3 are resolved:

  • #1: Use (this.client as any).off('Page.lifecycleEvent', listener) or store a reference and use CRI's deregistration API
  • #2: Register the .on() listener before checking the cache, or use a single code path
  • #3: Clear frameLifecycle and remove lifecycleListener in the disconnect() method

This PR depends on #6 — merge #6 first (after its fix), then rebase this.


Review complete. 1 agent (security + infrastructure). 3 blocking findings.

@RapierCraft
Copy link
Copy Markdown
Owner

Hey @madisonrickert — love the lifecycle-based approach, it's how Lighthouse/Puppeteer do it and is much better than readyState polling. Three issues to fix before merging:

1. Listener leak in waitForLifecycle (HIGH)

try { (this.client as any)?.removeListener?.('Page.lifecycleEvent', listener); } catch { /* ignore */ }

chrome-remote-interface doesn't expose removeListener on the client object — this is a no-op, so every waitForLifecycle call that waits (doesn't return synchronously) permanently leaks a listener. Use (this.client as any).off('Page.lifecycleEvent', listener) or store a reference from the CRI registration API.

2. Race condition (MEDIUM)

The synchronous cache check runs before the listener is registered:

// 1. check cache — FCP not found
for (const { events } of this.frameLifecycle.values()) { ... }
// ← FCP fires HERE, updates frameLifecycle, but listener isn't registered yet
// 2. register listener — misses FCP, waits full 2s timeout
return new Promise<boolean>((resolve) => { ... });

Fix: register the listener first, then check the cache inside the promise.

3. No cleanup in disconnect() (MEDIUM)

frameLifecycle map and lifecycleListener are never cleared or deregistered on disconnect. The map grows unbounded across reconnections. Add to disconnect():

this.frameLifecycle.clear();
this.lifecycleListener = null;

Once these are addressed, this is ready to merge (after #6 lands first since this depends on it).

madisonrickert and others added 6 commits May 12, 2026 14:53
Page.captureScreenshot can stall for ~2 minutes when the connected
target reports a 0x0 layout viewport — observed when the MCP attaches
to the Perplexity sidecar panel (visibilityState='hidden') in some
Comet window states (cold launch, no real browsing tabs in front).
Without captureBeyondViewport, captureScreenshot captures the layout
viewport and waits for compositor frames that never arrive.

When Page.getLayoutMetrics reports a zero-size viewport, supply an
explicit clip + captureBeyondViewport=true so the renderer produces
a frame immediately. No-op on real foreground tabs (where the layout
viewport is non-zero, so the clip path doesn't activate).

Closes RapierCraft#2.
…bility

Pulls the screenshot-with-fallback logic out of the CometCDPClient
method body and into an exported free function backed by a minimal
ScreenshotPageAPI interface — same seam pattern as CometAIClient,
so unit tests can substitute a hand-written fake page without a live
CDP connection.

Also promotes the inline 1280x800 fallback clip to a named
SCREENSHOT_FALLBACK_CLIP constant, with the explanatory comment
attached to the constant rather than buried mid-method.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers:
- normal viewport (1024x768) -> no clip; format forwarded as-is
- cssLayoutViewport missing -> falls back to layoutViewport
- jpeg format forwarded
- bringToFront is invoked before capture (best-effort)
- 0x0, 0xN, Nx0, both-fields-absent viewports -> fallback clip +
  captureBeyondViewport=true
- getLayoutMetrics throws -> fallback clip
- bringToFront throws -> swallowed; capture still runs
- captureScreenshot returns empty data -> throws descriptive error

The 0xN / Nx0 cases document the wrapper's defensive || semantics:
any zero dimension triggers the fallback, on the basis that a
0-pixel screenshot is useless regardless of whether the same
compositor stall reproduces. The hang itself is only confirmed for
0x0 in live testing; the wrapper-logic tests here don't claim
otherwise.

Adds tests/unit/fakes/fake-screenshot-page.ts following the existing
FakeCdpClient style (records calls, programmable responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h dims zero

Switches the screenshot-fallback condition from `||` to `&&` so the
fallback clip only activates on a strict 0x0 layout viewport — the only
viewport state we have evidence for actually stalling
Page.captureScreenshot.

Live CDP testing against a running Comet instance:

- Sidecar (visibilityState=hidden, cold launch): natural 0x0
  viewport; naked Page.captureScreenshot timed out at 15s; the
  wrapper completed in ~1s with valid PNG data.
- Forced 0x0 via Emulation.setDeviceMetricsOverride on a normal
  page: Chromium silently rejected the zero values and reported the
  natural viewport (1024x768) instead of 0x0.
- Forced 0xN and Nx0 via Emulation.setDeviceMetricsOverride:
  Chromium rejected the single-zero dimension the same way,
  substituting the natural value for the missing axis.

So a 0xN or Nx0 layout viewport is not a state we can observe in
practice. The `||` form was defensive against a hypothetical that
doesn't appear to be reachable; `&&` matches the predicate to the
evidence and avoids guessing on inputs we can't generate.

Unit-test expectations for the 0xN / Nx0 boundary cases move from
"fallback clip is applied" to "fallback clip is not applied",
documented inline. The 0x0, missing-viewport-fields, and
getLayoutMetrics-throws cases still trigger the fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tests

The 0×N and N×0 boundary tests previously asserted only the predicate
outcome ("no fallback clip applied") while letting the fake return a
canned valid-PNG byte string — which doesn't match what a real
encoder would do. PNG IHDR and JPEG SOF both require both dimensions
to be >= 1, so a zero-dim viewport can't produce a valid image
regardless of which encoder runs.

Update those two tests to:
  1. Configure the fake's captureScreenshot to return empty data
     (modeling the real encoder's behavior on a zero-dim viewport).
  2. Assert that the wrapper surfaces the failure via the existing
     empty-data guard rather than masking it with a synthetic 1280×800
     capture.

The call shape assertion (`{ format: "png" }`, i.e. no clip /
captureBeyondViewport) is preserved so the predicate decision is still
covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous catch block applied the fallback clip for any throw from
Page.getLayoutMetrics, which masked transport-level failures (websocket
dropped, target detached) behind a synthetic 1280x800 capture. Tighten
the discriminator to only recover from Chrome-side rejections, and
propagate everything else so the caller sees the real failure.

Mechanics:

- Bump chrome-remote-interface from ^0.33.2 to ^0.34.0 to pick up the
  newly-exposed ProtocolError class (added in 82d1d60b upstream).
- @types/chrome-remote-interface hasn't declared ProtocolError yet
  (DefinitelyTyped/DefinitelyTyped#74992 in flight), so bridge the gap
  with a small local cast at import time. Comment marks the cast for
  removal once the upstream types land.
- Switch the catch block to `if (err instanceof ProtocolError)
  { clip = SCREENSHOT_FALLBACK_CLIP } else { throw err }`.

Tests:

- Fake gains `setMetricsToThrowProtocolError()` and
  `setMetricsToThrowGeneric()` (replaces the single `setMetricsToThrow`).
- Splits the existing "getLayoutMetrics throws -> fallback" test into:
  - ProtocolError throw -> fallback clip applied (recovery path)
  - generic Error throw -> propagates, no captureScreenshot attempted

Safety analysis vs. the broad catch we're replacing:
  - getLayoutMetrics returns 0x0:   fallback (unchanged)
  - ProtocolError on unsupported:    fallback (unchanged)
  - 0x0 + ProtocolError on probe:    fallback (unchanged — captureScreenshot
                                     still gets the clip)
  - websocket dropped mid-probe:     propagates (was: silently fell back, then
                                     captureScreenshot also failed with the
                                     same error one call later — same end
                                     state, cleaner provenance)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
madisonrickert added a commit to madisonrickert/Perplexity-Comet-MCP that referenced this pull request May 12, 2026
Addresses review feedback on RapierCraft#7:

- Capture the unsubscribe function returned by client.Page.lifecycleEvent()
  in connect(), store on this.lifecycleUnsubscribe. CRI's domain-event
  callbacks return `() => chrome.removeListener(rawEventName, handler)`
  (api.js:49), which is the idiomatic deregistration path. Replaces the
  prior `(client as any).removeListener(...)` call, which worked (Chrome
  extends EventEmitter) but bypassed the typed API.

- In waitForLifecycle, use the same unsubscribe-function pattern instead
  of `client.on(...)` + `client.removeListener(...)`. Register the
  listener before scanning the cache: single-threaded JS makes the
  synchronous-only path safe today, but ordering matters if anything
  upstream (CRI internals, scheduler) ever inserts a microtask between
  the two operations.

- Mark waitForLifecycle private; only screenshot() uses it.

- Clear frameLifecycle, lifecycleListener, and lifecycleUnsubscribe in
  disconnect() so stale state from the prior tab doesn't carry across a
  disconnect/reconnect cycle.
madisonrickert and others added 4 commits May 12, 2026 16:24
Previously screenshot() proceeded as soon as the connection was up,
which meant a screenshot taken mid-navigation could capture an
unrendered or stale frame. Lighthouse and Puppeteer use the CDP
Page.lifecycleEvent stream to know when the renderer has actually
painted; we now do the same.

- At connect(): enable Page.setLifecycleEventsEnabled and accumulate
  events per (frameId, loaderId).
- New waitForLifecycle(name, timeoutMs) returns immediately if the
  event has already fired for the current document, otherwise waits
  for the next occurrence.
- screenshot() awaits firstContentfulPaint up to 2s before capture.
  Returns immediately for steady-state pages; bounds blast radius if
  the page never paints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review feedback on RapierCraft#7:

- Capture the unsubscribe function returned by client.Page.lifecycleEvent()
  in connect(), store on this.lifecycleUnsubscribe. CRI's domain-event
  callbacks return `() => chrome.removeListener(rawEventName, handler)`
  (api.js:49), which is the idiomatic deregistration path. Replaces the
  prior `(client as any).removeListener(...)` call, which worked (Chrome
  extends EventEmitter) but bypassed the typed API.

- In waitForLifecycle, use the same unsubscribe-function pattern instead
  of `client.on(...)` + `client.removeListener(...)`. Register the
  listener before scanning the cache: single-threaded JS makes the
  synchronous-only path safe today, but ordering matters if anything
  upstream (CRI internals, scheduler) ever inserts a microtask between
  the two operations.

- Mark waitForLifecycle private; only screenshot() uses it.

- Clear frameLifecycle, lifecycleListener, and lifecycleUnsubscribe in
  disconnect() so stale state from the prior tab doesn't carry across a
  disconnect/reconnect cycle.
…estability

Mirrors the captureScreenshotWithFallback pattern: move the lifecycle-wait
logic into a free function that takes a `LifecyclePageAPI` slice and a
`FrameLifecycleMap`, so unit tests can drive it with a hand-written fake
instead of a live CDP connection. Drops the private wrapper method on
CometCDPClient (no other callers) and inlines the call site in screenshot().

No behavior change.
Covers the three exit paths (cache hit, listener fires, timeout) plus
error handling when the CRI lifecycleEvent registration throws. Pins the
review-fix invariants:

- listener is unsubscribed on every exit path (no leaked handlers across
  calls — guards the listener-leak finding)
- listener is registered before the cache scan (defensive ordering against
  any future microtask insertion between cache check and registration)
- unrelated lifecycle events on the same connection do not resolve the
  Promise — only the requested eventName matters

Uses a FakeLifecyclePage that satisfies LifecyclePageAPI and records
registrations/unsubscribes/live-handler-count so the leak invariants can
be asserted directly.
@madisonrickert madisonrickert force-pushed the upstream-pr/screenshot-paint-wait branch from e05add6 to 191ea5c Compare May 12, 2026 23:33
@madisonrickert
Copy link
Copy Markdown
Contributor Author

madisonrickert commented May 12, 2026

Branch is rebased onto the current head of #6 so it picks up the unit-test infra from #10, and the lifecycle code now lives behind the same free-function pattern as captureScreenshotWithFallback.

Notes on the analysis behind each finding:

Listener leak. chrome-remote-interface's Chrome class extends Node's EventEmitter (chrome.js:15), so the prior removeListener call wasn't actually a no-op. That said, the idiomatic CRI pattern is to keep the unsubscribe function CRI itself returns from client.Page.lifecycleEvent(handler): api.js:49 returns exactly () => chrome.removeListener(rawEventName, handler). Switched to that. Cleaner, drops the as any cast, and disconnect() uses it too.

Race condition. In single-threaded Node, the cache scan, Promise construction, and listener registration execute in one synchronous block with no event-loop turn between them. CRI events arrive via WebSocket I/O callbacks, which only fire on event-loop ticks. So there isn't an actual race today. But listener-first ordering is defensive against any future microtask insertion (CRI internals, scheduler changes), and it costs nothing. Swapped, with a test that pins the ordering.

Disconnect cleanup. connect() already calls this.frameLifecycle.clear(), so the map doesn't grow unbounded across reconnections. The real residue is stale events between disconnect() and the next connect() plus a dangling listener reference. Added the cleanup as you suggested.

waitForLifecycle privacy. Removed the class method entirely. The logic now lives as a top-level waitForLifecycle(page, frameLifecycle, eventName, timeoutMs) function (callable from outside the class for future ops on networkAlmostIdle / load, per the PR description), and screenshot() calls it directly.

The new tests cover the three exit paths (cache hit, listener fires, timeout) and pin the listener-leak invariant (unsubscribe count on every path) plus the defensive ordering. npx vitest run shows 51/51.

Just pending #6 landing still.

@RapierCraft
Copy link
Copy Markdown
Owner

Code Review: fix(screenshot): wait for paint via Page.lifecycleEvent

Reviewed commit: 191ea5c

Overall Assessment: APPROVE (with minor observations)

This is a well-engineered PR. The lifecycle waiting logic is correct, the cleanup paths are complete, error handling is defensive without masking real failures, and the test coverage is thorough. The PR is safe to merge.


Detailed Findings

1. Lifecycle Event Handling — Correct

waitForLifecycle race condition analysis: The implementation is safe. The listener is registered before scanning the cache (line ~193 vs ~195 in the diff), which closes the classic TOCTOU window. Since JavaScript is single-threaded, no event can arrive between listener registration and the cache scan that would be lost. The comment at line 165-167 correctly documents this.

Cleanup: The finish() guard (if (done) return) prevents double-resolution if the timer fires after the event arrives (or vice versa). Both the listener and timer are cleaned up on every exit path (cache hit, event arrival, timeout, registration failure). No leaks.

2. Timeout Handling — Sound

The 2-second timeout is well-chosen: short enough to avoid user-visible stalls, long enough for typical FCP on SPAs. On timeout, waitForLifecycle returns false (not an exception), so screenshot() degrades gracefully to the prior behavior — captures whatever is on screen. This matches the "bounded blast radius" claim in the PR description.

3. Lifecycle Map Management — Correct

  • connect(): Clears the map and unsubscribes any prior listener before re-subscribing. This prevents stale events from a previous document leaking into the new connection.
  • disconnect(): Cleans up listener, nulls references, clears map. Correct order (unsubscribe before client.close()).
  • LoaderId rotation: When a new loaderId arrives for a frameId, the entry is replaced (new Set([name])), which correctly invalidates stale events from the prior navigation. This is the key correctness property for SPA navigations.

4. captureScreenshotWithFallback — Well-Separated Concerns

The 0x0 viewport fallback (from PR #6) is cleanly extracted into a testable free function. The ProtocolError vs generic error discrimination is correct — only Chrome-side rejections trigger the fallback; transport failures propagate.

5. Minor Observations (non-blocking)

5a. ProtocolError import pattern: The cast (CDP as unknown as { ProtocolError: ... }).ProtocolError works at runtime but relies on an undocumented export. The comment notes this is a workaround pending DefinitelyTyped updates — acceptable.

5b. Lifecycle listener type: this.lifecycleListener is typed as ((params: any) => void) | null. The any is pragmatic given CRI's typing limitations, but the actual params are well-validated inside the handler (if (!frameId || !loaderId || !name) return).

5c. waitForLifecycle passes this.client!.Page directly: This works because LifecyclePageAPI only requires lifecycleEvent(handler) which CRI's Page domain provides. The type narrowing via the interface is a good pattern for testability.

5d. The !v?.clientWidth && !v?.clientHeight check: This uses falsy coercion, so 0 triggers the fallback. Since viewport dimensions are numeric, this is correct — width=0 AND height=0 means degenerate. A viewport with width=0 but height=800 would NOT trigger the fallback (intentional, per the comments and tests).

6. Test Quality

The tests cover all meaningful paths:

  • Cache-hit (synchronous resolution)
  • Listener path (async resolution)
  • Timeout path (with fake timers)
  • Error handling (registration failure)
  • Cleanup verification (no leaked handlers)
  • Defensive ordering assertion

The fakes are minimal and well-typed. The FakeScreenshotPage correctly mirrors the ProtocolError construction pattern from the source.

7. Dependency Bump

chrome-remote-interface 0.33.2 -> 0.34.0: Required for the ProtocolError export. The @types/chrome-remote-interface stays at ^0.33.0 — no runtime risk from the types lag since the cast pattern handles it.


Summary

Category Status
Correctness Pass — no race conditions, proper TOCTOU ordering
Cleanup/Leaks Pass — all exit paths clean up listener + timer
Timeout behavior Pass — bounded, graceful degradation
Error handling Pass — defensive without masking transport failures
Test coverage Pass — all paths exercised with fake timers
Breaking changes None — steady-state pages resolve synchronously

Verdict: Clean. No blocking issues found. Safe to merge after PR #6 lands (as noted in the PR description).

@RapierCraft RapierCraft merged commit be984af into RapierCraft:main May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants