feat(notebook-cloud): asset fetch resilience and degraded output states (#3421 PR 4)#5
Closed
quillaid wants to merge 12 commits into
Closed
feat(notebook-cloud): asset fetch resilience and degraded output states (#3421 PR 4)#5quillaid wants to merge 12 commits into
quillaid wants to merge 12 commits into
Conversation
…nd cloud (nteract#3581) * refactor(notebook): lift outline interaction into shared hooks Extract useActiveOutlineItemId and useOutlineSelection from apps/notebook/src/App.tsx into src/components/notebook/outline-interaction.ts. These hooks manage scroll-synced active item tracking and outline-to-focused-cell selection coupling, used by both desktop and cloud. Also extract getOutlineStatusLabel as useOutlineStatusLabel(), built on the shared useNotebookQueueProjection, into outline-status-label.ts. Desktop App.tsx switches from ~160 lines of local implementation to importing the shared hooks. Behavior-preserving: the extraction is verbatim, and desktop tests pass with the new imports. Tests in outline-interaction.test.ts cover selection-clearing on focus change, focus coupling when setFocusedCellId is provided, and graceful degradation when setFocusedCellId is undefined. * feat(notebook-cloud): wire shared outline interaction in the viewer Use useActiveOutlineItemId, useOutlineSelection, and useOutlineStatusLabel from shared code. Add getOutlineStatusLabel to the view model call, wire activeOutlineItemId, and pass outlineCellIds + activeOutlineItemId + selectedOutlineCellId to NotebookDocumentRail. Removes ~30 lines of local state + handlers. The viewer now tracks the active outline item on scroll and shows execution status labels, matching desktop's outline behavior. Extend viewer-render-props.test.ts to assert the viewer source references the shared hooks and rail props, preventing re-forking.
…teract#3570) * feat(notebook): quiet connection/identity slot with desktop mount (nteract#3421) NotebookConnectionIdentity fills the reserved identity slot: the self actor's flattened avatar (shared actor projection, initials/icon only) paired with a connectivity dot — connectionStatus$'s first UI consumer. Runtime-state stores are deliberately not blanked while a transport reconnects, so the pulsing dot is what makes frozen kernel/execution chrome interpretable during the offline window. Hard quiet-chrome constraints from the pulled nteract#3273/nteract#3290/nteract#3337/nteract#3349 designs, enforced in component + tests: - renders NOTHING for a purely local desktop session (isRemoteNotebookContext: access.source !== 'local' or runtime.target.kind === 'runtime_peer') — conditionality is why nteract#3290 was pulled; - flat rounded-md border-border/70 bg-muted/35, never rounded-full+shadow; - icon-only at every width: no visible text pill; label and status live in sr-only copy and the title tooltip; - state expresses as dot color (statusTone vocabulary: emerald online, amber pulse connecting/reconnecting, muted offline) and wrapper opacity, never copy; - errors/reconnect prompts stay in the notices stack; connection state never rides the kernel status surface. The component takes a structural NotebookConnectionStatusSource (an rxjs-free Observable subset), so shared src/ gains no dependency; NotebookActorAvatar is exported from NotebookIdentity with an optional status-tone override to carry the connection dot. Desktop mount: trailingControls on <NotebookToolbar> in App.tsx -> identityControls at the right end of the command toolbar, driven by the host transport's connectionStatus$ (stable for the app's lifetime). A purely local session renders nothing, exactly as before nteract#3273. * feat(notebook-cloud): mount the connection slot on a switching status bridge (nteract#3421) CloudConnectionStatusBridge (live-sync.ts): one BehaviorSubject-backed connection-status source that stays stable for the session's lifetime. The transport survives transport-level reconnects, but initial-connect attempts and escalation teardowns still REPLACE it — the host facade's delegation captures whichever transport exists at subscribe time and never switches, so a subscriber could watch a dead object forever. The bridge attaches to each replacement transport via onTransportCreated, dedups across switches, and on escalation teardown detaches BEFORE the dispose and reports 'reconnecting' — the disposed transport's terminal 'offline' (manual-disconnect vocabulary) never surfaces while the session is retrying. useCloudViewerSession exposes connectionStatus$ from the bridge; notebook-viewer.tsx fills the formerly-null identityControls slot with NotebookConnectionIdentity driven by it. Tests: bridge replacement + dedup, teardown-retry masking, and the PR-2 reconnect loop reflected through one stable subscription; the viewer-render-props quiet-chrome regression now pins the new mount instead of identityControls={null}. * docs(adr): record PR-3 landed shape - Component name (NotebookConnectionIdentity), centralized isRemoteNotebookContext gate, structural rxjs-free status source. - Desktop mount via host transport; cloud mount via CloudConnectionStatusBridge with teardown-retry masking semantics. - The slot is what makes the PR-2 stale-chrome limitation interpretable: the dot stays live through 'reconnecting'. - Tests paragraph updated to the landed coverage, including the identityControls={null} pin becoming a pin on the new mount. * fix(notebook): real desktop dot, scoped link copy, SR announcements (nteract#3421) Per the PR-3 adversarial review: - The desktop dot was a constant: TauriTransport's connectionStatus$ initializes 'online' and the app never disconnects it, so the slot could never transition — emerald through a daemon restart, the exact window its doc comment promised to make interpretable. The mount now feeds createDesktopConnectionStatusSource, derived from the host facade's daemon lifecycle events (daemon:ready -> online with ready-cache backfill for late mounts; daemon:disconnected -> reconnecting, since the host auto-reconnects; daemon:unavailable -> offline). The IPC transport's own subject stays untouched and honest about IPC. - Scoped copy: connectionLabel names the link the source measures ('Daemon connection' on desktop) in the title/sr-only copy and the announcements, so the dot never asserts daemon<->room health it does not observe (recorded as future work). - getCurrent(): optional synchronous snapshot on the source contract; first paint shows the real status instead of a one-frame 'connecting' flash. - Live region: status CHANGES (never the initial value — the first source delivery is the baseline) are announced via a polite sr-only region using the scoped copy; sources dedup so flaps don't spam. - aria-hidden on the avatar so SR users hear the sr-only copy, not 'AL Alice — Connected'. Tests hardened per review: fixtures use valid runtime-target shapes (attached/ready + label); no-visible-text now clone-and-strips so wrapper-level text nodes are caught; flat-never-raised walks the whole subtree banning shadow tokens (rounded-full stays wrapper-scoped — Avatar internals are legitimately round) with classList token checks; unmount-unsubscribe pinned via listenerCount; new tests for getCurrent first paint, live-region change-only announcements, aria-hidden, and scoped copy. Desktop coverage added: daemon-lifecycle source unit tests, a composition test feeding REAL desktopNotebookShellCapabilities output into isRemoteNotebookContext, toolbar trailingControls flow with real projection fixtures, and a source-text pin on the App.tsx mount (daemon source + scoped label, never the IPC transport). * fix(notebook-cloud): close the bridge gap, implement the slot source contract (nteract#3421) - Auth-refresh gap: the live-room effect cleanup detached the bridge and the refreshing re-run early-returns without attaching, so the bridge served stale 'online' while the transport was fully disconnected. The cleanup now uses noteTeardownRetry() — the gap reads as 'reconnecting' (a transition), and the disposed transports' terminal 'offline' still never surfaces. Harmless at real unmount. - CloudConnectionStatusBridge implements the slot's source contract directly (subscribe + getCurrent), and the session exposes the bridge itself so first paint reads the synchronous snapshot. - Tests: plain detach() severance (no late emissions, no spurious status), the source contract (replay + snapshot + unsubscribe); viewer-render-props slot pin is now module-scoped via viewerFileContaining with bounded windows and pins capabilities={shellCapabilities} (no cross-corpus false-pass), plus source guardrails for the session wiring order: attach-on-transport-created and retry-before-dispose in all three teardown paths (scheduleReconnect, poison-pill discard, effect cleanup). * docs(adr): correct the PR-3 desktop source and record review fixes - Desktop is fed by the daemon-lifecycle source, not the IPC transport (whose status is constant in practice); copy scoped via connectionLabel; daemon<->room link health recorded as future work. - getCurrent first-paint snapshot, change-only live-region announcements, aria-hidden avatar. - Bridge cleanup semantics: teardown paths AND the effect cleanup report 'reconnecting' so the auth-refresh re-run gap never reads as stale 'online'. - Tests paragraph updated to the landed coverage. * feat(notebook-cloud): debounced sustained-reconnecting line in the notices stack The connection/identity slot stays an 8px dot by design, so a sustained outage was only legible to someone watching the far right of the header. Surface ONE quiet notices-stack line ("Reconnecting." / "Your edits are kept locally and will sync when the connection returns.") once the live room status has stayed "reconnecting" for 3s, and clear it on online. The debounce lives in SustainedReconnectingTracker: reconnecting arms one timer, sub-debounce flaps cancel it silently, repeated reconnecting deliveries never re-arm or duplicate, and a replacement transport's pre-handshake "connecting" neither arms nor clears. Transport-level connectionError strings (socket drops, handshake timeouts, missed liveness pongs, browser offline) now route through this debounced line instead of the immediate per-drop "Live room needs attention." warning; access and sign-in diagnostics keep their dedicated notices. The chip and dot visuals are untouched. * fix(notebook-cloud): classify only exact link-loss shapes as reconnect noise The broad prefix match (/^cloud (sync|room) /) swallowed every transport-wrapped failure into the debounced calm "Reconnecting." line - including genuinely terminal ones. A mid-session auth/access revocation surfaces as wrapped connect-target or socket failures, and the actionable diagnostics path only runs pre-ready, so an affected user saw an infinite quiet reconnect with no sign-in CTA. The classifier now matches the transport's own link-loss reasons shape by shape, anchored (socket close/failure, browser offline, handshake timeout, liveness ping/pong misses). Connect-target resolution failures, socket creation failures, protocol decode failures, and the session escalation's "cloud room rejected frame: ..." keep the existing "Live room needs attention." warning with its action. Tests pin both routes, including the rendered notice for non-link failures. * test(notebook-cloud): render useSustainedReconnecting through a real hook harness The tracker had thorough unit coverage but the hook itself was never executed, leaving its React wiring unpinned - dropping tracker.dispose() from the effect cleanup passed every suite. Render the real hook against a BehaviorSubject-shaped fake source: debounce boundary drives the returned flag, sub-debounce flaps stay silent, unmount unsubscribes and disposes the pending timer (timer count reaches zero), and a source identity change mid-outage pins the intended clear-then-re-arm-via-replay behavior.
The dynamic import() promise for the runtimed WASM module is cached for the life of the page, but the failure path only reset initialized/ initializedSource/resolvedModule — never loadedModule/loadedModuleSource. One transient import failure therefore pinned a rejected promise forever: every retry (including retryLiveConnection reconnect attempts) re-awaited the same rejection. Clear the cached import when it rejects (guarded so a newer load is never clobbered), inject the importer for tests, and pin the regression: fail once -> retry succeeds with a fresh import.
…allback Wrap both halves of the runtimed WASM load in the blob-resolver retry ladder ([150, 500, 1500] ms): - module import(): every rejection retries (import exposes no status), then a content-hashed filename falls back to its stable-name copy (runtimed_wasm.js) for one more ladder — both names are deployed for exactly this deploy-window skew. - .wasm binary: URL-shaped inputs are fetched here and handed to wasm-bindgen as a Response. Thrown fetch errors and 404/409/425/429/5xx retry (404 deliberately retryable for deploy propagation); other statuses bail to the stable-name fallback immediately. Non-network schemes (file:, data:) pass through untouched. Touch point is runtimed-wasm-client.ts only — the snapshot and live connect paths both flow through initializeRuntimedWasmClient. Exhausted ladders stay terminal for the connect attempt: connectCloudSyncRuntime's catch still disconnects the transport (stopping its reconnect loop) and the session surfaces the existing error notice, so a hard persistent 404 cannot spin forever. retryLiveConnection re-enters cleanly because the rejection also clears the module/init caches. Tests pin the exact ladder delays under mocked timers, retryable vs bail statuses, hashed->stable fallback for both assets, non-URL passthrough, and the fully-failed-attempt -> retry-succeeds regression.
…derer bundle A renderer bundle fetch failure used to be terminal for every mounted output iframe: the provider stored the error, loadingPromise was only cleared for the NEXT provider mount, and already-mounted frames never recovered — N blank wells per page until reload. Restructure the provider around a module-level load store (useSyncExternalStore) shared by every provider instance: - fetch failures retry on the [150, 500, 1500] ms ladder before any error is surfaced, so transient asset-origin blips recover invisibly; - the context value now exposes retry(), which starts a fresh ladder after a terminal failure and no-ops while loading/after success; - because the bundle state is module-level, one successful retry un-blanks every mounted consumer (and every provider instance) at once; - the historical retry-on-next-mount behavior is preserved (a fresh mount after a terminal failure kicks a new ladder). Tests pin the exact ladder delays under fake timers, the quiet mid-ladder state, terminal error + retry() recovery propagating to all mounted consumers (including across provider instances), retry() idempotence, basePath fetch failures, and the missing-config error.
A terminal renderer-bundle failure used to render as N silent blank 24px
wells plus a cross-origin error cascade (one renderer-bundle-provider-
error diagnostic per mounted frame, console.error per output). Two
surfaces replace that — and only these two; asset health never feeds the
connection dot or CloudConnectionStatusBridge, which models room
transport health exclusively:
- OutputArea's isolated branch now gets the ErrorBoundary +
OutputErrorFallback-with-retry treatment the in-DOM branch already had.
On terminal bundle failure the frame is replaced by a visible fallback
whose Retry calls the shared provider retry(); because the bundle state
is module-level, one click un-blanks every output at once. Frames are
not mounted at all in that state, so the per-iframe error cascade is
gone. Render errors inside the isolated branch are caught by the new
boundary with resetKeys=[outputs].
- CloudNotebookNotices gains one quiet warning line ('Output renderer
unavailable.' + Retry) driven by the single module-level provider state
read in NotebookViewer — N identical failures aggregate structurally
into ONE notice, never per-output spam.
Tests: per-well fallback + retry wiring + no-frame-mount/no-cascade +
recovery propagation + in-DOM untouched + boundary catch/reset
(OutputAreaIsolatedFallback), and the single aggregated notice with
retry action kept out of connection vocabulary (viewer-notices).
Finish the content-hash story (PR nteract#3449's runtime-wasm pattern applied to the remaining un-hashed sidecars) so the renderer-assets origin serves them fully immutable: - copy-viewer-assets now writes isolated-renderer.js/.css and sift_wasm.wasm under BOTH stable and content-hashed names (renderer-sidecar-assets.mjs) plus a renderer-sidecar-assets.json manifest; the worker's existing isContentHashedAssetPathname already rewards the hashed pathnames with year-long immutable caching. - the shell worker reads the manifest per-request (regex-validated, like runtime-wasm-assets.json) and threads the names through viewer config as rendererAssets; missing/invalid manifests fall back to the stable names, which remain the documented fallback and stay deployed. - IsolatedRendererProvider accepts assetNames for the bundle fetch; resolveSiftWasmUrl accepts siftWasmAssetName, retiring the ?v= query on the cloud path when the name is content-hashed (the stable name keeps ?v= as its buster; daemon/desktop paths are unchanged). The name crosses the sandbox boundary via host context, so it is validated against the sift_wasm shape before use; the parent-page prefetch warm uses the same resolution. Tests: manifest/hashed-name emission and per-content hashing (scripts), shell config defaults + hashed manifest + invalid-manifest fallback (html), provider hashed/stable filename fetches, sift URL hashed-no-query / stable-keeps-query / sandbox-input rejection, hashed prefetch URL, and worker immutable-vs-revalidate pins for the new names.
…ap pinning, pair skew, and hangs Review findings 0/2/3/15/19/20/23 from the PR-4 review round: - Module-map busting: same-specifier import() retries replay the pinned failure network-free (the HTML spec caches failed module fetches per specifier for the page lifetime; Chromium/WebKit conform). Rungs >= 1 now carry a unique ?retry=N query (monotonic, reset-able in tests); stableRuntimedAssetHref preserves the query on fallback. - Coupled pair fallback (nteract#3416 skew class): hashed glue must never run against a stable binary or vice versa. A module stable-fallback makes the binary skip its hashed name outright; a binary stable-fallback forces a stable re-import of the (hashed) module before init. - Per-rung timeouts (20s): each import/fetch rung is raced against a deadline (plus best-effort AbortSignal.timeout on fetch) so a black-holed attempt becomes a failed rung instead of wedging the shared init singleton that every caller - including retryLiveConnection - awaits. Abandoned losers have their rejections observed. - Body-cancel hygiene on the non-retryable throw path (was retryable-only). - Terminal failures now carry the 'runtimed WASM asset failed: ' prefix (new runtimed-wasm-failure.ts, dependency-free) so the notices layer can classify them for a Retry affordance (wired in a follow-up commit). - Ladder comment documents the accepted worst case: sequential ladders, ~4.3s of sleeps + ~10 round trips on a fully-skewed page; zero on the healthy path. Tests: exact-boundary pins for the final module rung (1499/1500) and all three binary rungs (independent of the module ladder), 409/425 in the retryable set, both coupling directions, hung-rung recovery for module and binary, busted-href sequences, and the failure-prefix classification.
…ck, and output presence Review findings 1/5/6/8/12/14/18/22/23 from the PR-4 review round: - Hashed->stable fallback (mirrors the wasm client): when content-hashed bundle names exhaust the retry ladder (deploy-window skew - the page predates a deploy that replaced the hashed copies), the stable names are tried once before the failure goes terminal. js and css fall back as a PAIR so they always come from the same deploy tier. retry() after a deploy can now actually succeed instead of re-404ing forever. - Sticky lastError: a retry kicked from a terminal error keeps the failure visible while the ladder runs. OutputArea keeps the fallback mounted (frames remount only when the bundle actually loads), so a hopeless Retry click no longer churns N blank cross-origin iframes and flaps the degraded state per click. - Auto-recovery nudge: one window 'online' listener in the provider re-kicks a terminally failed load (startRendererBundleLoad is idempotent and generation-guarded) - a >2.2s outage no longer pins the degraded state after everything else self-heals. - Per-rung settle deadline (20s, race + best-effort AbortSignal) on the bundle fetches so a black-holed fetch becomes a failed, retryable rung instead of a permanent isLoading that gates off all degraded UI. - Isolated-output presence (useRegisterIsolatedOutput / useHasIsolatedOutputs): OutputArea reports its shouldIsolate so page-level surfaces can gate the asset notice on something actually being degraded (wired in the cloud viewer in a follow-up commit). Tests: stable-pair fallback after hashed exhaustion, sticky lastError through an in-flight retry, online re-kick recovery, presence tracking across mounts, fallback-stays-mounted during retry (no frame churn), ErrorBoundary resetKeys auto-reset on a NEW outputs array (was only manually-reset-tested), and OutputArea never rendering the page-level notice itself (pins N-wells -> ONE notice structurally).
Review findings 5/17/18 (sift half) from the PR-4 review round: sift's wasm is fetched lazily INSIDE the sandboxed output frame at first table render — a long-lived tab whose hashed sift_wasm name vanished across a deploy got a silent 404 with no ladder, no fallback, and no notice. The seam that actually sees the failure is sift's loader, so: - @nteract/sift setWasmUrl gains an optional fallbackUrl; ensureModule retries it once when loading the primary URL fails (skipped when identical/absent). An ABI mismatch on the fallback still fails loudly at instantiation — no worse than no fallback. - resolveSiftWasmUrls pairs the hashed primary with the stable + ?v= copy (null when the name is already stable); sift-renderer threads the pair through setWasmUrl from host context. Tests: predicate fallback retry/terminal/identical/success-shortcut via an injectable wasm loader; paired-URL resolution; and the validation regex boundaries (short hash, uppercase hex, wrong stem/extension all fall back to stable + ?v=, matching the worker's hashed predicate).
…nd quieter asset notices Review findings 4/7/8/13/16/21/22/24 from the PR-4 review round: - Terminal runtimed-WASM failures (prefixed by the wasm client) now get a dedicated 'Notebook engine failed to load.' notice whose action is Retry wired to retryLiveConnection — the documented re-entry finally has a UI surface. Previously this class fell through to 'Live room needs attention.' whose only action, 'Use anonymous', destroys a signed-in session without fixing anything (the legacy action remains the fallback when no retry callback is wired). - The notice sanitizer redacts http(s) URLs (it only covered ws/wss); wasm asset hrefs were the first http(s) URLs threaded into this surface. Query/fragment are stripped, host/pathname stay legible. - The 'Rich outputs are paused' notice is gated on useHasIsolatedOutputs: pure-markdown/text notebooks no longer warn about outputs they do not have while the provider's eager preload fails. - The notice reads error ?? lastError, so an in-flight retry no longer unmounts/remounts it per click (pairs with the sticky OutputArea fallback from the previous commit). - normalizeRendererAssetNames is exported and tested: the client half of the manifest-missing contract (old shell config + new stable-named viewer JS during gradual worker deploys) was unpinned. Tests: dedicated wasm-failure notice with non-destructive Retry (and the legacy action without a callback), http(s) sanitizer redaction (token query stripped), normalizeRendererAssetNames absent/partial/complete, the sidecar manifest parse-throw (non-JSON body) fallback, and the invalid-shape fixture now mixes a valid hashed css so it pins whole-manifest rejection rather than per-field fallback.
… pair Review finding 11 (frontend half): the renderer bundle ladder uses its own delay constant, so its rung 2 lower bound and rung 3 had no coverage at all (cumulative advances pass for any shorter delay). The exhaustion test now splits each rung at 149/150, 499/500, 1499/1500. Also updates the sift-renderer plugin test for setWasmUrl's new optional fallback argument (stable name -> undefined fallback) and adds a case pinning the hashed-name + stable-fallback pair threading from host context.
6d47246 to
7e16918
Compare
Owner
Author
|
Promoted upstream now that nteract#3570 merged — see the cross-fork PR on nteract/nteract. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 4 — final leg of the nteract#3421 stack (intra-fork: base is the nteract#3570 branch so the diff shows only this PR; promotes to nteract/nteract when nteract#3570 merges).
What this does
Asset failures stop producing blank viewers and silent iframe cascades:
import()of the runtimed WASM module was pinned for the life of the page — every retry re-awaited the same dead promise. This silently defeatedretryLiveConnectionsince the day it shipped.retryLiveConnection.retry(),online-event re-kick, sticky fallback wells (no iframe churn during in-flight retries).ErrorBoundary+OutputErrorFallback-with-retry treatment the in-DOM branch had; N identical failures aggregate into ONE quiet notice (gated on isolated outputs actually existing). Asset health has zero coupling to the connection dot/bridge — by contract.isolated-renderer.js/.css+sift_wasm.wasmget hashed filenames + manifest (the runtime-wasm-assets pattern), returning the renderer-assets origin to fullyimmutablecaching (fix(notebook-cloud): revalidate un-hashed renderer assets instead of immutable nteract/nteract#3416 → fix(cloud): load hashed runtime wasm assets nteract/nteract#3449 arc complete). Manifest missing/invalid → stable names, no deploy-ordering hazard.Review
4-lens adversarial review + per-finding verification (one resume across a session boundary — journaled agents cached): 25 confirmed findings fixed in one consolidated round, 5 refuted. Accepted-as-documented: worst-case ~4.3s sequential skew ladders (commented at the ladder).
Verification
vp 2162 passed / notebook-cloud 816 passed / sift package 177 passed;
tsc+cargo xtask lintclean; post-rebase spot runs green, CI confirms.This completes the four-PR stack from nteract#3421. Remaining tracked follow-ups: access-revocation diagnostics (fast-follow), PR 5 (offline-merge surfacing), PR 6 (instant first paint), and the nteract#3580 persistence-side actor-rejection hardening (carved off to the Bedrock workstream).
🤖 Generated with Claude Code