fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468
Conversation
Testing instructions for reviewersThe sandbox commit ( You'll need:
One-time setupgit checkout sherwin/sdk-4336
# 1. Start the tunnel FIRST so you know your host. Free ngrok now issues
# *.ngrok-free.dev, and the host is RANDOM per restart, so every new host
# needs a fresh build (step 3) AND a dashboard Site URL update.
ngrok http https://localhost:4001
# 2. Capture the BARE host — no scheme, no port. Pasting the full https:// URL
# bakes a broken `https://https://...` origin; a stray `:4001` also fails
# because ngrok serves 443.
HOST=$(curl -s localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' | sed 's#https://##')
echo "$HOST" # e.g. abc1-2-3.ngrok-free.dev
# 3. Build the dev SDK against your tunnel host. NO_DEV_PORT=true drops the port
# so the device fetches assets from https://$HOST/... (443) through the tunnel.
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod
# 4. Serve the prebuilt SDK. This MUST run from the preview/ directory — that is
# what binds https://localhost:4001. Do NOT use `vp dev --filter @onesignal/preview`
# (the package is named `preview`, so the filter matches nothing and vp serves
# the root config on :4000), and do NOT use `vp run start` (it re-runs build:dev
# and re-bakes BUILD_ORIGIN=localhost, clobbering step 3).
cd preview && SDK_ENV=dev vp devSanity-check the baked origin before testing (exactly one rg "BUILD_ORIGIN = " build/releases/Dev-OneSignalSDK.page.js
# var BUILD_ORIGIN = "abc1-2-3.ngrok-free.dev";Then set the OneSignal app's Site URL to Reproduce the bug (control)The control is the repro sandbox without the fix. git checkout main
git checkout sherwin/sdk-4336 -- preview # sandbox pages, no fix
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod
cd preview && SDK_ENV=dev vp dev
Verify the fixgit checkout sherwin/sdk-4336
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod
cd preview && SDK_ENV=dev vp devReset state on the device (delete the PWA from the Home Screen, then Settings → Apps → Safari → Advanced → Website Data → remove the tunnel host's data), reinstall, and run the same sequence (including a force-quit/reopen, since that's the most common trigger). On the wedged load you should see roughly: One Options write times out at 1500ms and trips the circuit breaker; the remaining Options writes short-circuit silently (no per-write log). Total init on the wedged load is ~1.5s instead of indefinite. No What I confirmed on my device
Captured logs are in the SDK-4336 Linear ticket if you want to compare to your own run. Pre-mergeBefore merging, drop the sandbox commit ( |
|
Note: we'll also roll back the |
|
@claude review |
|
would fadi/sdk-4336-options-write-timeout be enough? |
|
|
647e220 to
b9d03ef
Compare
…erge) Adds the reproducible demo we used to verify the SDK-4336 fix on a real iOS Safari PWA. Lets a reviewer reproduce the original 30-minute init hang on `main` and confirm the fix branch resolves it. What's included: - `preview/pageA.html`, `preview/pageB.html` — minimal two-page sandbox with a Register button on Page A, designed to exercise the navigation-after-push-subscription flow described in the ticket. Page A persists `app_id` to `localStorage` so subsequent in-page navigations don't lose it (the in-page links don't carry the query string, which would otherwise produce `InvalidAppIdError`). - Both pages set `apple-mobile-web-app-capable` so they install as a standalone PWA when added to the Home Screen. - `preview/OneSignalSDKWorker.js` and `preview/push/onesignal/OneSignalSDKWorker.js` now `importScripts` from `self.location.origin` instead of a hardcoded `https://localhost:4001`, so the worker resolves correctly when the dev server is exposed via an ngrok / Cloudflare tunnel for on-device testing. - `preview/vite.config.ts` disables HMR (the WebSocket can't reach a device through a tunnel and floods the console with unhandled-rejection spam from `ws.send`), strips Vite's auto-injected `/@vite/client` from HTML responses for the same reason, and sends `Cache-Control: no-store` for SDK assets and HTML so iOS Safari / PWA doesn't pin a stale build during a debug session. This commit is intentionally **not** part of the SDK fix; it should be reverted before merging the PR. Kept in branch history so we can re-introduce it if SDK-4336 surfaces again.
iOS 26 Safari PWA can leave the `ONE_SIGNAL_SDK_DB.Options` object store in a state where every `readwrite` request stalls indefinitely after the user navigates back into the PWA following a push subscription. The request never fires `success`, `error`, `abort`, or `complete`, so `OneSignal.init()` blocks on the first Options `put` until WebKit's internal transaction watchdog finally aborts it ~30 minutes later. Reads on the same connection still work, `readwrite` on other stores still works, and reopening the database does not clear the wedge — only the `Options` store readwrite path is poisoned. A separate IDB at the same origin is unaffected, so this is a per-database WebKit bug, not the NetworkProcess crash family already tracked in WebKit bugs 273827 / 277615 / 309386. Wrap `db.put`/`db.delete` on `Options` with a 1.5s hard timeout. On timeout, log a `[SDK-4336]` warning and resolve the promise as a no-op so init can continue. Other stores keep their existing behavior. The values written to `Options` are non-critical session metadata (`pageTitle`, `persistNotification`, webhook URLs, click-handler config, `lastPushToken`, `isPushEnabled`, etc.) that the service worker reads with sensible fallbacks if missing or stale, so push delivery remains unaffected.
Once a single Options `readwrite` request times out we know the store is poisoned for the rest of the page's lifetime — fresh connections inherit the same WebKit lock state, and we have no signal that would let us probe whether the wedge has cleared mid-session. Today every remaining Options write in `initSaveState` + `saveInitOptions` still arms its own 1.5s timer and walks to the timeout independently, which adds up to ~12s of init latency on the first navigation back into a wedged PWA. Add a module-scoped `optionsWriteWedged` flag. When the first Options write times out, set the flag and resolve subsequent Options writes as no-ops immediately, logging a `[SDK-4336]` warning so the skip is visible in telemetry. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh with the regular timeout. With this in place, init on a wedged page completes in ~1.5s instead of ~12s.
The first SDK-4336 commit only protected `Options` writes, but on-device
verification (logs12.txt) showed that once init completes, the
`OperationRepo` queue still wedges: `_executeOperations` awaits a
`db.put('operations', ...)` (or a downstream model-store `_persist`)
that never settles, leaving `runningOps = true` forever and spamming
`Ops in progress` every 500ms. This is the same iOS 26 Safari PWA
WebKit lock poisoning we saw on `Options`, just affecting different
stores once init is no longer the first thing to write.
Generalize the workaround:
- Rename `optionsWriteWedged` → `readwriteWedged` and apply the timeout
+ circuit breaker to every readwrite op (`put`, `delete`, `clear`),
not just `Options`.
- Once any readwrite times out, mark the DB readwrite path wedged for
the rest of the page's lifetime. All subsequent readwrites
short-circuit to a no-op resolve, with a `[SDK-4336]` warning logged
for telemetry.
- Reads (`get`, `getAll`) and `objectStoreNames`/`close` are unchanged.
The values we drop on a wedged page are either session metadata the
service worker re-derives from network state on the next visit, or
queued operations whose effects (subscription create/update/delete,
identity changes, etc.) are idempotent server-side and will be
re-attempted on the next page load. The alternative is letting the
operation queue spin forever, which is materially worse.
This reverts commit 3c41181.
Shortens the two `Log._warn` strings in `withOptionsWriteTimeout` (still tagged with `[SDK-4336]`) and bumps the `page.es6.js` and `sw.js` size-limit entries to fit the circuit-breaker code added by the SDK-4336 fix.
Combine `withOptionsWriteTimeout` and the per-method `if (storeName === 'Options')` branches into a single `guardOptionsWrite(storeName, label, op)` helper, and condense the explanatory comment block. Also drop the `[SDK-4336]` prefix from runtime warnings — the messages stand on their own and the ticket is captured in the commit log.
Two refinements to the Options-store guard: 1. `db.put`/`db.delete` now `await dbPromise` *before* invoking `guardOptionsWrite`, so the timeout scopes only to the readwrite request itself. Previously the 1500ms budget covered both DB open/upgrade and the put, so a slow `open()` (cross-tab `blocked` event during a schema upgrade, `terminated()` callback re-opening, or v5/v6 migrations on slow hardware) could false-trip the breaker and silently drop subsequent Options writes for the page lifetime. 2. Export `isOptionsWriteWedged()` and use it from `initSaveState` to defer the new-appId commit when the Options reset got circuit-broken mid-flight. Without this, the `Ids.appId` write (unguarded — the guard is Options-only) would succeed while the previous app's `isPushEnabled` / `lastPushId` / `lastPushToken` / `lastOptedIn` stayed put, and the `previousAppId !== appId` gate would keep the reset branch from re-entering on later loads — leaving cross-app contamination permanent. Skipping the commit lets a future non-wedged load complete the reset.
Apply Fadi's review feedback on the Options-write guard: - Rewrite `guardOptionsWrite` to the leaner `Promise.race([op(), timeout])` form with `.finally(clearTimeout)`, dropping the explicit `settled`-flag plumbing. `Promise.race` keeps a handler on `op`, so a post-timeout rejection stays handled (no unhandled rejection). - Drop the per-call "skipped (Options store wedged)" warning; the single "timed out" log on first wedge is the actionable signal and the rest would just be noise that bloats the bundle. - Drop the redundant `String(key)` cast in the delete label (Options keys are always strings). - Drop the "App ID change reset deferred" warning in `initSaveState`; the surrounding comment already documents the deferral. Add unit tests for the new behavior: - `client.test.ts`: circuit breaker trips on an Options-write timeout and short-circuits subsequent Options writes while leaving other stores unaffected; the timeout is cleared when a write resolves first. - `init.test.ts`: `initSaveState` defers the new-appId commit (and the `Ids` clears) when the breaker is tripped, so a wedged app-ID switch doesn't strand the previous app's metadata.
b9d03ef to
7716b57
Compare
|
Verified 30min timeout (pre-fix). |
Description
1 Line Summary
Stops
OneSignal.init()from hanging for ~30 minutes on iOS 26 Safari PWAs by cappingOptions-store readwrite operations with a 1.5s hard timeout and tripping a page-scoped circuit breaker once the wedge is detected.Note: A bug report (315804) was filed in Webkit upstream to investigate this.
Details
Root cause
On iOS 26 Safari running as a Home-Screen PWA (
display: standalone), the navigation back into the app after a successful push subscription leaves theONE_SIGNAL_SDK_DBIndexedDB in a poisoned state where everyreadwriterequest on theOptionsobject store stalls indefinitely. TheIDBRequestis created and goes topending, but no event ever fires — nosuccess, noerror, no transactioncomplete/abort, noIDBDatabase.onclose. WebKit's internal transaction watchdog only forcibly aborts the wedged transaction after roughly 30 minutes; until thenawait db.put('Options', ...)never settles.OneSignal.init()always writes toOptionsvery early —initSaveStatewritespageTitle,saveInitOptionswrites 7+ entries (webhook URLs, persistNotification, click-handler config,lastPushToken,isPushEnabled, etc.). Without this guard, the very first of those writes blocks init forever and the support team observes "init hanging for 30 minutes, then eventually recovers" — exactly the watchdog timer.What we ruled out and how:
UnknownError: Connection to Indexed Database server lostand fireIDBDatabase.onclose. We never see either.TransactionInactiveErrorsynchronously. We throw nothing.IDBRequestis dead.get,getAll) on the same DB connection still work.readwriteon other stores still works. Closing and re-opening the database returns a freshIDBDatabasewhose firstreadwriteonOptionswedges identically.readwrite putcompleted in 11 ms while the real DB was hung. The wedge is per-database, not origin-wide.Filed upstream as WebKit bug 315804 ("A readwrite IDBTransaction never fires oncomplete, onerror, or onabort after the user subscribes to web push in an installed PWA") with a link back to this PR for steps on how to reproduce. The source comment in
client.tsalso references the bug so the workaround can be removed once WebKit ships a fix.Fix
Three commits, each independently revertable:
81415fdd— fix: fail-fast Options writes. Wrapdb.put('Options', ...)anddb.delete('Options', ...)with a 1500ms hard timeout. On timeout, log a[SDK-4336]warning and resolve the promise as a no-op so init proceeds. Other stores keep their existing behavior. The values that don't get persisted are session metadata that the SW reads with sensible fallbacks if missing or stale, so push delivery is unaffected.1ae8abdf— perf: short-circuit after first wedge. Once a singleOptionsreadwritetimes out we know the store is poisoned for the rest of this page's lifetime. Add a module-scopedoptionsWriteWedgedflag so the remaining 7 Options writes ininitSaveState+saveInitOptionsshort-circuit immediately instead of each independently paying the 1500ms timeout. Cuts init latency on a wedged page from ~12s to ~1.5s. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh.88e4cd59— chore(preview): repro sandbox (will be reverted before merge). Two-page demo + dev-server hardening that lets a reviewer reproduce the original 30-minute hang onmainand confirm this branch resolves it. Detailed testing instructions in a separate PR comment.A 4th commit (
3c41181b) generalizing the timeout to all readwrite stores was tried and reverted (83cbff87) because we couldn't validate it on device — every subsequent on-device repro happened with an empty operation queue. Parked in branch history with the validation steps captured in a Linear comment on SDK-4336.Systems Affected
Validation
Tests
Info
client.test.tsround-trip tests (everyOptionswrite goes through the new wrapper, so all 12 client tests are effectively also covering the timeout path's happy case).[SDK-4336] db.put(Options) timed out … Tripping circuit breakerwarning and 7 follow-updb.put(Options) skippedwarnings, theninternalInit→ SW handshake →sessionInitproceed normally.client.test.tsusesfake-indexeddbwhich doesn't reproduce the WebKit-specific wedge — a synthetic timeout test would only verify the timer plumbing, which is straightforward enough that the on-device evidence is more meaningful.Checklist
Programming Checklist
Interfaces:
Functions:
Typescript:
Other:
elem of arraysyntax. PreferforEachor usemapcontextif possible.Screenshots
Info
N/A — runtime correctness fix, no UI changes. See on-device console logs in the SDK-4336 Linear ticket and chat history.
Checklist
Related Tickets
SDK-4336