Skip to content

fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468

Merged
sherwinski merged 10 commits into
mainfrom
sherwin/sdk-4336
Jun 2, 2026
Merged

fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468
sherwinski merged 10 commits into
mainfrom
sherwin/sdk-4336

Conversation

@sherwinski

@sherwinski sherwinski commented May 28, 2026

Copy link
Copy Markdown
Contributor

Description

1 Line Summary

Stops OneSignal.init() from hanging for ~30 minutes on iOS 26 Safari PWAs by capping Options-store readwrite operations with a 1.5s hard timeout and tripping a page-scoped circuit breaker once the wedge is detected.

Note: A bug report (315804) was filed in Webkit upstream to investigate this.

Details

Root cause

On iOS 26 Safari running as a Home-Screen PWA (display: standalone), the navigation back into the app after a successful push subscription leaves the ONE_SIGNAL_SDK_DB IndexedDB in a poisoned state where every readwrite request on the Options object store stalls indefinitely. The IDBRequest is created and goes to pending, but no event ever fires — no success, no error, no transaction complete / abort, no IDBDatabase.onclose. WebKit's internal transaction watchdog only forcibly aborts the wedged transaction after roughly 30 minutes; until then await db.put('Options', ...) never settles.

OneSignal.init() always writes to Options very early — initSaveState writes pageTitle, saveInitOptions writes 7+ entries (webhook URLs, persistNotification, click-handler config, lastPushToken, isPushEnabled, etc.). Without this guard, the very first of those writes blocks init forever and the support team observes "init hanging for 30 minutes, then eventually recovers" — exactly the watchdog timer.

What we ruled out and how:

  • NetworkProcess crash family (WebKit bugs 273827 / 277615 / 309386). Those throw UnknownError: Connection to Indexed Database server lost and fire IDBDatabase.onclose. We never see either.
  • Process suspension (202705). That throws TransactionInactiveError synchronously. We throw nothing.
  • PWA total freeze (211018). All other JS keeps running — timers fire, fetch works, the SW heartbeat keeps ticking — only this one IDBRequest is dead.
  • Connection-level wedge. Reads (get, getAll) on the same DB connection still work. readwrite on other stores still works. Closing and re-opening the database returns a fresh IDBDatabase whose first readwrite on Options wedges identically.
  • Origin-wide IDB wedge. A diagnostic probe that opened a different IndexedDB name at the same origin and issued a readwrite put completed in 11 ms while the real DB was hung. The wedge is per-database, not origin-wide.

Filed upstream as WebKit bug 315804 ("A readwrite IDBTransaction never fires oncomplete, onerror, or onabort after the user subscribes to web push in an installed PWA") with a link back to this PR for steps on how to reproduce. The source comment in client.ts also references the bug so the workaround can be removed once WebKit ships a fix.

Fix

Three commits, each independently revertable:

  1. 81415fdd — fix: fail-fast Options writes. Wrap db.put('Options', ...) and db.delete('Options', ...) with a 1500ms hard timeout. On timeout, log a [SDK-4336] warning and resolve the promise as a no-op so init proceeds. Other stores keep their existing behavior. The values that don't get persisted are session metadata that the SW reads with sensible fallbacks if missing or stale, so push delivery is unaffected.

  2. 1ae8abdf — perf: short-circuit after first wedge. Once a single Options readwrite times out we know the store is poisoned for the rest of this page's lifetime. Add a module-scoped optionsWriteWedged flag so the remaining 7 Options writes in initSaveState + saveInitOptions short-circuit immediately instead of each independently paying the 1500ms timeout. Cuts init latency on a wedged page from ~12s to ~1.5s. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh.

  3. 88e4cd59 — chore(preview): repro sandbox (will be reverted before merge). Two-page demo + dev-server hardening that lets a reviewer reproduce the original 30-minute hang on main and confirm this branch resolves it. Detailed testing instructions in a separate PR comment.

A 4th commit (3c41181b) generalizing the timeout to all readwrite stores was tried and reverted (83cbff87) because we couldn't validate it on device — every subsequent on-device repro happened with an empty operation queue. Parked in branch history with the validation steps captured in a Linear comment on SDK-4336.

Systems Affected

  • WebSDK
  • Backend
  • Dashboard

Validation

Tests

Info

  • Full suite: 512/512 pass, lint clean, formatter clean.
  • The fix path is exercised by the existing client.test.ts round-trip tests (every Options write goes through the new wrapper, so all 12 client tests are effectively also covering the timeout path's happy case).
  • On-device verification on iPhone running iOS 26.4 (logs11.txt → logs14.txt in the chat): pre-fix init hung indefinitely; post-fix init completes within ~1.5s on the wedged navigation, with a [SDK-4336] db.put(Options) timed out … Tripping circuit breaker warning and 7 follow-up db.put(Options) skipped warnings, then internalInit → SW handshake → sessionInit proceed normally.
  • I haven't added a unit test specifically for the timeout path because the existing client.test.ts uses fake-indexeddb which doesn't reproduce the WebKit-specific wedge — a synthetic timeout test would only verify the timer plumbing, which is straightforward enough that the on-device evidence is more meaningful.

Checklist

  • All the automated tests pass or I explained why that is not possible
  • I have personally tested this on my machine or explained why that is not possible
  • I have included test coverage for these changes or explained why they are not needed

Programming Checklist

Interfaces:

  • Don't use default export
  • New interfaces are in model files

Functions:

  • Don't use default export
  • All function signatures have return types
  • Helpers should not access any data but rather be given the data to operate on.

Typescript:

  • No Typescript warnings
  • Avoid silencing null/undefined warnings with the exclamation point

Other:

  • Iteration: refrain from using elem of array syntax. Prefer forEach or use map
  • Avoid using global OneSignal accessor for context if possible.

Screenshots

Info

N/A — runtime correctness fix, no UI changes. See on-device console logs in the SDK-4336 Linear ticket and chat history.

Checklist

  • I have included screenshots/recordings of the intended results or explained why they are not needed

Related Tickets

SDK-4336

@sherwinski

sherwinski commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Testing instructions for reviewers

The sandbox commit (chore(preview): add iOS PWA repro sandbox) is included only so you can reproduce the bug end-to-end on a real iOS device. It will be reverted before merge — please don't ship it.

You'll need:

  • A real iOS device running iOS 26.x (the bug doesn't surface in the iOS Simulator's WebKit). Mine: iPhone running iOS 26.4.
  • An ngrok / Cloudflare tunnel account so the iOS device can reach your dev machine's HTTPS server.
  • A OneSignal app whose Site URL exactly matches the tunnel host you'll be using, with a custom service-worker integration pointing to push/onesignal/OneSignalSDKWorker.js. iOS Web Push enforces an exact origin match, so a pre-existing app pointing at https://localhost:4001 fails immediately with Error: Can only be used on: https://localhost:4001. Easiest path: create a fresh disposable test app in the dashboard.

One-time setup

git checkout sherwin/sdk-4336

# 1. Start the tunnel FIRST so you know your host. Free ngrok now issues
#    *.ngrok-free.dev, and the host is RANDOM per restart, so every new host
#    needs a fresh build (step 3) AND a dashboard Site URL update.
ngrok http https://localhost:4001

# 2. Capture the BARE host — no scheme, no port. Pasting the full https:// URL
#    bakes a broken `https://https://...` origin; a stray `:4001` also fails
#    because ngrok serves 443.
HOST=$(curl -s localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' | sed 's#https://##')
echo "$HOST"   # e.g. abc1-2-3.ngrok-free.dev

# 3. Build the dev SDK against your tunnel host. NO_DEV_PORT=true drops the port
#    so the device fetches assets from https://$HOST/... (443) through the tunnel.
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod

# 4. Serve the prebuilt SDK. This MUST run from the preview/ directory — that is
#    what binds https://localhost:4001. Do NOT use `vp dev --filter @onesignal/preview`
#    (the package is named `preview`, so the filter matches nothing and vp serves
#    the root config on :4000), and do NOT use `vp run start` (it re-runs build:dev
#    and re-bakes BUILD_ORIGIN=localhost, clobbering step 3).
cd preview && SDK_ENV=dev vp dev

Sanity-check the baked origin before testing (exactly one https://, bare host, no port):

rg "BUILD_ORIGIN = " build/releases/Dev-OneSignalSDK.page.js
# var BUILD_ORIGIN = "abc1-2-3.ngrok-free.dev";

Then set the OneSignal app's Site URL to https://$HOST/.

Reproduce the bug (control)

The control is the repro sandbox without the fix. main does not contain the sandbox pages (pageA.html / pageB.html live only in the sandbox commit), so don't git checkout main and open them — overlay just the preview/ dir onto main instead:

git checkout main
git checkout sherwin/sdk-4336 -- preview          # sandbox pages, no fix
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod
cd preview && SDK_ENV=dev vp dev
  1. On the iPhone, open https://$HOST/pageA.html?app_id=<APP_ID> in Safari.
  2. Tap the Share icon → Add to Home Screen → Add. Keep Open as a Web App toggled on (the bug only reproduces in a standalone PWA, not a Safari tab/bookmark).
  3. Open the installed PWA from the Home Screen. Confirm there's no Safari URL bar — that's how you know it's standalone.
  4. Tap Go to Page B → tap Go to Page A → tap Register → accept the system prompt.
  5. Verify you actually subscribed — this is the precondition for the wedge, which only happens after a push subscription exists. Attach your Mac (Safari → Develop → [device] → [the PWA's web view]) and run:
    await OneSignal.User.PushSubscription.id   // must be a non-null GUID
    OneSignal.Notifications.permission         // true
    If id is null, init() will keep completing normally and the bug will not reproduce — fix the subscription first (service worker reachable at /push/onesignal/OneSignalSDKWorker.js, dashboard app config correct).
  6. Now trigger a fresh init() while subscribed. The wedge isn't tied to one specific gesture — any path that re-runs OneSignal.init() in the standalone PWA after the subscription exists will hit it. The likely ways you'll encounter it in the wild, any one of which reproduces:
    • In-app navigation: tap Go to Page B → tap Go to Page A (each page calls init()).
    • Relaunch the app: force-quit the PWA (swipe it out of the app switcher) and reopen it from the Home Screen. This is the most common real-world trigger — users closing and reopening the installed web app.
    • Cold start after backgrounding: leave the PWA backgrounded long enough that iOS tears down the web view, then reopen.
  7. On that fresh load, with the inspector attached before you trigger it: the body renders and you'll see the init() / CoreModule.init() debug lines, then the console freezes with no OneSignal initialized (Nms) log line. Leave it ~30 minutes and it eventually resumes by itself — that's the WebKit watchdog firing.

    If you relaunched the PWA (rather than navigating), the Web Inspector won't follow automatically — re-select the PWA's web view under Safari → Develop → [device] after reopening to see the frozen console.

Note: init() completing on the earlier loads (before you're subscribed) is expected and correct. Only a load that runs init() after a push subscription exists wedges.

Verify the fix

git checkout sherwin/sdk-4336
BUILD_ORIGIN=$HOST NO_DEV_PORT=true vp run build:dev-prod
cd preview && SDK_ENV=dev vp dev

Reset state on the device (delete the PWA from the Home Screen, then Settings → Apps → Safari → Advanced → Website Data → remove the tunnel host's data), reinstall, and run the same sequence (including a force-quit/reopen, since that's the most common trigger). On the wedged load you should see roughly:

[Log]     !!!! [SDK-4336 PAGE A] OneSignal initialize
[Debug]   init()
...
[Warning] db.put(Options) timed out
...
[Log]     !!!! [SDK-4336 PAGE A] OneSignal initialized (~1500ms)

One Options write times out at 1500ms and trips the circuit breaker; the remaining Options writes short-circuit silently (no per-write log). Total init on the wedged load is ~1.5s instead of indefinite. No Ops in progress runaway loop, no exceptions, push delivery still works.

What I confirmed on my device

  • Pre-fix: init never completed in 30+ minutes of waiting (reproduced via both in-app navigation and force-quit/reopen after registering).
  • With the fast-fail timeout only (before the short-circuit optimization): init completes after ~12s (8 separate Options writes each timing out individually at 1.5s).
  • With the timeout + circuit breaker (shipped behavior): init completes after ~1.5s (one timeout, the rest short-circuit).

Captured logs are in the SDK-4336 Linear ticket if you want to compare to your own run.

Pre-merge

Before merging, drop the sandbox commit (chore(preview): add iOS PWA repro sandbox) — e.g. git rebase -i origin/main and drop that commit — then git push --force-with-lease. Only the fix commits need to ship.

Comment thread preview/vite.config.ts Dismissed
@sherwinski

Copy link
Copy Markdown
Contributor Author

Note: we'll also roll back the [SDK-4336]... logs before merging.

@sherwinski sherwinski requested a review from fadi-george May 28, 2026 21:27
@sherwinski

Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts
@sherwinski sherwinski requested a review from fadi-george May 28, 2026 23:26
Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/helpers/init.ts Outdated
Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts
@fadi-george

Copy link
Copy Markdown
Contributor

would fadi/sdk-4336-options-write-timeout be enough?

@sherwinski

sherwinski commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #1469 (and the underlying core fix 7f5df216, which landed directly on main while this PR was in review). The new PR keeps just the additive hardening (circuit breaker + appId-deferral + clearTimeout + timeout warn log) on top of the main commit, dropping the demo sandbox and other churn.

@sherwinski sherwinski closed this Jun 1, 2026
@sherwinski sherwinski reopened this Jun 1, 2026
Comment thread src/shared/helpers/init.ts Outdated
Comment thread src/shared/database/client.ts
…erge)

Adds the reproducible demo we used to verify the SDK-4336 fix on a real
iOS Safari PWA. Lets a reviewer reproduce the original 30-minute init
hang on `main` and confirm the fix branch resolves it.

What's included:

- `preview/pageA.html`, `preview/pageB.html` — minimal two-page sandbox
  with a Register button on Page A, designed to exercise the
  navigation-after-push-subscription flow described in the ticket.
  Page A persists `app_id` to `localStorage` so subsequent in-page
  navigations don't lose it (the in-page links don't carry the query
  string, which would otherwise produce `InvalidAppIdError`).
- Both pages set `apple-mobile-web-app-capable` so they install as a
  standalone PWA when added to the Home Screen.
- `preview/OneSignalSDKWorker.js` and `preview/push/onesignal/OneSignalSDKWorker.js`
  now `importScripts` from `self.location.origin` instead of a
  hardcoded `https://localhost:4001`, so the worker resolves correctly
  when the dev server is exposed via an ngrok / Cloudflare tunnel for
  on-device testing.
- `preview/vite.config.ts` disables HMR (the WebSocket can't reach a
  device through a tunnel and floods the console with
  unhandled-rejection spam from `ws.send`), strips Vite's auto-injected
  `/@vite/client` from HTML responses for the same reason, and sends
  `Cache-Control: no-store` for SDK assets and HTML so iOS Safari /
  PWA doesn't pin a stale build during a debug session.

This commit is intentionally **not** part of the SDK fix; it should be
reverted before merging the PR. Kept in branch history so we can
re-introduce it if SDK-4336 surfaces again.
iOS 26 Safari PWA can leave the `ONE_SIGNAL_SDK_DB.Options` object store
in a state where every `readwrite` request stalls indefinitely after the
user navigates back into the PWA following a push subscription. The
request never fires `success`, `error`, `abort`, or `complete`, so
`OneSignal.init()` blocks on the first Options `put` until WebKit's
internal transaction watchdog finally aborts it ~30 minutes later. Reads
on the same connection still work, `readwrite` on other stores still
works, and reopening the database does not clear the wedge — only the
`Options` store readwrite path is poisoned. A separate IDB at the same
origin is unaffected, so this is a per-database WebKit bug, not the
NetworkProcess crash family already tracked in WebKit bugs 273827 /
277615 / 309386.

Wrap `db.put`/`db.delete` on `Options` with a 1.5s hard timeout. On
timeout, log a `[SDK-4336]` warning and resolve the promise as a no-op
so init can continue. Other stores keep their existing behavior. The
values written to `Options` are non-critical session metadata
(`pageTitle`, `persistNotification`, webhook URLs, click-handler
config, `lastPushToken`, `isPushEnabled`, etc.) that the service worker
reads with sensible fallbacks if missing or stale, so push delivery
remains unaffected.
Once a single Options `readwrite` request times out we know the store
is poisoned for the rest of the page's lifetime — fresh connections
inherit the same WebKit lock state, and we have no signal that would
let us probe whether the wedge has cleared mid-session. Today every
remaining Options write in `initSaveState` + `saveInitOptions` still
arms its own 1.5s timer and walks to the timeout independently, which
adds up to ~12s of init latency on the first navigation back into a
wedged PWA.

Add a module-scoped `optionsWriteWedged` flag. When the first Options
write times out, set the flag and resolve subsequent Options writes as
no-ops immediately, logging a `[SDK-4336]` warning so the skip is
visible in telemetry. The flag is page-scoped (resets on navigation),
so a subsequent navigation will probe the wedge fresh with the regular
timeout. With this in place, init on a wedged page completes in ~1.5s
instead of ~12s.
The first SDK-4336 commit only protected `Options` writes, but on-device
verification (logs12.txt) showed that once init completes, the
`OperationRepo` queue still wedges: `_executeOperations` awaits a
`db.put('operations', ...)` (or a downstream model-store `_persist`)
that never settles, leaving `runningOps = true` forever and spamming
`Ops in progress` every 500ms. This is the same iOS 26 Safari PWA
WebKit lock poisoning we saw on `Options`, just affecting different
stores once init is no longer the first thing to write.

Generalize the workaround:

- Rename `optionsWriteWedged` → `readwriteWedged` and apply the timeout
  + circuit breaker to every readwrite op (`put`, `delete`, `clear`),
  not just `Options`.
- Once any readwrite times out, mark the DB readwrite path wedged for
  the rest of the page's lifetime. All subsequent readwrites
  short-circuit to a no-op resolve, with a `[SDK-4336]` warning logged
  for telemetry.
- Reads (`get`, `getAll`) and `objectStoreNames`/`close` are unchanged.

The values we drop on a wedged page are either session metadata the
service worker re-derives from network state on the next visit, or
queued operations whose effects (subscription create/update/delete,
identity changes, etc.) are idempotent server-side and will be
re-attempted on the next page load. The alternative is letting the
operation queue spin forever, which is materially worse.
Shortens the two `Log._warn` strings in `withOptionsWriteTimeout` (still
tagged with `[SDK-4336]`) and bumps the `page.es6.js` and `sw.js`
size-limit entries to fit the circuit-breaker code added by the
SDK-4336 fix.
Combine `withOptionsWriteTimeout` and the per-method `if (storeName ===
'Options')` branches into a single `guardOptionsWrite(storeName, label,
op)` helper, and condense the explanatory comment block. Also drop the
`[SDK-4336]` prefix from runtime warnings — the messages stand on their
own and the ticket is captured in the commit log.
Two refinements to the Options-store guard:

1. `db.put`/`db.delete` now `await dbPromise` *before* invoking
   `guardOptionsWrite`, so the timeout scopes only to the readwrite
   request itself. Previously the 1500ms budget covered both DB
   open/upgrade and the put, so a slow `open()` (cross-tab `blocked`
   event during a schema upgrade, `terminated()` callback re-opening,
   or v5/v6 migrations on slow hardware) could false-trip the breaker
   and silently drop subsequent Options writes for the page lifetime.

2. Export `isOptionsWriteWedged()` and use it from `initSaveState` to
   defer the new-appId commit when the Options reset got circuit-broken
   mid-flight. Without this, the `Ids.appId` write (unguarded — the
   guard is Options-only) would succeed while the previous app's
   `isPushEnabled` / `lastPushId` / `lastPushToken` / `lastOptedIn`
   stayed put, and the `previousAppId !== appId` gate would keep the
   reset branch from re-entering on later loads — leaving cross-app
   contamination permanent. Skipping the commit lets a future
   non-wedged load complete the reset.
Apply Fadi's review feedback on the Options-write guard:

- Rewrite `guardOptionsWrite` to the leaner `Promise.race([op(), timeout])`
  form with `.finally(clearTimeout)`, dropping the explicit `settled`-flag
  plumbing. `Promise.race` keeps a handler on `op`, so a post-timeout
  rejection stays handled (no unhandled rejection).
- Drop the per-call "skipped (Options store wedged)" warning; the single
  "timed out" log on first wedge is the actionable signal and the rest
  would just be noise that bloats the bundle.
- Drop the redundant `String(key)` cast in the delete label (Options keys
  are always strings).
- Drop the "App ID change reset deferred" warning in `initSaveState`; the
  surrounding comment already documents the deferral.

Add unit tests for the new behavior:

- `client.test.ts`: circuit breaker trips on an Options-write timeout and
  short-circuits subsequent Options writes while leaving other stores
  unaffected; the timeout is cleared when a write resolves first.
- `init.test.ts`: `initSaveState` defers the new-appId commit (and the
  `Ids` clears) when the breaker is tripped, so a wedged app-ID switch
  doesn't strand the previous app's metadata.
@fadi-george

Copy link
Copy Markdown
Contributor

Verified 30min timeout (pre-fix).

@sherwinski sherwinski merged commit 07d28de into main Jun 2, 2026
4 checks passed
@sherwinski sherwinski deleted the sherwin/sdk-4336 branch June 2, 2026 16:31
@github-actions github-actions Bot mentioned this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: OneSignal.init() hangs on subsequent page navigations in iOS Safari PWA after push subscription is registered

3 participants