fix(tray): stop "looks connected but isn't shipping" silent failure#132
Merged
Conversation
A 10+ hour outage on 2026-05-28 surfaced three architectural bugs in the tray's auth-loss handling. All three fed each other: ## Bug 1 — hangar worker unilaterally sabotages the sync worker On a 401 from the hangar push (`POST /v1/me/hangar`), the hangar worker called `clear_persisted_device_token()` AND set `account_status.auth_lost = true` tray-wide. The intent was "if our push 401s, the sync worker's calls will also 401, save a round-trip". But the hangar push and sync drain are independent surfaces, and they can race: the sync worker captures `api_url`/`access_token` at respawn-time, so a fresh post-pair token can be live in the sync worker's locals while the hangar worker is still using the pre-pair config snapshot. The hangar 401 in that race wipes the just-paired token AND flips global auth_lost — for everyone. Fix: hangar logs + records `hangar_stats.last_error` and bails its own cycle. Tray-wide auth_lost decisions stay with the sync worker's own 401 handlers. ## Bug 2 — sync worker with auth_lost loops silently forever `sync.rs::spawn_lane` checked `if !auth_lost` and skipped the drain on every tick when set. Workers stayed alive but did no work — no HTTP, no `sync_stats` updates, no log lines. The user saw 10+ hours of zero activity with no diagnostic signal. Fix: when the worker sees `auth_lost`, it logs once, emits `sync-paused` so the UI can react, and BREAKS out of the loop. The worker re-spawns on the next `respawn()` (pair_device / save_config / set_sync_preset), so re-pair recovers automatically. ## Bug 3 — health pill shows green forever on frozen sync_stats The pill derived from `sync_stats.last_success_at` with no staleness check, so a worker that died at noon would have its "last green reading" displayed until the user restarted the app. Fix: extend `deriveSyncHealth` with a staleness threshold (2× `bulk interval_secs`, floored at 60s). Adds `STALE` and `PAUSED` variants. Precedence is now OFF > PAUSED > IDLE > ERR > STALE > OK — ERR beats STALE because a known error is more useful than "we don't know"; STALE catches silent-failure modes like bug #2. Plus a `sync-paused` listener in SettingsPane that surfaces a notice ("Sync paused: the server rejected this uplink's token. Re-pair the device to resume.") and clears on toggle-back-on, mirroring the existing `sync-revoked` notice plumbing. ## Tests - 210 starstats-client cargo tests pass (no new tests — the sync.rs change is a control-flow refactor exercised by the existing fixture surface). - 162 tray-ui vitest tests pass (was 160; added STALE and PAUSED tests in SettingsPane.test.tsx). Existing OK test updated to use a fresh timestamp so the new staleness check doesn't fire. - `cargo fmt -p starstats-client --check` clean. - `cargo clippy -p starstats-client --bin starstats-client --tests -- -D warnings` clean.
This was referenced May 29, 2026
ntatschner
added a commit
that referenced
this pull request
May 30, 2026
…137) After a real device-revoke 401, `clear_persisted_device_token` (in the drain_lane / fetch_me_check paths) blanks `access_token` and `claimed_handle` on disk. The sync worker then sees `auth_lost=true` on its next iteration, exits, and emits `sync-paused` — which the Settings pane uses to flip the health pill to PAUSED. What was missing: a `config-changed` emit. The React-side config state stays at whatever `setConfig(...)` last received, so even though disk says "unpaired", the SettingsPane reads `isPaired = !!access_token && !!claimed_handle` from React state and keeps the "Paired as TheCodeSaiyan" card mounted. The user had to click Unpair manually to force a refresh before they could enter a fresh pairing code. Now: after emitting `sync-paused`, also `config::load()` the freshly-cleared config and emit `config-changed`. App.tsx's existing listener (`setConfig(e.payload)`) propagates the empty remote_sync block down to SettingsPane, which naturally transitions from the "Paired as" card to the "enter pairing code" input. Surfaced 2026-05-29 on the live `tray-v1.8.12` build (PR #132 shipped today). User test confirmed PAUSED state fires correctly but the pair UI didn't transition without manual Unpair. `clear_persisted_device_token` itself doesn't emit because it's called from drain_lane / fetch_me_check contexts without `AppHandle` in scope. Plumbing the handle through would touch every caller; the simpler move is to make the spawn_lane auth_lost branch (which already has the handle) responsible for the user-visible emit. No tests added — the existing fixture surface doesn't reach the spawn_lane loop, and the emit itself is unit-untestable without mocking Tauri's event bus. The change is exercised end-to-end on the next user-facing repro. Co-authored-by: Nigel Tatschner <n Tatschner@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnosed and fixed during the 2026-05-28 tray outage (PR #127's piggyback fix worked but exposed three deeper architectural bugs feeding "looks connected but isn't shipping" symptoms):
auth_loston a 401. A race between fresh sync-worker captures and stale hangar-worker captures meant the hangar's 401 could wipe a just-paired token. Now: hangar records its own failure (hangar_stats.last_error) and bails its cycle; tray-wide auth_lost decisions stay with the sync worker's own /v1/ingest and /v1/auth/me 401 handlers.auth_lost=truenow EXITS instead of looping silently. Previously it skipped drain every tick, never updating sync_stats, never logging. Now: log once, emitsync-pausedfor the UI, break out of the loop. Worker re-spawns on the nextrespawn()(pair_device / save_config / set_sync_preset) so re-pair recovers automatically.deriveSyncHealthgainsSTALE(last_attempt > 2× bulk interval ago, no error) andPAUSED(sync-paused event received). Precedence: OFF > PAUSED > IDLE > ERR > STALE > OK. ERR beats STALE because a known error is more useful than "we don't know"; STALE catches the silent-failure mode from bug chore(deps)(deps): bump actions/setup-node from 4 to 6 #2 going forward.A new SettingsPane notice handles
sync-paused("Sync paused: the server rejected this uplink's token. Re-pair the device to resume.") and clears on toggle-back-on, mirroring the existingsync-revokedplumbing.Test plan
cargo test -p starstats-client --bin starstats-client→ 210 passedpnpm --filter tray-ui run test:run→ 162 passed (was 160; added STALE + PAUSED variants)cargo fmt -p starstats-client --checkcleancargo clippy -p starstats-client --bin starstats-client --tests -- -D warningscleansync worker exiting: auth_lost is set — waiting for re-pair, the health pill to flip to PAUSED, and the Settings pane to show the "Sync paused" notice.Related
Followed [[synthetic-config-laundering]] (PR #127) and pair with PR #131 (server-side auth message disambiguation). #131 makes future occurrences of this bug class diagnosable in one log line; this PR makes the symptom self-recovering when re-pair fires.