Skip to content

fix(tray): stop "looks connected but isn't shipping" silent failure#132

Merged
ntatschner merged 1 commit into
nextfrom
fix/tray-auth-loss-architecture
May 28, 2026
Merged

fix(tray): stop "looks connected but isn't shipping" silent failure#132
ntatschner merged 1 commit into
nextfrom
fix/tray-auth-loss-architecture

Conversation

@ntatschner
Copy link
Copy Markdown
Collaborator

Summary

Diagnosed and fixed during the 2026-05-28 tray outage (PR #127's piggyback fix worked but exposed three deeper architectural bugs feeding "looks connected but isn't shipping" symptoms):

  1. Hangar worker no longer unilaterally clears the system-wide token + flips global auth_lost on a 401. A race between fresh sync-worker captures and stale hangar-worker captures meant the hangar's 401 could wipe a just-paired token. Now: hangar records its own failure (hangar_stats.last_error) and bails its cycle; tray-wide auth_lost decisions stay with the sync worker's own /v1/ingest and /v1/auth/me 401 handlers.
  2. Sync worker with auth_lost=true now EXITS instead of looping silently. Previously it skipped drain every tick, never updating sync_stats, never logging. Now: log once, emit sync-paused for the UI, break out of the loop. Worker re-spawns on the next respawn() (pair_device / save_config / set_sync_preset) so re-pair recovers automatically.
  3. Health pill staleness detectionderiveSyncHealth gains STALE (last_attempt > 2× bulk interval ago, no error) and PAUSED (sync-paused event received). Precedence: OFF > PAUSED > IDLE > ERR > STALE > OK. ERR beats STALE because a known error is more useful than "we don't know"; STALE catches the silent-failure mode from bug chore(deps)(deps): bump actions/setup-node from 4 to 6 #2 going forward.

A new SettingsPane notice handles sync-paused ("Sync paused: the server rejected this uplink's token. Re-pair the device to resume.") and clears on toggle-back-on, mirroring the existing sync-revoked plumbing.

Test plan

  • cargo test -p starstats-client --bin starstats-client → 210 passed
  • pnpm --filter tray-ui run test:run → 162 passed (was 160; added STALE + PAUSED variants)
  • cargo fmt -p starstats-client --check clean
  • cargo clippy -p starstats-client --bin starstats-client --tests -- -D warnings clean
  • Manual: pair, then revoke the device server-side (or unpair from web). Expect sync worker to log sync worker exiting: auth_lost is set — waiting for re-pair, the health pill to flip to PAUSED, and the Settings pane to show the "Sync paused" notice.
  • Manual: pair, then KILL the worker (or wait for the bulk-interval to elapse with no activity). Expect health pill to flip from OK → STALE after ~2× interval_secs without misrepresenting state as green.

Related

Followed [[synthetic-config-laundering]] (PR #127) and pair with PR #131 (server-side auth message disambiguation). #131 makes future occurrences of this bug class diagnosable in one log line; this PR makes the symptom self-recovering when re-pair fires.

A 10+ hour outage on 2026-05-28 surfaced three architectural bugs in
the tray's auth-loss handling. All three fed each other:

## Bug 1 — hangar worker unilaterally sabotages the sync worker

On a 401 from the hangar push (`POST /v1/me/hangar`), the hangar
worker called `clear_persisted_device_token()` AND set
`account_status.auth_lost = true` tray-wide. The intent was "if our
push 401s, the sync worker's calls will also 401, save a round-trip".
But the hangar push and sync drain are independent surfaces, and they
can race: the sync worker captures `api_url`/`access_token` at
respawn-time, so a fresh post-pair token can be live in the sync
worker's locals while the hangar worker is still using the pre-pair
config snapshot. The hangar 401 in that race wipes the just-paired
token AND flips global auth_lost — for everyone. Fix: hangar logs +
records `hangar_stats.last_error` and bails its own cycle. Tray-wide
auth_lost decisions stay with the sync worker's own 401 handlers.

## Bug 2 — sync worker with auth_lost loops silently forever

`sync.rs::spawn_lane` checked `if !auth_lost` and skipped the drain
on every tick when set. Workers stayed alive but did no work — no
HTTP, no `sync_stats` updates, no log lines. The user saw 10+ hours
of zero activity with no diagnostic signal. Fix: when the worker
sees `auth_lost`, it logs once, emits `sync-paused` so the UI can
react, and BREAKS out of the loop. The worker re-spawns on the next
`respawn()` (pair_device / save_config / set_sync_preset), so re-pair
recovers automatically.

## Bug 3 — health pill shows green forever on frozen sync_stats

The pill derived from `sync_stats.last_success_at` with no staleness
check, so a worker that died at noon would have its "last green
reading" displayed until the user restarted the app. Fix: extend
`deriveSyncHealth` with a staleness threshold (2× `bulk interval_secs`,
floored at 60s). Adds `STALE` and `PAUSED` variants. Precedence is
now OFF > PAUSED > IDLE > ERR > STALE > OK — ERR beats STALE because
a known error is more useful than "we don't know"; STALE catches
silent-failure modes like bug #2.

Plus a `sync-paused` listener in SettingsPane that surfaces a notice
("Sync paused: the server rejected this uplink's token. Re-pair the
device to resume.") and clears on toggle-back-on, mirroring the
existing `sync-revoked` notice plumbing.

## Tests

- 210 starstats-client cargo tests pass (no new tests — the sync.rs
  change is a control-flow refactor exercised by the existing fixture
  surface).
- 162 tray-ui vitest tests pass (was 160; added STALE and PAUSED
  tests in SettingsPane.test.tsx). Existing OK test updated to use a
  fresh timestamp so the new staleness check doesn't fire.
- `cargo fmt -p starstats-client --check` clean.
- `cargo clippy -p starstats-client --bin starstats-client --tests
  -- -D warnings` clean.
@ntatschner ntatschner merged commit 510ba55 into next May 28, 2026
11 checks passed
@ntatschner ntatschner deleted the fix/tray-auth-loss-architecture branch May 28, 2026 20:17
ntatschner added a commit that referenced this pull request May 30, 2026
…137)

After a real device-revoke 401, `clear_persisted_device_token` (in
the drain_lane / fetch_me_check paths) blanks `access_token` and
`claimed_handle` on disk. The sync worker then sees `auth_lost=true`
on its next iteration, exits, and emits `sync-paused` — which the
Settings pane uses to flip the health pill to PAUSED.

What was missing: a `config-changed` emit. The React-side config
state stays at whatever `setConfig(...)` last received, so even
though disk says "unpaired", the SettingsPane reads `isPaired =
!!access_token && !!claimed_handle` from React state and keeps the
"Paired as TheCodeSaiyan" card mounted. The user had to click
Unpair manually to force a refresh before they could enter a fresh
pairing code.

Now: after emitting `sync-paused`, also `config::load()` the
freshly-cleared config and emit `config-changed`. App.tsx's
existing listener (`setConfig(e.payload)`) propagates the empty
remote_sync block down to SettingsPane, which naturally transitions
from the "Paired as" card to the "enter pairing code" input.

Surfaced 2026-05-29 on the live `tray-v1.8.12` build (PR #132
shipped today). User test confirmed PAUSED state fires correctly
but the pair UI didn't transition without manual Unpair.

`clear_persisted_device_token` itself doesn't emit because it's
called from drain_lane / fetch_me_check contexts without
`AppHandle` in scope. Plumbing the handle through would touch every
caller; the simpler move is to make the spawn_lane auth_lost branch
(which already has the handle) responsible for the user-visible
emit. No tests added — the existing fixture surface doesn't reach
the spawn_lane loop, and the emit itself is unit-untestable without
mocking Tauri's event bus. The change is exercised end-to-end on
the next user-facing repro.

Co-authored-by: Nigel Tatschner <n Tatschner@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant