Skip to content

fix(controllers): retry storage updates on conflict, log other errors#34

Open
indyjonesnl wants to merge 4 commits into
calfonso:mainfrom
indyjonesnl:upstream/storage-update-retry
Open

fix(controllers): retry storage updates on conflict, log other errors#34
indyjonesnl wants to merge 4 commits into
calfonso:mainfrom
indyjonesnl:upstream/storage-update-retry

Conversation

@indyjonesnl
Copy link
Copy Markdown

Problem

Four controllers used `let _ = self.storage.update(...).await` to write status updates, silently swallowing every storage error. On a benign resource-version `Conflict` the update vanished — the controller's view of reality diverged from etcd. On a hard error nothing was logged, so failures were invisible.

Fix

Each controller now matches `Err` explicitly:

  • apiservice: `update_condition` writes the `Available` condition. On Conflict, re-read and re-apply once. Other errors log via `tracing::error!` with the apiservice name.
  • statefulset: Three `reconcile()` sites set `deletionTimestamp` on pods (terminal-pod cleanup, scale-down, rolling update). Routed through a new private `update_pod_with_retry()` helper: re-read + reapply once on Conflict, no-op on NotFound, log other errors.
  • scheduler: Six `let _ = ...` sites covering priority resolution, nominatedNodeName fallback after failed binding, and PodScheduled=Unschedulable condition writes. Same helper pattern.
  • deployment: Three improvements bundled because they all hit `reconcile_deployment`:
    1. Drop `tokio::task::block_in_place(|| Handle::current().block_on(...))` from the available-replica fallback — the surrounding fn is already `async`. Replace with a sequential `.await` accumulator.
    2. Add exponential backoff with jitter (10ms → 500ms) to the deployment-revision CAS retry. Without jitter, contended deployments thrashed storage.
    3. Convert four silenced `let _ = self.storage.update(...)` sites (orphan-RS adoption, RS/deployment revision annotation refresh, RS replica annotations) to log non-Conflict errors via `tracing::error!`.

Verification

`cargo build --workspace --locked` clean. Full test suite is green when stacked on #32; this PR adds no new test failures relative to the baseline.

Dependency

Depends on #32 for the test suite to show green.

indyjonesnl and others added 4 commits May 14, 2026 17:29
update_condition discarded all storage errors with 'let _ = ...'. On a
benign resource-version Conflict the Available condition was lost; on
other errors the failure was invisible. Match Conflict explicitly,
re-read and reapply once; log other errors with apiservice name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sites in reconcile() used 'let _ = self.storage.update(...)' to
mark a pod for deletion (terminal pod cleanup, scale-down, rolling
update). Errors were silently dropped — a benign resource-version
Conflict meant the pod never got its deletionTimestamp, and harder
errors went unlogged. Route all three through update_pod_with_retry()
which re-reads + reapplies once on Conflict, no-ops on NotFound, and
logs other errors with the pod key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six 'let _ = self.storage.update(...)' sites silently dropped storage
errors during priority resolution, nominatedNodeName fallback after
failed binding, and PodScheduled=Unschedulable condition writes. On a
benign resource-version Conflict these updates vanished, leaving pods
without the metadata downstream code expected. Route them through a
new update_pod_with_retry() helper: re-read + reapply once on Conflict,
no-op on NotFound, log other errors with pod key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues:

- The available-replica fallback used tokio::task::block_in_place +
  Handle::current().block_on(...) inside an already-async fn. Replace
  with a sequential .await accumulator — same semantics, no executor
  stall under load.

- The deployment-revision CAS loop retried 3 times with no backoff,
  thrashing storage on contended deployments. Add exponential backoff
  with jitter (10ms doubling to 500ms), and split Conflict (retry)
  from other Err (log + break).

- Four 'let _ = self.storage.update(...)' sites silently dropped errors
  on orphan-RS adoption, RS revision annotation refresh, deployment
  revision annotation refresh, and RS replica annotations. Log
  non-Conflict failures so they're visible in the controller log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant