fix(controllers): retry storage updates on conflict, log other errors by indyjonesnl · Pull Request #34 · calfonso/rusternetes

indyjonesnl · 2026-05-14T15:34:33Z

Problem

Four controllers used `let _ = self.storage.update(...).await` to write status updates, silently swallowing every storage error. On a benign resource-version `Conflict` the update vanished — the controller's view of reality diverged from etcd. On a hard error nothing was logged, so failures were invisible.

Fix

Each controller now matches `Err` explicitly:

apiservice: `update_condition` writes the `Available` condition. On Conflict, re-read and re-apply once. Other errors log via `tracing::error!` with the apiservice name.
statefulset: Three `reconcile()` sites set `deletionTimestamp` on pods (terminal-pod cleanup, scale-down, rolling update). Routed through a new private `update_pod_with_retry()` helper: re-read + reapply once on Conflict, no-op on NotFound, log other errors.
scheduler: Six `let _ = ...` sites covering priority resolution, nominatedNodeName fallback after failed binding, and PodScheduled=Unschedulable condition writes. Same helper pattern.
deployment: Three improvements bundled because they all hit `reconcile_deployment`:
1. Drop `tokio::task::block_in_place(|| Handle::current().block_on(...))` from the available-replica fallback — the surrounding fn is already `async`. Replace with a sequential `.await` accumulator.
2. Add exponential backoff with jitter (10ms → 500ms) to the deployment-revision CAS retry. Without jitter, contended deployments thrashed storage.
3. Convert four silenced `let _ = self.storage.update(...)` sites (orphan-RS adoption, RS/deployment revision annotation refresh, RS replica annotations) to log non-Conflict errors via `tracing::error!`.

Verification

`cargo build --workspace --locked` clean. Full test suite is green when stacked on #32; this PR adds no new test failures relative to the baseline.

Dependency

Depends on #32 for the test suite to show green.

update_condition discarded all storage errors with 'let _ = ...'. On a benign resource-version Conflict the Available condition was lost; on other errors the failure was invisible. Match Conflict explicitly, re-read and reapply once; log other errors with apiservice name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three sites in reconcile() used 'let _ = self.storage.update(...)' to mark a pod for deletion (terminal pod cleanup, scale-down, rolling update). Errors were silently dropped — a benign resource-version Conflict meant the pod never got its deletionTimestamp, and harder errors went unlogged. Route all three through update_pod_with_retry() which re-reads + reapplies once on Conflict, no-ops on NotFound, and logs other errors with the pod key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six 'let _ = self.storage.update(...)' sites silently dropped storage errors during priority resolution, nominatedNodeName fallback after failed binding, and PodScheduled=Unschedulable condition writes. On a benign resource-version Conflict these updates vanished, leaving pods without the metadata downstream code expected. Route them through a new update_pod_with_retry() helper: re-read + reapply once on Conflict, no-op on NotFound, log other errors with pod key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three issues: - The available-replica fallback used tokio::task::block_in_place + Handle::current().block_on(...) inside an already-async fn. Replace with a sequential .await accumulator — same semantics, no executor stall under load. - The deployment-revision CAS loop retried 3 times with no backoff, thrashing storage on contended deployments. Add exponential backoff with jitter (10ms doubling to 500ms), and split Conflict (retry) from other Err (log + break). - Four 'let _ = self.storage.update(...)' sites silently dropped errors on orphan-RS adoption, RS revision annotation refresh, deployment revision annotation refresh, and RS replica annotations. Log non-Conflict failures so they're visible in the controller log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

indyjonesnl and others added 4 commits May 14, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controllers): retry storage updates on conflict, log other errors#34

fix(controllers): retry storage updates on conflict, log other errors#34
indyjonesnl wants to merge 4 commits into
calfonso:mainfrom
indyjonesnl:upstream/storage-update-retry

indyjonesnl commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

indyjonesnl commented May 14, 2026

Problem

Fix

Verification

Dependency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant