fix(controllers): retry storage updates on conflict, log other errors#34
Open
indyjonesnl wants to merge 4 commits into
Open
fix(controllers): retry storage updates on conflict, log other errors#34indyjonesnl wants to merge 4 commits into
indyjonesnl wants to merge 4 commits into
Conversation
update_condition discarded all storage errors with 'let _ = ...'. On a benign resource-version Conflict the Available condition was lost; on other errors the failure was invisible. Match Conflict explicitly, re-read and reapply once; log other errors with apiservice name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sites in reconcile() used 'let _ = self.storage.update(...)' to mark a pod for deletion (terminal pod cleanup, scale-down, rolling update). Errors were silently dropped — a benign resource-version Conflict meant the pod never got its deletionTimestamp, and harder errors went unlogged. Route all three through update_pod_with_retry() which re-reads + reapplies once on Conflict, no-ops on NotFound, and logs other errors with the pod key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six 'let _ = self.storage.update(...)' sites silently dropped storage errors during priority resolution, nominatedNodeName fallback after failed binding, and PodScheduled=Unschedulable condition writes. On a benign resource-version Conflict these updates vanished, leaving pods without the metadata downstream code expected. Route them through a new update_pod_with_retry() helper: re-read + reapply once on Conflict, no-op on NotFound, log other errors with pod key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues: - The available-replica fallback used tokio::task::block_in_place + Handle::current().block_on(...) inside an already-async fn. Replace with a sequential .await accumulator — same semantics, no executor stall under load. - The deployment-revision CAS loop retried 3 times with no backoff, thrashing storage on contended deployments. Add exponential backoff with jitter (10ms doubling to 500ms), and split Conflict (retry) from other Err (log + break). - Four 'let _ = self.storage.update(...)' sites silently dropped errors on orphan-RS adoption, RS revision annotation refresh, deployment revision annotation refresh, and RS replica annotations. Log non-Conflict failures so they're visible in the controller log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Four controllers used `let _ = self.storage.update(...).await` to write status updates, silently swallowing every storage error. On a benign resource-version `Conflict` the update vanished — the controller's view of reality diverged from etcd. On a hard error nothing was logged, so failures were invisible.
Fix
Each controller now matches `Err` explicitly:
Verification
`cargo build --workspace --locked` clean. Full test suite is green when stacked on #32; this PR adds no new test failures relative to the baseline.
Dependency
Depends on #32 for the test suite to show green.