fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129
Open
sjungling wants to merge 3 commits into
Open
fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129sjungling wants to merge 3 commits into
sjungling wants to merge 3 commits into
Conversation
…te conflicts Azure returns 409 ConflictingConcurrentWriteNotAllowed when a prior write (e.g. disableCluster) is still in flight on the same VMSS. The scaleDown step was failing terminally because (1) the apply() call was outside the executeOp retry harness entirely, and (2) canRetry() did not recognize 409 as a retryable status code.
natedanner
approved these changes
Jun 18, 2026
…resizeServerGroup AzureBaseClientSpec verifies that executeOp retries transient errors including the newly-added 409 Conflict, and that 404 is not retried. AzureComputeClientSpec verifies that resizeServerGroup calls executeOp for both the get and apply steps, and that a null VMSS (not found) skips the apply without NPE.
…ne executeOp closure Matches the pattern already used by disableServerGroup and disableServerGroupWithLoadBalancer: wrapping the entire read-modify-write sequence in a single executeOp so retries (including the new 409 backoff) re-fetch a fresh VMSS reference instead of reusing a stale one. Also extracts shared test helpers (managementExceptionWithStatus, getUnsafe) into AzureClientSpecBase to remove duplication across the three client specs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AzureBaseClient.canRetry()so the retry harness backs off and retries on Azure concurrent write conflictsAzureComputeClient.resizeServerGroup()into oneexecuteOpclosure, matching the pattern already established by the disable methods — retries now re-fetch a fresh VMSS reference rather than reusing a stale oneRoot cause
Investigated from pipeline execution
01KVE8QYGXK6TK349JW2E6SPVA(deploy-devazformonitoringatlas2, 2026-06-18 14:03 PDT). ThescaleDownstage failed terminally with:Azure holds an exclusive write lock on a VMSS for the duration of any update. The
disableClusterstep completed from Spinnaker's perspective but the underlying Azure VMSS update was still in flight whenscaleDownimmediately issued a resize — hitting the lock and getting a 409. Azure explicitly documents this as transient and instructs callers to retry.Prior art — the single-closure pattern
Three existing methods in
AzureNetworkClientalready wrap the full read-modify-write sequence in oneexecuteOpclosure.resizeServerGroupnow matches them:disableServerGroup(app gateway path)disableServerGroupWithLoadBalancerenableServerGroupWithAppGatewayAll three wrap
vmss.update().apply()inside the same closure as thegetByResourceGroupcall, so a failed write retries the entire read-modify-write sequence with a fresh VMSS reference.Before
sequenceDiagram participant S as Spinnaker (Orca) participant CD as Clouddriver participant AZ as Azure VMSS API S->>CD: scaleDown(v054, capacity=0) CD->>AZ: getByResourceGroup(v054) Note over CD,AZ: inside executeOp() — retried on failure AZ-->>CD: 200 OK, vmss object CD->>AZ: vmss.update().withCapacity(0).apply() Note over CD,AZ: NOT inside executeOp() — no retry AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed Note over CD: ManagementException caught<br/>resourceNotFound? No → rethrow CD-->>S: AtomicOperationException S-->>S: Stage TERMINAL — pipeline failsAfter
sequenceDiagram participant S as Spinnaker (Orca) participant CD as Clouddriver participant AZ as Azure VMSS API S->>CD: scaleDown(v054, capacity=0) rect rgb(220, 240, 255) Note over CD,AZ: executeOp() — attempt 1 CD->>AZ: getByResourceGroup(v054) AZ-->>CD: 200 OK, vmss object CD->>AZ: vmss.update().withCapacity(0).apply() AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed Note over CD: canRetry(409)? Yes → sleep(backoff) end rect rgb(220, 240, 255) Note over CD,AZ: executeOp() — attempt 2 (fresh VMSS fetch) CD->>AZ: getByResourceGroup(v054) AZ-->>CD: 200 OK, vmss object CD->>AZ: vmss.update().withCapacity(0).apply() AZ-->>CD: 200 OK end CD-->>S: success S-->>S: Stage SUCCEEDED — pipeline continuesTest plan
AzureBaseClientSpec— verifiesexecuteOpretries on 409, 408, 5xx; does not retry on 404; throws after exhausting retriesAzureComputeClientSpec— integration test verifies VMSS is re-fetched on each retry (proving get lives inside the closure); null guard verified; structural test verifies singleexecuteOpcalldisableClusterandscaleDownrace