fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts by sjungling · Pull Request #129 · moderneinc/spinnaker

sjungling · 2026-06-18T21:35:50Z

Summary

Add HTTP 409 (Conflict) to AzureBaseClient.canRetry() so the retry harness backs off and retries on Azure concurrent write conflicts
Merge the get + apply in AzureComputeClient.resizeServerGroup() into one executeOp closure, matching the pattern already established by the disable methods — retries now re-fetch a fresh VMSS reference rather than reusing a stale one

Root cause

Investigated from pipeline execution 01KVE8QYGXK6TK349JW2E6SPVA (deploy-devaz for monitoringatlas2, 2026-06-18 14:03 PDT). The scaleDown stage failed terminally with:

Status code 409, ConflictingConcurrentWriteNotAllowed
"The operation was interrupted by a conflicting concurrent write on the same entity. Please retry later."

Azure holds an exclusive write lock on a VMSS for the duration of any update. The disableCluster step completed from Spinnaker's perspective but the underlying Azure VMSS update was still in flight when scaleDown immediately issued a resize — hitting the lock and getting a 409. Azure explicitly documents this as transient and instructs callers to retry.

Prior art — the single-closure pattern

Three existing methods in AzureNetworkClient already wrap the full read-modify-write sequence in one executeOp closure. resizeServerGroup now matches them:

All three wrap vmss.update().apply() inside the same closure as the getByResourceGroup call, so a failed write retries the entire read-modify-write sequence with a fresh VMSS reference.

Before

sequenceDiagram
    participant S as Spinnaker (Orca)
    participant CD as Clouddriver
    participant AZ as Azure VMSS API

    S->>CD: scaleDown(v054, capacity=0)
    CD->>AZ: getByResourceGroup(v054)
    Note over CD,AZ: inside executeOp() — retried on failure
    AZ-->>CD: 200 OK, vmss object
    CD->>AZ: vmss.update().withCapacity(0).apply()
    Note over CD,AZ: NOT inside executeOp() — no retry
    AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed
    Note over CD: ManagementException caught<br/>resourceNotFound? No → rethrow
    CD-->>S: AtomicOperationException
    S-->>S: Stage TERMINAL — pipeline fails

After

sequenceDiagram
    participant S as Spinnaker (Orca)
    participant CD as Clouddriver
    participant AZ as Azure VMSS API

    S->>CD: scaleDown(v054, capacity=0)

    rect rgb(220, 240, 255)
        Note over CD,AZ: executeOp() — attempt 1
        CD->>AZ: getByResourceGroup(v054)
        AZ-->>CD: 200 OK, vmss object
        CD->>AZ: vmss.update().withCapacity(0).apply()
        AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed
        Note over CD: canRetry(409)? Yes → sleep(backoff)
    end

    rect rgb(220, 240, 255)
        Note over CD,AZ: executeOp() — attempt 2 (fresh VMSS fetch)
        CD->>AZ: getByResourceGroup(v054)
        AZ-->>CD: 200 OK, vmss object
        CD->>AZ: vmss.update().withCapacity(0).apply()
        AZ-->>CD: 200 OK
    end

    CD-->>S: success
    S-->>S: Stage SUCCEEDED — pipeline continues

Test plan

AzureBaseClientSpec — verifies executeOp retries on 409, 408, 5xx; does not retry on 404; throws after exhausting retries
AzureComputeClientSpec — integration test verifies VMSS is re-fetched on each retry (proving get lives inside the closure); null guard verified; structural test verifies single executeOp call
Deploy pipeline for an Azure tenant completes without manual intervention when disableCluster and scaleDown race

…te conflicts Azure returns 409 ConflictingConcurrentWriteNotAllowed when a prior write (e.g. disableCluster) is still in flight on the same VMSS. The scaleDown step was failing terminally because (1) the apply() call was outside the executeOp retry harness entirely, and (2) canRetry() did not recognize 409 as a retryable status code.

…resizeServerGroup AzureBaseClientSpec verifies that executeOp retries transient errors including the newly-added 409 Conflict, and that 404 is not retried. AzureComputeClientSpec verifies that resizeServerGroup calls executeOp for both the get and apply steps, and that a null VMSS (not found) skips the apply without NPE.

…ne executeOp closure Matches the pattern already used by disableServerGroup and disableServerGroupWithLoadBalancer: wrapping the entire read-modify-write sequence in a single executeOp so retries (including the new 409 backoff) re-fetch a fresh VMSS reference instead of reusing a stale one. Also extracts shared test helpers (managementExceptionWithStatus, getUnsafe) into AzureClientSpecBase to remove duplication across the three client specs.

moderne-meeseeks Bot assigned sjungling Jun 18, 2026

natedanner approved these changes Jun 18, 2026

View reviewed changes

sjungling requested a review from shanman190 June 18, 2026 21:40

sjungling added 2 commits June 18, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129

fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129
sjungling wants to merge 3 commits into
mainfrom
ci/investigate-pipeline-failure-E286

sjungling commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjungling commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Prior art — the single-closure pattern

Before

After

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjungling commented Jun 18, 2026 •

edited

Loading