Skip to content

fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129

Open
sjungling wants to merge 3 commits into
mainfrom
ci/investigate-pipeline-failure-E286
Open

fix(clouddriver-azure): retry VMSS resize on Azure 409 concurrent write conflicts#129
sjungling wants to merge 3 commits into
mainfrom
ci/investigate-pipeline-failure-E286

Conversation

@sjungling

@sjungling sjungling commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

  • Add HTTP 409 (Conflict) to AzureBaseClient.canRetry() so the retry harness backs off and retries on Azure concurrent write conflicts
  • Merge the get + apply in AzureComputeClient.resizeServerGroup() into one executeOp closure, matching the pattern already established by the disable methods — retries now re-fetch a fresh VMSS reference rather than reusing a stale one

Root cause

Investigated from pipeline execution 01KVE8QYGXK6TK349JW2E6SPVA (deploy-devaz for monitoringatlas2, 2026-06-18 14:03 PDT). The scaleDown stage failed terminally with:

Status code 409, ConflictingConcurrentWriteNotAllowed
"The operation was interrupted by a conflicting concurrent write on the same entity. Please retry later."

Azure holds an exclusive write lock on a VMSS for the duration of any update. The disableCluster step completed from Spinnaker's perspective but the underlying Azure VMSS update was still in flight when scaleDown immediately issued a resize — hitting the lock and getting a 409. Azure explicitly documents this as transient and instructs callers to retry.

Prior art — the single-closure pattern

Three existing methods in AzureNetworkClient already wrap the full read-modify-write sequence in one executeOp closure. resizeServerGroup now matches them:

All three wrap vmss.update().apply() inside the same closure as the getByResourceGroup call, so a failed write retries the entire read-modify-write sequence with a fresh VMSS reference.

Before

sequenceDiagram
    participant S as Spinnaker (Orca)
    participant CD as Clouddriver
    participant AZ as Azure VMSS API

    S->>CD: scaleDown(v054, capacity=0)
    CD->>AZ: getByResourceGroup(v054)
    Note over CD,AZ: inside executeOp() — retried on failure
    AZ-->>CD: 200 OK, vmss object
    CD->>AZ: vmss.update().withCapacity(0).apply()
    Note over CD,AZ: NOT inside executeOp() — no retry
    AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed
    Note over CD: ManagementException caught<br/>resourceNotFound? No → rethrow
    CD-->>S: AtomicOperationException
    S-->>S: Stage TERMINAL — pipeline fails
Loading

After

sequenceDiagram
    participant S as Spinnaker (Orca)
    participant CD as Clouddriver
    participant AZ as Azure VMSS API

    S->>CD: scaleDown(v054, capacity=0)

    rect rgb(220, 240, 255)
        Note over CD,AZ: executeOp() — attempt 1
        CD->>AZ: getByResourceGroup(v054)
        AZ-->>CD: 200 OK, vmss object
        CD->>AZ: vmss.update().withCapacity(0).apply()
        AZ-->>CD: 409 ConflictingConcurrentWriteNotAllowed
        Note over CD: canRetry(409)? Yes → sleep(backoff)
    end

    rect rgb(220, 240, 255)
        Note over CD,AZ: executeOp() — attempt 2 (fresh VMSS fetch)
        CD->>AZ: getByResourceGroup(v054)
        AZ-->>CD: 200 OK, vmss object
        CD->>AZ: vmss.update().withCapacity(0).apply()
        AZ-->>CD: 200 OK
    end

    CD-->>S: success
    S-->>S: Stage SUCCEEDED — pipeline continues
Loading

Test plan

  • AzureBaseClientSpec — verifies executeOp retries on 409, 408, 5xx; does not retry on 404; throws after exhausting retries
  • AzureComputeClientSpec — integration test verifies VMSS is re-fetched on each retry (proving get lives inside the closure); null guard verified; structural test verifies single executeOp call
  • Deploy pipeline for an Azure tenant completes without manual intervention when disableCluster and scaleDown race

…te conflicts

Azure returns 409 ConflictingConcurrentWriteNotAllowed when a prior write
(e.g. disableCluster) is still in flight on the same VMSS. The scaleDown
step was failing terminally because (1) the apply() call was outside the
executeOp retry harness entirely, and (2) canRetry() did not recognize 409
as a retryable status code.
@sjungling sjungling requested a review from shanman190 June 18, 2026 21:40
…resizeServerGroup

AzureBaseClientSpec verifies that executeOp retries transient errors including
the newly-added 409 Conflict, and that 404 is not retried. AzureComputeClientSpec
verifies that resizeServerGroup calls executeOp for both the get and apply steps,
and that a null VMSS (not found) skips the apply without NPE.
…ne executeOp closure

Matches the pattern already used by disableServerGroup and
disableServerGroupWithLoadBalancer: wrapping the entire read-modify-write
sequence in a single executeOp so retries (including the new 409 backoff)
re-fetch a fresh VMSS reference instead of reusing a stale one.

Also extracts shared test helpers (managementExceptionWithStatus, getUnsafe)
into AzureClientSpecBase to remove duplication across the three client specs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants