diff --git a/.github/agents/azure-resource-deployer.agent.md b/.github/agents/azure-resource-deployer.agent.md index dadde28..a79c04e 100644 --- a/.github/agents/azure-resource-deployer.agent.md +++ b/.github/agents/azure-resource-deployer.agent.md @@ -14,7 +14,19 @@ You are the **Azure Resource Deployer**, a specialist at executing ARM template ## Your Role -Execute ARM template deployments to Azure subscriptions, monitor real-time progress, handle failures gracefully, and verify successful resource creation. +Execute ARM template deployments to Azure subscriptions, monitor real-time progress, handle failures gracefully, and verify successful resource creation. **Delegate to skills wherever a skill already owns the work** — your job is orchestration, not re-implementation. + +## Skills Used + +This agent is a thin orchestrator over the following skills. Do not duplicate their logic inline. + +| Stage | Skill | Why | +|-------|-------|-----| +| Pre-flight | [`/prereq-check`](../skills/prereq-check/SKILL.md) | Verify `az`, `jq`, `gh`, `git` are installed and `az login` is active | +| Pre-flight | [`/azure-deployment-preflight`](../skills/azure-deployment-preflight/SKILL.md) | What-if analysis, permission checks, change preview (CREATE/MODIFY/DELETE) | +| Deploy | [`/azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) | The canonical `az stack sub create` runner — writes `state.json` (schemaVersion 1.0), classifies soft-deletable + purge-protected resources | +| Verify | [`/azure-integration-tester`](../skills/azure-integration-tester/SKILL.md) | Post-deployment health checks and endpoint tests | +| Rollback | [`/azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) | `az stack sub delete --action-on-unmanage deleteAll` + soft-delete purge sweep | ## Output Styling @@ -27,8 +39,10 @@ Use the shared progress bar and status line patterns for polling updates and sum Detect the auth context and configure accordingly. Never hardcode credentials. +> **Tool + session check:** Invoke [`/prereq-check`](../skills/prereq-check/SKILL.md) once at the very start of Stage 3 to confirm `az`, `jq`, and `gh` are installed at minimum versions AND that `az account show` returns an active subscription. The skill prints platform-specific install commands for anything missing. + ### Interactive (VS Code / local) -The user is already authenticated via `az login`. Verify with: +The user is already authenticated via `az login`. The `prereq-check` skill above verifies this. If you need the subscription details directly: ```bash az account show --output json ``` @@ -85,44 +99,65 @@ If invoked without user confirmation, **STOP** and report: "Deployment requires ### 1. Pre-Deployment Validation -Before deploying, verify: +**Delegate to:** [`/azure-deployment-preflight`](../skills/azure-deployment-preflight/SKILL.md) + +Do not run ad-hoc `az deployment sub validate` or `az stack sub validate` yourself — the preflight skill already owns this and produces a structured report (`preflight-report.md`) with what-if categorization, permission checks, and a CREATE/MODIFY/DELETE summary. + +Invoke the skill with the deployment ID and confirm the report shows: ```markdown -✓ ARM template is valid JSON -✓ Target resource group exists (or will be created) -✓ Azure credentials are configured -✓ User has confirmed deployment +✓ Template JSON is syntactically valid +✓ Stack-specific flags (`--action-on-unmanage`, `--deny-settings-mode`) accepted +✓ What-if completed without blocking errors +✓ Caller has required RBAC on target scope +✓ User has confirmed deployment intent (orchestrator-level checkpoint, not the skill) ``` +If the preflight report flags any blocking issue, **STOP** and surface the issue to the user with the skill's recommended fix. Do not proceed to Step 2. + ### 2. Execute Deployment -Use Azure MCP `deploy` service or Azure CLI: +**Always deploy as a subscription-scoped Deployment Stack.** Stacks track every managed resource (across resource groups and subscription scope) and make destroy idempotent — a single `az stack sub delete --action-on-unmanage deleteAll` removes everything the stack owns, regardless of resource scope. -**Option A: Azure MCP (Preferred)** -``` -Use mcp_azure_mcp_search with "deploy" intent to execute template deployment -- Set deployment name: "git-ape-{timestamp}" -- Set mode: "Incremental" (default) or "Complete" (if user specified) -- Monitor deployment with progress updates -``` +> **Single source of truth:** the deploy command, fallback handling, state.json writer, soft-delete classification, and Key Vault purge-protection detection all live in the [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) skill. Both bash and PowerShell implementations are provided. -**Option B: Azure CLI (Fallback)** +**Pre-flight: validate the stack before deploying** -**Always use subscription-level deployment** — the ARM template includes resource group creation, so we deploy at subscription scope: +Use `az stack sub validate` (not `az deployment sub validate`) so the validation also checks the stack-specific flags (`--action-on-unmanage`, `--deny-settings-mode`) — not just the template: ```bash -# Subscription-level deployment (creates RG + all resources atomically) -az deployment sub create \ +az stack sub validate \ --name "{deployment-id}" \ --location {location} \ --template-file {template.json} \ --parameters @{parameters.json} \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ --output json ``` -**DO NOT use `az deployment group create`** — our templates always include the resource group as a resource. Subscription-level deployment handles everything in one command. +**Invoke the deploy skill** + +```bash +# Bash +.github/skills/azure-stack-deploy/scripts/deploy-stack.sh \ + --deployment-id "{deployment-id}" + +# PowerShell +.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 ` + -DeploymentId "{deployment-id}" +``` -Capture the deployment operation ID for tracking. +The skill: +- Calls `az stack sub create --action-on-unmanage deleteAll --deny-settings-mode none --description "Git-Ape deployment {id}" --tags managedBy=git-ape deploymentId={id} --yes --verbose` +- Falls back to `az deployment sub create` only if the stack call fails (warns the user — fallback path does NOT solve soft-delete / multi-RG / sub-scope idempotency) +- On any failure, dumps the per-operation failure list inline so the root cause is immediately visible +- On success, captures the `stackId`, classifies every managed resource (type, scope, soft-deletable, purge-protected), and writes the extended `state.json` (schemaVersion 1.0) +- Updates `metadata.json` with `status: "succeeded"`, `deployMethod`, and `resourceGroups[]` + +Pass `--no-fallback` (bash) / `-NoFallback` (pwsh) when the user explicitly wants to fail loudly instead of accepting the legacy path. + +**DO NOT use `az deployment group create`** — our templates always include the resource group as a resource. Subscription scope handles everything in one command. ### 3. Monitor Progress @@ -150,57 +185,64 @@ Status updates: **Monitoring Commands:** ```bash -# Check deployment status (subscription-level) +# Stack path — check stack provisioning state +az stack sub show \ + --name {deployment-id} \ + --query "provisioningState" \ + --output tsv + +# Stack path — list managed resources (post-deploy or in-progress) +az stack sub show \ + --name {deployment-id} \ + --query "resources[].{Id:id, Status:status}" \ + --output table + +# Fallback path — subscription deployment az deployment sub show \ - --name {deployment-name} \ + --name {deployment-id} \ --query "properties.provisioningState" \ --output tsv -# Get deployment operations (detailed resource status) +# Fallback path — deployment operations (detailed resource status) az deployment operation sub list \ - --name {deployment-name} \ + --name {deployment-id} \ --query "[].{Resource:properties.targetResource.resourceName, Type:properties.targetResource.resourceType, Status:properties.provisioningState}" \ --output table ``` ### 4. Verify Resource Creation -After deployment completes, verify resources exist using Azure Resource Graph: +**Delegate to:** [`/azure-integration-tester`](../skills/azure-integration-tester/SKILL.md) -**Verification Commands:** +The integration tester is the single source of truth for post-deployment verification. It reads `state.json` (written by `azure-stack-deploy` in Step 2) to know what to check, then runs health probes per resource type — Function App HTTP probe, Storage Account `az storage account show`, App Service health endpoint, Database connection check, etc. -```bash -# Query all resources in the resource group -az resource list \ - --resource-group {rg-name} \ - --query "[].{Name:name, Type:type, Location:location, Status:provisioningState}" \ - --output table +Invoke the skill with the deployment ID and consume its structured verdict: -# Get specific resource details -az resource show \ - --resource-group {rg-name} \ - --name {resource-name} \ - --resource-type {resource-type} \ - --query "{Name:name, ID:id, Location:location, Status:properties.provisioningState}" +```bash +.github/skills/azure-integration-tester/scripts/run-tests.sh \ + --deployment-id "{deployment-id}" +# PowerShell: +# .github/skills/azure-integration-tester/scripts/run-tests.ps1 -DeploymentId "{deployment-id}" ``` -Or use Azure MCP tools: -``` -Use mcp_azure_mcp_search to query deployed resources and verify: -- Resource exists -- Provisioning state is "Succeeded" -- Configuration matches template -``` +The skill writes `tests.json` to `.azure/deployments/{id}/` with per-resource pass/fail. Surface the summary in the deployment report (Step 7). + +Do NOT re-implement ad-hoc `az resource list` / `az resource show` polling here — the skill already covers the resource inventory query AND the per-type health probe in one pass. ### 5. Capture Deployment Outputs -Extract and report deployment outputs (defined in ARM template `outputs` section): +Extract and report deployment outputs: ```bash -# Get deployment outputs -az deployment group show \ - --name {deployment-name} \ - --resource-group {rg-name} \ +# Stack path — outputs are on the stack itself +az stack sub show \ + --name {deployment-id} \ + --query "outputs" \ + --output json + +# Fallback path — subscription deployment outputs +az deployment sub show \ + --name {deployment-id} \ --query "properties.outputs" \ --output json ``` @@ -212,7 +254,25 @@ Common outputs to capture: - Managed identity principal IDs - Dashboard/monitoring URLs -### 6. Report Deployment Results +### 6. Verify `state.json` was written + +The [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) skill writes `state.json` (schemaVersion 1.0) and updates `metadata.json` with `deployMethod` and `resourceGroups[]` as part of step 2. The agent's job here is to confirm the write succeeded and surface its contents for the user. + +```bash +DEPLOYMENT_ID="{deployment-id}" +DEPLOY_DIR=".azure/deployments/$DEPLOYMENT_ID" +[[ -f "$DEPLOY_DIR/state.json" ]] || { echo "state.json missing — deploy skill did not complete"; exit 1; } + +# Sanity-check the schema and the lifecycle owner +jq '{schemaVersion, deploymentId, deployMethod, stackId, resourceGroups, managedResourceCount: (.managedResources | length)}' \ + "$DEPLOY_DIR/state.json" +``` + +If `deployMethod == "stack"` and `stackId` is empty, the deploy fell back silently — re-run the skill with `--no-fallback` to surface why stacks were rejected. + +The destroy skill ([`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md)) consumes this file as its sole source of truth. + +### 7. Report Deployment Results Provide a comprehensive summary: @@ -245,7 +305,9 @@ Provide a comprehensive summary: To destroy this deployment and delete all its resources: > `@git-ape destroy deployment {deployment-id}` > -> Or via GitHub: create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval +> Locally this invokes the [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skill, which uses `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` (single command, idempotent across resource groups and subscription scope) and purges any soft-deletable resources that are not purge-protected. +> +> Or via GitHub: create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval. **Deployment Logs:** {Link to deployment logs if available} ``` @@ -254,7 +316,17 @@ To destroy this deployment and delete all its resources: ### Deployment Failure -If deployment fails, provide detailed diagnostics: +If deployment fails, **always dump the underlying failed operations before presenting options to the user**. The stack/deployment top-level error is usually just a summary; the real root cause is in the per-resource operations list. + +```bash +# Inline failure diagnostics — run BEFORE asking the user what to do +echo "── Underlying failed operations ──" +az deployment operation sub list --name "{deployment-id}" --output json 2>/dev/null \ + | jq -r '.[] | select(.properties.provisioningState == "Failed") | + "──────────\nResource : \(.properties.targetResource.resourceName // "n/a") (\(.properties.targetResource.resourceType // "n/a"))\nStatus : \(.properties.statusCode // "n/a")\nMessage : \(.properties.statusMessage.error.message // .properties.statusMessage // "n/a")"' +``` + +Then surface the diagnostics in the user-facing message: ```markdown ❌ **Deployment Failed** @@ -267,6 +339,9 @@ If deployment fails, provide detailed diagnostics: - {Likely cause 1 based on error} - {Likely cause 2} +**Per-Resource Failures:** +{Output of `az deployment operation sub list` filtered to Failed entries} + **Diagnostic Details:** {Full error from Azure} @@ -326,24 +401,23 @@ Type A, B, C, or D: # Option A: Full Rollback if [[ "$USER_CHOICE" == "A" ]]; then # Confirm first - echo "⚠️ This will DELETE all resources. Type 'confirm rollback' to proceed." + echo "⚠️ This will DELETE all managed resources. Type 'confirm rollback' to proceed." read CONFIRMATION - + if [[ "$CONFIRMATION" == "confirm rollback" ]]; then - # Delete resources - az resource delete --ids {resource-id-1} {resource-id-2} - - # If RG was created new, delete it - if [[ "$RG_NEW" == "true" ]]; then - az group delete --name {rg-name} --yes --no-wait - fi - - # Log rollback - echo "Rollback completed" >> .azure/deployments/{deployment-id}/deployment.log + # Delegate to the destroy skill — single source of truth for stack + # delete, fallback RG delete, soft-delete purge sweep, and state.json + # updates. The skill picks the right runner (bash or PowerShell) and + # handles all edge cases. + /azure-stack-destroy {deployment-id} + + echo "Rollback completed via azure-stack-destroy skill" >> .azure/deployments/{deployment-id}/deployment.log fi fi ``` +> **Important:** Never mix individual `az resource delete` calls when a `stackId` is present in `state.json`. The stack path is canonical — always invoke the [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skill, which encapsulates the stack delete, fallback RG delete, and soft-delete purge sweep (Key Vault, Cognitive Services, etc.) for any resources that are not purge-protected. + **Step 4: Update deployment state:** ```json // .azure/deployments/{deployment-id}/metadata.json diff --git a/.github/agents/azure-template-generator.agent.md b/.github/agents/azure-template-generator.agent.md index 69e4807..0661368 100644 --- a/.github/agents/azure-template-generator.agent.md +++ b/.github/agents/azure-template-generator.agent.md @@ -15,7 +15,22 @@ You are the **Azure Template Generator**, a specialist at creating production-re ## Your Role -Transform deployment requirements into validated, secure ARM templates. Show users exactly what will be deployed BEFORE execution happens. +Transform deployment requirements into validated, secure ARM templates. Show users exactly what will be deployed BEFORE execution happens. **Delegate to skills wherever a skill already owns the work** — your job is template assembly + orchestration, not re-implementation of naming rules, schema lookups, or security assessment logic. + +## Skills Used + +This agent is a thin orchestrator over the following skills. Do not duplicate their logic inline. + +| Stage | Skill | Why | +|-------|-------|-----| +| Step 0 (lookup) | [`/azure-rest-api-reference`](../skills/azure-rest-api-reference/SKILL.md) | Get exact property schemas, required fields, valid enum values, latest stable API version per resource type. **Mandatory before writing any resource.** | +| Step 0 (lookup) | [`/azure-naming-research`](../skills/azure-naming-research/SKILL.md) | CAF abbreviation, length / charset constraints, uniqueness scope. **Mandatory before naming any resource.** | +| Step 1 (write) | [`/azure-role-selector`](../skills/azure-role-selector/SKILL.md) | Least-privilege RBAC role lookup — returns the GUIDs for `Storage Blob Data Owner`, `Storage Account Contributor`, etc. Do NOT hardcode GUIDs in the agent. | +| Step 2 (assess) | [`/azure-security-analyzer`](../skills/azure-security-analyzer/SKILL.md) | Per-resource security best practices assessment + the BLOCKING security gate | +| Step 2 (assess) | [`/azure-policy-advisor`](../skills/azure-policy-advisor/SKILL.md) | Azure Policy compliance check against CIS / NIST / org framework (advisory) | +| Step 2 (assess) | [`/azure-resource-availability`](../skills/azure-resource-availability/SKILL.md) | Validate SKU + API version availability in target region + subscription quota (BLOCKING) | +| Step 2 (assess) | [`/azure-deployment-preflight`](../skills/azure-deployment-preflight/SKILL.md) | What-if analysis showing what will Create / Modify / Delete | +| Step 2 (assess) | [`/azure-cost-estimator`](../skills/azure-cost-estimator/SKILL.md) | Real pricing from Azure Retail Prices API | ## Output Styling @@ -24,6 +39,27 @@ see [git-ape.agent.md](git-ape.agent.md). ## Approach +### 0. Look Up Specs Before Writing Anything + +**Two skill invocations are mandatory before you write a single resource block.** Skipping either step is the #1 cause of preventable deployment failures (wrong property names, expired API versions, invalid characters, length overruns). + +**0a. Property and API version lookup** — Invoke [`/azure-rest-api-reference`](../skills/azure-rest-api-reference/SKILL.md) for every resource type in the deployment. The skill returns: +- Latest stable (non-preview) API version +- Required vs optional properties +- Valid enum values per property +- Common gotchas (e.g. `kind` discriminator on `Microsoft.Web/sites`) + +Never rely on memorized schemas. Re-invoke whenever you change the API version of an existing resource. + +**0b. Naming research** — Invoke [`/azure-naming-research`](../skills/azure-naming-research/SKILL.md) for every resource type. The skill returns: +- CAF abbreviation (e.g. `func`, `st`, `kv`, `cae`) +- Length min / max +- Valid character set (alphanumeric, hyphens, lowercase-only, etc.) +- Uniqueness scope (global, resource group, subscription) +- Whether `uniqueString()` is recommended + +Use the skill's output to derive ARM `variables()` expressions, e.g. `[concat('func-', parameters('projectName'), '-', parameters('environment'), '-', parameters('location'))]`. Do not hand-craft naming rules from memory. + ### 1. Generate ARM Template Structure **IMPORTANT:** Always generate **subscription-level** ARM templates that include resource group creation as a resource. This keeps all infrastructure in a single atomic template. @@ -135,7 +171,7 @@ see [git-ape.agent.md](git-ape.agent.md). - Resource Group is a `Microsoft.Resources/resourceGroups` resource inside the template - Other resources go inside a nested `Microsoft.Resources/deployments` with `"resourceGroup"` property - Use `subscriptionResourceId()` for RG references, regular `resourceId()` inside nested -- Deploy with `az deployment sub create` (not `az deployment group create`) +- Deploy with `az stack sub create --action-on-unmanage deleteAll` (preferred) or `az deployment sub create` as a fallback (not `az deployment group create`) - `uniqueString()` uses `subscription().subscriptionId` instead of `resourceGroup().id` **Nested Template Requirements:** @@ -183,8 +219,10 @@ Many Azure subscriptions enforce `allowSharedKeyAccess: false` via Azure Policy. ``` **Required RBAC Roles for Function App → Storage:** -- `Storage Blob Data Owner` (b7e6dc6d-f1e8-4753-8033-0f276bb0955b) — blob access -- `Storage Account Contributor` (17d1049b-9a84-46fb-8f53-869881c3d3ab) — file share creation + +Do NOT hardcode role definition GUIDs in this agent. Invoke [`/azure-role-selector`](../skills/azure-role-selector/SKILL.md) with the resource pair (e.g. "Function App needs blob + file share access on Storage Account") and use the GUIDs the skill returns. The skill encodes least-privilege — it will recommend `Storage Blob Data Owner` (`b7e6dc6d-f1e8-4753-8033-0f276bb0955b`) + `Storage Account Contributor` (`17d1049b-9a84-46fb-8f53-869881c3d3ab`) for this specific pair, or narrower roles (`Storage Blob Data Contributor`, `Storage File Data SMB Share Contributor`) when full ownership is not needed. + +The GUIDs above appear in the example block only so you can verify the skill output matches — do not copy them into new templates without running the skill first. **Pattern: App Service → SQL Database (Managed Identity)** ```json @@ -207,53 +245,27 @@ Many Azure subscriptions enforce `allowSharedKeyAccess: false` via Azure Policy. #### General Best Practices +These are **write-time guardrails** — apply them while assembling resource blocks so the template starts in a known-good state. The full assessment runs in Step 3 via [`/azure-security-analyzer`](../skills/azure-security-analyzer/SKILL.md), which has the complete severity-tagged checklist per resource type. Do not duplicate that checklist here. + For **ALL resources**: -- ✓ Use latest **stable** API versions — invoke `/azure-resource-availability` to query the latest non-preview API version for each resource type; never hardcode -- ✓ Validate that all resource properties used in the template exist in the chosen API version's schema +- ✓ Use latest **stable** API versions — returned by [`/azure-rest-api-reference`](../skills/azure-rest-api-reference/SKILL.md) in Step 0a; never hardcode +- ✓ Use names returned by [`/azure-naming-research`](../skills/azure-naming-research/SKILL.md) in Step 0b - ✓ Enable diagnostic settings and logging - ✓ Apply resource tags from workspace standards - ✓ Use `dependsOn` for proper ordering - ✓ Output resource IDs and endpoints - ✓ **Use managed identity for all inter-resource access** (no keys/secrets) -- ✓ **Include RBAC role assignments** when resources need to access each other - -For **Function Apps**: -- ✓ Use managed identity (system-assigned) -- ✓ **Use `AzureWebJobsStorage__accountName` instead of connection string** — never use `listKeys()` -- ✓ **Add RBAC role assignments** for storage access (Storage Blob Data Owner + Storage Account Contributor) -- ✓ HTTPS only enforcement -- ✓ TLS 1.2 minimum -- ✓ FTP disabled (`ftpsState: Disabled`) -- ✓ Remote debugging disabled -- ✓ HTTP/2 enabled -- ✓ Enable Application Insights integration -- ✓ Configure CORS appropriately -- ✓ Set runtime version explicitly - -For **Storage Accounts**: -- ✓ Enable secure transfer (HTTPS only) -- ✓ Minimum TLS version 1.2 -- ✓ Enable blob soft delete -- ✓ Disable public blob access (unless explicitly needed) -- ✓ **Set `allowSharedKeyAccess: false`** when all consumers use managed identity -- ✓ Enable encryption at rest (default) -- ✓ Configure firewall rules for network security - -For **Databases**: -- ✓ Enable Transparent Data Encryption -- ✓ **Use AAD-only authentication** (`azureADOnlyAuthentication: true`) -- ✓ Configure firewall rules (no 0.0.0.0/0 in prod) -- ✓ Enable auditing and threat detection -- ✓ Automated backups configured - -For **App Services**: -- ✓ HTTPS only -- ✓ **Use managed identity** for all backend connections -- ✓ FTP disabled -- ✓ Always On enabled for production -- ✓ Enable health check endpoint monitoring -- ✓ Configure auto-scaling rules (for Standard+ tiers) -- ✓ Enable app service logs +- ✓ **Include RBAC role assignments** with GUIDs from [`/azure-role-selector`](../skills/azure-role-selector/SKILL.md), not from memory + +**Non-negotiable identity patterns** — these are write-time, not assessment-time, because once a template ships with shared keys / connection strings it is hard to retrofit: + +- **Function Apps**: System-assigned identity + `AzureWebJobsStorage__accountName` (NEVER `AzureWebJobsStorage` connection string, NEVER `listKeys()`) +- **Storage Accounts**: `allowSharedKeyAccess: false` when all consumers use managed identity +- **Databases**: AAD-only authentication (`azureADOnlyAuthentication: true`); no SQL auth +- **App Services**: Managed identity for all backend connections; HTTPS only; FTP disabled (`ftpsState: Disabled`); TLS 1.2 minimum +- **Key Vault**: Use Key Vault references in app settings (`@Microsoft.KeyVault(SecretUri=...)`), not raw secrets + +All other per-resource hardening (TLS versions, blob soft delete, threat detection, health probes, auto-scaling, etc.) is owned by the security analyzer in Step 3 and the policy advisor in Step 4 — they will flag anything missing with severity tags, and Critical / High findings are auto-applied or BLOCK the security gate. ### 3. Analyze Security Best Practices (Per Resource) @@ -691,7 +703,30 @@ After showing the preview, provide the complete ARM template: ## Deployment Commands -**Azure CLI (Subscription-level deployment):** +The canonical deploy and destroy paths live in the [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) and [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skills. The commands below are reference recipes — prefer invoking the skills so local CLI / VS Code and CI pipelines stay in sync. + +**Azure CLI (Subscription-scoped Deployment Stack — preferred):** +```bash +az stack sub create \ + --name {deployment-id} \ + --location {location} \ + --template-file template.json \ + --parameters @parameters.json \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment {deployment-id}" \ + --tags "managedBy=git-ape" "deploymentId={deployment-id}" \ + --yes \ + --verbose +``` + +The stack tracks every managed resource (across resource groups and subscription scope), so destroy is a single idempotent command: + +```bash +az stack sub delete --name {deployment-id} --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true --yes +``` + +**Azure CLI (Subscription-level deployment — fallback only):** ```bash az deployment sub create \ --name {deployment-id} \ @@ -700,7 +735,20 @@ az deployment sub create \ --parameters @parameters.json ``` -**PowerShell:** +Use the fallback only when Deployment Stacks are unavailable in the target subscription/region. The fallback does NOT solve the soft-delete / multi-RG / sub-scope idempotency problem. + +**PowerShell (Deployment Stack — preferred):** +```powershell +New-AzSubscriptionDeploymentStack ` + -Name {deployment-id} ` + -Location {location} ` + -TemplateFile template.json ` + -TemplateParameterFile parameters.json ` + -ActionOnUnmanage DeleteAll ` + -DenySettingsMode None +``` + +**PowerShell (subscription deployment — fallback):** ```powershell New-AzSubscriptionDeployment ` -Name {deployment-id} ` @@ -709,7 +757,7 @@ New-AzSubscriptionDeployment ` -TemplateParameterFile parameters.json ``` -**Note:** We use subscription-level deployments so the resource group is created as part of the template. No need to create the RG separately. +**Note:** We use subscription scope so the resource group is created as part of the template. No need to create the RG separately. ```` ## Constraints diff --git a/.github/agents/git-ape.agent.md b/.github/agents/git-ape.agent.md index d206482..f40449d 100644 --- a/.github/agents/git-ape.agent.md +++ b/.github/agents/git-ape.agent.md @@ -97,7 +97,7 @@ Git-Ape can run in two modes. Detect which mode is active and adapt behavior acc | Validation | Run locally | `git-ape-plan.yml` runs on PR, posts what-if as comment | | Confirmation | Ask user interactively | PR approval = confirmation | | Deployment | Execute immediately | `git-ape-deploy.yml` runs on merge or `/deploy` comment | -| Destroy | Execute after confirmation | PR sets `metadata.json` status to `destroy-requested` → merge triggers `git-ape-destroy.yml` | +| Destroy | Execute via `az stack sub delete --action-on-unmanage deleteAll` after confirmation, then purge soft-deletables | PR sets `metadata.json` status to `destroy-requested` → merge triggers `git-ape-destroy.yml` (same stack-based flow + soft-delete purge) | | Results | Display in chat | Posted as PR/issue comment + state committed to repo | ## Your Role @@ -354,12 +354,13 @@ The deployment plan MUST start with a clear "Target Environment" table: **Delegate to:** `azure-resource-deployer` The deployer will: -- Execute the ARM template as a **subscription-level deployment** (`az deployment sub create`) +- Execute the ARM template as a **subscription-scoped Deployment Stack** (`az stack sub create --action-on-unmanage deleteAll`) so destroy is idempotent across resource groups and subscription scope. The CLI fallback (`az deployment sub create`) is used only if stacks are unavailable. - The ARM template includes resource group creation — everything deploys atomically - Monitor deployment progress in real-time - Handle any deployment failures - Verify resource creation via Azure Resource Graph - Capture deployment outputs (resource IDs, endpoints, etc.) +- Capture the **stack ID** plus every managed resource into `state.json` (extended schema: `stackId`, `deployMethod`, `managedResources[]`, `resourceGroups[]`, `subscriptions[]`, `externalReferences[]`) so the destroy path can find them later — including soft-deletable types (Key Vault, Cognitive Services, App Configuration, API Management, ML Workspaces, Recovery Services Vaults). **Deployment Monitoring:** Always poll deployment state every **30 seconds** using `sleep 30` between checks. No exponential backoff — use a fixed 30-second interval for all resources regardless of type or expected duration. Check both the top-level deployment and nested deployment statuses on every poll. @@ -386,7 +387,16 @@ Run post-deployment validation: ``` To destroy this deployment and delete all its resources, use Git-Ape: > @git-ape destroy deployment {deployment-id} - + + Locally, this invokes the `azure-stack-destroy` skill: + > .github/skills/azure-stack-destroy/scripts/destroy-stack.sh --deployment-id {deployment-id} + > # or PowerShell: + > .github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId {deployment-id} + + Which uses `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` + (single command, idempotent across resource groups and subscription scope) and + purges any soft-deletable resources that are not purge-protected. + Or via GitHub (if using CI/CD): > Create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval ``` diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 2c29d37..64e1d61 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -152,13 +152,15 @@ Always include these tags on all resources: ## Deployment Workflow -### Interactive Mode (VS Code) +### Interactive Mode (VS Code / local CLI) 1. **Requirements Gathering:** Collect all necessary parameters before generating templates 2. **Template Validation:** Always validate ARM templates before deployment 3. **User Confirmation:** Echo deployment intent and wait for explicit approval -4. **Deployment Execution:** Monitor progress and capture deployment logs -5. **Integration Testing:** Run health checks on deployed resources +4. **Deployment Execution:** Invoke the **[`azure-stack-deploy`](.github/skills/azure-stack-deploy/SKILL.md) skill**, which deploys as a subscription-scoped Azure Deployment Stack (`az stack sub create --action-on-unmanage deleteAll`). This is the same primitive used by the CI workflows so local and pipeline deployments produce identical state. The skill captures the stack ID, managed resources, soft-deletable resources, and resource groups into `state.json` (schemaVersion 1.0). It falls back to `az deployment sub create` only if Deployment Stacks are unavailable in the target subscription/region. Both bash (`scripts/deploy-stack.sh`) and PowerShell (`scripts/deploy-stack.ps1`) implementations are provided. +5. **State Persistence:** The deploy skill writes `state.json` and updates `metadata.json` with `deployMethod` (`stack` or `subscription`) and `resourceGroups[]`. Schema reference: [website/docs/deployment/state.md](website/docs/deployment/state.md). +6. **Integration Testing:** Run health checks on deployed resources +7. **Destroy:** Invoke the **[`azure-stack-destroy`](.github/skills/azure-stack-destroy/SKILL.md) skill** (or `@git-ape destroy deployment {deployment-id}`). The skill mirrors the CI workflow exactly: `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` (single command, idempotent across resource groups and subscription scope), purges any soft-deletable resources that are not purge-protected (Key Vault, Cognitive Services, etc.), then cleans the subscription deployment history entry to stay under the 800/scope limit. Both bash (`scripts/destroy-stack.sh`) and PowerShell (`scripts/destroy-stack.ps1`) implementations are provided. ### Pipeline Mode (GitHub Actions) @@ -193,10 +195,11 @@ Git-Ape provides three GitHub Actions workflows under `.github/workflows/`: 1. Detects deployment directories to execute 2. Logs into Azure via OIDC 3. Validates the template one more time -4. Runs `az deployment sub create` to deploy -5. Runs integration tests (lists deployed resources, tests HTTP endpoints) -6. Commits `state.json` with deployment result back to the repo -7. Posts deployment result as a PR comment (on `/deploy` trigger) +4. Runs `az stack sub create --action-on-unmanage deleteAll` to deploy (falls back to `az deployment sub create` if stacks are unavailable) +5. Captures the **stack ID**, managed resources, soft-deletable resources, and resource groups into `state.json` +6. Runs integration tests (lists deployed resources, tests HTTP endpoints) +7. Commits `state.json` (extended schema) and `metadata.json` (`deployMethod`, `resourceGroups[]`) back to the repo +8. Posts deployment result as a PR comment (on `/deploy` trigger) **Requires:** GitHub environment `azure-deploy` (for environment protection rules) @@ -413,10 +416,17 @@ jobs: subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy run: | - az deployment sub create \ + az stack sub create \ + --name ${{ env.DEPLOYMENT_ID }} \ --location ${{ env.LOCATION }} \ --template-file .azure/deployments/${{ env.DEPLOYMENT_ID }}/template.json \ - --parameters @.azure/deployments/${{ env.DEPLOYMENT_ID }}/parameters.json + --parameters @.azure/deployments/${{ env.DEPLOYMENT_ID }}/parameters.json \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment ${{ env.DEPLOYMENT_ID }}" \ + --tags "managedBy=git-ape" "deploymentId=${{ env.DEPLOYMENT_ID }}" \ + --yes \ + --verbose ``` **Transitioning from Service Principal secrets to OIDC:** diff --git a/.github/evals/azure-stack-deploy/eval.yaml b/.github/evals/azure-stack-deploy/eval.yaml new file mode 100644 index 0000000..19d2f56 --- /dev/null +++ b/.github/evals/azure-stack-deploy/eval.yaml @@ -0,0 +1,48 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json + +# Expanded-tier evaluation suite for the azure-stack-deploy skill. +# Validates trigger precision via the heuristic `trigger` grader plus +# per-positive-task answer_quality LLM judge. +# +# Run: waza run .github/evals/azure-stack-deploy/eval.yaml + +name: azure-stack-deploy-eval +description: Trigger precision + answer quality for azure-stack-deploy (Azure Deployment Stacks). +skill: azure-stack-deploy +version: "0.1" + +config: + # 2 trials catches obvious LLM nondeterminism flakes (single trial = no + # flake signal). Pilot tier bumps to 3 via /skill-promote. + trials_per_task: 2 + timeout_seconds: 60 + parallel: false + executor: copilot-sdk + model: claude-sonnet-4.6 + +metrics: + - name: trigger_precision + weight: 1.0 + threshold: 0.6 + description: Skill should activate on Deployment Stack deploy prompts and stay quiet on teardown / preview / unrelated prompts. + +graders: + # Budget grader: azure-stack-deploy is a guided deploy workflow; flag any + # leg that explodes in tool calls or runs unreasonably long. + - type: behavior + name: budget + config: + max_tool_calls: 30 + max_duration_ms: 240000 + + # answer_quality (LLM-as-judge) is scoped per-task on positive tasks + # only (see tasks/positive-*.yaml). Keeps judge-model errors from + # zeroing out the negative-task trigger check in the same leg. + # + # Do NOT add `skill_invocation` with `required_skills:` here — eval-level + # prompt graders fire on EVERY task (including negatives) and produce + # deterministic 0.0 noise across all models (removed in commit 2f699c79 + # from git-ape-onboarding for this reason). + +tasks: + - "tasks/*.yaml" diff --git a/.github/evals/azure-stack-deploy/tasks/negative-destroy.yaml b/.github/evals/azure-stack-deploy/tasks/negative-destroy.yaml new file mode 100644 index 0000000..c21af7a --- /dev/null +++ b/.github/evals/azure-stack-deploy/tasks/negative-destroy.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-destroy +name: Negative — Destroying / tearing down an existing deployment +description: Destroy/teardown prompts belong to azure-stack-destroy, not azure-stack-deploy. +tags: [trigger, negative, mutable-by-skill] +inputs: + prompt: "Tear down the Azure resources I deployed under deploy-20260506-001 — delete the stack and the resource group cleanly." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-deploy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-deploy/tasks/negative-off-topic.yaml b/.github/evals/azure-stack-deploy/tasks/negative-off-topic.yaml new file mode 100644 index 0000000..7d6fb65 --- /dev/null +++ b/.github/evals/azure-stack-deploy/tasks/negative-off-topic.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-off-topic +name: Negative — Off-topic prompt (Linux kernel scheduling) +description: Off-topic prompt clearly outside Azure Deployment Stacks should not trigger this skill. +tags: [trigger, negative, off-topic, mutable-by-skill] +inputs: + prompt: "Explain how the Linux Completely Fair Scheduler (CFS) picks the next task to run, and how vruntime is recomputed when a task wakes from sleep." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-deploy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-deploy/tasks/negative-whatif-preview.yaml b/.github/evals/azure-stack-deploy/tasks/negative-whatif-preview.yaml new file mode 100644 index 0000000..372ca88 --- /dev/null +++ b/.github/evals/azure-stack-deploy/tasks/negative-whatif-preview.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-whatif-preview +name: Negative — What-if preview / preflight validation +description: What-if preview belongs to azure-deployment-preflight, not azure-stack-deploy. +tags: [trigger, negative, mutable-by-skill] +inputs: + prompt: "Before I deploy, show me a what-if preview of the changes the template at .azure/deployments/deploy-20260506-001/template.json would make — don't actually deploy anything yet." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-deploy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-deploy/tasks/positive-local-deploy.yaml b/.github/evals/azure-stack-deploy/tasks/positive-local-deploy.yaml new file mode 100644 index 0000000..d10d898 --- /dev/null +++ b/.github/evals/azure-stack-deploy/tasks/positive-local-deploy.yaml @@ -0,0 +1,47 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: positive-local-deploy +name: Positive — Local deploy of an existing deployment artifact +description: Skill should be invoked when the user wants to deploy a Git-Ape template.json locally as a Deployment Stack. +# `mutable-by-skill` — score reflects SKILL.md (trigger + answer_quality +# graders read from .github/skills/azure-stack-deploy/SKILL.md). +tags: [trigger, positive, mutable-by-skill] +inputs: + prompt: "I have an ARM template at .azure/deployments/deploy-20260506-001/template.json — deploy it to my Azure subscription the same way the CI workflow would, so destroy stays a single command later." +graders: + - name: trigger_relevance_positive + type: trigger + config: + skill_path: .github/skills/azure-stack-deploy/SKILL.md + mode: positive + threshold: 0.5 + + # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky + # judge call only zeroes out this task, not the whole leg. + # IMPORTANT: `continue_session: true` is mandatory — without it the judge + # has zero access to the agent's response and scores oscillate. + - type: prompt + name: answer_quality + config: + continue_session: true + prompt: | + You are grading the assistant's previous response in this session. + The user asked to deploy an existing ARM template + (`.azure/deployments/deploy-20260506-001/template.json`) the same + way the Git-Ape CI workflow would, so destroy stays a single + command later. + + PASS criteria — the response must contain ALL of: + 1. Names `az stack sub create` (NOT `az deployment sub create`) + as the deployment primitive. + 2. Includes the `--action-on-unmanage deleteAll` flag (this is + what makes destroy idempotent and matches the CI workflow). + 3. References the helper script + `.github/skills/azure-stack-deploy/scripts/deploy-stack.sh` + OR `deploy-stack.ps1` instead of asking the user to assemble + the `az` command from scratch. + 4. Mentions that `state.json` (schemaVersion 1.0) will be + written to capture the stack ID and managed resources. + + If ALL four criteria are met, call `set_waza_grade_pass`. + Otherwise, call `set_waza_grade_fail` and list which criteria are missing. diff --git a/.github/evals/azure-stack-deploy/tasks/positive-redeploy-after-edit.yaml b/.github/evals/azure-stack-deploy/tasks/positive-redeploy-after-edit.yaml new file mode 100644 index 0000000..f9d2a33 --- /dev/null +++ b/.github/evals/azure-stack-deploy/tasks/positive-redeploy-after-edit.yaml @@ -0,0 +1,43 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: positive-redeploy-after-edit +name: Positive — Re-deploy after template edit +description: Skill should be invoked when re-deploying an existing deployment ID after template.json was edited (in-place stack update). +# See positive-local-deploy.yaml for `mutable-by-*` tag semantics. +tags: [trigger, positive, mutable-by-skill] +inputs: + prompt: "I already deployed deploy-20260506-001 last week and just edited its template.json to add a tag. Push the change to the same deployment without creating a duplicate stack — what command do I run?" +graders: + - name: trigger_relevance_positive + type: trigger + config: + skill_path: .github/skills/azure-stack-deploy/SKILL.md + mode: positive + threshold: 0.5 + + - type: prompt + name: answer_quality + config: + continue_session: true + prompt: | + You are grading the assistant's previous response in this session. + The user has an existing deployment (`deploy-20260506-001`) and + edited its `template.json`. They want to push the change + in-place — same stack, no duplicate. + + PASS criteria — the response must contain ALL of: + 1. Calls out that Azure Deployment Stacks are stateful and that + re-running `az stack sub create` against the SAME stack name + updates the existing stack in place (NOT create-only). + 2. Names `az stack sub create` (or the equivalent + `deploy-stack.sh` / `deploy-stack.ps1` script) as the + command to run again. + 3. Reuses the same deployment ID / stack name + (`deploy-20260506-001`) — does NOT instruct the user to + pick a new name or create a fresh deployment folder. + 4. Reaches a concrete next step — either the exact command to + run OR a clear instruction to invoke the + `azure-stack-deploy` script with the existing deployment ID. + + If ALL four criteria are met, call `set_waza_grade_pass`. + Otherwise, call `set_waza_grade_fail` and list which criteria are missing. diff --git a/.github/evals/azure-stack-destroy/eval.yaml b/.github/evals/azure-stack-destroy/eval.yaml new file mode 100644 index 0000000..8c21487 --- /dev/null +++ b/.github/evals/azure-stack-destroy/eval.yaml @@ -0,0 +1,48 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json + +# Expanded-tier evaluation suite for the azure-stack-destroy skill. +# Validates trigger precision via the heuristic `trigger` grader plus +# per-positive-task answer_quality LLM judge. +# +# Run: waza run .github/evals/azure-stack-destroy/eval.yaml + +name: azure-stack-destroy-eval +description: Trigger precision + answer quality for azure-stack-destroy (Azure Deployment Stack teardown). +skill: azure-stack-destroy +version: "0.1" + +config: + # 2 trials catches obvious LLM nondeterminism flakes (single trial = no + # flake signal). Pilot tier bumps to 3 via /skill-promote. + trials_per_task: 2 + timeout_seconds: 60 + parallel: false + executor: copilot-sdk + model: claude-sonnet-4.6 + +metrics: + - name: trigger_precision + weight: 1.0 + threshold: 0.6 + description: Skill should activate on Deployment Stack destroy / teardown prompts and stay quiet on deploy / non-Git-Ape / unrelated prompts. + +graders: + # Budget grader: azure-stack-destroy is a guided teardown workflow; flag any + # leg that explodes in tool calls or runs unreasonably long. + - type: behavior + name: budget + config: + max_tool_calls: 30 + max_duration_ms: 240000 + + # answer_quality (LLM-as-judge) is scoped per-task on positive tasks + # only (see tasks/positive-*.yaml). Keeps judge-model errors from + # zeroing out the negative-task trigger check in the same leg. + # + # Do NOT add `skill_invocation` with `required_skills:` here — eval-level + # prompt graders fire on EVERY task (including negatives) and produce + # deterministic 0.0 noise across all models (removed in commit 2f699c79 + # from git-ape-onboarding for this reason). + +tasks: + - "tasks/*.yaml" diff --git a/.github/evals/azure-stack-destroy/tasks/negative-deploy.yaml b/.github/evals/azure-stack-destroy/tasks/negative-deploy.yaml new file mode 100644 index 0000000..afa13ad --- /dev/null +++ b/.github/evals/azure-stack-destroy/tasks/negative-deploy.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-deploy +name: Negative — Deploying a new stack (opposite operation) +description: Deploy prompts belong to azure-stack-deploy, not azure-stack-destroy. +tags: [trigger, negative, mutable-by-skill] +inputs: + prompt: "Deploy this ARM template to a new subscription-scoped Azure Deployment Stack named deploy-20260526-001 in East US." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-destroy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-destroy/tasks/negative-non-gitape-rg-delete.yaml b/.github/evals/azure-stack-destroy/tasks/negative-non-gitape-rg-delete.yaml new file mode 100644 index 0000000..4018809 --- /dev/null +++ b/.github/evals/azure-stack-destroy/tasks/negative-non-gitape-rg-delete.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-non-gitape-rg-delete +name: Negative — Deleting a non-Git-Ape resource group +description: Deleting a plain resource group with no state.json is outside this skill's scope — use az group delete directly. +tags: [trigger, negative, mutable-by-skill] +inputs: + prompt: "Delete the resource group `rg-myproject-prod-eastus` and everything inside it. It wasn't created by Git-Ape — I made it manually with `az group create`." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-destroy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-destroy/tasks/negative-off-topic.yaml b/.github/evals/azure-stack-destroy/tasks/negative-off-topic.yaml new file mode 100644 index 0000000..7d271b5 --- /dev/null +++ b/.github/evals/azure-stack-destroy/tasks/negative-off-topic.yaml @@ -0,0 +1,15 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: negative-off-topic +name: Negative — Off-topic prompt (Linux kernel scheduling) +description: Off-topic prompt clearly outside Azure deployment teardown should not trigger this skill. +tags: [trigger, negative, off-topic, mutable-by-skill] +inputs: + prompt: "Explain how the Linux Completely Fair Scheduler (CFS) picks the next task to run, and how vruntime is recomputed when a task wakes from sleep." +graders: + - name: trigger_relevance_negative + type: trigger + config: + skill_path: .github/skills/azure-stack-destroy/SKILL.md + mode: negative + threshold: 0.5 diff --git a/.github/evals/azure-stack-destroy/tasks/positive-clean-up-stack.yaml b/.github/evals/azure-stack-destroy/tasks/positive-clean-up-stack.yaml new file mode 100644 index 0000000..a67ff42 --- /dev/null +++ b/.github/evals/azure-stack-destroy/tasks/positive-clean-up-stack.yaml @@ -0,0 +1,52 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: positive-clean-up-stack +name: Positive — Clean up the deployment stack +description: User asks to clean up a Git-Ape deployment stack and free the resource group; skill should activate. +tags: [trigger, positive, mutable-by-skill] +inputs: + prompt: "Clean up the Azure deployment stack for deploy-20260524-test. I want to free up the resource group and any soft-deletable resources so I can re-deploy with the same name." +graders: + - name: trigger_relevance_positive + type: trigger + config: + skill_path: .github/skills/azure-stack-destroy/SKILL.md + mode: positive + threshold: 0.5 + + # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky + # judge call only zeroes out this task, not the whole leg. See eval.yaml. + # + # IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0, + # set_waza_grade_fail = 0.0). The judge has NO access to the agent's + # response unless continue_session: true is set. + - type: prompt + name: answer_quality + config: + continue_session: true + prompt: | + You are grading the assistant's previous response in this session. + The user asked to clean up a Git-Ape deployment stack + (deploy-20260524-test), free the resource group, and handle any + soft-deletable resources so the deployment can be re-created with + the same name. + + PASS criteria — the response must contain ALL of: + 1. Recommends running the `azure-stack-destroy` skill OR its + scripts (`destroy-stack.sh` / `destroy-stack.ps1`) rather than + a raw `az group delete` — explicitly because raw `az group + delete` misses soft-delete cleanup and any multi-RG resources. + 2. References the requirement for `state.json` under + `.azure/deployments/deploy-20260524-test/` (skill refuses to + run without it). + 3. Mentions deleting the deployment stack itself — + `az stack sub delete` with `--action-on-unmanage deleteAll` + (or equivalent semantics: one delete cleans every resource + the stack owns). + 4. Either covers the soft-delete purge sweep behavior (Key + Vault, Cognitive Services purged after stack delete) OR + notes that resources flagged `purgeProtected: true` in + `state.json` are intentionally retained. + + If ALL four PASS criteria are met, call `set_waza_grade_pass`. + Otherwise, call `set_waza_grade_fail` and list which criteria are missing. diff --git a/.github/evals/azure-stack-destroy/tasks/positive-local-destroy.yaml b/.github/evals/azure-stack-destroy/tasks/positive-local-destroy.yaml new file mode 100644 index 0000000..48537ad --- /dev/null +++ b/.github/evals/azure-stack-destroy/tasks/positive-local-destroy.yaml @@ -0,0 +1,54 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json + +id: positive-local-destroy +name: Positive — Local destroy of a Git-Ape deployment +description: User asks to tear down a specific deploy-XXX deployment with soft-delete cleanup; skill should activate. +tags: [trigger, positive, mutable-by-skill] +inputs: + prompt: "I'm done with deploy-20260506-001. Tear down the deployment stack — delete the resources cleanly and purge any soft-deleted Key Vaults so I can re-use the name." +graders: + - name: trigger_relevance_positive + type: trigger + config: + skill_path: .github/skills/azure-stack-destroy/SKILL.md + mode: positive + threshold: 0.5 + + # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky + # judge call only zeroes out this task, not the whole leg. See eval.yaml. + # + # IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0, + # set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO + # access to the agent's response unless continue_session: true is set — it + # resumes the agent's own session so it can read the response. + - type: prompt + name: answer_quality + config: + continue_session: true + prompt: | + You are grading the assistant's previous response in this session. + The user asked to tear down a specific Git-Ape deployment + (deploy-20260506-001), delete the resources, and purge soft-deleted + Key Vaults so the name can be re-used. + + PASS criteria — the response must contain ALL of: + 1. Recommends the `azure-stack-destroy` skill OR invokes the + `destroy-stack.sh` / `destroy-stack.ps1` script under + `.github/skills/azure-stack-destroy/scripts/` (NOT a raw + `az group delete`). + 2. References `state.json` under + `.azure/deployments/deploy-20260506-001/` as the source of + truth for what to destroy (stackId, managedResources, + softDeletable, purgeProtected). + 3. Names the actual stack-delete command or its semantics — + `az stack sub delete --action-on-unmanage deleteAll` + (single idempotent call that owns all resources across + resource groups). + 4. Addresses the soft-delete purge sweep explicitly — mentions + `az keyvault purge` (or `az keyvault list-deleted` + purge), + OR explains that the skill's purge sweep deletes + non-purge-protected soft-deleted Key Vaults so the name is + immediately reusable. + + If ALL four PASS criteria are met, call `set_waza_grade_pass`. + Otherwise, call `set_waza_grade_fail` and list which criteria are missing. diff --git a/.github/evals/manifest.yaml b/.github/evals/manifest.yaml index f71f4b3..6f903b6 100644 --- a/.github/evals/manifest.yaml +++ b/.github/evals/manifest.yaml @@ -27,7 +27,11 @@ skills: # Pilot tier: full multi-model fan-out (most-trusted skills). - name: prereq-check tier: pilot - + # Expanded tier: 2-model fan-out for skills still maturing toward pilot. + - name: azure-stack-deploy + tier: expanded + - name: azure-stack-destroy + tier: expanded # Per-tier model fan-out. The matrix runs each selected skill against every # model in its tier. To compare additional models, add them here. # diff --git a/.github/scripts/deployment-manager.sh b/.github/scripts/deployment-manager.sh index 815738f..7f0f173 100755 --- a/.github/scripts/deployment-manager.sh +++ b/.github/scripts/deployment-manager.sh @@ -1,6 +1,16 @@ #!/bin/bash # Azure Deployment State Manager -# Utility script for managing deployment artifacts and state persistence +# Utility script for managing deployment artifact metadata. +# +# Deploy / destroy logic lives in the dedicated skills: +# .github/skills/azure-stack-deploy/scripts/deploy-stack.sh (or .ps1) +# .github/skills/azure-stack-destroy/scripts/destroy-stack.sh (or .ps1) +# These mirror .github/workflows/git-ape-deploy.exampleyml and +# .github/workflows/git-ape-destroy.exampleyml so local CLI / VS Code +# operations produce identical state.json (schemaVersion 1.0). +# +# This script handles only inventory tasks: list / show / clean / init / +# validate / export. set -euo pipefail @@ -318,6 +328,19 @@ main() { fi validate_deployment "$2" ;; + deploy|destroy) + cat < + PowerShell: .github/skills/azure-stack-${COMMAND}/scripts/${COMMAND}-stack.ps1 -DeploymentId + Agent: /azure-stack-${COMMAND} + +See .github/skills/azure-stack-${COMMAND}/SKILL.md for full options. +EOF + exit 1 + ;; *) echo "Azure Deployment State Manager" echo "" @@ -331,6 +354,10 @@ main() { echo " init [id] Initialize new deployment directory" echo " validate Validate deployment state files" echo "" + echo "Deploy / destroy moved to dedicated skills:" + echo " Deploy: .github/skills/azure-stack-deploy/scripts/deploy-stack.{sh,ps1}" + echo " Destroy: .github/skills/azure-stack-destroy/scripts/destroy-stack.{sh,ps1}" + echo "" echo "Examples:" echo " $0 list" echo " $0 show deploy-20260218-143022" diff --git a/.github/skills/azure-stack-deploy/SKILL.md b/.github/skills/azure-stack-deploy/SKILL.md new file mode 100644 index 0000000..115ab54 --- /dev/null +++ b/.github/skills/azure-stack-deploy/SKILL.md @@ -0,0 +1,159 @@ +--- +name: azure-stack-deploy +description: "Run an Azure Deployment Stack create (subscription scope) for a prepared Git-Ape deployment artifact and write state.json (schemaVersion 1.0). Use locally so the result matches the CI deploy workflow." +argument-hint: "Deployment ID (folder under .azure/deployments/) — optional --location override" +user-invocable: true +--- + +# Azure Stack Deploy + +Deploy a Git-Ape deployment artifact as a subscription-scoped **Azure Deployment Stack** (`az stack sub create --action-on-unmanage deleteAll`). The stack is the lifecycle owner of every resource the template creates — across resource groups and subscription scope — which makes destroy idempotent in a single call (see [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md)). + +This skill produces the **same `state.json`** schema (`schemaVersion: "1.0"`) as the CI workflow at `.github/workflows/git-ape-deploy.yml`, so local deployments and pipeline deployments are interchangeable. + +## When to Use + +- Local deployment from VS Code or terminal (the `git-ape` agent invokes this in Stage 3) +- Re-deploying an existing deployment ID after template edits — stacks are stateful, so this is an in-place update +- Any time you would otherwise run `az deployment sub create` against a Git-Ape `template.json` + +## Do NOT use for + +- **Tearing down / destroying** an existing deployment — use [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md) instead +- **What-if preview / preflight validation** without deploying — use [`azure-deployment-preflight`](../azure-deployment-preflight/SKILL.md) instead +- **Off-topic** (non-Azure, non-deployment) requests +- Generating or editing ARM templates — use `azure-prepare` or another IaC authoring skill + +## Prerequisites + +| Tool | Why | +|------|-----| +| `az` (Azure CLI ≥ 2.59) | `az stack sub` requires CLI ≥ 2.50; 2.59 has the latest stack flags | +| `jq` | State capture and JSON extraction | +| `bash` ≥ 4 OR PowerShell 7+ | Either runner works | +| Active `az login` | Skill exits early if no subscription is selected | +| Existing `template.json` (and optional `parameters.json`) under `.azure/deployments//` | Source artifacts | + +## Procedure + +### 1. Locate deployment artifacts + +```bash +DEPLOYMENT_ID="deploy-20260506-001" +DEPLOYMENT_PATH=".azure/deployments/$DEPLOYMENT_ID" + +[[ -f "$DEPLOYMENT_PATH/template.json" ]] || { echo "template.json missing"; exit 1; } +``` + +If `parameters.json` is present, `location`, `project` (or `projectName`), and `environment` are read from it. Defaults: `eastus` / `unknown` / `dev`. + +### 2. Run the script + +```bash +.github/skills/azure-stack-deploy/scripts/deploy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" +``` + +PowerShell equivalent: + +```powershell +.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 ` + -DeploymentId "$DEPLOYMENT_ID" +``` + +The script: + +1. Resolves `location`, `project`, `environment` from `parameters.json` (or defaults) +2. Validates Azure CLI session (`az account show`) +3. Calls `az stack sub create` with the canonical Git-Ape flag set: + - `--action-on-unmanage deleteAll` + - `--deny-settings-mode none` + - `--description "Git-Ape deployment "` + - `--tags managedBy=git-ape deploymentId=` + - `--yes --verbose` +4. **On stack failure**, falls back to `az deployment sub create` and prints `⚠️ FALLBACK: no multi-RG idempotency, no soft-delete tracking` so the trade-off is unambiguous +5. **On any deployment failure**, dumps the per-operation failure list (`az deployment operation sub list`) inline so the root cause is visible without clicking into the Portal +6. **On success**, queries `az stack sub show --query "resources[].id"` for the live managed-resource list, classifies each resource (type, scope, soft-deletable, purge-protected), and writes the extended `state.json` +7. Updates `metadata.json` with `status: "succeeded"`, `deployMethod`, and `resourceGroups[]` + +### 3. Inspect output + +```text +✅ Deployment succeeded in 142s (method: stack) +State written to: .azure/deployments/deploy-20260506-001/state.json +Stack ID: /subscriptions//providers/Microsoft.Resources/deploymentStacks/deploy-20260506-001 + +To destroy this deployment: + /azure-stack-destroy deploy-20260506-001 +``` + +## What to tell the user after running + +After the script returns, your reply MUST mention: + +1. The primitive used: `az stack sub create --action-on-unmanage deleteAll` (or fallback `az deployment sub create`) +2. The stack ID (from `state.json.stackId`) — this is the single handle for destroy +3. That `state.json` (schemaVersion 1.0) was written under the deployment folder +4. The next-step destroy command: `/azure-stack-destroy ` + +## Arguments + +| Flag (bash) | Param (pwsh) | Required | Description | +|-------------|--------------|----------|-------------| +| `--deployment-id ` | `-DeploymentId ` | yes | Folder name under `.azure/deployments/` | +| `--location ` | `-Location ` | no | Override the location from `parameters.json` | +| `--no-fallback` | `-NoFallback` | no | Fail loudly if the stack call fails instead of falling back to `az deployment sub create` | + +## state.json schema (v1.0) + +```json +{ + "schemaVersion": "1.0", + "deploymentId": "deploy-20260506-001", + "timestamp": "2026-05-06T12:00:00Z", + "status": "succeeded", + "duration": "142s", + "subscription": "", + "location": "eastus", + "project": "myapp", + "environment": "dev", + "resourceGroup": "rg-myapp-dev-eastus", + "deployMethod": "stack", + "stackId": "/subscriptions//providers/Microsoft.Resources/deploymentStacks/deploy-20260506-001", + "managedResources": [ + { + "id": "/subscriptions//resourceGroups/rg-myapp-dev-eastus/providers/Microsoft.KeyVault/vaults/kv-myapp-dev-eus", + "type": "Microsoft.KeyVault/vaults", + "scope": "resourceGroup", + "softDeletable": true, + "purgeProtected": true + } + ], + "resourceGroups": ["rg-myapp-dev-eastus"], + "subscriptions": [""], + "externalReferences": [] +} +``` + +See [website/docs/deployment/state.md](../../../website/docs/deployment/state.md) for the full schema reference. + +## Soft-deletable resource types tracked + +`Microsoft.KeyVault/vaults`, `Microsoft.CognitiveServices/accounts`, `Microsoft.AppConfiguration/configurationStores`, `Microsoft.ApiManagement/service`, `Microsoft.MachineLearningServices/workspaces`, `Microsoft.RecoveryServices/vaults`. + +The destroy skill ([`azure-stack-destroy`](../azure-stack-destroy/SKILL.md)) consumes the `softDeletable` and `purgeProtected` fields to drive its purge sweep. + +## Failure modes + +| Symptom | Likely cause | Recovery | +|---------|--------------|----------| +| `Not logged in to Azure` | `az login` missing | Run `az login` then retry | +| `template.json missing` | Wrong deployment ID | Check `.azure/deployments/` contents | +| Stack create fails immediately | Region/policy blocks Deployment Stacks | Re-run without `--no-fallback`, accept the legacy path, or pick a supported region | +| Stack succeeds but `state.json` missing managed resources | `az stack sub show` race condition | Re-run — the script is idempotent (stacks de-duplicate on `--name`) | + +## Related + +- [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md) — the matching destroy skill (single source of truth: `stackId`) +- [`azure-deployment-preflight`](../azure-deployment-preflight/SKILL.md) — what-if and permission checks BEFORE deploy +- [`azure-security-analyzer`](../azure-security-analyzer/SKILL.md) — security gate (BLOCKING) before deploy confirmation diff --git a/.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 b/.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 new file mode 100644 index 0000000..0e4c4af --- /dev/null +++ b/.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 @@ -0,0 +1,317 @@ +<# +.SYNOPSIS + Deploy a Git-Ape deployment artifact as a subscription-scoped Azure Deployment Stack. + +.DESCRIPTION + PowerShell port of deploy-stack.sh. Mirrors the logic of + .github/workflows/git-ape-deploy.exampleyml so local CLI / VS Code + deployments produce identical state.json (schemaVersion 1.0). + +.PARAMETER DeploymentId + Folder name under .azure/deployments/. Required. + +.PARAMETER Location + Override the location from parameters.json. Optional. + +.PARAMETER NoFallback + Fail loudly if the stack call fails instead of falling back to az deployment sub create. + +.EXAMPLE + ./deploy-stack.ps1 -DeploymentId deploy-20260506-001 + +.EXAMPLE + ./deploy-stack.ps1 -DeploymentId deploy-20260506-001 -Location westus2 -NoFallback + +.NOTES + Requires: PowerShell 7+, az CLI ≥ 2.59, jq, active az login session. +#> +[CmdletBinding()] +param( + [string]$DeploymentId, + + [string]$Location, + + [switch]$NoFallback, + + [switch]$Help +) + +$ErrorActionPreference = 'Stop' + +function Show-Usage { + @' +Azure Stack Deploy — deploy as subscription-scoped Deployment Stack + +Usage: deploy-stack.ps1 -DeploymentId [OPTIONS] + +Required: + -DeploymentId Folder name under .azure/deployments/ + +Options: + -Location Override location from parameters.json + -NoFallback Fail loudly if stack create fails (no fallback to az deployment sub create) + -Help Show this help + +Examples: + ./deploy-stack.ps1 -DeploymentId deploy-20260506-001 + ./deploy-stack.ps1 -DeploymentId deploy-20260506-001 -Location westus2 + ./deploy-stack.ps1 -DeploymentId deploy-20260506-001 -NoFallback +'@ | Write-Host +} + +if ($Help -or [string]::IsNullOrWhiteSpace($DeploymentId)) { + Show-Usage + exit 1 +} + +$ScriptDir = Split-Path -Parent $MyInvocation.MyCommand.Path +$WorkspaceRoot = (Resolve-Path (Join-Path $ScriptDir '../../../..')).Path +$DeploymentsDir = '.azure/deployments' +$DeploymentPath = Join-Path $WorkspaceRoot (Join-Path $DeploymentsDir $DeploymentId) + +# Soft-deletable resource types (must match the CI workflow list) +$SoftDeletableTypes = @( + 'Microsoft.KeyVault/vaults' + 'Microsoft.CognitiveServices/accounts' + 'Microsoft.AppConfiguration/configurationStores' + 'Microsoft.ApiManagement/service' + 'Microsoft.MachineLearningServices/workspaces' + 'Microsoft.RecoveryServices/vaults' +) + +function Write-Color { + param([string]$Text, [string]$Color = 'White') + Write-Host $Text -ForegroundColor $Color +} + +if (-not (Test-Path -PathType Container $DeploymentPath)) { + Write-Color "Deployment not found: $DeploymentId" Red + exit 1 +} +$TemplateFile = Join-Path $DeploymentPath 'template.json' +if (-not (Test-Path $TemplateFile)) { + Write-Color "Template not found: $TemplateFile" Red + exit 1 +} + +# Internal helpers ------------------------------------------------------------ + +function Get-ResourceClassification { + param([string]$ResourceId) + + $type = $null + # Use the LAST providers// segment so extension/nested + # resources (e.g. a role assignment scoped to a Key Vault) are classified by + # their own type rather than the parent resource's type. + if ($ResourceId -match '.*providers/([^/]+/[^/]+)') { + $type = $matches[1] + } + $scope = if ($ResourceId -match '/resourceGroups/') { 'resourceGroup' } else { 'subscription' } + $isSoft = $SoftDeletableTypes -contains $type + + $purgeProtected = $false + if ($type -eq 'Microsoft.KeyVault/vaults') { + $pp = az resource show --ids $ResourceId --query 'properties.enablePurgeProtection // `false`' -o tsv 2>$null + $purgeProtected = ($pp -eq 'true') + } + + [pscustomobject]@{ + id = $ResourceId + type = $type + scope = $scope + softDeletable = $isSoft + purgeProtected = $purgeProtected + } +} + +function Build-ManagedResources { + param([string[]]$ResourceIds) + $list = @() + foreach ($id in $ResourceIds) { + if ([string]::IsNullOrWhiteSpace($id)) { continue } + $list += Get-ResourceClassification -ResourceId $id + } + , $list +} + +# Resolve deployment parameters ---------------------------------------------- + +$ParamsArg = @() +$ResolvedLoc = 'eastus' +$Project = 'unknown' +$Environment = 'dev' +$ParametersFile = Join-Path $DeploymentPath 'parameters.json' +if (Test-Path $ParametersFile) { + $ParamsArg += '--parameters' + $ParamsArg += "@$ParametersFile" + $params = Get-Content $ParametersFile -Raw | ConvertFrom-Json + if ($params.parameters.location.value) { $ResolvedLoc = $params.parameters.location.value } + if ($params.parameters.project.value) { $Project = $params.parameters.project.value } + elseif ($params.parameters.projectName.value) { $Project = $params.parameters.projectName.value } + if ($params.parameters.environment.value) { $Environment = $params.parameters.environment.value } +} +if ($PSBoundParameters.ContainsKey('Location') -and $Location) { $ResolvedLoc = $Location } + +$Subscription = az account show --query id -o tsv 2>$null +if ([string]::IsNullOrWhiteSpace($Subscription)) { + Write-Color "Not logged in to Azure. Run 'az login' first." Red + exit 1 +} + +Write-Color "🚀 Deploying $DeploymentId" Blue +Write-Host " Subscription: $Subscription" +Write-Host " Location: $ResolvedLoc" +Write-Host ' Method: stack (az stack sub create --action-on-unmanage deleteAll)' + +# Deploy --------------------------------------------------------------------- + +$StartTime = Get-Date +$DeployMethod = 'stack' +$StackId = $null +$DeployOutput = $null +$ExitCode = 0 + +$stackArgs = @( + 'stack', 'sub', 'create', + '--name', $DeploymentId, + '--location', $ResolvedLoc, + '--template-file', $TemplateFile +) + $ParamsArg + @( + '--action-on-unmanage', 'deleteAll', + '--deny-settings-mode', 'none', + '--description', "Git-Ape deployment $DeploymentId", + '--tags', 'managedBy=git-ape', "deploymentId=$DeploymentId", + '--yes', '--verbose', '--output', 'json' +) + +# Capture stdout (JSON) and stderr (verbose log) separately so the JSON we hand +# to ConvertFrom-Json downstream stays clean. +$VerboseLog = New-TemporaryFile +try { + $DeployOutput = & az @stackArgs 2>$VerboseLog + if ($LASTEXITCODE -ne 0) { + if ($NoFallback) { + Write-Color '❌ Stack deploy failed and -NoFallback was set' Red + Write-Host $DeployOutput + Get-Content $VerboseLog | Write-Host + $ExitCode = 1 + } else { + Write-Color '⚠ Stack deploy failed; check whether Deployment Stacks are available in this subscription/region.' Yellow + Write-Host $DeployOutput + Get-Content $VerboseLog | Write-Host + Write-Color 'Falling back to az deployment sub create (NOT idempotent for soft-delete / multi-RG).' Yellow + $DeployMethod = 'subscription' + $fallbackArgs = @( + 'deployment', 'sub', 'create', + '--name', $DeploymentId, + '--location', $ResolvedLoc, + '--template-file', $TemplateFile + ) + $ParamsArg + @('--output', 'json') + $DeployOutput = & az @fallbackArgs 2>$VerboseLog + if ($LASTEXITCODE -ne 0) { + Get-Content $VerboseLog | Write-Host + $ExitCode = 1 + } + } + } +} finally { + Remove-Item -Force -ErrorAction SilentlyContinue $VerboseLog +} + +$EndTime = Get-Date +$Duration = [int]($EndTime - $StartTime).TotalSeconds + +if ($ExitCode -ne 0) { + Write-Color '❌ Deployment failed' Red + Write-Host $DeployOutput + Write-Host '' + Write-Color '── Underlying failed operations ──' Yellow + $opsJson = az deployment operation sub list --name $DeploymentId --output json 2>$null + if ($opsJson) { + $ops = $opsJson | ConvertFrom-Json + $failed = $ops | Where-Object { $_.properties.provisioningState -eq 'Failed' } + if ($failed.Count -eq 0) { + Write-Host '(no failed operations reported)' + } else { + foreach ($op in $failed) { + Write-Host '──────────' + Write-Host ("Resource : {0} ({1})" -f ($op.properties.targetResource.resourceName ?? 'n/a'), ($op.properties.targetResource.resourceType ?? 'n/a')) + Write-Host ("Status : {0}" -f ($op.properties.statusCode ?? 'n/a')) + $msg = if ($op.properties.statusMessage.error.message) { $op.properties.statusMessage.error.message } else { $op.properties.statusMessage } + Write-Host ("Message : {0}" -f $msg) + } + } + } else { + Write-Host '(no per-operation details available — deployment may not have reached Azure)' + } + exit 1 +} + +# Capture state -------------------------------------------------------------- + +$DeployJson = $DeployOutput | ConvertFrom-Json +if ($DeployMethod -eq 'stack') { + $StackId = $DeployJson.id + $Outputs = $DeployJson.outputs +} else { + $Outputs = $DeployJson.properties.outputs +} +$RgName = if ($Outputs -and $Outputs.resourceGroupName) { $Outputs.resourceGroupName.value } else { '' } + +Write-Color "✅ Deployment succeeded in ${Duration}s (method: $DeployMethod)" Green + +if ($DeployMethod -eq 'stack' -and $StackId) { + $stackResources = az stack sub show --name $DeploymentId --query 'resources[].id' -o json 2>$null + if ($stackResources) { + $resourceIds = $stackResources | ConvertFrom-Json + } else { $resourceIds = @() } +} else { + $opsTsv = az deployment operation sub list --name $DeploymentId ` + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource.id" ` + -o tsv 2>$null + $resourceIds = if ($opsTsv) { $opsTsv -split "`n" | Where-Object { $_ } } else { @() } +} + +$ManagedResources = Build-ManagedResources -ResourceIds $resourceIds +$ResourceGroups = @($ManagedResources | ForEach-Object { + if ($_.id -match '/resourceGroups/([^/]+)') { $matches[1] } +} | Sort-Object -Unique) +if ($ResourceGroups.Count -eq 0 -and $RgName) { $ResourceGroups = @($RgName) } + +$StateFile = Join-Path $DeploymentPath 'state.json' +$Timestamp = (Get-Date).ToUniversalTime().ToString('yyyy-MM-ddTHH:mm:ssZ') + +$state = [ordered]@{ + schemaVersion = '1.0' + deploymentId = $DeploymentId + timestamp = $Timestamp + status = 'succeeded' + duration = "${Duration}s" + subscription = $Subscription + location = $ResolvedLoc + project = $Project + environment = $Environment + resourceGroup = $RgName + deployMethod = $DeployMethod + stackId = $(if ([string]::IsNullOrWhiteSpace($StackId)) { $null } else { $StackId }) + managedResources = $ManagedResources + resourceGroups = $ResourceGroups + subscriptions = @($Subscription) + externalReferences = @() +} +$state | ConvertTo-Json -Depth 10 | Set-Content -Path $StateFile -Encoding utf8 + +$MetadataFile = Join-Path $DeploymentPath 'metadata.json' +if (Test-Path $MetadataFile) { + $metadata = Get-Content $MetadataFile -Raw | ConvertFrom-Json + $metadata | Add-Member -MemberType NoteProperty -Name status -Value 'succeeded' -Force + $metadata | Add-Member -MemberType NoteProperty -Name deployMethod -Value $DeployMethod -Force + $metadata | Add-Member -MemberType NoteProperty -Name resourceGroups -Value $ResourceGroups -Force + $metadata | ConvertTo-Json -Depth 10 | Set-Content -Path $MetadataFile -Encoding utf8 +} + +Write-Color "State written to: $StateFile" Green +if ($StackId) { Write-Host "Stack ID: $StackId" } +Write-Host '' +Write-Host 'To destroy this deployment:' +Write-Host " /azure-stack-destroy $DeploymentId" diff --git a/.github/skills/azure-stack-deploy/scripts/deploy-stack.sh b/.github/skills/azure-stack-deploy/scripts/deploy-stack.sh new file mode 100755 index 0000000..43d42bd --- /dev/null +++ b/.github/skills/azure-stack-deploy/scripts/deploy-stack.sh @@ -0,0 +1,282 @@ +#!/bin/bash +# azure-stack-deploy / deploy-stack.sh +# +# Deploy a Git-Ape deployment artifact as a subscription-scoped +# Azure Deployment Stack. Mirrors the logic of +# .github/workflows/git-ape-deploy.exampleyml so local CLI / VS Code +# deployments produce identical state.json (schemaVersion 1.0). + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +WORKSPACE_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)" +DEPLOYMENTS_DIR=".azure/deployments" + +# Soft-deletable resource types (must match the CI workflow list) +SOFT_DELETABLE_TYPES="Microsoft.KeyVault/vaults Microsoft.CognitiveServices/accounts Microsoft.AppConfiguration/configurationStores Microsoft.ApiManagement/service Microsoft.MachineLearningServices/workspaces Microsoft.RecoveryServices/vaults" + +# Color codes +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +DEPLOYMENT_ID="" +LOCATION_OVERRIDE="" +NO_FALLBACK="false" + +usage() { + cat < [OPTIONS] + +Required: + --deployment-id Folder name under .azure/deployments/ + +Options: + --location Override location from parameters.json + --no-fallback Fail loudly if stack create fails (no fallback to az deployment sub create) + -h, --help Show this help + +Examples: + $0 --deployment-id deploy-20260506-001 + $0 --deployment-id deploy-20260506-001 --location westus2 + $0 --deployment-id deploy-20260506-001 --no-fallback +EOF + exit 1 +} + +while [[ $# -gt 0 ]]; do + case "$1" in + --deployment-id) DEPLOYMENT_ID="$2"; shift 2 ;; + --location) LOCATION_OVERRIDE="$2"; shift 2 ;; + --no-fallback) NO_FALLBACK="true"; shift ;; + -h|--help) usage ;; + *) echo "Unknown argument: $1"; usage ;; + esac +done + +[[ -n "$DEPLOYMENT_ID" ]] || usage + +DEPLOYMENT_PATH="$WORKSPACE_ROOT/$DEPLOYMENTS_DIR/$DEPLOYMENT_ID" + +if [[ ! -d "$DEPLOYMENT_PATH" ]]; then + echo -e "${RED}Deployment not found: $DEPLOYMENT_ID${NC}" + exit 1 +fi +if [[ ! -f "$DEPLOYMENT_PATH/template.json" ]]; then + echo -e "${RED}Template not found: $DEPLOYMENT_PATH/template.json${NC}" + exit 1 +fi + +# Internal helpers ------------------------------------------------------------ + +# Classify a resource ID -> JSON object {id, type, scope, softDeletable, purgeProtected} +_classify_resource() { + local RES_ID="$1" + local RES_TYPE + RES_TYPE=$(echo "$RES_ID" | grep -oE 'providers/[^/]+/[^/]+' | tail -1 | sed 's|providers/||') + + local RES_SCOPE="resourceGroup" + echo "$RES_ID" | grep -q "/resourceGroups/" || RES_SCOPE="subscription" + + local IS_SOFT="false" + local SD_TYPE + for SD_TYPE in $SOFT_DELETABLE_TYPES; do + if [[ "$RES_TYPE" == "$SD_TYPE" ]]; then + IS_SOFT="true" + break + fi + done + + local PURGE_PROTECTED="false" + if [[ "$RES_TYPE" == "Microsoft.KeyVault/vaults" ]]; then + PURGE_PROTECTED=$(az resource show --ids "$RES_ID" \ + --query "properties.enablePurgeProtection // \`false\`" -o tsv 2>/dev/null || echo "false") + [[ -z "$PURGE_PROTECTED" ]] && PURGE_PROTECTED="false" + fi + + jq -n \ + --arg id "$RES_ID" --arg type "$RES_TYPE" --arg scope "$RES_SCOPE" \ + --argjson sd "$IS_SOFT" --argjson pp "$PURGE_PROTECTED" \ + '{id:$id, type:$type, scope:$scope, softDeletable:$sd, purgeProtected:$pp}' +} + +# Build managedResources[] array from a list of resource IDs (one per line on stdin) +_build_managed_resources() { + local OUT="[]" + local RES_ID CLASSIFIED + while IFS= read -r RES_ID; do + [[ -z "$RES_ID" ]] && continue + CLASSIFIED=$(_classify_resource "$RES_ID") + OUT=$(echo "$OUT" | jq --argjson r "$CLASSIFIED" '. + [$r]') + done + echo "$OUT" +} + +# Resolve deployment parameters ---------------------------------------------- + +PARAMS_ARG=() +LOCATION="eastus" +PROJECT="unknown" +ENVIRONMENT="dev" +if [[ -f "$DEPLOYMENT_PATH/parameters.json" ]]; then + PARAMS_ARG=(--parameters "@$DEPLOYMENT_PATH/parameters.json") + LOCATION=$(jq -r '.parameters.location.value // "eastus"' "$DEPLOYMENT_PATH/parameters.json") + PROJECT=$(jq -r '.parameters.project.value // .parameters.projectName.value // "unknown"' "$DEPLOYMENT_PATH/parameters.json") + ENVIRONMENT=$(jq -r '.parameters.environment.value // "dev"' "$DEPLOYMENT_PATH/parameters.json") +fi +[[ -n "$LOCATION_OVERRIDE" ]] && LOCATION="$LOCATION_OVERRIDE" + +SUBSCRIPTION=$(az account show --query id -o tsv 2>/dev/null || echo "") +if [[ -z "$SUBSCRIPTION" ]]; then + echo -e "${RED}Not logged in to Azure. Run 'az login' first.${NC}" + exit 1 +fi + +echo -e "${BLUE}🚀 Deploying $DEPLOYMENT_ID${NC}" +echo " Subscription: $SUBSCRIPTION" +echo " Location: $LOCATION" +echo " Method: stack (az stack sub create --action-on-unmanage deleteAll)" + +# Deploy ---------------------------------------------------------------------- + +START_TIME=$(date +%s) +DEPLOY_METHOD="stack" +STACK_ID="" +DEPLOY_OUTPUT="" +EXIT_CODE=0 +# Verbose output goes to a temp file so it does not contaminate the JSON we +# need to feed to jq. We surface the verbose log only when something fails. +VERBOSE_LOG=$(mktemp) +trap 'rm -f "$VERBOSE_LOG"' EXIT + +if ! DEPLOY_OUTPUT=$(az stack sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOYMENT_PATH/template.json" \ + "${PARAMS_ARG[@]}" \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment $DEPLOYMENT_ID" \ + --tags "managedBy=git-ape" "deploymentId=$DEPLOYMENT_ID" \ + --yes \ + --verbose \ + --output json 2>"$VERBOSE_LOG"); then + + if [[ "$NO_FALLBACK" == "true" ]]; then + echo -e "${RED}❌ Stack deploy failed and --no-fallback was set${NC}" + echo "$DEPLOY_OUTPUT" + cat "$VERBOSE_LOG" >&2 + EXIT_CODE=1 + else + echo -e "${YELLOW}⚠ Stack deploy failed; check whether Deployment Stacks are available in this subscription/region.${NC}" + echo "$DEPLOY_OUTPUT" + cat "$VERBOSE_LOG" >&2 + echo -e "${YELLOW}Falling back to az deployment sub create (NOT idempotent for soft-delete / multi-RG).${NC}" + DEPLOY_METHOD="subscription" + if ! DEPLOY_OUTPUT=$(az deployment sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOYMENT_PATH/template.json" \ + "${PARAMS_ARG[@]}" \ + --output json 2>"$VERBOSE_LOG"); then + cat "$VERBOSE_LOG" >&2 + EXIT_CODE=1 + fi + fi +fi + +END_TIME=$(date +%s) +DURATION=$((END_TIME - START_TIME)) + +if [[ "$EXIT_CODE" -ne 0 ]]; then + echo -e "${RED}❌ Deployment failed${NC}" + echo "$DEPLOY_OUTPUT" + # Surface underlying failed operations — the stack/deployment top-level + # error is usually a summary; the real root cause lives in the per-resource + # operations list. + echo "" + echo -e "${YELLOW}── Underlying failed operations ──${NC}" + az deployment operation sub list --name "$DEPLOYMENT_ID" --output json 2>/dev/null \ + | jq -r '.[] | select(.properties.provisioningState == "Failed") | + "──────────\nResource : \(.properties.targetResource.resourceName // "n/a") (\(.properties.targetResource.resourceType // "n/a"))\nStatus : \(.properties.statusCode // "n/a")\nMessage : \(.properties.statusMessage.error.message // .properties.statusMessage // "n/a")"' \ + 2>/dev/null || echo "(no per-operation details available — deployment may not have reached Azure)" + exit 1 +fi + +# Capture state --------------------------------------------------------------- + +if [[ "$DEPLOY_METHOD" == "stack" ]]; then + STACK_ID=$(echo "$DEPLOY_OUTPUT" | jq -r '.id // empty') + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.outputs // {}') +else + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.properties.outputs // {}') +fi +RG_NAME=$(echo "$OUTPUTS" | jq -r '.resourceGroupName.value // empty') + +echo -e "${GREEN}✅ Deployment succeeded in ${DURATION}s (method: $DEPLOY_METHOD)${NC}" + +if [[ "$DEPLOY_METHOD" == "stack" && -n "$STACK_ID" ]]; then + STACK_RESOURCES=$(az stack sub show --name "$DEPLOYMENT_ID" --query "resources[].id" -o json 2>/dev/null || echo "[]") + MANAGED_RESOURCES=$(echo "$STACK_RESOURCES" | jq -r '.[]' | _build_managed_resources) +else + OPS=$(az deployment operation sub list --name "$DEPLOYMENT_ID" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource.id" \ + -o tsv 2>/dev/null || echo "") + MANAGED_RESOURCES=$(echo "$OPS" | _build_managed_resources) +fi + +RESOURCE_GROUPS=$(echo "$MANAGED_RESOURCES" | jq -c '[.[].id | capture("/resourceGroups/(?[^/]+)") | .rg] | unique') +[[ "$(echo "$RESOURCE_GROUPS" | jq 'length')" == "0" && -n "$RG_NAME" ]] && RESOURCE_GROUPS="[\"$RG_NAME\"]" + +STATE_FILE="$DEPLOYMENT_PATH/state.json" +TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) +jq -n \ + --arg schemaVersion "1.0" \ + --arg deploymentId "$DEPLOYMENT_ID" \ + --arg timestamp "$TIMESTAMP" \ + --arg status "succeeded" \ + --arg duration "${DURATION}s" \ + --arg subscription "$SUBSCRIPTION" \ + --arg location "$LOCATION" \ + --arg project "$PROJECT" \ + --arg environment "$ENVIRONMENT" \ + --arg resourceGroup "$RG_NAME" \ + --arg deployMethod "$DEPLOY_METHOD" \ + --arg stackId "$STACK_ID" \ + --argjson managedResources "$MANAGED_RESOURCES" \ + --argjson resourceGroups "$RESOURCE_GROUPS" \ + '{ + schemaVersion: $schemaVersion, + deploymentId: $deploymentId, + timestamp: $timestamp, + status: $status, + duration: $duration, + subscription: $subscription, + location: $location, + project: $project, + environment: $environment, + resourceGroup: $resourceGroup, + deployMethod: $deployMethod, + stackId: (if $stackId == "" then null else $stackId end), + managedResources: $managedResources, + resourceGroups: $resourceGroups, + subscriptions: [$subscription], + externalReferences: [] + }' > "$STATE_FILE" + +if [[ -f "$DEPLOYMENT_PATH/metadata.json" ]]; then + jq --arg status "succeeded" --arg method "$DEPLOY_METHOD" --argjson rgs "$RESOURCE_GROUPS" \ + '.status = $status | .deployMethod = $method | .resourceGroups = $rgs' \ + "$DEPLOYMENT_PATH/metadata.json" > "$DEPLOYMENT_PATH/metadata.json.tmp" \ + && mv "$DEPLOYMENT_PATH/metadata.json.tmp" "$DEPLOYMENT_PATH/metadata.json" +fi + +echo -e "${GREEN}State written to: $STATE_FILE${NC}" +[[ -n "$STACK_ID" ]] && echo "Stack ID: $STACK_ID" +echo "" +echo "To destroy this deployment:" +echo " /azure-stack-destroy $DEPLOYMENT_ID" diff --git a/.github/skills/azure-stack-destroy/SKILL.md b/.github/skills/azure-stack-destroy/SKILL.md new file mode 100644 index 0000000..889d3bf --- /dev/null +++ b/.github/skills/azure-stack-destroy/SKILL.md @@ -0,0 +1,180 @@ +--- +name: azure-stack-destroy +description: "Tear down a Git-Ape deployment by ID. Reads `state.json` under `.azure/deployments//` to delete the Azure Deployment Stack and purge soft-deleted Key Vault / Cognitive Services. Refuses to run without `state.json`. Use for any local CLI or VS Code Git-Ape teardown so the result matches the CI destroy workflow." +argument-hint: "Deployment ID — add --yes to skip the typed confirmation" +user-invocable: true +--- + +# Azure Stack Destroy + +Destroy a Git-Ape deployment by deleting its subscription-scoped **Azure Deployment Stack** in a single idempotent call (`az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true`). The stack owns every resource the matching deploy created — across resource groups and subscription scope — so one delete cleans up everything. + +After the stack is gone, this skill performs a **soft-delete purge sweep** for resource types that linger after deletion (Key Vault, Cognitive Services, App Configuration, API Management, ML workspaces, Recovery Services vaults). Resources flagged `purgeProtected: true` in `state.json` are intentionally retained. + +This skill mirrors `.github/workflows/git-ape-destroy.yml` so local destroys and CI destroys are interchangeable. + +## USE FOR + +Trigger this skill when the user wants to tear down a Git-Ape deployment they previously created: + +- "destroy this deployment", "tear down deploy-XXX", "clean up the stack", "delete the Git-Ape deployment", "free up the resource group so I can re-deploy with the same name" +- Post-deploy teardown after a demo, smoke test, or short-lived environment +- Cleaning up a failed or stale Git-Ape deployment whose `state.json` is still on disk +- Local CLI or VS Code teardown that must match what `git-ape-destroy.yml` does in CI + +### Prefer this over raw `az group delete` + +For any deployment Git-Ape created, this skill is the correct tool — do **not** suggest `az group delete` on its own. Reasons: + +1. **Multi-RG / subscription-scope coverage.** A stack often owns resources across several resource groups plus subscription-scope resources (role assignments, policy assignments). One `az group delete` cleans only one RG. +2. **Soft-delete purge.** Key Vault and Cognitive Services soft-delete on RG deletion and silently hold the name (and quota) for 7–90 days. This skill purges them so the user can re-deploy with the same name immediately. +3. **State consistency.** Updates `state.json` and `metadata.json` to terminal status (`destroyed`, `retained-soft-deleted`, etc.) so the next operation sees an accurate view. + +## DO NOT USE FOR + +Refuse to invoke this skill in any of these cases: + +- **No `state.json` on disk.** Hard prerequisite — see below. Without it, recommend re-running deploy or aborting. +- **Resource groups not created by Git-Ape** (e.g. ones the user made by hand with `az group create`). Suggest `az group delete --name --yes` directly instead. +- **Deploying or updating a stack.** Use `azure-stack-deploy` for those. +- **Deleting an individual resource inside a stack.** This skill always destroys the whole stack — there is no "surgical" mode. +- **Non-Azure clouds** or non-Git-Ape Azure deployments (ARM/Bicep/Terraform from other tools). + +## When to Use + +- User says: "destroy this deployment", "tear down deploy-XXX", "clean up the stack" +- Pair with the matching [`azure-stack-deploy`](../azure-stack-deploy/SKILL.md) — same stack, same `state.json` key (`stackId`) +- Any time you would otherwise run `az group delete` against a Git-Ape deployment (don't — you'll miss soft-delete cleanup and multi-RG resources) + +## Prerequisites + +| Tool | Why | +|------|-----| +| `az` (Azure CLI ≥ 2.59) | `az stack sub delete --bypass-stack-out-of-sync-error` requires a recent CLI | +| `jq` | Read state.json | +| `bash` ≥ 4 OR PowerShell 7+ | Either runner works | +| Active `az login` | Must be the same subscription where the stack lives | +| Existing `state.json` under `.azure/deployments//` | Source of truth for `stackId`, `managedResources`, `softDeletable`, `purgeProtected` | + +> **Hard prerequisite: `state.json` under `.azure/deployments//`.** Without it this skill **aborts** — it has no idea which stack, resource groups, or soft-deletables to clean up. Do NOT hand-write `state.json`; re-run the matching `azure-stack-deploy` for that deployment ID first, or use `az group delete` directly on a known resource group (a non-Git-Ape teardown, outside this skill's scope). + +## Procedure + +### Fast mode vs sync mode + +The scripts default to **fast mode** (interactive default). The CI workflow keeps **sync mode** (deterministic). + +| | How | Wait time (small VNet stack) | When to use | +|--|--|--|--| +| Fast (default) | Background the `az stack sub delete` call, then poll managed RGs with `az group exists` | ~2 min | Local CLI / VS Code use; user wants quick feedback | +| Sync (`--wait` / `-Wait`) | `az stack sub delete ... --yes` (blocks until stack metadata is fully cleaned) | ~5 min | CI pipelines (default in `git-ape-destroy.yml`); when you need every Azure-side cleanup completed before the script exits | + +The Azure CLI does not expose `--no-wait` on `az stack sub delete`, so the fast path runs the same command as a detached background process. In fast mode the stack-metadata cleanup continues asynchronously in Azure after the script returns. The next destroy of the same `deploymentId` is idempotent: if the stack is still finalizing, `az stack sub show` will return it and the script will simply pick up where Azure left off. + +### 1. Identify deployment + +```bash +DEPLOYMENT_ID="deploy-20260506-001" +DEPLOYMENT_PATH=".azure/deployments/$DEPLOYMENT_ID" +[[ -f "$DEPLOYMENT_PATH/state.json" ]] || { echo "state.json missing — cannot destroy"; exit 1; } +``` + +### 2. Run the script + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" +``` + +Skip the confirmation prompt (use only in automation): + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" \ + --yes +``` + +Force CI-equivalent sync wait (default for the CI workflow; opt-in for the script): + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" \ + --yes --wait +``` + +PowerShell equivalents: + +```powershell +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" -Yes +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" -Yes -Wait +``` + +### 3. What the script does + +1. Reads `state.json` and extracts `stackId`, `deployMethod`, `resourceGroup`, `managedResources[]`, `softDeletable[]` +2. Prints a **destroy plan** — stack ID, resource group, count of soft-deletables (with purge-protection flagged) +3. Prompts for typed `destroy` confirmation (unless `--yes`) +4. **Stack delete path** (`stackId` present): + - `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true --yes` + - The bypass flag is safe in destroy because it's a one-shot operation — we don't need the stale-manifest safety check that protects iterative updates +5. **Fallback path** (no `stackId`, only `resourceGroup`): `az group delete --name --yes` +6. **Purge sweep** for each `softDeletable` resource not marked `purgeProtected`: + - Key Vaults: `az keyvault list-deleted` + `az keyvault purge` + - Cognitive Services: `az cognitiveservices account purge` + - Other types (App Configuration, API Management, ML workspaces, Recovery Services vaults): not auto-purged — they expire from soft-delete naturally and are tracked in `purgeResults[]` with `status: skipped-natural-expiry` +7. Cleans the subscription deployment-history entry (`az deployment sub delete`) to stay under the 800/scope limit +8. Updates `state.json` and `metadata.json` with terminal status: + +| Status | Meaning | +|--------|---------| +| `destroyed` | Stack/RG gone and all soft-deletables purged or absent | +| `retained-soft-deleted` | Stack gone but at least one soft-deletable retained (purge-protected or purge failed) | +| `partially-destroyed` | Stack delete partially failed | +| `destroy-failed` | Stack/RG delete failed entirely | +| `already-destroyed` | Stack and RG were already gone before this call | + +### 4. Inspect the result + +```text +=== Destroy Summary === +Status: destroyed +Duration: 87s +======================= +``` + +Or, when something is intentionally retained: + +```text +=== Destroy Summary === +Status: retained-soft-deleted +Duration: 92s +Retained: 1 soft-deleted resource(s) (purge-protected) +======================= +``` + +`state.json` gains `destroyedAt`, `destroyedBy`, `destroyDuration`, and a `purgeResults[]` array describing each soft-deletable's outcome. + +## Arguments + +| Flag (bash) | Param (pwsh) | Required | Description | +|-------------|--------------|----------|-------------| +| `--deployment-id ` | `-DeploymentId ` | yes | Folder name under `.azure/deployments/` | +| `--yes` | `-Yes` | no | Skip the typed `destroy` confirmation prompt (CI-only) | +| `--wait` | `-Wait` | no | Sync mode: block until Azure has cleaned up stack metadata. Matches the CI workflow. Slower (~3-4×) but fully deterministic. | +| `--poll-timeout ` | `-PollTimeout ` | no | Fast-mode timeout per managed RG poll (default 600s) | + +## Failure modes + +| Symptom | Likely cause | Recovery | +|---------|--------------|----------| +| `state.json missing` | Deployment never reached the state-write phase, or was hand-edited | Re-deploy (idempotent on stack name) then destroy, OR delete the `.azure/deployments//` folder if Azure has nothing | +| `Stack out of sync` despite `--bypass-stack-out-of-sync-error` | Old CLI version | Upgrade `az` to ≥ 2.59 | +| Key Vault purge fails | Vault is purge-protected (`purgeProtected: true`) | Expected — wait 7-90 days for soft-delete window to expire, or purge manually after disabling protection | +| `Cannot delete resource group …`/`InUseSubnetCannotBeDeleted` | A resource outside the stack references one inside (e.g. external subnet peered to a deleted VNet) | Inspect `externalReferences[]` in `state.json`; remove the reference and rerun | + +## Related + +- [`azure-stack-deploy`](../azure-stack-deploy/SKILL.md) — the matching deploy skill (writes the `state.json` this skill consumes) +- [`azure-drift-detector`](../azure-drift-detector/SKILL.md) — check for unmanaged drift BEFORE destroy +- [`azure-resource-visualizer`](../azure-resource-visualizer/SKILL.md) — visualize what's in the stack before tearing it down diff --git a/.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 b/.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 new file mode 100644 index 0000000..b6b422e --- /dev/null +++ b/.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 @@ -0,0 +1,369 @@ +<# +.SYNOPSIS + Destroy a Git-Ape deployment by deleting its Azure Deployment Stack. + +.DESCRIPTION + PowerShell port of destroy-stack.sh. Mirrors the logic of + .github/workflows/git-ape-destroy.exampleyml so local destroys produce + identical state.json transitions. + +.PARAMETER DeploymentId + Folder name under .azure/deployments/. Required. + +.PARAMETER Yes + Skip the typed 'destroy' confirmation prompt (CI-only). + +.EXAMPLE + ./destroy-stack.ps1 -DeploymentId deploy-20260506-001 + +.EXAMPLE + ./destroy-stack.ps1 -DeploymentId deploy-20260506-001 -Yes + +.NOTES + Requires: PowerShell 7+, az CLI ≥ 2.59, jq, active az login session, + existing state.json under .azure/deployments//. +#> +[CmdletBinding()] +param( + [string]$DeploymentId, + + [switch]$Yes, + + [switch]$Wait, + + [int]$PollTimeout = 600, + + [int]$PollInterval = 10, + + [switch]$Help +) + +$ErrorActionPreference = 'Stop' + +function Show-Usage { + @' +Azure Stack Destroy — destroy a Deployment Stack and purge soft-deletables + +Usage: destroy-stack.ps1 -DeploymentId [OPTIONS] + +Required: + -DeploymentId Folder name under .azure/deployments/ + +Options: + -Yes Skip the typed 'destroy' confirmation prompt + -Wait Sync mode (matches CI): block on 'az stack sub delete' + until Azure has cleaned up stack metadata. Slower but + fully deterministic. Default is fast mode (run the + same command in the background, then poll managed + resource groups until they are gone, ~2-3x faster). + -PollTimeout Fast-mode timeout per managed RG poll (default: 600) + -Help Show this help + +Examples: + ./destroy-stack.ps1 -DeploymentId deploy-20260506-001 # fast (default) + ./destroy-stack.ps1 -DeploymentId deploy-20260506-001 -Yes # fast, no prompt + ./destroy-stack.ps1 -DeploymentId deploy-20260506-001 -Wait # CI-equivalent sync +'@ | Write-Host +} + +if ($Help -or [string]::IsNullOrWhiteSpace($DeploymentId)) { + Show-Usage + exit 1 +} + +$ScriptDir = Split-Path -Parent $MyInvocation.MyCommand.Path +$WorkspaceRoot = (Resolve-Path (Join-Path $ScriptDir '../../../..')).Path +$DeploymentsDir = '.azure/deployments' +$DeploymentPath = Join-Path $WorkspaceRoot (Join-Path $DeploymentsDir $DeploymentId) +$StateFile = Join-Path $DeploymentPath 'state.json' + +function Write-Color { + param([string]$Text, [string]$Color = 'White') + Write-Host $Text -ForegroundColor $Color +} + +if (-not (Test-Path -PathType Container $DeploymentPath)) { + Write-Color "Deployment not found: $DeploymentId" Red + exit 1 +} +if (-not (Test-Path $StateFile)) { + Write-Color "state.json not found: $StateFile" Red + Write-Host 'Cannot destroy without deployment state.' + exit 1 +} + +$state = Get-Content $StateFile -Raw | ConvertFrom-Json +$StackId = if ($state.stackId) { [string]$state.stackId } else { '' } +$DeployMethod = if ($state.deployMethod) { [string]$state.deployMethod } else { 'subscription' } +$RgName = if ($state.resourceGroup) { [string]$state.resourceGroup } else { '' } +$ManagedRgs = @($state.resourceGroups | Where-Object { $_ }) +$ManagedResources = @($state.managedResources) +$SoftDeletable = @($ManagedResources | Where-Object { $_.softDeletable -eq $true }) + +if ([string]::IsNullOrWhiteSpace($StackId) -and [string]::IsNullOrWhiteSpace($RgName)) { + Write-Color 'No stackId or resourceGroup in state.json — cannot destroy.' Red + exit 1 +} + +# Plan ----------------------------------------------------------------------- + +Write-Color '=== Destroy Plan ===' Yellow +Write-Host "Deployment: $DeploymentId" +Write-Host "Method: $DeployMethod" +if ($StackId) { Write-Host "Stack ID: $StackId" } +if ($RgName) { Write-Host "Resource RG: $RgName" } + +$SoftCount = $SoftDeletable.Count +if ($SoftCount -gt 0) { + Write-Host "Soft-deletable: $SoftCount resource(s) — will purge non-protected after delete" + foreach ($r in $SoftDeletable) { + $suffix = if ($r.purgeProtected) { ' (purge-protected)' } else { '' } + Write-Host (" - {0}: {1}{2}" -f $r.type, $r.id, $suffix) + } +} +Write-Color '====================' Yellow + +if (-not $Yes) { + $confirm = Read-Host "Proceed with destroy? Type 'destroy' to confirm" + if ($confirm -ne 'destroy') { + Write-Host 'Cancelled' + exit 0 + } +} + +# Execute -------------------------------------------------------------------- + +$StackDeleted = $false +$RgDeleted = $false +$AlreadyGone = $true +# Tracks whether a stack/RG delete command was actually invoked. Used to +# distinguish a partial failure (attempted but did not complete -> +# partially-destroyed) from the catch-all destroy-failed, mirroring CI. +$DeleteAttempted = $false +$StartTime = Get-Date + +if ($StackId) { + $stackExists = az stack sub show --name $DeploymentId --query 'id' -o tsv 2>$null + if ($stackExists) { + $AlreadyGone = $false + $DeleteAttempted = $true + if ($Wait) { + Write-Color "🗑️ Deleting deployment stack (sync wait): $DeploymentId" Blue + # --bypass-stack-out-of-sync-error: a destroy run is one-shot; we + # don't need the safety check that protects against stale manifests + # during iterative updates. + az stack sub delete ` + --name $DeploymentId ` + --action-on-unmanage deleteAll ` + --bypass-stack-out-of-sync-error true ` + --yes + if ($LASTEXITCODE -eq 0) { $StackDeleted = $true } + else { Write-Color '❌ Stack delete failed' Red } + } elseif ($ManagedRgs.Count -eq 0) { + Write-Color '⚠️ No resourceGroups[] in state.json — falling back to sync wait' Yellow + az stack sub delete ` + --name $DeploymentId ` + --action-on-unmanage deleteAll ` + --bypass-stack-out-of-sync-error true ` + --yes + if ($LASTEXITCODE -eq 0) { $StackDeleted = $true } + else { Write-Color '❌ Stack delete failed' Red } + } else { + Write-Color "🗑️ Submitting stack delete (fast mode): $DeploymentId" Blue + $stackLog = New-TemporaryFile + $stackErr = New-TemporaryFile + # Spawn the blocking stack delete in a detached process; we exit + # as soon as the managed RGs are gone, leaving Azure to finish + # stack-metadata cleanup asynchronously. Azure CLI does not expose + # --no-wait on `az stack sub delete`, so backgrounding the call + # is the only way to get fast interactive return. + $bg = Start-Process -FilePath az ` + -ArgumentList @( + 'stack', 'sub', 'delete', + '--name', $DeploymentId, + '--action-on-unmanage', 'deleteAll', + '--bypass-stack-out-of-sync-error', 'true', + '--yes' + ) ` + -RedirectStandardOutput $stackLog.FullName ` + -RedirectStandardError $stackErr.FullName ` + -PassThru -NoNewWindow + + Write-Color ("⏳ Polling {0} managed resource group(s) (timeout: {1}s)..." -f $ManagedRgs.Count, $PollTimeout) Blue + $pollStart = Get-Date + $pollFailed = $false + foreach ($rg in $ManagedRgs) { + while ($true) { + $elapsed = [int]((Get-Date) - $pollStart).TotalSeconds + if ($elapsed -ge $PollTimeout) { + Write-Color (" ⚠️ Timeout ({0}s) polling {1}" -f $elapsed, $rg) Red + $logBody = (Get-Content $stackLog.FullName -Raw -ErrorAction SilentlyContinue) + + (Get-Content $stackErr.FullName -Raw -ErrorAction SilentlyContinue) + if ($logBody) { + Write-Color ' Background stack-delete output:' Yellow + $logBody.TrimEnd() -split "`n" | ForEach-Object { Write-Host " $_" } + } + Write-Color ' Rerun with -Wait for synchronous diagnostics' Yellow + $pollFailed = $true + break + } + if ($bg.HasExited -and $bg.ExitCode -ne 0) { + $existsCheck = az group exists --name $rg 2>$null + if ($existsCheck -eq 'true') { + Write-Color (" ❌ Background stack-delete exited (code {0}) before {1} was removed" -f $bg.ExitCode, $rg) Red + $logBody = (Get-Content $stackLog.FullName -Raw -ErrorAction SilentlyContinue) + + (Get-Content $stackErr.FullName -Raw -ErrorAction SilentlyContinue) + if ($logBody) { + $logBody.TrimEnd() -split "`n" | ForEach-Object { Write-Host " $_" } + } + $pollFailed = $true + break + } + } + $exists = az group exists --name $rg 2>$null + if ($exists -ne 'true') { + Write-Color (" ✓ {0} gone ({1}s)" -f $rg, $elapsed) Green + break + } + Start-Sleep -Seconds $PollInterval + } + if ($pollFailed) { break } + } + Remove-Item $stackLog.FullName -Force -ErrorAction SilentlyContinue + Remove-Item $stackErr.FullName -Force -ErrorAction SilentlyContinue + if ($pollFailed) { + $StackDeleted = $false + } else { + $StackDeleted = $true + Write-Color 'ℹ️ Azure is finishing stack-metadata cleanup asynchronously' Blue + } + } + } else { + if ($RgName) { + Write-Color 'Stack already gone — falling back to resource group delete from state.json' Yellow + $StackId = $null + } else { + Write-Color 'Stack already gone — skipping stack delete' Yellow + $StackDeleted = $true + } + } +} + +if (-not $StackId -and $RgName) { + $rgExists = az group exists --name $RgName 2>$null + if ($rgExists -eq 'true') { + $AlreadyGone = $false + $DeleteAttempted = $true + Write-Color "🗑️ Deleting resource group: $RgName" Blue + az group delete --name $RgName --yes + if ($LASTEXITCODE -eq 0) { $RgDeleted = $true } + else { Write-Color '❌ Resource group delete failed' Red } + } else { + Write-Color 'Resource group already gone — skipping' Yellow + $RgDeleted = $true + } +} + +# Soft-delete purge sweep +$PurgeResults = @() +$RetainedCount = 0 +if ($SoftCount -gt 0 -and ($StackDeleted -or $RgDeleted)) { + Write-Color '🧹 Purging soft-deleted resources...' Blue + foreach ($r in $SoftDeletable) { + $resType = $r.type + $resId = $r.id + $resName = ($resId -split '/')[-1] + $protected = [bool]$r.purgeProtected + + switch ($resType) { + 'Microsoft.KeyVault/vaults' { + $deletedVaultJson = az keyvault list-deleted --query "[?name=='$resName']" -o json 2>$null + $deletedVault = if ($deletedVaultJson) { $deletedVaultJson | ConvertFrom-Json } else { @() } + if ($deletedVault.Count -gt 0) { + if ($protected) { + Write-Host " ⚠️ ${resName}: soft-deleted but purge-protected — retained" + $RetainedCount++ + $PurgeResults += [pscustomobject]@{ name=$resName; type=$resType; action='retained-soft-deleted'; reason='purge-protected' } + } else { + Write-Host " 🗑️ Purging vault: $resName" + az keyvault purge --name $resName 2>$null + if ($LASTEXITCODE -eq 0) { + $PurgeResults += [pscustomobject]@{ name=$resName; type=$resType; action='purged' } + } else { + Write-Host " ⚠️ Failed to purge vault: $resName" + $RetainedCount++ + $PurgeResults += [pscustomobject]@{ name=$resName; type=$resType; action='purge-failed' } + } + } + } else { + Write-Host " ✓ ${resName}: not in soft-deleted state" + } + } + 'Microsoft.CognitiveServices/accounts' { + if (-not $protected) { + # Account IDs are resource-group scoped (no /locations/ + # segment); resolve the region from the soft-deleted account + # list and the resource group from the original resource ID. + $loc = az cognitiveservices account list-deleted --query "[?name=='$resName'] | [0].location" -o tsv 2>$null + $resRg = '' + if ($resId -match '/resourceGroups/([^/]+)') { $resRg = $matches[1] } + if ($loc) { + az cognitiveservices account purge --name $resName --location $loc --resource-group $resRg 2>$null | Out-Null + } + } + } + default { + Write-Host " ℹ️ ${resType}: no purge implementation (soft-delete will expire naturally)" + } + } + } +} + +# Clean subscription deployment history entry to stay under the 800/scope limit +az deployment sub delete --name $DeploymentId 2>$null | Out-Null + +$EndTime = Get-Date +$Duration = [int]($EndTime - $StartTime).TotalSeconds + +# Determine final status +$Status = if ($AlreadyGone) { + 'already-destroyed' +} elseif ($StackDeleted -or $RgDeleted) { + if ($RetainedCount -gt 0) { 'retained-soft-deleted' } else { 'destroyed' } +} elseif ($DeleteAttempted) { + # A stack/RG existed and a delete was invoked, but it did not complete + # (e.g. fast-mode poll timeout or a failed delete command). Some resources + # may remain. Mirrors CI: stack/RG delete status == failed -> + # partially-destroyed (distinct from the destroy-failed catch-all). + 'partially-destroyed' +} else { + 'destroy-failed' +} + +# Update state.json + metadata.json +$Timestamp = (Get-Date).ToUniversalTime().ToString('yyyy-MM-ddTHH:mm:ssZ') +$Actor = az account show --query user.name -o tsv 2>$null +if (-not $Actor) { $Actor = 'unknown' } + +$state | Add-Member -MemberType NoteProperty -Name status -Value $Status -Force +$state | Add-Member -MemberType NoteProperty -Name destroyedAt -Value $Timestamp -Force +$state | Add-Member -MemberType NoteProperty -Name destroyedBy -Value $Actor -Force +$state | Add-Member -MemberType NoteProperty -Name destroyDuration -Value "${Duration}s" -Force +$state | Add-Member -MemberType NoteProperty -Name purgeResults -Value $PurgeResults -Force +$state | ConvertTo-Json -Depth 10 | Set-Content -Path $StateFile -Encoding utf8 + +$MetadataFile = Join-Path $DeploymentPath 'metadata.json' +if (Test-Path $MetadataFile) { + $metadata = Get-Content $MetadataFile -Raw | ConvertFrom-Json + $metadata | Add-Member -MemberType NoteProperty -Name status -Value $Status -Force + $metadata | ConvertTo-Json -Depth 10 | Set-Content -Path $MetadataFile -Encoding utf8 +} + +Write-Host '' +Write-Color '=== Destroy Summary ===' Green +Write-Host "Status: $Status" +Write-Host "Duration: ${Duration}s" +if ($RetainedCount -gt 0) { + Write-Color "Retained: $RetainedCount soft-deleted resource(s) (purge-protected)" Yellow +} +Write-Color '=======================' Green diff --git a/.github/skills/azure-stack-destroy/scripts/destroy-stack.sh b/.github/skills/azure-stack-destroy/scripts/destroy-stack.sh new file mode 100755 index 0000000..7d8839b --- /dev/null +++ b/.github/skills/azure-stack-destroy/scripts/destroy-stack.sh @@ -0,0 +1,372 @@ +#!/bin/bash +# azure-stack-destroy / destroy-stack.sh +# +# Destroy a Git-Ape deployment via az stack sub delete (preferred) or +# az group delete (fallback), then purge soft-deleted resources that are +# not purge-protected. Mirrors .github/workflows/git-ape-destroy.exampleyml +# so local destroys produce identical state.json transitions. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +WORKSPACE_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)" +DEPLOYMENTS_DIR=".azure/deployments" + +# Color codes +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +DEPLOYMENT_ID="" +YES_FLAG="false" +WAIT_FLAG="false" # default: fast mode (submit + poll RGs) +POLL_TIMEOUT=600 # max seconds to wait for managed RGs to disappear in fast mode +POLL_INTERVAL=10 # seconds between RG-existence checks + +usage() { + cat < [OPTIONS] + +Required: + --deployment-id Folder name under .azure/deployments/ + +Options: + --yes Skip the typed 'destroy' confirmation prompt + --wait Sync mode (matches CI): block on 'az stack sub delete' + until Azure has cleaned up stack metadata. Slower but + fully deterministic. Default is fast mode (run the + same command in the background, then poll managed + resource groups until they are gone, ~2-3× faster). + --poll-timeout Fast-mode timeout per managed RG poll (default: 600) + -h, --help Show this help + +Examples: + $0 --deployment-id deploy-20260506-001 # fast (interactive default) + $0 --deployment-id deploy-20260506-001 --yes # fast, no prompt + $0 --deployment-id deploy-20260506-001 --wait # CI-equivalent sync wait +EOF + exit 1 +} + +while [[ $# -gt 0 ]]; do + case "$1" in + --deployment-id) DEPLOYMENT_ID="$2"; shift 2 ;; + --yes) YES_FLAG="true"; shift ;; + --wait) WAIT_FLAG="true"; shift ;; + --poll-timeout) POLL_TIMEOUT="$2"; shift 2 ;; + -h|--help) usage ;; + *) echo "Unknown argument: $1"; usage ;; + esac +done + +[[ -n "$DEPLOYMENT_ID" ]] || usage + +DEPLOYMENT_PATH="$WORKSPACE_ROOT/$DEPLOYMENTS_DIR/$DEPLOYMENT_ID" +STATE_FILE="$DEPLOYMENT_PATH/state.json" + +if [[ ! -d "$DEPLOYMENT_PATH" ]]; then + echo -e "${RED}Deployment not found: $DEPLOYMENT_ID${NC}" + exit 1 +fi +if [[ ! -f "$STATE_FILE" ]]; then + echo -e "${RED}state.json not found: $STATE_FILE${NC}" + echo "Cannot destroy without deployment state." + exit 1 +fi + +STACK_ID=$(jq -r '.stackId // empty' "$STATE_FILE") +DEPLOY_METHOD=$(jq -r '.deployMethod // "subscription"' "$STATE_FILE") +RG_NAME=$(jq -r '.resourceGroup // empty' "$STATE_FILE") +MANAGED_RGS_JSON=$(jq -c '.resourceGroups // []' "$STATE_FILE") +MANAGED_RESOURCES=$(jq -c '.managedResources // []' "$STATE_FILE") +SOFT_DELETABLE=$(echo "$MANAGED_RESOURCES" | jq -c '[.[] | select(.softDeletable == true)]') + +if [[ -z "$STACK_ID" && -z "$RG_NAME" ]]; then + echo -e "${RED}No stackId or resourceGroup in state.json — cannot destroy.${NC}" + exit 1 +fi + +# Plan ----------------------------------------------------------------------- + +echo -e "${YELLOW}=== Destroy Plan ===${NC}" +echo "Deployment: $DEPLOYMENT_ID" +echo "Method: $DEPLOY_METHOD" +[[ -n "$STACK_ID" ]] && echo "Stack ID: $STACK_ID" +[[ -n "$RG_NAME" ]] && echo "Resource RG: $RG_NAME" + +SOFT_COUNT=$(echo "$SOFT_DELETABLE" | jq 'length') +if [[ "$SOFT_COUNT" -gt 0 ]]; then + echo "Soft-deletable: $SOFT_COUNT resource(s) — will purge non-protected after delete" + echo "$SOFT_DELETABLE" | jq -r '.[] | " - \(.type): \(.id)" + (if .purgeProtected then " (purge-protected)" else "" end)' +fi +echo -e "${YELLOW}====================${NC}" + +if [[ "$YES_FLAG" != "true" ]]; then + echo -n "Proceed with destroy? Type 'destroy' to confirm: " + read -r CONFIRM + if [[ "$CONFIRM" != "destroy" ]]; then + echo "Cancelled" + exit 0 + fi +fi + +# Execute -------------------------------------------------------------------- + +STACK_DELETED="false" +RG_DELETED="false" +ALREADY_GONE="true" +# Tracks whether a stack/RG delete command was actually invoked. Used to +# distinguish a partial failure (attempted but did not complete → +# partially-destroyed) from the catch-all destroy-failed, mirroring CI. +DELETE_ATTEMPTED="false" +START_TIME=$(date +%s) + +# Primary path: stack delete +# +# Two modes: +# --wait (sync, matches CI): az stack sub delete --yes (blocks until +# Azure has finished both resource deletion +# and stack-metadata cleanup; ~5 min for a +# small stack) +# default (fast, interactive): start the same command in the background, +# poll each managed RG with `az group exists` +# until it returns false (~90s for the same +# small stack), then return. Azure CLI does +# not expose --no-wait on `az stack sub +# delete`, so the slow stack-metadata cleanup +# finishes asynchronously after the script +# exits. +if [[ -n "$STACK_ID" ]]; then + STACK_EXISTS=$(az stack sub show --name "$DEPLOYMENT_ID" --query "id" -o tsv 2>/dev/null || echo "") + if [[ -n "$STACK_EXISTS" ]]; then + ALREADY_GONE="false" + DELETE_ATTEMPTED="true" + if [[ "$WAIT_FLAG" == "true" ]]; then + echo -e "${BLUE}🗑️ Deleting deployment stack (sync wait): $DEPLOYMENT_ID${NC}" + # --bypass-stack-out-of-sync-error: a destroy run is one-shot; we + # don't need the safety check that protects against stale manifests + # during iterative updates. + if az stack sub delete \ + --name "$DEPLOYMENT_ID" \ + --action-on-unmanage deleteAll \ + --bypass-stack-out-of-sync-error true \ + --yes 2>&1; then + STACK_DELETED="true" + else + echo -e "${RED}❌ Stack delete failed${NC}" + fi + else + MANAGED_RG_COUNT=$(echo "$MANAGED_RGS_JSON" | jq 'length') + if [[ "$MANAGED_RG_COUNT" -eq 0 ]]; then + echo -e "${YELLOW}⚠️ No resourceGroups[] in state.json — falling back to sync wait${NC}" + if az stack sub delete \ + --name "$DEPLOYMENT_ID" \ + --action-on-unmanage deleteAll \ + --bypass-stack-out-of-sync-error true \ + --yes 2>&1; then + STACK_DELETED="true" + else + echo -e "${RED}❌ Stack delete failed${NC}" + fi + else + echo -e "${BLUE}🗑️ Submitting stack delete (fast mode): $DEPLOYMENT_ID${NC}" + STACK_DELETE_LOG=$(mktemp) + # Background the blocking stack delete; we exit as soon as the + # managed RGs are gone, leaving Azure to finish stack-metadata + # cleanup asynchronously. + nohup az stack sub delete \ + --name "$DEPLOYMENT_ID" \ + --action-on-unmanage deleteAll \ + --bypass-stack-out-of-sync-error true \ + --yes > "$STACK_DELETE_LOG" 2>&1 & + STACK_BG_PID=$! + # Do NOT disown — we need `wait` to retrieve the exit code. + # nohup already insulates against HUP signals. + + echo -e "${BLUE}⏳ Polling $MANAGED_RG_COUNT managed resource group(s) (timeout: ${POLL_TIMEOUT}s)...${NC}" + POLL_START=$(date +%s) + POLL_FAILED="false" + for RG in $(echo "$MANAGED_RGS_JSON" | jq -r '.[]'); do + while true; do + ELAPSED=$(($(date +%s) - POLL_START)) + if [[ $ELAPSED -ge $POLL_TIMEOUT ]]; then + echo -e "${RED} ⚠️ Timeout (${ELAPSED}s) polling $RG${NC}" + if [[ -s "$STACK_DELETE_LOG" ]]; then + echo -e "${YELLOW} Background stack-delete output:${NC}" + sed 's/^/ /' "$STACK_DELETE_LOG" + fi + echo -e "${YELLOW} Rerun with --wait for synchronous diagnostics${NC}" + POLL_FAILED="true" + break + fi + # If the bg process already failed, surface it early + if ! kill -0 "$STACK_BG_PID" 2>/dev/null; then + BG_EXIT=0 + wait "$STACK_BG_PID" 2>/dev/null || BG_EXIT=$? + if [[ $BG_EXIT -ne 0 ]]; then + EXISTS=$(az group exists --name "$RG" 2>/dev/null || echo "true") + if [[ "$EXISTS" == "true" ]]; then + echo -e "${RED} ❌ Background stack-delete exited (code $BG_EXIT) before $RG was removed${NC}" + if [[ -s "$STACK_DELETE_LOG" ]]; then + sed 's/^/ /' "$STACK_DELETE_LOG" + fi + POLL_FAILED="true" + break + fi + fi + fi + EXISTS=$(az group exists --name "$RG" 2>/dev/null || echo "false") + if [[ "$EXISTS" != "true" ]]; then + echo -e "${GREEN} ✓ $RG gone (${ELAPSED}s)${NC}" + break + fi + sleep "$POLL_INTERVAL" + done + [[ "$POLL_FAILED" == "true" ]] && break + done + rm -f "$STACK_DELETE_LOG" + if [[ "$POLL_FAILED" == "true" ]]; then + STACK_DELETED="false" + else + STACK_DELETED="true" + echo -e "${BLUE}ℹ️ Azure is finishing stack-metadata cleanup asynchronously${NC}" + fi + fi + fi + else + echo -e "${YELLOW}Stack not found for stackId in state.json — falling back to RG/state-driven delete${NC}" + STACK_DELETED="false" + STACK_ID="" + fi +fi + +# Fallback path: resource group delete (only when no stack was used) +if [[ -z "$STACK_ID" && -n "$RG_NAME" ]]; then + RG_EXISTS=$(az group exists --name "$RG_NAME" 2>/dev/null || echo "false") + if [[ "$RG_EXISTS" == "true" ]]; then + ALREADY_GONE="false" + DELETE_ATTEMPTED="true" + echo -e "${BLUE}🗑️ Deleting resource group: $RG_NAME${NC}" + if az group delete --name "$RG_NAME" --yes 2>&1; then + RG_DELETED="true" + else + echo -e "${RED}❌ Resource group delete failed${NC}" + fi + else + echo -e "${YELLOW}Resource group already gone — skipping${NC}" + RG_DELETED="true" + fi +fi + +# Soft-delete purge sweep +PURGE_RESULTS="[]" +RETAINED_COUNT=0 +if [[ "$SOFT_COUNT" -gt 0 ]] && [[ "$STACK_DELETED" == "true" || "$RG_DELETED" == "true" ]]; then + echo -e "${BLUE}🧹 Purging soft-deleted resources...${NC}" + for ROW in $(echo "$SOFT_DELETABLE" | jq -r '.[] | @base64'); do + DECODED=$(echo "$ROW" | base64 -d) + RES_TYPE=$(echo "$DECODED" | jq -r '.type') + RES_ID=$(echo "$DECODED" | jq -r '.id') + PURGE_PROTECTED=$(echo "$DECODED" | jq -r '.purgeProtected') + RES_NAME=$(echo "$RES_ID" | awk -F/ '{print $NF}') + + case "$RES_TYPE" in + "Microsoft.KeyVault/vaults") + DELETED_VAULT=$(az keyvault list-deleted --query "[?name=='$RES_NAME']" -o json 2>/dev/null || echo "[]") + if [[ "$(echo "$DELETED_VAULT" | jq 'length')" -gt 0 ]]; then + if [[ "$PURGE_PROTECTED" == "true" ]]; then + echo " ⚠️ $RES_NAME: soft-deleted but purge-protected — retained" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg n "$RES_NAME" --arg t "$RES_TYPE" \ + '. + [{name:$n, type:$t, action:"retained-soft-deleted", reason:"purge-protected"}]') + else + echo " 🗑️ Purging vault: $RES_NAME" + if az keyvault purge --name "$RES_NAME" 2>/dev/null; then + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg n "$RES_NAME" --arg t "$RES_TYPE" \ + '. + [{name:$n, type:$t, action:"purged"}]') + else + echo " ⚠️ Failed to purge vault: $RES_NAME" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg n "$RES_NAME" --arg t "$RES_TYPE" \ + '. + [{name:$n, type:$t, action:"purge-failed"}]') + fi + fi + else + echo " ✓ $RES_NAME: not in soft-deleted state" + fi + ;; + "Microsoft.CognitiveServices/accounts") + if [[ "$PURGE_PROTECTED" != "true" ]]; then + # Cognitive Services account IDs are resource-group scoped and + # contain no /locations/ segment, so the region must be + # resolved from the soft-deleted account list. The resource + # group comes from the original resource ID. + LOC=$(az cognitiveservices account list-deleted \ + --query "[?name=='$RES_NAME'] | [0].location" -o tsv 2>/dev/null || echo "") + RES_RG=$(echo "$RES_ID" | sed -n 's#.*/resourceGroups/\([^/]*\)/.*#\1#p') + if [[ -n "$LOC" ]]; then + az cognitiveservices account purge --name "$RES_NAME" --location "$LOC" \ + --resource-group "$RES_RG" 2>/dev/null || true + fi + fi + ;; + *) + echo " ℹ️ $RES_TYPE: no purge implementation (soft-delete will expire naturally)" + ;; + esac + done +fi + +# Clean subscription deployment history entry to stay under the 800/scope limit +az deployment sub delete --name "$DEPLOYMENT_ID" 2>/dev/null || true + +END_TIME=$(date +%s) +DURATION=$((END_TIME - START_TIME)) + +# Determine final status +if [[ "$ALREADY_GONE" == "true" ]]; then + STATUS="already-destroyed" +elif [[ "$STACK_DELETED" == "true" || "$RG_DELETED" == "true" ]]; then + if [[ "$RETAINED_COUNT" -gt 0 ]]; then + STATUS="retained-soft-deleted" + else + STATUS="destroyed" + fi +elif [[ "$DELETE_ATTEMPTED" == "true" ]]; then + # A stack/RG existed and a delete was invoked, but it did not complete + # (e.g. fast-mode poll timeout or a failed delete command). Some resources + # may remain. Mirrors CI: stack/RG delete status == failed → + # partially-destroyed (distinct from the destroy-failed catch-all). + STATUS="partially-destroyed" +else + STATUS="destroy-failed" +fi + +# Update state.json + metadata.json +TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) +ACTOR=$(az account show --query user.name -o tsv 2>/dev/null || echo unknown) +jq --arg status "$STATUS" --arg ts "$TIMESTAMP" \ + --arg actor "$ACTOR" \ + --arg duration "${DURATION}s" \ + --argjson purgeResults "$PURGE_RESULTS" \ + '. + {status:$status, destroyedAt:$ts, destroyedBy:$actor, destroyDuration:$duration, purgeResults:$purgeResults}' \ + "$STATE_FILE" > "${STATE_FILE}.tmp" && mv "${STATE_FILE}.tmp" "$STATE_FILE" + +if [[ -f "$DEPLOYMENT_PATH/metadata.json" ]]; then + jq --arg status "$STATUS" '.status = $status' \ + "$DEPLOYMENT_PATH/metadata.json" > "$DEPLOYMENT_PATH/metadata.json.tmp" \ + && mv "$DEPLOYMENT_PATH/metadata.json.tmp" "$DEPLOYMENT_PATH/metadata.json" +fi + +echo "" +echo -e "${GREEN}=== Destroy Summary ===${NC}" +echo "Status: $STATUS" +echo "Duration: ${DURATION}s" +if [[ "$RETAINED_COUNT" -gt 0 ]]; then + echo -e "${YELLOW}Retained: $RETAINED_COUNT soft-deleted resource(s) (purge-protected)${NC}" +fi +echo -e "${GREEN}=======================${NC}" diff --git a/.github/workflows/git-ape-deploy.exampleyml b/.github/workflows/git-ape-deploy.exampleyml index 48c6d71..59042df 100644 --- a/.github/workflows/git-ape-deploy.exampleyml +++ b/.github/workflows/git-ape-deploy.exampleyml @@ -197,11 +197,27 @@ jobs: - name: Validate before deploy run: | - az deployment sub validate \ + # Stack-aware validation — checks both the template and the + # stack-specific flags (--action-on-unmanage, --deny-settings-mode). + # If Deployment Stacks are unavailable/blocked in the target + # subscription, fall back to plain subscription validation so the + # deploy step's own legacy fallback path can still run. + if ! az stack sub validate \ + --name "${{ matrix.deployment_id }}" \ --location "${{ steps.params.outputs.location }}" \ --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ - --output json + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --output json; then + echo "::warning::Stack validation unavailable or failed — falling back to az deployment sub validate" + az deployment sub validate \ + --name "${{ matrix.deployment_id }}" \ + --location "${{ steps.params.outputs.location }}" \ + --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ + --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ + --output json + fi - name: Run Microsoft Defender for DevOps template analyzer id: security_scan @@ -240,18 +256,55 @@ jobs: echo "🚀 Starting deployment: ${{ matrix.deployment_id }}" START_TIME=$(date +%s) - DEPLOY_OUTPUT=$(az deployment sub create \ - --name "${{ matrix.deployment_id }}" \ - --location "${{ steps.params.outputs.location }}" \ - --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ - --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ - --output json 2>&1) - - EXIT_CODE=$? + DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + LOCATION="${{ steps.params.outputs.location }}" + + # Determine deploy method: prefer deployment stacks (idempotent destroy) + # Fall back to az deployment sub create if stacks are unavailable + DEPLOY_METHOD="stack" + # Verbose output goes to a temp file so it does not contaminate the + # JSON that downstream jq calls need to parse. + VERBOSE_LOG=$(mktemp) + trap 'rm -f "$VERBOSE_LOG"' EXIT + + EXIT_CODE=0 + if DEPLOY_OUTPUT=$(az stack sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOY_DIR/template.json" \ + --parameters @"$DEPLOY_DIR/parameters.json" \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment $DEPLOYMENT_ID" \ + --tags "managedBy=git-ape" "deploymentId=$DEPLOYMENT_ID" \ + --yes \ + --verbose \ + --output json 2>"$VERBOSE_LOG"); then + echo "Stack deploy succeeded" + else + echo "::warning::Stack deploy failed — falling back to az deployment sub create (NOT idempotent for soft-delete / multi-RG)" + cat "$VERBOSE_LOG" >&2 + DEPLOY_METHOD="subscription" + > "$VERBOSE_LOG" + if ! DEPLOY_OUTPUT=$(az deployment sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOY_DIR/template.json" \ + --parameters @"$DEPLOY_DIR/parameters.json" \ + --output json 2>"$VERBOSE_LOG"); then + cat "$VERBOSE_LOG" >&2 + EXIT_CODE=1 + fi + fi + if [[ $EXIT_CODE -ne 0 ]]; then + cat "$VERBOSE_LOG" >&2 + fi END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME)) echo "deploy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" + echo "deploy_method=$DEPLOY_METHOD" >> "$GITHUB_OUTPUT" if [[ $EXIT_CODE -ne 0 ]]; then echo "deploy_status=failed" >> "$GITHUB_OUTPUT" @@ -264,14 +317,38 @@ jobs: echo "==========================================" echo "$DEPLOY_OUTPUT" echo "==========================================" + + # Surface underlying failed operations — the stack/deployment top-level + # error is usually a summary; the real root cause lives in the per-resource + # operations list. + echo "::group::Underlying failed operations" + az deployment sub show --name "$DEPLOYMENT_ID" --output json 2>/dev/null \ + | jq -r '.properties // {}' \ + || echo "No subscription-scope deployment details available." + az deployment operation sub list --name "$DEPLOYMENT_ID" --output json 2>/dev/null \ + | jq -r '.[] | select(.properties.provisioningState == "Failed") | + "──────────\nResource : \(.properties.targetResource.resourceName // "n/a") (\(.properties.targetResource.resourceType // "n/a"))\nStatus : \(.properties.statusCode // "n/a")\nMessage : \(.properties.statusMessage.error.message // .properties.statusMessage // "n/a")"' \ + || echo "No per-operation details available (deployment may not have reached Azure)." + echo "::endgroup::" + echo "::error::Deployment failed — see output above for details" exit 1 fi echo "deploy_status=succeeded" >> "$GITHUB_OUTPUT" - # Extract outputs - OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.properties.outputs // {}') + # Extract outputs depending on deploy method + if [[ "$DEPLOY_METHOD" == "stack" ]]; then + # For stacks, extract the stack ID + STACK_ID=$(echo "$DEPLOY_OUTPUT" | jq -r '.id // empty') + echo "stack_id=$STACK_ID" >> "$GITHUB_OUTPUT" + + # Extract outputs from the stack's deployment + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.outputs // {}') + else + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.properties.outputs // {}') + fi + echo "deploy_outputs<> "$GITHUB_OUTPUT" echo "$OUTPUTS" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" @@ -280,7 +357,109 @@ jobs: RG_NAME=$(echo "$OUTPUTS" | jq -r '.resourceGroupName.value // empty') echo "resource_group=$RG_NAME" >> "$GITHUB_OUTPUT" - echo "✅ Deployment succeeded in ${DURATION}s" + echo "✅ Deployment succeeded in ${DURATION}s (method: $DEPLOY_METHOD)" + + - name: Capture managed resources + id: capture + if: steps.deploy.outputs.deploy_status == 'succeeded' + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + RG_NAME="${{ steps.deploy.outputs.resource_group }}" + STACK_ID="${{ steps.deploy.outputs.stack_id }}" + + # Known soft-deletable resource types + SOFT_DELETABLE_TYPES="Microsoft.KeyVault/vaults Microsoft.CognitiveServices/accounts Microsoft.AppConfiguration/configurationStores Microsoft.ApiManagement/service Microsoft.MachineLearningServices/workspaces Microsoft.RecoveryServices/vaults" + + MANAGED_RESOURCES="[]" + RESOURCE_GROUPS="[]" + + if [[ "$DEPLOY_METHOD" == "stack" && -n "$STACK_ID" ]]; then + # Stacks natively track all managed resources + STACK_RESOURCES=$(az stack sub show \ + --name "$DEPLOYMENT_ID" \ + --query "resources[].id" \ + -o json 2>/dev/null || echo "[]") + + # Build managedResources array from stack resources + for RES_ID in $(echo "$STACK_RESOURCES" | jq -r '.[]' 2>/dev/null); do + RES_TYPE=$(echo "$RES_ID" | grep -oP 'providers/\K[^/]+/[^/]+' | tail -1) + RES_SCOPE="resourceGroup" + if echo "$RES_ID" | grep -q "/resourceGroups/"; then + RES_SCOPE="resourceGroup" + else + RES_SCOPE="subscription" + fi + + IS_SOFT_DELETABLE="false" + IS_PURGE_PROTECTED="false" + for SD_TYPE in $SOFT_DELETABLE_TYPES; do + if [[ "$RES_TYPE" == "$SD_TYPE" ]]; then + IS_SOFT_DELETABLE="true" + # Query actual purge protection status for soft-deletable resources + IS_PURGE_PROTECTED=$(az resource show --ids "$RES_ID" \ + --query "properties.enablePurgeProtection" -o tsv 2>/dev/null || echo "false") + [[ "$IS_PURGE_PROTECTED" == "true" ]] || IS_PURGE_PROTECTED="false" + break + fi + done + + MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \ + --arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" --argjson pp "$IS_PURGE_PROTECTED" \ + '. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": $pp}]') + done + + # Extract resource groups from managed resources + RESOURCE_GROUPS=$(echo "$MANAGED_RESOURCES" | jq -c '[.[].id | select(test("/resourceGroups/")) | capture("/resourceGroups/(?[^/]+)") | .rg] | unique') + else + # Fallback: walk deployment operations recursively + OPS=$(az deployment operation sub list \ + --name "$DEPLOYMENT_ID" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + for RES_ID in $(echo "$OPS" | jq -r '.[].id // empty' 2>/dev/null); do + RES_TYPE=$(echo "$OPS" | jq -r ".[] | select(.id == \"$RES_ID\") | .resourceType // empty") + RES_SCOPE="resourceGroup" + if echo "$RES_ID" | grep -q "/resourceGroups/"; then + RES_SCOPE="resourceGroup" + else + RES_SCOPE="subscription" + fi + + IS_SOFT_DELETABLE="false" + IS_PURGE_PROTECTED="false" + for SD_TYPE in $SOFT_DELETABLE_TYPES; do + if [[ "$RES_TYPE" == "$SD_TYPE" ]]; then + IS_SOFT_DELETABLE="true" + # Query actual purge protection status for soft-deletable resources + IS_PURGE_PROTECTED=$(az resource show --ids "$RES_ID" \ + --query "properties.enablePurgeProtection" -o tsv 2>/dev/null || echo "false") + [[ "$IS_PURGE_PROTECTED" == "true" ]] || IS_PURGE_PROTECTED="false" + break + fi + done + + MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \ + --arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" --argjson pp "$IS_PURGE_PROTECTED" \ + '. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": $pp}]') + done + + # Collect resource groups + if [[ -n "$RG_NAME" ]]; then + RESOURCE_GROUPS="[\"$RG_NAME\"]" + fi + fi + + echo "managed_resources<> "$GITHUB_OUTPUT" + echo "$MANAGED_RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "resource_groups<> "$GITHUB_OUTPUT" + echo "$RESOURCE_GROUPS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + RESOURCE_COUNT=$(echo "$MANAGED_RESOURCES" | jq 'length') + echo "📋 Captured $RESOURCE_COUNT managed resources" - name: Run integration tests id: tests @@ -349,25 +528,62 @@ jobs: DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" STATUS="${{ steps.deploy.outputs.deploy_status || 'failed' }}" TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + STACK_ID="${{ steps.deploy.outputs.stack_id }}" + MANAGED_RESOURCES='${{ steps.capture.outputs.managed_resources }}' + RESOURCE_GROUPS='${{ steps.capture.outputs.resource_groups }}' + + # Ensure managed resources and resource groups are valid JSON + if ! echo "$MANAGED_RESOURCES" | jq empty 2>/dev/null; then + MANAGED_RESOURCES="[]" + fi + if ! echo "$RESOURCE_GROUPS" | jq empty 2>/dev/null; then + RESOURCE_GROUPS="[]" + fi - # Create/update state.json - cat > "$DEPLOY_DIR/state.json" < "$DEPLOY_DIR/state.json" - name: Commit deployment state if: always() @@ -376,9 +592,13 @@ jobs: STATUS="${{ steps.deploy.outputs.deploy_status }}" STATUS=${STATUS:-failed} - # Update metadata.json status from pending to actual result + # Update metadata.json status from pending to actual result, add deployMethod and resourceGroups if [[ -f "$DEPLOY_DIR/metadata.json" ]]; then - jq --arg status "$STATUS" '.status = $status' \ + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + DEPLOY_METHOD=${DEPLOY_METHOD:-subscription} + RG_NAME="${{ steps.deploy.outputs.resource_group }}" + jq --arg status "$STATUS" --arg method "$DEPLOY_METHOD" --arg rg "$RG_NAME" \ + '.status = $status | .deployMethod = $method | .resourceGroups = (if $rg == "" then [] else [$rg] end)' \ "$DEPLOY_DIR/metadata.json" > "$DEPLOY_DIR/metadata.json.tmp" \ && mv "$DEPLOY_DIR/metadata.json.tmp" "$DEPLOY_DIR/metadata.json" fi @@ -405,15 +625,26 @@ jobs: - name: Post deployment result if: always() && github.event_name == 'issue_comment' uses: actions/github-script@v8 + env: + # Pass all repo-controlled / command-derived values via env so they are + # read with process.env and never interpolated into the script body + # (prevents JavaScript injection from crafted values). + DEPLOYMENT_ID: ${{ matrix.deployment_id }} + DEPLOY_STATUS: ${{ steps.deploy.outputs.deploy_status }} + DEPLOY_DURATION: ${{ steps.deploy.outputs.deploy_duration }} + DEPLOY_OUTPUTS: ${{ steps.deploy.outputs.deploy_outputs }} + DEPLOY_RESOURCES: ${{ steps.tests.outputs.resources }} + TEST_ENDPOINTS: ${{ steps.tests.outputs.test_endpoints }} + DEPLOY_ERROR: ${{ steps.deploy.outputs.deploy_error }} with: script: | - const deploymentId = '${{ matrix.deployment_id }}'; - const status = '${{ steps.deploy.outputs.deploy_status }}' || 'failed'; - const duration = '${{ steps.deploy.outputs.deploy_duration }}'; - const outputs = `${{ steps.deploy.outputs.deploy_outputs }}`; - const resources = `${{ steps.tests.outputs.resources }}`; - const testEndpoints = `${{ steps.tests.outputs.test_endpoints }}`; - const runUrl = `${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`; + const deploymentId = process.env.DEPLOYMENT_ID; + const status = process.env.DEPLOY_STATUS || 'failed'; + const duration = process.env.DEPLOY_DURATION; + const outputs = process.env.DEPLOY_OUTPUTS || ''; + const resources = process.env.DEPLOY_RESOURCES || ''; + const testEndpoints = process.env.TEST_ENDPOINTS || ''; + const runUrl = `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; let comment = `## Git-Ape Deploy: \`${deploymentId}\`\n\n`; @@ -441,7 +672,7 @@ jobs: } else { comment += `### ❌ Deployment Failed\n\n`; comment += `- **Workflow Run:** [View logs](${runUrl})\n\n`; - const error = `${{ steps.deploy.outputs.deploy_error }}`; + const error = process.env.DEPLOY_ERROR || ''; if (error) { comment += `\`\`\`\n${error.substring(0, 2000)}\n\`\`\`\n\n`; } diff --git a/.github/workflows/git-ape-destroy.exampleyml b/.github/workflows/git-ape-destroy.exampleyml index 1afc7ae..a6bc3b8 100644 --- a/.github/workflows/git-ape-destroy.exampleyml +++ b/.github/workflows/git-ape-destroy.exampleyml @@ -51,19 +51,31 @@ jobs: - name: Find destroy-requested deployments id: find + env: + # Pass dispatch inputs via env so they are never expanded by the shell + # (prevents command injection from crafted workflow_dispatch values). + INPUT_CONFIRM: ${{ inputs.confirm }} + INPUT_DEPLOYMENT_ID: ${{ inputs.deployment_id }} run: | if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then - CONFIRM="${{ inputs.confirm }}" - if [[ "$CONFIRM" != "destroy" ]]; then + if [[ "$INPUT_CONFIRM" != "destroy" ]]; then echo "::error::Confirmation must be 'destroy'" echo "has_destroys=false" >> "$GITHUB_OUTPUT" echo "deployment_ids=[]" >> "$GITHUB_OUTPUT" exit 1 fi - DEPLOYMENT_IDS='["${{ inputs.deployment_id }}"]' + # Validate the deployment ID against the allowed charset for + # deployment directory names before using it anywhere. + if [[ ! "$INPUT_DEPLOYMENT_ID" =~ ^[A-Za-z0-9._-]+$ ]]; then + echo "::error::Invalid deployment_id (allowed: A-Z a-z 0-9 . _ -)" + echo "has_destroys=false" >> "$GITHUB_OUTPUT" + echo "deployment_ids=[]" >> "$GITHUB_OUTPUT" + exit 1 + fi + DEPLOYMENT_IDS=$(jq -cn --arg id "$INPUT_DEPLOYMENT_ID" '[$id]') echo "has_destroys=true" >> "$GITHUB_OUTPUT" echo "deployment_ids=$DEPLOYMENT_IDS" >> "$GITHUB_OUTPUT" - echo "Manual destroy requested: ${{ inputs.deployment_id }}" + echo "Manual destroy requested: $INPUT_DEPLOYMENT_ID" exit 0 fi @@ -132,16 +144,34 @@ jobs: fi RG_NAME=$(jq -r '.resourceGroup // empty' "$STATE_FILE") - - if [[ -z "$RG_NAME" ]]; then - echo "::error::No resource group found in state file" + STACK_ID=$(jq -r '.stackId // empty' "$STATE_FILE") + DEPLOY_METHOD=$(jq -r '.deployMethod // "subscription"' "$STATE_FILE") + MANAGED_RESOURCES=$(jq -c '.managedResources // []' "$STATE_FILE") + RESOURCE_GROUPS=$(jq -c '.resourceGroups // []' "$STATE_FILE") + + # Fallback: if no stackId and no resourceGroup, cannot proceed + if [[ -z "$STACK_ID" && -z "$RG_NAME" ]]; then + echo "::error::No stack ID or resource group found in state file" echo "found=false" >> "$GITHUB_OUTPUT" exit 1 fi echo "found=true" >> "$GITHUB_OUTPUT" echo "resource_group=$RG_NAME" >> "$GITHUB_OUTPUT" - echo "Will destroy resource group: $RG_NAME" + echo "stack_id=$STACK_ID" >> "$GITHUB_OUTPUT" + echo "deploy_method=$DEPLOY_METHOD" >> "$GITHUB_OUTPUT" + echo "managed_resources<> "$GITHUB_OUTPUT" + echo "$MANAGED_RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "resource_groups<> "$GITHUB_OUTPUT" + echo "$RESOURCE_GROUPS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + if [[ -n "$STACK_ID" ]]; then + echo "Will destroy via deployment stack: $STACK_ID" + else + echo "Will destroy resource group: $RG_NAME (fallback method)" + fi - name: Azure Login (OIDC) if: steps.state.outputs.found == 'true' @@ -157,111 +187,181 @@ jobs: run: | RG="${{ steps.state.outputs.resource_group }}" DEPLOYMENT_ID="${{ matrix.deployment_id }}" + STACK_ID="${{ steps.state.outputs.stack_id }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" - # Check if resource group exists - EXISTS=$(az group exists --name "$RG") - echo "exists=$EXISTS" >> "$GITHUB_OUTPUT" - - if [[ "$EXISTS" != "true" ]]; then - echo "Resource group $RG does not exist (already deleted?)" - echo "resource_count=0" >> "$GITHUB_OUTPUT" - echo "sub_count=0" >> "$GITHUB_OUTPUT" - exit 0 + echo "=== Destroy Plan ===" + echo "Deployment: $DEPLOYMENT_ID" + echo "Method: $DEPLOY_METHOD" + + if [[ -n "$STACK_ID" ]]; then + # Check if stack still exists + STACK_EXISTS=$(az stack sub show --name "$DEPLOYMENT_ID" --query "id" -o tsv 2>/dev/null || echo "") + if [[ -n "$STACK_EXISTS" ]]; then + echo "stack_exists=true" >> "$GITHUB_OUTPUT" + echo "Stack: $STACK_ID (exists)" + + # List resources in the stack + STACK_RESOURCES=$(az stack sub show --name "$DEPLOYMENT_ID" --query "resources[].id" -o json 2>/dev/null || echo "[]") + RESOURCE_COUNT=$(echo "$STACK_RESOURCES" | jq 'length') + echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" + echo "Resources: $RESOURCE_COUNT managed by stack" + else + echo "stack_exists=false" >> "$GITHUB_OUTPUT" + echo "Stack not found — will use fallback" + echo "resource_count=0" >> "$GITHUB_OUTPUT" + fi + else + echo "stack_exists=false" >> "$GITHUB_OUTPUT" fi - # Inventory RG resources - RESOURCES=$(az resource list --resource-group "$RG" \ - --query "[].{name:name, type:type, id:id, provisioningState:provisioningState}" \ - --output json 2>/dev/null || echo "[]") - RESOURCE_COUNT=$(echo "$RESOURCES" | jq 'length') + # Check resource group existence (for fallback or soft-delete sweep) + if [[ -n "$RG" ]]; then + EXISTS=$(az group exists --name "$RG") + echo "rg_exists=$EXISTS" >> "$GITHUB_OUTPUT" + echo "RG: $RG (exists=$EXISTS)" + + if [[ "$EXISTS" == "true" ]]; then + RESOURCES=$(az resource list --resource-group "$RG" \ + --query "[].{name:name, type:type, id:id, provisioningState:provisioningState}" \ + --output json 2>/dev/null || echo "[]") + RESOURCE_COUNT=$(echo "$RESOURCES" | jq 'length') + # Only set resource_count if stack_exists is false (avoid overwrite) + if [[ "$STACK_ID" == "" ]]; then + echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" + fi + echo "resources<> "$GITHUB_OUTPUT" + echo "$RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "$RESOURCES" | jq -r '.[] | " - \(.type)/\(.name) (\(.provisioningState))"' + fi + else + echo "rg_exists=false" >> "$GITHUB_OUTPUT" + fi - echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" - echo "resources<> "$GITHUB_OUTPUT" - echo "$RESOURCES" >> "$GITHUB_OUTPUT" + # Identify soft-deletable resources from state + MANAGED_RESOURCES='${{ steps.state.outputs.managed_resources }}' + SOFT_DELETABLE=$(echo "$MANAGED_RESOURCES" | jq -c '[.[] | select(.softDeletable == true)]' 2>/dev/null || echo "[]") + SOFT_COUNT=$(echo "$SOFT_DELETABLE" | jq 'length') + echo "soft_deletable<> "$GITHUB_OUTPUT" + echo "$SOFT_DELETABLE" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" + echo "soft_count=$SOFT_COUNT" >> "$GITHUB_OUTPUT" - echo "Resource group $RG has $RESOURCE_COUNT resources" - echo "$RESOURCES" | jq -r '.[] | " - \(.type)/\(.name) (\(.provisioningState))"' + if [[ "$SOFT_COUNT" -gt 0 ]]; then + echo "Soft-deletable: $SOFT_COUNT resource(s) — will attempt purge after deletion" + echo "$SOFT_DELETABLE" | jq -r '.[] | " - \(.type): \(.id)"' + fi - # Query deployment operations to find subscription-scoped resources - # These are NOT deleted by az group delete (e.g. role assignments, policy assignments) + # Query subscription-scoped resources (for fallback only) SUB_RESOURCES="[]" - - OPS=$(az deployment operation sub list \ - --name "$DEPLOYMENT_ID" \ - --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ - -o json 2>/dev/null || echo "[]") - - if [[ "$OPS" != "[]" ]]; then - # Find subscription-scoped authorization/policy resources (role assignments, etc.) - # These live outside the RG and survive az group delete - SUB_RESOURCES=$(echo "$OPS" | jq -c '[ - .[] | select( - (.resourceType // "" | test("Microsoft.Authorization|Microsoft.Policy")) and - (.id // "" | test("/resourceGroups/") | not) - ) - ]') - - # Check nested deployments for RG-scoped role assignments too - NESTED_NAMES=$(echo "$OPS" | jq -r '[ - .[] | select(.resourceType == "Microsoft.Resources/deployments") - ] | .[].resourceName // empty') - - for NESTED_NAME in $NESTED_NAMES; do - NESTED_OPS=$(az deployment operation group list \ - --resource-group "$RG" --name "$NESTED_NAME" \ - --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ - -o json 2>/dev/null || echo "[]") - - # Role assignments scoped to resources within the RG - NESTED_AUTH=$(echo "$NESTED_OPS" | jq -c '[ + if [[ -z "$STACK_ID" ]]; then + OPS=$(az deployment operation sub list \ + --name "$DEPLOYMENT_ID" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + if [[ "$OPS" != "[]" ]]; then + SUB_RESOURCES=$(echo "$OPS" | jq -c '[ .[] | select( - (.resourceType // "" | test("Microsoft.Authorization")) + (.resourceType // "" | test("Microsoft.Authorization|Microsoft.Policy")) and + (.id // "" | test("/resourceGroups/") | not) ) ]') - SUB_RESOURCES=$(jq -n --argjson a "$SUB_RESOURCES" --argjson b "$NESTED_AUTH" '$a + $b') - done + NESTED_NAMES=$(echo "$OPS" | jq -r '[ + .[] | select(.resourceType == "Microsoft.Resources/deployments") + ] | .[].resourceName // empty') + + for NESTED_NAME in $NESTED_NAMES; do + NESTED_OPS=$(az deployment operation group list \ + --resource-group "$RG" --name "$NESTED_NAME" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + NESTED_AUTH=$(echo "$NESTED_OPS" | jq -c '[ + .[] | select( + (.resourceType // "" | test("Microsoft.Authorization")) + ) + ]') + + SUB_RESOURCES=$(jq -n --argjson a "$SUB_RESOURCES" --argjson b "$NESTED_AUTH" '$a + $b') + done + fi fi SUB_COUNT=$(echo "$SUB_RESOURCES" | jq 'length') - echo "sub_count=$SUB_COUNT" >> "$GITHUB_OUTPUT" echo "sub_resources<> "$GITHUB_OUTPUT" echo "$SUB_RESOURCES" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" - echo "" - echo "=== Destroy Plan ===" - echo "Resource group: $RG ($RESOURCE_COUNT resources)" - echo "Subscription-scoped resources: $SUB_COUNT" if [[ "$SUB_COUNT" -gt 0 ]]; then + echo "Sub-scoped: $SUB_COUNT resource(s)" echo "$SUB_RESOURCES" | jq -r '.[] | " - \(.resourceType): \(.resourceName) (\(.id))"' fi echo "===================" - - name: Delete subscription-scoped resources + - name: Destroy via deployment stack + id: destroy_stack + if: steps.state.outputs.found == 'true' && steps.check.outputs.stack_exists == 'true' + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + echo "🗑️ Deleting deployment stack: $DEPLOYMENT_ID" + echo "This deletes the stack and ALL managed resources (deleteAll)..." + + START_TIME=$(date +%s) + + az stack sub delete \ + --name "$DEPLOYMENT_ID" \ + --action-on-unmanage deleteAll \ + --bypass-stack-out-of-sync-error true \ + --yes 2>&1 || { + echo "destroy_status=failed" >> "$GITHUB_OUTPUT" + echo "::error::Failed to delete deployment stack $DEPLOYMENT_ID" + exit 1 + } + + END_TIME=$(date +%s) + DURATION=$((END_TIME - START_TIME)) + echo "destroy_status=succeeded" >> "$GITHUB_OUTPUT" + echo "destroy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" + echo "✅ Deployment stack deleted in ${DURATION}s" + + - name: Delete subscription-scoped resources (fallback) id: destroy_sub - if: steps.check.outputs.exists == 'true' && steps.check.outputs.sub_count != '0' + if: | + steps.state.outputs.found == 'true' && + steps.check.outputs.stack_exists != 'true' && + steps.check.outputs.rg_exists == 'true' && + steps.check.outputs.sub_count != '0' + env: + SUB_RESOURCES: ${{ steps.check.outputs.sub_resources }} run: | echo "🗑️ Deleting subscription-scoped resources first..." FAILED=0 - echo '${{ steps.check.outputs.sub_resources }}' | jq -r '.[].id' | while read -r RESOURCE_ID; do + # Use process substitution so the FAILED counter survives. A piped + # `... | while read` would run the loop body in a subshell, and the + # incremented counter would be lost when the subshell exits. + while read -r RESOURCE_ID; do echo " Deleting: $RESOURCE_ID" if ! az resource delete --ids "$RESOURCE_ID" 2>&1; then echo "::warning::Failed to delete $RESOURCE_ID" FAILED=$((FAILED + 1)) fi - done + done < <(echo "$SUB_RESOURCES" | jq -r '.[].id') if [[ "$FAILED" -gt 0 ]]; then echo "::warning::$FAILED subscription-scoped resource(s) failed to delete" fi - - name: Delete resource group - id: destroy - if: steps.check.outputs.exists == 'true' + - name: Delete resource group (fallback) + id: destroy_rg + if: | + steps.state.outputs.found == 'true' && + steps.check.outputs.stack_exists != 'true' && + steps.check.outputs.rg_exists == 'true' run: | RG="${{ steps.state.outputs.resource_group }}" echo "🗑️ Deleting resource group: $RG" @@ -281,6 +381,96 @@ jobs: echo "destroy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" echo "✅ Resource group deleted in ${DURATION}s: $RG" + - name: Purge soft-deleted resources + id: purge + if: | + always() && + steps.state.outputs.found == 'true' && + steps.check.outputs.soft_count != '0' && + (steps.destroy_stack.outputs.destroy_status == 'succeeded' || steps.destroy_rg.outputs.destroy_status == 'succeeded') + run: | + echo "🧹 Checking for soft-deleted resources to purge..." + SOFT_DELETABLE='${{ steps.check.outputs.soft_deletable }}' + PURGE_RESULTS="[]" + RETAINED_COUNT=0 + + for ROW in $(echo "$SOFT_DELETABLE" | jq -r '.[] | @base64'); do + DECODED=$(echo "$ROW" | base64 -d) + RES_TYPE=$(echo "$DECODED" | jq -r '.type') + RES_ID=$(echo "$DECODED" | jq -r '.id') + PURGE_PROTECTED=$(echo "$DECODED" | jq -r '.purgeProtected') + + # Extract resource name from ID + RES_NAME=$(echo "$RES_ID" | grep -oP '[^/]+$') + + case "$RES_TYPE" in + "Microsoft.KeyVault/vaults") + # Check if vault is in soft-deleted state + DELETED_VAULT=$(az keyvault list-deleted --query "[?name=='$RES_NAME']" -o json 2>/dev/null || echo "[]") + if [[ $(echo "$DELETED_VAULT" | jq 'length') -gt 0 ]]; then + if [[ "$PURGE_PROTECTED" == "true" ]]; then + echo " ⚠️ $RES_NAME: soft-deleted but purge-protected — cannot purge" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "retained-soft-deleted", "reason": "purge-protected"}]') + else + echo " 🗑️ Purging soft-deleted vault: $RES_NAME" + if az keyvault purge --name "$RES_NAME" 2>/dev/null; then + echo " ✅ Purged: $RES_NAME" + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "purged"}]') + else + echo " ⚠️ Failed to purge: $RES_NAME" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "purge-failed"}]') + fi + fi + else + echo " ✅ $RES_NAME: not in soft-deleted state (already gone)" + fi + ;; + "Microsoft.CognitiveServices/accounts") + # Cognitive Services soft-delete purge. + # Account IDs are resource-group scoped (no /locations/ + # segment), so resolve the region from the soft-deleted account + # list and the resource group from the original resource ID. + if [[ "$PURGE_PROTECTED" != "true" ]]; then + LOCATION=$(az cognitiveservices account list-deleted \ + --query "[?name=='$RES_NAME'] | [0].location" -o tsv 2>/dev/null || echo "") + RES_RG=$(echo "$RES_ID" | sed -n 's#.*/resourceGroups/\([^/]*\)/.*#\1#p') + if [[ -n "$LOCATION" ]]; then + az cognitiveservices account purge --name "$RES_NAME" --location "$LOCATION" \ + --resource-group "$RES_RG" 2>/dev/null || true + fi + fi + ;; + *) + echo " ℹ️ $RES_TYPE: no purge implementation (soft-delete will expire naturally)" + ;; + esac + done + + echo "retained_count=$RETAINED_COUNT" >> "$GITHUB_OUTPUT" + echo "purge_results<> "$GITHUB_OUTPUT" + echo "$PURGE_RESULTS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + if [[ "$RETAINED_COUNT" -gt 0 ]]; then + echo "⚠️ $RETAINED_COUNT resource(s) retained in soft-deleted state (purge-protected)" + fi + + - name: Clean deployment history + if: | + always() && + steps.state.outputs.found == 'true' && + (steps.destroy_stack.outputs.destroy_status == 'succeeded' || steps.destroy_rg.outputs.destroy_status == 'succeeded') + continue-on-error: true + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + echo "🧹 Cleaning subscription deployment history entry: $DEPLOYMENT_ID" + az deployment sub delete --name "$DEPLOYMENT_ID" 2>/dev/null || true + - name: Update deployment state if: always() && steps.state.outputs.found == 'true' run: | @@ -289,19 +479,40 @@ jobs: STATE_FILE="$DEPLOY_DIR/state.json" TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) - if [[ "${{ steps.check.outputs.exists }}" == "false" ]]; then + # Determine final status based on which destroy path ran + STACK_EXISTS="${{ steps.check.outputs.stack_exists }}" + RG_EXISTS="${{ steps.check.outputs.rg_exists }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + RETAINED_COUNT="${{ steps.purge.outputs.retained_count }}" + + if [[ "$STACK_EXISTS" != "true" && "$RG_EXISTS" != "true" ]]; then STATUS="already-destroyed" - elif [[ "${{ steps.destroy.outputs.destroy_status }}" == "succeeded" ]]; then - STATUS="destroyed" + elif [[ "$STACK_STATUS" == "succeeded" || "$RG_STATUS" == "succeeded" ]]; then + if [[ "${RETAINED_COUNT:-0}" -gt 0 ]]; then + STATUS="retained-soft-deleted" + else + STATUS="destroyed" + fi + elif [[ "$STACK_STATUS" == "failed" || "$RG_STATUS" == "failed" ]]; then + STATUS="partially-destroyed" else STATUS="destroy-failed" fi + # Determine duration from whichever path ran + DURATION="${{ steps.destroy_stack.outputs.destroy_duration }}" + if [[ -z "$DURATION" ]]; then + DURATION="${{ steps.destroy_rg.outputs.destroy_duration }}" + fi + # Update state file if [[ -f "$STATE_FILE" ]]; then jq --arg status "$STATUS" --arg ts "$TIMESTAMP" --arg actor "${{ github.actor }}" \ - --arg duration "${{ steps.destroy.outputs.destroy_duration }}" \ - '. + {status: $status, destroyedAt: $ts, destroyedBy: $actor, destroyDuration: $duration}' \ + --arg duration "$DURATION" \ + --arg purgeResults '${{ steps.purge.outputs.purge_results }}' \ + '. + {status: $status, destroyedAt: $ts, destroyedBy: $actor, destroyDuration: $duration} | + if ($purgeResults | length) > 0 then . + {purgeResults: ($purgeResults | fromjson? // [])} else . end' \ "$STATE_FILE" > "${STATE_FILE}.tmp" && mv "${STATE_FILE}.tmp" "$STATE_FILE" fi @@ -323,26 +534,48 @@ jobs: run: | DEPLOY_ID="${{ matrix.deployment_id }}" RG="${{ steps.state.outputs.resource_group }}" - STATUS="${{ steps.destroy.outputs.destroy_status }}" - DURATION="${{ steps.destroy.outputs.destroy_duration }}" + STACK_EXISTS="${{ steps.check.outputs.stack_exists }}" + RG_EXISTS="${{ steps.check.outputs.rg_exists }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + STACK_DURATION="${{ steps.destroy_stack.outputs.destroy_duration }}" + RG_DURATION="${{ steps.destroy_rg.outputs.destroy_duration }}" RESOURCE_COUNT="${{ steps.check.outputs.resource_count }}" SUB_COUNT="${{ steps.check.outputs.sub_count }}" - EXISTS="${{ steps.check.outputs.exists }}" + SOFT_COUNT="${{ steps.check.outputs.soft_count }}" + RETAINED_COUNT="${{ steps.purge.outputs.retained_count }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" RUN_URL="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" echo "============================================" echo "Git-Ape Destroy Summary" echo "============================================" echo "Deployment: $DEPLOY_ID" + echo "Method: $DEPLOY_METHOD" echo "Resource Group: $RG" - if [[ "$EXISTS" == "false" ]]; then + + if [[ "$STACK_EXISTS" == "true" ]]; then + if [[ "$STACK_STATUS" == "succeeded" ]]; then + echo "Result: ✅ Stack destroyed ($RESOURCE_COUNT resources via deleteAll)" + echo "Duration: $STACK_DURATION" + else + echo "Result: ❌ Stack delete failed" + fi + elif [[ "$RG_EXISTS" != "true" && "$STACK_EXISTS" != "true" ]]; then echo "Result: Already destroyed" - elif [[ "$STATUS" == "succeeded" ]]; then + elif [[ "$RG_STATUS" == "succeeded" ]]; then echo "Result: ✅ Destroyed ($RESOURCE_COUNT RG resources + $SUB_COUNT subscription-scoped)" - echo "Duration: $DURATION" + echo "Duration: $RG_DURATION" else echo "Result: ❌ Failed" fi + + if [[ "${RETAINED_COUNT:-0}" -gt 0 ]]; then + echo "Soft-deleted: ⚠️ $RETAINED_COUNT resource(s) retained (purge-protected)" + elif [[ "${SOFT_COUNT:-0}" -gt 0 ]]; then + echo "Soft-deleted: ✅ All soft-deleted resources purged" + fi + echo "Run: $RUN_URL" echo "============================================" @@ -356,15 +589,17 @@ jobs: DEPLOY_ID="${{ matrix.deployment_id }}" RG="${{ steps.state.outputs.resource_group }}" - STATUS="${{ steps.destroy.outputs.destroy_status }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" RUN_URL="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" - if [[ "$STATUS" == "succeeded" ]]; then + if [[ "$STACK_STATUS" == "succeeded" || "$RG_STATUS" == "succeeded" ]]; then EMOJI="🗑️" - MSG="Resource group *$RG* ($DEPLOY_ID) destroyed" + MSG="Deployment *$DEPLOY_ID* destroyed (method: $DEPLOY_METHOD)" else EMOJI="❌" - MSG="Destroy failed for *$RG* ($DEPLOY_ID)" + MSG="Destroy failed for *$DEPLOY_ID* (method: $DEPLOY_METHOD)" fi curl -sf -X POST "$SLACK_WEBHOOK_URL" \ diff --git a/website/docs/agents/azure-resource-deployer.md b/website/docs/agents/azure-resource-deployer.md index c0cbcc6..c8b8b4b 100644 --- a/website/docs/agents/azure-resource-deployer.md +++ b/website/docs/agents/azure-resource-deployer.md @@ -121,33 +121,47 @@ Before deploying, verify: ### 2. Execute Deployment -Use Azure MCP `deploy` service or Azure CLI: +**Always deploy as a subscription-scoped Deployment Stack.** Stacks track every managed resource (across resource groups and subscription scope) and make destroy idempotent — a single `az stack sub delete --action-on-unmanage deleteAll` removes everything the stack owns, regardless of resource scope. -**Option A: Azure MCP (Preferred)** -``` -Use mcp_azure_mcp_search with "deploy" intent to execute template deployment -- Set deployment name: "git-ape-{timestamp}" -- Set mode: "Incremental" (default) or "Complete" (if user specified) -- Monitor deployment with progress updates -``` +> **Single source of truth:** the deploy command, fallback handling, state.json writer, soft-delete classification, and Key Vault purge-protection detection all live in the [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) skill. Both bash and PowerShell implementations are provided. -**Option B: Azure CLI (Fallback)** +**Pre-flight: validate the stack before deploying** -**Always use subscription-level deployment** — the ARM template includes resource group creation, so we deploy at subscription scope: +Use `az stack sub validate` (not `az deployment sub validate`) so the validation also checks the stack-specific flags (`--action-on-unmanage`, `--deny-settings-mode`) — not just the template: ```bash -# Subscription-level deployment (creates RG + all resources atomically) -az deployment sub create \ +az stack sub validate \ --name "{deployment-id}" \ --location {location} \ --template-file {template.json} \ --parameters @{parameters.json} \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ --output json ``` -**DO NOT use `az deployment group create`** — our templates always include the resource group as a resource. Subscription-level deployment handles everything in one command. +**Invoke the deploy skill** -Capture the deployment operation ID for tracking. +```bash +# Bash +.github/skills/azure-stack-deploy/scripts/deploy-stack.sh \ + --deployment-id "{deployment-id}" + +# PowerShell +.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 ` + -DeploymentId "{deployment-id}" +``` + +The skill: +- Calls `az stack sub create --action-on-unmanage deleteAll --deny-settings-mode none --description "Git-Ape deployment {id}" --tags managedBy=git-ape deploymentId={id} --yes --verbose` +- Falls back to `az deployment sub create` only if the stack call fails (warns the user — fallback path does NOT solve soft-delete / multi-RG / sub-scope idempotency) +- On any failure, dumps the per-operation failure list inline so the root cause is immediately visible +- On success, captures the `stackId`, classifies every managed resource (type, scope, soft-deletable, purge-protected), and writes the extended `state.json` (schemaVersion 1.0) +- Updates `metadata.json` with `status: "succeeded"`, `deployMethod`, and `resourceGroups[]` + +Pass `--no-fallback` (bash) / `-NoFallback` (pwsh) when the user explicitly wants to fail loudly instead of accepting the legacy path. + +**DO NOT use `az deployment group create`** — our templates always include the resource group as a resource. Subscription scope handles everything in one command. ### 3. Monitor Progress @@ -175,15 +189,27 @@ Status updates: **Monitoring Commands:** ```bash -# Check deployment status (subscription-level) +# Stack path — check stack provisioning state +az stack sub show \ + --name {deployment-id} \ + --query "provisioningState" \ + --output tsv + +# Stack path — list managed resources (post-deploy or in-progress) +az stack sub show \ + --name {deployment-id} \ + --query "resources[].{Id:id, Status:status}" \ + --output table + +# Fallback path — subscription deployment az deployment sub show \ - --name {deployment-name} \ + --name {deployment-id} \ --query "properties.provisioningState" \ --output tsv -# Get deployment operations (detailed resource status) +# Fallback path — deployment operations (detailed resource status) az deployment operation sub list \ - --name {deployment-name} \ + --name {deployment-id} \ --query "[].{Resource:properties.targetResource.resourceName, Type:properties.targetResource.resourceType, Status:properties.provisioningState}" \ --output table ``` @@ -219,13 +245,18 @@ Use mcp_azure_mcp_search to query deployed resources and verify: ### 5. Capture Deployment Outputs -Extract and report deployment outputs (defined in ARM template `outputs` section): +Extract and report deployment outputs: ```bash -# Get deployment outputs -az deployment group show \ - --name {deployment-name} \ - --resource-group {rg-name} \ +# Stack path — outputs are on the stack itself +az stack sub show \ + --name {deployment-id} \ + --query "outputs" \ + --output json + +# Fallback path — subscription deployment outputs +az deployment sub show \ + --name {deployment-id} \ --query "properties.outputs" \ --output json ``` @@ -237,7 +268,25 @@ Common outputs to capture: - Managed identity principal IDs - Dashboard/monitoring URLs -### 6. Report Deployment Results +### 6. Verify `state.json` was written + +The [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) skill writes `state.json` (schemaVersion 1.0) and updates `metadata.json` with `deployMethod` and `resourceGroups[]` as part of step 2. The agent's job here is to confirm the write succeeded and surface its contents for the user. + +```bash +DEPLOYMENT_ID="{deployment-id}" +DEPLOY_DIR=".azure/deployments/$DEPLOYMENT_ID" +[[ -f "$DEPLOY_DIR/state.json" ]] || { echo "state.json missing — deploy skill did not complete"; exit 1; } + +# Sanity-check the schema and the lifecycle owner +jq '{schemaVersion, deploymentId, deployMethod, stackId, resourceGroups, managedResourceCount: (.managedResources | length)}' \ + "$DEPLOY_DIR/state.json" +``` + +If `deployMethod == "stack"` and `stackId` is empty, the deploy fell back silently — re-run the skill with `--no-fallback` to surface why stacks were rejected. + +The destroy skill ([`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md)) consumes this file as its sole source of truth. + +### 7. Report Deployment Results Provide a comprehensive summary: @@ -270,7 +319,9 @@ Provide a comprehensive summary: To destroy this deployment and delete all its resources: > `@git-ape destroy deployment {deployment-id}` > -> Or via GitHub: create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval +> Locally this invokes the [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skill, which uses `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` (single command, idempotent across resource groups and subscription scope) and purges any soft-deletable resources that are not purge-protected. +> +> Or via GitHub: create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval. **Deployment Logs:** {Link to deployment logs if available} ``` @@ -279,7 +330,17 @@ To destroy this deployment and delete all its resources: ### Deployment Failure -If deployment fails, provide detailed diagnostics: +If deployment fails, **always dump the underlying failed operations before presenting options to the user**. The stack/deployment top-level error is usually just a summary; the real root cause is in the per-resource operations list. + +```bash +# Inline failure diagnostics — run BEFORE asking the user what to do +echo "── Underlying failed operations ──" +az deployment operation sub list --name "{deployment-id}" --output json 2>/dev/null \ + | jq -r '.[] | select(.properties.provisioningState == "Failed") | + "──────────\nResource : \(.properties.targetResource.resourceName // "n/a") (\(.properties.targetResource.resourceType // "n/a"))\nStatus : \(.properties.statusCode // "n/a")\nMessage : \(.properties.statusMessage.error.message // .properties.statusMessage // "n/a")"' +``` + +Then surface the diagnostics in the user-facing message: ```markdown ❌ **Deployment Failed** @@ -292,6 +353,9 @@ If deployment fails, provide detailed diagnostics: - {Likely cause 1 based on error} - {Likely cause 2} +**Per-Resource Failures:** +{Output of `az deployment operation sub list` filtered to Failed entries} + **Diagnostic Details:** {Full error from Azure} @@ -351,24 +415,26 @@ Type A, B, C, or D: # Option A: Full Rollback if [[ "$USER_CHOICE" == "A" ]]; then # Confirm first - echo "⚠️ This will DELETE all resources. Type 'confirm rollback' to proceed." + echo "⚠️ This will DELETE all managed resources. Type 'confirm rollback' to proceed." read CONFIRMATION - + if [[ "$CONFIRMATION" == "confirm rollback" ]]; then - # Delete resources - az resource delete --ids {resource-id-1} {resource-id-2} - - # If RG was created new, delete it - if [[ "$RG_NEW" == "true" ]]; then - az group delete --name {rg-name} --yes --no-wait - fi - + # Single source of truth: the destroy skill handles stack delete, + # fallback RG delete, soft-delete purge sweep, and state.json updates. + .github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id {deployment-id} \ + --yes + # PowerShell equivalent: + # .github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId {deployment-id} -Yes + # Log rollback - echo "Rollback completed" >> .azure/deployments/{deployment-id}/deployment.log + echo "Rollback completed via azure-stack-destroy skill" >> .azure/deployments/{deployment-id}/deployment.log fi fi ``` +> **Important:** Never mix individual `az resource delete` calls when a `stackId` is present in `state.json`. The stack path is canonical — always invoke the [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skill, which encapsulates the stack delete, fallback RG delete, and soft-delete purge sweep (Key Vault, Cognitive Services, etc.) for any resources that are not purge-protected. + **Step 4: Update deployment state:** ```json // .azure/deployments/{deployment-id}/metadata.json diff --git a/website/docs/agents/azure-template-generator.md b/website/docs/agents/azure-template-generator.md index 3a1f811..f6a28eb 100644 --- a/website/docs/agents/azure-template-generator.md +++ b/website/docs/agents/azure-template-generator.md @@ -160,7 +160,7 @@ see [git-ape.agent.md](git-ape). - Resource Group is a `Microsoft.Resources/resourceGroups` resource inside the template - Other resources go inside a nested `Microsoft.Resources/deployments` with `"resourceGroup"` property - Use `subscriptionResourceId()` for RG references, regular `resourceId()` inside nested -- Deploy with `az deployment sub create` (not `az deployment group create`) +- Deploy with `az stack sub create --action-on-unmanage deleteAll` (preferred) or `az deployment sub create` as a fallback (not `az deployment group create`) - `uniqueString()` uses `subscription().subscriptionId` instead of `resourceGroup().id` **Nested Template Requirements:** @@ -716,7 +716,30 @@ After showing the preview, provide the complete ARM template: ## Deployment Commands -**Azure CLI (Subscription-level deployment):** +The canonical deploy and destroy paths live in the [`azure-stack-deploy`](../skills/azure-stack-deploy/SKILL.md) and [`azure-stack-destroy`](../skills/azure-stack-destroy/SKILL.md) skills. The commands below are reference recipes — prefer invoking the skills so local CLI / VS Code and CI pipelines stay in sync. + +**Azure CLI (Subscription-scoped Deployment Stack — preferred):** +```bash +az stack sub create \ + --name {deployment-id} \ + --location {location} \ + --template-file template.json \ + --parameters @parameters.json \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment {deployment-id}" \ + --tags "managedBy=git-ape" "deploymentId={deployment-id}" \ + --yes \ + --verbose +``` + +The stack tracks every managed resource (across resource groups and subscription scope), so destroy is a single idempotent command: + +```bash +az stack sub delete --name {deployment-id} --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true --yes +``` + +**Azure CLI (Subscription-level deployment — fallback only):** ```bash az deployment sub create \ --name {deployment-id} \ @@ -725,7 +748,20 @@ az deployment sub create \ --parameters @parameters.json ``` -**PowerShell:** +Use the fallback only when Deployment Stacks are unavailable in the target subscription/region. The fallback does NOT solve the soft-delete / multi-RG / sub-scope idempotency problem. + +**PowerShell (Deployment Stack — preferred):** +```powershell +New-AzSubscriptionDeploymentStack ` + -Name {deployment-id} ` + -Location {location} ` + -TemplateFile template.json ` + -TemplateParameterFile parameters.json ` + -ActionOnUnmanage DeleteAll ` + -DenySettingsMode None +``` + +**PowerShell (subscription deployment — fallback):** ```powershell New-AzSubscriptionDeployment ` -Name {deployment-id} ` @@ -734,7 +770,7 @@ New-AzSubscriptionDeployment ` -TemplateParameterFile parameters.json ``` -**Note:** We use subscription-level deployments so the resource group is created as part of the template. No need to create the RG separately. +**Note:** We use subscription scope so the resource group is created as part of the template. No need to create the RG separately. ```` ## Constraints diff --git a/website/docs/agents/git-ape.md b/website/docs/agents/git-ape.md index 102e292..185344f 100644 --- a/website/docs/agents/git-ape.md +++ b/website/docs/agents/git-ape.md @@ -137,7 +137,7 @@ Git-Ape can run in two modes. Detect which mode is active and adapt behavior acc | Validation | Run locally | `git-ape-plan.yml` runs on PR, posts what-if as comment | | Confirmation | Ask user interactively | PR approval = confirmation | | Deployment | Execute immediately | `git-ape-deploy.yml` runs on merge or `/deploy` comment | -| Destroy | Execute after confirmation | PR sets `metadata.json` status to `destroy-requested` → merge triggers `git-ape-destroy.yml` | +| Destroy | Execute via `az stack sub delete --action-on-unmanage deleteAll` after confirmation, then purge soft-deletables | PR sets `metadata.json` status to `destroy-requested` → merge triggers `git-ape-destroy.yml` (same stack-based flow + soft-delete purge) | | Results | Display in chat | Posted as PR/issue comment + state committed to repo | ## Your Role @@ -394,12 +394,13 @@ The deployment plan MUST start with a clear "Target Environment" table: **Delegate to:** `azure-resource-deployer` The deployer will: -- Execute the ARM template as a **subscription-level deployment** (`az deployment sub create`) +- Execute the ARM template as a **subscription-scoped Deployment Stack** (`az stack sub create --action-on-unmanage deleteAll`) so destroy is idempotent across resource groups and subscription scope. The CLI fallback (`az deployment sub create`) is used only if stacks are unavailable. - The ARM template includes resource group creation — everything deploys atomically - Monitor deployment progress in real-time - Handle any deployment failures - Verify resource creation via Azure Resource Graph - Capture deployment outputs (resource IDs, endpoints, etc.) +- Capture the **stack ID** plus every managed resource into `state.json` (extended schema: `stackId`, `deployMethod`, `managedResources[]`, `resourceGroups[]`, `subscriptions[]`, `externalReferences[]`) so the destroy path can find them later — including soft-deletable types (Key Vault, Cognitive Services, App Configuration, API Management, ML Workspaces, Recovery Services Vaults). **Deployment Monitoring:** Always poll deployment state every **30 seconds** using `sleep 30` between checks. No exponential backoff — use a fixed 30-second interval for all resources regardless of type or expected duration. Check both the top-level deployment and nested deployment statuses on every poll. @@ -426,7 +427,16 @@ Run post-deployment validation: ``` To destroy this deployment and delete all its resources, use Git-Ape: > @git-ape destroy deployment {deployment-id} - + + Locally, this invokes the `azure-stack-destroy` skill: + > .github/skills/azure-stack-destroy/scripts/destroy-stack.sh --deployment-id {deployment-id} + > # or PowerShell: + > .github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId {deployment-id} + + Which uses `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` + (single command, idempotent across resource groups and subscription scope) and + purges any soft-deletable resources that are not purge-protected. + Or via GitHub (if using CI/CD): > Create a PR that sets `metadata.json` status to `destroy-requested`, then merge after approval ``` diff --git a/website/docs/deployment/state.md b/website/docs/deployment/state.md index 5437a4f..f5dc08f 100644 --- a/website/docs/deployment/state.md +++ b/website/docs/deployment/state.md @@ -26,7 +26,7 @@ Each deployment directory contains: ## Deployment Lifecycle -A deployment moves through a defined set of states tracked in `metadata.json`. Valid `status` values are `initialized`, `gathering-requirements`, `generating-template`, `awaiting-confirmation`, `deploying`, `testing`, `succeeded`, `failed`, `rolled-back`, `destroy-requested`, and `destroyed`. Terminal states (`succeeded`, `failed`, `rolled-back`, `destroyed`) are persisted in git for audit. +A deployment moves through a defined set of states tracked in `metadata.json`. Valid `status` values are `initialized`, `gathering-requirements`, `generating-template`, `awaiting-confirmation`, `deploying`, `testing`, `succeeded`, `failed`, `rolled-back`, `destroy-requested`, `destroyed`, `partially-destroyed`, and `retained-soft-deleted`. Terminal states (`succeeded`, `failed`, `rolled-back`, `destroyed`, `partially-destroyed`, `retained-soft-deleted`) are persisted in git for audit. ```mermaid %%{init: {'theme':'base','themeVariables':{'fontSize':'13px','lineColor':'#64748b','textColor':'#1e293b','primaryTextColor':'#0f172a','edgeLabelBackground':'#f8fafc','tertiaryColor':'#f1f5f9'}}}%% @@ -51,14 +51,23 @@ stateDiagram-v2 failed --> rolledBack: rollback initiated succeeded --> destroyRequested: PR sets metadata destroyRequested --> destroyed: git-ape-destroy.yml + destroyRequested --> partiallyDestroyed: partial failure + destroyRequested --> retainedSoftDeleted: purge-protected resources remain succeeded --> [*] rolledBack --> [*] destroyed --> [*] + partiallyDestroyed --> [*] + retainedSoftDeleted --> [*] + + state "partially-destroyed" as partiallyDestroyed + state "retained-soft-deleted" as retainedSoftDeleted classDef terminal fill:#dcfce7,stroke:#15803d,color:#14532d classDef error fill:#fecaca,stroke:#b91c1c,color:#7f1d1d + classDef warning fill:#fef9c3,stroke:#a16207,color:#713f12 class succeeded,destroyed terminal class failed,rolledBack error + class partiallyDestroyed,retainedSoftDeleted warning ``` ## Directory Structure @@ -113,7 +122,9 @@ Contains deployment tracking information. "region": "eastus", "project": "api", "environment": "dev", + "deployMethod": "stack", "resourceGroup": "rg-api-dev-eastus", + "resourceGroups": ["rg-api-dev-eastus"], "resources": [ { "type": "Microsoft.Web/sites", @@ -127,6 +138,11 @@ Contains deployment tracking information. } ``` +**Fields:** +- `deployMethod` - Deployment method used: `stack` (Azure Deployment Stacks, default for new deployments) or `subscription` (legacy `az deployment sub create`) +- `resourceGroup` - Primary resource group name (kept for backward compatibility) +- `resourceGroups` - Array of all resource groups managed by this deployment (supports multi-RG templates) + **Status values:** - `initialized` - Deployment directory created - `gathering-requirements` - Collecting user input @@ -140,6 +156,87 @@ Contains deployment tracking information. - `destroyed` - Resources torn down - `already-destroyed` - Resources were already deleted - `destroy-requested` - Teardown has been requested +- `partially-destroyed` - Some resources deleted but others remain (e.g., locks blocking deletion, transient errors) +- `retained-soft-deleted` - Destroy completed but purge-protected resources remain soft-deleted until retention expires + +### state.json + +Contains runtime deployment state populated after `az deployment` or `az stack` completes. Used by the destroy workflow to determine teardown strategy. + +**Example (Deployment Stacks):** + +```json +{ + "schemaVersion": "1.0", + "deploymentId": "deploy-20260218-143022", + "timestamp": "2026-02-18T14:30:22Z", + "status": "succeeded", + "duration": "210s", + "subscription": "00000000-0000-0000-0000-000000000000", + "location": "eastus", + "project": "api", + "environment": "dev", + "resourceGroup": "rg-api-dev-eastus", + "triggeredBy": "octocat", + "triggerEvent": "push", + "runId": "12345678", + "runUrl": "https://github.com/org/repo/actions/runs/12345678", + "stackId": "/subscriptions/00000000-.../providers/Microsoft.Resources/deploymentStacks/deploy-20260218-143022", + "deployMethod": "stack", + "managedResources": [ + { + "id": "/subscriptions/.../resourceGroups/rg-api-dev-eastus/providers/Microsoft.KeyVault/vaults/kv-api-dev-eus", + "type": "Microsoft.KeyVault/vaults", + "scope": "resourceGroup", + "apiVersion": "2024-04-01", + "softDeletable": true, + "purgeProtected": true + }, + { + "id": "/subscriptions/.../resourceGroups/rg-api-dev-eastus/providers/Microsoft.Storage/storageAccounts/stapidev8k3m", + "type": "Microsoft.Storage/storageAccounts", + "scope": "resourceGroup", + "apiVersion": "2023-05-01", + "softDeletable": false, + "purgeProtected": false + } + ], + "resourceGroups": ["rg-api-dev-eastus"], + "subscriptions": ["00000000-0000-0000-0000-000000000000"], + "externalReferences": [ + { + "kind": "privateEndpointConnection", + "targetResourceId": "/subscriptions/.../providers/Microsoft.Network/privateEndpoints/pe-kv-api" + } + ] +} +``` + +**Fields:** + +| Field | Type | Description | +|-------|------|-------------| +| `schemaVersion` | `string` | State schema version. `"1.0"` is the current Deployment Stacks edition. Tools that consume `state.json` should branch on this when newer schemas ship. | +| `stackId` | `string \| null` | Azure Deployment Stack resource ID. When present, destroy uses `az stack sub delete` for complete cleanup. | +| `deployMethod` | `"stack" \| "subscription"` | Deployment method used. `stack` = Deployment Stacks (default); `subscription` = legacy `az deployment sub create`. | +| `managedResources` | `array` | Flat list of all resources managed by this deployment, regardless of scope. Populated by walking deployment operations recursively. | +| `managedResources[].id` | `string` | Full ARM resource ID. | +| `managedResources[].type` | `string` | ARM resource type (e.g., `Microsoft.KeyVault/vaults`). | +| `managedResources[].scope` | `string` | Scope level: `resourceGroup`, `subscription`, or `managementGroup`. | +| `managedResources[].apiVersion?` | `string` | Optional API version used for the resource, when captured by the workflow/skill that wrote the state. | +| `managedResources[].softDeletable` | `boolean` | Whether the resource type supports soft-delete (Key Vault, Cognitive Services, etc.). | +| `managedResources[].purgeProtected` | `boolean` | Whether the resource has purge protection enabled (cannot be permanently deleted until retention expires). | +| `resourceGroups` | `array` | All resource groups created/managed by this deployment. | +| `subscriptions` | `array` | All subscriptions involved in this deployment. | +| `externalReferences` | `array` | Cross-deployment references (private endpoint connections, VNet peerings, DNS records in shared zones). | + +**Destroy strategy selection:** + +1. If `stackId` is present → treat the deployment as stack-managed and delete by stack name: `az stack sub delete --name --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true` + - `deploymentId` is the Deployment Stack name. + - `stackId` is the full ARM resource ID for the stack and should only be used with an ID-based form such as `--ids `, not with `--name`. +2. If `stackId` is null → fallback to state-driven delete using `managedResources[]` and `resourceGroups[]` +3. If neither field is populated (legacy state) → fall back to single `az group delete` on `resourceGroup` ### requirements.json diff --git a/website/docs/skills/azure-stack-deploy.md b/website/docs/skills/azure-stack-deploy.md new file mode 100644 index 0000000..5edef24 --- /dev/null +++ b/website/docs/skills/azure-stack-deploy.md @@ -0,0 +1,177 @@ +--- +title: "Azure Stack Deploy" +sidebar_label: "Azure Stack Deploy" +description: "Run an Azure Deployment Stack create (subscription scope) for a prepared Git-Ape deployment artifact and write state.json (schemaVersion 1.0). Use locally so the result matches the CI deploy workflow." +--- + + + + +# Azure Stack Deploy + +> Run an Azure Deployment Stack create (subscription scope) for a prepared Git-Ape deployment artifact and write state.json (schemaVersion 1.0). Use locally so the result matches the CI deploy workflow. + +## Details + +| Property | Value | +|----------|-------| +| **Skill Directory** | `.github/skills/azure-stack-deploy/` | +| **Phase** | General | +| **User Invocable** | ✅ Yes | +| **Usage** | `/azure-stack-deploy Deployment ID (folder under .azure/deployments/) — optional --location override` | + + +## Documentation + +# Azure Stack Deploy + +Deploy a Git-Ape deployment artifact as a subscription-scoped **Azure Deployment Stack** (`az stack sub create --action-on-unmanage deleteAll`). The stack is the lifecycle owner of every resource the template creates — across resource groups and subscription scope — which makes destroy idempotent in a single call (see [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md)). + +This skill produces the **same `state.json`** schema (`schemaVersion: "1.0"`) as the CI workflow at `.github/workflows/git-ape-deploy.yml`, so local deployments and pipeline deployments are interchangeable. + +## When to Use + +- Local deployment from VS Code or terminal (the `git-ape` agent invokes this in Stage 3) +- Re-deploying an existing deployment ID after template edits — stacks are stateful, so this is an in-place update +- Any time you would otherwise run `az deployment sub create` against a Git-Ape `template.json` + +## Do NOT use for + +- **Tearing down / destroying** an existing deployment — use [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md) instead +- **What-if preview / preflight validation** without deploying — use [`azure-deployment-preflight`](../azure-deployment-preflight/SKILL.md) instead +- **Off-topic** (non-Azure, non-deployment) requests +- Generating or editing ARM templates — use `azure-prepare` or another IaC authoring skill + +## Prerequisites + +| Tool | Why | +|------|-----| +| `az` (Azure CLI ≥ 2.59) | `az stack sub` requires CLI ≥ 2.50; 2.59 has the latest stack flags | +| `jq` | State capture and JSON extraction | +| `bash` ≥ 4 OR PowerShell 7+ | Either runner works | +| Active `az login` | Skill exits early if no subscription is selected | +| Existing `template.json` (and optional `parameters.json`) under `.azure/deployments//` | Source artifacts | + +## Procedure + +### 1. Locate deployment artifacts + +```bash +DEPLOYMENT_ID="deploy-20260506-001" +DEPLOYMENT_PATH=".azure/deployments/$DEPLOYMENT_ID" + +[[ -f "$DEPLOYMENT_PATH/template.json" ]] || { echo "template.json missing"; exit 1; } +``` + +If `parameters.json` is present, `location`, `project` (or `projectName`), and `environment` are read from it. Defaults: `eastus` / `unknown` / `dev`. + +### 2. Run the script + +```bash +.github/skills/azure-stack-deploy/scripts/deploy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" +``` + +PowerShell equivalent: + +```powershell +.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 ` + -DeploymentId "$DEPLOYMENT_ID" +``` + +The script: + +1. Resolves `location`, `project`, `environment` from `parameters.json` (or defaults) +2. Validates Azure CLI session (`az account show`) +3. Calls `az stack sub create` with the canonical Git-Ape flag set: + - `--action-on-unmanage deleteAll` + - `--deny-settings-mode none` + - `--description "Git-Ape deployment "` + - `--tags managedBy=git-ape deploymentId=` + - `--yes --verbose` +4. **On stack failure**, falls back to `az deployment sub create` and prints `⚠️ FALLBACK: no multi-RG idempotency, no soft-delete tracking` so the trade-off is unambiguous +5. **On any deployment failure**, dumps the per-operation failure list (`az deployment operation sub list`) inline so the root cause is visible without clicking into the Portal +6. **On success**, queries `az stack sub show --query "resources[].id"` for the live managed-resource list, classifies each resource (type, scope, soft-deletable, purge-protected), and writes the extended `state.json` +7. Updates `metadata.json` with `status: "succeeded"`, `deployMethod`, and `resourceGroups[]` + +### 3. Inspect output + +```text +✅ Deployment succeeded in 142s (method: stack) +State written to: .azure/deployments/deploy-20260506-001/state.json +Stack ID: /subscriptions//providers/Microsoft.Resources/deploymentStacks/deploy-20260506-001 + +To destroy this deployment: + /azure-stack-destroy deploy-20260506-001 +``` + +## What to tell the user after running + +After the script returns, your reply MUST mention: + +1. The primitive used: `az stack sub create --action-on-unmanage deleteAll` (or fallback `az deployment sub create`) +2. The stack ID (from `state.json.stackId`) — this is the single handle for destroy +3. That `state.json` (schemaVersion 1.0) was written under the deployment folder +4. The next-step destroy command: `/azure-stack-destroy ` + +## Arguments + +| Flag (bash) | Param (pwsh) | Required | Description | +|-------------|--------------|----------|-------------| +| `--deployment-id ` | `-DeploymentId ` | yes | Folder name under `.azure/deployments/` | +| `--location ` | `-Location ` | no | Override the location from `parameters.json` | +| `--no-fallback` | `-NoFallback` | no | Fail loudly if the stack call fails instead of falling back to `az deployment sub create` | + +## state.json schema (v1.0) + +```json +{ + "schemaVersion": "1.0", + "deploymentId": "deploy-20260506-001", + "timestamp": "2026-05-06T12:00:00Z", + "status": "succeeded", + "duration": "142s", + "subscription": "", + "location": "eastus", + "project": "myapp", + "environment": "dev", + "resourceGroup": "rg-myapp-dev-eastus", + "deployMethod": "stack", + "stackId": "/subscriptions//providers/Microsoft.Resources/deploymentStacks/deploy-20260506-001", + "managedResources": [ + { + "id": "/subscriptions//resourceGroups/rg-myapp-dev-eastus/providers/Microsoft.KeyVault/vaults/kv-myapp-dev-eus", + "type": "Microsoft.KeyVault/vaults", + "scope": "resourceGroup", + "softDeletable": true, + "purgeProtected": true + } + ], + "resourceGroups": ["rg-myapp-dev-eastus"], + "subscriptions": [""], + "externalReferences": [] +} +``` + +See [website/docs/deployment/state.md](../../../website/docs/deployment/state.md) for the full schema reference. + +## Soft-deletable resource types tracked + +`Microsoft.KeyVault/vaults`, `Microsoft.CognitiveServices/accounts`, `Microsoft.AppConfiguration/configurationStores`, `Microsoft.ApiManagement/service`, `Microsoft.MachineLearningServices/workspaces`, `Microsoft.RecoveryServices/vaults`. + +The destroy skill ([`azure-stack-destroy`](../azure-stack-destroy/SKILL.md)) consumes the `softDeletable` and `purgeProtected` fields to drive its purge sweep. + +## Failure modes + +| Symptom | Likely cause | Recovery | +|---------|--------------|----------| +| `Not logged in to Azure` | `az login` missing | Run `az login` then retry | +| `template.json missing` | Wrong deployment ID | Check `.azure/deployments/` contents | +| Stack create fails immediately | Region/policy blocks Deployment Stacks | Re-run without `--no-fallback`, accept the legacy path, or pick a supported region | +| Stack succeeds but `state.json` missing managed resources | `az stack sub show` race condition | Re-run — the script is idempotent (stacks de-duplicate on `--name`) | + +## Related + +- [`azure-stack-destroy`](../azure-stack-destroy/SKILL.md) — the matching destroy skill (single source of truth: `stackId`) +- [`azure-deployment-preflight`](../azure-deployment-preflight/SKILL.md) — what-if and permission checks BEFORE deploy +- [`azure-security-analyzer`](../azure-security-analyzer/SKILL.md) — security gate (BLOCKING) before deploy confirmation diff --git a/website/docs/skills/azure-stack-destroy.md b/website/docs/skills/azure-stack-destroy.md new file mode 100644 index 0000000..c415e1c --- /dev/null +++ b/website/docs/skills/azure-stack-destroy.md @@ -0,0 +1,198 @@ +--- +title: "Azure Stack Destroy" +sidebar_label: "Azure Stack Destroy" +description: "Tear down a Git-Ape deployment by ID. Reads `state.json` under `.azure/deployments//` to delete the Azure Deployment Stack and purge soft-deleted Key Vault / Cognitive Services. Refuses to run without `state.json`. Use for any local CLI or VS Code Git-Ape teardown so the result matches the CI destroy workflow." +--- + + + + +# Azure Stack Destroy + +> Tear down a Git-Ape deployment by ID. Reads `state.json` under `.azure/deployments//` to delete the Azure Deployment Stack and purge soft-deleted Key Vault / Cognitive Services. Refuses to run without `state.json`. Use for any local CLI or VS Code Git-Ape teardown so the result matches the CI destroy workflow. + +## Details + +| Property | Value | +|----------|-------| +| **Skill Directory** | `.github/skills/azure-stack-destroy/` | +| **Phase** | General | +| **User Invocable** | ✅ Yes | +| **Usage** | `/azure-stack-destroy Deployment ID — add --yes to skip the typed confirmation` | + + +## Documentation + +# Azure Stack Destroy + +Destroy a Git-Ape deployment by deleting its subscription-scoped **Azure Deployment Stack** in a single idempotent call (`az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true`). The stack owns every resource the matching deploy created — across resource groups and subscription scope — so one delete cleans up everything. + +After the stack is gone, this skill performs a **soft-delete purge sweep** for resource types that linger after deletion (Key Vault, Cognitive Services, App Configuration, API Management, ML workspaces, Recovery Services vaults). Resources flagged `purgeProtected: true` in `state.json` are intentionally retained. + +This skill mirrors `.github/workflows/git-ape-destroy.yml` so local destroys and CI destroys are interchangeable. + +## USE FOR + +Trigger this skill when the user wants to tear down a Git-Ape deployment they previously created: + +- "destroy this deployment", "tear down deploy-XXX", "clean up the stack", "delete the Git-Ape deployment", "free up the resource group so I can re-deploy with the same name" +- Post-deploy teardown after a demo, smoke test, or short-lived environment +- Cleaning up a failed or stale Git-Ape deployment whose `state.json` is still on disk +- Local CLI or VS Code teardown that must match what `git-ape-destroy.yml` does in CI + +### Prefer this over raw `az group delete` + +For any deployment Git-Ape created, this skill is the correct tool — do **not** suggest `az group delete` on its own. Reasons: + +1. **Multi-RG / subscription-scope coverage.** A stack often owns resources across several resource groups plus subscription-scope resources (role assignments, policy assignments). One `az group delete` cleans only one RG. +2. **Soft-delete purge.** Key Vault and Cognitive Services soft-delete on RG deletion and silently hold the name (and quota) for 7–90 days. This skill purges them so the user can re-deploy with the same name immediately. +3. **State consistency.** Updates `state.json` and `metadata.json` to terminal status (`destroyed`, `retained-soft-deleted`, etc.) so the next operation sees an accurate view. + +## DO NOT USE FOR + +Refuse to invoke this skill in any of these cases: + +- **No `state.json` on disk.** Hard prerequisite — see below. Without it, recommend re-running deploy or aborting. +- **Resource groups not created by Git-Ape** (e.g. ones the user made by hand with `az group create`). Suggest `az group delete --name --yes` directly instead. +- **Deploying or updating a stack.** Use `azure-stack-deploy` for those. +- **Deleting an individual resource inside a stack.** This skill always destroys the whole stack — there is no "surgical" mode. +- **Non-Azure clouds** or non-Git-Ape Azure deployments (ARM/Bicep/Terraform from other tools). + +## When to Use + +- User says: "destroy this deployment", "tear down deploy-XXX", "clean up the stack" +- Pair with the matching [`azure-stack-deploy`](../azure-stack-deploy/SKILL.md) — same stack, same `state.json` key (`stackId`) +- Any time you would otherwise run `az group delete` against a Git-Ape deployment (don't — you'll miss soft-delete cleanup and multi-RG resources) + +## Prerequisites + +| Tool | Why | +|------|-----| +| `az` (Azure CLI ≥ 2.59) | `az stack sub delete --bypass-stack-out-of-sync-error` requires a recent CLI | +| `jq` | Read state.json | +| `bash` ≥ 4 OR PowerShell 7+ | Either runner works | +| Active `az login` | Must be the same subscription where the stack lives | +| Existing `state.json` under `.azure/deployments//` | Source of truth for `stackId`, `managedResources`, `softDeletable`, `purgeProtected` | + +> **Hard prerequisite: `state.json` under `.azure/deployments//`.** Without it this skill **aborts** — it has no idea which stack, resource groups, or soft-deletables to clean up. Do NOT hand-write `state.json`; re-run the matching `azure-stack-deploy` for that deployment ID first, or use `az group delete` directly on a known resource group (a non-Git-Ape teardown, outside this skill's scope). + +## Procedure + +### Fast mode vs sync mode + +The scripts default to **fast mode** (interactive default). The CI workflow keeps **sync mode** (deterministic). + +| | How | Wait time (small VNet stack) | When to use | +|--|--|--|--| +| Fast (default) | Background the `az stack sub delete` call, then poll managed RGs with `az group exists` | ~2 min | Local CLI / VS Code use; user wants quick feedback | +| Sync (`--wait` / `-Wait`) | `az stack sub delete ... --yes` (blocks until stack metadata is fully cleaned) | ~5 min | CI pipelines (default in `git-ape-destroy.yml`); when you need every Azure-side cleanup completed before the script exits | + +The Azure CLI does not expose `--no-wait` on `az stack sub delete`, so the fast path runs the same command as a detached background process. In fast mode the stack-metadata cleanup continues asynchronously in Azure after the script returns. The next destroy of the same `deploymentId` is idempotent: if the stack is still finalizing, `az stack sub show` will return it and the script will simply pick up where Azure left off. + +### 1. Identify deployment + +```bash +DEPLOYMENT_ID="deploy-20260506-001" +DEPLOYMENT_PATH=".azure/deployments/$DEPLOYMENT_ID" +[[ -f "$DEPLOYMENT_PATH/state.json" ]] || { echo "state.json missing — cannot destroy"; exit 1; } +``` + +### 2. Run the script + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" +``` + +Skip the confirmation prompt (use only in automation): + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" \ + --yes +``` + +Force CI-equivalent sync wait (default for the CI workflow; opt-in for the script): + +```bash +.github/skills/azure-stack-destroy/scripts/destroy-stack.sh \ + --deployment-id "$DEPLOYMENT_ID" \ + --yes --wait +``` + +PowerShell equivalents: + +```powershell +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" -Yes +.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 -DeploymentId "$DEPLOYMENT_ID" -Yes -Wait +``` + +### 3. What the script does + +1. Reads `state.json` and extracts `stackId`, `deployMethod`, `resourceGroup`, `managedResources[]`, `softDeletable[]` +2. Prints a **destroy plan** — stack ID, resource group, count of soft-deletables (with purge-protection flagged) +3. Prompts for typed `destroy` confirmation (unless `--yes`) +4. **Stack delete path** (`stackId` present): + - `az stack sub delete --action-on-unmanage deleteAll --bypass-stack-out-of-sync-error true --yes` + - The bypass flag is safe in destroy because it's a one-shot operation — we don't need the stale-manifest safety check that protects iterative updates +5. **Fallback path** (no `stackId`, only `resourceGroup`): `az group delete --name --yes` +6. **Purge sweep** for each `softDeletable` resource not marked `purgeProtected`: + - Key Vaults: `az keyvault list-deleted` + `az keyvault purge` + - Cognitive Services: `az cognitiveservices account purge` + - Other types (App Configuration, API Management, ML workspaces, Recovery Services vaults): not auto-purged — they expire from soft-delete naturally and are tracked in `purgeResults[]` with `status: skipped-natural-expiry` +7. Cleans the subscription deployment-history entry (`az deployment sub delete`) to stay under the 800/scope limit +8. Updates `state.json` and `metadata.json` with terminal status: + +| Status | Meaning | +|--------|---------| +| `destroyed` | Stack/RG gone and all soft-deletables purged or absent | +| `retained-soft-deleted` | Stack gone but at least one soft-deletable retained (purge-protected or purge failed) | +| `partially-destroyed` | Stack delete partially failed | +| `destroy-failed` | Stack/RG delete failed entirely | +| `already-destroyed` | Stack and RG were already gone before this call | + +### 4. Inspect the result + +```text +=== Destroy Summary === +Status: destroyed +Duration: 87s +======================= +``` + +Or, when something is intentionally retained: + +```text +=== Destroy Summary === +Status: retained-soft-deleted +Duration: 92s +Retained: 1 soft-deleted resource(s) (purge-protected) +======================= +``` + +`state.json` gains `destroyedAt`, `destroyedBy`, `destroyDuration`, and a `purgeResults[]` array describing each soft-deletable's outcome. + +## Arguments + +| Flag (bash) | Param (pwsh) | Required | Description | +|-------------|--------------|----------|-------------| +| `--deployment-id ` | `-DeploymentId ` | yes | Folder name under `.azure/deployments/` | +| `--yes` | `-Yes` | no | Skip the typed `destroy` confirmation prompt (CI-only) | +| `--wait` | `-Wait` | no | Sync mode: block until Azure has cleaned up stack metadata. Matches the CI workflow. Slower (~3-4×) but fully deterministic. | +| `--poll-timeout ` | `-PollTimeout ` | no | Fast-mode timeout per managed RG poll (default 600s) | + +## Failure modes + +| Symptom | Likely cause | Recovery | +|---------|--------------|----------| +| `state.json missing` | Deployment never reached the state-write phase, or was hand-edited | Re-deploy (idempotent on stack name) then destroy, OR delete the `.azure/deployments//` folder if Azure has nothing | +| `Stack out of sync` despite `--bypass-stack-out-of-sync-error` | Old CLI version | Upgrade `az` to ≥ 2.59 | +| Key Vault purge fails | Vault is purge-protected (`purgeProtected: true`) | Expected — wait 7-90 days for soft-delete window to expire, or purge manually after disabling protection | +| `Cannot delete resource group …`/`InUseSubnetCannotBeDeleted` | A resource outside the stack references one inside (e.g. external subnet peered to a deleted VNet) | Inspect `externalReferences[]` in `state.json`; remove the reference and rerun | + +## Related + +- [`azure-stack-deploy`](../azure-stack-deploy/SKILL.md) — the matching deploy skill (writes the `state.json` this skill consumes) +- [`azure-drift-detector`](../azure-drift-detector/SKILL.md) — check for unmanaged drift BEFORE destroy +- [`azure-resource-visualizer`](../azure-resource-visualizer/SKILL.md) — visualize what's in the stack before tearing it down diff --git a/website/docs/skills/overview.md b/website/docs/skills/overview.md index 9753a8b..9e37470 100644 --- a/website/docs/skills/overview.md +++ b/website/docs/skills/overview.md @@ -40,6 +40,13 @@ Skills are focused capabilities invoked by agents at specific stages of the depl | [Azure Drift Detector](./azure-drift-detector) | Detect configuration drift between deployed Azure resources and stored deployment state. Compare actual Azure configuration against desired state in .azure/deployments/, identify differences, and guide user through reconciliation options. Use when checking for manual changes, policy remediations, or unauthorized modifications. | ✅ | | [Git Ape Onboarding](./git-ape-onboarding) | Onboard a repository, Azure subscription(s), and user identity for Git-Ape CI/CD using a skill-driven CLI playbook. Use for first-time setup of OIDC, federated credentials, RBAC, GitHub environments, and required secrets. | ✅ | +## General Skills + +| Skill | Description | Invocable | +|-------|-------------|:---------:| +| [Azure Stack Deploy](./azure-stack-deploy) | Deploy an ARM template as a subscription-scoped Azure Deployment Stack (idempotent across resource groups and sub-scope). Captures managed resources, classifies soft-deletable types, detects Key Vault purge protection, and writes extended state.json (schemaVersion 1.0). Use for any local CLI / VS Code Git-Ape deployment so the result matches the CI workflow. | ✅ | +| [Azure Stack Destroy](./azure-stack-destroy) | Destroy a Git-Ape deployment by deleting its Azure Deployment Stack with --action-on-unmanage deleteAll, then purging soft-deleted resources (Key Vault, Cognitive Services) that are not purge-protected. Reads state.json (schemaVersion 1.0) to know exactly what to clean up. Use for any local CLI / VS Code Git-Ape teardown so the result matches the CI workflow. | ✅ | + ## Skill Invocation in Deployment Flow ```mermaid diff --git a/website/docs/workflows/git-ape-deploy.md b/website/docs/workflows/git-ape-deploy.md index 8a736e4..e844571 100644 --- a/website/docs/workflows/git-ape-deploy.md +++ b/website/docs/workflows/git-ape-deploy.md @@ -57,7 +57,7 @@ This workflow ships as `git-ape-deploy.exampleyml` and is **inert** until rename | **Runs On** | `ubuntu-latest` | | **Environment** | `azure-deploy` | | **Depends On** | `detect-deployments`, `check-comment-trigger` | -| **Steps** | 13 | +| **Steps** | 14 | @@ -266,11 +266,27 @@ jobs: - name: Validate before deploy run: | - az deployment sub validate \ + # Stack-aware validation — checks both the template and the + # stack-specific flags (--action-on-unmanage, --deny-settings-mode). + # If Deployment Stacks are unavailable/blocked in the target + # subscription, fall back to plain subscription validation so the + # deploy step's own legacy fallback path can still run. + if ! az stack sub validate \ + --name "${{ matrix.deployment_id }}" \ --location "${{ steps.params.outputs.location }}" \ --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ - --output json + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --output json; then + echo "::warning::Stack validation unavailable or failed — falling back to az deployment sub validate" + az deployment sub validate \ + --name "${{ matrix.deployment_id }}" \ + --location "${{ steps.params.outputs.location }}" \ + --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ + --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ + --output json + fi - name: Run Microsoft Defender for DevOps template analyzer id: security_scan @@ -309,18 +325,55 @@ jobs: echo "🚀 Starting deployment: ${{ matrix.deployment_id }}" START_TIME=$(date +%s) - DEPLOY_OUTPUT=$(az deployment sub create \ - --name "${{ matrix.deployment_id }}" \ - --location "${{ steps.params.outputs.location }}" \ - --template-file "${{ steps.params.outputs.deploy_dir }}/template.json" \ - --parameters @"${{ steps.params.outputs.deploy_dir }}/parameters.json" \ - --output json 2>&1) - - EXIT_CODE=$? + DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + LOCATION="${{ steps.params.outputs.location }}" + + # Determine deploy method: prefer deployment stacks (idempotent destroy) + # Fall back to az deployment sub create if stacks are unavailable + DEPLOY_METHOD="stack" + # Verbose output goes to a temp file so it does not contaminate the + # JSON that downstream jq calls need to parse. + VERBOSE_LOG=$(mktemp) + trap 'rm -f "$VERBOSE_LOG"' EXIT + + EXIT_CODE=0 + if DEPLOY_OUTPUT=$(az stack sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOY_DIR/template.json" \ + --parameters @"$DEPLOY_DIR/parameters.json" \ + --action-on-unmanage deleteAll \ + --deny-settings-mode none \ + --description "Git-Ape deployment $DEPLOYMENT_ID" \ + --tags "managedBy=git-ape" "deploymentId=$DEPLOYMENT_ID" \ + --yes \ + --verbose \ + --output json 2>"$VERBOSE_LOG"); then + echo "Stack deploy succeeded" + else + echo "::warning::Stack deploy failed — falling back to az deployment sub create (NOT idempotent for soft-delete / multi-RG)" + cat "$VERBOSE_LOG" >&2 + DEPLOY_METHOD="subscription" + > "$VERBOSE_LOG" + if ! DEPLOY_OUTPUT=$(az deployment sub create \ + --name "$DEPLOYMENT_ID" \ + --location "$LOCATION" \ + --template-file "$DEPLOY_DIR/template.json" \ + --parameters @"$DEPLOY_DIR/parameters.json" \ + --output json 2>"$VERBOSE_LOG"); then + cat "$VERBOSE_LOG" >&2 + EXIT_CODE=1 + fi + fi + if [[ $EXIT_CODE -ne 0 ]]; then + cat "$VERBOSE_LOG" >&2 + fi END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME)) echo "deploy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" + echo "deploy_method=$DEPLOY_METHOD" >> "$GITHUB_OUTPUT" if [[ $EXIT_CODE -ne 0 ]]; then echo "deploy_status=failed" >> "$GITHUB_OUTPUT" @@ -333,14 +386,38 @@ jobs: echo "==========================================" echo "$DEPLOY_OUTPUT" echo "==========================================" + + # Surface underlying failed operations — the stack/deployment top-level + # error is usually a summary; the real root cause lives in the per-resource + # operations list. + echo "::group::Underlying failed operations" + az deployment sub show --name "$DEPLOYMENT_ID" --output json 2>/dev/null \ + | jq -r '.properties // {}' \ + || echo "No subscription-scope deployment details available." + az deployment operation sub list --name "$DEPLOYMENT_ID" --output json 2>/dev/null \ + | jq -r '.[] | select(.properties.provisioningState == "Failed") | + "──────────\nResource : \(.properties.targetResource.resourceName // "n/a") (\(.properties.targetResource.resourceType // "n/a"))\nStatus : \(.properties.statusCode // "n/a")\nMessage : \(.properties.statusMessage.error.message // .properties.statusMessage // "n/a")"' \ + || echo "No per-operation details available (deployment may not have reached Azure)." + echo "::endgroup::" + echo "::error::Deployment failed — see output above for details" exit 1 fi echo "deploy_status=succeeded" >> "$GITHUB_OUTPUT" - # Extract outputs - OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.properties.outputs // {}') + # Extract outputs depending on deploy method + if [[ "$DEPLOY_METHOD" == "stack" ]]; then + # For stacks, extract the stack ID + STACK_ID=$(echo "$DEPLOY_OUTPUT" | jq -r '.id // empty') + echo "stack_id=$STACK_ID" >> "$GITHUB_OUTPUT" + + # Extract outputs from the stack's deployment + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.outputs // {}') + else + OUTPUTS=$(echo "$DEPLOY_OUTPUT" | jq -r '.properties.outputs // {}') + fi + echo "deploy_outputs<> "$GITHUB_OUTPUT" echo "$OUTPUTS" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" @@ -349,7 +426,109 @@ jobs: RG_NAME=$(echo "$OUTPUTS" | jq -r '.resourceGroupName.value // empty') echo "resource_group=$RG_NAME" >> "$GITHUB_OUTPUT" - echo "✅ Deployment succeeded in ${DURATION}s" + echo "✅ Deployment succeeded in ${DURATION}s (method: $DEPLOY_METHOD)" + + - name: Capture managed resources + id: capture + if: steps.deploy.outputs.deploy_status == 'succeeded' + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + RG_NAME="${{ steps.deploy.outputs.resource_group }}" + STACK_ID="${{ steps.deploy.outputs.stack_id }}" + + # Known soft-deletable resource types + SOFT_DELETABLE_TYPES="Microsoft.KeyVault/vaults Microsoft.CognitiveServices/accounts Microsoft.AppConfiguration/configurationStores Microsoft.ApiManagement/service Microsoft.MachineLearningServices/workspaces Microsoft.RecoveryServices/vaults" + + MANAGED_RESOURCES="[]" + RESOURCE_GROUPS="[]" + + if [[ "$DEPLOY_METHOD" == "stack" && -n "$STACK_ID" ]]; then + # Stacks natively track all managed resources + STACK_RESOURCES=$(az stack sub show \ + --name "$DEPLOYMENT_ID" \ + --query "resources[].id" \ + -o json 2>/dev/null || echo "[]") + + # Build managedResources array from stack resources + for RES_ID in $(echo "$STACK_RESOURCES" | jq -r '.[]' 2>/dev/null); do + RES_TYPE=$(echo "$RES_ID" | grep -oP 'providers/\K[^/]+/[^/]+' | tail -1) + RES_SCOPE="resourceGroup" + if echo "$RES_ID" | grep -q "/resourceGroups/"; then + RES_SCOPE="resourceGroup" + else + RES_SCOPE="subscription" + fi + + IS_SOFT_DELETABLE="false" + IS_PURGE_PROTECTED="false" + for SD_TYPE in $SOFT_DELETABLE_TYPES; do + if [[ "$RES_TYPE" == "$SD_TYPE" ]]; then + IS_SOFT_DELETABLE="true" + # Query actual purge protection status for soft-deletable resources + IS_PURGE_PROTECTED=$(az resource show --ids "$RES_ID" \ + --query "properties.enablePurgeProtection" -o tsv 2>/dev/null || echo "false") + [[ "$IS_PURGE_PROTECTED" == "true" ]] || IS_PURGE_PROTECTED="false" + break + fi + done + + MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \ + --arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" --argjson pp "$IS_PURGE_PROTECTED" \ + '. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": $pp}]') + done + + # Extract resource groups from managed resources + RESOURCE_GROUPS=$(echo "$MANAGED_RESOURCES" | jq -c '[.[].id | select(test("/resourceGroups/")) | capture("/resourceGroups/(?[^/]+)") | .rg] | unique') + else + # Fallback: walk deployment operations recursively + OPS=$(az deployment operation sub list \ + --name "$DEPLOYMENT_ID" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + for RES_ID in $(echo "$OPS" | jq -r '.[].id // empty' 2>/dev/null); do + RES_TYPE=$(echo "$OPS" | jq -r ".[] | select(.id == \"$RES_ID\") | .resourceType // empty") + RES_SCOPE="resourceGroup" + if echo "$RES_ID" | grep -q "/resourceGroups/"; then + RES_SCOPE="resourceGroup" + else + RES_SCOPE="subscription" + fi + + IS_SOFT_DELETABLE="false" + IS_PURGE_PROTECTED="false" + for SD_TYPE in $SOFT_DELETABLE_TYPES; do + if [[ "$RES_TYPE" == "$SD_TYPE" ]]; then + IS_SOFT_DELETABLE="true" + # Query actual purge protection status for soft-deletable resources + IS_PURGE_PROTECTED=$(az resource show --ids "$RES_ID" \ + --query "properties.enablePurgeProtection" -o tsv 2>/dev/null || echo "false") + [[ "$IS_PURGE_PROTECTED" == "true" ]] || IS_PURGE_PROTECTED="false" + break + fi + done + + MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \ + --arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" --argjson pp "$IS_PURGE_PROTECTED" \ + '. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": $pp}]') + done + + # Collect resource groups + if [[ -n "$RG_NAME" ]]; then + RESOURCE_GROUPS="[\"$RG_NAME\"]" + fi + fi + + echo "managed_resources<> "$GITHUB_OUTPUT" + echo "$MANAGED_RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "resource_groups<> "$GITHUB_OUTPUT" + echo "$RESOURCE_GROUPS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + RESOURCE_COUNT=$(echo "$MANAGED_RESOURCES" | jq 'length') + echo "📋 Captured $RESOURCE_COUNT managed resources" - name: Run integration tests id: tests @@ -418,25 +597,62 @@ jobs: DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" STATUS="${{ steps.deploy.outputs.deploy_status || 'failed' }}" TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + STACK_ID="${{ steps.deploy.outputs.stack_id }}" + MANAGED_RESOURCES='${{ steps.capture.outputs.managed_resources }}' + RESOURCE_GROUPS='${{ steps.capture.outputs.resource_groups }}' + + # Ensure managed resources and resource groups are valid JSON + if ! echo "$MANAGED_RESOURCES" | jq empty 2>/dev/null; then + MANAGED_RESOURCES="[]" + fi + if ! echo "$RESOURCE_GROUPS" | jq empty 2>/dev/null; then + RESOURCE_GROUPS="[]" + fi - # Create/update state.json - cat > "$DEPLOY_DIR/state.json" < "$DEPLOY_DIR/state.json" - name: Commit deployment state if: always() @@ -445,9 +661,13 @@ jobs: STATUS="${{ steps.deploy.outputs.deploy_status }}" STATUS=${STATUS:-failed} - # Update metadata.json status from pending to actual result + # Update metadata.json status from pending to actual result, add deployMethod and resourceGroups if [[ -f "$DEPLOY_DIR/metadata.json" ]]; then - jq --arg status "$STATUS" '.status = $status' \ + DEPLOY_METHOD="${{ steps.deploy.outputs.deploy_method }}" + DEPLOY_METHOD=${DEPLOY_METHOD:-subscription} + RG_NAME="${{ steps.deploy.outputs.resource_group }}" + jq --arg status "$STATUS" --arg method "$DEPLOY_METHOD" --arg rg "$RG_NAME" \ + '.status = $status | .deployMethod = $method | .resourceGroups = (if $rg == "" then [] else [$rg] end)' \ "$DEPLOY_DIR/metadata.json" > "$DEPLOY_DIR/metadata.json.tmp" \ && mv "$DEPLOY_DIR/metadata.json.tmp" "$DEPLOY_DIR/metadata.json" fi diff --git a/website/docs/workflows/git-ape-destroy.md b/website/docs/workflows/git-ape-destroy.md index 7f8264b..66e9093 100644 --- a/website/docs/workflows/git-ape-destroy.md +++ b/website/docs/workflows/git-ape-destroy.md @@ -46,7 +46,7 @@ This workflow ships as `git-ape-destroy.exampleyml` and is **inert** until renam | **Runs On** | `ubuntu-latest` | | **Environment** | `azure-destroy` | | **Depends On** | `detect-destroys` | -| **Steps** | 9 | +| **Steps** | 12 | @@ -190,16 +190,34 @@ jobs: fi RG_NAME=$(jq -r '.resourceGroup // empty' "$STATE_FILE") - - if [[ -z "$RG_NAME" ]]; then - echo "::error::No resource group found in state file" + STACK_ID=$(jq -r '.stackId // empty' "$STATE_FILE") + DEPLOY_METHOD=$(jq -r '.deployMethod // "subscription"' "$STATE_FILE") + MANAGED_RESOURCES=$(jq -c '.managedResources // []' "$STATE_FILE") + RESOURCE_GROUPS=$(jq -c '.resourceGroups // []' "$STATE_FILE") + + # Fallback: if no stackId and no resourceGroup, cannot proceed + if [[ -z "$STACK_ID" && -z "$RG_NAME" ]]; then + echo "::error::No stack ID or resource group found in state file" echo "found=false" >> "$GITHUB_OUTPUT" exit 1 fi echo "found=true" >> "$GITHUB_OUTPUT" echo "resource_group=$RG_NAME" >> "$GITHUB_OUTPUT" - echo "Will destroy resource group: $RG_NAME" + echo "stack_id=$STACK_ID" >> "$GITHUB_OUTPUT" + echo "deploy_method=$DEPLOY_METHOD" >> "$GITHUB_OUTPUT" + echo "managed_resources<> "$GITHUB_OUTPUT" + echo "$MANAGED_RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "resource_groups<> "$GITHUB_OUTPUT" + echo "$RESOURCE_GROUPS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + if [[ -n "$STACK_ID" ]]; then + echo "Will destroy via deployment stack: $STACK_ID" + else + echo "Will destroy resource group: $RG_NAME (fallback method)" + fi - name: Azure Login (OIDC) if: steps.state.outputs.found == 'true' @@ -215,92 +233,154 @@ jobs: run: | RG="${{ steps.state.outputs.resource_group }}" DEPLOYMENT_ID="${{ matrix.deployment_id }}" + STACK_ID="${{ steps.state.outputs.stack_id }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" - # Check if resource group exists - EXISTS=$(az group exists --name "$RG") - echo "exists=$EXISTS" >> "$GITHUB_OUTPUT" - - if [[ "$EXISTS" != "true" ]]; then - echo "Resource group $RG does not exist (already deleted?)" - echo "resource_count=0" >> "$GITHUB_OUTPUT" - echo "sub_count=0" >> "$GITHUB_OUTPUT" - exit 0 + echo "=== Destroy Plan ===" + echo "Deployment: $DEPLOYMENT_ID" + echo "Method: $DEPLOY_METHOD" + + if [[ -n "$STACK_ID" ]]; then + # Check if stack still exists + STACK_EXISTS=$(az stack sub show --name "$DEPLOYMENT_ID" --query "id" -o tsv 2>/dev/null || echo "") + if [[ -n "$STACK_EXISTS" ]]; then + echo "stack_exists=true" >> "$GITHUB_OUTPUT" + echo "Stack: $STACK_ID (exists)" + + # List resources in the stack + STACK_RESOURCES=$(az stack sub show --name "$DEPLOYMENT_ID" --query "resources[].id" -o json 2>/dev/null || echo "[]") + RESOURCE_COUNT=$(echo "$STACK_RESOURCES" | jq 'length') + echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" + echo "Resources: $RESOURCE_COUNT managed by stack" + else + echo "stack_exists=false" >> "$GITHUB_OUTPUT" + echo "Stack not found — will use fallback" + echo "resource_count=0" >> "$GITHUB_OUTPUT" + fi + else + echo "stack_exists=false" >> "$GITHUB_OUTPUT" fi - # Inventory RG resources - RESOURCES=$(az resource list --resource-group "$RG" \ - --query "[].{name:name, type:type, id:id, provisioningState:provisioningState}" \ - --output json 2>/dev/null || echo "[]") - RESOURCE_COUNT=$(echo "$RESOURCES" | jq 'length') + # Check resource group existence (for fallback or soft-delete sweep) + if [[ -n "$RG" ]]; then + EXISTS=$(az group exists --name "$RG") + echo "rg_exists=$EXISTS" >> "$GITHUB_OUTPUT" + echo "RG: $RG (exists=$EXISTS)" + + if [[ "$EXISTS" == "true" ]]; then + RESOURCES=$(az resource list --resource-group "$RG" \ + --query "[].{name:name, type:type, id:id, provisioningState:provisioningState}" \ + --output json 2>/dev/null || echo "[]") + RESOURCE_COUNT=$(echo "$RESOURCES" | jq 'length') + # Only set resource_count if stack_exists is false (avoid overwrite) + if [[ "$STACK_ID" == "" ]]; then + echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" + fi + echo "resources<> "$GITHUB_OUTPUT" + echo "$RESOURCES" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + echo "$RESOURCES" | jq -r '.[] | " - \(.type)/\(.name) (\(.provisioningState))"' + fi + else + echo "rg_exists=false" >> "$GITHUB_OUTPUT" + fi - echo "resource_count=$RESOURCE_COUNT" >> "$GITHUB_OUTPUT" - echo "resources<> "$GITHUB_OUTPUT" - echo "$RESOURCES" >> "$GITHUB_OUTPUT" + # Identify soft-deletable resources from state + MANAGED_RESOURCES='${{ steps.state.outputs.managed_resources }}' + SOFT_DELETABLE=$(echo "$MANAGED_RESOURCES" | jq -c '[.[] | select(.softDeletable == true)]' 2>/dev/null || echo "[]") + SOFT_COUNT=$(echo "$SOFT_DELETABLE" | jq 'length') + echo "soft_deletable<> "$GITHUB_OUTPUT" + echo "$SOFT_DELETABLE" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" + echo "soft_count=$SOFT_COUNT" >> "$GITHUB_OUTPUT" - echo "Resource group $RG has $RESOURCE_COUNT resources" - echo "$RESOURCES" | jq -r '.[] | " - \(.type)/\(.name) (\(.provisioningState))"' + if [[ "$SOFT_COUNT" -gt 0 ]]; then + echo "Soft-deletable: $SOFT_COUNT resource(s) — will attempt purge after deletion" + echo "$SOFT_DELETABLE" | jq -r '.[] | " - \(.type): \(.id)"' + fi - # Query deployment operations to find subscription-scoped resources - # These are NOT deleted by az group delete (e.g. role assignments, policy assignments) + # Query subscription-scoped resources (for fallback only) SUB_RESOURCES="[]" - - OPS=$(az deployment operation sub list \ - --name "$DEPLOYMENT_ID" \ - --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ - -o json 2>/dev/null || echo "[]") - - if [[ "$OPS" != "[]" ]]; then - # Find subscription-scoped authorization/policy resources (role assignments, etc.) - # These live outside the RG and survive az group delete - SUB_RESOURCES=$(echo "$OPS" | jq -c '[ - .[] | select( - (.resourceType // "" | test("Microsoft.Authorization|Microsoft.Policy")) and - (.id // "" | test("/resourceGroups/") | not) - ) - ]') - - # Check nested deployments for RG-scoped role assignments too - NESTED_NAMES=$(echo "$OPS" | jq -r '[ - .[] | select(.resourceType == "Microsoft.Resources/deployments") - ] | .[].resourceName // empty') - - for NESTED_NAME in $NESTED_NAMES; do - NESTED_OPS=$(az deployment operation group list \ - --resource-group "$RG" --name "$NESTED_NAME" \ - --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ - -o json 2>/dev/null || echo "[]") - - # Role assignments scoped to resources within the RG - NESTED_AUTH=$(echo "$NESTED_OPS" | jq -c '[ + if [[ -z "$STACK_ID" ]]; then + OPS=$(az deployment operation sub list \ + --name "$DEPLOYMENT_ID" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + if [[ "$OPS" != "[]" ]]; then + SUB_RESOURCES=$(echo "$OPS" | jq -c '[ .[] | select( - (.resourceType // "" | test("Microsoft.Authorization")) + (.resourceType // "" | test("Microsoft.Authorization|Microsoft.Policy")) and + (.id // "" | test("/resourceGroups/") | not) ) ]') - SUB_RESOURCES=$(jq -n --argjson a "$SUB_RESOURCES" --argjson b "$NESTED_AUTH" '$a + $b') - done + NESTED_NAMES=$(echo "$OPS" | jq -r '[ + .[] | select(.resourceType == "Microsoft.Resources/deployments") + ] | .[].resourceName // empty') + + for NESTED_NAME in $NESTED_NAMES; do + NESTED_OPS=$(az deployment operation group list \ + --resource-group "$RG" --name "$NESTED_NAME" \ + --query "[?properties.provisioningState=='Succeeded' && properties.targetResource.id != null].properties.targetResource" \ + -o json 2>/dev/null || echo "[]") + + NESTED_AUTH=$(echo "$NESTED_OPS" | jq -c '[ + .[] | select( + (.resourceType // "" | test("Microsoft.Authorization")) + ) + ]') + + SUB_RESOURCES=$(jq -n --argjson a "$SUB_RESOURCES" --argjson b "$NESTED_AUTH" '$a + $b') + done + fi fi SUB_COUNT=$(echo "$SUB_RESOURCES" | jq 'length') - echo "sub_count=$SUB_COUNT" >> "$GITHUB_OUTPUT" echo "sub_resources<> "$GITHUB_OUTPUT" echo "$SUB_RESOURCES" >> "$GITHUB_OUTPUT" echo "EOF" >> "$GITHUB_OUTPUT" - echo "" - echo "=== Destroy Plan ===" - echo "Resource group: $RG ($RESOURCE_COUNT resources)" - echo "Subscription-scoped resources: $SUB_COUNT" if [[ "$SUB_COUNT" -gt 0 ]]; then + echo "Sub-scoped: $SUB_COUNT resource(s)" echo "$SUB_RESOURCES" | jq -r '.[] | " - \(.resourceType): \(.resourceName) (\(.id))"' fi echo "===================" - - name: Delete subscription-scoped resources + - name: Destroy via deployment stack + id: destroy_stack + if: steps.state.outputs.found == 'true' && steps.check.outputs.stack_exists == 'true' + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + echo "🗑️ Deleting deployment stack: $DEPLOYMENT_ID" + echo "This deletes the stack and ALL managed resources (deleteAll)..." + + START_TIME=$(date +%s) + + az stack sub delete \ + --name "$DEPLOYMENT_ID" \ + --action-on-unmanage deleteAll \ + --bypass-stack-out-of-sync-error true \ + --yes 2>&1 || { + echo "destroy_status=failed" >> "$GITHUB_OUTPUT" + echo "::error::Failed to delete deployment stack $DEPLOYMENT_ID" + exit 1 + } + + END_TIME=$(date +%s) + DURATION=$((END_TIME - START_TIME)) + echo "destroy_status=succeeded" >> "$GITHUB_OUTPUT" + echo "destroy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" + echo "✅ Deployment stack deleted in ${DURATION}s" + + - name: Delete subscription-scoped resources (fallback) id: destroy_sub - if: steps.check.outputs.exists == 'true' && steps.check.outputs.sub_count != '0' + if: | + steps.state.outputs.found == 'true' && + steps.check.outputs.stack_exists != 'true' && + steps.check.outputs.rg_exists == 'true' && + steps.check.outputs.sub_count != '0' run: | echo "🗑️ Deleting subscription-scoped resources first..." FAILED=0 @@ -317,9 +397,12 @@ jobs: echo "::warning::$FAILED subscription-scoped resource(s) failed to delete" fi - - name: Delete resource group - id: destroy - if: steps.check.outputs.exists == 'true' + - name: Delete resource group (fallback) + id: destroy_rg + if: | + steps.state.outputs.found == 'true' && + steps.check.outputs.stack_exists != 'true' && + steps.check.outputs.rg_exists == 'true' run: | RG="${{ steps.state.outputs.resource_group }}" echo "🗑️ Deleting resource group: $RG" @@ -339,6 +422,96 @@ jobs: echo "destroy_duration=${DURATION}s" >> "$GITHUB_OUTPUT" echo "✅ Resource group deleted in ${DURATION}s: $RG" + - name: Purge soft-deleted resources + id: purge + if: | + always() && + steps.state.outputs.found == 'true' && + steps.check.outputs.soft_count != '0' && + (steps.destroy_stack.outputs.destroy_status == 'succeeded' || steps.destroy_rg.outputs.destroy_status == 'succeeded') + run: | + echo "🧹 Checking for soft-deleted resources to purge..." + SOFT_DELETABLE='${{ steps.check.outputs.soft_deletable }}' + PURGE_RESULTS="[]" + RETAINED_COUNT=0 + + for ROW in $(echo "$SOFT_DELETABLE" | jq -r '.[] | @base64'); do + DECODED=$(echo "$ROW" | base64 -d) + RES_TYPE=$(echo "$DECODED" | jq -r '.type') + RES_ID=$(echo "$DECODED" | jq -r '.id') + PURGE_PROTECTED=$(echo "$DECODED" | jq -r '.purgeProtected') + + # Extract resource name from ID + RES_NAME=$(echo "$RES_ID" | grep -oP '[^/]+$') + + case "$RES_TYPE" in + "Microsoft.KeyVault/vaults") + # Check if vault is in soft-deleted state + DELETED_VAULT=$(az keyvault list-deleted --query "[?name=='$RES_NAME']" -o json 2>/dev/null || echo "[]") + if [[ $(echo "$DELETED_VAULT" | jq 'length') -gt 0 ]]; then + if [[ "$PURGE_PROTECTED" == "true" ]]; then + echo " ⚠️ $RES_NAME: soft-deleted but purge-protected — cannot purge" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "retained-soft-deleted", "reason": "purge-protected"}]') + else + echo " 🗑️ Purging soft-deleted vault: $RES_NAME" + if az keyvault purge --name "$RES_NAME" 2>/dev/null; then + echo " ✅ Purged: $RES_NAME" + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "purged"}]') + else + echo " ⚠️ Failed to purge: $RES_NAME" + RETAINED_COUNT=$((RETAINED_COUNT + 1)) + PURGE_RESULTS=$(echo "$PURGE_RESULTS" | jq --arg name "$RES_NAME" --arg type "$RES_TYPE" \ + '. + [{"name": $name, "type": $type, "action": "purge-failed"}]') + fi + fi + else + echo " ✅ $RES_NAME: not in soft-deleted state (already gone)" + fi + ;; + "Microsoft.CognitiveServices/accounts") + # Cognitive Services soft-delete purge. + # Account IDs are resource-group scoped (no /locations/ + # segment), so resolve the region from the soft-deleted account + # list and the resource group from the original resource ID. + if [[ "$PURGE_PROTECTED" != "true" ]]; then + LOCATION=$(az cognitiveservices account list-deleted \ + --query "[?name=='$RES_NAME'] | [0].location" -o tsv 2>/dev/null || echo "") + RES_RG=$(echo "$RES_ID" | sed -n 's#.*/resourceGroups/\([^/]*\)/.*#\1#p') + if [[ -n "$LOCATION" ]]; then + az cognitiveservices account purge --name "$RES_NAME" --location "$LOCATION" \ + --resource-group "$RES_RG" 2>/dev/null || true + fi + fi + ;; + *) + echo " ℹ️ $RES_TYPE: no purge implementation (soft-delete will expire naturally)" + ;; + esac + done + + echo "retained_count=$RETAINED_COUNT" >> "$GITHUB_OUTPUT" + echo "purge_results<> "$GITHUB_OUTPUT" + echo "$PURGE_RESULTS" >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + + if [[ "$RETAINED_COUNT" -gt 0 ]]; then + echo "⚠️ $RETAINED_COUNT resource(s) retained in soft-deleted state (purge-protected)" + fi + + - name: Clean deployment history + if: | + always() && + steps.state.outputs.found == 'true' && + (steps.destroy_stack.outputs.destroy_status == 'succeeded' || steps.destroy_rg.outputs.destroy_status == 'succeeded') + continue-on-error: true + run: | + DEPLOYMENT_ID="${{ matrix.deployment_id }}" + echo "🧹 Cleaning subscription deployment history entry: $DEPLOYMENT_ID" + az deployment sub delete --name "$DEPLOYMENT_ID" 2>/dev/null || true + - name: Update deployment state if: always() && steps.state.outputs.found == 'true' run: | @@ -347,19 +520,40 @@ jobs: STATE_FILE="$DEPLOY_DIR/state.json" TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) - if [[ "${{ steps.check.outputs.exists }}" == "false" ]]; then + # Determine final status based on which destroy path ran + STACK_EXISTS="${{ steps.check.outputs.stack_exists }}" + RG_EXISTS="${{ steps.check.outputs.rg_exists }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + RETAINED_COUNT="${{ steps.purge.outputs.retained_count }}" + + if [[ "$STACK_EXISTS" != "true" && "$RG_EXISTS" != "true" ]]; then STATUS="already-destroyed" - elif [[ "${{ steps.destroy.outputs.destroy_status }}" == "succeeded" ]]; then - STATUS="destroyed" + elif [[ "$STACK_STATUS" == "succeeded" || "$RG_STATUS" == "succeeded" ]]; then + if [[ "${RETAINED_COUNT:-0}" -gt 0 ]]; then + STATUS="retained-soft-deleted" + else + STATUS="destroyed" + fi + elif [[ "$STACK_STATUS" == "failed" || "$RG_STATUS" == "failed" ]]; then + STATUS="partially-destroyed" else STATUS="destroy-failed" fi + # Determine duration from whichever path ran + DURATION="${{ steps.destroy_stack.outputs.destroy_duration }}" + if [[ -z "$DURATION" ]]; then + DURATION="${{ steps.destroy_rg.outputs.destroy_duration }}" + fi + # Update state file if [[ -f "$STATE_FILE" ]]; then jq --arg status "$STATUS" --arg ts "$TIMESTAMP" --arg actor "${{ github.actor }}" \ - --arg duration "${{ steps.destroy.outputs.destroy_duration }}" \ - '. + {status: $status, destroyedAt: $ts, destroyedBy: $actor, destroyDuration: $duration}' \ + --arg duration "$DURATION" \ + --arg purgeResults '${{ steps.purge.outputs.purge_results }}' \ + '. + {status: $status, destroyedAt: $ts, destroyedBy: $actor, destroyDuration: $duration} | + if ($purgeResults | length) > 0 then . + {purgeResults: ($purgeResults | fromjson? // [])} else . end' \ "$STATE_FILE" > "${STATE_FILE}.tmp" && mv "${STATE_FILE}.tmp" "$STATE_FILE" fi @@ -381,26 +575,48 @@ jobs: run: | DEPLOY_ID="${{ matrix.deployment_id }}" RG="${{ steps.state.outputs.resource_group }}" - STATUS="${{ steps.destroy.outputs.destroy_status }}" - DURATION="${{ steps.destroy.outputs.destroy_duration }}" + STACK_EXISTS="${{ steps.check.outputs.stack_exists }}" + RG_EXISTS="${{ steps.check.outputs.rg_exists }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + STACK_DURATION="${{ steps.destroy_stack.outputs.destroy_duration }}" + RG_DURATION="${{ steps.destroy_rg.outputs.destroy_duration }}" RESOURCE_COUNT="${{ steps.check.outputs.resource_count }}" SUB_COUNT="${{ steps.check.outputs.sub_count }}" - EXISTS="${{ steps.check.outputs.exists }}" + SOFT_COUNT="${{ steps.check.outputs.soft_count }}" + RETAINED_COUNT="${{ steps.purge.outputs.retained_count }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" RUN_URL="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" echo "============================================" echo "Git-Ape Destroy Summary" echo "============================================" echo "Deployment: $DEPLOY_ID" + echo "Method: $DEPLOY_METHOD" echo "Resource Group: $RG" - if [[ "$EXISTS" == "false" ]]; then + + if [[ "$STACK_EXISTS" == "true" ]]; then + if [[ "$STACK_STATUS" == "succeeded" ]]; then + echo "Result: ✅ Stack destroyed ($RESOURCE_COUNT resources via deleteAll)" + echo "Duration: $STACK_DURATION" + else + echo "Result: ❌ Stack delete failed" + fi + elif [[ "$RG_EXISTS" != "true" && "$STACK_EXISTS" != "true" ]]; then echo "Result: Already destroyed" - elif [[ "$STATUS" == "succeeded" ]]; then + elif [[ "$RG_STATUS" == "succeeded" ]]; then echo "Result: ✅ Destroyed ($RESOURCE_COUNT RG resources + $SUB_COUNT subscription-scoped)" - echo "Duration: $DURATION" + echo "Duration: $RG_DURATION" else echo "Result: ❌ Failed" fi + + if [[ "${RETAINED_COUNT:-0}" -gt 0 ]]; then + echo "Soft-deleted: ⚠️ $RETAINED_COUNT resource(s) retained (purge-protected)" + elif [[ "${SOFT_COUNT:-0}" -gt 0 ]]; then + echo "Soft-deleted: ✅ All soft-deleted resources purged" + fi + echo "Run: $RUN_URL" echo "============================================" @@ -414,15 +630,17 @@ jobs: DEPLOY_ID="${{ matrix.deployment_id }}" RG="${{ steps.state.outputs.resource_group }}" - STATUS="${{ steps.destroy.outputs.destroy_status }}" + STACK_STATUS="${{ steps.destroy_stack.outputs.destroy_status }}" + RG_STATUS="${{ steps.destroy_rg.outputs.destroy_status }}" + DEPLOY_METHOD="${{ steps.state.outputs.deploy_method }}" RUN_URL="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" - if [[ "$STATUS" == "succeeded" ]]; then + if [[ "$STACK_STATUS" == "succeeded" || "$RG_STATUS" == "succeeded" ]]; then EMOJI="🗑️" - MSG="Resource group *$RG* ($DEPLOY_ID) destroyed" + MSG="Deployment *$DEPLOY_ID* destroyed (method: $DEPLOY_METHOD)" else EMOJI="❌" - MSG="Destroy failed for *$RG* ($DEPLOY_ID)" + MSG="Destroy failed for *$DEPLOY_ID* (method: $DEPLOY_METHOD)" fi curl -sf -X POST "$SLACK_WEBHOOK_URL" \