fix(localbox): improve deployment reliability — VHDX copy, VM verification, task trigger, RP polling, SYSTEM context#3394
Open
nekdima wants to merge 5 commits intomicrosoft:mainfrom
Open
Conversation
When Copy-Item for AzL-node.vhdx or GUI.vhdx fails silently (piped to Out-Null), the downstream New-AzLocalNodeVM and Set-AzLocalNodeVhdx functions proceed with non-existent files, causing Mount-VHD to fail with 'is not an existing virtual hard disk file' after 5 retries. The deployment then silently fails at Step 4/11 with no node VMs created. Root cause observed in production (50-participant MicroHack deployment): - AzL-node.vhdx copy to V:\VMs failed silently - New-AzLocalNodeVM created data/S2D disks but not the OS VHDX - Set-AzLocalNodeVhdx failed on Mount-VHD for the missing VHDX - Script continued to Step 5 (Start VMs) with no node VMs Fix: 1. Replace silent Copy-Item with verified copy + 3x retry loop - Validates source exists before copy - Verifies destination exists and size matches after copy - Retries with 10s delay on failure - Throws terminating error after all retries exhausted 2. Add post-creation verification for node VMs - Checks VHDX file exists after New-AzLocalNodeVM - Checks Hyper-V VM exists after creation - Fails fast with clear error instead of proceeding with missing VMs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
immediately start the task after registration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add mandatory RP registration check + polling before step 10 (Validate) in New-LocalBoxCluster.ps1. Prevents 'ArcIntegration requirements not met' error (step 90) caused by providers not fully registered when the HCI validator runs. Registers if needed and polls for up to 5 minutes. Affected providers: KubernetesConfiguration, ExtendedLocation, HybridContainerService, HybridCompute, AzureStackHCI, ResourceConnector, Kubernetes, EdgeMarketplace. Root cause: az provider register --wait returns before full propagation to HCI node context, causing race condition at scale (3+ subs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changes based on GPT-5.4-pro and o3-pro review: Fix microsoft#1 (VHDX copy): - Replace -ErrorAction SilentlyContinue with try/catch + -ErrorAction Stop to surface root cause on failure instead of swallowing errors - Move $maxRetries before loop for clarity - Add -ErrorAction SilentlyContinue to Get-Item destination check Fix microsoft#2 (Node VM verification): - Query VHDX path via Get-VMHardDiskDrive instead of assuming naming convention (avoids false failures if module changes disk layout) - Add null check on MAC address returned by New-AzLocalNodeVM - Fix double space before -LocalBoxConfig Fix microsoft#3 (Scheduled task trigger): - CRITICAL: Remove Start-ScheduledTask that fires before Hyper-V reboot, causing LocalBoxLogonScript to fail with missing Hyper-V module and leaving corrupt state for the post-reboot AtLogOn execution Fix microsoft#4 (RP registration polling): - Add Select-Object -Unique to .RegistrationState to handle array return from Get-AzResourceProvider (prevents System.Object[] in log output) - Log which specific provider is lagging during polling - Change -ForegroundColor Gray to DarkGray for terminal compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d5f922c to
e63b33c
Compare
Wrap shortcut creation, Hyper-V shortcut, Windows Terminal config, and VSCode extension install in try-catch blocks. When running as SYSTEM via scheduled task (headless CI), the Desktop folder doesn't exist and code.exe isn't in PATH. These are non-critical UI setup steps that should not crash the entire bootstrap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While preparing LocalBox environments for MicroHack workshops (50+ participants, 3+ subscriptions), we hit repeated deployment failures. Colleagues have also reported random cluster deployment failures independently. The root causes were hard to diagnose — scripts continued past errors silently, and failures only surfaced much later as cryptic downstream errors.
Changes
Fix 1 — VHDX copy verification with retry (
New-LocalBoxCluster.ps1)The VHDX copy path could fail silently, allowing VM creation to continue without valid OS disks and later fail at
Mount-VHD. Replaced the copy flow with source pre-check,try/catch+-ErrorAction Stop, post-copy size validation, and up to 3 retries with 10-second backoff.Fix 2 — Node VM post-creation verification (
New-LocalBoxCluster.ps1)After
New-AzLocalNodeVM, there was no validation that the Hyper-V VM and attached VHDX were actually present. Added explicit verification usingGet-VMandGet-VMHardDiskDrive, plus null-check validation for the returned MAC address before continuing.Fix 3 — Missing scheduled task trigger (
Bootstrap.ps1)LocalBoxLogonScriptwas registered without a trigger, so it never fired in headless deployments. Added the missing logon trigger (-AtLogOn) to align behavior with the existing task registration pattern.Fix 4 — Resource provider registration polling (
New-LocalBoxCluster.ps1)az provider register --waitcan return before registration fully propagates to HCI node context. Added explicit polling for 8 mandatory providers (15-second interval, 5-minute timeout) before the existing 600-second sleep, with non-blocking timeout fallback to preserve progress while improving reliability.Fix 5 — SYSTEM context handling (
LocalBoxLogonScript.ps1)When the
LocalBoxLogonScriptscheduled task runs as SYSTEM (via the AtLogOn trigger in headless CI), it crashes at line 47 because$Env:USERPROFILE\Desktopdoes not exist for the SYSTEM profile, andcode.exe(VS Code) is not in the SYSTEM PATH. These are non-critical UI setup steps (desktop shortcuts, VS Code extensions) that should not abort the entire bootstrap. Wrapped all UI-related setup blocks intry/catchand added aGet-Command codeguard for VSCode extension install.Files Modified
New-LocalBoxCluster.ps1Bootstrap.ps1LocalBoxLogonScript.ps1Testing
Get-VMHardDiskDrive✅, missing VM throw ✅, no-VHD throw ✅, MAC null-check ✅Select-Object -Uniquereturns scalar ✅, registered/unregistered comparison ✅, noSystem.Object[]in logs ✅Production Deployment Evidence
These fixes have been validated across 60+ GitHub Actions workflow runs deploying LocalBox to 3 Azure subscriptions simultaneously. The SYSTEM context fix (fix 5) was the final missing piece — without it, the scheduled task would crash before starting nested VMs in headless deployments.