Skip to content

fix(localbox): improve deployment reliability — VHDX copy, VM verification, task trigger, RP polling, SYSTEM context#3394

Open
nekdima wants to merge 5 commits intomicrosoft:mainfrom
nekdima:fix/localbox-reliability
Open

fix(localbox): improve deployment reliability — VHDX copy, VM verification, task trigger, RP polling, SYSTEM context#3394
nekdima wants to merge 5 commits intomicrosoft:mainfrom
nekdima:fix/localbox-reliability

Conversation

@nekdima
Copy link
Copy Markdown

@nekdima nekdima commented Apr 13, 2026

Summary

While preparing LocalBox environments for MicroHack workshops (50+ participants, 3+ subscriptions), we hit repeated deployment failures. Colleagues have also reported random cluster deployment failures independently. The root causes were hard to diagnose — scripts continued past errors silently, and failures only surfaced much later as cryptic downstream errors.

Changes

Fix 1 — VHDX copy verification with retry (New-LocalBoxCluster.ps1)

The VHDX copy path could fail silently, allowing VM creation to continue without valid OS disks and later fail at Mount-VHD. Replaced the copy flow with source pre-check, try/catch + -ErrorAction Stop, post-copy size validation, and up to 3 retries with 10-second backoff.

Fix 2 — Node VM post-creation verification (New-LocalBoxCluster.ps1)

After New-AzLocalNodeVM, there was no validation that the Hyper-V VM and attached VHDX were actually present. Added explicit verification using Get-VM and Get-VMHardDiskDrive, plus null-check validation for the returned MAC address before continuing.

Fix 3 — Missing scheduled task trigger (Bootstrap.ps1)

LocalBoxLogonScript was registered without a trigger, so it never fired in headless deployments. Added the missing logon trigger (-AtLogOn) to align behavior with the existing task registration pattern.

Fix 4 — Resource provider registration polling (New-LocalBoxCluster.ps1)

az provider register --wait can return before registration fully propagates to HCI node context. Added explicit polling for 8 mandatory providers (15-second interval, 5-minute timeout) before the existing 600-second sleep, with non-blocking timeout fallback to preserve progress while improving reliability.

Fix 5 — SYSTEM context handling (LocalBoxLogonScript.ps1)

When the LocalBoxLogonScript scheduled task runs as SYSTEM (via the AtLogOn trigger in headless CI), it crashes at line 47 because $Env:USERPROFILE\Desktop does not exist for the SYSTEM profile, and code.exe (VS Code) is not in the SYSTEM PATH. These are non-critical UI setup steps (desktop shortcuts, VS Code extensions) that should not abort the entire bootstrap. Wrapped all UI-related setup blocks in try/catch and added a Get-Command code guard for VSCode extension install.

Files Modified

File Lines Fixes
New-LocalBoxCluster.ps1 +104 / -4 Fixes 1, 2, 4
Bootstrap.ps1 +3 / -1 Fix 3
LocalBoxLogonScript.ps1 +43 / -21 Fix 5

Testing

  • Syntax validation: all three files parse cleanly
  • Tested on Windows Server 2022 VM (Standard_D4s_v5) with Hyper-V in Azure (Sweden Central)
  • VHDX copy (fix 1): copy + size verification ✅, source-missing throw ✅, size-mismatch detection ✅
  • VM verification (fix 2): VM + VHDX verify via Get-VMHardDiskDrive ✅, missing VM throw ✅, no-VHD throw ✅, MAC null-check ✅
  • Task trigger (fix 3): trigger registration confirmed ✅, task not auto-started ✅
  • RP polling (fix 4): Select-Object -Unique returns scalar ✅, registered/unregistered comparison ✅, no System.Object[] in logs ✅
  • SYSTEM context (fix 5): Desktop folder creation fallback ✅, VSCode skip when not in PATH ✅, script continues past UI errors ✅

Production Deployment Evidence

These fixes have been validated across 60+ GitHub Actions workflow runs deploying LocalBox to 3 Azure subscriptions simultaneously. The SYSTEM context fix (fix 5) was the final missing piece — without it, the scheduled task would crash before starting nested VMs in headless deployments.

nekdima and others added 4 commits April 13, 2026 21:38
When Copy-Item for AzL-node.vhdx or GUI.vhdx fails silently (piped to
Out-Null), the downstream New-AzLocalNodeVM and Set-AzLocalNodeVhdx
functions proceed with non-existent files, causing Mount-VHD to fail
with 'is not an existing virtual hard disk file' after 5 retries. The
deployment then silently fails at Step 4/11 with no node VMs created.

Root cause observed in production (50-participant MicroHack deployment):
- AzL-node.vhdx copy to V:\VMs failed silently
- New-AzLocalNodeVM created data/S2D disks but not the OS VHDX
- Set-AzLocalNodeVhdx failed on Mount-VHD for the missing VHDX
- Script continued to Step 5 (Start VMs) with no node VMs

Fix:
1. Replace silent Copy-Item with verified copy + 3x retry loop
   - Validates source exists before copy
   - Verifies destination exists and size matches after copy
   - Retries with 10s delay on failure
   - Throws terminating error after all retries exhausted
2. Add post-creation verification for node VMs
   - Checks VHDX file exists after New-AzLocalNodeVM
   - Checks Hyper-V VM exists after creation
   - Fails fast with clear error instead of proceeding with missing VMs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
immediately start the task after registration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add mandatory RP registration check + polling before step 10 (Validate)
in New-LocalBoxCluster.ps1. Prevents 'ArcIntegration requirements not met'
error (step 90) caused by providers not fully registered when the HCI
validator runs. Registers if needed and polls for up to 5 minutes.

Affected providers: KubernetesConfiguration, ExtendedLocation,
HybridContainerService, HybridCompute, AzureStackHCI, ResourceConnector,
Kubernetes, EdgeMarketplace.

Root cause: az provider register --wait returns before full propagation
to HCI node context, causing race condition at scale (3+ subs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changes based on GPT-5.4-pro and o3-pro review:

Fix microsoft#1 (VHDX copy):
- Replace -ErrorAction SilentlyContinue with try/catch + -ErrorAction Stop
  to surface root cause on failure instead of swallowing errors
- Move $maxRetries before loop for clarity
- Add -ErrorAction SilentlyContinue to Get-Item destination check

Fix microsoft#2 (Node VM verification):
- Query VHDX path via Get-VMHardDiskDrive instead of assuming naming
  convention (avoids false failures if module changes disk layout)
- Add null check on MAC address returned by New-AzLocalNodeVM
- Fix double space before -LocalBoxConfig

Fix microsoft#3 (Scheduled task trigger):
- CRITICAL: Remove Start-ScheduledTask that fires before Hyper-V reboot,
  causing LocalBoxLogonScript to fail with missing Hyper-V module and
  leaving corrupt state for the post-reboot AtLogOn execution

Fix microsoft#4 (RP registration polling):
- Add Select-Object -Unique to .RegistrationState to handle array return
  from Get-AzResourceProvider (prevents System.Object[] in log output)
- Log which specific provider is lagging during polling
- Change -ForegroundColor Gray to DarkGray for terminal compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nekdima nekdima force-pushed the fix/localbox-reliability branch from d5f922c to e63b33c Compare April 13, 2026 19:58
Wrap shortcut creation, Hyper-V shortcut, Windows Terminal config,
and VSCode extension install in try-catch blocks. When running as
SYSTEM via scheduled task (headless CI), the Desktop folder doesn't
exist and code.exe isn't in PATH. These are non-critical UI setup
steps that should not crash the entire bootstrap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nekdima nekdima changed the title fix(localbox): improve deployment reliability — VHDX copy, VM verification, task trigger, RP polling fix(localbox): improve deployment reliability — VHDX copy, VM verification, task trigger, RP polling, SYSTEM context Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant