Skip to content

Add Azure Batch scalability example#82

Draft
pstreef wants to merge 14 commits into
mainfrom
pstreef/azure-batch-example
Draft

Add Azure Batch scalability example#82
pstreef wants to merge 14 commits into
mainfrom
pstreef/azure-batch-example

Conversation

@pstreef

@pstreef pstreef commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Problem

Solution

Add an Azure Batch example with full Terraform IaC:

  • Azure Batch pool with auto-scaling and container support
  • Automation Account + Runbook for job orchestration
  • Key Vault integration for secrets (fetched by runbook, forwarded as env vars)
  • Managed Identity for authentication

Follows the same chunk/processor fan-out pattern as AWS and GCP.

Pending

Not yet integration-tested — requires Contributor role on an Azure resource group to create Batch Account, Automation Account, and Key Vault access policies.

pstreef added 14 commits March 20, 2026 10:56
Adds GCP Batch and Azure Batch alongside the existing AWS Batch
example, all VM-based to avoid known Kubernetes issues with
resource-intensive LST builds.
Address issues from spec review: correct GCP Terraform resource name
(google_batch_job), add Cloud Workflows as scheduling intermediary,
fix Azure task submission pattern, add missing files to directory
structure (TROUBLESHOOTING.md, terraform.tfvars.example), document
CSV download and container registry per platform, note existing
chunk.sh bug to fix during migration.
9-task plan covering AWS migration, GCP Batch, Azure Batch,
top-level README, and root README updates. Reviewed and fixed:
GCP chunk logic moved to Workflow, Azure task submission loop,
pool image SKU match, timestamp lifecycle, NSG variable.
Restructure for multi-platform support. Fixes in chunk.sh:
- Use $local_csv_file instead of $csv_file for wc -l (fails on S3 URLs)
- Fix off-by-one in seq step ($chunk_size, not $chunk_size + 1)
- Add set -euo pipefail
Also add terraform.tfvars.example and update relative paths in README.
Explains the shared architecture pattern across AWS/GCP/Azure batch
services and documents why VM-based approaches are recommended over
Kubernetes based on real customer experiences.
Terraform IaC with Cloud Workflows for job orchestration, Cloud
Scheduler for cron triggers, Secret Manager for credentials, and
service accounts with least-privilege IAM. Uses Compute Engine VMs
(n2-standard-4) with auto-scaling to zero.
Terraform IaC with Azure Automation for scheduling, Key Vault for
secrets, managed identity for passwordless auth, and auto-scaling
pool with container support. Uses Standard_D4s_v5 VMs.
Reflect the new AWS/GCP/Azure batch options in the scalability
stage description, directory tree, and comparison table.
Remove docs/superpowers/ directory containing internal AI workflow
references that shouldn't be in a public repo. Soften the K8s cost
inefficiency claim to cite industry reports and use "order of
magnitude" instead of specific multiplier.
- Azure: runbook now fetches secrets from Key Vault and passes them
  as env vars to chunk task, which forwards them to processor tasks
- Azure: chunk.sh passes --account-endpoint for az batch auth
- Azure: add acr_resource_group_name variable for cross-RG ACR
- Azure: remove unused disk_size_gb variable
- GCP: fix workflow service_account to use .email not .id
- AWS: add empty CSV guard for consistency with Azure/GCP
- GCP: replace math.ceil with integer ceiling division (not available in Cloud Workflows)
- GCP: move jobId computation to init step (yamlencode doesn't evaluate expressions in connector args)
- GCP: add roles/artifactregistry.reader for batch task SA to pull container images
- AWS: fix reference to nonexistent ingest_job_definition (should be processor_job_definition)
- Azure: remove unsupported tags attribute on azurerm_batch_pool
- AWS: fix chunk.sh to use $local_csv_file instead of $csv_file for line count
Prevents the azurerm provider from trying to register resource providers
at the subscription level, which fails for users without subscription-level
permissions.
Azure Batch example needs integration testing which requires elevated
Azure RBAC permissions. Splitting it out to unblock the GCP example
which is tested and ready.
Azure Batch example with full Terraform IaC:
- Azure Batch pool with auto-scaling and container support
- Automation Account + Runbook for job orchestration
- Key Vault integration for secrets
- Managed Identity for authentication

Terraform validates but not yet integration-tested (requires Contributor
RBAC on the target resource group).
Base automatically changed from pstreef/multi-platform-examples to main April 16, 2026 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant