Skip to content

scttfrdmn/burstlab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BurstLab

BurstLab provisions disposable "mock on-prem" HPC clusters on AWS that replicate what real HPC environments actually run, with the AWS Plugin for Slurm v2 pre-configured for cloud bursting. It is a learning platform — every config, every design decision, and every AWS resource is documented so an SA or customer can understand exactly what was done and why.

This is not a canned demo. It is a transferable architecture. An SA can stand up a BurstLab cluster that matches a customer's Slurm version and OS, walk through the bursting configuration live, and hand over the IaC when the meeting ends.


Architecture

VPC 10.0.0.0/16
├── management subnet  10.0.0.0/24  (us-west-2a) — head node, EIP, SSH entry
├── on-prem subnet     10.0.1.0/24  (us-west-2a) — static compute nodes (no public IPs)
├── cloud subnet A     10.0.2.0/24  (us-west-2a) — burst nodes
└── cloud subnet B     10.0.3.0/24  (us-west-2b) — burst nodes (multi-AZ)

Head node  (m7a.2xlarge):     slurmctld + slurmdbd + munge + NAT (iptables masquerade)
Compute nodes (m7a.2xlarge × 4):  slurmd, private only, internet via head node NAT
Burst nodes (m7a.2xlarge):    launched by Plugin v2 via EC2 CreateFleet, same NAT path
EFS:  /u (user homes) and /opt/slurm (Slurm binaries + config) shared across all nodes

The head node is the cluster's NAT gateway — on-prem compute and burst nodes route all outbound traffic through it. This mirrors real on-prem HPC environments where compute nodes live on isolated private networks with no direct internet access.


What's in This Repo

burstlab/
├── README.md                        # This file
├── docs/
│   ├── prerequisites.md             # AWS quota requirements and pre-flight check
│   ├── quickstart.md                # Step-by-step: zero to running cluster (Gen 1)
│   ├── generations.md               # Why five generations exist; which to choose
│   ├── slurm-intro.md               # Slurm concepts and commands from zero
│   ├── architecture.md              # Network, EFS, NAT, IAM, and security design
│   ├── slurm-gen1-deep-dive.md      # Every slurm.conf directive explained (Gen 1)
│   ├── slurm-gen2-deep-dive.md      # Gen 2 config changes from Gen 1
│   ├── slurm-gen3-deep-dive.md      # Gen 3 config changes, cloud_reg_addrs
│   ├── plugin-v2-setup.md           # Plugin v2 setup: configs, debugging, IAM
│   ├── sa-guide.md                  # How to use BurstLab with customers
│   └── workloads/                   # Workloads overlay: data staging demos
│       ├── overview.md              # Scenario selection guide, storage tier matrix
│       ├── scenario1-compute.md     # GROMACS + Spack on burst nodes
│       ├── scenario2-roda.md        # RODA public datasets via s5cmd/rclone/Mountpoint
│       ├── scenario3-ephemeral-efs.md  # Job-scoped EFS: create → compute → destroy
│       └── scenario4-ephemeral-fsx.md  # Job-scoped FSx Lustre linked to S3
│
├── terraform/
│   ├── modules/
│   │   ├── vpc/                     # VPC, subnets, security groups, route tables
│   │   ├── head-node/               # Head node EC2, EIP, NAT routing
│   │   ├── compute-nodes/           # Static on-prem compute EC2 instances
│   │   ├── shared-storage/          # EFS filesystem and mount targets
│   │   ├── iam/                     # Head node and burst node IAM roles
│   │   └── burst-config/            # Plugin v2 config files and launch template
│   ├── generations/
│   │   ├── gen1-slurm2205-rocky8/      # Gen 1: Rocky 8 + Slurm 22.05
│   │   ├── gen2-slurm2311-rocky9/      # Gen 2: Rocky 9 + Slurm 23.11
│   │   ├── gen3-slurm2405-rocky10/     # Gen 3: Rocky 10 + Slurm 24.05
│   │   ├── gen4-slurm2311-ubuntu2204/  # Gen 4: Ubuntu 22.04 + Slurm 23.11
│   │   └── gen5-slurm2405-ubuntu2404/  # Gen 5: Ubuntu 24.04 + Slurm 24.05
│   └── workloads/                   # Overlay: attaches to existing generation clusters
│       ├── base/                    # S3 bucket, transfer tools, script deploy
│       ├── scenario1-compute/       # Spack + GROMACS install
│       ├── scenario2-roda/          # S3 read policy + results bucket
│       ├── scenario3-ephemeral-efs/ # EFS lifecycle IAM policies
│       ├── scenario3-wrapper/       # efs-sbatch drop-in wrapper deploy
│       ├── scenario3-prolog-epilog/ # EFS prolog/epilog + slurm.conf patch
│       ├── scenario4-ephemeral-fsx/ # FSx + S3 policies, service-linked role
│       ├── scenario4-wrapper/       # fsx-sbatch drop-in wrapper deploy
│       ├── scenario4-prolog-epilog/ # FSx prolog/epilog + slurm.conf patch
│       └── scenario4-burst-buffer/  # Lua burst buffer plugin deploy
│
├── configs/
│   ├── gen1-slurm2205-rocky8/       # Gen 1 config templates (Rocky 8)
│   ├── gen2-slurm2311-rocky9/       # Gen 2 config templates (Rocky 9)
│   ├── gen3-slurm2405-rocky10/      # Gen 3 config templates (Rocky 10)
│   ├── gen4-slurm2311-ubuntu2204/   # Gen 4 config templates (Ubuntu 22.04)
│   └── gen5-slurm2405-ubuntu2404/   # Gen 5 config templates (Ubuntu 24.04)
│
├── scripts/
│   ├── check-quotas.sh              # Pre-flight AWS quota check
│   ├── validate-cluster.sh          # Post-deploy health check (40 checks)
│   ├── demo-burst.sh                # Interactive burst demo (run as alice via SSH)
│   ├── teardown.sh                  # Graceful cluster shutdown + terraform destroy
│   ├── userdata/                    # Cloud-init scripts for each node type
│   │   ├── head-node-init.sh.tpl
│   │   ├── compute-node-init.sh.tpl
│   │   └── burst-node-init.sh.tpl
│   └── workloads/                   # Workloads overlay scripts (deployed to EFS)
│       ├── install-transfer-tools.sh  # rclone, s5cmd, Mountpoint
│       ├── install-spack.sh           # Spack + Lmod via AWS binary cache
│       ├── install-gromacs.sh         # GROMACS via Spack
│       ├── lib/
│       │   ├── efs-lifecycle.sh       # EFS create/wait/destroy helpers
│       │   └── fsx-lifecycle.sh       # FSx Lustre create/wait/flush/destroy helpers
│       └── jobs/
│           ├── scenario1/             # GROMACS job scripts
│           ├── scenario2/             # RODA data access job scripts
│           ├── scenario3/             # Ephemeral EFS: chain, wrapper, prolog/epilog
│           └── scenario4/             # Ephemeral FSx: chain, wrapper, prolog/epilog, BB
│
└── ami/
    ├── rocky8-slurm2205.pkr.hcl     # Packer: Rocky 8 + Slurm 22.05 (Gen 1)
    ├── rocky9-slurm2311.pkr.hcl     # Packer: Rocky 9 + Slurm 23.11 (Gen 2)
    ├── rocky10-slurm2405.pkr.hcl    # Packer: Rocky 10 + Slurm 24.05 (Gen 3)
    ├── ubuntu2204-slurm2311.pkr.hcl # Packer: Ubuntu 22.04 + Slurm 23.11 (Gen 4)
    └── ubuntu2404-slurm2405.pkr.hcl # Packer: Ubuntu 24.04 + Slurm 24.05 (Gen 5)

Getting Started

Before deploying, check your AWS quota headroom — a low vCPU quota is the most common reason a deploy fails partway through:

bash scripts/check-quotas.sh --profile aws --region us-west-2

See docs/prerequisites.md for full requirements and how to request quota increases. See docs/generations.md to choose the right generation for your customer.

See docs/quickstart.md for the full step-by-step walkthrough with time estimates.

Short version (Gen 1 — the recommended default):

# 1. Build the AMI (~15-20 minutes)
cd ami/
packer build -var "aws_profile=aws" rocky8-slurm2205.pkr.hcl

# 2. Configure and deploy (~5 minutes)
cd terraform/generations/gen1-slurm2205-rocky8/
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars: set key_name and head_node_ami
terraform init && terraform apply

# 3. Wait for cluster init (~10-15 minutes), then connect
ssh -i ~/.ssh/your-key.pem rocky@<head_node_public_ip>
sudo tail -f /var/log/burstlab-init.log   # watch init progress
sinfo                                      # should show local + cloud partitions

Slurm Generations

BurstLab provides five complete generations spanning two OS families (RHEL/Rocky and Ubuntu). Each is an independently deployable cluster matching a specific customer environment.

RHEL/Rocky Track

Generation OS Slurm Key Features When to Use
Gen 1 Rocky 8 22.05.x Python 3.6 boto3 shim; cgroup v1; FSx ✅ RHEL/Rocky 8, Slurm 22.x — largest installed base
Gen 2 Rocky 9 23.11.x Python 3.9 native; cgroup v2; FSx ✅ RHEL/Rocky 9, Slurm 23.x
Gen 3 Rocky 10 24.05.x cloud_reg_addrs; cgroup v2 only; Ed25519; FSx requires burstlab-lustre RHEL/Rocky 10, Slurm 24.x; greenfield

Ubuntu Track

Generation OS Slurm Key Features When to Use
Gen 4 Ubuntu 22.04 23.11.x apt/AppArmor; Python 3.10; cgroup v2; FSx requires burstlab-lustre Ubuntu 22.04, Slurm 23.x; academic/cloud-native/NVIDIA
Gen 5 Ubuntu 24.04 24.05.x apt/AppArmor; Python 3.12; cloud_reg_addrs; FSx requires burstlab-lustre Ubuntu 24.04, Slurm 24.x; latest LTS

Start with Gen 1 for RHEL/Rocky customers or Gen 4 for Ubuntu customers unless you know the specific OS and Slurm version. Most HPC teams struggling with cloud bursting today are on Rocky 8 with Slurm 22.05.

FSx Lustre on Gen 3-5: AWS doesn't provide Lustre packages for Rocky 10 or Ubuntu LTS versions. Install Lustre clients from burstlab-lustre to enable FSx support on Gen 3, Gen 4, and Gen 5 clusters. EFS workloads work on all generations without additional setup.

See docs/generations.md for detailed comparison, decision tables, and architectural differences between generations. See README-ubuntu.md for Ubuntu-specific quick start.


Workloads Track

The workloads overlay demonstrates how HPC applications consume and produce data in a cloud bursting environment. It builds on top of any deployed generation cluster without modifying the core infrastructure.

Scenario Story Storage
1 — Compute GROMACS + Spack, no data staging EFS only
2 — RODA Read public AWS datasets (NOAA GOES-16) S3 read-only
3 — Ephemeral EFS Job-scoped NFS scratch with three lifecycle approaches EFS ephemeral
4 — Ephemeral FSx Job-scoped Lustre scratch linked to S3 with three lifecycle approaches FSx + S3

Scenarios 3 and 4 each support three ways to trigger storage lifecycle — from explicit to transparent:

Approach How to submit User sees
0 — Chain bash submit-chain.sh Three job IDs with dependencies
A — Wrapper fsx-sbatch myjob.sh One job ID
B — Prolog/Epilog sbatch --comment=fsx:1200 myjob.sh One job ID; storage created silently
C — Burst Buffer sbatch myjob.sh (with #BB directive) BF → R → CG state transitions
# Deploy the base overlay (once per cluster)
cd terraform/workloads/base/
terraform init && terraform apply

# Deploy a scenario and run it
cd terraform/workloads/scenario4-ephemeral-fsx/
terraform init && terraform apply
ssh alice@<head_node_ip>
bash /opt/slurm/etc/workloads/jobs/scenario4/submit-chain.sh

See docs/workloads/overview.md for scenario selection, storage tier decision matrix, and granularity modes (per-job, per-array, per-campaign). See docs/workloads/transparent-lifecycle.md for a full comparison of the three transparent lifecycle approaches.


Why BurstLab Exists

On-prem HPC environments share a common set of problems when attempting cloud bursting for the first time:

  • Config drift: slurm.conf has diverged between the head node and login nodes
  • Missing plugins: the serializer/json plugin is absent in some Slurm 22.05 builds, preventing slurmctld from starting
  • SelectType mismatches: select/linear on one node, select/cons_tres on another
  • Broken accounting: slurmdbd not running or not configured, which Plugin v2 requires
  • IAM gaps: iam:PassRole missing, preventing EC2 Fleet from launching burst instances

BurstLab eliminates the "can we even get it working" phase. The Terraform and config templates represent known-good configurations for each Slurm generation. An SA can deploy a matching cluster in under 30 minutes and demonstrate working cloud bursting before a customer engagement even begins.


Design Principles

  1. Correctness over cleverness. Every config file should be something an HPC sysadmin can read and understand. No magic.
  2. Ephemeral by default. terraform destroy cleans up everything. No orphaned resources.
  3. Match reality. Rocky Linux 8 with the same repo and package constraints customers face — not some idealized image.
  4. Configs are the product. The IaC is scaffolding. The real value is the known-good slurm.conf, partitions.json, IAM policies, and security groups for each generation.
  5. Document the why. Every directive, every AWS resource, every design choice has an explanation. The code is the documentation.

Documentation

Doc Audience Contents
prerequisites.md Everyone AWS quota requirements and pre-flight check
quickstart.md Everyone Step-by-step deploy with time estimates (Gen 1)
generations.md Everyone Why five generations exist; which to choose
README-ubuntu.md Ubuntu users Ubuntu-specific quick start (Gen 4 & 5)
slurm-intro.md Everyone Slurm concepts and commands from zero
architecture.md SAs, technical customers Network, EFS, NAT, IAM deep dive
slurm-gen1-deep-dive.md SAs, HPC admins Every slurm.conf directive for Gen 1
slurm-gen2-deep-dive.md SAs, HPC admins Gen 2 config changes from Gen 1
slurm-gen3-deep-dive.md SAs, HPC admins Gen 3 config changes, cloud_reg_addrs
plugin-v2-setup.md SAs, HPC admins Plugin v2 setup, configs, debugging
sa-guide.md SAs How to run a customer demo
workloads/overview.md SAs Workloads overlay: scenario guide, storage tiers
workloads/scenario1-compute.md SAs GROMACS + Spack demo
workloads/scenario2-roda.md SAs RODA public datasets, s5cmd/rclone/Mountpoint
workloads/user-guide.md Cluster users How to use fsx-sbatch, efs-sbatch, fsx-list/restore/purge
workloads/scenario3-ephemeral-efs.md SAs Ephemeral EFS: chain, wrapper, prolog/epilog
workloads/scenario4-ephemeral-fsx.md SAs Ephemeral FSx Lustre + S3: chain, wrapper, prolog/epilog, burst buffer
workloads/transparent-lifecycle.md SAs Approach comparison: chain vs wrapper vs prolog/epilog vs burst buffer

Contributing and Extending

BurstLab is intentionally structured for extension:

  • New Slurm version: Copy configs/gen1-slurm2205-rocky8/ to a new directory, update the config templates, add a Packer template, add a Terraform generation module.
  • New OS: Swap the Packer source AMI and update the repo-fix steps in the init scripts.
  • New instance type: Change instance types in terraform.tfvars and update the CPU/memory values in partitions.json.tpl and slurm.conf.tpl to match.
  • Spot instances: Set "PurchasingOption": "spot" in partitions.json.tpl and add a SpotOptions block with your interruption strategy.

About

Disposable mock on-prem HPC clusters on AWS with Slurm cloud bursting. Five generations: Rocky 8/9/10 + Ubuntu 22.04/24.04 with Slurm 22.05/23.11/24.05. Terraform + Packer IaC for SA demos and customer architectures.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors