BurstLab

BurstLab provisions disposable "mock on-prem" HPC clusters on AWS that replicate what real HPC environments actually run, with the AWS Plugin for Slurm v2 pre-configured for cloud bursting. It is a learning platform — every config, every design decision, and every AWS resource is documented so an SA or customer can understand exactly what was done and why.

This is not a canned demo. It is a transferable architecture. An SA can stand up a BurstLab cluster that matches a customer's Slurm version and OS, walk through the bursting configuration live, and hand over the IaC when the meeting ends.

Architecture

VPC 10.0.0.0/16
├── management subnet  10.0.0.0/24  (us-west-2a) — head node, EIP, SSH entry
├── on-prem subnet     10.0.1.0/24  (us-west-2a) — static compute nodes (no public IPs)
├── cloud subnet A     10.0.2.0/24  (us-west-2a) — burst nodes
└── cloud subnet B     10.0.3.0/24  (us-west-2b) — burst nodes (multi-AZ)

Head node  (m7a.2xlarge):     slurmctld + slurmdbd + munge + NAT (iptables masquerade)
Compute nodes (m7a.2xlarge × 4):  slurmd, private only, internet via head node NAT
Burst nodes (m7a.2xlarge):    launched by Plugin v2 via EC2 CreateFleet, same NAT path
EFS:  /u (user homes) and /opt/slurm (Slurm binaries + config) shared across all nodes

The head node is the cluster's NAT gateway — on-prem compute and burst nodes route all outbound traffic through it. This mirrors real on-prem HPC environments where compute nodes live on isolated private networks with no direct internet access.

What's in This Repo

burstlab/
├── README.md                        # This file
├── docs/
│   ├── prerequisites.md             # AWS quota requirements and pre-flight check
│   ├── quickstart.md                # Step-by-step: zero to running cluster (Gen 1)
│   ├── generations.md               # Why five generations exist; which to choose
│   ├── slurm-intro.md               # Slurm concepts and commands from zero
│   ├── architecture.md              # Network, EFS, NAT, IAM, and security design
│   ├── slurm-gen1-deep-dive.md      # Every slurm.conf directive explained (Gen 1)
│   ├── slurm-gen2-deep-dive.md      # Gen 2 config changes from Gen 1
│   ├── slurm-gen3-deep-dive.md      # Gen 3 config changes, cloud_reg_addrs
│   ├── plugin-v2-setup.md           # Plugin v2 setup: configs, debugging, IAM
│   ├── sa-guide.md                  # How to use BurstLab with customers
│   └── workloads/                   # Workloads overlay: data staging demos
│       ├── overview.md              # Scenario selection guide, storage tier matrix
│       ├── scenario1-compute.md     # GROMACS + Spack on burst nodes
│       ├── scenario2-roda.md        # RODA public datasets via s5cmd/rclone/Mountpoint
│       ├── scenario3-ephemeral-efs.md  # Job-scoped EFS: create → compute → destroy
│       └── scenario4-ephemeral-fsx.md  # Job-scoped FSx Lustre linked to S3
│
├── terraform/
│   ├── modules/
│   │   ├── vpc/                     # VPC, subnets, security groups, route tables
│   │   ├── head-node/               # Head node EC2, EIP, NAT routing
│   │   ├── compute-nodes/           # Static on-prem compute EC2 instances
│   │   ├── shared-storage/          # EFS filesystem and mount targets
│   │   ├── iam/                     # Head node and burst node IAM roles
│   │   └── burst-config/            # Plugin v2 config files and launch template
│   ├── generations/
│   │   ├── gen1-slurm2205-rocky8/      # Gen 1: Rocky 8 + Slurm 22.05
│   │   ├── gen2-slurm2311-rocky9/      # Gen 2: Rocky 9 + Slurm 23.11
│   │   ├── gen3-slurm2405-rocky10/     # Gen 3: Rocky 10 + Slurm 24.05
│   │   ├── gen4-slurm2311-ubuntu2204/  # Gen 4: Ubuntu 22.04 + Slurm 23.11
│   │   └── gen5-slurm2405-ubuntu2404/  # Gen 5: Ubuntu 24.04 + Slurm 24.05
│   └── workloads/                   # Overlay: attaches to existing generation clusters
│       ├── base/                    # S3 bucket, transfer tools, script deploy
│       ├── scenario1-compute/       # Spack + GROMACS install
│       ├── scenario2-roda/          # S3 read policy + results bucket
│       ├── scenario3-ephemeral-efs/ # EFS lifecycle IAM policies
│       ├── scenario3-wrapper/       # efs-sbatch drop-in wrapper deploy
│       ├── scenario3-prolog-epilog/ # EFS prolog/epilog + slurm.conf patch
│       ├── scenario4-ephemeral-fsx/ # FSx + S3 policies, service-linked role
│       ├── scenario4-wrapper/       # fsx-sbatch drop-in wrapper deploy
│       ├── scenario4-prolog-epilog/ # FSx prolog/epilog + slurm.conf patch
│       └── scenario4-burst-buffer/  # Lua burst buffer plugin deploy
│
├── configs/
│   ├── gen1-slurm2205-rocky8/       # Gen 1 config templates (Rocky 8)
│   ├── gen2-slurm2311-rocky9/       # Gen 2 config templates (Rocky 9)
│   ├── gen3-slurm2405-rocky10/      # Gen 3 config templates (Rocky 10)
│   ├── gen4-slurm2311-ubuntu2204/   # Gen 4 config templates (Ubuntu 22.04)
│   └── gen5-slurm2405-ubuntu2404/   # Gen 5 config templates (Ubuntu 24.04)
│
├── scripts/
│   ├── check-quotas.sh              # Pre-flight AWS quota check
│   ├── validate-cluster.sh          # Post-deploy health check (40 checks)
│   ├── demo-burst.sh                # Interactive burst demo (run as alice via SSH)
│   ├── teardown.sh                  # Graceful cluster shutdown + terraform destroy
│   ├── userdata/                    # Cloud-init scripts for each node type
│   │   ├── head-node-init.sh.tpl
│   │   ├── compute-node-init.sh.tpl
│   │   └── burst-node-init.sh.tpl
│   └── workloads/                   # Workloads overlay scripts (deployed to EFS)
│       ├── install-transfer-tools.sh  # rclone, s5cmd, Mountpoint
│       ├── install-spack.sh           # Spack + Lmod via AWS binary cache
│       ├── install-gromacs.sh         # GROMACS via Spack
│       ├── lib/
│       │   ├── efs-lifecycle.sh       # EFS create/wait/destroy helpers
│       │   └── fsx-lifecycle.sh       # FSx Lustre create/wait/flush/destroy helpers
│       └── jobs/
│           ├── scenario1/             # GROMACS job scripts
│           ├── scenario2/             # RODA data access job scripts
│           ├── scenario3/             # Ephemeral EFS: chain, wrapper, prolog/epilog
│           └── scenario4/             # Ephemeral FSx: chain, wrapper, prolog/epilog, BB
│
└── ami/
    ├── rocky8-slurm2205.pkr.hcl     # Packer: Rocky 8 + Slurm 22.05 (Gen 1)
    ├── rocky9-slurm2311.pkr.hcl     # Packer: Rocky 9 + Slurm 23.11 (Gen 2)
    ├── rocky10-slurm2405.pkr.hcl    # Packer: Rocky 10 + Slurm 24.05 (Gen 3)
    ├── ubuntu2204-slurm2311.pkr.hcl # Packer: Ubuntu 22.04 + Slurm 23.11 (Gen 4)
    └── ubuntu2404-slurm2405.pkr.hcl # Packer: Ubuntu 24.04 + Slurm 24.05 (Gen 5)

Getting Started

Before deploying, check your AWS quota headroom — a low vCPU quota is the most common reason a deploy fails partway through:

bash scripts/check-quotas.sh --profile aws --region us-west-2

See docs/prerequisites.md for full requirements and how to request quota increases. See docs/generations.md to choose the right generation for your customer.

See docs/quickstart.md for the full step-by-step walkthrough with time estimates.

Short version (Gen 1 — the recommended default):

# 1. Build the AMI (~15-20 minutes)
cd ami/
packer build -var "aws_profile=aws" rocky8-slurm2205.pkr.hcl

# 2. Configure and deploy (~5 minutes)
cd terraform/generations/gen1-slurm2205-rocky8/
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars: set key_name and head_node_ami
terraform init && terraform apply

# 3. Wait for cluster init (~10-15 minutes), then connect
ssh -i ~/.ssh/your-key.pem rocky@<head_node_public_ip>
sudo tail -f /var/log/burstlab-init.log   # watch init progress
sinfo                                      # should show local + cloud partitions

Slurm Generations

BurstLab provides five complete generations spanning two OS families (RHEL/Rocky and Ubuntu). Each is an independently deployable cluster matching a specific customer environment.

RHEL/Rocky Track

Generation	OS	Slurm	Key Features	When to Use
Gen 1	Rocky 8	22.05.x	Python 3.6 boto3 shim; cgroup v1; FSx ✅	RHEL/Rocky 8, Slurm 22.x — largest installed base
Gen 2	Rocky 9	23.11.x	Python 3.9 native; cgroup v2; FSx ✅	RHEL/Rocky 9, Slurm 23.x
Gen 3	Rocky 10	24.05.x	`cloud_reg_addrs`; cgroup v2 only; Ed25519; FSx requires burstlab-lustre	RHEL/Rocky 10, Slurm 24.x; greenfield

Ubuntu Track

Generation	OS	Slurm	Key Features	When to Use
Gen 4	Ubuntu 22.04	23.11.x	apt/AppArmor; Python 3.10; cgroup v2; FSx requires burstlab-lustre	Ubuntu 22.04, Slurm 23.x; academic/cloud-native/NVIDIA
Gen 5	Ubuntu 24.04	24.05.x	apt/AppArmor; Python 3.12; `cloud_reg_addrs`; FSx requires burstlab-lustre	Ubuntu 24.04, Slurm 24.x; latest LTS

Start with Gen 1 for RHEL/Rocky customers or Gen 4 for Ubuntu customers unless you know the specific OS and Slurm version. Most HPC teams struggling with cloud bursting today are on Rocky 8 with Slurm 22.05.

FSx Lustre on Gen 3-5: AWS doesn't provide Lustre packages for Rocky 10 or Ubuntu LTS versions. Install Lustre clients from burstlab-lustre to enable FSx support on Gen 3, Gen 4, and Gen 5 clusters. EFS workloads work on all generations without additional setup.

See docs/generations.md for detailed comparison, decision tables, and architectural differences between generations. See README-ubuntu.md for Ubuntu-specific quick start.

Workloads Track

The workloads overlay demonstrates how HPC applications consume and produce data in a cloud bursting environment. It builds on top of any deployed generation cluster without modifying the core infrastructure.

Scenario	Story	Storage
1 — Compute	GROMACS + Spack, no data staging	EFS only
2 — RODA	Read public AWS datasets (NOAA GOES-16)	S3 read-only
3 — Ephemeral EFS	Job-scoped NFS scratch with three lifecycle approaches	EFS ephemeral
4 — Ephemeral FSx	Job-scoped Lustre scratch linked to S3 with three lifecycle approaches	FSx + S3

Scenarios 3 and 4 each support three ways to trigger storage lifecycle — from explicit to transparent:

Approach	How to submit	User sees
0 — Chain	`bash submit-chain.sh`	Three job IDs with dependencies
A — Wrapper	`fsx-sbatch myjob.sh`	One job ID
B — Prolog/Epilog	`sbatch --comment=fsx:1200 myjob.sh`	One job ID; storage created silently
C — Burst Buffer	`sbatch myjob.sh` (with `#BB` directive)	BF → R → CG state transitions

# Deploy the base overlay (once per cluster)
cd terraform/workloads/base/
terraform init && terraform apply

# Deploy a scenario and run it
cd terraform/workloads/scenario4-ephemeral-fsx/
terraform init && terraform apply
ssh alice@<head_node_ip>
bash /opt/slurm/etc/workloads/jobs/scenario4/submit-chain.sh

See docs/workloads/overview.md for scenario selection, storage tier decision matrix, and granularity modes (per-job, per-array, per-campaign). See docs/workloads/transparent-lifecycle.md for a full comparison of the three transparent lifecycle approaches.

Why BurstLab Exists

On-prem HPC environments share a common set of problems when attempting cloud bursting for the first time:

Config drift: slurm.conf has diverged between the head node and login nodes
Missing plugins: the serializer/json plugin is absent in some Slurm 22.05 builds, preventing slurmctld from starting
SelectType mismatches: select/linear on one node, select/cons_tres on another
Broken accounting: slurmdbd not running or not configured, which Plugin v2 requires
IAM gaps: iam:PassRole missing, preventing EC2 Fleet from launching burst instances

BurstLab eliminates the "can we even get it working" phase. The Terraform and config templates represent known-good configurations for each Slurm generation. An SA can deploy a matching cluster in under 30 minutes and demonstrate working cloud bursting before a customer engagement even begins.

Design Principles

Correctness over cleverness. Every config file should be something an HPC sysadmin can read and understand. No magic.
Ephemeral by default. terraform destroy cleans up everything. No orphaned resources.
Match reality. Rocky Linux 8 with the same repo and package constraints customers face — not some idealized image.
Configs are the product. The IaC is scaffolding. The real value is the known-good slurm.conf, partitions.json, IAM policies, and security groups for each generation.
Document the why. Every directive, every AWS resource, every design choice has an explanation. The code is the documentation.

Documentation

Doc	Audience	Contents
prerequisites.md	Everyone	AWS quota requirements and pre-flight check
quickstart.md	Everyone	Step-by-step deploy with time estimates (Gen 1)
generations.md	Everyone	Why five generations exist; which to choose
README-ubuntu.md	Ubuntu users	Ubuntu-specific quick start (Gen 4 & 5)
slurm-intro.md	Everyone	Slurm concepts and commands from zero
architecture.md	SAs, technical customers	Network, EFS, NAT, IAM deep dive
slurm-gen1-deep-dive.md	SAs, HPC admins	Every slurm.conf directive for Gen 1
slurm-gen2-deep-dive.md	SAs, HPC admins	Gen 2 config changes from Gen 1
slurm-gen3-deep-dive.md	SAs, HPC admins	Gen 3 config changes, `cloud_reg_addrs`
plugin-v2-setup.md	SAs, HPC admins	Plugin v2 setup, configs, debugging
sa-guide.md	SAs	How to run a customer demo
workloads/overview.md	SAs	Workloads overlay: scenario guide, storage tiers
workloads/scenario1-compute.md	SAs	GROMACS + Spack demo
workloads/scenario2-roda.md	SAs	RODA public datasets, s5cmd/rclone/Mountpoint
workloads/user-guide.md	Cluster users	How to use fsx-sbatch, efs-sbatch, fsx-list/restore/purge
workloads/scenario3-ephemeral-efs.md	SAs	Ephemeral EFS: chain, wrapper, prolog/epilog
workloads/scenario4-ephemeral-fsx.md	SAs	Ephemeral FSx Lustre + S3: chain, wrapper, prolog/epilog, burst buffer
workloads/transparent-lifecycle.md	SAs	Approach comparison: chain vs wrapper vs prolog/epilog vs burst buffer

Contributing and Extending

BurstLab is intentionally structured for extension:

New Slurm version: Copy configs/gen1-slurm2205-rocky8/ to a new directory, update the config templates, add a Packer template, add a Terraform generation module.
New OS: Swap the Packer source AMI and update the repo-fix steps in the init scripts.
New instance type: Change instance types in terraform.tfvars and update the CPU/memory values in partitions.json.tpl and slurm.conf.tpl to match.
Spot instances: Set "PurchasingOption": "spot" in partitions.json.tpl and add a SpotOptions block with your interruption strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
ami		ami
cdk		cdk
configs		configs
docs		docs
scripts		scripts
terraform		terraform
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README-ubuntu.md		README-ubuntu.md
README.md		README.md
burstlab-spec.md		burstlab-spec.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BurstLab

Architecture

What's in This Repo

Getting Started

Slurm Generations

RHEL/Rocky Track

Ubuntu Track

Workloads Track

Why BurstLab Exists

Design Principles

Documentation

Contributing and Extending

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BurstLab

Architecture

What's in This Repo

Getting Started

Slurm Generations

RHEL/Rocky Track

Ubuntu Track

Workloads Track

Why BurstLab Exists

Design Principles

Documentation

Contributing and Extending

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages