feat: Add ESMFold protein structure prediction job bundle by folouiseAWS · Pull Request #229 · aws-deadline/deadline-cloud-samples

folouiseAWS · 2026-05-29T17:27:25Z

Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1 (facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable- count GPU ML inference workloads on a service-managed fleet.

The bundle:

Parses a user FASTA, validates sequences, round-robins them across N GPU tasks longest-first for balanced runtimes
Runs ESMFold inference with three layers of validation (structural sanity, mean pLDDT self-consistency, and optional TM-score / RMSD / pLDDT calibration vs experimental references)
Renders matplotlib CA-trace structures colored by per-residue pLDDT
Produces a per-sequence calibration plot showing model confidence vs actual error against ground-truth structures (when refs are provided)

Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments and the conda queue-environment pattern shared across ML bundles.

Fixes: N/A — new sample, not tied to a tracked issue

What was the problem/requirement? (What/Why)

The samples repo had ML/GPU bundles for image generation (flux2_klein_lora) and 3D reconstruction (gsplat_pipeline) but no example demonstrating the splitter-then-fanout
pattern for variable-count GPU inference workloads. Protein structure prediction is a clean fit: the input is a FASTA file with N records (where N is unknown until the

SplitFasta (1 task, CPU): parses the input FASTA, validates each sequence (length and amino-acid alphabet), and round-robins the records across Parallelism per-task
manifest files in a shared workspace directory. Sorts longest-first so per-task runtimes balance.
Fold (Parallelism tasks, GPU): loads facebook/esmfold_v1 from HuggingFace, runs inference on each sequence in the assigned manifest, validates the output PDB (atom
count, NaN check, pLDDT range), and emits one .pdb plus a summary.json per sequence. Uses the same HF cache pattern (HF_HOME pointed at OutputDir/.hf_cache) as other ML
bundles.
Render (1 task, CPU): produces a matplotlib backbone trace per predicted structure, colored by per-residue pLDDT confidence.
Validate (1 task, CPU, optional): when a ReferencePdbDir is provided, computes TM-score (via biotite's native tm_score), RMSD, and a per-residue pLDDT/error Pearson
correlation against each experimental reference. Writes a validation.csv and a per-sequence calibration.png.

The splitter + manifest pass-through pattern is borrowed from copy_s3_prefix_to_job_attachments. The conda-queue-environment pattern is shared with the other GPU/ML
bundles.

What is the impact of this change?

Adds a new Deadline Cloud sample for scientific GPU compute with no impact on existing bundles.
All inputs/outputs go through standard job attachments; no new IAM permissions required.
The bundle's CondaPackages resolves cleanly against the same cuda_farm CloudFormation farm template the other GPU samples target.

How was this change tested?

Submitted to a cuda_farm queue (g5 fleet, A10G workers) and iterated through real failures and fixes:

Verified end-to-end pipeline. Folded the demo set (1L2Y, 2JOF, 1VII) on the farm. All four steps green:
SplitFasta → SUCCEEDED
Fold → SUCCEEDED (5.12 s for 1VII, ~1 s for trp-cages)
Render → SUCCEEDED (3 PNGs produced)
Validate → SUCCEEDED (validation.csv with TM-score + Pearson r)
Verified validation against ground truth. Fetched experimental PDBs from RCSB for the three demo sequences, ran with ReferencePdbDir set, confirmed:
seq_id,tm_score,rmsd,aligned_residues,mean_plddt,plddt_error_pearson
1l2y,0.5589,0.549,20,85.54,-0.2767
1vii,0.6555,2.170,36,86.90,-0.5888
2jof,0.4530,0.967,20,87.25,-0.2876
RMSD < 1 Å for both trp-cages indicates correct prediction; Pearson r negative for all three confirms model confidence is calibrated.
Visually verified the rendered output by downloading the predicted structures and comparing CA-traces against experimental references. The two images embedded in the
README (1vii_structure.png, 1vii_calibration.png) are real outputs from a successful farm run, copied into .readme_images/ for documentation.
Iterated through real bugs during testing and fixed each: a WorkspacePath: dataFlow issue blocking manifest pass-through, a pLDDT scale conversion ([0,1] → [0,100])
needed for AlphaFold-convention rendering, and a switch from a custom TM-score implementation to biotite's native one for correctness on alignment-aware residue matching.

Was this change documented?

Yes — README.md covers prerequisites (farm setup, VRAM table, service quotas), submission examples (smoke test, with-validation, custom inputs), parameter reference,
output layout, validation strategy (three layers including the calibration plot), and notes on weight caching and monomer-only scope. sample_inputs/README.md documents
the demo FASTA contents and how to fetch reference PDBs from RCSB.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1 (facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable- count GPU ML inference workloads on a service-managed fleet. The bundle: - Parses a user FASTA, validates sequences, round-robins them across N GPU tasks longest-first for balanced runtimes - Runs ESMFold inference with three layers of validation (structural sanity, mean pLDDT self-consistency, and optional TM-score / RMSD / pLDDT calibration vs experimental references) - Renders matplotlib CA-trace structures colored by per-residue pLDDT - Produces a per-sequence calibration plot showing model confidence vs actual error against ground-truth structures (when refs are provided) Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments and the conda queue-environment pattern shared across ML bundles. Signed-off-by: Louise Fox <208544511+folouiseAWS@users.noreply.github.com>

+        "tm_score": round(tm, 4),
+        "rmsd": round(rmsd, 3),
+        "aligned_residues": int(len(ref_indices)),
+        "plddt_error_pearson": round(pearson, 4) if pearson == pearson else float("nan"),


leon-li-inspire · 2026-05-29T21:03:21Z

+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    from matplotlib.collections import LineCollection


This import has not been used.

crowecawcaw · 2026-05-29T21:44:27Z

+    import torch
+    from transformers import AutoTokenizer, EsmForProteinFolding


Nit: better to put at top of file if possible

crowecawcaw · 2026-05-29T21:45:43Z

+
+    Side effect: writes a per-residue calibration plot to plot_path.
+
+    Uses biotite >=1.2 native tm_score + superimpose_structural_homologs


Nit: put this version requirement in the conda packages string

crowecawcaw · 2026-05-29T21:46:01Z

+    import numpy as np
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    from biotite.structure.io.pdb import PDBFile
+    from biotite.structure import superimpose_structural_homologs, tm_score


Nit: put these imports at top of file if possible, same below

folouiseAWS requested a review from a team as a code owner May 29, 2026 17:27

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

Comment thread job_bundles/esmfold_predict/scripts/validate.py

"tm_score": round(tm, 4),

"rmsd": round(rmsd, 3),

"aligned_residues": int(len(ref_indices)),

"plddt_error_pearson": round(pearson, 4) if pearson == pearson else float("nan"),

github-actions Bot added the waiting-on-maintainers Waiting on the maintainers to review. label May 29, 2026

leon-li-inspire reviewed May 29, 2026

View reviewed changes

leon-li-inspire approved these changes May 29, 2026

View reviewed changes

crowecawcaw reviewed May 29, 2026

View reviewed changes

crowecawcaw approved these changes May 29, 2026

View reviewed changes

Merge branch 'mainline' into feat/esmfold-predict

307e1a8

karthikbekalp enabled auto-merge (squash) May 29, 2026 22:13

karthikbekalp merged commit 35aa89c into aws-deadline:mainline May 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ESMFold protein structure prediction job bundle#229

feat: Add ESMFold protein structure prediction job bundle#229
karthikbekalp merged 2 commits into
aws-deadline:mainlinefrom
folouiseAWS:feat/esmfold-predict

folouiseAWS commented May 29, 2026

Uh oh!

leon-li-inspire May 29, 2026

Uh oh!

crowecawcaw May 29, 2026

Uh oh!

crowecawcaw May 29, 2026

Uh oh!

crowecawcaw May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		import torch
		from transformers import AutoTokenizer, EsmForProteinFolding


		Side effect: writes a per-residue calibration plot to plot_path.

		Uses biotite >=1.2 native tm_score + superimpose_structural_homologs

Conversation

folouiseAWS commented May 29, 2026

What was the problem/requirement? (What/Why)

What is the impact of this change?

How was this change tested?

Was this change documented?

Uh oh!

leon-li-inspire May 29, 2026

Choose a reason for hiding this comment

Uh oh!

crowecawcaw May 29, 2026

Choose a reason for hiding this comment

Uh oh!

crowecawcaw May 29, 2026

Choose a reason for hiding this comment

Uh oh!

crowecawcaw May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants