Skip to content

feat: Add ESMFold protein structure prediction job bundle#229

Merged
karthikbekalp merged 2 commits into
aws-deadline:mainlinefrom
folouiseAWS:feat/esmfold-predict
May 29, 2026
Merged

feat: Add ESMFold protein structure prediction job bundle#229
karthikbekalp merged 2 commits into
aws-deadline:mainlinefrom
folouiseAWS:feat/esmfold-predict

Conversation

@folouiseAWS
Copy link
Copy Markdown
Contributor

Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1 (facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable- count GPU ML inference workloads on a service-managed fleet.

The bundle:

  • Parses a user FASTA, validates sequences, round-robins them across N GPU tasks longest-first for balanced runtimes
  • Runs ESMFold inference with three layers of validation (structural sanity, mean pLDDT self-consistency, and optional TM-score / RMSD / pLDDT calibration vs experimental references)
  • Renders matplotlib CA-trace structures colored by per-residue pLDDT
  • Produces a per-sequence calibration plot showing model confidence vs actual error against ground-truth structures (when refs are provided)

Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments and the conda queue-environment pattern shared across ML bundles.

Fixes: N/A — new sample, not tied to a tracked issue

What was the problem/requirement? (What/Why)

The samples repo had ML/GPU bundles for image generation (flux2_klein_lora) and 3D reconstruction (gsplat_pipeline) but no example demonstrating the splitter-then-fanout
pattern for variable-count GPU inference workloads. Protein structure prediction is a clean fit: the input is a FASTA file with N records (where N is unknown until the

  1. SplitFasta (1 task, CPU): parses the input FASTA, validates each sequence (length and amino-acid alphabet), and round-robins the records across Parallelism per-task
    manifest files in a shared workspace directory. Sorts longest-first so per-task runtimes balance.
  2. Fold (Parallelism tasks, GPU): loads facebook/esmfold_v1 from HuggingFace, runs inference on each sequence in the assigned manifest, validates the output PDB (atom
    count, NaN check, pLDDT range), and emits one .pdb plus a summary.json per sequence. Uses the same HF cache pattern (HF_HOME pointed at OutputDir/.hf_cache) as other ML
    bundles.
  3. Render (1 task, CPU): produces a matplotlib backbone trace per predicted structure, colored by per-residue pLDDT confidence.
  4. Validate (1 task, CPU, optional): when a ReferencePdbDir is provided, computes TM-score (via biotite's native tm_score), RMSD, and a per-residue pLDDT/error Pearson
    correlation against each experimental reference. Writes a validation.csv and a per-sequence calibration.png.

The splitter + manifest pass-through pattern is borrowed from copy_s3_prefix_to_job_attachments. The conda-queue-environment pattern is shared with the other GPU/ML
bundles.

What is the impact of this change?

  • Adds a new Deadline Cloud sample for scientific GPU compute with no impact on existing bundles.
  • All inputs/outputs go through standard job attachments; no new IAM permissions required.
  • The bundle's CondaPackages resolves cleanly against the same cuda_farm CloudFormation farm template the other GPU samples target.

How was this change tested?

Submitted to a cuda_farm queue (g5 fleet, A10G workers) and iterated through real failures and fixes:

  • Verified end-to-end pipeline. Folded the demo set (1L2Y, 2JOF, 1VII) on the farm. All four steps green:
    SplitFasta → SUCCEEDED
    Fold → SUCCEEDED (5.12 s for 1VII, ~1 s for trp-cages)
    Render → SUCCEEDED (3 PNGs produced)
    Validate → SUCCEEDED (validation.csv with TM-score + Pearson r)
  • Verified validation against ground truth. Fetched experimental PDBs from RCSB for the three demo sequences, ran with ReferencePdbDir set, confirmed:
    seq_id,tm_score,rmsd,aligned_residues,mean_plddt,plddt_error_pearson
    1l2y,0.5589,0.549,20,85.54,-0.2767
    1vii,0.6555,2.170,36,86.90,-0.5888
    2jof,0.4530,0.967,20,87.25,-0.2876
  • RMSD < 1 Å for both trp-cages indicates correct prediction; Pearson r negative for all three confirms model confidence is calibrated.
  • Visually verified the rendered output by downloading the predicted structures and comparing CA-traces against experimental references. The two images embedded in the
    README (1vii_structure.png, 1vii_calibration.png) are real outputs from a successful farm run, copied into .readme_images/ for documentation.
  • Iterated through real bugs during testing and fixed each: a WorkspacePath: dataFlow issue blocking manifest pass-through, a pLDDT scale conversion ([0,1] → [0,100])
    needed for AlphaFold-convention rendering, and a switch from a custom TM-score implementation to biotite's native one for correctness on alignment-aware residue matching.

Was this change documented?

Yes — README.md covers prerequisites (farm setup, VRAM table, service quotas), submission examples (smoke test, with-validation, custom inputs), parameter reference,
output layout, validation strategy (three layers including the calibration plot), and notes on weight caching and monomer-only scope. sample_inputs/README.md documents
the demo FASTA contents and how to fetch reference PDBs from RCSB.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1
(facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout
pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable-
count GPU ML inference workloads on a service-managed fleet.

The bundle:
- Parses a user FASTA, validates sequences, round-robins them across N
  GPU tasks longest-first for balanced runtimes
- Runs ESMFold inference with three layers of validation (structural
  sanity, mean pLDDT self-consistency, and optional TM-score / RMSD /
  pLDDT calibration vs experimental references)
- Renders matplotlib CA-trace structures colored by per-residue pLDDT
- Produces a per-sequence calibration plot showing model confidence vs
  actual error against ground-truth structures (when refs are provided)

Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments
and the conda queue-environment pattern shared across ML bundles.

Signed-off-by: Louise Fox <208544511+folouiseAWS@users.noreply.github.com>
@folouiseAWS folouiseAWS requested a review from a team as a code owner May 29, 2026 17:27
"tm_score": round(tm, 4),
"rmsd": round(rmsd, 3),
"aligned_residues": int(len(ref_indices)),
"plddt_error_pearson": round(pearson, 4) if pearson == pearson else float("nan"),
@github-actions github-actions Bot added the waiting-on-maintainers Waiting on the maintainers to review. label May 29, 2026
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import has not been used.

Comment on lines +118 to +119
import torch
from transformers import AutoTokenizer, EsmForProteinFolding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better to put at top of file if possible


Side effect: writes a per-residue calibration plot to plot_path.

Uses biotite >=1.2 native tm_score + superimpose_structural_homologs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put this version requirement in the conda packages string

Comment on lines +37 to +42
import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from biotite.structure.io.pdb import PDBFile
from biotite.structure import superimpose_structural_homologs, tm_score
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put these imports at top of file if possible, same below

@karthikbekalp karthikbekalp enabled auto-merge (squash) May 29, 2026 22:13
@karthikbekalp karthikbekalp merged commit 35aa89c into aws-deadline:mainline May 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-maintainers Waiting on the maintainers to review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants