feat: Add ESMFold protein structure prediction job bundle#229
Merged
karthikbekalp merged 2 commits intoMay 29, 2026
Conversation
Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1 (facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable- count GPU ML inference workloads on a service-managed fleet. The bundle: - Parses a user FASTA, validates sequences, round-robins them across N GPU tasks longest-first for balanced runtimes - Runs ESMFold inference with three layers of validation (structural sanity, mean pLDDT self-consistency, and optional TM-score / RMSD / pLDDT calibration vs experimental references) - Renders matplotlib CA-trace structures colored by per-residue pLDDT - Produces a per-sequence calibration plot showing model confidence vs actual error against ground-truth structures (when refs are provided) Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments and the conda queue-environment pattern shared across ML bundles. Signed-off-by: Louise Fox <208544511+folouiseAWS@users.noreply.github.com>
| "tm_score": round(tm, 4), | ||
| "rmsd": round(rmsd, 3), | ||
| "aligned_residues": int(len(ref_indices)), | ||
| "plddt_error_pearson": round(pearson, 4) if pearson == pearson else float("nan"), |
| import matplotlib | ||
| matplotlib.use("Agg") | ||
| import matplotlib.pyplot as plt | ||
| from matplotlib.collections import LineCollection |
There was a problem hiding this comment.
This import has not been used.
leon-li-inspire
approved these changes
May 29, 2026
crowecawcaw
reviewed
May 29, 2026
Comment on lines
+118
to
+119
| import torch | ||
| from transformers import AutoTokenizer, EsmForProteinFolding |
Contributor
There was a problem hiding this comment.
Nit: better to put at top of file if possible
crowecawcaw
approved these changes
May 29, 2026
|
|
||
| Side effect: writes a per-residue calibration plot to plot_path. | ||
|
|
||
| Uses biotite >=1.2 native tm_score + superimpose_structural_homologs |
Contributor
There was a problem hiding this comment.
Nit: put this version requirement in the conda packages string
Comment on lines
+37
to
+42
| import numpy as np | ||
| import matplotlib | ||
| matplotlib.use("Agg") | ||
| import matplotlib.pyplot as plt | ||
| from biotite.structure.io.pdb import PDBFile | ||
| from biotite.structure import superimpose_structural_homologs, tm_score |
Contributor
There was a problem hiding this comment.
Nit: put these imports at top of file if possible, same below
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a Deadline Cloud sample for protein folding via Meta's ESMFold v1 (facebook/esmfold_v1, MIT). Demonstrates a 4-step splitter-then-fanout pipeline pattern (SplitFasta -> Fold -> Render -> Validate) for variable- count GPU ML inference workloads on a service-managed fleet.
The bundle:
Reuses the splitter manifest pattern from copy_s3_prefix_to_job_attachments and the conda queue-environment pattern shared across ML bundles.
Fixes: N/A — new sample, not tied to a tracked issue
What was the problem/requirement? (What/Why)
The samples repo had ML/GPU bundles for image generation (flux2_klein_lora) and 3D reconstruction (gsplat_pipeline) but no example demonstrating the splitter-then-fanout
pattern for variable-count GPU inference workloads. Protein structure prediction is a clean fit: the input is a FASTA file with N records (where N is unknown until the
manifest files in a shared workspace directory. Sorts longest-first so per-task runtimes balance.
count, NaN check, pLDDT range), and emits one .pdb plus a summary.json per sequence. Uses the same HF cache pattern (HF_HOME pointed at OutputDir/.hf_cache) as other ML
bundles.
correlation against each experimental reference. Writes a validation.csv and a per-sequence calibration.png.
The splitter + manifest pass-through pattern is borrowed from copy_s3_prefix_to_job_attachments. The conda-queue-environment pattern is shared with the other GPU/ML
bundles.
What is the impact of this change?
How was this change tested?
Submitted to a cuda_farm queue (g5 fleet, A10G workers) and iterated through real failures and fixes:
SplitFasta → SUCCEEDED
Fold → SUCCEEDED (5.12 s for 1VII, ~1 s for trp-cages)
Render → SUCCEEDED (3 PNGs produced)
Validate → SUCCEEDED (validation.csv with TM-score + Pearson r)
seq_id,tm_score,rmsd,aligned_residues,mean_plddt,plddt_error_pearson
1l2y,0.5589,0.549,20,85.54,-0.2767
1vii,0.6555,2.170,36,86.90,-0.5888
2jof,0.4530,0.967,20,87.25,-0.2876
README (1vii_structure.png, 1vii_calibration.png) are real outputs from a successful farm run, copied into .readme_images/ for documentation.
needed for AlphaFold-convention rendering, and a switch from a custom TM-score implementation to biotite's native one for correctness on alignment-aware residue matching.
Was this change documented?
Yes — README.md covers prerequisites (farm setup, VRAM table, service quotas), submission examples (smoke test, with-validation, custom inputs), parameter reference,
output layout, validation strategy (three layers including the calibration plot), and notes on weight caching and monomer-only scope. sample_inputs/README.md documents
the demo FASTA contents and how to fetch reference PDBs from RCSB.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.