A complete, step-by-step guide to reproducing the protein-protein complex structure prediction and relaxation benchmark using the Protein-Protein Docking Benchmark 5.5 (BM5.5).
This document traces every script, validates each step, and provides verification checkpoints so you can confirm correctness at each stage.
- Prerequisites
- Step 0: Download BM5.5 Dataset
- Step 1: Clean PDB Structures
- Step 2: Download FASTA Sequences
- Step 3: Organize FASTAs
- Step 4: Prepare Boltz-1 Input
- Step 5: Run AlphaFold 2.3.2
- Step 6: Run Boltz-1
- Step 7: Organize Predictions for Relaxation
- Step 8: Run Rosetta Relaxation
- Step 9: MolProbity Validation
- Step 10: Collect Metrics
- Pipeline Audit
- Troubleshooting
| Software | Version | Installation |
|---|---|---|
| Python | 3.10+ | System or conda |
| AlphaFold | 2.3.2 | GitHub or cluster module |
| Boltz-1 | 0.4.1 | pip install boltz or cluster module |
| Rosetta | 3.15 | RosettaCommons |
| Phenix | Latest | Phenix |
| Reduce | Latest | Included with Phenix or SBGrid |
| PyMOL | 2.x+ | PyMOL |
| requests | Latest | pip install requests |
# AlphaFold 2.3.2
AF2_MINICONDA=/sb/apps/alphafold232/miniconda3
AF2_REPO=/sb/apps/alphafold232/alphafold
AF2_DATADIR=/csbtmp/alphafold-data.230
# Boltz-1 v0.4.1
BOLTZ_MINICONDA=/sb/apps/boltz1-v0.4.1/miniconda3
# Rosetta 3.15
RELAX=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/source/bin/relax.linuxgccrelease
ROSETTA_DB=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/database
CLEAN_SCRIPT=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/tools/protein_tools/scripts/clean_pdb.py
# MolProbity (SBGrid)
REDUCE=/programs/x86_64-linux/system/sbgrid_bin/reduce
PHENIX=/programs/x86_64-linux/system/sbgrid_bin/phenix.molprobityCreate your working directory:
mkdir -p benchmarking/{merged,cleaned,fasta,data,structures,test}
cd benchmarkingThe Protein-Protein Docking Benchmark 5.5 is maintained by the Weng Lab at UMass Medical School.
Download page: https://zlab.umassmed.edu/benchmark/
The benchmark provides:
- Bound structures — crystal structures of the complex
- Unbound structures — individual partner structures crystallized separately
- Metadata — complex categories (Antibody-Antigen, Enzyme-Inhibitor, etc.)
# Download the benchmark tables and structures
# The exact URL may change; check the Weng Lab benchmark page
wget https://zlab.umassmed.edu/benchmark/benchmark5.5.tgz
tar -xzf benchmark5.5.tgz
# The archive contains structures in the format:
# {PDBID}_l_b.pdb — Ligand, bound conformation
# {PDBID}_l_u.pdb — Ligand, unbound conformation
# {PDBID}_r_b.pdb — Receptor, bound conformation
# {PDBID}_r_u.pdb — Receptor, unbound conformationNaming convention:
| Suffix | Meaning |
|---|---|
_l_ |
Ligand (smaller binding partner) |
_r_ |
Receptor (larger binding partner) |
_b |
Bound conformation (from the complex crystal) |
_u |
Unbound conformation (separate crystal structure) |
For this benchmark, we need the complete complex structures (both chains together). If you downloaded the split bound structures, merge them:
# Move split structures to structures/ for reference
mv benchmark5.5/structures/* structures/
# Create merged complexes from bound structures
for pdb_id in $(ls structures/*_r_b.pdb | sed 's/.*\///' | sed 's/_r_b.pdb//'); do
cat "structures/${pdb_id}_r_b.pdb" "structures/${pdb_id}_l_b.pdb" | \
grep -E '^(ATOM|HETATM|TER)' > "merged/${pdb_id}.pdb"
echo "END" >> "merged/${pdb_id}.pdb"
doneAlternatively, download the full complex PDB files directly from RCSB:
# For each PDB ID in the benchmark
for pdb_id in $(cat benchmark5.5/pdb_list.txt); do
wget -q "https://files.rcsb.org/download/${pdb_id}.pdb" -O "merged/${pdb_id}.pdb"
sleep 0.2 # Be polite to RCSB
done# Should have 246 PDB files (257 BM5.5 minus 11 excluded Boltz OOM targets)
ls merged/*.pdb | wc -l
# Expected: 246
# Each PDB should have ATOM records with multiple chains
grep -c '^ATOM' merged/1AK4.pdb
# Expected: several thousand lines
# Check chain IDs present
awk '/^ATOM/{print substr($0,22,1)}' merged/1AK4.pdb | sort -u
# Expected: A, D (or similar multi-chain output)What it does:
- For each PDB in
merged/, extracts unique chain IDs - Runs Rosetta's
clean_pdb.pyon each chain individually - Merges cleaned chains into a single PDB with proper TER records
- Renumbers atom serials sequentially
Why this step exists:
- Rosetta's
clean_pdb.pystandardizes atom naming to Rosetta conventions - Removes HETATM records (water, ligands, ions)
- Fixes non-standard residue naming
- Required for Rosetta relaxation to work correctly
bash scripts/clean_pdbs.sh merged/ cleaned/ $CLEAN_SCRIPTInput: merged/*.pdb (raw PDB files from RCSB/BM5.5)
Output: cleaned/{PDBID}.pdb (one file per complex, all chains merged)
Potential issues identified:
- Chain ID
" "(space) is mapped to"_"for Rosetta compatibility — this is correct behavior - The script uses a temporary directory per PDB and cleans up after — safe
- Atom serial renumbering starts at 1 per file — correct
- Only ATOM and HETATM lines are kept — intentional (removes REMARK, HEADER, etc.)
Edge cases:
- PDBs with no ATOM records are skipped (logged as
[SKIP]) - If Rosetta's
clean_pdb.pyfails on a chain, that chain will be missing from output (check logs)
# Should have same count as merged/
ls cleaned/*.pdb | wc -l
# Expected: same as merged/
# Each cleaned PDB should have TER between chains and END at the bottom
tail -5 cleaned/1AK4.pdb
# Expected: ATOM lines, then TER, then END
# Check that atom serials are sequential starting from 1
head -1 cleaned/1AK4.pdb
# Expected: ATOM 1 ...
# Compare chain count (should be preserved)
awk '/^ATOM/{print substr($0,22,1)}' cleaned/1AK4.pdb | sort -u
# Expected: same chains as in merged/1AK4.pdbWhat it does:
- Scans
merged/for PDB files - Extracts 4-character PDB IDs from filenames
- Downloads FASTA sequences from RCSB (with fallback to PDBe)
- Handles obsolete entries by looking up replacement PDB IDs
- Logs all results to
fasta_download_log.csv
Why not extract sequences from the PDB files directly?
- RCSB FASTA includes all chains with proper headers
- Headers contain chain IDs needed for Boltz format conversion
- More reliable than parsing PDB ATOM records for sequence
python3 scripts/download_fastas.py merged/ fasta/Input: merged/ directory with PDB files
Output: fasta/{PDBID}.fasta + fasta/fasta_download_log.csv
Endpoint priority:
https://www.rcsb.org/fasta/entry/{PDB}(primary)https://www.rcsb.org/pdb/download/downloadFastaFiles.do?...(fallback)https://www.ebi.ac.uk/pdbe/entry/pdb/{PDB}/fasta(PDBe fallback)
Retry logic: 2 retries per endpoint with exponential backoff (1.5^i seconds) Rate limiting: 0.15s delay between PDB IDs (polite to servers) Obsolete handling: If all endpoints fail, checks RCSB entry API for replacement ID
Known special cases from our run:
1A2K,3RVW— obsolete PDB IDs; FASTAs downloaded from replacement entries (5BXQ, 5VPG)- 4 non-standard IDs:
BAAD,BOYV,BP57,CP57— sequences extracted from ATOM records
DNA/RNA exclusion policy: DNA/RNA chains are excluded from all prediction FASTAs. BM5.5 is a protein-protein docking benchmark; neither AlphaFold nor Boltz supports nucleic acid prediction. Targets with DNA in the crystal structure (3P57, 1H9D) have their FASTAs filtered to protein chains only.
# Check download count
grep -c 'ok' fasta/fasta_download_log.csv
# Expected: ~260-266
# Check failures
grep 'fail' fasta/fasta_download_log.csv
# Expected: 0-3 failures
# Verify FASTA format (should start with >)
head -2 fasta/1AK4.fasta
# Expected:
# >1AK4_1|Chains A, B|...
# MNGKII...
# Check that multi-chain complexes have multiple > headers
grep -c '^>' fasta/1AK4.fasta
# Expected: 2+ (one per chain group)What it does:
- Takes flat directory of
{PDBID}.fastafiles - Creates one subdirectory per PDB ID under
data/ - Copies (or moves) the FASTA file into the subdirectory
- Optionally renames it (e.g., to
sequence.fasta)
python3 scripts/organize_fastas.py fasta/ data/ --rename sequence.fastaInput: fasta/*.fasta
Output: data/{PDBID}/sequence.fasta
Behavior:
- PDB ID extracted from filename stem (uppercased)
- Creates directory if it doesn't exist
- Skips existing files unless
--overwriteis passed - Copies by default,
--moveflag deletes source after copy
No issues identified. This is a straightforward file organizer.
# Should have one directory per successfully downloaded FASTA
ls data/ | wc -l
# Expected: ~260-266
# Each directory should have sequence.fasta
ls data/1AK4/sequence.fasta
# Expected: exists
# Verify content matches original
diff fasta/1AK4.fasta data/1AK4/sequence.fasta
# Expected: no differencesWhat it does:
- Reads
data/*/sequence.fastafiles - Parses chain IDs from RCSB FASTA headers
- Rewrites each sequence with Boltz-compatible headers:
>{CHAIN}|PROTEIN| - Outputs
boltz_input.fastain each subdirectory
Why this is needed: Boltz-1 requires a specific FASTA format where each chain header contains:
- The chain letter
- The molecule type (
PROTEIN) - Separated by pipes
Standard RCSB headers look like: >1AK4_1|Chains A, B|CYCLOPHILIN A|Homo sapiens
Boltz needs: >A|PROTEIN|
python3 scripts/prepare_boltz_fastas.py data/Input: data/*/sequence.fasta
Output: data/*/boltz_input.fasta
Chain parsing logic:
- Looks for
Chain[s] A, B, Cpattern in header (regex) - Fallback: splits on
|, takes first char of second field - Last resort: defaults to chain
A
Known behavior:
- If RCSB header says
Chains A, B(homo-dimer with identical chains), the script writes both>A|PROTEIN|and>B|PROTEIN|with the same sequence — this is correct for Boltz multimer input - Single-chain entries get a single
>A|PROTEIN|header
Potential issue: If a header has an unusual format that doesn't match the regex, it falls back to A. This could produce incorrect chain assignments for some entries. Verify output manually for critical targets.
# Check that boltz_input.fasta was created
ls data/1AK4/boltz_input.fasta
# Expected: exists
# Verify Boltz format
head -4 data/1AK4/boltz_input.fasta
# Expected:
# >A|PROTEIN|
# MNGKII...
# >D|PROTEIN|
# PIVQNL...
# Verify chain count matches sequence.fasta
grep -c '^>' data/1AK4/sequence.fasta
grep -c '^>' data/1AK4/boltz_input.fasta
# Expected: should match (or be equal/greater if chains are split)
# Validate ALL Boltz FASTAs have correct format
for f in data/*/boltz_input.fasta; do
if ! grep -q '|PROTEIN|' "$f"; then
echo "BAD: $f"
fi
done
# Expected: no output (all valid)What they do:
- Submit SLURM GPU jobs to run AlphaFold 2.3.2
- Auto-detect monomer vs multimer based on sequence count
- Produce 10 models per target: 5 AMBER-relaxed (
ranked_*.pdb) + 5 unrelaxed (unrelaxed_model_*.pdb) - AMBER relaxation treated as 7th relaxation protocol (alongside 6 Rosetta protocols)
- Skip completed targets (completion guard checks for
ranking_debug.json) - Clean up intermediates (MSAs, pickles) to manage disk usage
# Standard (64GB RAM, covers most targets)
sbatch --array=1-257%10 scripts/run/af_array.slurm
# High-memory (128GB RAM, for large multimer complexes)
sbatch --array=<TASK_IDS> scripts/run/af_array_highmem.slurmResources: 1 node, 6 tasks, 1 A6000 GPU, 64GB RAM, 48h
FASTA selection: Prefers sequence.fasta over boltz_input.fasta
Preset detection: Counts > lines; if >1 → multimer, else monomer
Output: af_out/sequence/ranked_{0-4}.pdb (AMBER-relaxed) + af_out/sequence/unrelaxed_model_*.pdb (raw predictions)
Key flags:
--nouse_gpu_relax— runs AMBER relaxation on CPU (avoids GPU memory issues)--models_to_relax=all— AMBER-relaxes all 5 ranked models (not just best)--max_template_date=9999-12-31— allows all templates (no date cutoff)- Monomer: uses
pdb70database - Multimer: uses
pdb_seqres+uniprot,num_multimer_predictions_per_model=1
Database preset: Full databases (equivalent to --db_preset=full_dbs) with reduced_dbs
fallback on HHblits failure. Primary uses HHblits for BFD/UniRef30 searches; fallback uses
jackhmmer with small_bfd. The run_af() function wraps the prediction call and retries
with --db_preset=reduced_dbs --small_bfd_database_path=... if full_dbs fails.
AMBER safety: If AMBER relaxation crashes but unrelaxed models exist, the script preserves unrelaxed output instead of deleting everything during the reduced_dbs retry.
Database paths (ACCRE, /csbtmp/alphafold-data.230/):
uniref90/uniref90.fastamgnify/mgy_clusters_2022_05.fauniref30/UniRef30_2021_03bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_optpdb70/pdb70pdb_mmcif/mmcif_files/pdb_mmcif/obsolete.dat
Known OOM targets at 64GB (require 128GB highmem script): 1AHW, 1ATN, 1DFJ, 1DQJ, 1E6J, 1FC2, 1IRA, 1JWH, 1MLC (all large multimer complexes)
AMBER relaxation failures — ALL RESOLVED:
7 targets (1ATN, 1DFJ, 1FC2, 1WEJ, 2BTF, 4CPA, 5JMO) originally failed AMBER relaxation
due to non-standard residues (X/Z) in FASTA sequences. Root cause identified by Blue:
AlphaFold can't place atoms for X (unknown AA) and Z (ambiguous Glu/Gln), causing AMBER's
_check_residues_are_well_defined() to reject the model. Fix: trimmed X/Z from FASTA
terminals. All 7 now have full 10 models (5 ranked + 5 unrelaxed).
HHblits failures (31 targets, resubmitted with reduced_dbs fallback):
If the main job was submitted before the fallback logic was added to the script, HHblits
failures will not trigger the reduced_dbs retry. Clean failed af_out/ directories and
resubmit with the current script version.
After cleanup, 10 models per target are retained:
data/{PDBID}/af_out/
└── sequence/
├── ranked_0.pdb # Best AMBER-relaxed model (0-indexed)
├── ranked_1.pdb
├── ranked_2.pdb
├── ranked_3.pdb
├── ranked_4.pdb # 5th AMBER-relaxed model
├── unrelaxed_model_1_*.pdb # Raw AF prediction (1-indexed, AF convention)
├── unrelaxed_model_2_*.pdb
├── unrelaxed_model_3_*.pdb
├── unrelaxed_model_4_*.pdb
├── unrelaxed_model_5_*.pdb
└── ranking_debug.json # Maps ranked (0-indexed) to model (1-indexed)
Numbering note: AlphaFold uses 0-indexed ranked_*.pdb (ordered by confidence) and
1-indexed unrelaxed_model_* (unordered). The ranking_debug.json file maps between them.
For example, ranked_0.pdb may correspond to unrelaxed_model_3 if model 3 scored highest.
Intermediate files removed by cleanup: MSAs, feature pickles, result pickles,
relaxed_model_*.pdb (duplicates of ranked PDBs), timings.
# Check how many targets have ranking_debug.json (completion marker)
completed=0
for d in data/*/af_out/*/; do
if [ -f "$d/ranking_debug.json" ]; then
completed=$((completed + 1))
fi
done
echo "AlphaFold complete: $completed"
# Expected: 246/246 (all active benchmark targets)
# Verify a specific prediction
ls data/1AK4/af_out/sequence/ranked_*.pdb | wc -l
# Expected: 5
# Check model quality (ranking_debug.json shows pLDDT/pTM scores)
cat data/1AK4/af_out/sequence/ranking_debug.json | python3 -m json.tool | head -10What they do:
- Run Boltz-1 v0.4.1 structure prediction
- Use MSA server for fast multiple sequence alignment
- Produce 5 diffusion samples per target
# Count targets
N=$(ls -d data/*/ | wc -l)
# Submit array job
sbatch --array=1-${N} scripts/boltz_array.slurmOr for a single target:
sbatch --chdir=data/1AK4 scripts/boltz_single.slurm boltz_input.fastaResources: 1 node, 1 CPU, 1 L40S GPU, 256GB RAM, 2 days
FASTA selection: Prefers boltz_input.fasta over sequence.fasta
Header validation: Checks that all headers match >X|PROTEIN| format
Key parameters:
diffusion_samples=5— generates 5 modelsrecycling_steps=10— structure refinement iterationssampling_steps=200— diffusion sampling stepsoutput_format=pdb— outputs PDB files (not mmCIF)use_msa_server— uses ColabFold MSA server instead of local databases
Output format: --output_format pdb (PDB files, not mmCIF)
Skip guard: Checks for existing boltz_model_*.pdb files; skips if >= 5 found
Force re-run: Set FORCE=1 environment variable
Potential issues:
- 256GB RAM is very high; Boltz typically needs 16-64GB. This is set conservatively.
--use_msa_serverrequires internet access from compute nodes (may be blocked on some clusters)- The MSA server can be slow or rate-limited during peak usage
data/{PDBID}/boltz_out_dir/
└── boltz_results_boltz_input/
├── predictions/boltz_input/
│ ├── boltz_input_model_0.pdb
│ ├── boltz_input_model_1.pdb
│ ├── boltz_input_model_2.pdb
│ ├── boltz_input_model_3.pdb
│ └── boltz_input_model_4.pdb
├── msa/ # Generated MSAs
└── processed/ # Processed features
# Check how many targets have Boltz models
completed=0
for d in data/*/boltz_out_dir/; do
count=$(find "$d" -name 'boltz_input_model_*.pdb' -o -name 'boltz_model_*.pdb' 2>/dev/null | wc -l)
if [ "$count" -ge 1 ]; then
completed=$((completed + 1))
fi
done
echo "Boltz complete: $completed"
# Expected: 246/246 (all active benchmark targets, 5 models each)
# 11 targets excluded from benchmark due to Boltz OOM:
# 9 full OOM (>3000 total residues): 1DE4, 1K5D, 1N2C, 1WDW, 1ZM4, 3BIW, 3L89, 4GXU, 6EY6
# 2 partial OOM (1/5 models only): 1GXD, 3EO1
# Verify a specific prediction
find data/1AK4/boltz_out_dir -name '*.pdb' | wc -l
# Expected: 5Before running relaxation, predictions need to be organized into the test/ directory structure:
for pdb_id in $(ls data/); do
# Skip if no predictions
af_count=$(ls data/$pdb_id/af_out/sequence/ranked_*.pdb 2>/dev/null | wc -l)
boltz_count=$(find data/$pdb_id/boltz_out_dir -name 'boltz_input_model_*.pdb' 2>/dev/null | wc -l)
if [ "$af_count" -lt 5 ] && [ "$boltz_count" -lt 5 ]; then
echo "SKIP: $pdb_id (AF=$af_count, Boltz=$boltz_count)"
continue
fi
mkdir -p test/$pdb_id/AF test/$pdb_id/Boltz
# Copy crystal structure
if [ -f cleaned/$pdb_id.pdb ]; then
cp cleaned/$pdb_id.pdb test/$pdb_id/$pdb_id.pdb
fi
# Copy AF predictions
cp data/$pdb_id/af_out/sequence/ranked_*.pdb test/$pdb_id/AF/ 2>/dev/null
# Copy Boltz predictions
find data/$pdb_id/boltz_out_dir -name 'boltz_input_model_*.pdb' \
-exec cp {} test/$pdb_id/Boltz/ \; 2>/dev/null
done# Check structure
ls test/1AK4/
# Expected: 1AK4.pdb AF/ Boltz/
ls test/1AK4/AF/
# Expected: ranked_0.pdb through ranked_4.pdb
ls test/1AK4/Boltz/
# Expected: boltz_input_model_0.pdb through boltz_input_model_4.pdb7 relaxation protocols total: 1 AMBER (computed during AF prediction) + 6 Rosetta.
AMBER relaxation via OpenMM is performed during AlphaFold prediction (Step 5):
- Force field: AMBER ff14SB
- Energy tolerance: 2.39 kcal/mol
- Position restraint stiffness: 10.0 kcal/mol/A^2
- Compute: CPU (
--nouse_gpu_relax) - Output:
ranked_*.pdbfiles (already generated in Step 5)
No separate script needed — AMBER relaxation is built into AlphaFold's --models_to_relax=all.
What it does:
- For each PDB directory in
test/:- Finds the crystal structure PDB
- Runs 6 relaxation protocols x 5 replicates = 30 relaxed structures
- Each replicate uses
-nstruct 1with a different suffix (_r1through_r5)
| Key | Protocol | Scoring Function | Rosetta Flags |
|---|---|---|---|
cart_beta |
Cartesian | beta_nov16 | -relax:cartesian -beta_nov16 -score:weights beta_nov16_cart |
cart_ref15 |
Cartesian | ref2015 | -relax:cartesian -score:weights ref2015_cart |
dual_beta |
Dualspace | beta_nov16 | -relax:dualspace -beta_nov16 -score:weights beta_nov16_cart -nonideal -relax:minimize_bond_angles -relax:minimize_bond_lengths |
dual_ref15 |
Dualspace | ref2015 | -relax:dualspace -score:weights ref2015_cart -nonideal -relax:minimize_bond_angles -relax:minimize_bond_lengths |
norm_beta |
Normal FastRelax | beta_nov16 | -beta_nov16 -score:weights beta_nov16 |
norm_ref15 |
Normal FastRelax | ref2015 | -score:weights ref2015 |
# Crystal structure relaxation
N=$(ls -d test/*/ | wc -l)
sbatch --array=1-${N} scripts/relax_predictions.slurmFor AI prediction relaxation, the script needs to be modified to iterate over AF and Boltz models within each PDB directory. The extended version (used in our pipeline) does:
# For each PDB in test/:
# For each source (AF, Boltz):
# For each model (5 models):
# For each protocol (6 protocols):
# For each replicate (5 replicates):
# Run Rosetta relax → test/{PDB}/relax/{source}/{model}/{protocol}/Resources: 1 node, 1 CPU, 4GB RAM, 48h No GPU required — Rosetta relax is CPU-only
Common Rosetta flags:
-ignore_zero_occupancy false— processes all atoms regardless of occupancy-nstruct 1— one structure per run (replicates via suffix)-no_nstruct_label— don't add_0001to output filename-out:pdb_gz— gzip output (saves disk space)-flip_HNQ— optimize His/Asn/Gln hydrogen placement-fa_max_dis 9.0— maximum interaction distance for scoring-optimization:default_max_cycles 200— convergence iterations
Potential issues:
- 48h timeout: a single PDB with 6 protocols x 5 replicates = 30 runs. Each relax takes 5-60 minutes depending on size. Large complexes (>800 residues) may timeout.
- 4GB RAM is tight for large complexes. May need 8-16GB for >600 residues.
- The script uses
${CFG[$key]}without quoting — this works because the values have no spaces in critical positions, but it's technically fragile.
test/{PDBID}/
├── cartesian_beta/
│ ├── {PDBID}_r1.pdb.gz through _r5.pdb.gz
│ ├── log/
│ │ └── {PDBID}_cart_beta_r1.log through _r5.log
│ └── relax.fasc
├── cartesian_ref15/
│ └── [same structure]
├── dualspace_beta/
├── dualspace_ref15/
├── normal_beta/
└── normal_ref15/
For AI predictions:
test/{PDBID}/relax/
├── AF/
│ ├── ranked_0/
│ │ ├── cartesian_beta/ranked_0_r1.pdb.gz ... _r5.pdb.gz
│ │ ├── cartesian_ref15/
│ │ ├── dualspace_beta/
│ │ ├── dualspace_ref15/
│ │ ├── normal_beta/
│ │ └── normal_ref15/
│ ├── ranked_1/
│ └── ... (ranked_2 through ranked_4)
└── Boltz/
├── boltz_input_model_0/
└── ... (model_1 through model_4)
# Count relaxed structures for a single PDB (crystal only)
find test/1AK4/ -maxdepth 2 -name '*.pdb.gz' | wc -l
# Expected: 30 (6 protocols x 5 replicates)
# Count relaxed structures including AI predictions
find test/1AK4/relax/ -name '*.pdb.gz' | wc -l
# Expected: 300 (10 models x 6 protocols x 5 replicates)
# Verify a relaxation log looks normal (check for "Total weighted score")
tail -5 test/1AK4/cartesian_beta/log/1AK4_cart_beta_r1.log
# Expected: should end with score information, no errors
# Check that .fasc files contain scores
head -3 test/1AK4/cartesian_beta/relax.fasc
# Expected: SCORE: header line followed by dataWhat it does:
- For each PDB in the test set:
- Validates the experimental crystal structure
- Validates 5 raw AlphaFold predictions
- Validates 5 raw Boltz predictions
- Validates all relaxed structures (30 crystal + 300 prediction relaxations)
- Outputs per-protein CSV files with 16+ metrics per structure
- Supports resume (skips already-validated structures)
bash scripts/run_molprobity.sh test/ molprobity_results/Dependencies: reduce (hydrogen addition) + phenix.molprobity (validation)
Process per structure:
- Decompress
.pdb.gzif needed - Add hydrogens with
reduce -FLIP -Quiet(try without-FLIPif that fails) - Run
phenix.molprobity input_H.pdb keep_hydrogens=Truewith 600s timeout - Parse
molprobity.outfor metrics - Write to per-protein CSV
Metrics collected:
| Metric | Description | Good Values |
|---|---|---|
| Clashscore | All-atom steric clashes per 1000 atoms | < 10 |
| Ramachandran Favored | % residues in favored regions | > 95% |
| Ramachandran Outliers | % residues in outlier regions | < 0.5% |
| Rotamer Outliers | % side-chain rotamer outliers | < 2% |
| C-beta Deviations | Backbone geometry deviations | < 5 |
| Bond RMSZ | Bond length deviation from ideal | < 1.5 |
| Angle RMSZ | Bond angle deviation from ideal | < 1.5 |
| MolProbity Score | Combined quality metric | < 2.0 |
Potential issues:
reducemay fail on structures with non-standard residues (falls back to no hydrogens)- 600s timeout may be insufficient for very large complexes
- Parsing relies on specific
molprobity.outformat — may break with different Phenix versions
Structures validated per PDB: 341 total
- 1 crystal structure
- 10 raw predictions (5 AF + 5 Boltz)
- 30 crystal relaxations (6 protocols x 5 reps)
- 300 prediction relaxations (10 models x 6 protocols x 5 reps)
# Check that CSV files were created
ls molprobity_results/csvs/ | wc -l
# Expected: 20 (one per PDB in test subset)
# Check line count (should be 342 = 1 header + 341 structures)
wc -l molprobity_results/csvs/1AK4_validation.csv
# Expected: 342
# Check for failures
grep -c 'FAILED' molprobity_results/csvs/1AK4_validation.csv
# Expected: 0
# Merge all CSVs for analysis
head -1 molprobity_results/csvs/$(ls molprobity_results/csvs/ | head -1) > molprobity_results/all_validation.csv
tail -n +2 -q molprobity_results/csvs/*_validation.csv >> molprobity_results/all_validation.csv
wc -l molprobity_results/all_validation.csv
# Expected: ~6821 (1 header + 341 x 20 PDBs)What it does:
- Loads multiple structures into PyMOL
- Aligns each to a reference structure (crystal)
- Calculates C-alpha RMSD
- Extracts Rosetta total_score from PDB REMARK lines or scorefile
- Outputs TSV with RMSD and energy per structure
# Load structures
load test/1AK4/1AK4.pdb, crystal
load test/1AK4/AF/ranked_0.pdb, af_ranked0
# ... load more structures
# Run the plugin
run scripts/collect_metrics.py
collect_metrics ref=crystal, sel=polymer and name CA, do_fit=1, out=metrics.tsvFor relaxed structures with a Rosetta scorefile:
collect_metrics ref=crystal, scorefile=test/1AK4/cartesian_beta/relax.fasc, out=metrics.tsvRMSD calculation:
- Uses PyMOL's
cmd.align()withcycles=0(no outlier rejection) — gives true RMSD - Selection:
polymer and name CA— C-alpha atoms only transform=1— performs optimal superposition before measuring
Energy extraction priority:
- Check external scorefile (
.fascor.sc) if provided - Parse REMARK lines in PDB for
total_score - Return
Noneif neither found
Output: TSV file with columns: object, rmsd_to_{ref}, pairs, energy_total_score
No issues identified. This script is straightforward and correct.
# Check output file
head -5 metrics.tsv
# Expected: tab-separated with RMSD values
# RMSD values should be:
# - Crystal to crystal: NaN (self)
# - Crystal relaxed: 0.5-2.0 A
# - AF predictions: 1.0-5.0 A (depends on quality)
# - AF relaxed: 1.5-6.0 A| Step | Script | Status | Issues |
|---|---|---|---|
| 0 | Download BM5.5 | Manual | URL may change; verify on Weng Lab site |
| 1 | clean_pdbs.sh |
PASS | Space chain ID mapping works correctly |
| 2 | download_fastas.py |
PASS | 2-3 PDB IDs may fail (obsolete/invalid) |
| 3 | organize_fastas.py |
PASS | No issues |
| 4 | prepare_boltz_fastas.py |
PASS with caveat | Unusual FASTA headers may default to chain A |
| 5 | af_array.slurm |
PASS | 64GB RAM; 128GB highmem variant for large complexes; 7 AMBER failures resolved via FASTA fix |
| 5 | af_array_highmem.slurm |
PASS | 128GB RAM for large multimer complexes |
| 6 | boltz_array.slurm |
PASS | MSA server needs internet from compute node; 246/246 complete |
| 6 | boltz_single.slurm |
PASS | Same MSA server caveat |
| 7 | Organize predictions | Manual | No script — documented above |
| 8 | relax_predictions.slurm |
PASS with caveat | Crystal structures; 48h may timeout for large complexes |
| 8 | relax_test.slurm |
PASS | AI predictions (AF + Boltz); per-model array indexing |
| 8 | relax_finish.slurm |
PASS with caveat | Checkpoint recovery; uses $protocol instead of $PROTOCOL (line 108 bug) |
| 9 | run_molprobity.sh |
PASS | Phenix version-dependent output parsing |
| 10 | collect_metrics.py |
PASS | No issues |
The pipeline is scientifically sound and technically correct. The main risks are:
- Resource limits — some SLURM scripts use tight memory/time limits that may fail on large complexes
- External dependencies — MSA server availability, RCSB API stability
- Manual steps — Step 7 (organizing predictions) has no script and is error-prone
- Increase array job memory to 16GB for relaxation, 64GB for AlphaFold array
- Create a script for Step 7 (organizing predictions) to eliminate manual errors
- Add a master pipeline script that chains all steps with dependency checking
- Add
.gitignorefor large intermediate files (MSAs, pickles)
ResourceExhaustedError: OOM when allocating tensor
Fix: Increase --mem in SLURM script. Large complexes (>600 residues) may need 64-128GB.
ConnectionError: MSA server not responding
Fix: Check internet access from compute nodes. If blocked, pre-compute MSAs using local databases:
boltz predict input.fasta --msa_dir ./precomputed_msas/ --output_format pdbERROR: Unable to open file
Fix: Check that the Rosetta database path (-database) is correct and accessible from compute nodes.
NA,NA,NA,NA...
Fix: Check Phenix version compatibility. The parsing regex expects specific output format from phenix.molprobity. Run manually and compare output format:
phenix.molprobity input.pdb keep_hydrogens=True
cat molprobity.outRuntimeError: HHblits failed on the input
Cause: SLURM copies the submission script at job submission time. If the reduced_dbs
fallback was added to af_array.slurm AFTER the job was submitted, running tasks use the
old version without fallback.
Fix:
- Identify failed targets:
for d in data/*/af_out; do [ -d "$d" ] && [ ! -f "$d"/*/ranking_debug.json ] && ls "$d"/*/unrelaxed_model_*.pdb 2>/dev/null | wc -l | grep -q '^0$' && echo "$(basename $(dirname $d))"; done - Clean failed output:
rm -rf data/{TARGET}/af_out - Resubmit with current script:
sbatch --array=<IDS> af_array.slurm
Disk quota exceeded
Fix: AF intermediates consume ~1 GB per target (MSAs, pickles). Clean manually:
for d in data/*/af_out/*/; do
rm -rf "$d/msas" "$d/features.pkl" "$d/result_model_"*.pkl "$d/timings.json" 2>/dev/null
# Remove relaxed_model_*.pdb (duplicates of ranked_*.pdb)
rm -f "$d/relaxed_model_"*.pdb 2>/dev/null
doneMonitor with du -sh /data/p_csb_meiler/agarwm5/protein_ideal_test/benchmarking/.
[FAIL] 3RVW: no FASTA found
Fix: Some PDB IDs are obsolete. Check https://www.rcsb.org/structure/{PDBID} for replacement entries. Update your PDB list accordingly.