Protein-Protein Complex Relaxation Benchmark: From Scratch

A complete, step-by-step guide to reproducing the protein-protein complex structure prediction and relaxation benchmark using the Protein-Protein Docking Benchmark 5.5 (BM5.5).

This document traces every script, validates each step, and provides verification checkpoints so you can confirm correctness at each stage.

Prerequisites
Step 0: Download BM5.5 Dataset
Step 1: Clean PDB Structures
Step 2: Download FASTA Sequences
Step 3: Organize FASTAs
Step 4: Prepare Boltz-1 Input
Step 5: Run AlphaFold 2.3.2
Step 6: Run Boltz-1
Step 7: Organize Predictions for Relaxation
Step 8: Run Rosetta Relaxation
Step 9: MolProbity Validation
Step 10: Collect Metrics
Pipeline Audit
Troubleshooting

1. Prerequisites

Software Requirements

Software	Version	Installation
Python	3.10+	System or conda
AlphaFold	2.3.2	GitHub or cluster module
Boltz-1	0.4.1	`pip install boltz` or cluster module
Rosetta	3.15	RosettaCommons
Phenix	Latest	Phenix
Reduce	Latest	Included with Phenix or SBGrid
PyMOL	2.x+	PyMOL
requests	Latest	`pip install requests`

ACCRE-Specific Setup (Vanderbilt)

# AlphaFold 2.3.2
AF2_MINICONDA=/sb/apps/alphafold232/miniconda3
AF2_REPO=/sb/apps/alphafold232/alphafold
AF2_DATADIR=/csbtmp/alphafold-data.230

# Boltz-1 v0.4.1
BOLTZ_MINICONDA=/sb/apps/boltz1-v0.4.1/miniconda3

# Rosetta 3.15
RELAX=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/source/bin/relax.linuxgccrelease
ROSETTA_DB=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/database
CLEAN_SCRIPT=/data/p_csb_meiler/apps/rosetta/rosetta-3.15/main/tools/protein_tools/scripts/clean_pdb.py

# MolProbity (SBGrid)
REDUCE=/programs/x86_64-linux/system/sbgrid_bin/reduce
PHENIX=/programs/x86_64-linux/system/sbgrid_bin/phenix.molprobity

Directory Structure

Create your working directory:

mkdir -p benchmarking/{merged,cleaned,fasta,data,structures,test}
cd benchmarking

2. Step 0: Download BM5.5 Dataset

Source

The Protein-Protein Docking Benchmark 5.5 is maintained by the Weng Lab at UMass Medical School.

Download page: https://zlab.umassmed.edu/benchmark/

What to Download

The benchmark provides:

Bound structures — crystal structures of the complex
Unbound structures — individual partner structures crystallized separately
Metadata — complex categories (Antibody-Antigen, Enzyme-Inhibitor, etc.)

Download Steps

# Download the benchmark tables and structures
# The exact URL may change; check the Weng Lab benchmark page
wget https://zlab.umassmed.edu/benchmark/benchmark5.5.tgz
tar -xzf benchmark5.5.tgz

# The archive contains structures in the format:
#   {PDBID}_l_b.pdb  — Ligand, bound conformation
#   {PDBID}_l_u.pdb  — Ligand, unbound conformation
#   {PDBID}_r_b.pdb  — Receptor, bound conformation
#   {PDBID}_r_u.pdb  — Receptor, unbound conformation

Naming convention:

Suffix	Meaning
`_l_`	Ligand (smaller binding partner)
`_r_`	Receptor (larger binding partner)
`_b`	Bound conformation (from the complex crystal)
`_u`	Unbound conformation (separate crystal structure)

Create Merged Complexes

For this benchmark, we need the complete complex structures (both chains together). If you downloaded the split bound structures, merge them:

# Move split structures to structures/ for reference
mv benchmark5.5/structures/* structures/

# Create merged complexes from bound structures
for pdb_id in $(ls structures/*_r_b.pdb | sed 's/.*\///' | sed 's/_r_b.pdb//'); do
    cat "structures/${pdb_id}_r_b.pdb" "structures/${pdb_id}_l_b.pdb" | \
        grep -E '^(ATOM|HETATM|TER)' > "merged/${pdb_id}.pdb"
    echo "END" >> "merged/${pdb_id}.pdb"
done

Alternatively, download the full complex PDB files directly from RCSB:

# For each PDB ID in the benchmark
for pdb_id in $(cat benchmark5.5/pdb_list.txt); do
    wget -q "https://files.rcsb.org/download/${pdb_id}.pdb" -O "merged/${pdb_id}.pdb"
    sleep 0.2  # Be polite to RCSB
done

Verification Checkpoint

# Should have 246 PDB files (257 BM5.5 minus 11 excluded Boltz OOM targets)
ls merged/*.pdb | wc -l
# Expected: 246

# Each PDB should have ATOM records with multiple chains
grep -c '^ATOM' merged/1AK4.pdb
# Expected: several thousand lines

# Check chain IDs present
awk '/^ATOM/{print substr($0,22,1)}' merged/1AK4.pdb | sort -u
# Expected: A, D (or similar multi-chain output)

3. Step 1: Clean PDB Structures

Script: `clean_pdbs.sh`

What it does:

For each PDB in merged/, extracts unique chain IDs
Runs Rosetta's clean_pdb.py on each chain individually
Merges cleaned chains into a single PDB with proper TER records
Renumbers atom serials sequentially

Why this step exists:

Rosetta's clean_pdb.py standardizes atom naming to Rosetta conventions
Removes HETATM records (water, ligands, ions)
Fixes non-standard residue naming
Required for Rosetta relaxation to work correctly

Run

bash scripts/clean_pdbs.sh merged/ cleaned/ $CLEAN_SCRIPT

Script Audit

Input: merged/*.pdb (raw PDB files from RCSB/BM5.5) Output: cleaned/{PDBID}.pdb (one file per complex, all chains merged)

Potential issues identified:

Chain ID " " (space) is mapped to "_" for Rosetta compatibility — this is correct behavior
The script uses a temporary directory per PDB and cleans up after — safe
Atom serial renumbering starts at 1 per file — correct
Only ATOM and HETATM lines are kept — intentional (removes REMARK, HEADER, etc.)

Edge cases:

PDBs with no ATOM records are skipped (logged as [SKIP])
If Rosetta's clean_pdb.py fails on a chain, that chain will be missing from output (check logs)

Verification Checkpoint

# Should have same count as merged/
ls cleaned/*.pdb | wc -l
# Expected: same as merged/

# Each cleaned PDB should have TER between chains and END at the bottom
tail -5 cleaned/1AK4.pdb
# Expected: ATOM lines, then TER, then END

# Check that atom serials are sequential starting from 1
head -1 cleaned/1AK4.pdb
# Expected: ATOM      1 ...

# Compare chain count (should be preserved)
awk '/^ATOM/{print substr($0,22,1)}' cleaned/1AK4.pdb | sort -u
# Expected: same chains as in merged/1AK4.pdb

4. Step 2: Download FASTA Sequences

Script: `download_fastas.py`

What it does:

Scans merged/ for PDB files
Extracts 4-character PDB IDs from filenames
Downloads FASTA sequences from RCSB (with fallback to PDBe)
Handles obsolete entries by looking up replacement PDB IDs
Logs all results to fasta_download_log.csv

Why not extract sequences from the PDB files directly?

RCSB FASTA includes all chains with proper headers
Headers contain chain IDs needed for Boltz format conversion
More reliable than parsing PDB ATOM records for sequence

Run

python3 scripts/download_fastas.py merged/ fasta/

Script Audit

Input: merged/ directory with PDB files Output: fasta/{PDBID}.fasta + fasta/fasta_download_log.csv

Endpoint priority:

https://www.rcsb.org/fasta/entry/{PDB} (primary)
https://www.rcsb.org/pdb/download/downloadFastaFiles.do?... (fallback)
https://www.ebi.ac.uk/pdbe/entry/pdb/{PDB}/fasta (PDBe fallback)

Retry logic: 2 retries per endpoint with exponential backoff (1.5^i seconds) Rate limiting: 0.15s delay between PDB IDs (polite to servers) Obsolete handling: If all endpoints fail, checks RCSB entry API for replacement ID

Known special cases from our run:

1A2K, 3RVW — obsolete PDB IDs; FASTAs downloaded from replacement entries (5BXQ, 5VPG)
4 non-standard IDs: BAAD, BOYV, BP57, CP57 — sequences extracted from ATOM records

DNA/RNA exclusion policy: DNA/RNA chains are excluded from all prediction FASTAs. BM5.5 is a protein-protein docking benchmark; neither AlphaFold nor Boltz supports nucleic acid prediction. Targets with DNA in the crystal structure (3P57, 1H9D) have their FASTAs filtered to protein chains only.

Verification Checkpoint

# Check download count
grep -c 'ok' fasta/fasta_download_log.csv
# Expected: ~260-266

# Check failures
grep 'fail' fasta/fasta_download_log.csv
# Expected: 0-3 failures

# Verify FASTA format (should start with >)
head -2 fasta/1AK4.fasta
# Expected:
# >1AK4_1|Chains A, B|...
# MNGKII...

# Check that multi-chain complexes have multiple > headers
grep -c '^>' fasta/1AK4.fasta
# Expected: 2+ (one per chain group)

5. Step 3: Organize FASTAs

Script: `organize_fastas.py`

What it does:

Takes flat directory of {PDBID}.fasta files
Creates one subdirectory per PDB ID under data/
Copies (or moves) the FASTA file into the subdirectory
Optionally renames it (e.g., to sequence.fasta)

Run

python3 scripts/organize_fastas.py fasta/ data/ --rename sequence.fasta

Script Audit

Input: fasta/*.fasta Output: data/{PDBID}/sequence.fasta

Behavior:

PDB ID extracted from filename stem (uppercased)
Creates directory if it doesn't exist
Skips existing files unless --overwrite is passed
Copies by default, --move flag deletes source after copy

No issues identified. This is a straightforward file organizer.

Verification Checkpoint

# Should have one directory per successfully downloaded FASTA
ls data/ | wc -l
# Expected: ~260-266

# Each directory should have sequence.fasta
ls data/1AK4/sequence.fasta
# Expected: exists

# Verify content matches original
diff fasta/1AK4.fasta data/1AK4/sequence.fasta
# Expected: no differences

6. Step 4: Prepare Boltz-1 Input

Script: `prepare_boltz_fastas.py`

What it does:

Reads data/*/sequence.fasta files
Parses chain IDs from RCSB FASTA headers
Rewrites each sequence with Boltz-compatible headers: >{CHAIN}|PROTEIN|
Outputs boltz_input.fasta in each subdirectory

Why this is needed: Boltz-1 requires a specific FASTA format where each chain header contains:

The chain letter
The molecule type (PROTEIN)
Separated by pipes

Run

python3 scripts/prepare_boltz_fastas.py data/

Script Audit

Input: data/*/sequence.fasta Output: data/*/boltz_input.fasta

Chain parsing logic:

Looks for Chain[s] A, B, C pattern in header (regex)
Fallback: splits on |, takes first char of second field
Last resort: defaults to chain A

Known behavior:

If RCSB header says Chains A, B (homo-dimer with identical chains), the script writes both >A|PROTEIN| and >B|PROTEIN| with the same sequence — this is correct for Boltz multimer input
Single-chain entries get a single >A|PROTEIN| header

Potential issue: If a header has an unusual format that doesn't match the regex, it falls back to A. This could produce incorrect chain assignments for some entries. Verify output manually for critical targets.

Verification Checkpoint

# Check that boltz_input.fasta was created
ls data/1AK4/boltz_input.fasta
# Expected: exists

# Verify Boltz format
head -4 data/1AK4/boltz_input.fasta
# Expected:
# >A|PROTEIN|
# MNGKII...
# >D|PROTEIN|
# PIVQNL...

# Verify chain count matches sequence.fasta
grep -c '^>' data/1AK4/sequence.fasta
grep -c '^>' data/1AK4/boltz_input.fasta
# Expected: should match (or be equal/greater if chains are split)

# Validate ALL Boltz FASTAs have correct format
for f in data/*/boltz_input.fasta; do
    if ! grep -q '|PROTEIN|' "$f"; then
        echo "BAD: $f"
    fi
done
# Expected: no output (all valid)

7. Step 5: Run AlphaFold 2.3.2

Scripts: `af_array.slurm` / `af_array_highmem.slurm`

What they do:

Submit SLURM GPU jobs to run AlphaFold 2.3.2
Auto-detect monomer vs multimer based on sequence count
Produce 10 models per target: 5 AMBER-relaxed (ranked_*.pdb) + 5 unrelaxed (unrelaxed_model_*.pdb)
AMBER relaxation treated as 7th relaxation protocol (alongside 6 Rosetta protocols)
Skip completed targets (completion guard checks for ranking_debug.json)
Clean up intermediates (MSAs, pickles) to manage disk usage

Run (Array Job)

# Standard (64GB RAM, covers most targets)
sbatch --array=1-257%10 scripts/run/af_array.slurm

# High-memory (128GB RAM, for large multimer complexes)
sbatch --array=<TASK_IDS> scripts/run/af_array_highmem.slurm

Script Audit: `af_array.slurm`

Resources: 1 node, 6 tasks, 1 A6000 GPU, 64GB RAM, 48h FASTA selection: Prefers sequence.fasta over boltz_input.fasta Preset detection: Counts > lines; if >1 → multimer, else monomer Output: af_out/sequence/ranked_{0-4}.pdb (AMBER-relaxed) + af_out/sequence/unrelaxed_model_*.pdb (raw predictions)

Key flags:

--nouse_gpu_relax — runs AMBER relaxation on CPU (avoids GPU memory issues)
--models_to_relax=all — AMBER-relaxes all 5 ranked models (not just best)
--max_template_date=9999-12-31 — allows all templates (no date cutoff)
Monomer: uses pdb70 database
Multimer: uses pdb_seqres + uniprot, num_multimer_predictions_per_model=1

Database preset: Full databases (equivalent to --db_preset=full_dbs) with reduced_dbs fallback on HHblits failure. Primary uses HHblits for BFD/UniRef30 searches; fallback uses jackhmmer with small_bfd. The run_af() function wraps the prediction call and retries with --db_preset=reduced_dbs --small_bfd_database_path=... if full_dbs fails.

AMBER safety: If AMBER relaxation crashes but unrelaxed models exist, the script preserves unrelaxed output instead of deleting everything during the reduced_dbs retry.

Database paths (ACCRE, /csbtmp/alphafold-data.230/):

uniref90/uniref90.fasta
mgnify/mgy_clusters_2022_05.fa
uniref30/UniRef30_2021_03
bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt
pdb70/pdb70
pdb_mmcif/mmcif_files/
pdb_mmcif/obsolete.dat

Known OOM targets at 64GB (require 128GB highmem script): 1AHW, 1ATN, 1DFJ, 1DQJ, 1E6J, 1FC2, 1IRA, 1JWH, 1MLC (all large multimer complexes)

AMBER relaxation failures — ALL RESOLVED: 7 targets (1ATN, 1DFJ, 1FC2, 1WEJ, 2BTF, 4CPA, 5JMO) originally failed AMBER relaxation due to non-standard residues (X/Z) in FASTA sequences. Root cause identified by Blue: AlphaFold can't place atoms for X (unknown AA) and Z (ambiguous Glu/Gln), causing AMBER's _check_residues_are_well_defined() to reject the model. Fix: trimmed X/Z from FASTA terminals. All 7 now have full 10 models (5 ranked + 5 unrelaxed).

HHblits failures (31 targets, resubmitted with reduced_dbs fallback): If the main job was submitted before the fallback logic was added to the script, HHblits failures will not trigger the reduced_dbs retry. Clean failed af_out/ directories and resubmit with the current script version.

Expected Output

After cleanup, 10 models per target are retained:

data/{PDBID}/af_out/
└── sequence/
    ├── ranked_0.pdb              # Best AMBER-relaxed model (0-indexed)
    ├── ranked_1.pdb
    ├── ranked_2.pdb
    ├── ranked_3.pdb
    ├── ranked_4.pdb              # 5th AMBER-relaxed model
    ├── unrelaxed_model_1_*.pdb   # Raw AF prediction (1-indexed, AF convention)
    ├── unrelaxed_model_2_*.pdb
    ├── unrelaxed_model_3_*.pdb
    ├── unrelaxed_model_4_*.pdb
    ├── unrelaxed_model_5_*.pdb
    └── ranking_debug.json        # Maps ranked (0-indexed) to model (1-indexed)

Numbering note: AlphaFold uses 0-indexed ranked_*.pdb (ordered by confidence) and 1-indexed unrelaxed_model_* (unordered). The ranking_debug.json file maps between them. For example, ranked_0.pdb may correspond to unrelaxed_model_3 if model 3 scored highest.

Intermediate files removed by cleanup: MSAs, feature pickles, result pickles, relaxed_model_*.pdb (duplicates of ranked PDBs), timings.

Verification Checkpoint

# Check how many targets have ranking_debug.json (completion marker)
completed=0
for d in data/*/af_out/*/; do
    if [ -f "$d/ranking_debug.json" ]; then
        completed=$((completed + 1))
    fi
done
echo "AlphaFold complete: $completed"
# Expected: 246/246 (all active benchmark targets)

# Verify a specific prediction
ls data/1AK4/af_out/sequence/ranked_*.pdb | wc -l
# Expected: 5

# Check model quality (ranking_debug.json shows pLDDT/pTM scores)
cat data/1AK4/af_out/sequence/ranking_debug.json | python3 -m json.tool | head -10

8. Step 6: Run Boltz-1

Scripts: `boltz_array.slurm` / `boltz_single.slurm`

What they do:

Run Boltz-1 v0.4.1 structure prediction
Use MSA server for fast multiple sequence alignment
Produce 5 diffusion samples per target

Run (Array Job)

# Count targets
N=$(ls -d data/*/ | wc -l)

# Submit array job
sbatch --array=1-${N} scripts/boltz_array.slurm

Or for a single target:

sbatch --chdir=data/1AK4 scripts/boltz_single.slurm boltz_input.fasta

Script Audit: `boltz_array.slurm`

Resources: 1 node, 1 CPU, 1 L40S GPU, 256GB RAM, 2 days FASTA selection: Prefers boltz_input.fasta over sequence.fasta Header validation: Checks that all headers match >X|PROTEIN| format

Key parameters:

diffusion_samples=5 — generates 5 models
recycling_steps=10 — structure refinement iterations
sampling_steps=200 — diffusion sampling steps
output_format=pdb — outputs PDB files (not mmCIF)
use_msa_server — uses ColabFold MSA server instead of local databases

Output format: --output_format pdb (PDB files, not mmCIF) Skip guard: Checks for existing boltz_model_*.pdb files; skips if >= 5 found Force re-run: Set FORCE=1 environment variable

Potential issues:

256GB RAM is very high; Boltz typically needs 16-64GB. This is set conservatively.
--use_msa_server requires internet access from compute nodes (may be blocked on some clusters)
The MSA server can be slow or rate-limited during peak usage

Expected Output

data/{PDBID}/boltz_out_dir/
└── boltz_results_boltz_input/
    ├── predictions/boltz_input/
    │   ├── boltz_input_model_0.pdb
    │   ├── boltz_input_model_1.pdb
    │   ├── boltz_input_model_2.pdb
    │   ├── boltz_input_model_3.pdb
    │   └── boltz_input_model_4.pdb
    ├── msa/                           # Generated MSAs
    └── processed/                     # Processed features

Verification Checkpoint

# Check how many targets have Boltz models
completed=0
for d in data/*/boltz_out_dir/; do
    count=$(find "$d" -name 'boltz_input_model_*.pdb' -o -name 'boltz_model_*.pdb' 2>/dev/null | wc -l)
    if [ "$count" -ge 1 ]; then
        completed=$((completed + 1))
    fi
done
echo "Boltz complete: $completed"
# Expected: 246/246 (all active benchmark targets, 5 models each)
# 11 targets excluded from benchmark due to Boltz OOM:
#   9 full OOM (>3000 total residues): 1DE4, 1K5D, 1N2C, 1WDW, 1ZM4, 3BIW, 3L89, 4GXU, 6EY6
#   2 partial OOM (1/5 models only): 1GXD, 3EO1

# Verify a specific prediction
find data/1AK4/boltz_out_dir -name '*.pdb' | wc -l
# Expected: 5

9. Step 7: Organize Predictions for Relaxation

Manual Step (no script — do this manually)

Before running relaxation, predictions need to be organized into the test/ directory structure:

for pdb_id in $(ls data/); do
    # Skip if no predictions
    af_count=$(ls data/$pdb_id/af_out/sequence/ranked_*.pdb 2>/dev/null | wc -l)
    boltz_count=$(find data/$pdb_id/boltz_out_dir -name 'boltz_input_model_*.pdb' 2>/dev/null | wc -l)

    if [ "$af_count" -lt 5 ] && [ "$boltz_count" -lt 5 ]; then
        echo "SKIP: $pdb_id (AF=$af_count, Boltz=$boltz_count)"
        continue
    fi

    mkdir -p test/$pdb_id/AF test/$pdb_id/Boltz

    # Copy crystal structure
    if [ -f cleaned/$pdb_id.pdb ]; then
        cp cleaned/$pdb_id.pdb test/$pdb_id/$pdb_id.pdb
    fi

    # Copy AF predictions
    cp data/$pdb_id/af_out/sequence/ranked_*.pdb test/$pdb_id/AF/ 2>/dev/null

    # Copy Boltz predictions
    find data/$pdb_id/boltz_out_dir -name 'boltz_input_model_*.pdb' \
        -exec cp {} test/$pdb_id/Boltz/ \; 2>/dev/null
done

Verification Checkpoint

# Check structure
ls test/1AK4/
# Expected: 1AK4.pdb  AF/  Boltz/

ls test/1AK4/AF/
# Expected: ranked_0.pdb through ranked_4.pdb

ls test/1AK4/Boltz/
# Expected: boltz_input_model_0.pdb through boltz_input_model_4.pdb

10. Step 8: Run Relaxation (7 Protocols)

Overview

7 relaxation protocols total: 1 AMBER (computed during AF prediction) + 6 Rosetta.

Protocol 1: AMBER Relaxation (AlphaFold Native)

AMBER relaxation via OpenMM is performed during AlphaFold prediction (Step 5):

Force field: AMBER ff14SB
Energy tolerance: 2.39 kcal/mol
Position restraint stiffness: 10.0 kcal/mol/A^2
Compute: CPU (--nouse_gpu_relax)
Output: ranked_*.pdb files (already generated in Step 5)

No separate script needed — AMBER relaxation is built into AlphaFold's --models_to_relax=all.

Protocols 2-7: Rosetta Relaxation

Script: `relax_predictions.slurm`

What it does:

For each PDB directory in test/:
- Finds the crystal structure PDB
- Runs 6 relaxation protocols x 5 replicates = 30 relaxed structures
Each replicate uses -nstruct 1 with a different suffix (_r1 through _r5)

The 6 Rosetta Protocols

Key	Protocol	Scoring Function	Rosetta Flags
`cart_beta`	Cartesian	beta_nov16	`-relax:cartesian -beta_nov16 -score:weights beta_nov16_cart`
`cart_ref15`	Cartesian	ref2015	`-relax:cartesian -score:weights ref2015_cart`
`dual_beta`	Dualspace	beta_nov16	`-relax:dualspace -beta_nov16 -score:weights beta_nov16_cart -nonideal -relax:minimize_bond_angles -relax:minimize_bond_lengths`
`dual_ref15`	Dualspace	ref2015	`-relax:dualspace -score:weights ref2015_cart -nonideal -relax:minimize_bond_angles -relax:minimize_bond_lengths`
`norm_beta`	Normal FastRelax	beta_nov16	`-beta_nov16 -score:weights beta_nov16`
`norm_ref15`	Normal FastRelax	ref2015	`-score:weights ref2015`

Run

# Crystal structure relaxation
N=$(ls -d test/*/ | wc -l)
sbatch --array=1-${N} scripts/relax_predictions.slurm

For AI prediction relaxation, the script needs to be modified to iterate over AF and Boltz models within each PDB directory. The extended version (used in our pipeline) does:

# For each PDB in test/:
#   For each source (AF, Boltz):
#     For each model (5 models):
#       For each protocol (6 protocols):
#         For each replicate (5 replicates):
#           Run Rosetta relax → test/{PDB}/relax/{source}/{model}/{protocol}/

Script Audit

Resources: 1 node, 1 CPU, 4GB RAM, 48h No GPU required — Rosetta relax is CPU-only

Common Rosetta flags:

-ignore_zero_occupancy false — processes all atoms regardless of occupancy
-nstruct 1 — one structure per run (replicates via suffix)
-no_nstruct_label — don't add _0001 to output filename
-out:pdb_gz — gzip output (saves disk space)
-flip_HNQ — optimize His/Asn/Gln hydrogen placement
-fa_max_dis 9.0 — maximum interaction distance for scoring
-optimization:default_max_cycles 200 — convergence iterations

Potential issues:

48h timeout: a single PDB with 6 protocols x 5 replicates = 30 runs. Each relax takes 5-60 minutes depending on size. Large complexes (>800 residues) may timeout.
4GB RAM is tight for large complexes. May need 8-16GB for >600 residues.
The script uses ${CFG[$key]} without quoting — this works because the values have no spaces in critical positions, but it's technically fragile.

Expected Output

test/{PDBID}/
├── cartesian_beta/
│   ├── {PDBID}_r1.pdb.gz through _r5.pdb.gz
│   ├── log/
│   │   └── {PDBID}_cart_beta_r1.log through _r5.log
│   └── relax.fasc
├── cartesian_ref15/
│   └── [same structure]
├── dualspace_beta/
├── dualspace_ref15/
├── normal_beta/
└── normal_ref15/

For AI predictions:

test/{PDBID}/relax/
├── AF/
│   ├── ranked_0/
│   │   ├── cartesian_beta/ranked_0_r1.pdb.gz ... _r5.pdb.gz
│   │   ├── cartesian_ref15/
│   │   ├── dualspace_beta/
│   │   ├── dualspace_ref15/
│   │   ├── normal_beta/
│   │   └── normal_ref15/
│   ├── ranked_1/
│   └── ... (ranked_2 through ranked_4)
└── Boltz/
    ├── boltz_input_model_0/
    └── ... (model_1 through model_4)

Verification Checkpoint

# Count relaxed structures for a single PDB (crystal only)
find test/1AK4/ -maxdepth 2 -name '*.pdb.gz' | wc -l
# Expected: 30 (6 protocols x 5 replicates)

# Count relaxed structures including AI predictions
find test/1AK4/relax/ -name '*.pdb.gz' | wc -l
# Expected: 300 (10 models x 6 protocols x 5 replicates)

# Verify a relaxation log looks normal (check for "Total weighted score")
tail -5 test/1AK4/cartesian_beta/log/1AK4_cart_beta_r1.log
# Expected: should end with score information, no errors

# Check that .fasc files contain scores
head -3 test/1AK4/cartesian_beta/relax.fasc
# Expected: SCORE: header line followed by data

11. Step 9: MolProbity Validation

Script: `run_molprobity.sh`

What it does:

For each PDB in the test set:
- Validates the experimental crystal structure
- Validates 5 raw AlphaFold predictions
- Validates 5 raw Boltz predictions
- Validates all relaxed structures (30 crystal + 300 prediction relaxations)
Outputs per-protein CSV files with 16+ metrics per structure
Supports resume (skips already-validated structures)

Run

bash scripts/run_molprobity.sh test/ molprobity_results/

Script Audit

Dependencies: reduce (hydrogen addition) + phenix.molprobity (validation)

Process per structure:

Decompress .pdb.gz if needed
Add hydrogens with reduce -FLIP -Quiet (try without -FLIP if that fails)
Run phenix.molprobity input_H.pdb keep_hydrogens=True with 600s timeout
Parse molprobity.out for metrics
Write to per-protein CSV

Metrics collected:

Metric	Description	Good Values
Clashscore	All-atom steric clashes per 1000 atoms	< 10
Ramachandran Favored	% residues in favored regions	> 95%
Ramachandran Outliers	% residues in outlier regions	< 0.5%
Rotamer Outliers	% side-chain rotamer outliers	< 2%
C-beta Deviations	Backbone geometry deviations	< 5
Bond RMSZ	Bond length deviation from ideal	< 1.5
Angle RMSZ	Bond angle deviation from ideal	< 1.5
MolProbity Score	Combined quality metric	< 2.0

Potential issues:

reduce may fail on structures with non-standard residues (falls back to no hydrogens)
600s timeout may be insufficient for very large complexes
Parsing relies on specific molprobity.out format — may break with different Phenix versions

Structures validated per PDB: 341 total

1 crystal structure
10 raw predictions (5 AF + 5 Boltz)
30 crystal relaxations (6 protocols x 5 reps)
300 prediction relaxations (10 models x 6 protocols x 5 reps)

Verification Checkpoint

# Check that CSV files were created
ls molprobity_results/csvs/ | wc -l
# Expected: 20 (one per PDB in test subset)

# Check line count (should be 342 = 1 header + 341 structures)
wc -l molprobity_results/csvs/1AK4_validation.csv
# Expected: 342

# Check for failures
grep -c 'FAILED' molprobity_results/csvs/1AK4_validation.csv
# Expected: 0

# Merge all CSVs for analysis
head -1 molprobity_results/csvs/$(ls molprobity_results/csvs/ | head -1) > molprobity_results/all_validation.csv
tail -n +2 -q molprobity_results/csvs/*_validation.csv >> molprobity_results/all_validation.csv
wc -l molprobity_results/all_validation.csv
# Expected: ~6821 (1 header + 341 x 20 PDBs)

12. Step 10: Collect Metrics

Script: `collect_metrics.py` (PyMOL Plugin)

What it does:

Loads multiple structures into PyMOL
Aligns each to a reference structure (crystal)
Calculates C-alpha RMSD
Extracts Rosetta total_score from PDB REMARK lines or scorefile
Outputs TSV with RMSD and energy per structure

Run (Inside PyMOL)

# Load structures
load test/1AK4/1AK4.pdb, crystal
load test/1AK4/AF/ranked_0.pdb, af_ranked0
# ... load more structures

# Run the plugin
run scripts/collect_metrics.py
collect_metrics ref=crystal, sel=polymer and name CA, do_fit=1, out=metrics.tsv

For relaxed structures with a Rosetta scorefile:

collect_metrics ref=crystal, scorefile=test/1AK4/cartesian_beta/relax.fasc, out=metrics.tsv

Script Audit

RMSD calculation:

Uses PyMOL's cmd.align() with cycles=0 (no outlier rejection) — gives true RMSD
Selection: polymer and name CA — C-alpha atoms only
transform=1 — performs optimal superposition before measuring

Energy extraction priority:

Check external scorefile (.fasc or .sc) if provided
Parse REMARK lines in PDB for total_score
Return None if neither found

Output: TSV file with columns: object, rmsd_to_{ref}, pairs, energy_total_score

No issues identified. This script is straightforward and correct.

Verification Checkpoint

# Check output file
head -5 metrics.tsv
# Expected: tab-separated with RMSD values

# RMSD values should be:
# - Crystal to crystal: NaN (self)
# - Crystal relaxed: 0.5-2.0 A
# - AF predictions: 1.0-5.0 A (depends on quality)
# - AF relaxed: 1.5-6.0 A

13. Pipeline Audit

Summary of Script Verification

Step	Script	Status	Issues
0	Download BM5.5	Manual	URL may change; verify on Weng Lab site
1	`clean_pdbs.sh`	PASS	Space chain ID mapping works correctly
2	`download_fastas.py`	PASS	2-3 PDB IDs may fail (obsolete/invalid)
3	`organize_fastas.py`	PASS	No issues
4	`prepare_boltz_fastas.py`	PASS with caveat	Unusual FASTA headers may default to chain A
5	`af_array.slurm`	PASS	64GB RAM; 128GB highmem variant for large complexes; 7 AMBER failures resolved via FASTA fix
5	`af_array_highmem.slurm`	PASS	128GB RAM for large multimer complexes
6	`boltz_array.slurm`	PASS	MSA server needs internet from compute node; 246/246 complete
6	`boltz_single.slurm`	PASS	Same MSA server caveat
7	Organize predictions	Manual	No script — documented above
8	`relax_predictions.slurm`	PASS with caveat	Crystal structures; 48h may timeout for large complexes
8	`relax_test.slurm`	PASS	AI predictions (AF + Boltz); per-model array indexing
8	`relax_finish.slurm`	PASS with caveat	Checkpoint recovery; uses `$protocol` instead of `$PROTOCOL` (line 108 bug)
9	`run_molprobity.sh`	PASS	Phenix version-dependent output parsing
10	`collect_metrics.py`	PASS	No issues

Overall Pipeline Viability: PLAUSIBLE

The pipeline is scientifically sound and technically correct. The main risks are:

Resource limits — some SLURM scripts use tight memory/time limits that may fail on large complexes
External dependencies — MSA server availability, RCSB API stability
Manual steps — Step 7 (organizing predictions) has no script and is error-prone

Recommendations

Increase array job memory to 16GB for relaxation, 64GB for AlphaFold array
Create a script for Step 7 (organizing predictions) to eliminate manual errors
Add a master pipeline script that chains all steps with dependency checking
Add .gitignore for large intermediate files (MSAs, pickles)

14. Troubleshooting

AlphaFold Fails with OOM

ResourceExhaustedError: OOM when allocating tensor

Fix: Increase --mem in SLURM script. Large complexes (>600 residues) may need 64-128GB.

Boltz MSA Server Timeout

ConnectionError: MSA server not responding

Fix: Check internet access from compute nodes. If blocked, pre-compute MSAs using local databases:

boltz predict input.fasta --msa_dir ./precomputed_msas/ --output_format pdb

Rosetta Relax Crashes

ERROR: Unable to open file

Fix: Check that the Rosetta database path (-database) is correct and accessible from compute nodes.

MolProbity Parsing Returns NA

NA,NA,NA,NA...

Fix: Check Phenix version compatibility. The parsing regex expects specific output format from phenix.molprobity. Run manually and compare output format:

phenix.molprobity input.pdb keep_hydrogens=True
cat molprobity.out

HHblits Failures Despite Fallback in Script

RuntimeError: HHblits failed on the input

Cause: SLURM copies the submission script at job submission time. If the reduced_dbs fallback was added to af_array.slurm AFTER the job was submitted, running tasks use the old version without fallback.

Fix:

Identify failed targets: for d in data/*/af_out; do [ -d "$d" ] && [ ! -f "$d"/*/ranking_debug.json ] && ls "$d"/*/unrelaxed_model_*.pdb 2>/dev/null | wc -l | grep -q '^0$' && echo "$(basename $(dirname $d))"; done
Clean failed output: rm -rf data/{TARGET}/af_out
Resubmit with current script: sbatch --array=<IDS> af_array.slurm

Disk Space Emergency

Disk quota exceeded

Fix: AF intermediates consume ~1 GB per target (MSAs, pickles). Clean manually:

for d in data/*/af_out/*/; do
    rm -rf "$d/msas" "$d/features.pkl" "$d/result_model_"*.pkl "$d/timings.json" 2>/dev/null
    # Remove relaxed_model_*.pdb (duplicates of ranked_*.pdb)
    rm -f "$d/relaxed_model_"*.pdb 2>/dev/null
done

Monitor with du -sh /data/p_csb_meiler/agarwm5/protein_ideal_test/benchmarking/.

FASTA Download Failures

[FAIL] 3RVW: no FASTA found

Fix: Some PDB IDs are obsolete. Check https://www.rcsb.org/structure/{PDBID} for replacement entries. Update your PDB list accordingly.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
cleaned		cleaned
data		data
merged		merged
scripts		scripts
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
NOTEBOOK.md		NOTEBOOK.md
PROJECT_STATUS.md		PROJECT_STATUS.md
README.md		README.md

dreamlessx/Protein_Ideal

Folders and files

Latest commit

History

Repository files navigation

Protein-Protein Complex Relaxation Benchmark: From Scratch

Table of Contents

1. Prerequisites

Software Requirements

ACCRE-Specific Setup (Vanderbilt)

Directory Structure

2. Step 0: Download BM5.5 Dataset

Source

What to Download

Download Steps

Create Merged Complexes

Verification Checkpoint

3. Step 1: Clean PDB Structures

Script: clean_pdbs.sh

Run

Script Audit

Verification Checkpoint

4. Step 2: Download FASTA Sequences

Script: download_fastas.py

Run

Script Audit

Verification Checkpoint

5. Step 3: Organize FASTAs

Script: organize_fastas.py

Run

Script Audit

Verification Checkpoint

6. Step 4: Prepare Boltz-1 Input

Script: prepare_boltz_fastas.py

Run

Script Audit

Verification Checkpoint

7. Step 5: Run AlphaFold 2.3.2

Scripts: af_array.slurm / af_array_highmem.slurm

Run (Array Job)

Script Audit: af_array.slurm

Expected Output

Verification Checkpoint

8. Step 6: Run Boltz-1

Scripts: boltz_array.slurm / boltz_single.slurm

Run (Array Job)

Script Audit: boltz_array.slurm

Expected Output

Verification Checkpoint

9. Step 7: Organize Predictions for Relaxation

Manual Step (no script — do this manually)

Verification Checkpoint

10. Step 8: Run Relaxation (7 Protocols)

Overview

Protocol 1: AMBER Relaxation (AlphaFold Native)

Protocols 2-7: Rosetta Relaxation

Script: relax_predictions.slurm

The 6 Rosetta Protocols

Run

Script Audit

Expected Output

Verification Checkpoint

11. Step 9: MolProbity Validation

Script: run_molprobity.sh

Run

Script Audit

Verification Checkpoint

12. Step 10: Collect Metrics

Script: collect_metrics.py (PyMOL Plugin)

Run (Inside PyMOL)

Script Audit

Verification Checkpoint

13. Pipeline Audit

Summary of Script Verification

Overall Pipeline Viability: PLAUSIBLE

Recommendations

14. Troubleshooting

AlphaFold Fails with OOM

Boltz MSA Server Timeout

Rosetta Relax Crashes

Script: `clean_pdbs.sh`

Script: `download_fastas.py`

Script: `organize_fastas.py`

Script: `prepare_boltz_fastas.py`

Scripts: `af_array.slurm` / `af_array_highmem.slurm`

Script Audit: `af_array.slurm`

Scripts: `boltz_array.slurm` / `boltz_single.slurm`

Script Audit: `boltz_array.slurm`

Script: `relax_predictions.slurm`

Script: `run_molprobity.sh`

Script: `collect_metrics.py` (PyMOL Plugin)

Packages