PlaszymeGNN is a deep learning framework designed to predict enzyme–plastic interactions, supporting the discovery of novel plastic-degrading enzymes within the context of synthetic biology.
PlaszymeGNN adopts a dual-tower architecture that accepts arbitrary inputs of enzymes and polymers.
This enables predictions on previously unseen plastic molecules, providing broader generalization and novel discovery.
Figure: The dual-tower architecture of PlaszymeGNN (Protein tower + Polymer tower + Interaction head).
- Protein tower: GNN / GVP backbones, with optional ESM embeddings for enriched sequence and structure representation.
- Polymer tower: Graph-based representation with polymer-optimized descriptors.
We innovatively normalize traditional descriptors to better reflect polymer properties and to generalize beyond known categories. - Interaction head: Multiple fusion modules (cosine similarity, bilinear layers etc) to capture enzyme–polymer binding patterns.
-
Enzyme discovery for plastic degradation
Accelerating the identification of novel enzymes for plastic degradation pathways. -
Synthetic biology pathway design
Leveraging arbitrary enzyme–polymer predictions to guide metabolic pathway engineering in synthetic biology. -
Novel polymer screening
Predicting enzyme interactions with previously unseen plastic molecules, enabling exploration of new biodegradable materials. -
Experimental prioritization
Providing ranked recommendations of enzyme–polymer pairs to reduce experimental cost and time. -
Understanding recognition mechanisms
Revealing structural principles of enzyme–plastic recognition to inform rational design and directed evolution.
git clone <https://github.com/Tsutayaaa/Plaszyme.git>
cd PlaszymeWe recommend using Conda to manage environments.
The general dependencies are provided in environment.yml.
conda env create -f environment.yml
conda activate plaszyme_gnn
PyTorch
is the deep learning backbone of this project.
Install the correct version of PyTorch for your system. Refer to the PyTorch official installation guide.
Example (CUDA 12.8):
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128CPU-only (or Mac):
pip3 install torch torchvision
PyG (PyTorch Geometric) is the core library we use to build graph-based protein and polymer representations.
Install following the PyG installation guide.
Check that PyTorch and PyG are installed correctly:
python -c "import torch, torch_geometric; print(torch.__version__, torch_geometric.__version__)"Expected output:
2.7.1 2.6.1
Finally, install PlaszymeGNN itself so that the src/ modules
can be imported properly in training and prediction scripts.
pip install -e .This command uses the pyproject.toml configuration to register
the plaszyme package in editable mode.
After installation, check that the package can be imported successfully:
python -c "import plaszyme; print('PlaszymeGNN package loaded successfully')"If no error is shown, you are ready to run the training and prediction scripts.
PlaszymeGNN provides a list-wise scoring and analysis pipeline for enzyme–polymer interactions via scripts/predict_listwise.py. It supports confidence metrics, rank-based outputs, and batch evaluation.
Edit the constants at the top of scripts/predict_listwise.py:
# --- Model & data paths ---
MODEL_PATH = "../weights/gnn_bilinear/best_bilinear.pt" # <- REQUIRED: your trained .pt weights
PT_OUT_ROOT = "./data/processed/graphs_pt" # cached graphs (.pt)
SDF_ROOT = "./data/plastics_sdf/10_mers" # plastic library (.sdf/.mol)
PDB_ROOT = os.getcwd() # graph builder root (keep default)
# --- Prediction ---
OUTPUT_DIR = "../prediction_results"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
TOP_K = 10
INCLUDE_CONFIDENCE = True
# --- Plastics (two modes) ---
USE_ALL_SDF_PLASTICS = True # when PLASTIC_INPUT=None, load all from SDF_ROOT
DEFAULT_PLASTICS = [ ... ] # fallback when SDF_ROOT is empty
PLASTIC_INPUT = None # provide explicit .sdf/.mol to override the library (see below)
# --- Proteins (PDB) ---
# PDB_INPUT accepts: single file / list of files / a directory
PDB_INPUT = "/path/to/one.pdb" # or ["a.pdb", "b.pdb"] or "/path/to/pdb_dir"About MODEL_PATH (your .pt weights)
- Point
MODEL_PATHto the exact.ptcheckpoint you want to use (e.g.,../weights/gnn_bilinear/best_bilinear.pt). - The backbone (GNN/GVP/Seq) and interaction head are inferred from the checkpoint’s internal cfg; you don’t need to set them separately in the script.
Supports a single file, a list of files, or a directory (the script collects all .pdb files).
Examples:
PDB_INPUT = "./data/sample_pdb/X0001.pdb"
PDB_INPUT = ["./data/sample_pdb/X0001.pdb", "./data/sample_pdb/X0002.pdb"]
PDB_INPUT = "./data/sample_pdb/" # directoryGraphs are built via GNNProteinGraphBuilder/GVPProteinGraphBuilder and cached to PT_OUT_ROOT.
Notes:
- We recommend using AlphaFold or ColabFold predicted PDBs that contain:
- Only residue-level atomic coordinates (no crystallographic water, ligands, or alternate conformations).
- A single chain (default reader will prioritize chain A if multiple chains exist).
- A single model (multi-model ensembles are not supported).
- Overly complex PDBs (e.g., multi-chain complexes, additional heteroatoms) may cause parsing errors or unexpected graph construction issues.
Two modes are supported—explicit input takes precedence:
- Library mode (default)
- If
PLASTIC_INPUT = None, all.sdf/.molfiles underSDF_ROOTare used. SDF_ROOTis empty, the script falls back toDEFAULT_PLASTICS(names are resolved as SDF_ROOT/{name}.sdf/.mol if present).
- licit files (override library)
PLASTIC_INPUTto a file, a list (files and/or folders), or a folder. The script collects all.sdf/.molit finds and ignoresSDF_ROOT/DEFAULT_PLASTICS.
PLASTIC_INPUT = "/data/polymers/PET.sdf" # single file
PLASTIC_INPUT = ["/data/polymers/PET.sdf", "/data/custom_dir"] # mix of files/folders
PLASTIC_INPUT = "/data/polymers_dir" # folder onlyNotes:
- The model natively supports
.sdfand.molformats. - Structural complexity is not strictly required — minimal coordinate files are sufficient for descriptor extraction.
- Advanced option: By making a light modification to the featurizer code, you can also provide SMILES strings directly as polymer inputs instead of
.sdf/.mol.
(This is disabled by default to maintain consistency, but can be adapted if you only have SMILES libraries.)
- Single protein vs entire library
PDB_INPUT = "/data/pdbs/X0001.pdb"
PLASTIC_INPUT = None- Many proteins vs custom SDF folder
PDB_INPUT = "/data/pdbs/"
PLASTIC_INPUT = "/data/my_polymers_sdf/"- Single protein vs single SDF
PDB_INPUT = "/data/pdbs/X0001.pdb"
PLASTIC_INPUT = "/data/custom/PET.sdf"
TOP_K = 1For each protein, the script writes a CSV to OUTPUT_DIR:
{PDB_BASENAME}_predictions.csv, sorted by interaction_score (top TOP_K rows).
If INCLUDE_CONFIDENCE=True, the following columns are produced:
| Column | Meaning |
|---|---|
| rank | Rank by interaction_score (descending) |
| plastic_name | Polymer name or explicit file stem |
| interaction_score | Interaction score from the interaction head |
| confidence_score | Softmax over scores in the batch (T=0.1) |
| relative_strength | Min–max normalized score |
| score_percentile | Percentile within the batch score distribution |
| embedding_similarity | Cosine similarity in the projected feature space |
| prediction_category | Rule-based label (High / Medium / Low / No Significant Interaction) |
Single-plastic special case: if only one plastic is scored, the category reduces to a simple threshold: interaction_score > 0 → High Interaction; otherwise No Significant Interaction.
DEVICE = "cuda" / "cpu" # auto-detected by default; can force it
TOP_K = 10 # export top-K rows only
INCLUDE_CONFIDENCE = True # include confidence & analysis columns
VERBOSE = True # logging verbosityThe active backbone and interaction head are defined by the loaded checkpoint’s cfg (inside MODEL_PATH).
- Model file not found: check the
MODEL_PATHabsolute/relative path and filename. - No PDB files found: verify
PDB_INPUTtype/path and.pdbsuffix. - No valid plastic features extracted: ensure valid
.sdf/.mol(or correctSDF_ROOT). - RDKit errors: prefer installing rdkit via conda-forge for best compatibility.
- Graph build failed: the script will try to load cached graphs from
PT_OUT_ROOT. - Unxpected category with a single plastic: the single-item case uses the strict threshold (score > 0).
# 1) Activate env and install the project (editable)
conda activate plaszyme_gnn
pip install -e .
# 2) Run with defaults (paths in the script header)
python scripts/train_listwise.pyBy default, the script uses the hyperparameters and paths defined in TrainConfig inside train_listwise.py. Customize via a JSON config to avoid editing code:
python scripts/train_listwise.py --config configs/train_listwise.jsonExample configs/train_listwise.json (edit paths to your data):
{
"device": "cuda",
"enz_backbone": "GNN",
"pdb_root": "./data/sample_pdb",
"pt_out_root": "./data/processed/graphs_pt",
"sdf_root": "./data/plastics_sdf/10_mers",
"train_csv": "./data/matrix/plastics_onehot_trainset.csv",
"test_csv": "./data/matrix/plastics_onehot_testset.csv",
"out_dir": "./train_results/gnn_bilinear",
"emb_dim_enz": 128,
"emb_dim_pl": 128,
"proj_dim": 128,
"batch_size": 10,
"lr": 3e-4,
"weight_decay": 1e-4,
"epochs": 100,
"max_list_len": 10,
"temp": 0.2,
"alpha": 0.4,
"interaction": "gated",
"bilinear_rank": 64,
"lambda_w_reg": 1e-4,
"enable_list_mitigation": false,
"seed": 42
}- We recommend using AlphaFold/ColabFold predictions:
- Single chain, single model, residue coordinates only.
- If a PDB contains multiple chains/models, the builder defaults to chain A.
- Example folder:
./data/sample_pdb
- Provided as SDF/MOL oligomers (e.g., 10-mers) in:
./data/plastics_sdf/10_mers - Internal featurizer computes polymer-optimized, normalized descriptors.
- Advanced users: accept SMILES input with minor code changes.
- Format: rows = enzymes, columns = plastics, entries = {0,1} indicating known degradation.
- Example:
| enzyme_id | PET | PCL | PLA | PBAT | ... |
|---|---|---|---|---|---|
| X0004 | 1 | 0 | 0 | 1 | ... |
| X0006 | 0 | 1 | 0 | 0 | ... |
| X0009 | 0 | 0 | 1 | 0 | ... |
| X0010 | 0 | 0 | 1 | 0 | ... |
| ... | ... | ... | ... | ... | ... |
| X0020 | 1 | 0 | 0 | 0 | ... |
| X0021 | 0 | 1 | 0 | 0 | ... |
| X0024 | 0 | 0 | 1 | 0 | ... |
Here:
enzyme_id= PDB filename- Each plastic column = polymer class
- Value
1= enzyme reported to degrade this polymer
All example datasets can be constructed or extended from PlaszymeDB,
a curated database of plastic-degrading enzymes, polymers, and their reported interactions.
GitHub: https://github.com/Tsutayaaa/PlaszymeDB
WebApp: http://plaszyme.org/plaszymeDB
PlaszymeGNN adopts a dual-tower architecture:
-
Protein tower (enzyme encoder)
"GNN"— graph convolutional backbone (default)"GVP"— geometric vector perceptron (vector/scalar features)"MLP"— sequence pooling + MLP baseline
(All support optional ESM embeddings during graph building.)
-
Polymer tower (plastic encoder)
PolymerTowerthat consumes polymer-optimized, normalized descriptors (from SDF/MOL; lengths padded).
-
TwinProjector
TwinProjectormaps both towers into the same latent space (linear, Xavier init).
-
Interaction head
- Fuses enzyme & plastic embeddings into a scalar score.
Minimal JSON to choose the protein tower:
{ "enz_backbone": "GNN" }
Select via interaction:
| Name | Key | Mechanism | Notes |
|---|---|---|---|
| Cosine | "cos" |
cosine(z_e, z_p) | Simple & stable baseline |
| Bilinear | "bilinear" |
eᵀ W p | Strong but may overfit; add reg |
| Factorized Bilinear | "factorized_bilinear" |
eᵀ (UᵀV) p (low-rank) | Parameter-efficient; add ortho |
| Hadamard | "hadamard_mlp" |
(e ⊙ p) ↦ linear | Lightweight, fast |
| Gated (default) | "gated" |
learnable gating on e ⊙ p | Good accuracy-speed tradeoff |
Example snippet:
{
"interaction": "gated",
"bilinear_rank": 64,
"lambda_w_reg": 1e-4,
"ortho_reg": 0.0
}
- Per-enzyme list of plastics trained with listwise InfoNCE.
- Temperature
tempcontrols sharpness. - Diversity on positive plastics reduces mode collapse.
- Center/variance regularization stabilizes scales.
Typical settings:
{
"temp": 0.2,
"alpha": 0.4,
"plastic_diversify": true,
"lambda_diversify": 0.05,
"var_target": 1.0,
"lambda_center": 1e-3
}
Imbalance in long lists is mitigated by:
- Keeping all positives.
- Mixing hard negatives (highest scores) with random negatives.
- Normalizing by sublist length.
Enable & tune:
{
"enable_list_mitigation": true,
"max_list_len_train": 10,
"neg_per_item": 32,
"hard_neg_cand": 32,
"hard_neg_ratio": 0.5
}
In addition to joint training with enzymes, the polymer tower can be pretrained independently using a plastic co-occurrence matrix (e.g., from biodegradation assays or curated databases).
-
Why it works:
Plastics that are frequently degraded together by similar enzymes tend to share chemical substructures and functional motifs.
By learning to approximate this co-occurrence similarity, the polymer tower acquires a chemically meaningful embedding space even before being paired with enzymes. -
Benefits:
- Provides a stronger initialization for downstream enzyme–plastic prediction.
- Improves generalization to rare or unseen polymers.
- Reduces overfitting when training data is sparse.
Optionally pretrain the polymer tower using a plastic co-occurrence matrix:
{
"use_plastic_pretrain": true,
"co_matrix_csv": "./data/matrix/plastic_co_matrix.csv",
"pretrain_epochs": 10,
"pretrain_loss_mode": "contrastive" // or "mse"
}
The script exports a plastic_pretrained.pt in out_dir after pretraining.
- Logs:
train.log,run_config.txt,run_config.json - Checkpoints:
best_<interaction>.pt,last_<interaction>.pt - Curves:
loss.png,hit.png,score.png - Evaluation:
test_metrics.csvand optionaltest_score_matrix.csv
The script will automatically evaluate on test_csv using the best checkpoint.
- Start with
GNN + gated; then tryfactorized_bilinearfor sharper ranking with fewer params. - Clean PDBs reduce graph-build errors (avoid multi-model, altloc clutter).
- Fix seeds for reproducibility.
- Watch
emb_diag.csv(generated internally) for embedding collapse or scale drift.
The repository follows a modular design to separate core library code, training/evaluation scripts, and documentation assets.
├── LICENSE
├── README.md
├── doc/
├── environment.yml # Conda environment specification
├── pyproject.toml # Python package configuration (PEP 621)
├── data/
│ ├── plastics_sdf/ # Polymer structure files
│ │ ├── 10_mers/ # Polymer fragments (10-mer SDF format)
│ │ └── 3_mers/ # Smaller polymer fragments (3-mer MOL format)
│ ├── processed/ # Pre-processed graph representations
│ │ └── graphs_pt/ # PyTorch Geometric graph tensors
│ └── sample_pdb/ # Example enzyme PDB structures
├── scripts/ # Training and prediction entry points
│ ├── evaluate_tools/
│ │ └── evaluate_testset.py # Evaluation utilities for test sets
│ ├── predict_listwise.py # Prediction script for enzyme–plastic interactions
│ └── train_listwise.py # Main training script
├── src/ # Source code (installed as `plaszyme`)
│ └── plaszyme/
│ ├── builders/ # Protein graph construction modules
│ │ ├── base_builder.py # Base builder class (common logic)
│ │ ├── gnn_builder.py # Graph builder for GNN-based models
│ │ ├── gvp_builder.py # Graph builder for GVP (vector-scalar graphs)
│ │ └── sequence_embedder.py # Sequence embedding (ESM / one-hot / custom)
│ │
│ ├── heads/ # Residue-level supervision heads
│ │ ├── residue_activity_head.py # Predict residue activity/intensity
│ │ └── residue_role_head.py # Predict residue roles (interaction/reactant/spectator)
│ │
│ ├── models/ # Backbone networks and interaction modules
│ │ ├── gnn/ # Graph Neural Network backbones
│ │ │ └── backbone.py
│ │ ├── gvp/ # Geometric Vector Perceptron backbones
│ │ │ ├── backbone.py
│ │ │ └── gvp_local/ # Atom-level GVP extensions
│ │ │ ├── atom3d.py
│ │ │ ├── data.py
│ │ │ └── models.py
│ │ ├── seq_mlp/ # Baseline sequence-only encoder
│ │ │ └── backbone.py
│ │ ├── plastic_backbone.py # Polymer Tower (plastic feature encoder)
│ │ └── interaction_head.py # Fusion layers (cosine, bilinear, gated, etc.)
│ │
│ ├── plastic/ # Polymer feature extraction
│ │ ├── descriptors_rdkit.py # RDKit-based descriptor featurizer
│ │ └── rdkit_features.yaml # Descriptor configuration file
│ │
│ ├── readers/ # Structure parsing utilities
│ │ └── pdb_reader.py # Optimized for AlphaFold/ColabFold single-chain PDBs
│ │
│ └── viz_graph.py # Protein–polymer graph visualization
│
└── weights/ # Saved weights (multiple interaction heads)Protein graph construction and embeddings
base_builder.py→ Base class for graph buildersgnn_builder.py→ Builds residue-level graphs for GNNgvp_builder.py→ Builds vector–scalar graphs for GVPsequence_embedder.py→ Sequence-level embeddings (ESM / one-hot)
Residue-level supervision modules
residue_activity_head.py→ Predict activity scoresresidue_role_head.py→ Predict functional roles (interaction / spectator)
Core neural architectures
gnn/backbone.py→ Graph Neural Network backbonegvp/backbone.py→ Geometric Vector Perceptron backbonegvp_local/→ Atom-level extensions for GVPseq_mlp/backbone.py→ Baseline sequence MLPplastic_backbone.py→ Polymer Tower (plastic descriptors → embeddings)interaction_head.py→ Flexible scoring heads (cosine, bilinear, gated, etc.)
Polymer-specific feature extraction
descriptors_rdkit.py→ RDKit-based descriptor featurizerrdkit_features.yaml→ Descriptor config file
Input parsing utilities
pdb_reader.py→ Parses PDB files (optimized for AlphaFold / ColabFold single-chain structures)
Visualization tool for protein–polymer interaction graphs.
This project is part of the iGEM 2025 Competition, developed by the XJTLU-AI-China team.
This project was developed in the context of the iGEM 2025 competition.
- XJTLU-AI-China iGEM 2025 Team: for initiating and developing the PlaszymeGNN project.
- School of Science, Xi’an Jiaotong-Liverpool University (XJTLU): for institutional support.
- Prof. Chun Chan (Kevin) (School of Science, XJTLU), our Principal Investigator (PI): for providing invaluable guidance and mentorship throughout the project.
- GVP: for providing the Geometric Vector Perceptron backbone implementation.
- ESM: for enabling powerful protein sequence and structure embeddings.
- We also gratefully acknowledge the open-source community, whose tools and resources have made this work possible.
This project is licensed under the MIT License.
You are free to use, modify, and distribute this project, provided that proper attribution is given.

