TRACE-GFN: Transformer for Reaction-Aware Compound Exploration with GFlowNet in QSAR-Guided Molecular Design
TRACE-GFN is a generative flow network (GFlowNet) framework for designing drug-like molecules through interpretable chemical reaction pathways. Please refer to the paper for more detailed information.
The implementation was tested and confirmed to run on an Ubuntu operating system with Python 3.11 and CUDA version 12.1.
bash install.sh
source .venv/bin/activate# Install dependencies using uv
uv sync
# Activate virtual environment
source .venv/bin/activate
# Install PyTorch Geometric dependencies (CUDA 12.1)
uv pip install torch_scatter torch_sparse torch_cluster \
-f https://data.pyg.org/whl/torch-2.1.2+cu121.htmlIf you don't have CUDA available, modify the PyTorch installation in pyproject.toml to use CPU-only versions.
Please download trained weights for the Transformer from Figshare here, and place the weights as Transformer.pth in the src/gflownet/models/ckpts/Transformer/ directory. You can use the following command to download the weights directly:
wget -O src/gflownet/models/ckpts/Transformer/Transformer.pth https://ndownloader.figshare.com/files/46402633
Then, the directory substructure is as follows:
models/
└── ckpts/
├── GCN/
│ └── GCN.pth
└── Transformer/
└── Transformer.pth
Generate molecules optimized for DRD2 binding starting from a specific compound:
python -u src/gflownet/tasks/qsar_reactions.py \
--protein_name "DRD2" \
--init_compound_idx 1 \
--condition 16.0 \
--max_depth 5| Argument | Description | Default |
|---|---|---|
--protein_name |
Target protein: "DRD2", "AKT1", or "CXCR4" | Required |
--init_compound_idx |
Index of starting material | Required |
--condition |
Temperature parameter (higher = more exploitation) | 16.0 |
--max_depth |
Maximum number of reaction steps | 5 |
Starting materials are specified in SMILES format in:
src/gflownet/data/{PROTEIN_NAME}/init_compound_{IDX}.smi
For example, src/gflownet/data/DRD2/init_compound_1.smi contains:
OC1CCc2cc(F)ccc21
You can create custom starting materials by adding new .smi files with your desired SMILES strings.
DRD2 optimization with high exploration:
python -u src/gflownet/tasks/qsar_reactions.py \
--protein_name "DRD2" \
--init_compound_idx 1 \
--condition 16.0 \
--max_depth 5AKT1 optimization with conservative exploration:
python -u src/gflownet/tasks/qsar_reactions.py \
--protein_name "AKT1" \
--init_compound_idx 6 \
--condition 16.0 \
--max_depth 5CXCR4 optimization using probabilistic sampling:
python -u src/gflownet/tasks/qsar_reactions.py \
--protein_name "CXCR4" \
--init_compound_idx 11 \
--condition 16.0 \
--max_depth 5 \Training outputs are saved to:
./logs/{PROTEIN_NAME}_reactions_{TIMESTAMP}/
The training progress is logged to Weights & Biases (wandb) under the project {PROTEIN_NAME}_TRACER-GFN. Metrics include:
- Rewards (binding affinity predictions)
- Loss values (trajectory balance, GCN, Transformer)
- Sampling diversity (unique molecule rate)
- Training time and throughput
Key hyperparameters can be modified in src/gflownet/config.py or via command-line arguments.
TRACE-GFN includes pre-trained QSAR models for three protein targets:
-
DRD2
- Relevant for: Antipsychotics, Parkinson's disease treatments
- QSAR model:
src/gflownet/models/qsar_DRD2_optimized.pkl
-
AKT1
- Relevant for: Cancer therapies, metabolic disorders
- QSAR model:
src/gflownet/models/qsar_AKT1_optimized.pkl
-
CXCR4
- Relevant for: HIV treatments, cancer metastasis inhibitors
- QSAR model:
src/gflownet/models/qsar_CXCR4_optimized.pkl
To add a new protein target:
- Train a QSAR model (e.g., using Morgan fingerprints and Random Forest)
- Save the model as
src/gflownet/models/qsar_{PROTEIN_NAME}_optimized.pkl - Update the protein name options in src/gflownet/tasks/qsar_reactions.py
- Prepare starting materials in
src/gflownet/data/{PROTEIN_NAME}/init_compound_*.smi
TRACE-GFN uses reaction templates derived from the USPTO dataset:
- Template library:
src/gflownet/data/label_template.json(1000 templates) - Training data:
src/gflownet/data/USPTO/(tokenized reaction examples)
Reaction templates are represented as SMARTS patterns that define molecular transformations. The GCN learns to predict which templates are applicable to each molecule based on structural features.
TRACE-GFN/
├── src/gflownet/
│ ├── tasks/
│ │ └── qsar_reactions.py # Main entry point
│ ├── models/
│ │ ├── GCN/ # Graph convolution network
│ │ ├── Transformer/ # Product generation model
│ │ ├── mlp.py # Partition function predictor
│ │ └── qsar_*.pkl # Pre-trained QSAR models
│ ├── algo/
│ │ ├── trajectory_balance_synthesis.py # Training objective
│ │ └── reaction_sampling.py # Trajectory generation
│ ├── data/
│ │ ├── DRD2/, AKT1/, CXCR4/ # Protein-specific data
│ │ ├── USPTO/ # Reaction templates
│ │ └── sampling_iterator.py # Data loading
│ ├── trainer.py # Base trainer class
│ ├── online_trainer.py # Online training implementation
│ └── config.py # Configuration dataclasses
├── install.sh # Installation script
└── pyproject.toml # Dependencies