Skip to content

DeepSynthesis/Ullmann-Ma-Reaction-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ullmann-Ma Coupling Reaction Yield Prediction

A machine learning project for predicting reaction yields in Ullmann-Ma coupling reactions using XGBoost and molecular descriptors.

Overview

This project implements a data-driven approach to predict reaction yields for Ullmann-Ma coupling reactions. It utilizes molecular descriptors (RDKit and QM descriptors) combined with reaction conditions to train an XGBoost regression model.

Features

  • Multiple Descriptor Support: RDKit molecular descriptors, Quantum Mechanical (QM) descriptors, or combined approach
  • Model Training: XGBoost regressor with cross-validation
  • Hyperparameter Optimization: Bayesian optimization using Hyperopt
  • Interpretability: SHAP analysis for feature importance
  • Visualization: Dataset distribution analysis and prediction plots

Project Structure

ullmann_ma_prediction/
├── data/                          # Data files
│   ├── Ullmann-Ma_Dataset.xlsx   # Original dataset
│   ├── descriptor/               # QM molecular descriptors
│   └── *.csv                     # Additional descriptor files
├── src/                          # Source code
│   ├── modules/                  # Core modules
│   │   ├── config.py             # Configuration settings
│   │   ├── data_processing.py    # Data loading and feature construction
│   │   ├── molecule_descriptors.py # RDKit descriptor calculation
│   │   ├── modeling.py           # Model training and prediction
│   │   ├── shap_analysis.py      # SHAP analysis
│   │   └── visualization.py      # Plotting utilities
│   ├── 0-dataset_visualization.py    # Dataset visualization
│   ├── 1-training_and_prediciton.py  # Model training and prediction
│   ├── 2-shap_plot.py            # SHAP feature importance plotting
│   ├── 3-predict_all_posibilities.py # Virtual screening
│   ├── 4-reactants_with_reagents.py  # Reactant-reagent analysis
│   ├── figures/                  # Output figures
│   ├── model/                    # Trained models
│   └── results/                  # Prediction results
└── .venv/                        # Python virtual environment

Installation

  1. Clone the repository:
git clone <repository-url>
cd ullmann_ma_prediction
  1. Create and activate the virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install required dependencies:
pip install -r requirements.txt

Key Dependencies

  • Python 3.10+
  • RDKit: Molecular descriptor calculation
  • XGBoost: Machine learning model
  • SHAP: Model interpretability
  • Pandas, NumPy, Matplotlib, Seaborn: Data processing and visualization

Dataset

The dataset (data/Ullmann-Ma_Dataset.xlsx) contains reaction records with the following fields:

Column Description
entry Unique reaction identifier
ligand Ligand SMILES
metal Metal catalyst SMILES
reactant1 First reactant SMILES
reactant2 Second reactant SMILES
base Base SMILES
solvent Solvent SMILES
ligand_amount Ligand amount (eq)
metal_amount Metal amount (eq)
base_amount Base amount (eq)
solvent_amount Solvent amount (μL)
time Reaction time (hours)
temperature Reaction temperature (°C)
yield Reaction yield (%) - Target variable

Usage

1. Dataset Visualization

Generate distribution plots for the dataset:

python src/0-dataset_visualization.py

2. Model Training

Train the model with cross-validation:

from src.modules.config import *
from src.modules.data_processing import load_and_clean_data, construct_reaction_features_qm
from src.modules.modeling import train_with_cv

# Load data
df = load_and_clean_data()
df_valid, X = construct_reaction_features_qm(df)
y = df_valid["yield"].values.astype(float)

# Train with cross-validation
results_df, train_metrics, val_metrics, fold_metrics = train_with_cv(
    X, df_valid["entry"], y, XGB_PARAMS, model_type="xgb"
)

Or run the main training script:

cd src
python 1-training_and_prediciton.py

Available descriptor types:

  • qm: Quantum Mechanical descriptors
  • rdkit: RDKit molecular descriptors
  • combined: Both QM and RDKit descriptors

3. SHAP Analysis

Perform SHAP analysis to understand feature importance:

python src/2-shap_plot.py

Or enable SHAP analysis during training:

main(desc_type="qm", task="train_with_cv", do_shap=True)

4. Virtual Screening

Predict yields for all possible reagent combinations:

python src/3-predict_all_posibilities.py

5. Reactant-Reagent Analysis

Analyze relationships between reactants and reagents:

python src/4-reactants_with_reagents.py

Model Configuration

Default model hyperparameters are defined in src/modules/config.py:

XGB_PARAMS = {
    "n_estimators": 280,
    "max_depth": 12,
    "learning_rate": 0.0304,
    "subsample": 0.946,
    "gamma": 4.507,
    "colsample_bytree": 0.539,
    "min_child_weight": 10,
    "random_state": 42,
}

Output

The project generates the following outputs:

  • Figures:

    • src/figures/analysis_plots/: Dataset distribution plots
    • src/figures/shap/: SHAP analysis plots
    • src/figures/total_prediction_results.png: Prediction vs. actual yields
  • Results:

    • src/results/prediction_results.xlsx: Cross-validation results
    • src/results/expanded_df.csv: Virtual screening results
  • Models:

    • src/model/best_xgb_model.pkl: Trained XGBoost model
    • src/model/best_xgb_model_scaler.pkl: Data scaler

Descriptor Preparation

RDKit Descriptors

RDKit descriptors are automatically calculated from SMILES strings:

from src.modules.molecule_descriptors import compute_rdkit_descriptors, build_descriptor_dict

descriptors = build_descriptor_dict(smiles_list)

QM Descriptors

QM descriptors should be pre-calculated and stored in data/descriptor/ directory as CSV files with format:

  • ligand_desc.csv, metal_desc.csv, reactant1_desc.csv, reactant2_desc.csv, base_desc.csv, solvent_desc.csv
  • Each file must have a "SMILES" column and descriptor columns

License

This project is for research purposes.

References

About

A machine learning project for predicting reaction yields in Ullmann-Ma coupling reactions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages