Ullmann-Ma Coupling Reaction Yield Prediction

A machine learning project for predicting reaction yields in Ullmann-Ma coupling reactions using XGBoost and molecular descriptors.

Overview

This project implements a data-driven approach to predict reaction yields for Ullmann-Ma coupling reactions. It utilizes molecular descriptors (RDKit and QM descriptors) combined with reaction conditions to train an XGBoost regression model.

Features

Multiple Descriptor Support: RDKit molecular descriptors, Quantum Mechanical (QM) descriptors, or combined approach
Model Training: XGBoost regressor with cross-validation
Hyperparameter Optimization: Bayesian optimization using Hyperopt
Interpretability: SHAP analysis for feature importance
Visualization: Dataset distribution analysis and prediction plots

Project Structure

ullmann_ma_prediction/
├── data/                          # Data files
│   ├── Ullmann-Ma_Dataset.xlsx   # Original dataset
│   ├── descriptor/               # QM molecular descriptors
│   └── *.csv                     # Additional descriptor files
├── src/                          # Source code
│   ├── modules/                  # Core modules
│   │   ├── config.py             # Configuration settings
│   │   ├── data_processing.py    # Data loading and feature construction
│   │   ├── molecule_descriptors.py # RDKit descriptor calculation
│   │   ├── modeling.py           # Model training and prediction
│   │   ├── shap_analysis.py      # SHAP analysis
│   │   └── visualization.py      # Plotting utilities
│   ├── 0-dataset_visualization.py    # Dataset visualization
│   ├── 1-training_and_prediciton.py  # Model training and prediction
│   ├── 2-shap_plot.py            # SHAP feature importance plotting
│   ├── 3-predict_all_posibilities.py # Virtual screening
│   ├── 4-reactants_with_reagents.py  # Reactant-reagent analysis
│   ├── figures/                  # Output figures
│   ├── model/                    # Trained models
│   └── results/                  # Prediction results
└── .venv/                        # Python virtual environment

Installation

Clone the repository:

git clone <repository-url>
cd ullmann_ma_prediction

Create and activate the virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install required dependencies:

pip install -r requirements.txt

Key Dependencies

Python 3.10+
RDKit: Molecular descriptor calculation
XGBoost: Machine learning model
SHAP: Model interpretability
Pandas, NumPy, Matplotlib, Seaborn: Data processing and visualization

Dataset

The dataset (data/Ullmann-Ma_Dataset.xlsx) contains reaction records with the following fields:

Column	Description
entry	Unique reaction identifier
ligand	Ligand SMILES
metal	Metal catalyst SMILES
reactant1	First reactant SMILES
reactant2	Second reactant SMILES
base	Base SMILES
solvent	Solvent SMILES
ligand_amount	Ligand amount (eq)
metal_amount	Metal amount (eq)
base_amount	Base amount (eq)
solvent_amount	Solvent amount (μL)
time	Reaction time (hours)
temperature	Reaction temperature (°C)
yield	Reaction yield (%) - Target variable

Usage

1. Dataset Visualization

Generate distribution plots for the dataset:

python src/0-dataset_visualization.py

2. Model Training

Train the model with cross-validation:

from src.modules.config import *
from src.modules.data_processing import load_and_clean_data, construct_reaction_features_qm
from src.modules.modeling import train_with_cv

# Load data
df = load_and_clean_data()
df_valid, X = construct_reaction_features_qm(df)
y = df_valid["yield"].values.astype(float)

# Train with cross-validation
results_df, train_metrics, val_metrics, fold_metrics = train_with_cv(
    X, df_valid["entry"], y, XGB_PARAMS, model_type="xgb"
)

Or run the main training script:

cd src
python 1-training_and_prediciton.py

Available descriptor types:

qm: Quantum Mechanical descriptors
rdkit: RDKit molecular descriptors
combined: Both QM and RDKit descriptors

3. SHAP Analysis

Perform SHAP analysis to understand feature importance:

python src/2-shap_plot.py

Or enable SHAP analysis during training:

main(desc_type="qm", task="train_with_cv", do_shap=True)

4. Virtual Screening

Predict yields for all possible reagent combinations:

python src/3-predict_all_posibilities.py

5. Reactant-Reagent Analysis

Analyze relationships between reactants and reagents:

python src/4-reactants_with_reagents.py

Model Configuration

Default model hyperparameters are defined in src/modules/config.py:

XGB_PARAMS = {
    "n_estimators": 280,
    "max_depth": 12,
    "learning_rate": 0.0304,
    "subsample": 0.946,
    "gamma": 4.507,
    "colsample_bytree": 0.539,
    "min_child_weight": 10,
    "random_state": 42,
}

Output

The project generates the following outputs:

Figures:
- src/figures/analysis_plots/: Dataset distribution plots
- src/figures/shap/: SHAP analysis plots
- src/figures/total_prediction_results.png: Prediction vs. actual yields
Results:
- src/results/prediction_results.xlsx: Cross-validation results
- src/results/expanded_df.csv: Virtual screening results
Models:
- src/model/best_xgb_model.pkl: Trained XGBoost model
- src/model/best_xgb_model_scaler.pkl: Data scaler

Descriptor Preparation

RDKit Descriptors

RDKit descriptors are automatically calculated from SMILES strings:

from src.modules.molecule_descriptors import compute_rdkit_descriptors, build_descriptor_dict

descriptors = build_descriptor_dict(smiles_list)

QM Descriptors

QM descriptors should be pre-calculated and stored in data/descriptor/ directory as CSV files with format:

ligand_desc.csv, metal_desc.csv, reactant1_desc.csv, reactant2_desc.csv, base_desc.csv, solvent_desc.csv
Each file must have a "SMILES" column and descriptor columns

License

This project is for research purposes.

References

Ullmann-Ma coupling reaction dataset
XGBoost: https://xgboost.readthedocs.io/
SHAP: https://shap.readthedocs.io/
RDKit: https://www.rdkit.org/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
descriptor		descriptor
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ullmann-Ma Coupling Reaction Yield Prediction

Overview

Features

Project Structure

Installation

Key Dependencies

Dataset

Usage

1. Dataset Visualization

2. Model Training

3. SHAP Analysis

4. Virtual Screening

5. Reactant-Reagent Analysis

Model Configuration

Output

Descriptor Preparation

RDKit Descriptors

QM Descriptors

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ullmann-Ma Coupling Reaction Yield Prediction

Overview

Features

Project Structure

Installation

Key Dependencies

Dataset

Usage

1. Dataset Visualization

2. Model Training

3. SHAP Analysis

4. Virtual Screening

5. Reactant-Reagent Analysis

Model Configuration

Output

Descriptor Preparation

RDKit Descriptors

QM Descriptors

License

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages