A machine learning project for predicting reaction yields in Ullmann-Ma coupling reactions using XGBoost and molecular descriptors.
This project implements a data-driven approach to predict reaction yields for Ullmann-Ma coupling reactions. It utilizes molecular descriptors (RDKit and QM descriptors) combined with reaction conditions to train an XGBoost regression model.
- Multiple Descriptor Support: RDKit molecular descriptors, Quantum Mechanical (QM) descriptors, or combined approach
- Model Training: XGBoost regressor with cross-validation
- Hyperparameter Optimization: Bayesian optimization using Hyperopt
- Interpretability: SHAP analysis for feature importance
- Visualization: Dataset distribution analysis and prediction plots
ullmann_ma_prediction/
├── data/ # Data files
│ ├── Ullmann-Ma_Dataset.xlsx # Original dataset
│ ├── descriptor/ # QM molecular descriptors
│ └── *.csv # Additional descriptor files
├── src/ # Source code
│ ├── modules/ # Core modules
│ │ ├── config.py # Configuration settings
│ │ ├── data_processing.py # Data loading and feature construction
│ │ ├── molecule_descriptors.py # RDKit descriptor calculation
│ │ ├── modeling.py # Model training and prediction
│ │ ├── shap_analysis.py # SHAP analysis
│ │ └── visualization.py # Plotting utilities
│ ├── 0-dataset_visualization.py # Dataset visualization
│ ├── 1-training_and_prediciton.py # Model training and prediction
│ ├── 2-shap_plot.py # SHAP feature importance plotting
│ ├── 3-predict_all_posibilities.py # Virtual screening
│ ├── 4-reactants_with_reagents.py # Reactant-reagent analysis
│ ├── figures/ # Output figures
│ ├── model/ # Trained models
│ └── results/ # Prediction results
└── .venv/ # Python virtual environment
- Clone the repository:
git clone <repository-url>
cd ullmann_ma_prediction- Create and activate the virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install required dependencies:
pip install -r requirements.txt- Python 3.10+
- RDKit: Molecular descriptor calculation
- XGBoost: Machine learning model
- SHAP: Model interpretability
- Pandas, NumPy, Matplotlib, Seaborn: Data processing and visualization
The dataset (data/Ullmann-Ma_Dataset.xlsx) contains reaction records with the following fields:
| Column | Description |
|---|---|
| entry | Unique reaction identifier |
| ligand | Ligand SMILES |
| metal | Metal catalyst SMILES |
| reactant1 | First reactant SMILES |
| reactant2 | Second reactant SMILES |
| base | Base SMILES |
| solvent | Solvent SMILES |
| ligand_amount | Ligand amount (eq) |
| metal_amount | Metal amount (eq) |
| base_amount | Base amount (eq) |
| solvent_amount | Solvent amount (μL) |
| time | Reaction time (hours) |
| temperature | Reaction temperature (°C) |
| yield | Reaction yield (%) - Target variable |
Generate distribution plots for the dataset:
python src/0-dataset_visualization.pyTrain the model with cross-validation:
from src.modules.config import *
from src.modules.data_processing import load_and_clean_data, construct_reaction_features_qm
from src.modules.modeling import train_with_cv
# Load data
df = load_and_clean_data()
df_valid, X = construct_reaction_features_qm(df)
y = df_valid["yield"].values.astype(float)
# Train with cross-validation
results_df, train_metrics, val_metrics, fold_metrics = train_with_cv(
X, df_valid["entry"], y, XGB_PARAMS, model_type="xgb"
)Or run the main training script:
cd src
python 1-training_and_prediciton.pyAvailable descriptor types:
qm: Quantum Mechanical descriptorsrdkit: RDKit molecular descriptorscombined: Both QM and RDKit descriptors
Perform SHAP analysis to understand feature importance:
python src/2-shap_plot.pyOr enable SHAP analysis during training:
main(desc_type="qm", task="train_with_cv", do_shap=True)Predict yields for all possible reagent combinations:
python src/3-predict_all_posibilities.pyAnalyze relationships between reactants and reagents:
python src/4-reactants_with_reagents.pyDefault model hyperparameters are defined in src/modules/config.py:
XGB_PARAMS = {
"n_estimators": 280,
"max_depth": 12,
"learning_rate": 0.0304,
"subsample": 0.946,
"gamma": 4.507,
"colsample_bytree": 0.539,
"min_child_weight": 10,
"random_state": 42,
}The project generates the following outputs:
-
Figures:
src/figures/analysis_plots/: Dataset distribution plotssrc/figures/shap/: SHAP analysis plotssrc/figures/total_prediction_results.png: Prediction vs. actual yields
-
Results:
src/results/prediction_results.xlsx: Cross-validation resultssrc/results/expanded_df.csv: Virtual screening results
-
Models:
src/model/best_xgb_model.pkl: Trained XGBoost modelsrc/model/best_xgb_model_scaler.pkl: Data scaler
RDKit descriptors are automatically calculated from SMILES strings:
from src.modules.molecule_descriptors import compute_rdkit_descriptors, build_descriptor_dict
descriptors = build_descriptor_dict(smiles_list)QM descriptors should be pre-calculated and stored in data/descriptor/ directory as CSV files with format:
ligand_desc.csv,metal_desc.csv,reactant1_desc.csv,reactant2_desc.csv,base_desc.csv,solvent_desc.csv- Each file must have a "SMILES" column and descriptor columns
This project is for research purposes.
- Ullmann-Ma coupling reaction dataset
- XGBoost: https://xgboost.readthedocs.io/
- SHAP: https://shap.readthedocs.io/
- RDKit: https://www.rdkit.org/