Predicting the stability of cobaltocenium derivatives

Author: Shehani Wetthasinghe

Last modified: 02/25/2023

According to the previous experimental work, cobaltocenium derivatives (CoCp₂OH) are regarded as promising anion exchange membrane (AEM) components because of their excellent thermal and alkaline stability under operating conditions of fuel cells.
In this project, I provide a useful guidance for experimentalists to find out the stable cobaltocenium derivatives by introducing a machine learning model.
The stability can be interpreted as measure of bond dissociation energy (BDE).
After an extensive analysis, I found that BDE can be predicted using few chemical properties of the fragments of cobaltocenium derivatives based on its substituents.

Here are the substituents used in this study:

Check out following publications for more information about the theory model;

Data Generation

I generated the data set with every possible di-substituted cobatocenium derivatives for 42 substituents. So here I share the python codes that I used for;

generating input files for QChem calculations
data extraction from QChem output files
property calculation and data analysis

This is the procedure need to follow to add a new substituent;

Guidlines to generate input files

First, QChem input files for di-substituted CoCp₂OH are required to generate with a given substituent set. To do that, xyz coordinates for di-substuted CoCp₂OH are creating by;

connecting one group directly to 15^th atom (C)
connecting the second group to 4^th atom (C) with a coordination rotation through z axis

Steps:

Optimize the geometry of mono-substituted cobaltocenium with the new substituent group. (input file: Cobaltocenium_project/mono_sub_opt_sample.in)
Extract the xyz coordinates of optimized output files using "extract_subcor.py"
Do any required changes in rem/pcm//basis set/charge/multiplicity in corresponding data files or di_input_generator.py (rem.dat, pcm.dat, basis.dat, "char" or "multi" variables in di_input_generator.py)
Run "di_input_generator.py"

Guidlines to submit mutiple QChem files

Run "multi_cal.sh" file to submit the geometry optimization calculations in the serial manner. (It followes the list of di-substituted CoCp₂OH in "der.dat" file which automatically generate by running the "di_input_generator.py" )

Guidlines to data extraction, property calculation and data analysis

As the next task, the required data is needed to extract from the QChem output files and then calculate further chemical properties. For these calculations, previously calculated data for fragments is also needed. (ex: cp_homo.dat, cp_lumo.dat, cocp_homo.dat, cocp_lumo.dat, h_charge.dat)
Geometry optimaization calculations can be performed using "cocp_frag_opt.in" and "cp_frag_opt.in" for CoCpOH and Cp fragments respectively. Then required data can be extracted to following data files using "frag_extract.py"
- Energy of substituted CoCpOH fragment: cocp.dat
- Energy of substituted Cp fragment: cp.dat
- Energy of highest occupied molecular orbital (HOMO) for substituted CoCpOH fragment: cocp_homo.dat
- Energy of lowest unoccupied molecular orbital (LUMO) for substituted CoCpOH fragment: cocp_lumo.dat
- Energy of HOMO for substituted Cp fragment: cp_homo.dat
- Energy of LUMO for substituted Cp fragment: cp_lumo.dat
For the Hirshfeld charge calculation;
- Create the input file for geometry optimization of substituted benzene using "h_charge_opt.in"
- Perfom the Hirshfeld charge calculation by adding the optimized geometry to "h_charge_sub.in"
- Use "h_charge_calculator.py" to calculate the sum of Hirshfeld charge on C₆H₅ for each substituent and generate the "h_charge.dat" file
Once all required data files are created, "data.py" can be used to calculate BDE and other chemical properties for di-substituted complexes. (Note: Do the necessary changes in data.py file as in the comments)

The extracted data has published in Open Source Framework (OSF) and the link to the database is attached below;

https://osf.io/6za8c/

Machine Learning Models

NOTE: Before applying the machine learning techniquies, derivatives below the DFT accuracy 3 kcal/mol of BDE were removed.

The following heat map illustrates the correlation between the features of fragments based on the generated data.

First, I tried out few regression machine learning models with the default parameters. The following table shows their performence;

Model	Train R²	Test R²	Train RMSE (kcal/mol)	Test RMSE (kcal/mol)
Linear Regression	0.459858	0.400263	4.857049	5.170292
Decision Tree	1	0.500963	0	4.716295
Bagged Tree	0.968559	0.758548	1.171838	3.280576
KNN	0.699967	0.428702	3.619956	5.046217
Random Forest	0.982086	0.795065	0.884542	3.022334
SVG	0.479298	0.378495	4.768846	5.263284
XG Boost	0.905036	0.844032	2.036564	2.636647

Since test R² scores are very poor compared to train R² scores, all models are suffering from over-fitting issue and it is also confirmed by the large test RMSE scores over train RMSE.
Out of these 7 models, XG Boost, random forest and bagged tree are selected based on the test scores for further optimizations.

The following table illustrates the performance of the models after tunning the hyperparameters and carrying out the cross validation (folds = 5);

Model	Train R²	Test R²	Train RMSE (kcal/mol)	Test RMSE (kcal/mol)
XG Boost	0.986399	0.845785	0.770731	2.621787
Random Forest	0.979996	0.798638	0.9347	2.995871
Bagged Tree	0.980953	0.801947	0.912076	2.971157

Out of the optimized models, XG Boost machine learning model is the best performing model by having the highest R² and lowest RMSE test scores.
Yet, the test scores are lower than the train scores even for the optimized models.
According to the test results of each model illustrated above, there is a similarity in the pattern of deviated points from the trend line when actual BDE $>$ 30 kcal/mol.
These outliers are the ones which cause to generate lower test scores. Therefore, the deviated derivatives were investigated further.
First, the common outliers for all three optimized models were identified by calculating $\lvert$ Actual BDE - Predicted BDE $\rvert$.
Below figire, shows the list top 9 outliers where $\lvert$ Actual BDE - Predicted BDE $\rvert$ $>$ 4 kcal/mol.
Seven of the nine outliers consist (CO)NH₂ group either in CoCpY' or CpY. Consequently, the (CO)NH₂ derivatives were sliced out and the distribution of the data was re-analyzed across each input feature.

As the above figure, the (CO)NH₂ derivatives do not appear as outliers when the entire data set is distributed across each input feature. However, since the input features do not provide a clear indication of (CO)NH₂ derivatives as outliers, it may be necessary to analyze a different property.
The properties of the substituted CoCp₂OH complex that were not utilized as input features for the machine learning models, but were employed to determine the BDE, as well as the HOMO and LUMO energies of the complex, were analyzed in the subsequent step. The selected properties are listed below;
- Energy of Co[CpY][CpY']OH
- Energy of CoCpY'energy
- CpY energy
- BDE (target variable)
- HOMO energy of Co[CpY][CpY']OH
- LUMO energy of Co[CpY][CpY']OH

Based on the data presented in above plots, it is evident that (CO)NH₂ derivatives appear as significant outliers in the distribution of HOMO energy of the complex.
Due to the peculiar behavior of (CO)NH₂ derivatives, they were removed from the data set and then tried with the selected machine learning models to investigate whether they make a significant difference in the performance.
The performance of the selected machine learning models with the refined data is presented in below grid and the summary is mentioned in Table.
Both train and test scores have decreased but removing (CO)NH₂ outliers from the data set made a significant difference on test scores of all 3 machine learning models. Out of 3 machine learning models, optimized XG Boost model has achieved the best prediction accuracy for unseen data which is 97%.

Model	Train R²	Test R²	Train RMSE (kcal/mol)	Test RMSE (kcal/mol)
XG Boost	0.9846	0.9677	0.7128	0.9465
Random Forest	0.9894	0.9349	0.5929	1.3442
Bagged Tree	0.9885	0.9371	0.6152	1.3214

Conclusion

The optimized XG Boost can be recommnded as a potential machine learning model to predict the stability of new di-substituted CoCp₂OH derivatives with the best accuracy.

Future work

The majority of (CO)NH₂ derivatives have higher BDE values with lower HOMO energies which implies that (CO)NH₂ stabilizes the complex by acting as electron withdrawing group.
However, this observation appears to contradict our previous work and experimental results, which showed that electron-donating groups stabilize CoCp₂OH derivatives.
Therefore, the impact of (CO)NH₂ on the electronic structure of CoCp₂OH is more complex than my initially thought.
Further investigation is needed to fully understand how (CO)NH₂ impacts the electronic structure and stability of CoCp₂OH.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Hirshfeld.ipynb		Hirshfeld.ipynb
Initial Bagged Tree Model_no_conh2.png		Initial Bagged Tree Model_no_conh2.png
Initial Decision Tree Model_no_conh2.png		Initial Decision Tree Model_no_conh2.png
Initial KNN Model_no_conh2.png		Initial KNN Model_no_conh2.png
Initial Random Forest Model_no_conh2.png		Initial Random Forest Model_no_conh2.png
Initial SVG Model_no_conh2.png		Initial SVG Model_no_conh2.png
Initial XG Boost Model_no_conh2.png		Initial XG Boost Model_no_conh2.png
LICENSE		LICENSE
Linear Regression Model_no_conh2.png		Linear Regression Model_no_conh2.png
Optimized Bagged Tree Model_no_conh2.png		Optimized Bagged Tree Model_no_conh2.png
Optimized Random Forest Model_no_conh2.png		Optimized Random Forest Model_no_conh2.png
Optimized XG Boost Model_no_conh2.png		Optimized XG Boost Model_no_conh2.png
README.md		README.md
all_feat.png		all_feat.png
all_feat_out.png		all_feat_out.png
bdeVScomplexhomo.png		bdeVScomplexhomo.png
bdeVsfeatures.ipynb		bdeVsfeatures.ipynb
bdeVsfeatures_new.ipynb		bdeVsfeatures_new.ipynb
bdeVsfeatures_newdataset.ipynb		bdeVsfeatures_newdataset.ipynb
bde_plots.ipynb		bde_plots.ipynb
bdevsCoCpY' Hirshfeld charge.png		bdevsCoCpY' Hirshfeld charge.png
bdevsCpY Hirshfeld charge.png		bdevsCpY Hirshfeld charge.png
bdevsEHOMO(CoCpY'OH) (Eh).png		bdevsEHOMO(CoCpY'OH) (Eh).png
bdevsEHOMO(Co[CpY][CpY']OH) (Eh).png		bdevsEHOMO(Co[CpY][CpY']OH) (Eh).png
bdevsEHOMO(CpY) (Eh).png		bdevsEHOMO(CpY) (Eh).png
bdevsELUMO(CoCpY'OH) (Eh).png		bdevsELUMO(CoCpY'OH) (Eh).png
bdevsELUMO(Co[CpY][CpY']OH) (Eh).png		bdevsELUMO(Co[CpY][CpY']OH) (Eh).png
bdevsELUMO(CpY) (Eh).png		bdevsELUMO(CpY) (Eh).png
bdevscocp_energy_out.png		bdevscocp_energy_out.png
bdevscomplexenergy_out.png		bdevscomplexenergy_out.png
bdevscp_energy_out.png		bdevscp_energy_out.png
bt_opt.png		bt_opt.png
bt_opt_refined.png		bt_opt_refined.png
cobaltocenium_data.xlsx		cobaltocenium_data.xlsx
cobaltocenium_data_new.xlsx		cobaltocenium_data_new.xlsx
cocp.dat		cocp.dat
cocp_energy_out.xlsx		cocp_energy_out.xlsx
cocp_frag_opt.in		cocp_frag_opt.in
cocp_homo.dat		cocp_homo.dat
cocp_lumo.dat		cocp_lumo.dat
complex_energy_out.xlsx		complex_energy_out.xlsx
complex_energy_out_new.xlsx		complex_energy_out_new.xlsx
complex_feat.png		complex_feat.png
cp.dat		cp.dat
cp_energy_out.xlsx		cp_energy_out.xlsx
cp_frag_opt.in		cp_frag_opt.in
cp_homo.dat		cp_homo.dat
cp_lumo.dat		cp_lumo.dat
data.py		data.py
der.dat		der.dat
di_input_generator.py		di_input_generator.py
dist_CoCpY' Hirshfeld charge.png		dist_CoCpY' Hirshfeld charge.png
dist_CpY Hirshfeld charge.png		dist_CpY Hirshfeld charge.png
dist_E$_{HOMO}$(CoCpY'OH).png		dist_E$_{HOMO}$(CoCpY'OH).png
dist_E$_{HOMO}$(CpY).png		dist_E$_{HOMO}$(CpY).png
dist_E$_{LUMO}$(CoCpY'OH).png		dist_E$_{LUMO}$(CoCpY'OH).png
dist_E$_{LUMO}$(CpY).png		dist_E$_{LUMO}$(CpY).png
extract_subcor.py		extract_subcor.py
frag_extract.py		frag_extract.py
h_charge.dat		h_charge.dat
h_charge_calculator.py		h_charge_calculator.py
h_charge_opt.in		h_charge_opt.in
h_charge_sub.in		h_charge_sub.in
homo_complex_outliers.xlsx		homo_complex_outliers.xlsx
ml_code.ipynb		ml_code.ipynb
ml_code_2.ipynb		ml_code_2.ipynb
ml_code_2_acidity1.ipynb		ml_code_2_acidity1.ipynb
ml_code_2_corrected.ipynb		ml_code_2_corrected.ipynb
ml_code_no_cohn2.ipynb		ml_code_no_cohn2.ipynb
ml_model.png		ml_model.png
model_code_2_acidity2.ipynb		model_code_2_acidity2.ipynb
mono_sub_opt_sample.in		mono_sub_opt_sample.in
multi_cal.sh		multi_cal.sh
opt_rt.png		opt_rt.png
pa_table.png		pa_table.png
pol.dat		pol.dat
pred_bdevsCoCpY' Hirshfeld charge.png		pred_bdevsCoCpY' Hirshfeld charge.png
pred_bdevsCpY Hirshfeld charge.png		pred_bdevsCpY Hirshfeld charge.png
pred_bdevsEHOMO(CoCpY'OH) (Eh).png		pred_bdevsEHOMO(CoCpY'OH) (Eh).png
pred_bdevsEHOMO(CpY) (Eh).png		pred_bdevsEHOMO(CpY) (Eh).png
pred_bdevsELUMO(CoCpY'OH) (Eh).png		pred_bdevsELUMO(CoCpY'OH) (Eh).png
pred_bdevsELUMO(CpY) (Eh).png		pred_bdevsELUMO(CpY) (Eh).png
proton_affinity_norm.ipynb		proton_affinity_norm.ipynb
proton_afinity.png		proton_afinity.png
proton_afinity_data.xlsx		proton_afinity_data.xlsx
proton_afinity_plot.ipynb		proton_afinity_plot.ipynb
proton_afinity_table.png		proton_afinity_table.png
r2_opt_models.png		r2_opt_models.png
rf_opt_refined.png		rf_opt_refined.png
rmse_opt_models.png		rmse_opt_models.png
rt_opt.png		rt_opt.png
std_proton_afinity.png		std_proton_afinity.png
substituents_image.tif		substituents_image.tif
top9_outlier.png		top9_outlier.png
top9_outlier.xlsx		top9_outlier.xlsx
xgb_opt.png		xgb_opt.png
xgb_opt_fi.png		xgb_opt_fi.png
xgb_opt_fi_shap.png		xgb_opt_fi_shap.png
xgb_opt_refined.png		xgb_opt_refined.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the stability of cobaltocenium derivatives

Data Generation

Machine Learning Models

Conclusion

Future work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting the stability of cobaltocenium derivatives

Data Generation

Machine Learning Models

Conclusion

Future work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages