Skip to content

SchwartzLabURI/PML

 
 

Repository files navigation

PML

Phylogenetic Machine Learning and Sequence Simulations

Add contents here

Simulation

Species tree simulation

Empirical species trees

0. Infer trees of empirical datasets in IQ-TREE

For consistency, as well as because not all studies release the tree files, the trees were inferred denovo based on provided alignments

Navigate to the datasets/ directory and for each dataset (Fong, Liu, Wickett, McGowen) follow the instructions in the README to download the appropriate dataset and run the accompanying iqtree shell script.

1. Prepare the starting trees for simulation

Navigate to the empirical_tree_generation directory. Run 1_empirical_simulation_prep.sh to set up directories for simulations and run 2_empirical_tree_processor.sh to edit species trees inferred in IQtree and generate parameters for downstream simulations.

cd empirical_tree_generation
sbatch 1_empirical_simulation_prep.sh
sbatch 2_empirical_tree_processor.sh

2. Simulate loci in SimPhy

Navigate to the simulation_scripts directory and run 1_simulation_prep.sh to organize data and generate directories for downstream simulations. Set up the batch script for simphy by copying 3_run_simphy_CLEAN.sh into 3_run_simphy.sh Run 2_prep_simphy.sh to generate a list of SimPhy commands for simulating loci off of the species trees generated previously. Run those commands with 3_run_simphy.sh

cd simulation_scripts
sbatch 1_simulation_prep.sh
sbatch 2_prep_simphy.sh
sbatch --array=1-20 3_run_simphy.sh

Make sure all loci have been generated by running 4_post_simphy.sh

4_post_simphy.sh

Navigate into the subdirectories for the Fong, Liu, McGowen, and Wickett datasets and check that all expected gene trees have been simulated:

wc -l gene_trees.tre

If all expected gene trees have been saved in gene_trees.tre, run INDELible:

sbatch 5_INDELible.sh

3. Run assessment programs

Make sure that MAFFT, AMAS, IQ-TREE2, FastSP, and HyPhy are installed, and correct paths/modules in the submission scripts as needed.

Navigate to 3_feature_assessment and run assessment scripts according to numbered order:

sbatch 1_prep.sh

When complete, submit the alignment job (8 array elements correspond to 2000 files split into 250 file bins):

sbatch --array=1-8 2_mafft.sh

When complete, submit jobs to run AMAS (properties), IQ-TREE (trees), and FastSP (alignment accuracy)

sbatch 3_run_amas.sh
sbatch 4_run_FastSP.sh
sbatch 5_iqtree_concat.sh
sbatch --array=1-8 6_iqtree_array.sh

When complete, submit a job to prepare species tree for each locus to assess rates with HyPhy

sbatch 7_prune_rescale_species_tree.sh

When complete, submit jobs to run HyPhy to assess site rates for each locus (since separate trees were estimated for Train and Test datasets, rate assessments are done separately as well), as well as to reconstruct a coalescence-framework tree of the test subdataset using Astral:

sbatch --array=1-4 8_rate_assessment_Train.sh
sbatch --array=5-8 9_rate_assessment_Test.sh
sbatch 10_run_astral.sh

Run assessment script

For simulation datasets the true trees are known. So instead of using pruned and rescaled inferred trees, we prepare the true simulated trees.

Run the following to prepare true simulated trees:

sbatch 11_prune_simul_trees.sh

When complete, run the main feature assessment script

12_assess_properties.sh

Output files are ML_train.txt and ML_test.txt respectively. These files are used for some of the downstream interrogations, however, to train / use the machine learning model, ML_train/test file for all datasets are combined together and certain columns are excluded (for ex, wRF column is excluded when training using RF as the proxy for phylogenetic utility).

4. Model Training

Navigate to 4_model_training and run scripts according to numbered order:

First, prepare a training dataset for Robinson-Foulds and weighted Robinson-Foulds models:

sbatch 1_prep_data.sh

Next, train two random forest regression models on Robinson-Foulds and weighted Robinson-Foulds data. If you tune all hyperparameters, expect this step to take some time:

sbatch 2_RF_model.sh
sbatch 3_wRF_model.sh

Compute first and second order Friedman's H-statistic and generate Partial Dependency Plots to look at feature interactions:

sbatch 4_investigate_interactions.sh

DNN and SVM Model Training

Navigate to 4_model_training/svm_and_dnn_models

To train a deep neural network for both the RF and wRF datasets, run:

sbatch 6_DNN.sh 

Outputs will be generated in the directory DNN_results and wRF_DNN_results. Each permutation of hyperparameter training will be in its own subdirectory. To summarize results, run

8_load_best_model.sh 

To train a support vector machine regressor for both the RF and wRF datasets, run:

7_SVM.sh

SVM models will be saved to 4_model_training/svm_and_dnn_models .

As part of model training, .png images containing R^2 scores and SHAP feature importance plots will be generated for all six models (RF and wRF models for the random forest regressor, DNN, and SVM models).

5. Locus Utility Prediction

Following model training, predict which loci from simulated datasets have the greatest phylogenetic utility. Navigate to 5_locus_utility_prediction and run the following to prepare data, predict locus utility, and generate locus subsets:

sbatch 1_prep_data.sh
sbatch 2_run_predict_locus_utility.sh
sbatch 3_subset_generator.sh

Next, infer concatenated phylogenies from subsetted loci in IQTREE:

sbatch 4_run_concat_iqtree_array.sh
sbatch 4a_run_concat_iqtree_wrf_array.sh

Finally, infer coalescent phylogenies from subsetted loci gene trees in ASTRAL:

sbatch 5_run_astral_array.sh
sbatch 6_run_astral_array_wrf.sh

6. Empirical Feature Assessment

Now we are ready to assess features from empirical datasets. Navigate to 6_empirical_features and run the following

sbatch 0_data_prep.sh
sbatch 1_run_prep.sh
sbatch 2_run_mafft.sh

The McGowen dataset needs to be trimmed separately to remove empty alignments. Next, run AMAS for all datsets:

sbatch 3c_trim_McGowen.sh
sbatch 3_run_amas.sh

Next, run fastSP to analyze alignment error:

sbatch 4_run_FastSP.sh

Run IQTREE to generate a concatenated tree froom all alignments and also run an array to infer gene trees for all loci in IQTREE:

sbatch 5_run_iqtree_concat.sh
sbatch 6_run_iqtree_array.sh

Now generate a pruned and rescaled species tree for each locus for downstream analysis in HyPhy:

sbatch 7_prune_rescale_species_tree.sh

When complete, submit jobs to run HyPhy to assess site rates for each locus

sbatch 8_rate_assessment.sh

Infer a coalescent tree in ASTRAL from all loci gene trees:

sbatch 9_run_astral.sh

Assess all feature properties for all loci for each empirical dataset:

sbatch 10_assess_properties.sh

Prepare data for locus utility prediction:

sbatch 11a_prep_data.sh
sbatch 11_run_predict_locus_utility.sh

Generate loci subsets based on predicted phylogenetic utility:

sbatch 12_subset_generator.sh

Generate concatenated and coalescent trees from Robinson-Foulds and weighted-Robinson Foulds random forest regressor models for each dataset:

sbatch 13a_wRF_run_concat_iqtree_array.sh
sbatch 13_run_concat_iqtree_array.sh
sbatch 14_run_astral_array.sh
sbatch 15_run_astral_array_wrf.sh

7. Generate summary statistics and figures

Navigate to 7_assessment_and_figures and run the following:

sbatch 1_fig1.sh
sbatch 2_fig2.sh
sbatch 3_fig3.sh
sbatch 4_fig4_5.sh
sbatch 5_fig6.sh
sbatch 6_fig_7.sh

UPDATED TO THIS POINT ONLY

About

Phylogenetic Machine Learning and Sequence Simulations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • R 42.9%
  • Python 33.9%
  • Shell 23.2%