-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial and examples
This page describes several standard use cases. Note that all command lines should be run from the root of the github repository.
(I'll soon add another example to illustrate gene transfers)
Our example dataset consists of 8 gene families covering 13 primate species (extracted from the ENSEMBL database). I have sampled 1000 gene trees per gene family using MrBayes. The example data can be found under data/primates/. I also provide the reference rooted species tree.
In practice, for your own analyses, you will have to run Bayesian gene tree inference yourself, with your favorite tool (e.g. PhyloBayes, MrBayes, etc.) to generate a distribution of trees (we recommend at least 1000) for each gene family.
In this example, I assume that the species tree is known, and that we only want to perform standard gene tree reconciliation.
alerax -f data/primates/primate_families.txt -s data/primates/speciesTree.newick -p output_reconciliation --gene-tree-samples 100
You can see from the logs that the transfer rate is minimal (10^-7).
You can see an overview of the number of events happening at each branch of the species tree in output_reconciliation/reconciliations/perspecies_eventcount.txt. There were a few DL events but no HGT. In this example, there seems to be very little uncertainty because most of the 100 samples seem to recover the same events.
The family family_381 seems to have quite a lot of gene copies. If you open output_reconciliation/reconciliations/perspecies_eventcount.txt, you will get a summary of the per-species events (averaged over all 100 samples). There was 3 duplications at the root of the species tree (in my run, the label of the root was Node_PanUUUTroglodytes_MicrocebusUUUMurinus_0). You can visualize individual reconciled gene tree samples with Thirdkind. For instance,if you use a local installation of thirdkind (you can also use the web interface):
thirdkind -f output_reconciliation/reconciliations/all/family_381_11.xml
You should observe 3 duplications at the root and then a few losses along the tree:

This corresponds to duplications that occurred much earlier in the vertebrate tree (the gene families were extracted from a vertebrate dataset).
To run species tree inference, you need to add the argument --species-tree-search HYBRID. Here we also skip the reconciliation step with --gene-tree-samples 0, and start from a random species tree with -s random. You could also use -s MiniNJ to start from a more plausible tree.
alerax -f data/primates/primate_families.txt -s random -p output_species_inference --gene-tree-samples 0 --species-tree-search HYBRID
The inferred tree can be found here: output_reconciliation/species_trees/inferred_species_tree.newick.
You can find other files in the same directories with support information. For instance, you can get information about the RELL support for the root position in output_species_inference/species_trees/species_tree_root_rell.newick. Here is a picture to help you understand how to interpret the position of the scores:

The support is a value between 0 (no bootstrap supports this root position) and 1000 (all bootstraps support this position). Note that RELL tends to overestimate the support values. The support values are attached to the branches at which the root could be placed. The two branches around the inferred root have the same value (here 669) corresponding to the support for the inferred root. In this example, the terminal branch that leads to Pongo Abeli is the second best supported root position (298), which is most likely due to the fact that we have very few families to have a good support for the true position.
The different support values that we provide in AleRax are quite primitive. If you have several alternative species trees or species tree root positions that you want to test, we recommend to run AleRax on each of them separately, such that AleRax optimizes the model parameters for each candidate, and to then use a tool such as CONSEL to compare the different runs. CONSEL usually takes as input the per-site (log) likelihoods of different trees computed from the same MSA. Here, we'll give it per-family likelihoods for different species trees. AleRax outputs the per-family likelihoods into per_fam_likelihoods.txt, but we provide a script to do the job for you. In this example we will compare three root positions for the primate species tree (the true root, rooting at Pongo Abeli, and rooting at Homo Sapiens). We run AleRax on the three corresponding species tree files (with different output directories!!!!) and we do not sample any gene trees because we only need the per-family likelihoods.
alerax -f data/primates/primate_families.txt -s data/primates/speciesTree.newick -p output_testTrue --gene-tree-samples 0
alerax -f data/primates/primate_families.txt -s data/primates/pongoRootSpeciesTree.newick -p output_testPongo --gene-tree-samples 0
alerax -f data/primates/primate_families.txt -s data/primates/homoRootSpeciesTree.newick -p output_testHomo --gene-tree-samples 0
Then we can run a script to generate the input file for consel (consel.txt):
python scripts/generate_consel_file.py consel.txt output_testTrue/ output_testHomo/ output_testPongo/
The script will tell you which run corresponds to which item in consel:

You can then run consel (but you have to install consel yourself):
makermt --puzzle consel.txt
consel consel.rmt
catpv consel.pv
I get the following output, indicating a high support for the correct root (item 1):

By default, AleRax assumes that all families and all species have the same DTL probabilities. You can estimate the per-family DTL probabilities with --model-parameterization PER-FAMILY:
alerax -f data/primates/primate_families.txt -s data/primates/speciesTree.newick -p output_familyrates --gene-tree-samples 100 --model-parameterization PER-FAMILY
You can check the different estimated DTL probabilities by looking in the directory output_familyrates/model_parameters/. For instance, family_12270 gets the minimum D and T probabilities (10^-7) but a non-null loss rate, and family_381 has a few gene transfers.