This README describes the steps that need to be taken to run the runs of homozygosity pipeline as presented in Bortoluzzi et al. (2020).
If you are using any of these scripts, please cite the following papers:
- Bortoluzzi, C., Bosse, M., Derks, M.F., Crooijmans, R.P., Groenen, M.A. and Megens, H.J., 2020. The type of bottleneck matters: Insights into the deleterious variation landscape of small managed populations. Evolutionary applications, 13(2), pp.330-341.
- Bosse, M., Megens, H.J., Madsen, O., Paudel, Y., Frantz, L.A., Schook, L.B., Crooijmans, R.P. and Groenen, M.A., 2012. Regions of homozygosity in the porcine genome: consequence of demography and the recombination landscape. PLoS genetics, 8(11), p.e1003100.
This pipeline consists of various Bash and Python scripts, each described in the order they should be executed.
Before running the pipeline, make sure you have installed the following programs:
- DeepVariant (v1.9.0): https://github.com/google/deepvariant?tab=readme-ov-file
- Samtools (v1.22.1): https://github.com/samtools/samtools
- Python >v3.6
If you want to run this pipeline, you can simply clone the github repository using the code below:
git clone git@github.com:cbortoluzzi/ROHs.git
In this step, you will use DeepVariant to perform variant calling from sequencing reads:
deepvariant.sh <reference_genome> <bam> <model>
Required inputs:
-
Reference genome (
<reference_genome>): the reference genome must be in uncompressed .fasta fomrat and indexed. Use the following command to index it:samtools faidx <reference_genome>. -
BAM file (
<bam>): the input alignment of an individual. Like the reference genome, the alignment must also be indexed:samtools index <bam>. -
Model type (
<model>): this specifies the DeepVariant model to use. UseWGSfor Illumina whole-genome sequencing data,PACBIOfor PacBio data, orONT_R104for Oxford Nanopore R10.4.1 chemistry Simplex and Duplex reads.
Outputs:
- VCF file: this is the standard VCF file containing all called variants.
- XML report: a report in XML format generated by DeepVariant containing summary statistics. For more information about the VCF stats report, refer to the DeepVariant documentation: https://github.com/google/deepvariant/blob/r1.9/docs/deepvariant-vcf-stats-report.md.
IMPORTANT: before running the script, make sure to change in the bash script the paths to DeepVariant and the directory where your data is stored.
In this step, you will use Samtools to compute the genome-wide coverage of each individual alignment file in BAM format:
samtools_depth.sh <bam>
Required inputs:
- BAM file (
<bam>): the indexed alignment file for the individual for which you want to calculate the genome-wide coverage.
Output:
- Coverage file: a text file (
.cov) reporting the genome-wide coverage. You will need this file in step 3 and 4.
In this step, you will use a python script to filter variants called in step 1. This is an important step to retain high quality variants:
python filter_variants.py -h
Parameters:
--vcf This is the VCF file generated by DeepVariant in step 1.
--cov This is the text file generated by Samtools in step 2.
--q This is the minimum Phred-quality score that a variant must have to be retained (default value: 15).
--dp This is the minimum read depth that a variant must have to be retained (default value: 6).
--gq This is the minimum genotype quality that a variant must have to be retained (default value: 20).
--o This is the name of the output directory where the filtered VCF file will be saved.
To run this script, make sure to have the following modules installed in python:
PyVCF 0.6.8pathlib 1.0.1subprocessargparse
To run this script in bash, you can refer to the example bash script filter_variants.sh.
This script filters variants in the VCF file based on three key parameters: minimum Phred-quality score, minimum read depth, and minimum genotype quality. The parameters -q, -dp, and -gq allow you to modify the default values for these thresholds.
You only need to specify these three parameters if you want to change the default settings. If you are fine with the defaults, you can omit them when running the script.
I recommend reviewing the XML report generated in Step 1 to assess the quality of your data. This report can help you decide if any parameters need adjusting. For example, a default genotype quality of 20 may be too low or too high for your specific dataset, and tweaking it could improve your variant filtering.
IMPORTANT: the script also filters variants if their read depth is above 2 x the average genome-wide coverage. If you want to allow for a higher maximum read depth, you will need to manually change the script at line 33. For example, if you want a maximum read depth of 2.5 x the average genome-wide coverage, change line 33 with max_depth = 2.5 * float(avg_genome_coverage).
In this step, you will use a python script to calculate a corrected measure of heterozygosity for each individual VCF file, employing a window size approach.
calculate_genome_wide_heterozygosity.py -h
Parameters:
--vcf This is the VCF file filtered in step 3.
--bam This is the BAM file.
--cov This is the text file generated by Samtools in step 2.
--dp This is the minimum read depth that a variant must have to be retained (default value: 6).
--w This is the window size to use to calculated the corrected measure of heterozygisity (default value: 10,000 bp).
--o This is the name of the output directory where the filtered VCF file will be saved.
To run this script in bash, you can refer to the example bash script RoH_pipeline_pt1.sh. Make sure to have samtools installed or loaded if you installed it as a module.
IMPORTANT: the script also filters variants if their read depth is above 2 x the average genome-wide coverage. If you want to allow for a higher maximum read depth, you will need to manually change the script at line 52. For example, if you want a maximum read depth of 2.5 x the average genome-wide coverage, change line 52 with max_depth = 2.5 * float(avg_genome_coverage).
The output file is going to look like the example below:
1 0 10000 8765 10 11.41
1 10000 20000 7865 16 20.34
1 20000 30000 6536 2 3.06
where the entries are, in order:
chromosome
start position of the window
end position of the window
total number of sites meeting our coverage criteria as calculated from the BAM file
total number of sites meeting our coverage criteria as calculated from the VCF file
corrected heterozygosity
The heterozygosity is "corrected" because it adjusts for the number of sites excluded from the 10 Kb window due to failing our coverage criteria. Therefore, if a 10 Kb window has 8765 well-covered sites and 10 well-covered heterozygous sites, the corrected heterozygosity is calculated as:
corrected_heterozygosity = (window size / cov_sites) * heterozygous_sites = (10000 / 8765) * 10 = 11.41
IMPORTANT: since the VCF file used in this step is the one generated in step 3, it is recommended to apply the same coverage filtering criteria here to maintain consistency.
In this step, you will use a python script to identify runs of homozygosity (RoH), which are here defined as stretches with lower than expected heterozygosity in an individual's genome.
runs_of_homozygosity.py -h
Parameters:
--het This is the tab delimited file obtained from step 4 on the genome-wide heterozygosity calculated using a window approach.
--w This is the window size that was used in the previous step (default value: 10,000 bp).
--t1 This is the first threshold in the pipeline and it's about the number of consecutive bins or windows to consider at a time when selecting candidate RoHs (default value: 10 bins).
--t2 This is the second threshold in the pipeline and it's about the minimum number of well covered sites within each window to consider when selecting candidate RoHs (default value: 6,000 sites).
--t3 This is the third threshold in the pipeline and it's about the fraction under which to retain a candidate RoH (default value: 0.25)
--o This is the name of the output directory where the RoHs will be saved.
To run this script in bash, you can refer to the example bash script RoH_pipeline_pt2.sh.
The identification of RoHs is based on the work of Bosse et al. (2012). Briefly, the script will consider 10 (--t1) consecutive bins at a time and only bins that will have at least 6,000 well-covered sites (--t2) will be retained and considered.
The average heterozygosity will then be calculated within this initial 10 consecutive bins. The 10 consecutive bins will then be retained as candidate ROHs only if the heterozygosity within these 10 consecutive bins is below 0.25 (--t3) the average genome-wide heterozygosity. All 10 consecutive bins that do not meet these
criteria will contribute to the genome‐wide heterozygosity level outside ROHs.
It is possible that within the 10 consecutive bins there are bins that show a peak in heterozygosity. These peaks are often symptomatic of local assembly or alignment errors. To avoid including them in the final RoHs, we relax the threshold within candidate homozygous stretches allowing for maximally twice the average heterozygosity in a bin.
If you experience issues or need support, either submit a request or contact me at: chiara[dot]bortoluzzi[dot]2[at]gmail[dot]com
Chiara Bortoluzzi, Data Manager, Environmental Bioinformatics Group, SIB Swiss Institute of Bioinformatics, Switzerland
- 0.3
- Upload of the runs of homozygosity script and relative documentation
- 0.2
- New release with documentation
- 0.1
- Initial Release
This pipeline was inspired by: Bosse, M., Megens, H.J., Madsen, O., Paudel, Y., Frantz, L.A., Schook, L.B., Crooijmans, R.P. and Groenen, M.A., 2012. Regions of homozygosity in the porcine genome: consequence of demography and the recombination landscape. PLoS genetics, 8(11), p.e1003100.