GeCKO provides four workflows (DataCleaning, ReadMapping, VariantCalling, and VcfFiltering), which allow one to easily chain the different analyses needed to transform raw reads into filtered genotypes (VCF file). This gitHub project focuses on the use cases detailed in the paper to illustrate the features of GeCKO through a step-by-step analysis that traces the impact of domestication on allele diversity in durum wheat.
Below is a step by step guide to reproduce the use case analyses. For more information on the GeCKO project, please see the GeCKO gitHub repository.
This dataset is quite large, if your goal is just to test your GeCKO installation please use the small test dataset provided in the EXAMPLE folder of each workflow (e.g. data cleaning example).
We consider 120 accessions representing the three major subspecies involved in the transition steps described above (Figure 2); they represent the wild form T. turgidum dicoccoides (n=30, denoted DD), the first domesticated (solid rachis) form (T. turgidum dicoccum, n=30, DC), and the non-hulled cultivated form, (T. turgidum durum, n=60). The latter subspecies has been divided into two subgroups depending on whether the varieties originated in the pre- or post-Green Revolution period. The first group consists of lines derived from local varieties called "Landraces" (DP, n=30), and the second group consists of "elite" varieties registered in Europe after the Green Revolution (1970-1990; DE, n=30).
- Installing GeCKO (getting GeCKO workflows)
GeCKO workflows rely on snakemake, which ensures reproducible and scalable data analysis and make worflows installation straightforward since the only requirement is to have snakemake and conda available on your environment (see GeCKO install procedure for detailed information.
- Downloading dataset
The input data needed to reproduce the use case analysis are provided in a datagouv.fr dedicated repository Note that this dataset contains input files needed to reproduce GeCKO analyses as well as GeCKO expected output files. The latter can be used to check GeCKO ran as expected or to reproduce the population genetic analysis without the need of runing GeCKO workflows.
- Clone the GeCKO use case github repository
By doing so you will get i) GeCKO config files for reproducing the use case, and ii) scripts and conda environment files to reproduce population genetic analysis
git clone git@github.com:GE2POP/GeCKO_UseCase.git
This should be done on a cluster with SGE or SLURM available.
- Copy the fastq.gz files dowloaded from the dataverse in the adequate RAWDATA folder
# for DEV_Cap009.2b
cp dataverse_files/DEV_Cap009.2b_R1.fastq.gz DEV_Cap009.2b/RAWDATA
cp dataverse_files/DEV_Cap009.2b_R2.fastq.gz DEV_Cap009.2b/RAWDATA
# for DEV_Cap0010.2
cp dataverse_files/DEV_Cap010.2_R1.fastq.gz DEV_Cap010.2/RAWDATA
cp dataverse_files/DEV_Cap010.2_R2.fastq.gz DEV_Cap010.2/RAWDATA- Get the wheat genome reference The genome is downloaded from NCBI, chromosome names are shorten and samtools are used to restrict the reference to the full chromosomes (eliminating remaining contigs/scafolds). As the wheat genome is very large (> 10 Gb) this is quite time consumming.
If you don't have samtools on your HPC you can use the conda environment we provide:
conda env create --file conda_UC_GECKO_getRef.yml
conda activate UC-GECKO-getRef
chmod +x GeCKO_use_case_getWheatRef.sh
./GeCKO_use_case_getWheatRef.sh
conda deactivate- Modify config files
The information regarding the fastq files, read index etc. are, by default, retrieved from the config file CONFIG/config_WorkflowName.yml; all this is set with correct values to reproduce the use case analyses.
The only file you need to adapt is the cluster_config_WorkflowName.yml file used by default to provide specific cluster information (e.g. job queue names) related to this workflow:
- DEV_Cap009.2b/CONFIG/DC_CLUSTER_PROFILE_SLURM/config.yaml
- DEV_Cap0010.2/CONFIG/DC_CLUSTER_PROFILE_SLURM/config.yaml
- DEV_Cap009_and_Cap010/CONFIG/RM_CLUSTER_PROFILE_SLURM/config.yaml
- DEV_Cap009_and_Cap010/CONFIG/VC_CLUSTER_PROFILE_SLURM/config.yaml
- DEV_Cap009_and_Cap010/CONFIG/VF_CLUSTER_PROFILE_SLURM/config.yaml
- Launch Gecko workflow
Assuming you clone the GeCKO repository on your home directory, you shloud have a ~/GECKO folder containing GeCKO workflows and all analyses can be launch using (each command need to be finished before launching the next one):
conda activate GeCKO_env
WORKFLOW_PATH=/home/user/GeCKO
# DataCleaning
cd DEV_Cap009.2b
${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50
cd ../DEV_Cap010.2
${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50
# ReadMapping
cd ../DEV_Cap009_and_Cap010
${WORKFLOW_PATH}/runGeCKO.sh --workflow ReadMapping --config-file CONFIG/config_ReadMapping.yml --cluster-profile CONFIG/RM_CLUSTER_PROFILE_SLURM --jobs 50
# VariantCalling
${WORKFLOW_PATH}/runGeCKO.sh --workflow VariantCalling --config-file CONFIG/config_VariantCalling.yml --cluster-profile CONFIG/VC_CLUSTER_PROFILE_SLURM --jobs 50
# VcfFiltering
${WORKFLOW_PATH}/runGeCKO.sh --workflow VcfFiltering --config-file CONFIG/config_VcfFiltering.yml --cluster-profile CONFIG/VF_CLUSTER_PROFILE_SLURM --jobs 50Note that this launch launches the commands on the master node, it is advise to launch the command as a job on our cluster, for the first datacleaing workflow, this look likes:
# DataCleaning
cd DEV_Cap009.2b
sbatch --partition=agap_normal --mem=10G --wrap="${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50"
- Copy GeCKO output files in GeCKO_UseCase/Population_genetic_analyses/
The files to be copied can be taken from GeCKO output folders (if you have reproduced the GeCKO analysis) or from the dataverse_file folder:
cd GeCKO_UseCase/Population_genetic_analyses/
cp ~/dataverse_files/variants_calling_converted.vcf.gz .
#cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/VARIANTS_CALLING/GENOTYPE_GVCFS/variants_calling_converted.vcf.gz .
cp ~/dataverse_files/04__Genotype_Locus1_Sample_Locus2_Filtered.vcf.gz .
# cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/VCF_FILTERING/04__Genotype_Locus1_Sample_Locus2_Filtered.vcf.gz .
cp ~/dataverse_files/auto_zones.bed .
# cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/READ_MAPPING/EXTRACTED_BAMS/REFERENCE_zones/auto_zones.bed .
cp ~/dataverse_files/labels_groups.txt .
gzip -d variants_calling_converted.vcf.gz- Create conda environments
# for egglib
conda env create --file conda_UC_GECKO.yml
# for R analyses
conda env create --file conda_UC_GECKO_R.yml
chmod +x *.sh- Compute population statistics
conda activate UC-GECKO
./GeCKO_use_case_part1.sh
conda deactivate- Compute molecular diversity and Generate plots for Fst-scan and PCA analysis
conda activate UC-GECKO_R
./GeCKO_use_case_part2.sh
conda deactivateMolecular diversity of each population is provided in DRI.tsv. DRI values can be obtained by taking the ratios between the molecular diversity computed for the two populations to compare.