GeCKO - Domestication wheat - use case

GeCKO provides four workflows (DataCleaning, ReadMapping, VariantCalling, and VcfFiltering), which allow one to easily chain the different analyses needed to transform raw reads into filtered genotypes (VCF file). This gitHub project focuses on the use cases detailed in the paper to illustrate the features of GeCKO through a step-by-step analysis that traces the impact of domestication on allele diversity in durum wheat.

Below is a step by step guide to reproduce the use case analyses. For more information on the GeCKO project, please see the GeCKO gitHub repository.

This dataset is quite large, if your goal is just to test your GeCKO installation please use the small test dataset provided in the EXAMPLE folder of each workflow (e.g. data cleaning example).

Context

We consider 120 accessions representing the three major subspecies involved in the transition steps described above (Figure 2); they represent the wild form T. turgidum dicoccoides (n=30, denoted DD), the first domesticated (solid rachis) form (T. turgidum dicoccum, n=30, DC), and the non-hulled cultivated form, (T. turgidum durum, n=60). The latter subspecies has been divided into two subgroups depending on whether the varieties originated in the pre- or post-Green Revolution period. The first group consists of lines derived from local varieties called "Landraces" (DP, n=30), and the second group consists of "elite" varieties registered in Europe after the Green Revolution (1970-1990; DE, n=30).

Reproducing the use case analyse

Setting the environment

Installing GeCKO (getting GeCKO workflows)

GeCKO workflows rely on snakemake, which ensures reproducible and scalable data analysis and make worflows installation straightforward since the only requirement is to have snakemake and conda available on your environment (see GeCKO install procedure for detailed information.

Downloading dataset

The input data needed to reproduce the use case analysis are provided in a datagouv.fr dedicated repository Note that this dataset contains input files needed to reproduce GeCKO analyses as well as GeCKO expected output files. The latter can be used to check GeCKO ran as expected or to reproduce the population genetic analysis without the need of runing GeCKO workflows.

Clone the GeCKO use case github repository

By doing so you will get i) GeCKO config files for reproducing the use case, and ii) scripts and conda environment files to reproduce population genetic analysis git clone git@github.com:GE2POP/GeCKO_UseCase.git

Reproducing the GeCKO analysis (from raw reads to filtered VCF)

This should be done on a cluster with SGE or SLURM available.

Copy the fastq.gz files dowloaded from the dataverse in the adequate RAWDATA folder

# for DEV_Cap009.2b
cp dataverse_files/DEV_Cap009.2b_R1.fastq.gz DEV_Cap009.2b/RAWDATA
cp dataverse_files/DEV_Cap009.2b_R2.fastq.gz DEV_Cap009.2b/RAWDATA
# for DEV_Cap0010.2
cp dataverse_files/DEV_Cap010.2_R1.fastq.gz  DEV_Cap010.2/RAWDATA
cp dataverse_files/DEV_Cap010.2_R2.fastq.gz  DEV_Cap010.2/RAWDATA

Get the wheat genome reference The genome is downloaded from NCBI, chromosome names are shorten and samtools are used to restrict the reference to the full chromosomes (eliminating remaining contigs/scafolds). As the wheat genome is very large (> 10 Gb) this is quite time consumming.

If you don't have samtools on your HPC you can use the conda environment we provide:

conda env create --file conda_UC_GECKO_getRef.yml
conda activate UC-GECKO-getRef
chmod +x GeCKO_use_case_getWheatRef.sh
./GeCKO_use_case_getWheatRef.sh
conda deactivate

Modify config files

The information regarding the fastq files, read index etc. are, by default, retrieved from the config file CONFIG/config_WorkflowName.yml; all this is set with correct values to reproduce the use case analyses.

The only file you need to adapt is the cluster_config_WorkflowName.yml file used by default to provide specific cluster information (e.g. job queue names) related to this workflow:

DEV_Cap009.2b/CONFIG/DC_CLUSTER_PROFILE_SLURM/config.yaml
DEV_Cap0010.2/CONFIG/DC_CLUSTER_PROFILE_SLURM/config.yaml
DEV_Cap009_and_Cap010/CONFIG/RM_CLUSTER_PROFILE_SLURM/config.yaml
DEV_Cap009_and_Cap010/CONFIG/VC_CLUSTER_PROFILE_SLURM/config.yaml
DEV_Cap009_and_Cap010/CONFIG/VF_CLUSTER_PROFILE_SLURM/config.yaml

Launch Gecko workflow

Assuming you clone the GeCKO repository on your home directory, you shloud have a ~/GECKO folder containing GeCKO workflows and all analyses can be launch using (each command need to be finished before launching the next one):

conda activate GeCKO_env
WORKFLOW_PATH=/home/user/GeCKO

# DataCleaning
cd DEV_Cap009.2b
${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50

cd ../DEV_Cap010.2
${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50

# ReadMapping
cd ../DEV_Cap009_and_Cap010
${WORKFLOW_PATH}/runGeCKO.sh --workflow ReadMapping --config-file CONFIG/config_ReadMapping.yml --cluster-profile CONFIG/RM_CLUSTER_PROFILE_SLURM --jobs 50

# VariantCalling
${WORKFLOW_PATH}/runGeCKO.sh --workflow VariantCalling --config-file CONFIG/config_VariantCalling.yml --cluster-profile CONFIG/VC_CLUSTER_PROFILE_SLURM --jobs 50

# VcfFiltering
${WORKFLOW_PATH}/runGeCKO.sh --workflow VcfFiltering --config-file CONFIG/config_VcfFiltering.yml --cluster-profile CONFIG/VF_CLUSTER_PROFILE_SLURM --jobs 50

Note that this launch launches the commands on the master node, it is advise to launch the command as a job on our cluster, for the first datacleaing workflow, this look likes:

# DataCleaning
cd DEV_Cap009.2b
sbatch --partition=agap_normal --mem=10G --wrap="${WORKFLOW_PATH}/runGeCKO.sh --workflow DataCleaning --config-file CONFIG/config_DataCleaning.yml --cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM --jobs 50"

Reproducing population genetic analysis

Copy GeCKO output files in GeCKO_UseCase/Population_genetic_analyses/

The files to be copied can be taken from GeCKO output folders (if you have reproduced the GeCKO analysis) or from the dataverse_file folder:

cd GeCKO_UseCase/Population_genetic_analyses/

cp ~/dataverse_files/variants_calling_converted.vcf.gz .
#cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/VARIANTS_CALLING/GENOTYPE_GVCFS/variants_calling_converted.vcf.gz .

cp ~/dataverse_files/04__Genotype_Locus1_Sample_Locus2_Filtered.vcf.gz .
# cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/VCF_FILTERING/04__Genotype_Locus1_Sample_Locus2_Filtered.vcf.gz .

cp ~/dataverse_files/auto_zones.bed .
# cp DEV_Cap009_and_Cap010/WORKFLOWS_OUTPUTS/READ_MAPPING/EXTRACTED_BAMS/REFERENCE_zones/auto_zones.bed .

cp ~/dataverse_files/labels_groups.txt .

gzip -d variants_calling_converted.vcf.gz

Create conda environments

# for egglib
conda env create --file conda_UC_GECKO.yml
# for R analyses
conda env create --file conda_UC_GECKO_R.yml
chmod +x *.sh

Compute population statistics

conda activate UC-GECKO
./GeCKO_use_case_part1.sh
conda deactivate

Compute molecular diversity and Generate plots for Fst-scan and PCA analysis

conda activate UC-GECKO_R
./GeCKO_use_case_part2.sh
conda deactivate

Molecular diversity of each population is provided in DRI.tsv. DRI values can be obtained by taking the ratios between the molecular diversity computed for the two populations to compare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeCKO - Domestication wheat - use case

Context

Reproducing the use case analyse

Setting the environment

Reproducing the GeCKO analysis (from raw reads to filtered VCF)

Reproducing population genetic analysis

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
GeCKO_analyses		GeCKO_analyses
Population_genetic_analyses		Population_genetic_analyses
GeCKO_use_case_getWheatRef.sh		GeCKO_use_case_getWheatRef.sh
conda_UC_GECKO_getRef.yml		conda_UC_GECKO_getRef.yml
readme.md		readme.md

GE2POP/GeCKO_UseCase

Folders and files

Latest commit

History

Repository files navigation

GeCKO - Domestication wheat - use case

Context

Reproducing the use case analyse

Setting the environment

Reproducing the GeCKO analysis (from raw reads to filtered VCF)

Reproducing population genetic analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages