WESworkflow

A Practical Workflow for Correcting Kit-specific Effects in Whole-Exome Sequencing Data

This repository provides an end-to-end workflow for processing whole-exome sequencing (WES) data, from raw reads to gene-level feature matrices (CADD-weighted Allele Fraction), batch effect assessment, and imputation-based correction.

Requirements

Bash, R (≥4.0), Python (for PARC)
Tools: BWA, CrossMap (optional), samtools, Picard, DeepVariant, GLnexus, bcftools, Beagle, ANNOVAR
Additional R packages listed in the R scripts.
External resources:
- Reference genome (GRCh38)
- Reference panel (provided via Zenodo)
- Genetic maps (for Beagle)
- ANNOVAR databases
- CADD prescored database

Repository structure

Data/                      # example inputs and auxiliary resources
Data_pre_processing/       # QC + trimming + alignment / liftover
Variant_calling/           # DeepVariant + GLnexus
Variant_post_processing/   # genotype imputation with BEAGLE + ANNOVAR/CADD annotation
Variant_to_gene/           # gene-level variant aggregation
Gene-level imputation/     # clustering, GMM for threshold determination, gene-level imputation imputation

Workflow

1. Data preprocessing

Scripts:

Data_pre_processing/QC/trimm.sh
Data_pre_processing/QC/fastqc/
Data_pre_processing/Alignment/alignment.sh
Data_pre_processing/Liftover/liftover_bams.sh

Steps:

QC and trimming
alignment to GRCh38
duplicate marking
optional liftover

2. Variant calling

Scripts:

Variant_calling/Calling/run_deepvariant.sh
Variant_calling/Joint_genotyping/run_GLnexus.sh

Steps:

per-sample variant calling (DeepVariant)
joint genotyping (GLnexus)

3. Genotype imputation

Script:

Variant_post_processing/1_genotype_imputation.sh

Steps:

normalization and filtering
genotype conformation
imputation (Beagle)

4. Variant annotation

Script:

Variant_post_processing/2_annotation.sh

Steps:

functional annotation (ANNOVAR)
filtering to coding/splicing variants
adding CADD scores

5. Gene-level feature generation

Scripts:

Variant_to_gene/gene_aggregation.sh
Variant_to_gene/cal_features_multi.py

Steps:

aggregation of variants per gene
computation of gene-level features

6. Feature analysis and imputation

Scripts:

Gene-level imputation/
├── 1_features_loading.R
├── 2_clustering.R
├── 3_GMM.R
├── 4_feature_imputation.R

Steps:

feature loading
clustering (PARC)
detection rate modeling (GMM) for feature imputation thresholds
MNAR-aware gene-level KNN imputation

Data

Example metadata:

Data/example/sample_path_map_example.tsv

BED regions:

Data/bed/refGene_exons_splice5.nochr.bed

LoF annotations:

Data/gnomad_lofs/

Reference panel (EUR subset):
- provided separately via Zenodo (see DOI)

Notes

All scripts use /path/to/... placeholders — paths must be adapted
Large datasets (VCF, BAM, reference panels) are not included
Workflow is modular — individual steps can be run independently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WESworkflow

Requirements

Repository structure

Workflow

1. Data preprocessing

2. Variant calling

3. Genotype imputation

4. Variant annotation

5. Gene-level feature generation

6. Feature analysis and imputation

Data

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Data		Data
Data_pre_processing		Data_pre_processing
Gene-level imputation		Gene-level imputation
Variant_calling		Variant_calling
Variant_post_processing		Variant_post_processing
Variant_to_gene		Variant_to_gene
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

WESworkflow

Requirements

Repository structure

Workflow

1. Data preprocessing

2. Variant calling

3. Genotype imputation

4. Variant annotation

5. Gene-level feature generation

6. Feature analysis and imputation

Data

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages