A Practical Workflow for Correcting Kit-specific Effects in Whole-Exome Sequencing Data
This repository provides an end-to-end workflow for processing whole-exome sequencing (WES) data, from raw reads to gene-level feature matrices (CADD-weighted Allele Fraction), batch effect assessment, and imputation-based correction.
-
Bash, R (≥4.0), Python (for PARC)
-
Tools: BWA, CrossMap (optional), samtools, Picard, DeepVariant, GLnexus, bcftools, Beagle, ANNOVAR
-
Additional R packages listed in the R scripts.
-
External resources:
- Reference genome (GRCh38)
- Reference panel (provided via Zenodo)
- Genetic maps (for Beagle)
- ANNOVAR databases
- CADD prescored database
Data/ # example inputs and auxiliary resources
Data_pre_processing/ # QC + trimming + alignment / liftover
Variant_calling/ # DeepVariant + GLnexus
Variant_post_processing/ # genotype imputation with BEAGLE + ANNOVAR/CADD annotation
Variant_to_gene/ # gene-level variant aggregation
Gene-level imputation/ # clustering, GMM for threshold determination, gene-level imputation imputation
Scripts:
Data_pre_processing/QC/trimm.sh
Data_pre_processing/QC/fastqc/
Data_pre_processing/Alignment/alignment.sh
Data_pre_processing/Liftover/liftover_bams.sh
Steps:
- QC and trimming
- alignment to GRCh38
- duplicate marking
- optional liftover
Scripts:
Variant_calling/Calling/run_deepvariant.sh
Variant_calling/Joint_genotyping/run_GLnexus.sh
Steps:
- per-sample variant calling (DeepVariant)
- joint genotyping (GLnexus)
Script:
Variant_post_processing/1_genotype_imputation.sh
Steps:
- normalization and filtering
- genotype conformation
- imputation (Beagle)
Script:
Variant_post_processing/2_annotation.sh
Steps:
- functional annotation (ANNOVAR)
- filtering to coding/splicing variants
- adding CADD scores
Scripts:
Variant_to_gene/gene_aggregation.sh
Variant_to_gene/cal_features_multi.py
Steps:
- aggregation of variants per gene
- computation of gene-level features
Scripts:
Gene-level imputation/
├── 1_features_loading.R
├── 2_clustering.R
├── 3_GMM.R
├── 4_feature_imputation.R
Steps:
- feature loading
- clustering (PARC)
- detection rate modeling (GMM) for feature imputation thresholds
- MNAR-aware gene-level KNN imputation
- Example metadata:
Data/example/sample_path_map_example.tsv
- BED regions:
Data/bed/refGene_exons_splice5.nochr.bed
- LoF annotations:
Data/gnomad_lofs/
-
Reference panel (EUR subset):
- provided separately via Zenodo (see DOI)
- All scripts use
/path/to/...placeholders — paths must be adapted - Large datasets (VCF, BAM, reference panels) are not included
- Workflow is modular — individual steps can be run independently