A Nextflow pipeline for trimming, aligning, deduplicating, and joint variant-calling short-read sequencing data against a reference genome.
Given paired-end FASTQ files and a reference genome, the pipeline produces:
- Per-sample QC reports (fastp, FastQC, MultiQC, qualimap)
- Sorted, deduplicated, read-group-tagged BAMs
- Per-sample gVCFs (HaplotypeCaller)
- A jointly-called, hard-filtered, recoded multi-sample VCF (SNPs and indels)
| # | Process | Tool(s) | Purpose |
|---|---|---|---|
| 1 | index_genome |
samtools, GATK | Build .fai and .dict for the reference |
| 2 | fastp |
fastp | Adapter and quality trimming |
| 3 | fastqc + multiqc |
FastQC, MultiQC | Read-level QC reports |
| 4 | alignment |
NextGenMap + samtools sort | Produce sorted BAM |
| 5 | check_duplicates |
Picard AddOrReplaceReadGroups | Add read-group tags |
| 6 | remove_duplicates |
GATK MarkDuplicates | Remove sequencing duplicates |
| 7 | qualimap + qualimap_collate |
qualimap | Alignment QC + collated report |
| 8 | (optional) base_recal1/2/3 |
GATK BQSR | Base-quality recalibration; only runs if --knownSites is provided |
| 9 | haplotype_caller |
GATK HaplotypeCaller (gVCF mode) | Per-sample gVCF |
| 10 | genomicsdb_import |
GATK GenomicsDBImport | Per-interval sample combine (parallelized) |
| 11 | genotype_gvcfs |
GATK GenotypeGVCFs | Per-interval joint genotyping |
| 12 | gather_vcfs |
GATK GatherVcfs | Recombine intervals in reference .dict order |
| 13 | downstream_filter |
GATK SelectVariants + VariantFiltration | Split SNPs/indels and apply hard filters |
| 14 | recode_vcfs |
vcftools | Mask genotypes with DP < 3; preserve INFO column |
| 15 | fill_tags |
bcftools +fill-tags |
Refresh AC/AN/AF/NS/F_MISSING; output bgzip + tabix |
--fqPattern <glob>— Absolute path glob for paired FASTQs, e.g.'/path/to/data/*_{1,2}.fq.gz'. The literal{1,2}defines the pair.--savePath <dir>— Where published outputs are written. Subdirectories are created automatically (raw_bams/,final_bam_files/,fastqc/,multiqc/,qualimap/,raw_snps/,filtered_snps/).--refGenome <file.fna>— Absolute path to the reference FASTA.
--intervals <file>— Newline-separated list of intervals (one per line) for parallel joint genotyping. If omitted, intervals are derived from each contig in the reference.dict.--knownSites <file.vcf.gz>— A VCF of known variants. If supplied, BQSR (three iterations) runs before HaplotypeCaller. If omitted, BQSR is skipped.
- Nextflow 25.x or newer
- Singularity (or Apptainer) — every process runs in a container; no local tool installs needed
- A SLURM cluster (the default executor; configurable in
nextflow.config)
The repo ships a minimal nextflow.config. Cluster-specific options (account, partition, Tower token) should go in a personal config that you point Nextflow at with -c. Example for Purdue Negishi (~/config/nextflow.config):
singularity.enabled = true
tower.accessToken = 'YOUR_TOWER_TOKEN' // optional, only needed for -with-tower
executor {
name = 'slurm'
queueSize = 10
}
process {
cache = 'lenient'
clusterOptions = '-A bharpur -p cpu' // Negishi requires explicit -A and -p
errorStrategy = 'retry'
}Then invoke:
nextflow \
-c ~/config/nextflow.config \
run /depot/bharpur/apps/nf_bennyben/main.nf \
--fqPattern '/depot/bharpur/data/projects/example/data/*_{1,2}.fq.gz' \
--savePath /depot/bharpur/data/projects/example/output \
--refGenome /depot/bharpur/data/ref_genomes/VMAN/GCF_014083535.2_V.mandarinia_Nanaimo_p1.0_genomic.fna \
-w /scratch/$USER/example_work \
-bg -resume -with-tower-bg— run in the background; safe to close the terminal.-resume— pick up from cached tasks. Required for any incremental restart.-with-tower— stream progress to the Seqera platform. RequiresTOWER_ACCESS_TOKENin the environment ortower.accessTokenin a config.-w <dir>— set thework/directory. Recommended: point at scratch (e.g./scratch/$USER/...) so heavy intermediates don't accumulate on depot. Do not change between runs you want to-resume— the cache lives there.
- Logs:
.nextflow.login the launch directory records every task. Search forERRORorterminated with an errorto find the failure. - Per-task logs: a failure message points at a
work/XX/YYYYYY...directory. Inside, the most useful files are:.command.log— combined stdout/stderr from the task.command.run— the wrapper SLURM script.exitcode— the task's exit code.command.sh— the actual command Nextflow generated
- Session lock errors on
-resume: a previous run was killed without cleanup. Confirm nothing else is holding it (lsof <work>/.nextflow/cache/*/db/LOCKshould be empty), thenrmthe LOCK file and rerun. - Stopping a running pipeline:
- Foreground: Ctrl-C.
- Background:
kill $(cat .nextflow.pid), thenscancelany in-flight SLURM jobs.
- SLURM jobs visible:
squeue -u $USERlists your in-flight per-task jobs (each is tagged with its process and sample/interval).