nf_bennyben

A Nextflow pipeline for trimming, aligning, deduplicating, and joint variant-calling short-read sequencing data against a reference genome.

What it does

Given paired-end FASTQ files and a reference genome, the pipeline produces:

Per-sample QC reports (fastp, FastQC, MultiQC, qualimap)
Sorted, deduplicated, read-group-tagged BAMs
Per-sample gVCFs (HaplotypeCaller)
A jointly-called, hard-filtered, recoded multi-sample VCF (SNPs and indels)

Processes, in order

#	Process	Tool(s)	Purpose
1	`index_genome`	samtools, GATK	Build `.fai` and `.dict` for the reference
2	`fastp`	fastp	Adapter and quality trimming
3	`fastqc` + `multiqc`	FastQC, MultiQC	Read-level QC reports
4	`alignment`	NextGenMap + samtools sort	Produce sorted BAM
5	`check_duplicates`	Picard AddOrReplaceReadGroups	Add read-group tags
6	`remove_duplicates`	GATK MarkDuplicates	Remove sequencing duplicates
7	`qualimap` + `qualimap_collate`	qualimap	Alignment QC + collated report
8	(optional) `base_recal1/2/3`	GATK BQSR	Base-quality recalibration; only runs if `--knownSites` is provided
9	`haplotype_caller`	GATK HaplotypeCaller (gVCF mode)	Per-sample gVCF
10	`genomicsdb_import`	GATK GenomicsDBImport	Per-interval sample combine (parallelized)
11	`genotype_gvcfs`	GATK GenotypeGVCFs	Per-interval joint genotyping
12	`gather_vcfs`	GATK GatherVcfs	Recombine intervals in reference `.dict` order
13	`downstream_filter`	GATK SelectVariants + VariantFiltration	Split SNPs/indels and apply hard filters
14	`recode_vcfs`	vcftools	Mask genotypes with `DP < 3`; preserve INFO column
15	`fill_tags`	bcftools `+fill-tags`	Refresh AC/AN/AF/NS/F_MISSING; output bgzip + tabix

Parameters

Required

--fqPattern <glob> — Absolute path glob for paired FASTQs, e.g. '/path/to/data/*_{1,2}.fq.gz'. The literal {1,2} defines the pair.
--savePath <dir> — Where published outputs are written. Subdirectories are created automatically (raw_bams/, final_bam_files/, fastqc/, multiqc/, qualimap/, raw_snps/, filtered_snps/).
--refGenome <file.fna> — Absolute path to the reference FASTA.

Optional

--intervals <file> — Newline-separated list of intervals (one per line) for parallel joint genotyping. If omitted, intervals are derived from each contig in the reference .dict.
--knownSites <file.vcf.gz> — A VCF of known variants. If supplied, BQSR (three iterations) runs before HaplotypeCaller. If omitted, BQSR is skipped.

Requirements

Nextflow 25.x or newer
Singularity (or Apptainer) — every process runs in a container; no local tool installs needed
A SLURM cluster (the default executor; configurable in nextflow.config)

Running on a SLURM cluster

The repo ships a minimal nextflow.config. Cluster-specific options (account, partition, Tower token) should go in a personal config that you point Nextflow at with -c. Example for Purdue Negishi (~/config/nextflow.config):

singularity.enabled = true
tower.accessToken = 'YOUR_TOWER_TOKEN'    // optional, only needed for -with-tower

executor {
    name = 'slurm'
    queueSize = 10
}

process {
    cache = 'lenient'
    clusterOptions = '-A bharpur -p cpu'  // Negishi requires explicit -A and -p
    errorStrategy = 'retry'
}

Then invoke:

nextflow \
  -c ~/config/nextflow.config \
  run /depot/bharpur/apps/nf_bennyben/main.nf \
  --fqPattern '/depot/bharpur/data/projects/example/data/*_{1,2}.fq.gz' \
  --savePath /depot/bharpur/data/projects/example/output \
  --refGenome /depot/bharpur/data/ref_genomes/VMAN/GCF_014083535.2_V.mandarinia_Nanaimo_p1.0_genomic.fna \
  -w /scratch/$USER/example_work \
  -bg -resume -with-tower

Useful Nextflow flags

-bg — run in the background; safe to close the terminal.
-resume — pick up from cached tasks. Required for any incremental restart.
-with-tower — stream progress to the Seqera platform. Requires TOWER_ACCESS_TOKEN in the environment or tower.accessToken in a config.
-w <dir> — set the work/ directory. Recommended: point at scratch (e.g. /scratch/$USER/...) so heavy intermediates don't accumulate on depot. Do not change between runs you want to -resume — the cache lives there.

Troubleshooting

Logs: .nextflow.log in the launch directory records every task. Search for ERROR or terminated with an error to find the failure.
Per-task logs: a failure message points at a work/XX/YYYYYY... directory. Inside, the most useful files are:
- .command.log — combined stdout/stderr from the task
- .command.run — the wrapper SLURM script
- .exitcode — the task's exit code
- .command.sh — the actual command Nextflow generated
Session lock errors on -resume: a previous run was killed without cleanup. Confirm nothing else is holding it (lsof <work>/.nextflow/cache/*/db/LOCK should be empty), then rm the LOCK file and rerun.
Stopping a running pipeline:
- Foreground: Ctrl-C.
- Background: kill $(cat .nextflow.pid), then scancel any in-flight SLURM jobs.
SLURM jobs visible: squeue -u $USER lists your in-flight per-task jobs (each is tagged with its process and sample/interval).

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
modules.nf		modules.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nf_bennyben

What it does

Processes, in order

Parameters

Required

Optional

Requirements

Running on a SLURM cluster

Useful Nextflow flags

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nf_bennyben

What it does

Processes, in order

Parameters

Required

Optional

Requirements

Running on a SLURM cluster

Useful Nextflow flags

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages