Nextflow Tutorial

A hands-on tutorial for learning Nextflow, a workflow management system for bioinformatics pipelines. This tutorial progresses from basic "Hello World" examples to a complete parallel BLAST workflow.

Setup
Tutorial Overview
Tutorial Examples
Running the Examples
Project Structure
Resources

Setup

Prerequisites

Access to an HPC environment with Nextflow and BLAST+ installed
Basic knowledge of command-line operations
Understanding of FASTA format and BLAST basics (for later examples)

Initial Configuration

Load Nextflow module:
```
module load nextflow
```
Set Nextflow home directory:
```
export NXF_HOME='/work/idoerg/<netid>/.nextflow'
```
Replace <netid> with your actual network ID.
Verify installation:
```
nextflow -h
```

Tutorial Overview

This tutorial demonstrates key Nextflow concepts through seven progressively complex examples:

Basic output - Writing to stdout
File output - Creating output files
Publishing results - Using publishDir directive
Multi-language support - Running Python scripts
Multiple processes - Chaining multiple processes
Real-world pipeline - Complete BLAST workflow
Parallel processing - Splitting work across chunks

Tutorial Examples

01_hello.nf - Basic Hello World

Concepts: Process definition, stdout, basic workflow

A simple introduction to Nextflow that prints a message to standard output.

nextflow run code/01_hello.nf

Key concepts:

nextflow.enable.dsl=2 - Enables DSL2 syntax
process blocks define computational tasks
output directive specifies what the process produces
workflow block orchestrates process execution
.view() displays output to console

02_hellowrite2file.nf - Writing to Files

Concepts: File outputs, output paths

Demonstrates how to write process output to a file instead of stdout.

nextflow run code/02_hellowrite2file.nf

Key concepts:

path 'result.txt' specifies file output
Output files are stored in Nextflow's work/ directory by default

03_hellofile.nf - Publishing Results

Concepts: publishDir, output management

Shows how to copy results to a designated output directory.

nextflow run code/03_hellofile.nf

Key concepts:

publishDir directive copies results to specified location
mode: 'copy' creates a copy (alternatives: 'symlink', 'move')
Results appear in the output/ directory

04_hellopython.nf - Multi-Language Scripts

Concepts: Shebang, Python in Nextflow

Demonstrates running Python code within a Nextflow process.

nextflow run code/04_hellopython.nf

Key concepts:

#!/usr/bin/python shebang specifies interpreter
Any scripting language can be used (Python, R, Perl, etc.)
Script content goes in the triple-quoted string

05_hellomultiprocess.nf - Multiple Processes

Concepts: Process composition, parallel execution

Shows how to define and run multiple independent processes.

nextflow run code/05_hellomultiprocess.nf

Key concepts:

Multiple processes can be defined in one workflow
Processes without dependencies run in parallel
Each process has its own execution environment

06_blast.nf - Complete BLAST Pipeline

Concepts: Input parameters, process chaining, data dependencies

A real bioinformatics pipeline that downloads a database, creates BLAST indices, and performs sequence alignment.

nextflow run code/06_blast.nf -profile hpc_modules

Key concepts:

params define configurable workflow parameters
input and output directives chain processes together
Processes execute only when their inputs are ready
Profile configuration (in nextflow.config) manages environment

Pipeline steps:

DOWNLOAD_UNIPROT_FASTA - Downloads yeast proteins from UniProt
MAKE_BLAST_DB - Creates BLAST database from downloaded sequences
BLASTP - Runs protein BLAST search with query sequences

07_blastparallel.nf - Parallel BLAST

Concepts: Channel operations, data parallelization, result aggregation

Extends the BLAST pipeline with parallel processing by splitting the query file into chunks.

nextflow run code/07_blastparallel.nf -profile hpc_modules

Key concepts:

Channel.fromPath() creates channels from files
.splitFasta() divides FASTA files into chunks
.map() transforms channel data
.collect() aggregates results from parallel processes
tuple passes multiple values together

Pipeline steps:

DOWNLOAD_UNIPROT_FASTA - Downloads database
MAKE_BLAST_DB - Creates BLAST indices
BLASTP - Runs BLAST on each chunk in parallel
MERGE_BLAST - Combines all chunk results into single file

Performance benefit: Processing 5 sequences at once (configurable via params.chunkSize) allows parallel execution and faster completion.

Running the Examples

Basic Execution

# Run without profile (for simple examples)
nextflow run code/01_hello.nf

# Run with HPC modules profile (for BLAST examples)
nextflow run code/06_blast.nf -profile hpc_modules

Common Options

# Resume from last successful step
nextflow run code/07_blastparallel.nf -resume -profile hpc_modules

# Override parameters
nextflow run code/07_blastparallel.nf --chunkSize 10 -profile hpc_modules

# View execution timeline
nextflow run code/07_blastparallel.nf -profile hpc_modules -with-timeline timeline.html

Cleaning Up

# Remove work directory after successful run
rm -rf work/

# Clean up Nextflow cache
nextflow clean -f

Performance Comparison: Nextflow vs Bash

To compare execution times between Nextflow and traditional bash scripts, you can use the time command:

Timing the Bash Script (06_blast.sh)

# Time the bash script execution
time ./code/06_blast.sh

Timing the Nextflow Pipeline (06_blast.nf)

# Time the Nextflow pipeline
time nextflow run code/06_blast.nf -profile hpc_modules

# For subsequent runs (using cache)
time nextflow run code/06_blast.nf -profile hpc_modules -resume

Timing Output Explained

The time command shows three values:

real - Total wall clock time (actual elapsed time)
user - CPU time spent in user mode
sys - CPU time spent in kernel mode

Example Comparison

# Clean run (no cached results)
rm -rf work/ work_bash/ results/

# Run bash version
echo "=== Bash Script ==="
time ./code/06_blast.sh

# Run Nextflow version
echo "=== Nextflow Pipeline ==="
time nextflow run code/06_blast.nf -profile hpc_modules

Key Observations:

First run: Nextflow has overhead for workflow management
Cached runs: Nextflow's -resume skips completed steps (huge time saver!)
Parallel workflows (07_blastparallel.nf): Nextflow shows real advantage with parallelization
Reproducibility: Nextflow tracks all intermediate steps automatically

Timing Parallel Execution

# Compare sequential vs parallel BLAST
echo "=== Sequential (06_blast.nf) ==="
time nextflow run code/06_blast.nf -profile hpc_modules

echo "=== Parallel (07_blastparallel.nf) ==="
time nextflow run code/07_blastparallel.nf -profile hpc_modules --chunkSize 5

Project Structure

nextflow_tutorial/
├── README.md                    # This file
├── nextflow.config              # Configuration profiles
├── code/                        # Tutorial scripts
│   ├── 01_hello.nf             # Basic output
│   ├── 02_hellowrite2file.nf   # File output
│   ├── 03_hellofile.nf         # Publishing results
│   ├── 04_hellopython.nf       # Python integration
│   ├── 05_hellomultiprocess.nf # Multiple processes
│   ├── 06_blast.nf             # BLAST pipeline (Nextflow)
│   ├── 06_blast.sh             # BLAST pipeline (bash - for comparison)
│   └── 07_blastparallel.nf     # Parallel BLAST
├── data/                        # Input data
│   └── query.fasta             # Query sequences
├── output/                      # Simple example outputs
├── results/                     # BLAST results
├── work/                        # Nextflow working directory
└── work_bash/                   # Bash script working directory

Configuration Profiles

The nextflow.config file defines two profiles:

hpc_modules

For HPC environments using environment modules:

nextflow run code/06_blast.nf -profile hpc_modules

Loads BLAST+ module before each task
Uses local executor

local

For local execution without modules:

nextflow run code/01_hello.nf -profile local

Basic local execution
Assumes tools are in PATH

Key Nextflow Concepts

Processes

Self-contained computational tasks
Define inputs, outputs, and script to execute
Can use any scripting language

Channels

Asynchronous queues that connect processes
Pass data between processes
Support operations like map, filter, split, collect

Workflows

Define the execution logic
Connect processes through channels
Can be nested and modularized

Directives

publishDir - Where to save results
input - What data the process receives
output - What data the process produces
script - The command(s) to execute

Resources

Official Documentation

Nextflow Documentation
Nextflow Patterns
nf-core - Community curated pipelines

Learning Resources

Community

Troubleshooting

Common Issues

Issue: "command not found" errors

Solution: Ensure modules are loaded or tools are in PATH. Use -profile hpc_modules for BLAST examples.

Issue: "work directory too large"

Solution: Run nextflow clean -f to remove cached work files.

Issue: Pipeline fails partway through

Solution: Use -resume flag to continue from last successful step.

Issue: Permission denied errors

Solution: Check file permissions and ensure NXF_HOME is writable.

Tips for Success

Always use -resume - Saves time by skipping completed steps
Start simple - Master basic examples before complex pipelines
Check work directory - Inspect intermediate files in work/ for debugging
Use .view() - Add to channels to see what data is flowing through
Read error messages - Nextflow provides detailed logs and error traces

Next Steps

After completing this tutorial:

Explore nf-core pipelines for production-ready workflows
Learn about containers (Docker/Singularity) for reproducibility
Study executor configuration for HPC clusters
Build your own pipeline for your research needs

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Nextflow Tutorial

Table of Contents

Setup

Prerequisites

Initial Configuration

Tutorial Overview

Tutorial Examples

01_hello.nf - Basic Hello World

02_hellowrite2file.nf - Writing to Files

03_hellofile.nf - Publishing Results

04_hellopython.nf - Multi-Language Scripts

05_hellomultiprocess.nf - Multiple Processes

06_blast.nf - Complete BLAST Pipeline

07_blastparallel.nf - Parallel BLAST

Running the Examples

Basic Execution

Common Options

Cleaning Up

Performance Comparison: Nextflow vs Bash

Timing the Bash Script (06_blast.sh)

Timing the Nextflow Pipeline (06_blast.nf)

Timing Output Explained

Example Comparison

Timing Parallel Execution

Project Structure

Configuration Profiles

hpc_modules

local

Key Nextflow Concepts

Processes

Channels

Workflows

Directives

Resources

Official Documentation

Learning Resources

Community

Troubleshooting

Common Issues

Tips for Success

Next Steps