chipnexus-processing-scripts/README.Rmd at master · mlweilert/chipnexus-processing-scripts · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "ChIP-nexus Processing Pipeline"
author: "Melanie Weilert and Jeff Johnston"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
  rmarkdown::github_document:
  pdf_document:
    toc: yes
  word_document:
    toc: yes
header-includes:
- \usepackage{fancyhdr}
- \usepackage{lipsum}
- \pagestyle{fancy}
- \fancyfoot[CO,CE]{'Stowers Institute of Medical Research'}
- \fancyfoot[LE,RO]{\thepage}
- \fancypagestyle{plain}{\pagestyle{fancy}}
editor_options:
  chunk_output_type: console
---

# Introduction

The purpose of this script is to provide a readable workflow for processing ChIP-nexus data (http://www.nature.com/nbt/journal/v33/n4/full/nbt.3121.html). ChIP‐nexus (Chromatin‐Immunoprecipitation with nucleotide resolution using exonuclease digestion, unique barcode and single ligation) is a ChIP‐exo protocol that makes use of a unique barcode to identify duplicate reads and a novel library preparation strategy that is adopted from iCLIP.

Other alignment tools can be used at the discretion of the investigator, but the preprocessing_fastq.R and the process_bam.R are meant to act as helpers for easy downstream analysis of ChIP-nexus data.

Detailed information on ChIP-nexus experimental protocols can be found at: http://research.stowers.org/zeitlingerlab/protocols.html

# Computational Setup

The following Unix and R tools need to be set up as follows:

## UNIX environment

Before running any of the pipelines below, you will first need to ensure that your UNIX environment is properly configured. The pipelines currently require the following:

 - R 3.2.5
 - Bioconductor 3.2
 - bowtie 1.1.2
 - gzcat
 - cutadapt 1.8.1
 - samtools 1.3.1

### bowtie

The pipelines look for bowtie to be installed at `~/apps/bowtie/bowtie`:

```{bash, eval=F}
~/apps/bowtie/bowtie --version
```

If the above command does not work, download and uncompress bowtie 1.1.2 into `~/apps/bowtie`.

Modify your `~/.bashrc` to set the `BOWTIE_INDEXES` environment variable and add `~/bin` to your path.

```{bash, eval=F}
export BOWTIE_INDEXES=[path to bowtie indexes]
export PATH=~/bin:$PATH
```

### cutadapt

The ChIP-nexus pipelines use `cutadapt` to perform adapter trimming. Install version 1.8.1 locally via `pip` and ensure it is in your PATH:

```{bash, eval=F}
pip install --user cutadapt==1.8.1
~/.local/bin/cutadapt --version
mkdir -p ~/bin
ln -s ~/.local/bin/cutadapt ~/bin
```

### gzcat

The UNIX command `zcat` should be symlinked to `gzcat`. The pipelines use `gzcat` instead of `zcat` for compatability with Mac OS X.

```{bash, eval=F}
ln -s `which zcat` ~/bin/gzcat
```

### samtools

Verify that `samtools` version 1.2 or higher is available:

```{bash, eval=F}
samtools --version
```

## R and Bioconductor environment

Verify that you have the proper version of R and Bioconductor installed:

```{bash, eval=F}
Rscript --version
Rscript -e "library(BiocInstaller)"
```

Start an R session and verify you have the necessary packages installed:

```{r r_packages}
library(dplyr)
library(pander)
cran_packages <- c("magrittr", "RCurl", "optparse", "stringr", "ascii", "data.table")
bioc_packages <- c("GenomicAlignments", "Rsamtools", "rtracklayer",
                   "GenomicRanges", "chipseq", "ShortRead")
packages <- data_frame(Package = cran_packages, Source = "CRAN") %>%
            rbind(data_frame(Package = bioc_packages, Source = "BioC")) %>%
            mutate(Installed = Package %in% rownames(installed.packages()))
packages %>% pander(caption="Installation status of required packages")
```


# Running the ChIP-nexus pipeline

# Option 1: Run using Snakemake [recommended]

In order to provide a reproducible and easy-to-read workflow for anyone looking at your processing steps, we highly recommend using Snakemake to run your pipeline. More details on this tools can be used here: https://snakemake.readthedocs.io/en/stable/. The tutorial they use is an aligner for genomes, so it will be useful to look over.

1. **Install** Snakemake on your machine: `pip install snakemake`
2. **Organize** your files to be the same format as this GitHub repo.
+ `data/fastq/{factor}_nexus_{replicate}.fastq` files are present
+ `bowtie_indexes/{genome_prefix}` files are present
+ `scripts/*` files are present
3. **Edit** the `Snakefile` in a text editor (lines 37-41) so that the pipeline workflow knows what samples to process.
4. In your directory that contains the `Snakefile`, type `snakemake -np` to simulate what will be run in this pipeline.
5. If (4) returns no errors, then type `snakemake` and watch your pipeline run for each sample/replicate combination!

Please look at the `data/fastq/sample_nexus_1.fastq` example that the `Snakefile` shows for a basic example.

# Option 2: Run each script manuscript through each sample [not recommended]

## Step 1: Preprocessing FASTQ reads

Given a sequencing file of ChIP-nexus reads, this step removes reads that don't have the proper fixed barcode, moves the random barcode sequences to the FASTQ read name, and removes both the fixed and random barcodes from the reads. Note that this step is highly recommended, but optional.

Additionally, it is recommended that ChIP-nexus FASTQ reads are sequenced single-end due to the imbalance between additional information gained and cost/time constraints by seuqencing paired end. However, if you wish to preprocess and align ChIP-nexus FASTQ files that are paired end, please follow the relevant chunks of instructions below:

### 1.1. R-Based Approach

#### 1.1.1. Single-end Sequencing/Alignment

Inputs:

-f, --file: Path of ChIP-nexus FASTQ file to process [required]
-o, --output: Output FASTQ file (gzip compressed) [required]
-t, --trim: Pre-trim all reads to this length before processing [default=0]
-k, --keep: Minimum number of bases required after barcode to keep read [default=18]
-b, --barcode: Barcode sequences (comma-separated) that follow random barcode") [default="CTGA"]
-r, --randombarcode: Number of bases at the start of each read used for random barcode [default=5]
-c, --chunksize: Number of reads to process at once (in thousands) [default=1000]
-p, --processors: Number of simultaneous processing cores to utilize [default=2]

Outputs: gzip-compressed FASTQ file with filtered reads

Example Use Case:

```{bash, eval=F}
Rscript scripts/preprocess_fastq.r -f chipnexus_sample.fastq.gz \
                           -k 22 -b CTGA -r 5 -p 4 -c 1000 \
                           -o chipnexus_sample_processed.fastq.gz
```

#### 1.1.2. Single-end Sequencing/Alignment

Inputs
-f, --first: Path of forward strand ChIP-nexus FASTQ file to process [required]
-s, --second: Path of reverse strand ChIP-nexus FASTQ file to process [required]
-o, --output: Output FASTQ file (gzip compressed) [required]
-t, --trim: Pre-trim all reads to this length before processing [default=0]
-k, --keep: Minimum number of bases required after barcode to keep read [default=18]
-b, --barcode: Barcode sequences (comma-separated) that follow random barcode") [default="CTGA"]
-r, --randombarcode: Number of bases at the start of each read used for random barcode [default=5]
-c, --chunksize: Number of reads to process at once (in thousands) [default=1000]
-p, --processors: Number of simultaneous processing cores to utilize [default=2]

Outputs: gzip-compressed FASTQ file with filtered reads

Example Use Case:

```{bash, eval=F}
Rscript scripts/preprocess_paired_fastq.r -f chipnexus_sample_read1.fastq.gz -s chipnexus_sample_read2.fastq.gz \
                           -k 22 -b CTGA -r 5 -p 4 -c 1000 \
                           -o chipnexus_sample_processed.fastq.gz
```

### 1.2. Nim-Based Approach: nimnexus trim (C, C++, JavaScript executable)

Scripts compiled using Nim (developed by Brent Pedersen, https://github.com/brentp/bpbio) and developed by Ziga Avsec are also available and recommended for use in this preprocessing step, especially if R is not currently installed on your server or package incompatibilities arise.

Note that while actual run time is similar between the R-based and Nim-based approaches, this script may be highly parallelized because computational requirements are low per worker.

The Github repository can be found here: https://github.com/Avsecz/nimnexus in addition to installation instructions. As of 11/27/2018, this approach can only work with single-end sequencing results.

## Step 2: Alignment

This step aligns the processed FASTQ file to the genome using bowtie and cutadapt with the appropriate indexing. Manual command line entry or use of a premade script (scripts/align_chipnexus.sh) can be used.

Note: Multiple barcodes are used in the cutadapt command in order to account for to the possibility of adapter degradation.

### Example Use Case:

```{bash, eval=F}
#Manual command line
bowtie -S -p 4 --chunkmbs 512 -k 1 -m 1 -v 2 --best --strata \
        bowtie_indexes/dm6 <(cutadapt -m 22 -O 4 -e 0.2 --quiet \
                       -a AGATCGGAAGAGCACACGTCTGGATCCACGACGCTCTTCC \
                       -a GATCGGAAGAGCACACGTCTGGATCCACGACGCTCTTCC \
                       -a ATCGGAAGAGCACACGTCTGGATCCACGACGCTCTTCC \
                       -a TCGGAAGAGCACACGTCTGGATCCACGACGCTCTTCC \
                       -a CGGAAGAGCACACGTCTGGATCCACGACGCTCTTCC \
                       chipnexus_sample_processed.fastq.gz) | samtools view -F 4 -Sbo chipnexus_sample.bam -
#Script
scripts/align_chipnexus.sh chipnexus_sample_processed.fastq.gz bowtie_indexes/dm6
```

### 2.1. Optional: Sort your BAM file for reduced storage and quicker processing downstream.

```{bash, eval=F}
samtools sort chipnexus_sample.bam
```

## Step 3: Process BAM

### 3.1. R-Based Approach

This step removes reads that align to the same genomic position and have identical random barcodes, resizes the reads to a width of 1 (the first base sequenced), and saves the results as a GRanges object.

Inputs:

-f, --file: Path of BAM file to process [required]
-o, --output: Output FASTQ file (gzip compressed) [required]
-p, --paired: Call option if the alignment was conducted in paired-end mode [default=FALSE]

#### Example Use Case:

``` {bash, eval=F}
Rscript process_bam.r -f chipnexus_sample.bam -n chipnexus_sample
```

### 3.2. Nim-Based Approach: nimnexus dedup (C, C++, JavaScript executable)

Scripts compiled using Nim (developed by Brent Pedersen, https://github.com/brentp/bpbio) and developed by Ziga Avsec are also available for use in this deduplication step, especially if R is not currently installed on your server or package incompatibilities arise.

The Github repository can be found here: https://github.com/Avsecz/nimnexus in addition to installation instructions.

## Step 4: Generate individual strand BigWigs

This step generates a positive and negative strand BigWig file from the supplied GRanges object.

Inputs:

-f, --file: Path of GRanges file to process [required]
-n, --name: Output prefix for BW file [required]
-m, --normalization: Call option if the output BW file should be normalized to RPM [default=FALSE]

### Example Use Case

``` {bash, eval=F}
Rscript split_granges_by_strand.r -r chipnexus_sample.granges.rds
```

These bigwig files can be used for downstream analysis and loaded into an IGV/UCSC/etc genome browser for survery of results.