GeneExpressionAnalysis/RNASeq_Analysis_Tutorial.Rmd at master · MerrimanLab/GeneExpressionAnalysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "DESeq_RNASeq_Analysis_Tutorial"
author: "Nick Burns"
date: "15 September 2016"
output: html_document
---

## Exploratory analysis for RNASeq data

We will work through a very basic exploration of RNASeq data. This is largely based on the larger tutorials available here (http://www.bioconductor.org/help/workflows/rnaseqGene/) and here (http://www.nathalievilla.org/doc/pdf/tutorial-rnaseq.pdf). This work flow will cover:

  1. Normalise raw read counts, so that they can be compared in an unbiased fashion
  2. Pre-filter the data to exlude genes whose expression levels remain relatively constant over all samples. 'Samples' may be consecutive time points, or they may be case-control samples.
  3. Find "similar" groups of genes and visualise the general behaviour of these groups.

### The data

We are going to use data from the 'pasilla' package in R. This dataset contains gene expression counts for a case-control study using _Drosophila melanogaster_ cell cultures. More info is available here (http://www.nathalievilla.org/doc/pdf/tutorial-rnaseq.pdf, Section 3)

```{r}
library(data.table)
if (!require(pasilla)) {
    source("https://bioconductor.org/biocLite.R")
    biocLite("pasilla")
}

data_file <- system.file("extdata/pasilla_gene_counts.tsv", package = "pasilla")
raw_counts <- read.table(data_file, header = TRUE, row.names = 1)
gene_names <- rownames(raw_counts)
raw_counts <- data.table(raw_counts)

# filter out genes with very low expression (sum(expression) <= 10)
idx <- which(rowSums(raw_counts) > 10)
raw_counts <- raw_counts[idx]
gene_names <- gene_names[idx]

raw_counts
```

There are 4 control samples (untreated 1..4) and 3 case samples (treated 1..3). Next, we will normalise the raw counts so that we can compare them across samples.

### Data normalisation

Typically, RNASeq read counts are normalised using a method called RPKM (reads mapped per kilobase). This accounts for detection bias towards longer genes, by accounting for the gene length. The resulting RPKM-normalised reads allow you to compare genes to one another, as all genes are on an even playing field. However, in our example we are really interested in comparing the differences between _samples_, so we don't necessarily need to use RPKM normalisation. Instead, we will use log2-cpm.

Instead of RPKM, we are going to use a log-cpm (log2 counts per million). This results in expression counts which are comaprable between samples, and hte change in gene expression from one sample to the next can be interpretted as a log-fold-change in expression.

Log2-cpm counts are calculated for each sample individually, as follows:

```
for each sample (S_i):
  total_reads := sum(expression)

  for each gene:
    reads_per_million := read_count(g) / (total_reads / 1000000)
    log_cpm := log2(0.5 + reads_per_million)
```

Log2-cpm normalisation is implemented in the limma and edgeR packages (see the ```voom``` transformation). However, below we will calculate it directly.

```{r}
log_cpm <- function (x) {
    log2(0.5 + x / sum(x) * 1000000)
}
norm_counts <- raw_counts[, lapply(.SD, log_cpm)]      # this is some data.table magic
                                                       # it calculates the log2-cpm values across all samples (columns)
norm_counts
```


Below, we show the distribution of the normalised counts across all samples:

```{r}
par(mfrow = c(1, 2))
boxplot(norm_counts, main = "Distribution of log2-cpm normalised counts")
hist(norm_counts[, untreated1], breaks = 50, main = "Untreated1")
par(mfrow = c(1, 1))
```

As we would expect, the distribution of normalised counts looks the same for each sample - this is good, because it means we can make unbiased comparisons between samples. In addition, the distribution is clearly bimodal, which indicates that some genes are 'off' and some are 'on'. A nice sanity check that all appears to be as it should.

### Exploratory analysis

First, let's see if there is any obvious differences between our samples (here treated vs. untreated). We will do this by comparing the distance between each sample, based on the normalised expression data.

```{r}
library(RColorBrewer)
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(16)

by_sample <- t(as.matrix(norm_counts))
heatmap(as.matrix(dist(by_sample)), col = rev(hmcol))
```

There are two obvious groups above: treated vs. untreated. This is what we would expect of course, but it is nice to confirm our hunch. There is something about the gene expression which is different between treated and untreated samples, though we don't know what. Let's try to identify these important genes.

We have no hope of clustering 14000 genes, it is just too many. But many of these genes will be uninteresting. So let's filter out any genes which do not vary all that much. Below, we will filter out any genes with less than a 2-fold change across the samples and cluster the rest:

```{r}
threshold <- log2(2.5)
high_var_genes <- which(apply(by_sample, 2, sd) > threshold)
colnames(by_sample) <- gene_names

heatmap(t(by_sample[, high_var_genes]), col = rev(hmcol))
```

Clearly, there are some obvious gene clusters which have different levels of expression between the treatment groups. This is great! The choice of threshold is somewhat arbitrary, and here results in only 71 genes. But we would intuitively expect that these 71 genes are those that exhibit the greatest difference in expression between treatment and control. Plenty of great info here to follow up on.

### Differential expression

The heatmap above suggests that there are substantial differences in the gene expression between treated and untreated samples. Differential expression (DE) analysis can give us a more robust statistical test of these differences.

**Caveat:** DE analysis requires replicates to work properly. Here, we have 4 untreated samples and 3 treated samples, so this should work well. In a time-series analysis you would have to have multiple samples measured at each time point to achieve the same effect. DE analysis is hihgly unlikely to yield meaningful results without replicates.

Below, we will use the DESeq2 lirbary to perform DE analysis.

```{r}
library(DESeq2)
treatment <- data.frame(sample_name = colnames(raw_counts),
                        treatment_group = c("A", "A", "A", "A", "B", "B", "B"))
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
                              colData = treatment,
                              design = ~ treatment_group)
dds <- DESeq(dds)
res <- as.data.table(results(dds))
res[, Gene := gene_names]
res[order(padj, decreasing = FALSE), .(Gene, log2FoldChange, padj)][1:10]
```

There is clear evidence for DE between treated and untreated samples. Above, the top 10 differentially expressed genes have been printed. Although we haven't done so here, if you compare the top 10 - 20 DE genes to the heatmap above you will note that 3 or 4 of the gene clusters are strongly represented in the top 20 DE genes.