Diff of /vignettes/bulkRNA.Rmd [000000] .. [d06c2b]

Switch to unified view

a b/vignettes/bulkRNA.Rmd
1
---
2
title: "General bulk RNA-seq analysis"
3
author: "Yulin Lyu <lvyulin@pku.edu.cn>"
4
date: "`r Sys.Date()`"
5
output: 
6
  html_document: 
7
    toc: yes
8
    toc_float: yes
9
    theme: lumen
10
---
11
12
```{r setup, include = FALSE}
13
knitr::opts_chunk$set(
14
  collapse = F,
15
  comment = "## "
16
)
17
```
18
19
```{css, echo = FALSE}
20
pre {
21
  max-height: 300px;
22
  overflow-y: auto;
23
  background-color: AliceBlue;
24
}
25
26
pre[class] {
27
  max-height: 200px;
28
}
29
```
30
31
## Preprocess
32
33
Detailed source code is located here [`bulkRNApre.R`](https://github.com/sajuukLyu/Protocols-4pub/blob/main/a_PreprocessPipeline/bulkRNApre.R) with a [vignette](https://sajuuklyu.github.io/Protocols-4pub/exampleData/RNA/bulkRNApre.html) explaining each step. The flow chart of preprocess pipeline of bulk RNA-seq shows in the bottom.
34
35
![](../mermaidPlot/bulkRNApre.svg){width="50%"}
36
37
### Software preparation
38
39
-   [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc)
40
41
-   [multiQC](https://multiqc.info)
42
43
-   [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)
44
45
-   [STAR](https://github.com/alexdobin/STAR)
46
47
-   [RSeQC](http://rseqc.sourceforge.net)
48
49
-   [featureCounts](http://bioinf.wehi.edu.au/featureCounts/)
50
51
By default, the path to these installed software should be added to `$PATH` in order to use them directly. Or we can install them in a virtual environment like `conda`. Specifically, we can use the full path to the software in commands if necessary.
52
53
### 0. Dictionary design
54
55
After all the preprocess, the dictionary structure will look like this:
56
57
```{bash, eval = FALSE}
58
0_fastq  1_qc  2_trim  3_qc  4_map  5_infer  6_count  code  raw
59
```
60
61
The raw sub-dir contains the raw fastq files need (download from public databases, see [`downloadData.R`](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/downloadData.R) for detail)
62
63
```{bash, eval = FALSE}
64
total 13G
65
drwxr-xr-x  2 lvyl lic 4.0K Mar 31 00:36 ./
66
drwxr-xr-x 12 lvyl lic 4.0K Mar 31 16:51 ../
67
-rw-r--r--  1 lvyl lic 925M Mar 30 23:17 SRR11459602.sra_1.fastq.gz
68
-rw-r--r--  1 lvyl lic 955M Mar 30 23:17 SRR11459602.sra_2.fastq.gz
69
-rw-r--r--  1 lvyl lic 927M Mar 30 23:24 SRR11459603.sra_1.fastq.gz
70
-rw-r--r--  1 lvyl lic 947M Mar 30 23:24 SRR11459603.sra_2.fastq.gz
71
-rw-r--r--  1 lvyl lic 899M Mar 30 23:32 SRR11459604.sra_1.fastq.gz
72
-rw-r--r--  1 lvyl lic 954M Mar 30 23:32 SRR11459604.sra_2.fastq.gz
73
-rw-r--r--  1 lvyl lic 894M Mar 30 23:39 SRR11459605.sra_1.fastq.gz
74
-rw-r--r--  1 lvyl lic 894M Mar 30 23:39 SRR11459605.sra_2.fastq.gz
75
-rw-r--r--  1 lvyl lic 891M Mar 30 23:55 SRR11459606.sra_1.fastq.gz
76
-rw-r--r--  1 lvyl lic 907M Mar 30 23:55 SRR11459606.sra_2.fastq.gz
77
-rw-r--r--  1 lvyl lic 832M Mar 31 00:07 SRR11459607.sra_1.fastq.gz
78
-rw-r--r--  1 lvyl lic 840M Mar 31 00:07 SRR11459607.sra_2.fastq.gz
79
-rw-r--r--  1 lvyl lic 853M Mar 31 00:20 SRR11459608.sra_1.fastq.gz
80
-rw-r--r--  1 lvyl lic 868M Mar 31 00:20 SRR11459608.sra_2.fastq.gz
81
```
82
83
There are some special designing in the pipeline.
84
85
-   We use symbol links to rename raw files. There are two major advantages: 1. clarify the original file of renamed files. 2. save disk space, deleting raw files will release the used space. Hard links could not do these.
86
87
```{bash, eval = FALSE}
88
total 8.0K
89
drwxr-xr-x  2 lvyl lic 4.0K Mar 31 01:04 ./
90
drwxr-xr-x 12 lvyl lic 4.0K Mar 31 16:51 ../
91
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-ES1_1.fq.gz -> ../raw/SRR11459602.sra_1.fastq.gz
92
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-ES1_2.fq.gz -> ../raw/SRR11459602.sra_2.fastq.gz
93
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-ES2_1.fq.gz -> ../raw/SRR11459603.sra_1.fastq.gz
94
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-ES2_2.fq.gz -> ../raw/SRR11459603.sra_2.fastq.gz
95
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-H1_1.fq.gz -> ../raw/SRR11459604.sra_1.fastq.gz
96
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 F-H1_2.fq.gz -> ../raw/SRR11459604.sra_2.fastq.gz
97
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-ES1_1.fq.gz -> ../raw/SRR11459605.sra_1.fastq.gz
98
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-ES1_2.fq.gz -> ../raw/SRR11459605.sra_2.fastq.gz
99
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-ES2_1.fq.gz -> ../raw/SRR11459606.sra_1.fastq.gz
100
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-ES2_2.fq.gz -> ../raw/SRR11459606.sra_2.fastq.gz
101
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-H1_1.fq.gz -> ../raw/SRR11459607.sra_1.fastq.gz
102
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-H1_2.fq.gz -> ../raw/SRR11459607.sra_2.fastq.gz
103
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-iPS_1.fq.gz -> ../raw/SRR11459608.sra_1.fastq.gz
104
lrwxrwxrwx  1 lvyl lic   33 Mar 31 00:59 XF-iPS_2.fq.gz -> ../raw/SRR11459608.sra_2.fastq.gz
105
```
106
107
-   We split the multiple commands into batched scripts in some time consuming steps. For example, if we have 100 mapping tasks, we could split these tasks to 10 batched scripts with 10 commands one by one in each script. In this way, we can run these commands 'parallelly'. If each script uses 20 cores, we will use total 10 \* 20 = 200 cores at the same time in a local server. However, we can submit these tasks to a computing cluster with the 10 scripts running on 10 different nodes. A special function is designed to fulfill this demand.
108
109
```{bash, eval = FALSE}
110
code/
111
├── 1_qc.sh
112
├── 2_trim
113
│   ├── batch0.sh
114
│   ├── batch1.sh
115
│   ├── batch2.sh
116
│   ├── batch3.sh
117
│   ├── batch4.sh
118
│   ├── batch5.sh
119
│   ├── batch6.sh
120
│   └── submit.sh
121
├── 3_qc.sh
122
├── 4_map
123
│   ├── batch0.sh
124
│   ├── batch1.sh
125
│   ├── batch2.sh
126
│   ├── batch3.sh
127
│   ├── batch4.sh
128
│   ├── batch5.sh
129
│   ├── batch6.sh
130
│   └── submit.sh
131
├── 5_infer.sh
132
└── 6_count.sh
133
```
134
135
### 1. Quality Control
136
137
The quality control is performed using FastQC. Then multiple QC reports can be integrated to one report using multiQC.
138
139
![](../exampleData/RNA/qc/rawQC.svg){width="50%"}
140
141
### 2. Trimming adapters
142
143
After removing the Illumina universal adapter `TruSeq3-PE-2.fa` using Trimmomatic, we can get clean reads with no adapter left.
144
145
![](../exampleData/RNA/qc/trim.svg){width="50%"}
146
147
![](../exampleData/RNA/qc/trimQC.png){width="50%"}
148
149
### 3. Mapping reads to reference genome
150
151
The mapping is performed using STAR.
152
153
![](../exampleData/RNA/qc/map.svg){width="50%"}
154
155
### 4. Counting reads of each feature
156
157
Before counting, we need to clearify whether sequencing are strand-specific. This inferance is performed using RSeQC.
158
The result will be noisy if we set wrong parameters to count strand-specific sequencing reads.
159
160
![](../exampleData/RNA/qc/count.svg){width="50%"}
161
162
After counting, the file `counts.txt` will be used to perform downstream analysis.
163
164
## Downstream analysis
165
166
### 1. Loading count data
167
168
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_2_loadCount.R).
169
170
First, some packages are loaded in convenient.
171
172
```{r, message = FALSE, warning = FALSE}
173
library(tidyverse)
174
library(magrittr)
175
```
176
177
All the raw count files are located in specified dir. Here we only have one.
178
179
```{r}
180
dataPath <- "data"
181
(usedData <- list.files(dataPath, "txt", full = T))
182
```
183
184
Next reading all the count files while removing their fist 7 columns of gene annotation data.
185
186
```{r, cache = TRUE}
187
dataMtx <- map(usedData, ~ read_delim(.x, delim = "\t", comment = "#")) %>%
188
  map(~ .x[-1:-7]) %>%
189
  purrr::reduce(cbind) %>%
190
  as.matrix()
191
head(dataMtx)
192
```
193
194
These gene annotation data are also useful, so we can seperate and save them to load when in need.
195
196
```{r, cache = TRUE}
197
anno <- readRDS("../../data/hg19anno.rds")
198
head(anno)
199
```
200
201
We can give `dataMtx` meaningful row names as gene names and column names as sample name.
202
203
```{r}
204
rownames(dataMtx) <- scater::uniquifyFeatureNames(anno$Geneid, anno$gene_name)
205
colnames(dataMtx) %<>% str_replace_all(".*map/|Aligned.*", "")
206
head(dataMtx)
207
```
208
209
Then the count matrix should be saved for later use.
210
211
```{r, eval = FALSE}
212
saveRDS(dataMtx, "dataMtx.rds")
213
```
214
215
### 2. Differentially expressed gene analysis {.tabset}
216
217
#### Using DESeq2
218
219
For samples with duplicates, we can use DESeq2 to perform DEG analysis.
220
221
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkRNAana_2b_DEseq2.R).
222
223
```{r, message = FALSE, warning = FALSE}
224
library(glue)
225
library(DESeq2)
226
```
227
228
We can select part of samples to analyze. Here we keep all.
229
230
```{r}
231
colnames(dataMtx)
232
usedMtx <- dataMtx[, 1:7]
233
```
234
235
A `SummarizedExperiment` object should be created using count data.
236
We can set condition factors to tell DESeq2 how these samples are grouped.
237
Then we can create a `DESeqDataSet` using these data.
238
239
```{r}
240
# dir.create("DESeq2")
241
242
se <- SummarizedExperiment(assays = list(counts = as.matrix(usedMtx)))
243
se$condition <- factor(
244
  rep(c("F", "XF"), c(3, 4)),
245
  levels = c("F", "XF")
246
)
247
248
dds <- DESeqDataSet(se, design = ~ condition)
249
dds
250
```
251
252
Lowly expressed genes among samples will be filtered.
253
254
```{r}
255
rs <- rowMeans(usedMtx)
256
geneKeep <- rs > quantile(rs, 0.4)
257
sum(geneKeep)
258
dds <- dds[geneKeep, ]
259
```
260
261
The variance stabilizing transformated matrix will be used for visualization later.
262
263
```{r, eval = FALSE}
264
vsd <- vst(dds, blind = T)
265
saveRDS(vsd, "DESeq2/vsd.rds")
266
```
267
268
DEG analysis will be performed between groups of samples.
269
270
```{r}
271
dds <- DESeq(dds)
272
273
resultsNames(dds)
274
```
275
276
The results of DEG can be exported as Excel readable .csv files.
277
278
```{r, eval = FALSE}
279
DEGres <- list()
280
DEGres$XF_vs_F <- lfcShrink(dds, coef = "condition_XF_vs_F", type = "apeglm")
281
282
iwalk(DEGres, ~ write.csv(.x, glue("DESeq2/{.y}.DEG.csv")))
283
284
saveRDS(dds, "DESeq2/dds.rds")
285
```
286
287
#### Using DESeq
288
289
For samples with no replicates, we can use DESeq to get a 'p value' for each gene between one sample and another. It is not very useful for well designed experiments.
290
However, we can still use it for visulization.
291
292
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkRNAana_2a_DESeq.R).
293
294
```{r, eval = FALSE}
295
library(DESeq)
296
297
dataMtx <- readRDS("dataMtx.rds")
298
299
colnames(dataMtx)
300
usedMtx <- dataMtx[c(
301
  c("wanted sample names"),
302
  NULL
303
)]
304
```
305
306
After loading data and selecting wanted samples, we can create a `CountDataSet` object with count data matrix and condition factor.
307
Note that we should set the parameters `method = "blind"` and `sharingMode = "fit-only"` here if we have no replicates.
308
309
```{r, eval = FALSE}
310
dir.create("DESeq")
311
312
condition <- factor(
313
  c("a", "b", "c"),
314
  levels = c("a", "b", "c")
315
)
316
317
cds <- newCountDataSet(usedMtx, condition) %>%
318
  estimateSizeFactors() %>%
319
  estimateDispersions(method = "blind", sharingMode = "fit-only")
320
```
321
322
The variance stabilizing transformated matrix will be used for visualization later.
323
324
```{r, eval = FALSE}
325
vsd <- varianceStabilizingTransformation(cds)
326
saveRDS(vsd, "DESeq/vsd.rds")
327
```
328
329
Perform DEG test using `nbinomTest` if there is no other choice.
330
331
```{r, eval = FALSE}
332
# DEGres <- list()
333
# DEGres$a_vs_b <- nbinomTest(cds, "a", "b")
334
# DEGres$a_vs_c <- nbinomTest(cds, "a", "c")
335
336
# iwalk(DEGres, ~ write_csv(.x, glue("DESeq/{.y}.DEG.csv")))
337
```
338
339
### {-}
340
341
### 3. Ploting samples and DEGs
342
343
We can perform PCA and visualize the results.
344
345
![](../exampleData/RNA/graphics/PCA.png){width="50%"}
346
347
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_PCA.R).
348
349
For DEGs, there are some ways to visualize the difference between group of samples.
350
351
-   volcano plot
352
353
![](../exampleData/RNA/graphics/XF_vs_F.volcano.png){width="50%"}
354
355
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkRNA_volcano.R).
356
357
-   MA plot
358
359
![](../exampleData/RNA/graphics/XF_vs_F.MAplot.png){width="50%"}
360
361
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkRNA_MAplot.R).
362
363
Heatmap is a good way to describe the comprehensive expression pattern of used samples.
364
We can plot heatmaps with a large amount of genes with some interested ones labeled on the side.
365
366
![](../exampleData/RNA/graphics/heatmapMany.png){width="50%"}
367
368
Or we can just plot heatmaps with selected interesting genes.
369
370
![](../exampleData/RNA/graphics/heatmapFew.png){width="50%"}
371
372
Detailed codes for this two plots can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkRNA_heatmap.R).
373
374
### 4. Performing GO analysis
375
376
To understand the biology function of differentially expressed genes, GO analysis will be performed using `clusterProfiler`.
377
378
```{r, warning = FALSE, message = FALSE}
379
library(data.table)
380
library(org.Hs.eg.db)
381
library(clusterProfiler)
382
```
383
384
DEGs will be loaded as `data.table` with `NA`s of adjusted p value are set to 1.
385
386
```{r}
387
diffData <- fread("DESeq2/XF_vs_F.DEG.csv")
388
colnames(diffData)[1] <- "gene"
389
390
diffData[is.na(padj), padj := 1][]
391
```
392
393
The genes with fold change > 2 or < -2 with significant difference are defined as up- or down-regulated, respectively.
394
395
```{r}
396
diffData[, type := "ns"]
397
diffData[log2FoldChange > 1 & padj < 0.05, type := "up"][log2FoldChange < -1 & padj < 0.05, type := "down"][]
398
```
399
400
These gene sets are converted to `Entrez ID` and perform GO enrichment test.
401
402
```{r, eval = FALSE}
403
geneList <- list()
404
geneList$XFup <- diffData[type == "up", gene]
405
geneList$Fup <- diffData[type == "down", gene]
406
407
egoList <- map(geneList, ~ {
408
  enrichGO(
409
    gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "SYMBOL")$ENTREZID),
410
    OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T)
411
})
412
413
iwalk(egoList, ~ write.csv(.x@result, str_c(.y, ".GO.csv")))
414
```
415
416
The top enriched terms of each gene set can be visualized using histogram.
417
418
![](../exampleData/RNA/graphics/GO.png){width="50%"}
419
420
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkRNA_GO.R).
421
422
Other content will be added soon.
423
424
```{r}
425
sessionInfo()
426
```