a b/vignettes/bulkATAC.Rmd
1
---
2
title: "General bulk ATAC-seq analysis"
3
author: "Yulin Lyu <lvyulin@pku.edu.cn>"
4
date: "`r Sys.Date()`"
5
output: 
6
  html_document: 
7
    toc: yes
8
    toc_float: yes
9
    theme: lumen
10
---
11
12
```{r setup, include = FALSE}
13
knitr::opts_chunk$set(
14
  collapse = F,
15
  comment = "## "
16
)
17
```
18
19
```{css, echo = FALSE}
20
pre {
21
  max-height: 300px;
22
  overflow-y: auto;
23
  background-color: AliceBlue;
24
}
25
26
pre[class] {
27
  max-height: 200px;
28
}
29
```
30
31
## Preprocess
32
33
Detailed source code is located here [`bulkATACpre.R`](https://github.com/sajuukLyu/Protocols-4pub/blob/main/a_PreprocessPipeline/bulkATACpre.R) with a [vignette](https://sajuuklyu.github.io/Protocols-4pub/exampleData/ATAC/bulkATACpre.html) explaining each step. The flow chart of preprocess pipeline of bulk ATAC-seq shows in the bottom.
34
35
![](../mermaidPlot/bulkATACpre.svg){width="50%"}
36
37
### Software preparation
38
39
-   [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc)
40
41
-   [multiQC](https://multiqc.info)
42
43
-   [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)
44
45
-   [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
46
47
-   [Samtools](http://www.htslib.org)
48
49
-   [Picard](http://broadinstitute.github.io/picard)
50
51
-   [bedtools](https://bedtools.readthedocs.io/en/latest/index.html)
52
53
-   [MACS2](https://github.com/macs3-project/MACS/wiki)
54
55
-   [deepTools](https://deeptools.readthedocs.io/en/develop/index.html)
56
57
-   [IDR](https://github.com/nboley/idr)
58
59
-   [Homer](http://homer.ucsd.edu/homer/index.html)
60
61
By default, the path to these installed software should be added to `$PATH` in order to use them directly. Or we can install them in a virtual environment like `conda`. Specifically, we can use the full path to the software in commands if necessary.
62
63
64
65
## Downstream analysis
66
67
### 1. Quality control for samples
68
69
At the beginning, we should insure that the quality of our data are good enough to perform later analysis.
70
There are some metrics to evaluate the quality of bulk ATAC-seq data, for example, the insert fragment length distribution and TSS enrichment.
71
Since we have already calculated these metrics on the preprocess stage, we could easily visualize the data quality in a better way.
72
73
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_1_QC.R).
74
75
-   TSS enrichment
76
77
![](../exampleData/ATAC/graphics/TSSenrich.png){width="50%"}
78
79
-   Insert fragment length distribution
80
81
![](../exampleData/ATAC/graphics/fragLength.png){width="50%"}
82
83
### 2. Loading count data
84
85
Then we could load the data and get a peak matrix following the [`DiffBind`](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf) pipeline.
86
87
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_2_loadCount.R).
88
89
```{r, eval = FALSE}
90
# grammar
91
library(tidyverse)
92
library(magrittr)
93
library(glue)
94
library(data.table)
95
96
# analysis
97
library(DiffBind)
98
```
99
100
In this process, we need to load the `.bam` files after the filter and consensus peaks of each sample got in preprocess pipeline. These files are very large, so they are not included here.
101
102
```{r, eval = FALSE}
103
bamFile <- list.files("bam", ".bam$", full.names = T)
104
usedSample <- str_remove_all(bamFile, ".*bam/|.filter.*")
105
sampleType <- str_remove(usedSample, "-[123]$")
106
sampleRep <- str_extract(usedSample, "[123]$") %>% as.numeric()
107
108
sampleSheet <- data.table(
109
  SampleID = usedSample,
110
  Condition = sampleType,
111
  Replicate = sampleRep,
112
  bamReads = bamFile,
113
  Peaks = str_c("peak/", sampleType, ".narrowPeak"),
114
  PeakCaller = "narrow"
115
)
116
117
dbaAll <- dba(sampleSheet = sampleSheet, minOverlap = 2)
118
119
dbaAll$chrmap %<>% str_c("chr", .)
120
dbaAll <- dba.blacklist(dbaAll, blacklist = DBA_BLACKLIST_HG19, greylist = F)
121
dbaAll$chrmap %<>% str_remove("chr")
122
123
dbaAll <- dba.count(dbaAll, minOverlap = 2, fragmentSize = 200, summits = 250)
124
dbaAll <- dba.normalize(dbaAll, background = T, normalize = DBA_NORM_NATIVE)
125
dbaAll <- dba.contrast(dbaAll, minMembers = 2, categories = DBA_CONDITION)
126
dbaAll <- dba.analyze(dbaAll, bBlacklist = F, bGreylist = F)
127
128
saveRDS(dbaAll, "dbaAll.rds")
129
130
diffList <- list()
131
diffList$F_vs_P <- dba.report(dbaAll, contrast = 1)
132
diffList$F_vs_XF <- dba.report(dbaAll, contrast = 2)
133
diffList$XF_vs_P <- dba.report(dbaAll, contrast = 3)
134
135
saveRDS(diffList, "diffList.rds")
136
137
peakMeta <- as.data.table(dbaAll$peaks[[1]][, 1:3])
138
139
saveRDS(peakMeta, "peakMeta.rds")
140
141
peakMtx <- map(dbaAll$peaks, ~ {.x$Score}) %>% reduce(cbind) %>% set_colnames(dbaAll$samples$SampleID) %>% as.data.table()
142
143
saveRDS(peakMtx, "peakMtx.rds")
144
```
145
146
After this process, we have done theree things:
147
148
1. Peaks of each sample are merged together and expanded to the same length (501 bp). The positions of every peak regions are saved to `peakMeta.rds`.
149
150
2. The normalized accessibility scores of each peak in each sample are calculated based on the number of reads overlapping peak regions. The peak matrix is saved to `peakMtx.rds`.
151
152
3. The differentially accessible peaks (DAP) between group of samples are calculated using `DESeq2` model. The results are saved to `diffList.rds`. 
153
154
Using these data, we can do some visualization work.
155
156
### 3. Ploting samples and DAPs
157
158
We can perform PCA and visualize the results.
159
160
![](../exampleData/ATAC/graphics/PCA.png){width="50%"}
161
162
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_PCA.R).
163
164
The DAPs among groups of samples can be clustered to some peak groups.
165
166
![](../exampleData/ATAC/graphics/heatmapPeak.png){width="50%"}
167
168
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_heatmapPeak.R).
169
170
Further, we could select the group 1 and group 4 of these peak sets and visualize in a more detail way, to show the 'real' accessibility signal of the peak regions of each sample.
171
This can be done in a similar way as TSS enrichment calculation process above.
172
173
![](../exampleData/ATAC/graphics/heatmapTrack.png){width="50%"}
174
175
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_heatmapTrack.R).
176
177
Sometimes it is also necessary to show some track plots, for example:
178
179
![](../exampleData/ATAC/graphics/SP5_track.png){width="80%"}
180
181
![](../exampleData/ATAC/graphics/CAT_track.png){width="80%"}
182
183
Detailed codes for this plot type can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_trackPlot.R).
184
185
### 4. Annotating peaks
186
187
There are some ways to annotate a peak to a nearby gene, here we use a simple way using `ChIPseeker`.
188
189
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_3_annotatePeak.R).
190
191
```{r, eval = FALSE}
192
library(ChIPseeker)
193
library(GenomicFeatures)
194
```
195
196
The reference genome and peak group data from heatmap visualization process is needed.
197
198
```{r, eval = FALSE}
199
txdb <- makeTxDbFromGFF("../../data/hg19genes.gtf")
200
201
peakMeta <- readRDS("peakMeta.rds")
202
peakMeta[, id := paste(Chr, Start, End, sep = "_")][]
203
204
peakGroup <- readRDS("peakGroup.rds")
205
peakIndex <- map(peakGroup, ~ {match(.x, peakMeta$id)})
206
```
207
208
The annotation of peaks are saved to `peakAnnoList.rds`.
209
210
```{r, eval = FALSE}
211
homerPeak <- peakMeta
212
homerPeak[, Chr := str_c("chr", Chr)][]
213
214
fwrite(homerPeak, "allPeak.homer", sep = "\t", col.names = F)
215
216
peakGRlist <- map(peakIndex, ~ {
217
  GRanges(
218
    seqnames = peakMeta[.x, Chr],
219
    ranges = IRanges(
220
      start = peakMeta[.x, Start],
221
      end = peakMeta[.x, End]))})
222
223
peakAnnoList <- map(peakGRlist, annotatePeak, TxDb = txdb, tssRegion = c(-2000, 2000))
224
225
map(peakGRlist, ~ {as.data.table(.x)[, c("width", "strand") := NULL][, id := 1:.N][, seqnames := str_c("chr", seqnames)]}) %>%
226
  iwalk(~ fwrite(.x, str_c("peakGroup_", .y, ".bed"), sep = "\t", col.names = F))
227
228
saveRDS(peakAnnoList, "peakAnnoList.rds")
229
```
230
231
We can plot a histogram to show the region of genome of each group of peak.
232
233
![](../exampleData/ATAC/graphics/peakAnnoDisp.png){width="50%"}
234
235
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_peakAnnoDisp.R).
236
237
### 5. Performing GO analysis
238
239
We can perform GO analysis to the genes which annotated to TSS region of each peak group.
240
241
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_4_GO.R).
242
243
```{r, eval = FALSE}
244
library(org.Hs.eg.db)
245
library(clusterProfiler)
246
```
247
248
```{r, eval = FALSE}
249
anno <- readRDS("../../data/hg19anno.rds")
250
peakAnnoList <- readRDS("peakAnnoList.rds")
251
252
id2gene <- structure(anno$gene_name, names = anno$Geneid)
253
254
annoGene <- map(peakAnnoList, ~ {.x@anno$geneId})
255
256
egoList <- map(annoGene, ~ {
257
  enrichGO(
258
    gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "ENSEMBL")$ENTREZID),
259
    OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T)
260
})
261
names(egoList)
262
263
iwalk(egoList, ~ write.csv(.x@result, str_c("peakGroup_", .y, ".GO.csv")))
264
```
265
266
### 6. Motif enrichment and footprint annotation
267
268
How the groups of peaks are regulated by TF? We can perform motif enrichment analysis and annotate each peak with motifs possibly binding to it using `Homer`.
269
270
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_5_peakTF.R).
271
272
![](../exampleData/ATAC/graphics/motifEnrich.png){width="50%"}
273
274
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_motifEnrich.R).
275
276
We can see some of the motifs are highly enriched in different peak groups.
277
278
The annotation file containing peak-motif matrix `allPeak.txt` here is not included because it is too big.
279
Some of its columns are peak to gene annotation message, so we keep it as a independent file `peakAnnoHomer.rds`. However, only three columns of this file is needed for later analysis.
280
281
Here, we connect peaks around a gene region to that gene while leaving the peaks at intergenic regions connect to no gene.
282
283
The motif annotation data `motifAnno` contain the relationship between motif names and relative TFs.
284
285
```{r, eval = FALSE}
286
peakAnno <- fread("motif/allPeak.txt")
287
colnames(peakAnno)[1] %<>% str_remove(" .*")
288
289
saveRDS(peakAnno[, PeakID:`GC%`], "peakAnnoHomer.rds")
290
peakGene <- readRDS("peakAnnoHomer.rds")
291
peakGene <- peakGene[, c("PeakID", "Annotation", "Gene Name")]
292
colnames(peakGene)[3] <- "gene"
293
peakGene[, Annotation := str_remove(Annotation, " \\(.*")][]
294
peakGene[Annotation == "Intergenic", gene := "no"][]
295
setkey(peakGene, PeakID)
296
297
motifAnno <- readRDS("motifAnno.rds")
298
299
name2symbol <- structure(
300
  motifAnno$symbol,
301
  names = motifAnno$`Motif Name`
302
)
303
304
peakGroup <- readRDS("peakGroup.rds")
305
```
306
307
Here, only the motif-peak relationships with odd ratio high enough are kept.
308
The TF-motif relationship, motif-peak relationship, and peak-gene relationship are combined to get the TF-gene regulation network.
309
The weights of edges of this network are defined as the sum of the scores if a TF have multiple TF-motif-peak-gene path.
310
311
```{r, eval = FALSE}
312
selCols <- peakAnno[, Chr:`GC%`] %>% colnames()
313
peakAnno[, (selCols) := NULL]
314
315
motifMeta <- name2symbol[str_remove(colnames(peakAnno)[2:ncol(peakAnno)], " .*")]
316
317
motifMeta <- data.table(
318
  name = names(motifMeta),
319
  symbol = motifMeta
320
)
321
322
motifMeta[, rep := 1:.N, by = symbol][]
323
motifMeta[rep != 1, symbol := str_c(symbol, "_", rep)][]
324
colnames(peakAnno)[2:ncol(peakAnno)] <- motifMeta$symbol
325
326
peakTFmtx <- melt(peakAnno, id.vars = "PeakID", variable.name = "TF", value.name = "val")
327
peakTFbind <- peakTFmtx[val > -log(1e-4)]
328
peakTFbind[, .N, by = PeakID]$N %>% table()
329
peakTFbind[, .N, by = PeakID]$N %>% hist(breaks = 50)
330
setkey(peakTFbind, PeakID)
331
332
peakGroup[c("1", "4")] %>% imap(~ {
333
  n_TFpeak <- peakTFbind[.x][, gene := peakGene[PeakID, gene]][!is.na(val)]
334
  n_TFgene <- n_TFpeak[!gene %in% c("no", ""), .(val = sum(val)), by = .(TF, gene)] %>% set_colnames(c("Source", "Target", "Weight"))
335
  n_TFgene[, Weight := rescale(Weight, to = c(0.05, 1))]
336
  n_node <- table(c(as.vector(n_TFgene$Source), n_TFgene$Target)) %>% as.data.table() %>% set_colnames(c("Id", "Weight"))
337
  n_node[, `:=` (Label = Id, type = ifelse(Id %in% n_TFgene$Source, "tf", "gene"))]
338
  
339
  fwrite(n_TFgene, str_c("network/", .y, "_TFgene.csv"), sep = ",")
340
  fwrite(n_node, str_c("network/", .y, "_node.csv"), sep = ",")
341
})
342
```
343
344
The output graph node file `n_node.csv` and graph edge file `n_TFgene.csv` can directly load to graph visualization software [`gephi`](https://gephi.org).
345
Using `Force Atlas 2` arithmetic, we can put each node at a proper position, then the data can export to a `.gexf` file then load to R.
346
347
After some transformation, the graph can be visualized, showing important TFs regulationg the most number of genes through each peak group.
348
349
![](../exampleData/ATAC/graphics/g1_network.png){width="80%"}
350
351
![](../exampleData/ATAC/graphics/g4_network.png){width="80%"}
352
353
Detailed codes for this plot type can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_network.R).
354
355
Other content will be added soon.
356
357
```{r}
358
sessionInfo()
359
```
360