|
a |
|
b/vignettes/bulkATAC.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "General bulk ATAC-seq analysis" |
|
|
3 |
author: "Yulin Lyu <lvyulin@pku.edu.cn>" |
|
|
4 |
date: "`r Sys.Date()`" |
|
|
5 |
output: |
|
|
6 |
html_document: |
|
|
7 |
toc: yes |
|
|
8 |
toc_float: yes |
|
|
9 |
theme: lumen |
|
|
10 |
--- |
|
|
11 |
|
|
|
12 |
```{r setup, include = FALSE} |
|
|
13 |
knitr::opts_chunk$set( |
|
|
14 |
collapse = F, |
|
|
15 |
comment = "## " |
|
|
16 |
) |
|
|
17 |
``` |
|
|
18 |
|
|
|
19 |
```{css, echo = FALSE} |
|
|
20 |
pre { |
|
|
21 |
max-height: 300px; |
|
|
22 |
overflow-y: auto; |
|
|
23 |
background-color: AliceBlue; |
|
|
24 |
} |
|
|
25 |
|
|
|
26 |
pre[class] { |
|
|
27 |
max-height: 200px; |
|
|
28 |
} |
|
|
29 |
``` |
|
|
30 |
|
|
|
31 |
## Preprocess |
|
|
32 |
|
|
|
33 |
Detailed source code is located here [`bulkATACpre.R`](https://github.com/sajuukLyu/Protocols-4pub/blob/main/a_PreprocessPipeline/bulkATACpre.R) with a [vignette](https://sajuuklyu.github.io/Protocols-4pub/exampleData/ATAC/bulkATACpre.html) explaining each step. The flow chart of preprocess pipeline of bulk ATAC-seq shows in the bottom. |
|
|
34 |
|
|
|
35 |
{width="50%"} |
|
|
36 |
|
|
|
37 |
### Software preparation |
|
|
38 |
|
|
|
39 |
- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc) |
|
|
40 |
|
|
|
41 |
- [multiQC](https://multiqc.info) |
|
|
42 |
|
|
|
43 |
- [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) |
|
|
44 |
|
|
|
45 |
- [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) |
|
|
46 |
|
|
|
47 |
- [Samtools](http://www.htslib.org) |
|
|
48 |
|
|
|
49 |
- [Picard](http://broadinstitute.github.io/picard) |
|
|
50 |
|
|
|
51 |
- [bedtools](https://bedtools.readthedocs.io/en/latest/index.html) |
|
|
52 |
|
|
|
53 |
- [MACS2](https://github.com/macs3-project/MACS/wiki) |
|
|
54 |
|
|
|
55 |
- [deepTools](https://deeptools.readthedocs.io/en/develop/index.html) |
|
|
56 |
|
|
|
57 |
- [IDR](https://github.com/nboley/idr) |
|
|
58 |
|
|
|
59 |
- [Homer](http://homer.ucsd.edu/homer/index.html) |
|
|
60 |
|
|
|
61 |
By default, the path to these installed software should be added to `$PATH` in order to use them directly. Or we can install them in a virtual environment like `conda`. Specifically, we can use the full path to the software in commands if necessary. |
|
|
62 |
|
|
|
63 |
|
|
|
64 |
|
|
|
65 |
## Downstream analysis |
|
|
66 |
|
|
|
67 |
### 1. Quality control for samples |
|
|
68 |
|
|
|
69 |
At the beginning, we should insure that the quality of our data are good enough to perform later analysis. |
|
|
70 |
There are some metrics to evaluate the quality of bulk ATAC-seq data, for example, the insert fragment length distribution and TSS enrichment. |
|
|
71 |
Since we have already calculated these metrics on the preprocess stage, we could easily visualize the data quality in a better way. |
|
|
72 |
|
|
|
73 |
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_1_QC.R). |
|
|
74 |
|
|
|
75 |
- TSS enrichment |
|
|
76 |
|
|
|
77 |
{width="50%"} |
|
|
78 |
|
|
|
79 |
- Insert fragment length distribution |
|
|
80 |
|
|
|
81 |
{width="50%"} |
|
|
82 |
|
|
|
83 |
### 2. Loading count data |
|
|
84 |
|
|
|
85 |
Then we could load the data and get a peak matrix following the [`DiffBind`](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf) pipeline. |
|
|
86 |
|
|
|
87 |
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_2_loadCount.R). |
|
|
88 |
|
|
|
89 |
```{r, eval = FALSE} |
|
|
90 |
# grammar |
|
|
91 |
library(tidyverse) |
|
|
92 |
library(magrittr) |
|
|
93 |
library(glue) |
|
|
94 |
library(data.table) |
|
|
95 |
|
|
|
96 |
# analysis |
|
|
97 |
library(DiffBind) |
|
|
98 |
``` |
|
|
99 |
|
|
|
100 |
In this process, we need to load the `.bam` files after the filter and consensus peaks of each sample got in preprocess pipeline. These files are very large, so they are not included here. |
|
|
101 |
|
|
|
102 |
```{r, eval = FALSE} |
|
|
103 |
bamFile <- list.files("bam", ".bam$", full.names = T) |
|
|
104 |
usedSample <- str_remove_all(bamFile, ".*bam/|.filter.*") |
|
|
105 |
sampleType <- str_remove(usedSample, "-[123]$") |
|
|
106 |
sampleRep <- str_extract(usedSample, "[123]$") %>% as.numeric() |
|
|
107 |
|
|
|
108 |
sampleSheet <- data.table( |
|
|
109 |
SampleID = usedSample, |
|
|
110 |
Condition = sampleType, |
|
|
111 |
Replicate = sampleRep, |
|
|
112 |
bamReads = bamFile, |
|
|
113 |
Peaks = str_c("peak/", sampleType, ".narrowPeak"), |
|
|
114 |
PeakCaller = "narrow" |
|
|
115 |
) |
|
|
116 |
|
|
|
117 |
dbaAll <- dba(sampleSheet = sampleSheet, minOverlap = 2) |
|
|
118 |
|
|
|
119 |
dbaAll$chrmap %<>% str_c("chr", .) |
|
|
120 |
dbaAll <- dba.blacklist(dbaAll, blacklist = DBA_BLACKLIST_HG19, greylist = F) |
|
|
121 |
dbaAll$chrmap %<>% str_remove("chr") |
|
|
122 |
|
|
|
123 |
dbaAll <- dba.count(dbaAll, minOverlap = 2, fragmentSize = 200, summits = 250) |
|
|
124 |
dbaAll <- dba.normalize(dbaAll, background = T, normalize = DBA_NORM_NATIVE) |
|
|
125 |
dbaAll <- dba.contrast(dbaAll, minMembers = 2, categories = DBA_CONDITION) |
|
|
126 |
dbaAll <- dba.analyze(dbaAll, bBlacklist = F, bGreylist = F) |
|
|
127 |
|
|
|
128 |
saveRDS(dbaAll, "dbaAll.rds") |
|
|
129 |
|
|
|
130 |
diffList <- list() |
|
|
131 |
diffList$F_vs_P <- dba.report(dbaAll, contrast = 1) |
|
|
132 |
diffList$F_vs_XF <- dba.report(dbaAll, contrast = 2) |
|
|
133 |
diffList$XF_vs_P <- dba.report(dbaAll, contrast = 3) |
|
|
134 |
|
|
|
135 |
saveRDS(diffList, "diffList.rds") |
|
|
136 |
|
|
|
137 |
peakMeta <- as.data.table(dbaAll$peaks[[1]][, 1:3]) |
|
|
138 |
|
|
|
139 |
saveRDS(peakMeta, "peakMeta.rds") |
|
|
140 |
|
|
|
141 |
peakMtx <- map(dbaAll$peaks, ~ {.x$Score}) %>% reduce(cbind) %>% set_colnames(dbaAll$samples$SampleID) %>% as.data.table() |
|
|
142 |
|
|
|
143 |
saveRDS(peakMtx, "peakMtx.rds") |
|
|
144 |
``` |
|
|
145 |
|
|
|
146 |
After this process, we have done theree things: |
|
|
147 |
|
|
|
148 |
1. Peaks of each sample are merged together and expanded to the same length (501 bp). The positions of every peak regions are saved to `peakMeta.rds`. |
|
|
149 |
|
|
|
150 |
2. The normalized accessibility scores of each peak in each sample are calculated based on the number of reads overlapping peak regions. The peak matrix is saved to `peakMtx.rds`. |
|
|
151 |
|
|
|
152 |
3. The differentially accessible peaks (DAP) between group of samples are calculated using `DESeq2` model. The results are saved to `diffList.rds`. |
|
|
153 |
|
|
|
154 |
Using these data, we can do some visualization work. |
|
|
155 |
|
|
|
156 |
### 3. Ploting samples and DAPs |
|
|
157 |
|
|
|
158 |
We can perform PCA and visualize the results. |
|
|
159 |
|
|
|
160 |
{width="50%"} |
|
|
161 |
|
|
|
162 |
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_PCA.R). |
|
|
163 |
|
|
|
164 |
The DAPs among groups of samples can be clustered to some peak groups. |
|
|
165 |
|
|
|
166 |
{width="50%"} |
|
|
167 |
|
|
|
168 |
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_heatmapPeak.R). |
|
|
169 |
|
|
|
170 |
Further, we could select the group 1 and group 4 of these peak sets and visualize in a more detail way, to show the 'real' accessibility signal of the peak regions of each sample. |
|
|
171 |
This can be done in a similar way as TSS enrichment calculation process above. |
|
|
172 |
|
|
|
173 |
{width="50%"} |
|
|
174 |
|
|
|
175 |
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_heatmapTrack.R). |
|
|
176 |
|
|
|
177 |
Sometimes it is also necessary to show some track plots, for example: |
|
|
178 |
|
|
|
179 |
{width="80%"} |
|
|
180 |
|
|
|
181 |
{width="80%"} |
|
|
182 |
|
|
|
183 |
Detailed codes for this plot type can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_trackPlot.R). |
|
|
184 |
|
|
|
185 |
### 4. Annotating peaks |
|
|
186 |
|
|
|
187 |
There are some ways to annotate a peak to a nearby gene, here we use a simple way using `ChIPseeker`. |
|
|
188 |
|
|
|
189 |
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_3_annotatePeak.R). |
|
|
190 |
|
|
|
191 |
```{r, eval = FALSE} |
|
|
192 |
library(ChIPseeker) |
|
|
193 |
library(GenomicFeatures) |
|
|
194 |
``` |
|
|
195 |
|
|
|
196 |
The reference genome and peak group data from heatmap visualization process is needed. |
|
|
197 |
|
|
|
198 |
```{r, eval = FALSE} |
|
|
199 |
txdb <- makeTxDbFromGFF("../../data/hg19genes.gtf") |
|
|
200 |
|
|
|
201 |
peakMeta <- readRDS("peakMeta.rds") |
|
|
202 |
peakMeta[, id := paste(Chr, Start, End, sep = "_")][] |
|
|
203 |
|
|
|
204 |
peakGroup <- readRDS("peakGroup.rds") |
|
|
205 |
peakIndex <- map(peakGroup, ~ {match(.x, peakMeta$id)}) |
|
|
206 |
``` |
|
|
207 |
|
|
|
208 |
The annotation of peaks are saved to `peakAnnoList.rds`. |
|
|
209 |
|
|
|
210 |
```{r, eval = FALSE} |
|
|
211 |
homerPeak <- peakMeta |
|
|
212 |
homerPeak[, Chr := str_c("chr", Chr)][] |
|
|
213 |
|
|
|
214 |
fwrite(homerPeak, "allPeak.homer", sep = "\t", col.names = F) |
|
|
215 |
|
|
|
216 |
peakGRlist <- map(peakIndex, ~ { |
|
|
217 |
GRanges( |
|
|
218 |
seqnames = peakMeta[.x, Chr], |
|
|
219 |
ranges = IRanges( |
|
|
220 |
start = peakMeta[.x, Start], |
|
|
221 |
end = peakMeta[.x, End]))}) |
|
|
222 |
|
|
|
223 |
peakAnnoList <- map(peakGRlist, annotatePeak, TxDb = txdb, tssRegion = c(-2000, 2000)) |
|
|
224 |
|
|
|
225 |
map(peakGRlist, ~ {as.data.table(.x)[, c("width", "strand") := NULL][, id := 1:.N][, seqnames := str_c("chr", seqnames)]}) %>% |
|
|
226 |
iwalk(~ fwrite(.x, str_c("peakGroup_", .y, ".bed"), sep = "\t", col.names = F)) |
|
|
227 |
|
|
|
228 |
saveRDS(peakAnnoList, "peakAnnoList.rds") |
|
|
229 |
``` |
|
|
230 |
|
|
|
231 |
We can plot a histogram to show the region of genome of each group of peak. |
|
|
232 |
|
|
|
233 |
{width="50%"} |
|
|
234 |
|
|
|
235 |
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_peakAnnoDisp.R). |
|
|
236 |
|
|
|
237 |
### 5. Performing GO analysis |
|
|
238 |
|
|
|
239 |
We can perform GO analysis to the genes which annotated to TSS region of each peak group. |
|
|
240 |
|
|
|
241 |
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_4_GO.R). |
|
|
242 |
|
|
|
243 |
```{r, eval = FALSE} |
|
|
244 |
library(org.Hs.eg.db) |
|
|
245 |
library(clusterProfiler) |
|
|
246 |
``` |
|
|
247 |
|
|
|
248 |
```{r, eval = FALSE} |
|
|
249 |
anno <- readRDS("../../data/hg19anno.rds") |
|
|
250 |
peakAnnoList <- readRDS("peakAnnoList.rds") |
|
|
251 |
|
|
|
252 |
id2gene <- structure(anno$gene_name, names = anno$Geneid) |
|
|
253 |
|
|
|
254 |
annoGene <- map(peakAnnoList, ~ {.x@anno$geneId}) |
|
|
255 |
|
|
|
256 |
egoList <- map(annoGene, ~ { |
|
|
257 |
enrichGO( |
|
|
258 |
gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "ENSEMBL")$ENTREZID), |
|
|
259 |
OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T) |
|
|
260 |
}) |
|
|
261 |
names(egoList) |
|
|
262 |
|
|
|
263 |
iwalk(egoList, ~ write.csv(.x@result, str_c("peakGroup_", .y, ".GO.csv"))) |
|
|
264 |
``` |
|
|
265 |
|
|
|
266 |
### 6. Motif enrichment and footprint annotation |
|
|
267 |
|
|
|
268 |
How the groups of peaks are regulated by TF? We can perform motif enrichment analysis and annotate each peak with motifs possibly binding to it using `Homer`. |
|
|
269 |
|
|
|
270 |
Complete codes of this part can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/b_DownstreamAnalysisScript/bulkATACana_5_peakTF.R). |
|
|
271 |
|
|
|
272 |
{width="50%"} |
|
|
273 |
|
|
|
274 |
Detailed codes for this plot can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_motifEnrich.R). |
|
|
275 |
|
|
|
276 |
We can see some of the motifs are highly enriched in different peak groups. |
|
|
277 |
|
|
|
278 |
The annotation file containing peak-motif matrix `allPeak.txt` here is not included because it is too big. |
|
|
279 |
Some of its columns are peak to gene annotation message, so we keep it as a independent file `peakAnnoHomer.rds`. However, only three columns of this file is needed for later analysis. |
|
|
280 |
|
|
|
281 |
Here, we connect peaks around a gene region to that gene while leaving the peaks at intergenic regions connect to no gene. |
|
|
282 |
|
|
|
283 |
The motif annotation data `motifAnno` contain the relationship between motif names and relative TFs. |
|
|
284 |
|
|
|
285 |
```{r, eval = FALSE} |
|
|
286 |
peakAnno <- fread("motif/allPeak.txt") |
|
|
287 |
colnames(peakAnno)[1] %<>% str_remove(" .*") |
|
|
288 |
|
|
|
289 |
saveRDS(peakAnno[, PeakID:`GC%`], "peakAnnoHomer.rds") |
|
|
290 |
peakGene <- readRDS("peakAnnoHomer.rds") |
|
|
291 |
peakGene <- peakGene[, c("PeakID", "Annotation", "Gene Name")] |
|
|
292 |
colnames(peakGene)[3] <- "gene" |
|
|
293 |
peakGene[, Annotation := str_remove(Annotation, " \\(.*")][] |
|
|
294 |
peakGene[Annotation == "Intergenic", gene := "no"][] |
|
|
295 |
setkey(peakGene, PeakID) |
|
|
296 |
|
|
|
297 |
motifAnno <- readRDS("motifAnno.rds") |
|
|
298 |
|
|
|
299 |
name2symbol <- structure( |
|
|
300 |
motifAnno$symbol, |
|
|
301 |
names = motifAnno$`Motif Name` |
|
|
302 |
) |
|
|
303 |
|
|
|
304 |
peakGroup <- readRDS("peakGroup.rds") |
|
|
305 |
``` |
|
|
306 |
|
|
|
307 |
Here, only the motif-peak relationships with odd ratio high enough are kept. |
|
|
308 |
The TF-motif relationship, motif-peak relationship, and peak-gene relationship are combined to get the TF-gene regulation network. |
|
|
309 |
The weights of edges of this network are defined as the sum of the scores if a TF have multiple TF-motif-peak-gene path. |
|
|
310 |
|
|
|
311 |
```{r, eval = FALSE} |
|
|
312 |
selCols <- peakAnno[, Chr:`GC%`] %>% colnames() |
|
|
313 |
peakAnno[, (selCols) := NULL] |
|
|
314 |
|
|
|
315 |
motifMeta <- name2symbol[str_remove(colnames(peakAnno)[2:ncol(peakAnno)], " .*")] |
|
|
316 |
|
|
|
317 |
motifMeta <- data.table( |
|
|
318 |
name = names(motifMeta), |
|
|
319 |
symbol = motifMeta |
|
|
320 |
) |
|
|
321 |
|
|
|
322 |
motifMeta[, rep := 1:.N, by = symbol][] |
|
|
323 |
motifMeta[rep != 1, symbol := str_c(symbol, "_", rep)][] |
|
|
324 |
colnames(peakAnno)[2:ncol(peakAnno)] <- motifMeta$symbol |
|
|
325 |
|
|
|
326 |
peakTFmtx <- melt(peakAnno, id.vars = "PeakID", variable.name = "TF", value.name = "val") |
|
|
327 |
peakTFbind <- peakTFmtx[val > -log(1e-4)] |
|
|
328 |
peakTFbind[, .N, by = PeakID]$N %>% table() |
|
|
329 |
peakTFbind[, .N, by = PeakID]$N %>% hist(breaks = 50) |
|
|
330 |
setkey(peakTFbind, PeakID) |
|
|
331 |
|
|
|
332 |
peakGroup[c("1", "4")] %>% imap(~ { |
|
|
333 |
n_TFpeak <- peakTFbind[.x][, gene := peakGene[PeakID, gene]][!is.na(val)] |
|
|
334 |
n_TFgene <- n_TFpeak[!gene %in% c("no", ""), .(val = sum(val)), by = .(TF, gene)] %>% set_colnames(c("Source", "Target", "Weight")) |
|
|
335 |
n_TFgene[, Weight := rescale(Weight, to = c(0.05, 1))] |
|
|
336 |
n_node <- table(c(as.vector(n_TFgene$Source), n_TFgene$Target)) %>% as.data.table() %>% set_colnames(c("Id", "Weight")) |
|
|
337 |
n_node[, `:=` (Label = Id, type = ifelse(Id %in% n_TFgene$Source, "tf", "gene"))] |
|
|
338 |
|
|
|
339 |
fwrite(n_TFgene, str_c("network/", .y, "_TFgene.csv"), sep = ",") |
|
|
340 |
fwrite(n_node, str_c("network/", .y, "_node.csv"), sep = ",") |
|
|
341 |
}) |
|
|
342 |
``` |
|
|
343 |
|
|
|
344 |
The output graph node file `n_node.csv` and graph edge file `n_TFgene.csv` can directly load to graph visualization software [`gephi`](https://gephi.org). |
|
|
345 |
Using `Force Atlas 2` arithmetic, we can put each node at a proper position, then the data can export to a `.gexf` file then load to R. |
|
|
346 |
|
|
|
347 |
After some transformation, the graph can be visualized, showing important TFs regulationg the most number of genes through each peak group. |
|
|
348 |
|
|
|
349 |
{width="80%"} |
|
|
350 |
|
|
|
351 |
{width="80%"} |
|
|
352 |
|
|
|
353 |
Detailed codes for this plot type can be find [here](https://github.com/sajuukLyu/Protocols-4pub/blob/main/c_VisualizationScript/Visulz_bulkATAC_network.R). |
|
|
354 |
|
|
|
355 |
Other content will be added soon. |
|
|
356 |
|
|
|
357 |
```{r} |
|
|
358 |
sessionInfo() |
|
|
359 |
``` |
|
|
360 |
|