Detailed source code is located here bulkATACpre.R
with a vignette explaining each step. The flow chart of preprocess pipeline of bulk ATAC-seq shows in the bottom.
By default, the path to these installed software should be added to $PATH
in order to use them directly. Or we can install them in a virtual environment like conda
. Specifically, we can use the full path to the software in commands if necessary.
At the beginning, we should insure that the quality of our data are good enough to perform later analysis. There are some metrics to evaluate the quality of bulk ATAC-seq data, for example, the insert fragment length distribution and TSS enrichment. Since we have already calculated these metrics on the preprocess stage, we could easily visualize the data quality in a better way.
Complete codes of this part can be find here.
Then we could load the data and get a peak matrix following the DiffBind
pipeline.
Complete codes of this part can be find here.
# grammar
library(tidyverse)
library(magrittr)
library(glue)
library(data.table)
# analysis
library(DiffBind)
In this process, we need to load the .bam
files after the filter and consensus peaks of each sample got in preprocess pipeline. These files are very large, so they are not included here.
bamFile <- list.files("bam", ".bam$", full.names = T)
usedSample <- str_remove_all(bamFile, ".*bam/|.filter.*")
sampleType <- str_remove(usedSample, "-[123]$")
sampleRep <- str_extract(usedSample, "[123]$") %>% as.numeric()
sampleSheet <- data.table(
SampleID = usedSample,
Condition = sampleType,
Replicate = sampleRep,
bamReads = bamFile,
Peaks = str_c("peak/", sampleType, ".narrowPeak"),
PeakCaller = "narrow"
)
dbaAll <- dba(sampleSheet = sampleSheet, minOverlap = 2)
dbaAll$chrmap %<>% str_c("chr", .)
dbaAll <- dba.blacklist(dbaAll, blacklist = DBA_BLACKLIST_HG19, greylist = F)
dbaAll$chrmap %<>% str_remove("chr")
dbaAll <- dba.count(dbaAll, minOverlap = 2, fragmentSize = 200, summits = 250)
dbaAll <- dba.normalize(dbaAll, background = T, normalize = DBA_NORM_NATIVE)
dbaAll <- dba.contrast(dbaAll, minMembers = 2, categories = DBA_CONDITION)
dbaAll <- dba.analyze(dbaAll, bBlacklist = F, bGreylist = F)
saveRDS(dbaAll, "dbaAll.rds")
diffList <- list()
diffList$F_vs_P <- dba.report(dbaAll, contrast = 1)
diffList$F_vs_XF <- dba.report(dbaAll, contrast = 2)
diffList$XF_vs_P <- dba.report(dbaAll, contrast = 3)
saveRDS(diffList, "diffList.rds")
peakMeta <- as.data.table(dbaAll$peaks[[1]][, 1:3])
saveRDS(peakMeta, "peakMeta.rds")
peakMtx <- map(dbaAll$peaks, ~ {.x$Score}) %>% reduce(cbind) %>% set_colnames(dbaAll$samples$SampleID) %>% as.data.table()
saveRDS(peakMtx, "peakMtx.rds")
After this process, we have done theree things:
Peaks of each sample are merged together and expanded to the same length (501 bp). The positions of every peak regions are saved to peakMeta.rds
.
The normalized accessibility scores of each peak in each sample are calculated based on the number of reads overlapping peak regions. The peak matrix is saved to peakMtx.rds
.
The differentially accessible peaks (DAP) between group of samples are calculated using DESeq2
model. The results are saved to diffList.rds
.
Using these data, we can do some visualization work.
We can perform PCA and visualize the results.
Detailed codes for this plot can be find here.
The DAPs among groups of samples can be clustered to some peak groups.
Detailed codes for this plot can be find here.
Further, we could select the group 1 and group 4 of these peak sets and visualize in a more detail way, to show the ‘real’ accessibility signal of the peak regions of each sample. This can be done in a similar way as TSS enrichment calculation process above.
Detailed codes for this plot can be find here.
Sometimes it is also necessary to show some track plots, for example:
Detailed codes for this plot type can be find here.
There are some ways to annotate a peak to a nearby gene, here we use a simple way using ChIPseeker
.
Complete codes of this part can be find here.
library(ChIPseeker)
library(GenomicFeatures)
The reference genome and peak group data from heatmap visualization process is needed.
txdb <- makeTxDbFromGFF("../../data/hg19genes.gtf")
peakMeta <- readRDS("peakMeta.rds")
peakMeta[, id := paste(Chr, Start, End, sep = "_")][]
peakGroup <- readRDS("peakGroup.rds")
peakIndex <- map(peakGroup, ~ {match(.x, peakMeta$id)})
The annotation of peaks are saved to peakAnnoList.rds
.
homerPeak <- peakMeta
homerPeak[, Chr := str_c("chr", Chr)][]
fwrite(homerPeak, "allPeak.homer", sep = "\t", col.names = F)
peakGRlist <- map(peakIndex, ~ {
GRanges(
seqnames = peakMeta[.x, Chr],
ranges = IRanges(
start = peakMeta[.x, Start],
end = peakMeta[.x, End]))})
peakAnnoList <- map(peakGRlist, annotatePeak, TxDb = txdb, tssRegion = c(-2000, 2000))
map(peakGRlist, ~ {as.data.table(.x)[, c("width", "strand") := NULL][, id := 1:.N][, seqnames := str_c("chr", seqnames)]}) %>%
iwalk(~ fwrite(.x, str_c("peakGroup_", .y, ".bed"), sep = "\t", col.names = F))
saveRDS(peakAnnoList, "peakAnnoList.rds")
We can plot a histogram to show the region of genome of each group of peak.
Detailed codes for this plot can be find here.
We can perform GO analysis to the genes which annotated to TSS region of each peak group.
Complete codes of this part can be find here.
library(org.Hs.eg.db)
library(clusterProfiler)
anno <- readRDS("../../data/hg19anno.rds")
peakAnnoList <- readRDS("peakAnnoList.rds")
id2gene <- structure(anno$gene_name, names = anno$Geneid)
annoGene <- map(peakAnnoList, ~ {.x@anno$geneId})
egoList <- map(annoGene, ~ {
enrichGO(
gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "ENSEMBL")$ENTREZID),
OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T)
})
names(egoList)
iwalk(egoList, ~ write.csv(.x@result, str_c("peakGroup_", .y, ".GO.csv")))
How the groups of peaks are regulated by TF? We can perform motif enrichment analysis and annotate each peak with motifs possibly binding to it using Homer
.
Complete codes of this part can be find here.
Detailed codes for this plot can be find here.
We can see some of the motifs are highly enriched in different peak groups.
The annotation file containing peak-motif matrix allPeak.txt
here is not included because it is too big. Some of its columns are peak to gene annotation message, so we keep it as a independent file peakAnnoHomer.rds
. However, only three columns of this file is needed for later analysis.
Here, we connect peaks around a gene region to that gene while leaving the peaks at intergenic regions connect to no gene.
The motif annotation data motifAnno
contain the relationship between motif names and relative TFs.
peakAnno <- fread("motif/allPeak.txt")
colnames(peakAnno)[1] %<>% str_remove(" .*")
saveRDS(peakAnno[, PeakID:`GC%`], "peakAnnoHomer.rds")
peakGene <- readRDS("peakAnnoHomer.rds")
peakGene <- peakGene[, c("PeakID", "Annotation", "Gene Name")]
colnames(peakGene)[3] <- "gene"
peakGene[, Annotation := str_remove(Annotation, " \\(.*")][]
peakGene[Annotation == "Intergenic", gene := "no"][]
setkey(peakGene, PeakID)
motifAnno <- readRDS("motifAnno.rds")
name2symbol <- structure(
motifAnno$symbol,
names = motifAnno$`Motif Name`
)
peakGroup <- readRDS("peakGroup.rds")
Here, only the motif-peak relationships with odd ratio high enough are kept. The TF-motif relationship, motif-peak relationship, and peak-gene relationship are combined to get the TF-gene regulation network. The weights of edges of this network are defined as the sum of the scores if a TF have multiple TF-motif-peak-gene path.
selCols <- peakAnno[, Chr:`GC%`] %>% colnames()
peakAnno[, (selCols) := NULL]
motifMeta <- name2symbol[str_remove(colnames(peakAnno)[2:ncol(peakAnno)], " .*")]
motifMeta <- data.table(
name = names(motifMeta),
symbol = motifMeta
)
motifMeta[, rep := 1:.N, by = symbol][]
motifMeta[rep != 1, symbol := str_c(symbol, "_", rep)][]
colnames(peakAnno)[2:ncol(peakAnno)] <- motifMeta$symbol
peakTFmtx <- melt(peakAnno, id.vars = "PeakID", variable.name = "TF", value.name = "val")
peakTFbind <- peakTFmtx[val > -log(1e-4)]
peakTFbind[, .N, by = PeakID]$N %>% table()
peakTFbind[, .N, by = PeakID]$N %>% hist(breaks = 50)
setkey(peakTFbind, PeakID)
peakGroup[c("1", "4")] %>% imap(~ {
n_TFpeak <- peakTFbind[.x][, gene := peakGene[PeakID, gene]][!is.na(val)]
n_TFgene <- n_TFpeak[!gene %in% c("no", ""), .(val = sum(val)), by = .(TF, gene)] %>% set_colnames(c("Source", "Target", "Weight"))
n_TFgene[, Weight := rescale(Weight, to = c(0.05, 1))]
n_node <- table(c(as.vector(n_TFgene$Source), n_TFgene$Target)) %>% as.data.table() %>% set_colnames(c("Id", "Weight"))
n_node[, `:=` (Label = Id, type = ifelse(Id %in% n_TFgene$Source, "tf", "gene"))]
fwrite(n_TFgene, str_c("network/", .y, "_TFgene.csv"), sep = ",")
fwrite(n_node, str_c("network/", .y, "_node.csv"), sep = ",")
})
The output graph node file n_node.csv
and graph edge file n_TFgene.csv
can directly load to graph visualization software gephi
. Using Force Atlas 2
arithmetic, we can put each node at a proper position, then the data can export to a .gexf
file then load to R.
After some transformation, the graph can be visualized, showing important TFs regulationg the most number of genes through each peak group.
Detailed codes for this plot type can be find here.
Other content will be added soon.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.27 R6_2.4.0 jsonlite_1.6
## [4] magrittr_1.5 evaluate_0.14 rlang_0.4.7
## [7] stringi_1.5.3 jquerylib_0.1.3 bslib_0.2.4
## [10] rmarkdown_2.7 tools_3.6.3 stringr_1.4.0
## [13] xfun_0.22 yaml_2.2.0 compiler_3.6.3
## [16] htmltools_0.5.1.1 knitr_1.25 sass_0.3.1