Preprocess

Detailed source code is located here bulkATACpre.R with a vignette explaining each step. The flow chart of preprocess pipeline of bulk ATAC-seq shows in the bottom.

Software preparation

By default, the path to these installed software should be added to $PATH in order to use them directly. Or we can install them in a virtual environment like conda. Specifically, we can use the full path to the software in commands if necessary.

Downstream analysis

1. Quality control for samples

At the beginning, we should insure that the quality of our data are good enough to perform later analysis. There are some metrics to evaluate the quality of bulk ATAC-seq data, for example, the insert fragment length distribution and TSS enrichment. Since we have already calculated these metrics on the preprocess stage, we could easily visualize the data quality in a better way.

Complete codes of this part can be find here.

  • TSS enrichment

  • Insert fragment length distribution

2. Loading count data

Then we could load the data and get a peak matrix following the DiffBind pipeline.

Complete codes of this part can be find here.

# grammar
library(tidyverse)
library(magrittr)
library(glue)
library(data.table)

# analysis
library(DiffBind)

In this process, we need to load the .bam files after the filter and consensus peaks of each sample got in preprocess pipeline. These files are very large, so they are not included here.

bamFile <- list.files("bam", ".bam$", full.names = T)
usedSample <- str_remove_all(bamFile, ".*bam/|.filter.*")
sampleType <- str_remove(usedSample, "-[123]$")
sampleRep <- str_extract(usedSample, "[123]$") %>% as.numeric()

sampleSheet <- data.table(
  SampleID = usedSample,
  Condition = sampleType,
  Replicate = sampleRep,
  bamReads = bamFile,
  Peaks = str_c("peak/", sampleType, ".narrowPeak"),
  PeakCaller = "narrow"
)

dbaAll <- dba(sampleSheet = sampleSheet, minOverlap = 2)

dbaAll$chrmap %<>% str_c("chr", .)
dbaAll <- dba.blacklist(dbaAll, blacklist = DBA_BLACKLIST_HG19, greylist = F)
dbaAll$chrmap %<>% str_remove("chr")

dbaAll <- dba.count(dbaAll, minOverlap = 2, fragmentSize = 200, summits = 250)
dbaAll <- dba.normalize(dbaAll, background = T, normalize = DBA_NORM_NATIVE)
dbaAll <- dba.contrast(dbaAll, minMembers = 2, categories = DBA_CONDITION)
dbaAll <- dba.analyze(dbaAll, bBlacklist = F, bGreylist = F)

saveRDS(dbaAll, "dbaAll.rds")

diffList <- list()
diffList$F_vs_P <- dba.report(dbaAll, contrast = 1)
diffList$F_vs_XF <- dba.report(dbaAll, contrast = 2)
diffList$XF_vs_P <- dba.report(dbaAll, contrast = 3)

saveRDS(diffList, "diffList.rds")

peakMeta <- as.data.table(dbaAll$peaks[[1]][, 1:3])

saveRDS(peakMeta, "peakMeta.rds")

peakMtx <- map(dbaAll$peaks, ~ {.x$Score}) %>% reduce(cbind) %>% set_colnames(dbaAll$samples$SampleID) %>% as.data.table()

saveRDS(peakMtx, "peakMtx.rds")

After this process, we have done theree things:

  1. Peaks of each sample are merged together and expanded to the same length (501 bp). The positions of every peak regions are saved to peakMeta.rds.

  2. The normalized accessibility scores of each peak in each sample are calculated based on the number of reads overlapping peak regions. The peak matrix is saved to peakMtx.rds.

  3. The differentially accessible peaks (DAP) between group of samples are calculated using DESeq2 model. The results are saved to diffList.rds.

Using these data, we can do some visualization work.

3. Ploting samples and DAPs

We can perform PCA and visualize the results.

Detailed codes for this plot can be find here.

The DAPs among groups of samples can be clustered to some peak groups.

Detailed codes for this plot can be find here.

Further, we could select the group 1 and group 4 of these peak sets and visualize in a more detail way, to show the ‘real’ accessibility signal of the peak regions of each sample. This can be done in a similar way as TSS enrichment calculation process above.

Detailed codes for this plot can be find here.

Sometimes it is also necessary to show some track plots, for example:

Detailed codes for this plot type can be find here.

4. Annotating peaks

There are some ways to annotate a peak to a nearby gene, here we use a simple way using ChIPseeker.

Complete codes of this part can be find here.

library(ChIPseeker)
library(GenomicFeatures)

The reference genome and peak group data from heatmap visualization process is needed.

txdb <- makeTxDbFromGFF("../../data/hg19genes.gtf")

peakMeta <- readRDS("peakMeta.rds")
peakMeta[, id := paste(Chr, Start, End, sep = "_")][]

peakGroup <- readRDS("peakGroup.rds")
peakIndex <- map(peakGroup, ~ {match(.x, peakMeta$id)})

The annotation of peaks are saved to peakAnnoList.rds.

homerPeak <- peakMeta
homerPeak[, Chr := str_c("chr", Chr)][]

fwrite(homerPeak, "allPeak.homer", sep = "\t", col.names = F)

peakGRlist <- map(peakIndex, ~ {
  GRanges(
    seqnames = peakMeta[.x, Chr],
    ranges = IRanges(
      start = peakMeta[.x, Start],
      end = peakMeta[.x, End]))})

peakAnnoList <- map(peakGRlist, annotatePeak, TxDb = txdb, tssRegion = c(-2000, 2000))

map(peakGRlist, ~ {as.data.table(.x)[, c("width", "strand") := NULL][, id := 1:.N][, seqnames := str_c("chr", seqnames)]}) %>%
  iwalk(~ fwrite(.x, str_c("peakGroup_", .y, ".bed"), sep = "\t", col.names = F))

saveRDS(peakAnnoList, "peakAnnoList.rds")

We can plot a histogram to show the region of genome of each group of peak.

Detailed codes for this plot can be find here.

5. Performing GO analysis

We can perform GO analysis to the genes which annotated to TSS region of each peak group.

Complete codes of this part can be find here.

library(org.Hs.eg.db)
library(clusterProfiler)
anno <- readRDS("../../data/hg19anno.rds")
peakAnnoList <- readRDS("peakAnnoList.rds")

id2gene <- structure(anno$gene_name, names = anno$Geneid)

annoGene <- map(peakAnnoList, ~ {.x@anno$geneId})

egoList <- map(annoGene, ~ {
  enrichGO(
    gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "ENSEMBL")$ENTREZID),
    OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T)
})
names(egoList)

iwalk(egoList, ~ write.csv(.x@result, str_c("peakGroup_", .y, ".GO.csv")))

6. Motif enrichment and footprint annotation

How the groups of peaks are regulated by TF? We can perform motif enrichment analysis and annotate each peak with motifs possibly binding to it using Homer.

Complete codes of this part can be find here.

Detailed codes for this plot can be find here.

We can see some of the motifs are highly enriched in different peak groups.

The annotation file containing peak-motif matrix allPeak.txt here is not included because it is too big. Some of its columns are peak to gene annotation message, so we keep it as a independent file peakAnnoHomer.rds. However, only three columns of this file is needed for later analysis.

Here, we connect peaks around a gene region to that gene while leaving the peaks at intergenic regions connect to no gene.

The motif annotation data motifAnno contain the relationship between motif names and relative TFs.

peakAnno <- fread("motif/allPeak.txt")
colnames(peakAnno)[1] %<>% str_remove(" .*")

saveRDS(peakAnno[, PeakID:`GC%`], "peakAnnoHomer.rds")
peakGene <- readRDS("peakAnnoHomer.rds")
peakGene <- peakGene[, c("PeakID", "Annotation", "Gene Name")]
colnames(peakGene)[3] <- "gene"
peakGene[, Annotation := str_remove(Annotation, " \\(.*")][]
peakGene[Annotation == "Intergenic", gene := "no"][]
setkey(peakGene, PeakID)

motifAnno <- readRDS("motifAnno.rds")

name2symbol <- structure(
  motifAnno$symbol,
  names = motifAnno$`Motif Name`
)

peakGroup <- readRDS("peakGroup.rds")

Here, only the motif-peak relationships with odd ratio high enough are kept. The TF-motif relationship, motif-peak relationship, and peak-gene relationship are combined to get the TF-gene regulation network. The weights of edges of this network are defined as the sum of the scores if a TF have multiple TF-motif-peak-gene path.

selCols <- peakAnno[, Chr:`GC%`] %>% colnames()
peakAnno[, (selCols) := NULL]

motifMeta <- name2symbol[str_remove(colnames(peakAnno)[2:ncol(peakAnno)], " .*")]

motifMeta <- data.table(
  name = names(motifMeta),
  symbol = motifMeta
)

motifMeta[, rep := 1:.N, by = symbol][]
motifMeta[rep != 1, symbol := str_c(symbol, "_", rep)][]
colnames(peakAnno)[2:ncol(peakAnno)] <- motifMeta$symbol

peakTFmtx <- melt(peakAnno, id.vars = "PeakID", variable.name = "TF", value.name = "val")
peakTFbind <- peakTFmtx[val > -log(1e-4)]
peakTFbind[, .N, by = PeakID]$N %>% table()
peakTFbind[, .N, by = PeakID]$N %>% hist(breaks = 50)
setkey(peakTFbind, PeakID)

peakGroup[c("1", "4")] %>% imap(~ {
  n_TFpeak <- peakTFbind[.x][, gene := peakGene[PeakID, gene]][!is.na(val)]
  n_TFgene <- n_TFpeak[!gene %in% c("no", ""), .(val = sum(val)), by = .(TF, gene)] %>% set_colnames(c("Source", "Target", "Weight"))
  n_TFgene[, Weight := rescale(Weight, to = c(0.05, 1))]
  n_node <- table(c(as.vector(n_TFgene$Source), n_TFgene$Target)) %>% as.data.table() %>% set_colnames(c("Id", "Weight"))
  n_node[, `:=` (Label = Id, type = ifelse(Id %in% n_TFgene$Source, "tf", "gene"))]
  
  fwrite(n_TFgene, str_c("network/", .y, "_TFgene.csv"), sep = ",")
  fwrite(n_node, str_c("network/", .y, "_node.csv"), sep = ",")
})

The output graph node file n_node.csv and graph edge file n_TFgene.csv can directly load to graph visualization software gephi. Using Force Atlas 2 arithmetic, we can put each node at a proper position, then the data can export to a .gexf file then load to R.

After some transformation, the graph can be visualized, showing important TFs regulationg the most number of genes through each peak group.

Detailed codes for this plot type can be find here.

Other content will be added soon.

sessionInfo()
##  R version 3.6.3 (2020-02-29)
##  Platform: x86_64-pc-linux-gnu (64-bit)
##  Running under: Ubuntu 20.04.2 LTS
##  
##  Matrix products: default
##  BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
##  LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##  
##  locale:
##   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##   [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##   [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
##  [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
##  
##  attached base packages:
##  [1] stats     graphics  grDevices utils     datasets  methods   base     
##  
##  loaded via a namespace (and not attached):
##   [1] digest_0.6.27     R6_2.4.0          jsonlite_1.6     
##   [4] magrittr_1.5      evaluate_0.14     rlang_0.4.7      
##   [7] stringi_1.5.3     jquerylib_0.1.3   bslib_0.2.4      
##  [10] rmarkdown_2.7     tools_3.6.3       stringr_1.4.0    
##  [13] xfun_0.22         yaml_2.2.0        compiler_3.6.3   
##  [16] htmltools_0.5.1.1 knitr_1.25        sass_0.3.1