Detailed source code is located here bulkRNApre.R
with a vignette explaining each step. The flow chart of preprocess pipeline of bulk RNA-seq shows in the bottom.
By default, the path to these installed software should be added to $PATH
in order to use them directly. Or we can install them in a virtual environment like conda
. Specifically, we can use the full path to the software in commands if necessary.
After all the preprocess, the dictionary structure will look like this:
0_fastq 1_qc 2_trim 3_qc 4_map 5_infer 6_count code raw
The raw sub-dir contains the raw fastq files need (download from public databases, see downloadData.R
for detail)
total 13G
drwxr-xr-x 2 lvyl lic 4.0K Mar 31 00:36 ./
drwxr-xr-x 12 lvyl lic 4.0K Mar 31 16:51 ../
-rw-r--r-- 1 lvyl lic 925M Mar 30 23:17 SRR11459602.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 955M Mar 30 23:17 SRR11459602.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 927M Mar 30 23:24 SRR11459603.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 947M Mar 30 23:24 SRR11459603.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 899M Mar 30 23:32 SRR11459604.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 954M Mar 30 23:32 SRR11459604.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 894M Mar 30 23:39 SRR11459605.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 894M Mar 30 23:39 SRR11459605.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 891M Mar 30 23:55 SRR11459606.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 907M Mar 30 23:55 SRR11459606.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 832M Mar 31 00:07 SRR11459607.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 840M Mar 31 00:07 SRR11459607.sra_2.fastq.gz
-rw-r--r-- 1 lvyl lic 853M Mar 31 00:20 SRR11459608.sra_1.fastq.gz
-rw-r--r-- 1 lvyl lic 868M Mar 31 00:20 SRR11459608.sra_2.fastq.gz
There are some special designing in the pipeline.
total 8.0K
drwxr-xr-x 2 lvyl lic 4.0K Mar 31 01:04 ./
drwxr-xr-x 12 lvyl lic 4.0K Mar 31 16:51 ../
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-ES1_1.fq.gz -> ../raw/SRR11459602.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-ES1_2.fq.gz -> ../raw/SRR11459602.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-ES2_1.fq.gz -> ../raw/SRR11459603.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-ES2_2.fq.gz -> ../raw/SRR11459603.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-H1_1.fq.gz -> ../raw/SRR11459604.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 F-H1_2.fq.gz -> ../raw/SRR11459604.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-ES1_1.fq.gz -> ../raw/SRR11459605.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-ES1_2.fq.gz -> ../raw/SRR11459605.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-ES2_1.fq.gz -> ../raw/SRR11459606.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-ES2_2.fq.gz -> ../raw/SRR11459606.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-H1_1.fq.gz -> ../raw/SRR11459607.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-H1_2.fq.gz -> ../raw/SRR11459607.sra_2.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-iPS_1.fq.gz -> ../raw/SRR11459608.sra_1.fastq.gz
lrwxrwxrwx 1 lvyl lic 33 Mar 31 00:59 XF-iPS_2.fq.gz -> ../raw/SRR11459608.sra_2.fastq.gz
code/
├── 1_qc.sh
├── 2_trim
│ ├── batch0.sh
│ ├── batch1.sh
│ ├── batch2.sh
│ ├── batch3.sh
│ ├── batch4.sh
│ ├── batch5.sh
│ ├── batch6.sh
│ └── submit.sh
├── 3_qc.sh
├── 4_map
│ ├── batch0.sh
│ ├── batch1.sh
│ ├── batch2.sh
│ ├── batch3.sh
│ ├── batch4.sh
│ ├── batch5.sh
│ ├── batch6.sh
│ └── submit.sh
├── 5_infer.sh
└── 6_count.sh
The quality control is performed using FastQC. Then multiple QC reports can be integrated to one report using multiQC.
After removing the Illumina universal adapter TruSeq3-PE-2.fa
using Trimmomatic, we can get clean reads with no adapter left.
The mapping is performed using STAR.
Before counting, we need to clearify whether sequencing are strand-specific. This inferance is performed using RSeQC. The result will be noisy if we set wrong parameters to count strand-specific sequencing reads.
After counting, the file counts.txt
will be used to perform downstream analysis.
Complete codes of this part can be find here.
First, some packages are loaded in convenient.
library(tidyverse)
library(magrittr)
All the raw count files are located in specified dir. Here we only have one.
dataPath <- "data"
(usedData <- list.files(dataPath, "txt", full = T))
## [1] "data/counts.txt"
Next reading all the count files while removing their fist 7 columns of gene annotation data.
dataMtx <- map(usedData, ~ read_delim(.x, delim = "\t", comment = "#")) %>%
map(~ .x[-1:-7]) %>%
purrr::reduce(cbind) %>%
as.matrix()
## Parsed with column specification:
## cols(
## Geneid = col_character(),
## Chr = col_character(),
## Start = col_character(),
## End = col_character(),
## Strand = col_character(),
## Length = col_double(),
## gene_name = col_character(),
## `4_map/F-ES1Aligned.out.bam` = col_double(),
## `4_map/F-ES2Aligned.out.bam` = col_double(),
## `4_map/F-H1Aligned.out.bam` = col_double(),
## `4_map/XF-ES1Aligned.out.bam` = col_double(),
## `4_map/XF-ES2Aligned.out.bam` = col_double(),
## `4_map/XF-H1Aligned.out.bam` = col_double(),
## `4_map/XF-iPSAligned.out.bam` = col_double()
## )
head(dataMtx)
## 4_map/F-ES1Aligned.out.bam 4_map/F-ES2Aligned.out.bam
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
## [4,] 1 0
## [5,] 0 0
## [6,] 252 224
## 4_map/F-H1Aligned.out.bam 4_map/XF-ES1Aligned.out.bam
## [1,] 1 0
## [2,] 0 0
## [3,] 0 0
## [4,] 0 2
## [5,] 0 0
## [6,] 176 1187
## 4_map/XF-ES2Aligned.out.bam 4_map/XF-H1Aligned.out.bam
## [1,] 1 0
## [2,] 0 0
## [3,] 0 0
## [4,] 2 2
## [5,] 0 0
## [6,] 1230 932
## 4_map/XF-iPSAligned.out.bam
## [1,] 4
## [2,] 0
## [3,] 0
## [4,] 0
## [5,] 0
## [6,] 613
These gene annotation data are also useful, so we can seperate and save them to load when in need.
anno <- readRDS("../../data/hg19anno.rds")
head(anno)
## Geneid Chr
## 1 ENSG00000243485 1;1;1;1;1;1
## 2 ENSG00000237613 1;1;1;1;1
## 3 ENSG00000186092 1
## 4 ENSG00000238009 1;1;1;1;1;1;1;1;1;1;1;1;1
## 5 ENSG00000239945 1;1
## 6 ENSG00000237683 1;1
## Start
## 1 29554;30267;30366;30564;30976;30976
## 2 34554;35245;35277;35721;35721
## 3 69091
## 4 89295;92091;92230;110953;112700;112700;112700;120721;120775;129055;129055;129081;133374
## 5 89551;90287
## 6 134901;137621
## End
## 1 30039;30667;30503;30667;31109;31097
## 2 35174;35481;35481;36073;36081
## 3 70008
## 4 91629;92240;92240;111357;112804;112804;112804;120932;120932;129173;129217;129223;133566
## 5 90050;91105
## 6 135802;139379
## Strand Length gene_name
## 1 +;+;+;+;+;+ 1021 MIR1302-10
## 2 -;-;-;-;- 1219 FAM138A
## 3 + 918 OR4F5
## 4 -;-;-;-;-;-;-;-;-;-;-;-;- 3569 RP11-34P13.7
## 5 -;- 1319 RP11-34P13.8
## 6 -;- 2661 AL627309.1
We can give dataMtx
meaningful row names as gene names and column names as sample name.
rownames(dataMtx) <- scater::uniquifyFeatureNames(anno$Geneid, anno$gene_name)
colnames(dataMtx) %<>% str_replace_all(".*map/|Aligned.*", "")
head(dataMtx)
## F-ES1 F-ES2 F-H1 XF-ES1 XF-ES2 XF-H1 XF-iPS
## MIR1302-10 0 0 1 0 1 0 4
## FAM138A 0 0 0 0 0 0 0
## OR4F5 0 0 0 0 0 0 0
## RP11-34P13.7 1 0 0 2 2 2 0
## RP11-34P13.8 0 0 0 0 0 0 0
## AL627309.1 252 224 176 1187 1230 932 613
Then the count matrix should be saved for later use.
saveRDS(dataMtx, "dataMtx.rds")
For samples with duplicates, we can use DESeq2 to perform DEG analysis.
Complete codes of this part can be find here.
library(glue)
library(DESeq2)
We can select part of samples to analyze. Here we keep all.
colnames(dataMtx)
## [1] "F-ES1" "F-ES2" "F-H1" "XF-ES1" "XF-ES2" "XF-H1" "XF-iPS"
usedMtx <- dataMtx[, 1:7]
A SummarizedExperiment
object should be created using count data. We can set condition factors to tell DESeq2 how these samples are grouped. Then we can create a DESeqDataSet
using these data.
# dir.create("DESeq2")
se <- SummarizedExperiment(assays = list(counts = as.matrix(usedMtx)))
se$condition <- factor(
rep(c("F", "XF"), c(3, 4)),
levels = c("F", "XF")
)
dds <- DESeqDataSet(se, design = ~ condition)
## converting counts to integer mode
dds
## class: DESeqDataSet
## dim: 32738 7
## metadata(1): version
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(0):
## colnames(7): F-ES1 F-ES2 ... XF-H1 XF-iPS
## colData names(1): condition
Lowly expressed genes among samples will be filtered.
rs <- rowMeans(usedMtx)
geneKeep <- rs > quantile(rs, 0.4)
sum(geneKeep)
## [1] 19562
dds <- dds[geneKeep, ]
The variance stabilizing transformated matrix will be used for visualization later.
vsd <- vst(dds, blind = T)
saveRDS(vsd, "DESeq2/vsd.rds")
DEG analysis will be performed between groups of samples.
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
resultsNames(dds)
## [1] "Intercept" "condition_XF_vs_F"
The results of DEG can be exported as Excel readable .csv files.
DEGres <- list()
DEGres$XF_vs_F <- lfcShrink(dds, coef = "condition_XF_vs_F", type = "apeglm")
iwalk(DEGres, ~ write.csv(.x, glue("DESeq2/{.y}.DEG.csv")))
saveRDS(dds, "DESeq2/dds.rds")
For samples with no replicates, we can use DESeq to get a ‘p value’ for each gene between one sample and another. It is not very useful for well designed experiments. However, we can still use it for visulization.
Complete codes of this part can be find here.
library(DESeq)
dataMtx <- readRDS("dataMtx.rds")
colnames(dataMtx)
usedMtx <- dataMtx[c(
c("wanted sample names"),
NULL
)]
After loading data and selecting wanted samples, we can create a CountDataSet
object with count data matrix and condition factor. Note that we should set the parameters method = "blind"
and sharingMode = "fit-only"
here if we have no replicates.
dir.create("DESeq")
condition <- factor(
c("a", "b", "c"),
levels = c("a", "b", "c")
)
cds <- newCountDataSet(usedMtx, condition) %>%
estimateSizeFactors() %>%
estimateDispersions(method = "blind", sharingMode = "fit-only")
The variance stabilizing transformated matrix will be used for visualization later.
vsd <- varianceStabilizingTransformation(cds)
saveRDS(vsd, "DESeq/vsd.rds")
Perform DEG test using nbinomTest
if there is no other choice.
# DEGres <- list()
# DEGres$a_vs_b <- nbinomTest(cds, "a", "b")
# DEGres$a_vs_c <- nbinomTest(cds, "a", "c")
# iwalk(DEGres, ~ write_csv(.x, glue("DESeq/{.y}.DEG.csv")))
We can perform PCA and visualize the results.
Detailed codes for this plot can be find here.
For DEGs, there are some ways to visualize the difference between group of samples.
Detailed codes for this plot can be find here.
Detailed codes for this plot can be find here.
Heatmap is a good way to describe the comprehensive expression pattern of used samples. We can plot heatmaps with a large amount of genes with some interested ones labeled on the side.
Or we can just plot heatmaps with selected interesting genes.
Detailed codes for this two plots can be find here.
To understand the biology function of differentially expressed genes, GO analysis will be performed using clusterProfiler
.
library(data.table)
library(org.Hs.eg.db)
library(clusterProfiler)
DEGs will be loaded as data.table
with NA
s of adjusted p value are set to 1.
diffData <- fread("DESeq2/XF_vs_F.DEG.csv")
colnames(diffData)[1] <- "gene"
diffData[is.na(padj), padj := 1][]
## gene baseMean log2FoldChange lfcSE
## 1: AL627309.1 650.219270 2.124178784 0.2881312
## 2: RP11-34P13.14 3.886484 0.036843965 0.2206781
## 3: AP006222.2 4.870546 0.025704071 0.2175207
## 4: RP4-669L17.10 26.311563 0.143448138 0.2707604
## 5: RP11-206L10.3 22.037879 0.870527760 0.8516016
## ---
## 19558: AC011841.1 32.805143 0.093406180 0.2395464
## 19559: AL354822.1 18.898812 0.010123712 0.2168090
## 19560: PNRC2_ENSG00000215700 108.159850 -0.008642293 0.2152445
## 19561: SRSF10_ENSG00000215699 268.299806 -0.028366678 0.1671313
## 19562: CU459201.1 9.041196 0.007793443 0.2105759
## pvalue padj
## 1: 7.097370e-15 2.965104e-11
## 2: 3.821245e-02 1.000000e+00
## 3: 2.837381e-01 1.000000e+00
## 4: 6.858933e-02 3.549276e-01
## 5: 6.466336e-03 1.004265e-01
## ---
## 19558: 8.653553e-02 3.924275e-01
## 19559: NA 1.000000e+00
## 19560: 7.245608e-01 9.135938e-01
## 19561: 7.895098e-01 9.382771e-01
## 19562: 8.778994e-01 9.677162e-01
The genes with fold change > 2 or < -2 with significant difference are defined as up- or down-regulated, respectively.
diffData[, type := "ns"]
diffData[log2FoldChange > 1 & padj < 0.05, type := "up"][log2FoldChange < -1 & padj < 0.05, type := "down"][]
## gene baseMean log2FoldChange lfcSE
## 1: AL627309.1 650.219270 2.124178784 0.2881312
## 2: RP11-34P13.14 3.886484 0.036843965 0.2206781
## 3: AP006222.2 4.870546 0.025704071 0.2175207
## 4: RP4-669L17.10 26.311563 0.143448138 0.2707604
## 5: RP11-206L10.3 22.037879 0.870527760 0.8516016
## ---
## 19558: AC011841.1 32.805143 0.093406180 0.2395464
## 19559: AL354822.1 18.898812 0.010123712 0.2168090
## 19560: PNRC2_ENSG00000215700 108.159850 -0.008642293 0.2152445
## 19561: SRSF10_ENSG00000215699 268.299806 -0.028366678 0.1671313
## 19562: CU459201.1 9.041196 0.007793443 0.2105759
## pvalue padj type
## 1: 7.097370e-15 2.965104e-11 up
## 2: 3.821245e-02 1.000000e+00 ns
## 3: 2.837381e-01 1.000000e+00 ns
## 4: 6.858933e-02 3.549276e-01 ns
## 5: 6.466336e-03 1.004265e-01 ns
## ---
## 19558: 8.653553e-02 3.924275e-01 ns
## 19559: NA 1.000000e+00 ns
## 19560: 7.245608e-01 9.135938e-01 ns
## 19561: 7.895098e-01 9.382771e-01 ns
## 19562: 8.778994e-01 9.677162e-01 ns
These gene sets are converted to Entrez ID
and perform GO enrichment test.
geneList <- list()
geneList$XFup <- diffData[type == "up", gene]
geneList$Fup <- diffData[type == "down", gene]
egoList <- map(geneList, ~ {
enrichGO(
gene = na.omit(select(org.Hs.eg.db, keys = .x, columns = "ENTREZID", keytype = "SYMBOL")$ENTREZID),
OrgDb = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 1, qvalueCutoff = 1, readable = T)
})
iwalk(egoList, ~ write.csv(.x@result, str_c(.y, ".GO.csv")))
The top enriched terms of each gene set can be visualized using histogram.
Detailed codes for this plot can be find here.
Other content will be added soon.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] clusterProfiler_3.14.3 org.Hs.eg.db_3.10.0
## [3] AnnotationDbi_1.48.0 data.table_1.12.8
## [5] DESeq2_1.26.0 SummarizedExperiment_1.16.0
## [7] DelayedArray_0.12.0 BiocParallel_1.20.0
## [9] matrixStats_0.56.0 Biobase_2.46.0
## [11] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
## [13] IRanges_2.20.0 S4Vectors_0.24.0
## [15] BiocGenerics_0.32.0 glue_1.4.2
## [17] magrittr_1.5 forcats_0.4.0
## [19] stringr_1.4.0 dplyr_1.0.2
## [21] purrr_0.3.3 readr_1.3.1
## [23] tidyr_1.0.0 tibble_3.0.3
## [25] ggplot2_3.3.2 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.5
## [3] Hmisc_4.2-0 fastmatch_1.1-0
## [5] plyr_1.8.6 igraph_1.2.5
## [7] splines_3.6.3 scater_1.14.0
## [9] urltools_1.7.3 digest_0.6.27
## [11] htmltools_0.5.1.1 GOSemSim_2.12.0
## [13] viridis_0.5.1 GO.db_3.10.0
## [15] fansi_0.4.0 checkmate_1.9.4
## [17] memoise_1.1.0 cluster_2.1.0
## [19] graphlayouts_0.6.0 annotate_1.64.0
## [21] modelr_0.1.5 prettyunits_1.0.2
## [23] enrichplot_1.6.1 colorspace_1.4-1
## [25] ggrepel_0.8.1 blob_1.2.0
## [27] rvest_0.3.5 haven_2.2.0
## [29] xfun_0.22 crayon_1.3.4
## [31] RCurl_1.95-4.12 jsonlite_1.6
## [33] genefilter_1.68.0 survival_3.1-8
## [35] polyclip_1.10-0 gtable_0.3.0
## [37] zlibbioc_1.32.0 XVector_0.26.0
## [39] BiocSingular_1.2.0 SingleCellExperiment_1.8.0
## [41] scales_1.0.0 DOSE_3.12.0
## [43] DBI_1.0.0 Rcpp_1.0.5
## [45] progress_1.2.2 viridisLite_0.3.0
## [47] xtable_1.8-4 htmlTable_1.13.2
## [49] gridGraphics_0.5-0 europepmc_0.3
## [51] foreign_0.8-75 bit_1.1-14
## [53] rsvd_1.0.2 Formula_1.2-3
## [55] htmlwidgets_1.5.1 httr_1.4.1
## [57] fgsea_1.12.0 RColorBrewer_1.1-2
## [59] acepack_1.4.1 ellipsis_0.3.0
## [61] farver_2.0.1 pkgconfig_2.0.3
## [63] XML_3.98-1.20 nnet_7.3-13
## [65] sass_0.3.1 dbplyr_1.4.2
## [67] locfit_1.5-9.1 ggplotify_0.0.5
## [69] tidyselect_1.1.0 rlang_0.4.7
## [71] reshape2_1.4.3 munsell_0.5.0
## [73] cellranger_1.1.0 tools_3.6.3
## [75] cli_2.0.0 generics_0.0.2
## [77] RSQLite_2.1.2 ggridges_0.5.1
## [79] broom_0.7.1 evaluate_0.14
## [81] yaml_2.2.0 knitr_1.25
## [83] bit64_0.9-7 fs_1.3.1
## [85] tidygraph_1.1.2 ggraph_2.0.2
## [87] DO.db_2.9 xml2_1.2.2
## [89] compiler_3.6.3 rstudioapi_0.10
## [91] beeswarm_0.2.3 reprex_0.3.0
## [93] tweenr_1.0.1 geneplotter_1.64.0
## [95] bslib_0.2.4 stringi_1.5.3
## [97] lattice_0.20-41 Matrix_1.2-18
## [99] vctrs_0.3.4 pillar_1.4.6
## [101] lifecycle_0.2.0 BiocManager_1.30.10
## [103] triebeard_0.3.0 jquerylib_0.1.3
## [105] BiocNeighbors_1.4.1 cowplot_1.0.0
## [107] bitops_1.0-6 irlba_2.3.3
## [109] qvalue_2.18.0 R6_2.4.0
## [111] latticeExtra_0.6-28 gridExtra_2.3
## [113] vipor_0.4.5 MASS_7.3-51.5
## [115] assertthat_0.2.1 withr_2.1.2
## [117] GenomeInfoDbData_1.2.2 hms_0.5.2
## [119] grid_3.6.3 rpart_4.1-15
## [121] rvcheck_0.1.8 rmarkdown_2.7
## [123] DelayedMatrixStats_1.8.0 ggforce_0.3.1
## [125] lubridate_1.7.4 base64enc_0.1-3
## [127] ggbeeswarm_0.6.0