a b/examples/walkthrough_A549dataset.Rmd
1
2
---
3
title: "Walkthrough with A549 dataset"
4
author: "Suoqin Jin, Lihua Zhang"
5
output: html_document
6
mainfont: Arial
7
vignette: >
8
  %\VignetteIndexEntry{Integrative analysis of single cell multi-omics data using scAI}
9
  %\VignetteEngine{knitr::rmarkdown}
10
  %\VignetteEncoding{UTF-8}
11
---
12
13
```{r setup, include = FALSE}
14
knitr::opts_chunk$set(
15
  collapse = TRUE,
16
  comment = "#>",
17
  root.dir = './'
18
)
19
```
20
21
This walkthrough will perform integrative analysis of paired single cell RNA-seq and ATAC-seq data of A549 cells.
22
This data describes lung adenocarcinoma-derived A549 cells after 0, 1 and 3 hours of 100 nM dexamethasone treatment, including scRNA-seq and scATAC-seq data of 2641 co-assayed cells.
23
The data is downloaded from GEO (GSM3271040 and GSM3271041) that were co-assayed using the sci-CAR protocol (Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380-1385 (2018)).
24
25
26
Load the required libraries
27
```{r message=FALSE,warning=FALSE}
28
library(scAI)
29
library(dplyr)
30
library(cowplot)
31
```
32
33
## Load data
34
The algorithm takes a list of two digital data matrices as input. Genes/loci should be in rows and cells in columns. rownames and colnames should be included. Before running the scAI model, we need to normalize the data to account for library size and select highly variable features. Our preprocessed data matrices after normalization and feature selection were provided for this walkthrough.
35
36
```{r}
37
load("/Users/suoqinjin/Documents/scAI/data/data_A549.rda")
38
X <- data_A549$data # List of data matrix
39
labels <- data_A549$labels # the collected time of cells, which is used for validation
40
```
41
42
## Create a scAI object
43
```{r}
44
scAI_outs <- create_scAIobject(raw.data = X)
45
```
46
47
## Preprocess data
48
Perform quality control to remove low-quality cells and genes, and normalize the data.
49
Since this is a preprocessed data, we do not need to normalize the data. Thus we set `assay = NULL`.
50
51
```{r, results='asis'}
52
scAI_outs <- preprocessing(scAI_outs, assay = NULL, minFeatures = 200, minCells = 1,
53
                           libararyflag = F, logNormalize = F)
54
55
```
56
57
Add cell information into *pData* slot of the object
58
```{r}
59
scAI_outs <- addpData(scAI_outs, pdata = labels, pdata.name = "Time")
60
```
61
62
## Run scAI model
63
As depending on the random initilization the results might differ, we run scAI multiple times (e.g. nrun = 5) and output the best result. User can also output results from all runs by setting *keep_all = TRUE*. The key parameters here are the number of component/clusters (k). The `selectK` function can aid in selecting k. A suitable k is the one at which the magnitude of cophenetic correlation begins to fall.
64
```{r}
65
scAI_outs <- run_scAI(scAI_outs, K = 2, nrun = 5)
66
67
```
68
69
## Visualize the inferred biologically relevant components
70
We plot the heatmap of the three learned low-rank matrices using hierarchical clustering. The time information of cells are used for validation.
71
```{r, fig.width=7,fig.height = 8, fig.wide = TRUE, fig.align = "center"}
72
lmHeatmap(scAI_outs, color.by = "Time")
73
74
```
75
76
## Visualize cells onto the low-dimensional space
77
We can visualize cells onto the low-dimensional space generated by t-SNE, FIt-sne or UMAP.
78
Here, we perform comparison of the visualization of raw ATAC-seq data with the aggregated data. Cells are colored by the collected time.
79
```{r, fig.width=7,fig.height = 3.5,  fig.wide = TRUE, fig.align = "center"}
80
cell_coords.ori <- reducedDims(scAI_outs, data.use = scAI_outs@norm.data$ATAC, do.scale = F, method = "umap", return.object = F)
81
cell_coords.agg <- reducedDims(scAI_outs, data.use = scAI_outs@agg.data, do.scale = F, method = "umap", return.object = F)
82
83
gg1 <- cellVisualization(scAI_outs, cell_coords.ori, color.by = "Time",show.legend = F, title = "scATAC-seq")
84
gg2 <- cellVisualization(scAI_outs, cell_coords.agg, color.by = "Time", ylabel = NULL, title = "Aggregated scATAC-seq")
85
cowplot::plot_grid(gg1, gg2)
86
87
```
88
89
## Identify enriched features in each factor
90
```{r}
91
markers_RNA <- identifyFactorMarkers(scAI_outs, assay = 'RNA', n.top = 5)
92
markers_ATAC <- identifyFactorMarkers(scAI_outs, assay = 'ATAC', n.top = 5)
93
94
```
95
96
### Ranking the features (genes/loci) and highlighting the important markers in each factor
97
```{r, fig.width=4, fig.height=3,  fig.wide = TRUE, fig.align = "center"}
98
# show the markers of GR activation
99
feature_genes = c('ZSWIM6','BASP1','TXNRD1','NR3C1','CKB','ABHD12','CDH16','NFKBIA','PER1','SCNN1A');
100
feature_loci <- c('5-17070007-17070367','5-17113334-17113956','5-17118743-17119143','5-17189512-17190061','5-17216210-17217826','5-17249185-17249758',
101
                  '20-25370461-25372119','16-66928981-66929756','12-6376107-6377003')
102
feature_loci_motif <- c('FOXA1','RELA/SP1','SP1','SMAD3','NR3C1/YY1','CEBPB/GATA3',
103
                        'CEBPB','SMAD3','CREB1/NR3C1')
104
105
featureRankingPlot(scAI_outs, assay = 'RNA', feature.show = feature_genes, top.p = 0.5, ylabel = "Gene score")
106
featureRankingPlot(scAI_outs, assay = 'ATAC', feature.show = feature_loci, feature.show.names = feature_loci_motif, top.p = 0.5, ylabel = "Locus score")
107
```
108
109
## Embedding cells, genes, loci and factors into 2D-dimensions using our new visualization method VscAI
110
111
```{r message=FALSE,warning=FALSE}
112
scAI_outs <- getEmbeddings(scAI_outs)
113
114
```
115
116
### Visualization of the embedded cells, genes, loci and factors
117
User can provide a vector of the features (e.g., key marker genes/loci) to explore the biological meaning of the cell groups and enhance the interpretation of the data. 
118
```{r, fig.width=10,fig.height=4, fig.align = "center"}
119
gg1 <- VscAIplot(scAI_outs, gene.use = feature_genes, loci.use = NULL, loci.use.names = NULL, color.by = "Time")
120
gg2 <- VscAIplot(scAI_outs, gene.use = NULL, loci.use = feature_loci, loci.use.names = feature_loci_motif, color.by = "Time")
121
cowplot::plot_grid(gg1, gg2)
122
```
123
124
## Feature plot
125
We can overlay the expression of features onto the low-dimensional space, e.g., VscAI, tsne, umap
126
```{r, fig.width=9, fig.height=2.5,  fig.wide = TRUE, fig.align = "center"}
127
featureVisualization(scAI_outs, assay = "RNA", feature.use = c('NR3C1','CKB','ABHD12','NFKBIA'),  method = "VscAI", nCol = 4, cell.size = 0.1, show.legend = F, show.legend.combined = T)
128
```
129
130