a b/vignettes/MOVICS-VIGNETTE.Rmd
1
---
2
title: "MOVICS: <br>Multi-Omics integration and VIsualization in Cancer Subtyping"
3
author:
4
  name: "<b>Xiaofan Lu</b>"
5
  affiliation: "<b>State Key Laboratory of Natural Medicines, <br>Research Center of Biostatistics and Computational Pharmacy, <br>China Pharmaceutical University, Nanjing, China</b>"
6
  email: "<b>xlu.cpu@foxmail.com</b>"
7
date: "<b>`r Sys.Date()`</b>"
8
output: 
9
  prettydoc::html_pretty:
10
    theme: cayman
11
    highlight: github
12
vignette: >
13
  %\VignetteIndexEntry{MOVICS-VIGNETTE}
14
  %\VignetteEngine{knitr::rmarkdown}
15
  %\VignetteEncoding{UTF-8}
16
---
17
18
<style>
19
p.caption {
20
  font-size: 0.8em;
21
}
22
23
sup {
24
    line-height: 0;
25
    font-size: 0.83em;
26
    vertical-align: super;
27
}
28
29
.math {
30
  font-size: small;
31
}
32
</style>
33
34
```{r, include = FALSE}
35
knitr::opts_chunk$set(
36
  collapse = TRUE,
37
  comment = "#>"
38
)
39
```
40
41
```{r, echo=FALSE}
42
Sys.setenv(LANGUAGE = "en")
43
```
44
45
# CONTENTS
46
* [1. Introduction](#Section.1)
47
* [2. Installation](#Section.2)
48
* [3. Real Data Scenario](#Section.3)
49
* [4. MOVICS Pipeline](#Section.4)
50
  * [4.1 Pipeline Introduction](#Section.4.1)
51
  * [4.2 Steps of Pipeline](#Section.4.2)
52
    * [4.2.1 GET Module](#Section.4.2.1)
53
      * [1) get data from example files](#Section.4.2.1.1)
54
      * [2) get elites by reducing data dimension](#Section.4.2.1.2)
55
      * [3) get optimal number for clustering](#Section.4.2.1.3)
56
      * [4) get results from single algorithm](#Section.4.2.1.4)
57
      * [5) get results from multiple algorithms at once](#Section.4.2.1.5)
58
      * [6) get consensus from different algorithms](#Section.4.2.1.6)
59
      * [7) get quantification of similarity using silhoutte](#Section.4.2.1.7)
60
      * [8) get multi-omics heatmap based on clustering result](#Section.4.2.1.8)
61
    * [4.2.2 COMP Module](#Section.4.2.2)
62
      * [1) compare survival outcome](#Section.4.2.2.1)
63
      * [2) compare clinical features](#Section.4.2.2.2)
64
      * [3) compare mutational frequency](#Section.4.2.2.3)
65
      * [4) compare total mutation burden](#Section.4.2.2.4)
66
      * [5) compare fraction genome altered](#Section.4.2.2.5)
67
      * [6) compare drug sensitivity](#Section.4.2.2.6)
68
      * [7) compare agreement with other subtypes](#Section.4.2.2.7)
69
    * [4.2.3 RUN Module](#Section.4.2.3)
70
      * [1) run differential expression analysis](#Section.4.2.3.1)
71
      * [2) run biomarker identification procedure](#Section.4.2.3.2)
72
      * [3) run gene set enrichment analysis](#Section.4.2.3.3)
73
      * [4) run gene set variation analysis](#Section.4.2.3.4)
74
      * [5) run nearest template prediction in external cohort](#Section.4.2.3.5)
75
      * [6) run partition around medoids classifier](#Section.4.2.3.6)
76
      * [7) run consistency evaluation using Kappa statistics](#Section.4.2.3.7)
77
* [5. Little Trick](#Section.5)
78
* [6. Summary](#Section.6)
79
* [7. Session Information](#Section.7)
80
* [8. Citing MOVICS](#Section.8)
81
* [REFERENCES](#Section.9)
82
83
# <a id="Section.1" style="color:#159957;">1. Introduction</a>
84
Recent advances in next-generation sequencing, microarrays and mass spectrometry for omics data production have enabled the generation and collection of different modalities of high-dimensional molecular data<sup>1</sup>. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges<sup>2</sup>. This vignette aims to show how to use the package named MOVICS to perform multi-omics integrative clustering and visualization for cancer subtyping researches. This R package provides a unified interface for 10 state-of-the-art multi-omics clustering algorithms, and standardizes the output for each algorithm so as to form a pipeline for downstream analyses. Ten algorithms are CIMLR, iClusterBayes, MoCluster, COCA, ConsensusClustering, IntNMF, LRAcluster, NEMO, PINSPlus, and SNF where the former three methods can also perform the process of feature selection. For cancer subtyping studies, MOVICS also forms a pipeline for most commonly used downstream analyses for further subtype characterization and creates editable publication-quality illustrations (see more details below). Please note that MOVICS currently supports up to 6 omics data for jointly clustering and users must provide at least 2 omics datasets as input. Okay then, let us make our hands dirty to figure out how MOVICS works.
85
86
# <a id="Section.2" style="color:#159957;">2. Installation</a>
87
It is essential that you have R 4.0.1 or above already installed on your computer or server. MOVICS is a pipeline that utilizes many other R packages that are currently available from CRAN, Bioconductor and GitHub. For all of the steps of the pipeline to work, make sure that you have upgraded Bioconductor to newest version (BiocManager v3.11).
88
After you have R and Bioconductor installed properly, you can start to install MOVICS. The easiest way to install it is by typing the following code into your R session:
89
```{r, warning=FALSE, message=FALSE, echo=TRUE, eval=FALSE}
90
if (!requireNamespace("BiocManager", quietly = TRUE))
91
    install.packages("BiocManager")
92
if (!require("devtools")) 
93
    install.packages("devtools")
94
devtools::install_github("xlucpu/MOVICS")
95
```
96
97
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE}
98
library("dplyr")
99
library("knitr")
100
library("kableExtra")
101
#library("devtools")
102
#load_all()
103
```
104
105
When you are installing MOVICS, you may encounter some errors saying that some packages are not installed. These errors are caused by recursively depending on R packages, so if one package was not installed properly on your computer, MOVICS would fail. To solve these errors, you simply need to check those error messages, find out which packages required are missing, then install it with command `BiocManager::install("YourErrorPackage")` or `install.packages("YourErrorPackage")` or `devtools::install_github("username/YourErrorPackage")` directly. After that, retry installing MOVICS, it may take several times, but eventually it should work. Or, we highly suggest that you referred to the `Imports` in the [DESCRIPTION](https://github.com/xlucpu/MOVICS/blob/master/DESCRIPTION) file, try to install all the R dependencies, and then install MOVICS. We also summarized several common problems that you may meet when installing MOVICS, please refer to [Troubleshooting](https://github.com/xlucpu/MOVICS) if applicable. After installation, you should be able to load the MOVICS package in your R session:
106
```{r, eval=TRUE, warning=FALSE, message=FALSE}
107
library("MOVICS")
108
```
109
110
# <a id="Section.3" style="color:#159957;">3. Real Data Scenario</a>
111
This package contains two pre-processed real datasets of breast cancer. One dataset is `brca.tcga.RData` which is a list that includes 643 samples with four complete omics data types of breast cancer retrieved from TCGA-BRCA cohort<sup>3</sup> (*i.e.*, mRNA expression, lncRNA expression, DNA methylation profiling and somatic mutation matrix), and corresponding clinicopathological information (*i.e.*, age, pathological stage, PAM50 subtype, vital status and overall suvival time); such data list also contains corresponding RNA-Seq raw count table and Fragments Per Kilobase Million (FPKM) data in order to test the functions for downstream analyses (*e.g.*, differential expression analysis, drug sensitivity analysis, *etc*). The other one, `brca.yau.RData`, is an external validation dataset which contains gene expression profiles and clinicopathological information that downloaded from BRCA-YAU cohort<sup>4</sup> with 682 samples (one sample without annotation of PAM50 subtype was removed), which can be used to test the predictive functions available in MOVICS. These two datasets can be loaded like below:
112
```{r, eval=TRUE}
113
# load example data of breast cancer
114
load(system.file("extdata", "brca.tcga.RData", package = "MOVICS", mustWork = TRUE))
115
load(system.file("extdata", "brca.yau.RData",  package = "MOVICS", mustWork = TRUE))
116
```
117
118
Since in most cases, multi-omics clustering procedure is rather time-consuming, the TCGA-BRCA dataset was therefore first pre-processed to extract top 500 mRNAs, 500 lncRNA, 1,000 promoter CGI probes/genes with high variation using statistics of median absolute deviation (MAD), and 30 genes that mutated in at least 3% of the entire cohort. All these features are used in the following clustering analyses.
119
120
# <a id="Section.4" style="color:#159957;">4. MOVICS Pipeline</a>
121
## <a id="Section.4.1" style="color:#159957;">4.1 Pipeline Introduction</a>
122
123
<p align="center">![cloud](pkg_pipeline.jpg)</p>
124
125
MOVICS Pipeline diagram above outlines the concept for this package and such pipeline comprises separate functions that can be run individually (see details about argument list by typing `?function_name()` in R session). In above graphical abstract, all functions of MOVICS are included, which are categorized into three modules using three colors:
126
127
* <div style="color:red">**GET Module: get subtypes through multi-omics integrative clustering**</div> 
128
    * `getElites()`: get elites which are those features that pass the filtering procedure and are used for analyses
129
    * `getClustNum()`: get optimal cluster number by calculating clustering prediction index (CPI) and Gap-statistics
130
    * `get%algorithm_name%()`: get results from one specific multi-omics integrative clustering algorithm with detailed parameters
131
    * `getMOIC()`: get a list of results from multiple multi-omics integrative clustering algorithm with parameters by default
132
    * `getConsensusMOIC()`: get a consensus matrix that indicates the clustering robustness across different clustering algorithms and generate a consensus heatmap
133
    * `getSilhouette()`: get quantification of sample similarity using silhoutte score approach
134
    * `getStdiz()`: get a standardized data for generating comprehensive multi-omics heatmap
135
    * `getMoHeatmap()`: get a comprehensive multi-omics heatmap based on clustering results  
136
</br>
137
* <div style="color:green">**COMP Module: compare subtypes from multiple perspectives**</div> 
138
    * `compSurv()`: compare survival outcome and generate a Kalan-Meier curve with pairwise comparison if possible
139
    * `compClinvar()`: compare and summarize clinical features among different identified subtypes
140
    * `compMut()`: compare mutational frequency and generate an OncoPrint with significant mutations
141
    * `compTMB()`: compare total mutation burden among subtypes and generate distribution of Transitions and Transversions
142
    * `compFGA()`: compare fraction genome altered among subtypes and generate a barplot for distribution comparison
143
    * `compDrugsen()`: compare estimated half maximal inhibitory concentration ($IC_{50}$) for drug sensitivity and generate a boxviolin for distribution comparison
144
    * `compAgree()`: compare agreement of current subtypes with other pre-existed classifications and generate an alluvial diagram and an agreement barplot  
145
</br>
146
* <div style="color:blue">**RUN Module: run marker identification and verify subtypes**</div> 
147
    * `runDEA()`: run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limma
148
    * `runMarker()`: run biomarker identification to determine uniquely and significantly differential expressed genes for each subtype
149
    * `runGSEA()`: run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmap 
150
    * `runGSVA()`: run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interest
151
    * `runNTP()`: run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohorts
152
    * `runPAM()`: run partition around medoids classifier based on discovery cohort to predict subtypes in external cohorts
153
    * `runKappa()`: run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes
154
    
155
## <a id="Section.4.2" style="color:#159957;">4.2 Steps of Pipeline</a>
156
Basically, the above three connected modules explain the workflow of this R package. MOVICS first identifies the cancer subtype (CS) by using one or multiple clustering algorithms; if multiple clustering algorithms are specified, it is highly recommended to perform a consensus clustering based on different subtyping results in order to derive stable and robust subtypes. Second, after having subtypes it is natural to exploit the heterogeneity of subtypes from as many angles as possible. Third, each subtype should have a list of subtype-specific biomarkers for reproducing such subtype in external cohorts. Therefore, let us follow this analytic pipeline and step into each module and each function in MOVICS. 
157
158
### <a id="Section.4.2.1" style="color:#159957;">4.2.1 GET Module</a>
159
#### <a id="Section.4.2.1.1" style="color:#159957;">1) get data from example files</a>
160
Data used for this vignette should be first extracted from the list of `brca.tcga`, including 4 types of multi-omics data. Except for omics data, this list also contains RNA-Seq raw count, FPKM matrix, MAF data, segmented copy number, clinical and survival information for downstream analyses. **Notably, all these omics data used for clustering share exactly the same samples with exactly the same order, and please make sure of this when using MOVICS on your own data.** 
161
```{r, eval=TRUE}
162
# print name of example data
163
names(brca.tcga)
164
names(brca.yau)
165
166
# extract multi-omics data
167
mo.data   <- brca.tcga[1:4]
168
169
# extract raw count data for downstream analyses
170
count     <- brca.tcga$count
171
172
# extract fpkm data for downstream analyses
173
fpkm      <- brca.tcga$fpkm
174
175
# extract maf for downstream analysis
176
maf       <- brca.tcga$maf
177
178
# extract segmented copy number for downstream analyses
179
segment   <- brca.tcga$segment
180
181
# extract survival information
182
surv.info <- brca.tcga$clin.info
183
```
184
185
#### <a id="Section.4.2.1.2" style="color:#159957;">2) get elites by reducing data dimension</a>
186
Although all these omics data have been already processed (filtered from the original dataset), I am still pleased to show you how to use `getElites()` function to filter out features that meet some stringent requirements, and those features that are preserved in this procedure are considered elites by MOVICS. Five filtering methods are provided here, namely `mad` for median absolute deviation, `sd` for standard deviation, `pca` for principal components analysis, `cox` for univariate Cox proportional hazards regression, and `freq` for binary omics data. This function also handles missing values coded in $NA$ by removing them directly or imputing them by $k$ nearest neighbors using a Euclidean metric through argument of `na.action`. Let me show you how to use `getElites()` below:
187
```{r, eval=TRUE}
188
# scenario 1: considering we are dealing with an expression data that have 2 rows with NA values
189
tmp       <- brca.tcga$mRNA.expr # get expression data
190
dim(tmp) # check data dimension
191
tmp[1,1]  <- tmp[2,2] <- NA # set 2 rows with NA values
192
tmp[1:3,1:3] # check data
193
elite.tmp <- getElites(dat       = tmp,
194
                       method    = "mad",
195
                       na.action = "rm", # NA values will be removed
196
                       elite.pct = 1) # elite.pct equals to 1 means all (100%) features after NA removal will be selected even using mad method
197
dim(elite.tmp$elite.dat) # check dimension again and see that we have removed 2 rows with NA data
198
199
elite.tmp <- getElites(dat       = tmp,
200
                       method    = "mad",
201
                       na.action = "impute", # NA values will be imputed
202
                       elite.pct = 1) 
203
dim(elite.tmp$elite.dat) # all data kept
204
elite.tmp$elite.dat[1:3,1:3] # NA values have been imputed 
205
206
# scenario 2: considering we are dealing with continuous data and use mad or sd to select elites
207
tmp       <- brca.tcga$mRNA.expr # get expression data with 500 features
208
elite.tmp <- getElites(dat       = tmp,
209
                       method    = "mad",
210
                       elite.pct = 0.1) # this time only top 10% features with high mad values are kept
211
dim(elite.tmp$elite.dat) # get 50 elite left
212
213
elite.tmp <- getElites(dat       = tmp,
214
                       method    = "sd",
215
                       elite.num = 100, # this time only top 100 features with high sd values are kept
216
                       elite.pct = 0.1) # this time elite.pct argument will be disabled because elite.num has been already indicated.
217
dim(elite.tmp$elite.dat) # get 100 elites left
218
219
# scenario 3: 
220
# considering we are dealing with continuous data and use pca to select elites
221
tmp       <- brca.tcga$mRNA.expr # get expression data with 500 features
222
elite.tmp <- getElites(dat       = tmp,
223
                       method    = "pca",
224
                       pca.ratio = 0.95) # ratio of PCs is selected
225
dim(elite.tmp$elite.dat) # get 204 elite (PCs) left
226
227
# scenario 4: considering we are dealing with data and use cox to select elite
228
tmp       <- brca.tcga$mRNA.expr # get expression data 
229
elite.tmp <- getElites(dat       = tmp,
230
                       method    = "cox",
231
                       surv.info = surv.info, # must provide survival information with 'futime' and 'fustat'
232
                       p.cutoff  = 0.05,
233
                       elite.num = 100) # this time elite.num argument will be disabled because cox method refers to p.cutoff to select elites
234
dim(elite.tmp$elite.dat) # get 125 elites
235
table(elite.tmp$unicox$pvalue < 0.05) # 125 genes have nominal pvalue < 0.05 in univariate Cox regression
236
237
tmp       <- brca.tcga$mut.status # get mutation data 
238
elite.tmp <- getElites(dat       = tmp,
239
                       method    = "cox",
240
                       surv.info = surv.info, # must provide survival information with 'futime' and 'fustat'
241
                       p.cutoff  = 0.05,
242
                       elite.num = 100) # this time elite.num argument will be disabled because cox method refers to p.cutoff to select elites
243
dim(elite.tmp$elite.dat) # get 3 elites
244
table(elite.tmp$unicox$pvalue < 0.05) # 3 mutations have nominal pvalue < 0.05
245
246
# scenario 5: considering we are dealing with mutation data using freq to select elites
247
tmp       <- brca.tcga$mut.status # get mutation data 
248
rowSums(tmp) # check mutation frequency
249
elite.tmp <- getElites(dat       = tmp,
250
                       method    = "freq", # must set as 'freq'
251
                       elite.num = 80, # note: in this scenario elite.num refers to frequency of mutation
252
                       elite.pct = 0.1) # discard because elite.num has been already indicated
253
rowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 80 samples are kept as elites
254
255
elite.tmp <- getElites(dat       = tmp,
256
                       method    = "freq", # must set as 'freq'
257
                       elite.pct = 0.2) # note: in this scenario elite.pct refers to frequency of mutation / sample size
258
rowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 0.2*643=128.6 samples are kept as elites
259
260
# get mo.data list just like below (not run)
261
# mo.data <- list(omics1 = elite.tmp$elite.dat,
262
#                 omics2 = ...)
263
```
264
265
Now I think I have made this clear for you concerning how to reduce data dimension for clustering analysis. Since the `mo.data` has been already prepared, I am going to stick on this data list, and take you further to MOVICS.
266
267
#### <a id="Section.4.2.1.3" style="color:#159957;">3) get optimal number for clustering</a>
268
The most important parameter to estimate in any clustering study is the optimum number of clusters $k$ for the data, where $k$ needs to be small enough to reduce noise but large enough to retain important information. Herein MOVICS refers to CPI<sup>5</sup> and Gaps-statistics<sup>6</sup> to estimate the number of clusters by using `getClustNum()` function. 
269
```{r, fig.align="center", fig.width=6, fig.height=6, fig.cap="Figure 1. Identification of optimal cluster number by calculating CPI (blue line) and Gaps-statistics (red line) in TCGA-BRCA cohort.", eval=TRUE}
270
# identify optimal clustering number (may take a while)
271
optk.brca <- getClustNum(data        = mo.data,
272
                         is.binary   = c(F,F,F,T), # note: the 4th data is somatic mutation which is a binary matrix
273
                         try.N.clust = 2:8, # try cluster number from 2 to 8
274
                         fig.name    = "CLUSTER NUMBER OF TCGA-BRCA")
275
```
276
277
The above estimation of clustering number gives an arbitrary $k$ of 3. However, the popular PAM50 classifier for breast cancer has 5 classifications and if taking a close look at the descriptive figure, it can be noticed that both CPI and Gaps-statistics does not decline too much at $k$ of 5. Therefore, considering these knowledge, $k$ of 5 is chosen as the optimal clustering number for further analyses.
278
279
#### <a id="Section.4.2.1.4" style="color:#159957;">4) get results from single algorithm</a>
280
In this part I will first show how to use MOVICS to perform multi-omics integrative clustering by specifying one algorithm with detailed parameters. For example, let us try iClusterBayes method like below:
281
```{r, eval=TRUE}
282
# perform iClusterBayes (may take a while)
283
iClusterBayes.res <- getiClusterBayes(data        = mo.data,
284
                                      N.clust     = 5,
285
                                      type        = c("gaussian","gaussian","gaussian","binomial"),
286
                                      n.burnin    = 1800,
287
                                      n.draw      = 1200,
288
                                      prior.gamma = c(0.5, 0.5, 0.5, 0.5),
289
                                      sdev        = 0.05,
290
                                      thin        = 3)
291
```
292
293
Otherwise, a unified function can be used for all algorithms individually with detailed parameters, that is, `getMOIC()`.
294
```{r, eval=FALSE}
295
iClusterBayes.res <- getMOIC(data        = mo.data,
296
                             N.clust     = 5,
297
                             methodslist = "iClusterBayes", # specify only ONE algorithm here
298
                             type        = c("gaussian","gaussian","gaussian","binomial"), # data type corresponding to the list
299
                             n.burnin    = 1800,
300
                             n.draw      = 1200,
301
                             prior.gamma = c(0.5, 0.5, 0.5, 0.5),
302
                             sdev        = 0.05,
303
                             thin        = 3)
304
```
305
306
By specifying only one algorithm (*i.e.*, iClusterBayes) in the argument of `methodslist`, the above `getMOIC()` will return exactly the same results from `getiClusterBayes()` if the same parameters are provided. The returned result contains a `clust.res` object that has two columns: `clust` to indicate the subtype which the sample belongs to, and `samID` records the corresponding sample name. For algorithms that provide feature selection procedure (*i.e.*, iClusterBayes, CIMLR, and MoCluster), the result also contains a `feat.res` object that stores the information of such procedure. To those algorithms involving hierarchical clustering (*e.g.*, COCA, ConsensusClustering), the corresponding dendrogram for sample clustering will be also returned as `clust.dend`, which is useful if the users want to put them at the heatmap.
307
308
#### <a id="Section.4.2.1.5" style="color:#159957;">5) get results from multiple algorithms at once</a>
309
If you simultaneously specify a list of algorithms to `methodslist` argument in `getMOIC()`, it will automatically perform each algorithm with default parameters one by one, and a list of results derived from specified algorithms will be finally returned. Now that iClusterBayes has been finished, let us try other 9 algorithms at once with parameters by default. This will take some time, so have a coffee break.
310
```{r, fig.show='hide', message=TRUE, eval=TRUE}
311
# perform multi-omics integrative clustering with the rest of 9 algorithms
312
moic.res.list <- getMOIC(data        = mo.data,
313
                         methodslist = list("SNF", "PINSPlus", "NEMO", "COCA", "LRAcluster", "ConsensusClustering", "IntNMF", "CIMLR", "MoCluster"),
314
                         N.clust     = 5,
315
                         type        = c("gaussian", "gaussian", "gaussian", "binomial"))
316
317
# attach iClusterBayes.res as a list using append() to moic.res.list with 9 results already
318
moic.res.list <- append(moic.res.list, 
319
                        list("iClusterBayes" = iClusterBayes.res))
320
321
# save moic.res.list to local path
322
save(moic.res.list, file = "moic.res.list.rda")
323
```
324
325
```{r, echo=FALSE, eval=FALSE}
326
load(file = "iClusterBayes.res.rda")
327
load(file = "moic.res.list.rda")
328
```
329
330
#### <a id="Section.4.2.1.6" style="color:#159957;">6) get consensus from different algorithms</a>
331
Since now all the clustering results from 10 algorithms are in hand, MOVICS borrows the idea of consensus ensembles<sup>7</sup> for later integration of the clustering results derived from different algorithms, so as to improve the clustering robustness. To be specific, if $t_{max}$ algorithms are specified where $2\le t_{max} \le10$, `getConsensusMOIC()` calculates a matrix $M_{n\times n}^{(t)}$ per algorithm where $n$ is the number of samples, and $M_{ij}^{(t)}=1$ only when the sample $i$ and $j$ are clustered in the same subtype, otherwise $M_{ij}^{(t)}=0$. After get all results from specified algorithms, MOVICS calculates a consensus matrix $CM=\sum_{t=1}^{t_{max}}M^{(t)}$, and $cm_{ij}\in[0,10]$. Such matrix represents a robust pairwise similarities for samples because it considers different multi-omics integrative clustering algorithms. MOVICS then searches for a stable clustering result by applying hierarchical clustering to $CM$. 
332
The simplest way to perform `getConsensusMOIC()` is to pass a list of object returned by multiple `get%algorithm_name%()` or by `getMOIC()` with specific argument of `methodslist`. This can be done like below:
333
```{r, fig.align="center", fig.width=7, fig.height=6, fig.cap="Figure 2. Consensus heatmap based on results from 10 multi-omics integrative clustering algorithms with cluster number of 5.", eval=TRUE}
334
cmoic.brca <- getConsensusMOIC(moic.res.list = moic.res.list,
335
                               fig.name      = "CONSENSUS HEATMAP",
336
                               distance      = "euclidean",
337
                               linkage       = "average")
338
```
339
340
Remember, you must choose at least two methods to perform the above consensus clustering by `getConsensusMOIC()`, and this function will return a consensus matrix (a probability matrix represents how many times samples belonging to the same subtype can be clustered together by different multi-omics clustering methods) and a corresponding consensus heatmap. Ideally, the consensus heatmap will show a perfect diagonal rectangle, and the input values are 0 and 1 only because all algorithms derived the same clustering results.
341
342
#### <a id="Section.4.2.1.7" style="color:#159957;">7) get quantification of similarity using silhoutte</a>
343
If consensus ensembles are used with multiple algorithms, except for the consensus heatmap, MOVICS provides a function `getSilhoutte()` to quantify and visualize the sample similarity more specifically given the identified clusters. Such function depends on a `sil` object that stores silhoutte score which was returned by `getConsensusMOIC()`.
344
```{r, fig.align="center", fig.width=5, fig.height=5.5, fig.cap="Figure 3. Quantification of sample similarity using silhoutte score based on consensus ensembles result.", eval=TRUE}
345
getSilhouette(sil      = cmoic.brca$sil, # a sil object returned by getConsensusMOIC()
346
              fig.path = getwd(),
347
              fig.name = "SILHOUETTE",
348
              height   = 5.5,
349
              width    = 5)
350
```
351
352
#### <a id="Section.4.2.1.8" style="color:#159957;">7) get multi-omics heatmap based on clustering result</a>
353
Genome-wide heatmaps are widely used to graphically display potential underlying patterns within the large genomic dataset. They have been used to reveal information about how the samples/genes cluster together and provide insights into potential sample biases or other artifacts. Herein MOVICS provides `getMoHeatmap` to visually deal with pre-clustered multi-omics data and generate an exquisite heatmap for publication requirement.
354
Before using `getMoHeatmap()`, omics data should be properly processed by using the function of `getStdiz()` which returns a list storing normalized omics data. Omics data, especially for expression (*e.g.*, RNA and protein), should be centered (`centerFlag = TRUE`) or scaled (`scaleFlag = TRUE`) or z-scored (both centered and scaled). Generally, DNA methylation data ($\beta$ matrix ranging from 0 to 1) and somatic mutation (0 and 1 binary matrix) should not be normalized. However, it is a good choice to convert the methylation $\beta$ value to M value following the formula of $M=log_2\frac{\beta}{1-\beta}$ for its stronger signal in visualization and M value is suitable for normalization.
355
This function also provides an argument of `halfwidth` for continuous omics data; such argument is used to truncate the 'extremum' after normalization; specifically, normalized values that exceed the `halfwidth` boundaries will be replaced by the `halfwidth`, which is beneficial to map colors in heatmap.
356
```{r, eval=TRUE}
357
# convert beta value to M value for stronger signal
358
indata <- mo.data
359
indata$meth.beta <- log2(indata$meth.beta / (1 - indata$meth.beta))
360
361
# data normalization for heatmap
362
plotdata <- getStdiz(data       = indata,
363
                     halfwidth  = c(2,2,2,NA), # no truncation for mutation
364
                     centerFlag = c(T,T,T,F), # no center for mutation
365
                     scaleFlag  = c(T,T,T,F)) # no scale for mutation
366
```
367
368
As I mentioned earlier, several algorithms also provide feature selection; those selected features show a complex cross-talk with other omics data and might have special biological significance that drives the heterogeneity of cancers. Therefore I show below how to generate a comprehensive heatmap based on a single algorithm (*e.g.*, iClusterBayes) with selected features by using `getMoHeatmap()`. However in the first place, features must be extracted:
369
```{r, eval=TRUE}
370
feat   <- iClusterBayes.res$feat.res
371
feat1  <- feat[which(feat$dataset == "mRNA.expr"),][1:10,"feature"] 
372
feat2  <- feat[which(feat$dataset == "lncRNA.expr"),][1:10,"feature"]
373
feat3  <- feat[which(feat$dataset == "meth.beta"),][1:10,"feature"]
374
feat4  <- feat[which(feat$dataset == "mut.status"),][1:10,"feature"]
375
annRow <- list(feat1, feat2, feat3, feat4)
376
```
377
378
The `feat.res` contained in `iClusterBayes.res` is sorted by posterior probability of features for each omics data. In this manner, the top 10 features are selected for each omics data and a feature list is generated and named as `annRow` for heatmap raw annotation (Do not worry, I will talk about how to attach column annotation for samples in a minute). 
379
```{r, fig.align="center", fig.width=9, fig.height=8.5, fig.cap="Figure 4. Comprehensive heatmap of multi-omics integrative clustering by iClusterBayes with annotation of potential drivers.", eval=TRUE}
380
# set color for each omics data
381
# if no color list specified all subheatmaps will be unified to green and red color pattern
382
mRNA.col   <- c("#00FF00", "#008000", "#000000", "#800000", "#FF0000")
383
lncRNA.col <- c("#6699CC", "white"  , "#FF3C38")
384
meth.col   <- c("#0074FE", "#96EBF9", "#FEE900", "#F00003")
385
mut.col    <- c("grey90" , "black")
386
col.list   <- list(mRNA.col, lncRNA.col, meth.col, mut.col)
387
388
# comprehensive heatmap (may take a while)
389
getMoHeatmap(data          = plotdata,
390
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
391
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
392
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
393
             clust.res     = iClusterBayes.res$clust.res, # cluster results
394
             clust.dend    = NULL, # no dendrogram
395
             show.rownames = c(F,F,F,F), # specify for each omics data
396
             show.colnames = FALSE, # show no sample names
397
             annRow        = annRow, # mark selected features
398
             color         = col.list,
399
             annCol        = NULL, # no annotation for samples
400
             annColors     = NULL, # no annotation color
401
             width         = 10, # width of each subheatmap
402
             height        = 5, # height of each subheatmap
403
             fig.name      = "COMPREHENSIVE HEATMAP OF ICLUSTERBAYES")
404
```
405
406
Of course, since there are a total of 10 results stored in the `moic.res.list`, you can also choose any one of them to create the heatmap. Here I select COCA which also returns sample dendrogram like below:
407
```{r, fig.align="center", fig.width=9, fig.height=9, fig.cap="Figure 5. Comprehensive heatmap of multi-omics integrative clustering by COCA with dendrogram for samples.", eval=TRUE}
408
# comprehensive heatmap (may take a while)
409
getMoHeatmap(data          = plotdata,
410
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
411
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
412
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
413
             clust.res     = moic.res.list$COCA$clust.res, # cluster results
414
             clust.dend    = moic.res.list$COCA$clust.dend, # show dendrogram for samples
415
             color         = col.list,
416
             width         = 10, # width of each subheatmap
417
             height        = 5, # height of each subheatmap
418
             fig.name      = "COMPREHENSIVE HEATMAP OF COCA")
419
```
420
421
Now go back to the consensus result of `cmoic.brca` that integrates 10 algorithms and this time samples' annotation is also provided to generate the heatmap. Since the core function of `getMoHeatmap()` is based on ComplexHeatmap R package, **when creating annotations, you should always use `circlize::colorRamp2()` function to generate the color mapping function for continuous variables** (*e.g.*, age in this example).
422
```{r, fig.align="center", fig.width=9, fig.height=9.5, fig.cap="Figure 6. Comprehensive heatmap based on consensus across 10 algorithms with clinicopathological annotation.", eval=TRUE}
423
# extract PAM50, pathologic stage and age for sample annotation
424
annCol    <- surv.info[,c("PAM50", "pstage", "age"), drop = FALSE]
425
426
# generate corresponding colors for sample annotation
427
annColors <- list(age    = circlize::colorRamp2(breaks = c(min(annCol$age),
428
                                                           median(annCol$age),
429
                                                           max(annCol$age)), 
430
                                                colors = c("#0000AA", "#555555", "#AAAA00")),
431
                  PAM50  = c("Basal" = "blue",
432
                            "Her2"   = "red",
433
                            "LumA"   = "yellow",
434
                            "LumB"   = "green",
435
                            "Normal" = "black"),
436
                  pstage = c("T1"    = "green",
437
                             "T2"    = "blue",
438
                             "T3"    = "red",
439
                             "T4"    = "yellow", 
440
                             "TX"    = "black"))
441
442
# comprehensive heatmap (may take a while)
443
getMoHeatmap(data          = plotdata,
444
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
445
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
446
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
447
             clust.res     = cmoic.brca$clust.res, # consensusMOIC results
448
             clust.dend    = NULL, # show no dendrogram for samples
449
             show.rownames = c(F,F,F,F), # specify for each omics data
450
             show.colnames = FALSE, # show no sample names
451
             show.row.dend = c(F,F,F,F), # show no dendrogram for features
452
             annRow        = NULL, # no selected features
453
             color         = col.list,
454
             annCol        = annCol, # annotation for samples
455
             annColors     = annColors, # annotation color
456
             width         = 10, # width of each subheatmap
457
             height        = 5, # height of each subheatmap
458
             fig.name      = "COMPREHENSIVE HEATMAP OF CONSENSUSMOIC")
459
```
460
461
### <a id="Section.4.2.2" style="color:#159957;">4.2.2 COMP Module</a>
462
After identification of cancer subtypes, it is essential to further characterize each subtype by discovering their difference from multiple aspects. To this end, MOVICS provides commonly used downstream analyses in cancer subtyping researches for easily cohesion with results derived from <span style="color:red">**GET Module**</span>. Now let us check them out.
463
464
#### <a id="Section.4.2.2.1" style="color:#159957;">1) compare survival outcome</a>
465
I start with comparing the prognosis among different subtypes. MOVICS provides function of `compSurv()` which not only calculates the overall nominal *P* value by log-rank test, but also performs pairwise comparison and derives adjusted *P* values if more than two subtypes are identified. These information will be all printed in the Kaplan-Meier Curve which is convenient for researchers to refer. Except for clustering results (*e.g.*, `cmoic.brca` in this example), you must additionally provide `surv.info` argument which should be a data.frame (must has row names of samples) that stores a `futime` column for survival time (**unit of day**) and another `fustat` column for survival outcome (0 = censor; 1 = event). x-year survival probability can be also estimated if specifying argument of `xyrs.est`.
466
```{r, fig.align="center", fig.width=6, fig.height=7, fig.cap="Figure 7. Kaplan-Meier survival curve of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.", eval=TRUE}
467
# survival comparison
468
surv.brca <- compSurv(moic.res         = cmoic.brca,
469
                      surv.info        = surv.info,
470
                      convt.time       = "m", # convert day unit to month
471
                      surv.median.line = "h", # draw horizontal line at median survival
472
                      xyrs.est         = c(5,10), # estimate 5 and 10-year survival
473
                      fig.name         = "KAPLAN-MEIER CURVE OF CONSENSUSMOIC")
474
475
print(surv.brca)
476
```
477
478
#### <a id="Section.4.2.2.2" style="color:#159957;">2) compare clinical features</a>
479
Then compare the clinical features among different subtypes. MOVICS provides function of `compClinvar()` which enables summarizing both continuous and categorical variables and performing proper statistical tests. This function can give a table in .docx format that is easy to use in medical research papers.
480
```{r, eval=TRUE}
481
clin.brca <- compClinvar(moic.res      = cmoic.brca,
482
                         var2comp      = surv.info, # data.frame needs to summarize (must has row names of samples)
483
                         strata        = "Subtype", # stratifying variable (e.g., Subtype in this example)
484
                         factorVars    = c("PAM50","pstage","fustat"), # features that are considered categorical variables
485
                         nonnormalVars = "futime", # feature(s) that are considered using nonparametric test
486
                         exactVars     = "pstage", # feature(s) that are considered using exact test
487
                         doWord        = TRUE, # generate .docx file in local path
488
                         tab.name      = "SUMMARIZATION OF CLINICAL FEATURES")
489
490
```
491
492
```{r, eval=FALSE}
493
print(clin.brca$compTab)
494
```
495
496
```{r, echo=FALSE, eval=TRUE}
497
clin.brca$compTab %>%
498
  kbl(caption = "Table 1. Comparison of clinical features among 5 identified subtype of breast cancer in TCGA-BRCA cohort.") %>%
499
  kable_classic(full_width = TRUE, html_font = "Calibri")
500
```
501
  
502
#### <a id="Section.4.2.2.3" style="color:#159957;">3) compare mutational frequency</a>
503
Subype-specific mutation might be promising as therapeutic target. Hence, I then compare the mutational frequency among different clusters. MOVICS provides function of `compMut()` which deals with binary mutational data (0 = wild; 1 = mutated). This function applies independent testing (*i.e.*, Fisher\' s exact test or $\chi^2$ test) for each mutation, and generates a table with statistical results (.docx file also if specified), and creates an OncoPrint using those differentially mutated genes.
504
```{r, fig.align="center", fig.width=8, fig.height=3, fig.cap="Figure 8. Mutational OncoPrint of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.", eval=TRUE}
505
# mutational frequency comparison
506
mut.brca <- compMut(moic.res     = cmoic.brca,
507
                    mut.matrix   = brca.tcga$mut.status, # binary somatic mutation matrix
508
                    doWord       = TRUE, # generate table in .docx format
509
                    doPlot       = TRUE, # draw OncoPrint
510
                    freq.cutoff  = 0.05, # keep those genes that mutated in at least 5% of samples
511
                    p.adj.cutoff = 0.05, # keep those genes with adjusted p value < 0.05 to draw OncoPrint
512
                    innerclust   = TRUE, # perform clustering within each subtype
513
                    annCol       = annCol, # same annotation for heatmap
514
                    annColors    = annColors, # same annotation color for heatmap
515
                    width        = 6, 
516
                    height       = 2,
517
                    fig.name     = "ONCOPRINT FOR SIGNIFICANT MUTATIONS",
518
                    tab.name     = "INDEPENDENT TEST BETWEEN SUBTYPE AND MUTATION")
519
```
520
521
```{r, eval=FALSE}
522
print(mut.brca)
523
```
524
525
```{r, echo=FALSE, eval=TRUE}
526
mut.brca %>%
527
  kbl(caption = "Table 2. Comparison of mutational frequency among 5 identified subtype of breast cancer in TCGA-BRCA cohort.") %>%
528
  kable_classic(full_width = TRUE, html_font = "Calibri")
529
```
530
531
#### <a id="Section.4.2.2.4" style="color:#159957;">4) compare total mutation burden</a>
532
Needless to say, immunotherapy is becoming a pillar of modern cancer treatment. Recent analyses have linked the tumoral genomic landscape with antitumor immunity. In particular, an emerging picture suggests that tumor-specific genomic lesions are associated with immune checkpoint activation and the extent and duration of responses for patients to immunotherapy. These lesions contains high mutation burdens <sup>8</sup> and aneuploidy<sup>9</sup>. To quantify these genomic alterations that may affect immunotherapy, MOVICS provides two functions to calculate total mutation burden (TMB) and fraction genome altered (FGA). To be specific, TMB refers to the number of mutations that are found in the tumor genome, while FGA is the percentage of genome that has been affected by copy number gains or losses. Both attributes are useful for genetic researchers as they provide them with more in-depth information on the genomic make-up of the tumors. Let me show you how to use these two functions by starting with `compTMB()`. First of all, the input maf data for this function must have the following 10 columns at least:
533
```{r, eval=TRUE}
534
head(maf)
535
```
536
537
```{r, echo=FALSE, eval=FALSE}
538
head(maf) %>%
539
  kbl(caption = "Table . Demo of MAF data with eligible column names.") %>%
540
  kable_classic(full_width = TRUE, html_font = "Calibri")
541
```
542
Run `compTMB()` then. By default, this function considers nonsynonymous variants only when counting somatic mutation frequency, including Frame_Shift_Del, Frame_Shift_Ins, Splice_Site, Translation_Start_Site, Nonsense_Mutation, Nonstop_Mutation, In_Frame_Del, In_Frame_Ins, and Missense_Mutation. In addition to calculating TMB, this function also classifies Single Nucleotide Variants into Transitions and Transversions (TiTv), and depicts the distribution of both TMB and TiTv.
543
```{r, fig.align="center", fig.width=6, fig.height=6, fig.cap="Figure 9. Comparison of TMB and TiTv among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.", eval=TRUE}
544
# compare TMB
545
tmb.brca <- compTMB(moic.res     = cmoic.brca,
546
                    maf          = maf,
547
                    rmDup        = TRUE, # remove duplicated variants per sample
548
                    rmFLAGS      = FALSE, # keep FLAGS mutations
549
                    exome.size   = 38, # estimated exome size
550
                    test.method  = "nonparametric", # statistical testing method
551
                    fig.name     = "DISTRIBUTION OF TMB AND TITV")
552
```
553
554
```{r, eval=FALSE}
555
head(tmb.brca$TMB.dat)
556
```
557
558
```{r, echo=FALSE, eval=TRUE}
559
head(tmb.brca$TMB.dat) %>%
560
  kbl(caption = "Table 3. Demo of comparison of TMB among 5 identified subtype of breast cancer in TCGA-BRCA cohort.") %>%
561
  kable_classic(full_width = TRUE, html_font = "Calibri")
562
```
563
564
#### <a id="Section.4.2.2.5" style="color:#159957;">5) compare fraction genome altered</a>
565
Next, `compFGA()` calculates not only FGA but also computes specific gain (FGG) or loss (FGL) per sample within each subtype. To get this function worked, an eligible input of segmented copy number should be prepared with exactly the same column name like below: 
566
```{r}
567
# change column names of segment data
568
colnames(segment) <- c("sample","chrom","start","end","value")
569
```
570
571
Let's see how the input looks like:
572
```{r, eval=TRUE}
573
head(segment)
574
```
575
576
```{r, echo=FALSE, eval=FALSE}
577
head(segment) %>%
578
  kbl(caption = "Table . Demo of segmented copy number data with eligible column names.") %>%
579
  kable_classic(full_width = TRUE, html_font = "Calibri")
580
```
581
582
Then this function can be run without any difficulties. **Notably, if your CNA calling procedure did not provide a segmented copy number as value column but the original copy number, argument of `iscopynumber` must be switched to `TRUE` instead**.
583
```{r, fig.align="center", fig.width=10, fig.height=2.5, fig.cap="Figure 10. Barplot of fraction genome altered among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.", eval=TRUE}
584
# compare FGA, FGG, and FGL
585
fga.brca <- compFGA(moic.res     = cmoic.brca,
586
                    segment      = segment,
587
                    iscopynumber = FALSE, # this is a segmented copy number file
588
                    cnathreshold = 0.2, # threshold to determine CNA gain or loss
589
                    test.method  = "nonparametric", # statistical testing method
590
                    fig.name     = "BARPLOT OF FGA")
591
```
592
593
```{r, eval=FALSE}
594
head(fga.brca$summary)
595
```
596
597
```{r, echo=FALSE, eval=TRUE}
598
head(fga.brca$summary) %>%
599
  kbl(caption = "Table 4. Demo of comparison of fraction genome altered among 5 identified subtype of breast cancer in TCGA-BRCA cohort.") %>%
600
  kable_classic(full_width = TRUE, html_font = "Calibri")
601
```
602
603
#### <a id="Section.4.2.2.6" style="color:#159957;">6) compare drug sensitivity</a>
604
Predicting response to medication is particularly important for drugs with a narrow therapeutic index, for example chemotherapeutic agents, because response is highly variable and side effects are potentially lethal. Therefore, Paul Geeleher et al. (2014)<sup>10</sup> used baseline gene expression and $in\;vitro$ drug sensitivity derived from cell lines, coupled with $in\;vivo$ baseline tumor gene expression, to predict patients' response to drugs. Paul developped an R package pRRophetic for prediction of clinical chemotherapeutic response from tumor gene expression levels<sup>11</sup>, and now this function has been involved in MOVICS to examine difference of drug sensitivity among different subtypes. Here I estimate the $IC_{50}$ of *Cisplatin* and *Paclitaxel* for 5 identified subtypes of breast cancer in TCGA cohort.
605
```{r, fig.show = "hold", out.width = "50%", fig.align = "default", fig.width=8, fig.height=6, fig.cap="Figure 11. Boxviolins for estimated IC50 of Cisplatin and Paclitaxel among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.", eval=TRUE}
606
# drug sensitivity comparison
607
drug.brca <- compDrugsen(moic.res    = cmoic.brca,
608
                         norm.expr   = fpkm[,cmoic.brca$clust.res$samID], # double guarantee sample order
609
                         drugs       = c("Cisplatin", "Paclitaxel"), # a vector of names of drug in GDSC
610
                         tissueType  = "breast", # choose specific tissue type to construct ridge regression model
611
                         test.method = "nonparametric", # statistical testing method
612
                         prefix      = "BOXVIOLIN OF ESTIMATED IC50") 
613
```
614
615
```{r, eval=FALSE}
616
head(drug.brca$Cisplatin)
617
```
618
619
```{r, echo=FALSE, eval=TRUE}
620
head(drug.brca$Cisplatin) %>%
621
  kbl(caption = "Table 5. Demo of estimated IC50 for Cisplatin among 5 identified subtype of breast cancer in TCGA-BRCA cohort.") %>%
622
  kable_classic(full_width = TRUE, html_font = "Calibri")
623
```
624
625
#### <a id="Section.4.2.2.7" style="color:#159957;">7) compare agreement with other subtypes</a>
626
At present, many cancers have traditional classifications, and evaluating the consistency of new subtypes with previous classifications is critical to reflect the robustness of clustering analysis and to determine potential but novel subtypes. To measure the agreement (similarity) between the current subtypes and other pre-existed classifications, MOVICS provides function of `compAgree()` to calculate four statistics: Rand Index (RI)<sup>12</sup>, Adjusted Mutual Information (AMI)<sup>13</sup>, Jaccard Index (JI)<sup>14</sup>, and Fowlkes-Mallows (FM)<sup>15</sup>; all these measurements range from 0 to 1 and the larger the value is, the more similar the two appraises are. This function can also generate an alluvial diagram to visualize the agreement of two appraises with the current subtypes as reference. 
627
```{r, fig.align="center", fig.width=8, fig.height=5, fig.cap="Figure 12. Agreement of 5 identified subtypes of breast cancer with PAM50 classification and pathological stage in TCGA-BRCA cohort.", eval=TRUE}
628
# customize the factor level for pstage
629
surv.info$pstage <- factor(surv.info$pstage, levels = c("TX","T1","T2","T3","T4"))
630
631
# agreement comparison (support up to 6 classifications include current subtype)
632
agree.brca <- compAgree(moic.res  = cmoic.brca,
633
                        subt2comp = surv.info[,c("PAM50","pstage")],
634
                        doPlot    = TRUE,
635
                        box.width = 0.2,
636
                        fig.name  = "AGREEMENT OF CONSENSUSMOIC WITH PAM50 AND PSTAGE")
637
```
638
639
```{r, eval=FALSE}
640
print(agree.brca)
641
```
642
643
```{r, echo=FALSE, eval=TRUE}
644
agree.brca %>%
645
  kbl(caption = "Table 6. Agreement of 5 identified subtypes with PAM50 classification and pathological stage in TCGA-BRCA cohort.") %>%
646
  kable_classic(full_width = TRUE, html_font = "Calibri")
647
```
648
649
### <a id="Section.4.2.3" style="color:#159957;">4.2.3 RUN Module</a>
650
Take a deep breath, we successfully enter the last <span style="color:blue">**RUN Module**</span>, and victory is in sight. In this module, MOVICS aims to characterize different subtypes by identifying their potential predictive biomarkers and functional pathways. Identifying and applying molecular biomarkers to predict subtype with efficiency is particularly important for disease management and treatment, thus improving clinical outcome. In this context, MOVICS searches for subtype-specific biomarkers by starting with differential expression analysis (DEA). 
651
652
#### <a id="Section.4.2.3.1" style="color:#159957;">1) run differential expression analysis</a>
653
MOVICS provides `runDEA()` function which embeds three state-of-the-art DEA approches to identify differentially expressed genes (DEGs), including edgeR and DESeq2 for RNA-Seq count data and limma for microarray profile or normalized expression data. **Notably, since `runDEA()` checks the data scale automatically when choosing limma algorithm, it is recommended to provide a microarray expression profile or normalized expression data (*e.g.*, RSEM, FPKM, TPM) without z-score or log2 transformation.**
654
This step is also quite time-consuming, especially in the case of using raw count data, so release your hand from the keyboard, close your eyes, and relax for a moment.
655
```{r, eval=TRUE}
656
# run DEA with edgeR
657
runDEA(dea.method = "edger",
658
       expr       = count, # raw count data
659
       moic.res   = cmoic.brca,
660
       prefix     = "TCGA-BRCA") # prefix of figure name
661
662
# run DEA with DESeq2
663
runDEA(dea.method = "deseq2",
664
       expr       = count,
665
       moic.res   = cmoic.brca,
666
       prefix     = "TCGA-BRCA")
667
668
# run DEA with limma
669
runDEA(dea.method = "limma",
670
       expr       = fpkm, # normalized expression data
671
       moic.res   = cmoic.brca,
672
       prefix     = "TCGA-BRCA")
673
```
674
675
Each identified cancer subtype will be compared with the rest (Others) and the corresponding .txt file will be stored according to the argument of `res.path`. By default, these files will be saved under the current working directory.
676
677
#### <a id="Section.4.2.3.2" style="color:#159957;">2) run biomarker identification procedure</a>
678
In this procedure, the most differentially expressed genes sorted by log2FoldChange are chosen as the biomarkers for each subtype (200 biomarkers for each subtype by default). These biomarkers should pass the significance threshold (*e.g.*, nominal $P$ value < 0.05 and adjusted $P$ value < 0.05) and must not overlap with any biomarkers identified for other subtypes.
679
```{r, fig.align="center", fig.width=7, fig.height=6, fig.cap="Figure 13. Heatmap of subtype-specific upregulated biomarkers using edgeR for 5 identified subtypes in TCGA-BRCA cohort.", eval=TRUE}
680
# choose edgeR result to identify subtype-specific up-regulated biomarkers
681
marker.up <- runMarker(moic.res      = cmoic.brca,
682
                       dea.method    = "edger", # name of DEA method
683
                       prefix        = "TCGA-BRCA", # MUST be the same of argument in runDEA()
684
                       dat.path      = getwd(), # path of DEA files
685
                       res.path      = getwd(), # path to save marker files
686
                       p.cutoff      = 0.05, # p cutoff to identify significant DEGs
687
                       p.adj.cutoff  = 0.05, # padj cutoff to identify significant DEGs
688
                       dirct         = "up", # direction of dysregulation in expression
689
                       n.marker      = 100, # number of biomarkers for each subtype
690
                       doplot        = TRUE, # generate diagonal heatmap
691
                       norm.expr     = fpkm, # use normalized expression as heatmap input
692
                       annCol        = annCol, # sample annotation in heatmap
693
                       annColors     = annColors, # colors for sample annotation
694
                       show_rownames = FALSE, # show no rownames (biomarker name)
695
                       fig.name      = "UPREGULATED BIOMARKER HEATMAP")
696
```
697
698
```{r, eval=FALSE}
699
# check the upregulated biomarkers
700
head(marker.up$templates)
701
```
702
703
```{r, echo=FALSE, eval=TRUE}
704
head(marker.up$templates) %>%
705
  kbl(caption = "Table 7. Demo of subtype-specific upregulated biomarkers for 5 identified subtypes of breast cancer in TCGA-BRCA cohort.") %>%
706
  kable_classic(full_width = TRUE, html_font = "Calibri")
707
```
708
709
Then try results derived from limma to identify subtype-specific downregulated biomarkers. Please feel free to try DESeq2.
710
```{r, fig.align="center", fig.width=7, fig.height=6, fig.cap="Figure 14. Heatmap of subtype-specific downregulated biomarkers using limma for 5 identified subtypes in TCGA-BRCA cohort.", eval=TRUE}
711
# choose limma result to identify subtype-specific down-regulated biomarkers
712
marker.dn <- runMarker(moic.res      = cmoic.brca,
713
                       dea.method    = "limma",
714
                       prefix        = "TCGA-BRCA",
715
                       dirct         = "down",
716
                       n.marker      = 50, # switch to 50
717
                       doplot        = TRUE,
718
                       norm.expr     = fpkm,
719
                       annCol        = annCol,
720
                       annColors     = annColors,
721
                       fig.name      = "DOWNREGULATED BIOMARKER HEATMAP")
722
```
723
724
#### <a id="Section.4.2.3.3" style="color:#159957;">3) run gene set enrichment analysis</a>
725
Similarly, GSEA is run for each subtype based on its corresponding DEA result to identify subtype-specific functional pathways. To this end, I have prepared a geneset background which includes all gene sets derived from GO biological processes (`c5.bp.v7.1.symbols.gmt`) from The Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). You can download other interested background for your own study.
726
```{r, eval=TRUE}
727
# MUST locate ABSOLUTE path of msigdb file
728
MSIGDB.FILE <- system.file("extdata", "c5.bp.v7.1.symbols.xls", package = "MOVICS", mustWork = TRUE)
729
```
730
731
Likewise, these identified specific pathways should pass the significance threshold (*e.g.*, nominal $P$ value < 0.05 and adjusted $P$ value < 0.25) and must not overlap with any pathways identified for other subtypes. After having the subtype-specific pathways, genes that are inside the pathways are retrieved to calculate a single sample enrichment score by using GSVA R package. Subsequently, subtype-specific enrichment score will be represented by the mean or median (mean by default) value within the subtype, and will be further visualized by diagonal heatmap.
732
```{r, fig.align="center", fig.width=10, fig.height=8, fig.cap="Figure 15. Heatmap of subtype-specific upregulated pathways using edgeR algorithm for 5 identified subtypes in TCGA-BRCA cohort.", eval=TRUE}
733
# run GSEA to identify up-regulated GO pathways using results from edgeR
734
gsea.up <- runGSEA(moic.res     = cmoic.brca,
735
                   dea.method   = "edger", # name of DEA method
736
                   prefix       = "TCGA-BRCA", # MUST be the same of argument in runDEA()
737
                   dat.path     = getwd(), # path of DEA files
738
                   res.path     = getwd(), # path to save GSEA files
739
                   msigdb.path  = MSIGDB.FILE, # MUST be the ABSOLUTE path of msigdb file
740
                   norm.expr    = fpkm, # use normalized expression to calculate enrichment score
741
                   dirct        = "up", # direction of dysregulation in pathway
742
                   p.cutoff     = 0.05, # p cutoff to identify significant pathways
743
                   p.adj.cutoff = 0.25, # padj cutoff to identify significant pathways
744
                   gsva.method  = "gsva", # method to calculate single sample enrichment score
745
                   norm.method  = "mean", # normalization method to calculate subtype-specific enrichment score
746
                   fig.name     = "UPREGULATED PATHWAY HEATMAP")
747
```
748
749
Check some columns of GSEA results for the first subtype (CS1).
750
```{r, eval=FALSE}
751
print(gsea.up$gsea.list$CS1[1:6,3:6])
752
```
753
754
```{r, echo=FALSE, eval=TRUE}
755
gsea.up$gsea.list$CS1[1:6,3:6] %>%
756
  kbl(caption = "Table 8. Demo of GSEA results for the first cancer subtype (CS1) of breast cancer in TCGA-BRCA cohort.") %>%
757
  kable_classic(full_width = TRUE, html_font = "Calibri")
758
```
759
760
Also check results of subtype-specific enrichment scores.
761
```{r, eval=FALSE}
762
head(round(gsea.up$grouped.es,3))
763
```
764
765
```{r, echo=FALSE, eval=TRUE}
766
head(round(gsea.up$grouped.es,3)) %>%
767
  kbl(caption = "Table 9. Demo of subtype-specific enrichment scores among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.") %>%
768
  kable_classic(full_width = TRUE, html_font = "Calibri")
769
```
770
771
Then try results derived from DESeq2 to identify subtype-specific downregulated pathways. Please feel free to try limma.
772
```{r, fig.align="center", fig.width=10, fig.height=8, fig.cap="Figure 16. Heatmap of subtype-specific downregulated pathways using limma algorithm for 5 identified subtypes in TCGA-BRCA cohort.", eval=TRUE}
773
# run GSEA to identify down-regulated GO pathways using results from DESeq2
774
gsea.dn <- runGSEA(moic.res     = cmoic.brca,
775
                   dea.method   = "deseq2",
776
                   prefix       = "TCGA-BRCA",
777
                   msigdb.path  = MSIGDB.FILE,
778
                   norm.expr    = fpkm,
779
                   dirct        = "down",
780
                   p.cutoff     = 0.05,
781
                   p.adj.cutoff = 0.25,
782
                   gsva.method  = "ssgsea", # switch to ssgsea
783
                   norm.method  = "median", # switch to median
784
                   fig.name     = "DOWNREGULATED PATHWAY HEATMAP") 
785
```
786
787
#### <a id="Section.4.2.3.4" style="color:#159957;">4) run gene set variation analysis</a>
788
For all the new defined molecular subtypes, it is critical to depict its characteristics validated by different signatures of gene sets. MOVICS provides a simple function which uses gene set variation analysis to calculate enrichment score of each sample in each subtype based on given gene set list of interest. First, we must prepare a gene list of interest in hand saved as [GMT format](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29).
789
```{r, eval=TRUE}
790
# MUST locate ABSOLUTE path of gene set file
791
GSET.FILE <- 
792
  system.file("extdata", "gene sets of interest.gmt", package = "MOVICS", mustWork = TRUE)
793
```
794
795
Then we can calculate single sample enrichment score based on specified method and given gene set of interest.
796
```{r, fig.align="center", fig.width=8, fig.height=5, fig.cap="Figure 17. Heatmap of enrichment score of gene set of interest for 5 identified subtypes in TCGA-BRCA cohort.", eval=TRUE}
797
# run GSVA to estimate single sample enrichment score based on given gene set of interest
798
gsva.res <- 
799
  runGSVA(moic.res      = cmoic.brca,
800
          norm.expr     = fpkm,
801
          gset.gmt.path = GSET.FILE, # ABSOLUTE path of gene set file
802
          gsva.method   = "gsva", # method to calculate single sample enrichment score
803
          annCol        = annCol,
804
          annColors     = annColors,
805
          fig.path      = getwd(),
806
          fig.name      = "GENE SETS OF INTEREST HEATMAP",
807
          height        = 5,
808
          width         = 8)
809
810
# check raw enrichment score
811
print(gsva.res$raw.es[1:3,1:3])
812
813
# check z-scored and truncated enrichment score
814
print(gsva.res$scaled.es[1:3,1:3])
815
```
816
817
#### <a id="Section.4.2.3.5" style="color:#159957;">4) run nearest template prediction in external cohort</a>
818
Oh wait, did we forget something? Yeah there is a dataset that has not been used yet, so let's see if we can use those subtype-specific biomarkers to verify the current breast cancer subtypes in the external Yau cohort. In this part, our core purpose is to predict the possible subtypes of each sample in the external dataset. In most cases, this is a multi-classification problem, and the identified biomarkers may be difficult to be overall matched in the external cohort, so it might not be reliable to use model-based predictive algorithms. Therefore, MOVICS provides two model-free approaches for subtype prediction in validation cohort. First, MOVICS switches to nearest template prediction (NTP) which can be flexibly applied to cross-platform, cross-species, and multiclass predictions without any optimization of analysis parameters<sup>16</sup>. Only one thing have to do is generating a template file, which has been fortunately prepared already.
819
```{r, fig.align="center", fig.width=6, fig.height=6, fig.cap="Figure 18. Heatmap of NTP in Yau cohort using subtype-specific upregulated biomarkers identified from TCGA-BRCA cohort", eval=TRUE}
820
# run NTP in Yau cohort by using up-regulated biomarkers
821
yau.ntp.pred <- runNTP(expr       = brca.yau$mRNA.expr,
822
                       templates  = marker.up$templates, # the template has been already prepared in runMarker()
823
                       scaleFlag  = TRUE, # scale input data (by default)
824
                       centerFlag = TRUE, # center input data (by default)
825
                       doPlot     = TRUE, # to generate heatmap
826
                       fig.name   = "NTP HEATMAP FOR YAU") 
827
```
828
829
```{r, eval=FALSE}
830
head(yau.ntp.pred$ntp.res)
831
```
832
833
```{r, echo=FALSE, eval=TRUE}
834
head(yau.ntp.pred$ntp.res) %>%
835
  kbl(caption = "Table 10. Demo of predicted subtypes in Yau cohort by NTP using subtype-specific upregulated biomarkers identified from TCGA-BRCA cohort.") %>%
836
  kable_classic(full_width = TRUE, html_font = "Calibri")
837
```
838
839
Above an object `yau.ntp.pred` is prepared which is similar in structure of object returned by `getMOIC()` but only stores `clust.res` which can be passed to functions within <span style="color:green">**COMP Module**</span> if additional data is available. For example, I herein first compare the survival outcome of the predicted 5 cancer subtypes in Yau cohort.
840
```{r, fig.align="center", fig.width=6, fig.height=7, fig.cap="Figure 19. Kaplan-Meier survival curve of predicted 5 subtypes of breast cancer in Yau cohort.", eval=TRUE}
841
# compare survival outcome in Yau cohort
842
surv.yau <- compSurv(moic.res         = yau.ntp.pred,
843
                     surv.info        = brca.yau$clin.info,
844
                     convt.time       = "m", # switch to month
845
                     surv.median.line = "hv", # switch to both
846
                     fig.name         = "KAPLAN-MEIER CURVE OF NTP FOR YAU") 
847
848
print(surv.yau)
849
```
850
851
I further check the agreement between the predicted subtype and PAM50 classification.
852
```{r, fig.align="center", fig.width=8, fig.height=5, fig.cap="Figure 20. Agreement of predicted 5 subtypes of breast cancer with PAM50 classification in Yau cohort.", eval=TRUE}
853
# compare agreement in Yau cohort
854
agree.yau <- compAgree(moic.res  = yau.ntp.pred,
855
                       subt2comp = brca.yau$clin.info[, "PAM50", drop = FALSE],
856
                       doPlot    = TRUE,
857
                       fig.name  = "YAU PREDICTEDMOIC WITH PAM50")
858
859
```
860
861
```{r, eval=FALSE}
862
print(agree.yau)
863
```
864
865
```{r, echo=FALSE, eval=TRUE}
866
agree.yau %>%
867
  kbl(caption = "Table 11. Agreement of 5 predicted subtypes of breast cancer with PAM50 classification in Yau cohort.") %>%
868
  kable_classic(full_width = TRUE, html_font = "Calibri")
869
```
870
871
It is clearly that the predicted subtypes of breast cancer in Yau cohort can distinguish prognosis well and also show similar pattern of agreement with PAM50 classification as compared to TCGA-BRCA to some extent.
872
873
#### <a id="Section.4.2.3.6" style="color:#159957;">5) run partition around medoids classifier</a>
874
In addition to NTP, MOVICS provides another model-free approach to predict subtypes. To be specific, `runPAM()` first trains a partition around medoids (PAM) classifier in the discovery (training) cohort (*i.e.*, TCGA-BRCA) to predict the subtype for patients in the external validation (testing) cohort (*i.e.*, BRCA-Yau), and each sample in the validation cohort was assigned to a subtype label whose centroid had the highest Pearson correlation with the sample<sup>17</sup>. Finally, the in-group proportion (IGP) statistic will be performed to evaluate the similarity and reproducibility of the acquired subtypes between discovery and validation cohorts<sup>18</sup>.
875
```{r, eval=TRUE}
876
yau.pam.pred <- runPAM(train.expr  = fpkm,
877
                       moic.res    = cmoic.brca,
878
                       test.expr   = brca.yau$mRNA.expr)
879
```
880
881
The `yau.pam.pred` object also stores `clust.res` which can be passed to other functions, and the users can check IGP information like below:
882
```{r, eval=TRUE}
883
print(yau.pam.pred$IGP)
884
```
885
886
#### <a id="Section.4.2.3.7" style="color:#159957;">6) run consistency evaluation using Kappa statistics</a>
887
Want to know how accurate the NTP or PAM is when using discovery cohort? Want to know how consistent the different prediction results are? Use `runKappa()` then:
888
```{r, fig.show = "hold", out.width = "33.3%", fig.align = "default", fig.width=10, fig.height=9, fig.cap="Figure 21. Consistency heatmap using Kappa statistics.", eval=TRUE}
889
# predict subtype in discovery cohort using NTP
890
tcga.ntp.pred <- runNTP(expr      = fpkm,
891
                        templates = marker.up$templates,
892
                        doPlot    = FALSE) 
893
894
# predict subtype in discovery cohort using PAM
895
tcga.pam.pred <- runPAM(train.expr  = fpkm,
896
                        moic.res    = cmoic.brca,
897
                        test.expr   = fpkm)
898
899
# check consistency between current and NTP-predicted subtype in discovery TCGA-BRCA
900
runKappa(subt1     = cmoic.brca$clust.res$clust,
901
         subt2     = tcga.ntp.pred$clust.res$clust,
902
         subt1.lab = "CMOIC",
903
         subt2.lab = "NTP",
904
         fig.name  = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and NTP")
905
906
# check consistency between current and PAM-predicted subtype in discovery TCGA-BRCA
907
runKappa(subt1     = cmoic.brca$clust.res$clust,
908
         subt2     = tcga.pam.pred$clust.res$clust,
909
         subt1.lab = "CMOIC",
910
         subt2.lab = "PAM",
911
         fig.name  = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and PAM")
912
913
# check consistency between NTP and PAM-predicted subtype in validation Yau-BRCA
914
runKappa(subt1     = yau.ntp.pred$clust.res$clust,
915
         subt2     = yau.pam.pred$clust.res$clust,
916
         subt1.lab = "NTP",
917
         subt2.lab = "PAM",
918
         fig.name  = "CONSISTENCY HEATMAP FOR YAU")
919
```
920
921
# <a id="Section.5" style="color:#159957;">5. Little Trick</a>
922
Indeed, the `moic.res` parameter is necessary for almost all the functions, especially for downstream analyses. You may deem that if the `moic.res` object is not obtained properly through the module of <span style="color:red">**GET**</span>, downstream analyses cannot run smoothly. But in fact, you can fool downstream modules (<span style="color:green">**COMP**</span> and <span style="color:blue">**RUN**</span>) with little tricks. The core information stored in `moic.res` is a data.frame named `clust.res` with a `samID` column (character) for samples name (should be exactly the same with row names) and a `clust` column (integer) for subtype indicator. Another information required for some functions in downstream analyses is `mo.method`, a string value that is usually considered as prefix for output files. For example, if I want to check the prognostic value of PAM50 classification, the only thing I have to prepare is a pseudo `moic.res` object with a customized `mo.method` just like below:
923
```{r, eval=TRUE}
924
# include original clinical information as `clust.res` and a string value for `mo.method` to a list
925
pseudo.moic.res                 <- list("clust.res" = surv.info,
926
                                        "mo.method" = "PAM50")
927
928
# make pseudo samID
929
pseudo.moic.res$clust.res$samID <- rownames(pseudo.moic.res$clust.res)
930
931
# make pseudo clust using a mapping relationship
932
pseudo.moic.res$clust.res$clust <- sapply(pseudo.moic.res$clust.res$PAM50,
933
                                          switch,
934
                                          "Basal"   = 1, # relabel Basal as 1
935
                                          "Her2"    = 2, # relabel Her2 as 2
936
                                          "LumA"    = 3, # relabel LumA as 3
937
                                          "LumB"    = 4, # relabel LumnB as 4
938
                                          "Normal"  = 5) # relabel Normal as 5
939
```
940
941
Let's check how the pseudo `moic.res` object looks like.
942
```{r, eval=TRUE}
943
head(pseudo.moic.res$clust.res)
944
```
945
946
```{r, echo=FALSE, eval=FALSE}
947
head(pseudo.moic.res$clust.res) %>%
948
  kbl(caption = "Table . Demo of pseudo object for downstream analyses in MOVICS.") %>%
949
  kable_classic(full_width = TRUE, html_font = "Calibri")
950
```
951
952
Ok everything is ready, just keep in mind how your interested subtypes are mapped and feel free to use such pseudo object to perform downstream analyses such as `compSurv()` as follows:
953
```{r, fig.align="center", fig.width=6, fig.height=7, fig.cap="Figure 22. Kaplan-Meier survival curve of PAM50 subtypes of breast cancer with pseudo input in TCGA-BRCA cohort.", eval=TRUE}
954
# survival comparison
955
pam50.brca <- compSurv(moic.res         = pseudo.moic.res,
956
                       surv.info        = surv.info,
957
                       convt.time       = "y", # convert day unit to year
958
                       surv.median.line = "h", # draw horizontal line at median survival
959
                       fig.name         = "KAPLAN-MEIER CURVE OF PAM50 BY PSEUDO")
960
```
961
962
# <a id="Section.6" style="color:#159957;">6. Summary</a>
963
I have introduced the pipeline and described the functions included in MOVICS in almost detail. This package is rather young and I hope to improve it further in the future, along with more functionality. If you have any questions, bug reports, or suggestions for improving MOVICS, please email them to xlu.cpu@foxmail.com. Feedback is crucial for improving a package that tries to seamlessly incorporate functionality and flexibility from many other useful tools.
964
965
# <a id="Section.7" style="color:#159957;">7. Session Information</a>
966
```{r, echo=TRUE, eval=TRUE}
967
sessionInfo()
968
```
969
970
```{r, echo=FALSE, eval=FALSE}
971
save.image("MOVICS.RData")
972
```
973
974
# <a id="Section.8" style="color:#159957;">8. Citing MOVICS</a> 
975
MOVICS is a wrapper incorporating algorithms and functions from many other sources. The majority of functions used in MOVICS are incorporated directly from other packages or are drawn from specific published algorithms. Hence, below I provide guidance on which papers should be cited alongside MOVICS when using these functions:
976
977
<p align="center">![cloud](pkg_cloud.jpg)</p>
978
979
For now, if you use MOVICS R package in published research, please cite:
980
981
  >+ Lu, X., Meng, J., Zhou, Y., Jiang, L., and Yan, F. (2020). MOVICS: an R package for multi-omics integration and visualization in cancer subtyping. bioRxiv, 2020.2009.2015.297820. [doi.org/10.1101/2020.09.15.297820]
982
  
983
Please cite corresponding literature below for multi-omic clustering algorithm if used:
984
985
  >+ CIMLR: Ramazzotti D, Lal A, Wang B, Batzoglou S, Sidow A (2018). Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun, 9(1):4453.
986
  + iClusterBayes: Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG (2018). A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics, 19(1):71-86.
987
  + Mocluster: Meng C, Helm D, Frejno M, Kuster B (2016). moCluster: Identifying Joint Patterns Across Multiple Omics Data Sets. J Proteome Res, 15(3):755-765.
988
  + COCA: Hoadley KA, Yau C, Wolf DM, et al (2014). Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158(4):929-944.
989
  + ConsensusClustering: Monti S, Tamayo P, Mesirov J, et al (2003). Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Mach Learn, 52:91-118.
990
  + IntNMF: Chalise P, Fridley BL (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PLoS One, 12(5):e0176278.
991
  + LRAcluster: Wu D, Wang D, Zhang MQ, Gu J (2015). Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics, 16(1):1022.
992
  + NEMO: Rappoport N, Shamir R (2019). NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics, 35(18):3348-3356.
993
  + PINSPlus: Nguyen H, Shrestha S, Draghici S, Nguyen T (2019). PINSPlus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics, 35(16):2843-2846.
994
  + SNF: Wang B, Mezlini AM, Demir F, et al (2014). Similarity network fusion for aggregating data types on a genomic scale. Nat Methods, 11(3):333-337.
995
  
996
Heatmap generated by MOVICS uses ComplexHeatmap R package, so please cite if any heatmap is created:
997
998
  >+ Gu Z, Eils R, Schlesner M (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32(18):2847–2849.
999
1000
If FLAGS is removed when calculating TMB, please cite the following two at the same time, otherwise only the first one:
1001
1002
  >+ Mayakonda A, Lin D C, Assenov Y, et al. (2018). Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res, 28(11):1747-1756.
1003
  + Shyr C, Tarailo-Graovac M, Gottlieb M, Lee JJ, van Karnebeek C, Wasserman WW. (2014). FLAGS, frequently mutated genes in public exomes. BMC Med Genomics, 7(1): 1-14.
1004
  
1005
If calculating FGA, please cite:
1006
1007
  >+ Cerami E, Gao J, Dogrusoz U, et al. (2012). The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov, 2(5):401-404.
1008
  + Gao J, Aksoy B A, Dogrusoz U, et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal, 6(269):pl1-pl1.
1009
  
1010
Drug sensitivity analysis is based on pRRophetic R package, so please cite:
1011
1012
  >+ Geeleher P, Cox N, Huang R S (2014). pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels. PLoS One, 9(9):e107468.
1013
  + Geeleher P, Cox N J, Huang R S (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol, 15(3):1-12.
1014
1015
\  
1016
Please cite corresponding literature below if using differential expression analysis:
1017
1018
  >+ edgeR: 
1019
  >   + Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-140.
1020
  >   + McCarthy DJ, Chen Y, Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40(10):4288-4297.
1021
  + DESeq2: Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol, 15(12):550-558.
1022
  + limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43(7):e47.
1023
1024
\  
1025
Gene set enrichment analysis uses clusterProfiler R package with the following paper:
1026
1027
  >+ Yu G, Wang L, Han Y, He Q (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5):284-287.
1028
  + Subramanian A, Tamayo P, Mootha V K, et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci, 102(43):15545-15550.
1029
  
1030
\  
1031
For single sample enrichment analysis, please cite literature below appropriately:
1032
1033
   >+ ssgsea: Barbie, D.A. et al (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(5):108-112.
1034
   >+ gsva: Hänzelmann, S., Castelo, R. and Guinney, J. (2013). GSVA: Gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics, 14(1):7.
1035
   >+ zscore: Lee, E. et al (2008). Inferring pathway activity toward precise disease classification. PLoS Comp Biol, 4(11):e1000217.
1036
   >+ plage: Tomfohr, J. et al (2005). Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics, 6(1):1-11.
1037
1038
External validation uses nearest template prediction as implanted in CMScaller R package. Please cite:
1039
1040
  >+ Eide P W, Bruun J, Lothe R A, et al. (2017). CMScaller: an R package for consensus molecular subtyping of colorectal cancer pre-clinical models. Sci Rep, 7(1):1-8.
1041
  + Hoshida, Y. (2010). Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS One, 5(11):e15543.
1042
1043
If using partition around medoids classifier, then cite:
1044
1045
  >+ Tibshirani R, Hastie T, Narasimhan B and Chu G (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci, 99,6567–6572.
1046
  + Kapp A V, Tibshirani R. (2007). Are clusters found in one dataset present in another dataset? Biostatistics, 8(1):9-31.
1047
1048
# <a id="Section.9" style="color:#159957;">REFERENCES</a> 
1049
>1. Pierre-Jean M, Deleuze J F, Le Floch E, et al (2019). Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinformatics.
1050
2. Rappoport N, Shamir R (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res, 46(20):10546-10562.
1051
3. Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature, 2012, 490(7418):61-78.
1052
4. Yau C, Esserman L, Moore D H, et al. (2010). A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer. Breast Cancer Res, 5(12):1-15.
1053
5. Chalise P, Fridley BL (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PLoS One, 12(5):e0176278.
1054
6. Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. J R Stat Soc Series B Stat Methodol, 63(2):411-423.
1055
7. Strehl, Alexander; Ghosh, Joydeep. (2002). Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Res, 3(Dec):583–617.
1056
8. Le DT, Uram JN, Wang H, et al. (2015). PD-1 Blockade in tumors with mismatch-repair deficiency. N Engl J Med, 372(26):2509–2520.
1057
9. Davoli T, Uno H, Wooten E C, et al. (2017). Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy. Science, 355(6322):261-U75.
1058
10. Geeleher P, Cox N J, Huang R S (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol, 15(3):1-12.
1059
11. Geeleher P, Cox N, Huang R S (2014). pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels. PLoS One, 9(9):e107468.
1060
12. Rand W M. (1971). Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846-850.
1061
13. Vinh, N. X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. p. 1.
1062
14. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). Dictionary of Bioinformatics and Computational Biology.
1063
15. Fowlkes E B, Mallows C L. (1983). A method for comparing two hierarchical clusterings. J Am Stat Assoc, 78(383):553-569.
1064
16. Hoshida, Y. (2010). Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS One, 5(11):e15543.
1065
17. Tibshirani R, Hastie T, Narasimhan B and Chu G (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci, 99(10):6567–6572.
1066
18. Kapp A V, Tibshirani R. (2007). Are clusters found in one dataset present in another dataset?, Biostatistics, 8(1):9-31.