|
a |
|
b/vignettes/vignette.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "timeOmics" |
|
|
3 |
author: |
|
|
4 |
- name: "Antoine Bodein" |
|
|
5 |
affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada" |
|
|
6 |
email: "antoine.bodein.1@ulaval.ca" |
|
|
7 |
- name: "Olivier Chapleur" |
|
|
8 |
affiliation: "Hydrosystems and Biopresses Research Unit, Irstea, Antony, France" |
|
|
9 |
- name: "Kim-Anh Lê Cao" |
|
|
10 |
affiliation: "Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia" |
|
|
11 |
- name: "Arnaud Droit" |
|
|
12 |
affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada" |
|
|
13 |
package: timeOmics |
|
|
14 |
output: |
|
|
15 |
BiocStyle::html_document |
|
|
16 |
|
|
|
17 |
vignette: > |
|
|
18 |
%\VignetteIndexEntry{timeOmics} |
|
|
19 |
%\VignetteEngine{knitr::rmarkdown} |
|
|
20 |
%\VignetteEncoding{UTF-8} |
|
|
21 |
|
|
|
22 |
bibliography: ["mybib.bib"] |
|
|
23 |
biblio-style: apalike |
|
|
24 |
link-citations: true |
|
|
25 |
--- |
|
|
26 |
|
|
|
27 |
```{r, echo = FALSE} |
|
|
28 |
knitr::opts_chunk$set(eval = TRUE, |
|
|
29 |
echo = TRUE, |
|
|
30 |
fig.align = "center", |
|
|
31 |
warning = FALSE, |
|
|
32 |
message = FALSE) |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
# Introduction |
|
|
36 |
|
|
|
37 |
***timeOmics*** is a generic data-driven framework to integrate multi-Omics longitudinal data (**A.**) measured on the same biological samples and select key temporal features with strong associations within the same sample group. |
|
|
38 |
|
|
|
39 |
The main steps of ***timeOmics*** are: |
|
|
40 |
|
|
|
41 |
* a pre-processing step (**B.**) Normalize and filter low-expressed features, except those not varying over time, |
|
|
42 |
* a modelling step (**C.**) Capture inter-individual variability in biological/technical replicates and accommodate heterogeneous experimental designs, |
|
|
43 |
* a clustering step (**D.**) Group features with the same expression profile over time. Feature selection step can also be used to identify a signature per cluster, |
|
|
44 |
* a post-hoc validation step (**E.**) Ensure clustering quality. |
|
|
45 |
|
|
|
46 |
This framework is presented on both single-Omic and multi-Omics situations. |
|
|
47 |
|
|
|
48 |
 |
|
|
49 |
|
|
|
50 |
For more details please check: |
|
|
51 |
*Bodein A, Chapleur O, Droit A and Lê Cao K-A (2019) A Generic Multivariate Framework for the Integration of Microbiome Longitudinal Studies With Other Data Types. Front. Genet. 10:963. <a href="http://dx.doi.org/10.3389/fgene.2019.00963"> doi:10.3389/fgene.2019.00963</a>* |
|
|
52 |
|
|
|
53 |
# Start |
|
|
54 |
|
|
|
55 |
## Installation |
|
|
56 |
|
|
|
57 |
### Lastest Bioconductor Release |
|
|
58 |
|
|
|
59 |
```r |
|
|
60 |
## install BiocManager if not installed |
|
|
61 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
|
|
62 |
install.packages("BiocManager") |
|
|
63 |
## install timeOmics |
|
|
64 |
BiocManager::install('timeOmics') |
|
|
65 |
``` |
|
|
66 |
### Lastest `Github` version |
|
|
67 |
|
|
|
68 |
```r |
|
|
69 |
install.packages("devtools") |
|
|
70 |
# then load |
|
|
71 |
library(devtools) |
|
|
72 |
install_github("abodein/timeOmics") |
|
|
73 |
``` |
|
|
74 |
|
|
|
75 |
## Load the package |
|
|
76 |
|
|
|
77 |
```{r, message=FALSE, warning=FALSE} |
|
|
78 |
library(timeOmics) |
|
|
79 |
``` |
|
|
80 |
|
|
|
81 |
## Useful package to run this vignette |
|
|
82 |
|
|
|
83 |
```{r, message=F} |
|
|
84 |
library(tidyverse) |
|
|
85 |
``` |
|
|
86 |
|
|
|
87 |
# Required data |
|
|
88 |
|
|
|
89 |
Each omics technology produces count or abundance tables with samples in rows and features in columns (genes, proteins, species, ...). |
|
|
90 |
In multi-Omics, each *block* has the same rows and a variable number of columns depending on the technology and number of identified features. |
|
|
91 |
|
|
|
92 |
We assume each *block* *(omics)* is a matrix/data.frame with samples in **rows** (similar in each *block*) and features in **columns** (variable number of column). Normalization steps applied to each block will be covered in the next section. |
|
|
93 |
|
|
|
94 |
For this example, we will use a part of simulated data based on the above-mentioned article and generated as follow: |
|
|
95 |
|
|
|
96 |
Twenty reference time profiles, were generated on 9 *equally spaced*\* time points and assigned to 4 clusters (5 profiles each). These ground truth profiles were then used to simulate new profiles. |
|
|
97 |
The profiles from the 5 individuals were then modelled with `lmms` [@straube2015linear]. |
|
|
98 |
Please check [@bodein2019generic] for more details about the simulated data. |
|
|
99 |
|
|
|
100 |
To illustrate the filtering step implemented later, we add an extra noisy profile resulting in a matrix of (9x5) x (20+1). |
|
|
101 |
|
|
|
102 |
\* *It is not mandatory to have equally spaced time points in your data.* |
|
|
103 |
|
|
|
104 |
```{r} |
|
|
105 |
data("timeOmics.simdata") |
|
|
106 |
sim.data <- timeOmics.simdata$sim |
|
|
107 |
|
|
|
108 |
dim(sim.data) |
|
|
109 |
head(sim.data[,1:6]) |
|
|
110 |
``` |
|
|
111 |
|
|
|
112 |
|
|
|
113 |
# Data preprocessing |
|
|
114 |
|
|
|
115 |
Every analysis starts with a pre-processing step that includes normalization and data cleaning. |
|
|
116 |
In longitudinal multi-omics analysis we have a 2-step pre-processing procedure. |
|
|
117 |
|
|
|
118 |
## Platform-specific |
|
|
119 |
|
|
|
120 |
Platform-specific pre-processing is the type of normalization normally used without time component. |
|
|
121 |
It may differ depending on the type of technology. |
|
|
122 |
|
|
|
123 |
The user can apply normalization steps *(log, scale, rle, ...)* and filtering steps *(low count removal, ...)*. |
|
|
124 |
|
|
|
125 |
It is also possible to handle microbiome data with Centered Log Ratio transformation as described [here](http://mixomics.org/mixmc/pre-processing/). |
|
|
126 |
|
|
|
127 |
That is why we let the user apply their favorite method of normalization. |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
## Time-specific |
|
|
131 |
|
|
|
132 |
In a longitudinal context, one can be interested only in features that vary over time and filter out molecules with a low variation coefficient. |
|
|
133 |
|
|
|
134 |
To do so, we can first naively set a threshold on the variation coefficient and keep those features that exceed the threshold. |
|
|
135 |
|
|
|
136 |
```{r} |
|
|
137 |
remove.low.cv <- function(X, cutoff = 0.5){ |
|
|
138 |
# var.coef |
|
|
139 |
cv <- unlist(lapply(as.data.frame(X), |
|
|
140 |
function(x) abs(sd(x)/mean(x)))) |
|
|
141 |
return(X[,cv > cutoff]) |
|
|
142 |
} |
|
|
143 |
|
|
|
144 |
data.filtered <- remove.low.cv(sim.data, 0.5) |
|
|
145 |
``` |
|
|
146 |
|
|
|
147 |
|
|
|
148 |
# Time Modelling |
|
|
149 |
|
|
|
150 |
The next step is the modelling of each feature (molecule) as a function of time. |
|
|
151 |
|
|
|
152 |
We rely on a *Linear Mixed Model Splines* framework (package `lmms`) to model the features expression as a function of time by taking into account inter-individual variability. |
|
|
153 |
|
|
|
154 |
`lmms` fits 4 different types of models described and indexed as below and assigns the best fit for each of the feature. |
|
|
155 |
|
|
|
156 |
* 0 = linear model, |
|
|
157 |
* 1 = linear mixed effect model spline (LMMS) with defined basis, |
|
|
158 |
* 2 = LMMS taking subject-specific random intercepts, |
|
|
159 |
* 3 = LMMS with subject-specific intercept and slope. |
|
|
160 |
|
|
|
161 |
The package also has an interesting feature for filtering profiles which are not differentially expressed over time, with statistical testing (see `lmms::lmmsDE`). |
|
|
162 |
|
|
|
163 |
Once run, `lmms` summarizes each feature into a unique time profile. |
|
|
164 |
|
|
|
165 |
## `lmms` example |
|
|
166 |
|
|
|
167 |
`lmms` requires a data.frame with features in columns and samples in rows. |
|
|
168 |
|
|
|
169 |
For more information about `lmms` modelling parameters, please check `?lmms::lmmSpline` |
|
|
170 |
|
|
|
171 |
*** Package `lmms` was removed from the CRAN repository (Archived on 2020-09-11). |
|
|
172 |
https://cran.r-project.org/web/packages/lmms/index.html *** |
|
|
173 |
|
|
|
174 |
`lmms` package is still available and can be installed as follow: |
|
|
175 |
|
|
|
176 |
```r |
|
|
177 |
devtools::install_github("cran/lmms") |
|
|
178 |
library(lmms) |
|
|
179 |
``` |
|
|
180 |
|
|
|
181 |
|
|
|
182 |
```{r, message=FALSE} |
|
|
183 |
# numeric vector containing the sample time point information |
|
|
184 |
time <- timeOmics.simdata$time |
|
|
185 |
head(time) |
|
|
186 |
``` |
|
|
187 |
|
|
|
188 |
```{r,eval=FALSE} |
|
|
189 |
# example of lmms |
|
|
190 |
lmms.output <- lmms::lmmSpline(data = data.filtered, time = time, |
|
|
191 |
sampleID = rownames(data.filtered), deri = FALSE, |
|
|
192 |
basis = "p-spline", numCores = 4, timePredict = 1:9, |
|
|
193 |
keepModels = TRUE) |
|
|
194 |
modelled.data <- t(slot(lmms.output, 'predSpline')) |
|
|
195 |
``` |
|
|
196 |
|
|
|
197 |
```{r, warning=FALSE, message=FALSE} |
|
|
198 |
lmms.output <- timeOmics.simdata$lmms.output |
|
|
199 |
modelled.data <- timeOmics.simdata$modelled |
|
|
200 |
``` |
|
|
201 |
|
|
|
202 |
The `lmms` object provides a list of models for each feature. |
|
|
203 |
It also includes the new predicted splines *(modelled data)* in the `predSpline` slot. |
|
|
204 |
The produced table contains features in columns and now, **times in rows**. |
|
|
205 |
|
|
|
206 |
Let's plot the modeled profiles. |
|
|
207 |
|
|
|
208 |
```{r} |
|
|
209 |
# gather data |
|
|
210 |
data.gathered <- modelled.data %>% as.data.frame() %>% |
|
|
211 |
rownames_to_column("time") %>% |
|
|
212 |
mutate(time = as.numeric(time)) %>% |
|
|
213 |
pivot_longer(names_to="feature", values_to = 'value', -time) |
|
|
214 |
|
|
|
215 |
# plot profiles |
|
|
216 |
ggplot(data.gathered, aes(x = time, y = value, color = feature)) + geom_line() + |
|
|
217 |
theme_bw() + ggtitle("`lmms` profiles") + ylab("Feature expression") + |
|
|
218 |
xlab("Time") |
|
|
219 |
``` |
|
|
220 |
|
|
|
221 |
## Profile filtering |
|
|
222 |
|
|
|
223 |
Straight line modelling can occur when the inter-individual variation is too high. |
|
|
224 |
To remove the noisy profiles, we have implemented a 2-phase test procedure. |
|
|
225 |
|
|
|
226 |
* *Breusch-Pagan test*, which tests the homoscedasticity of the residuals. |
|
|
227 |
* *Cutoff on MSE* (mean squared error), to remove feature for which the residuals are too dispersed arround the fitted line. |
|
|
228 |
This threshold is determined automatically. The MSE for a linear model must not exceed the MSE of more complex models. |
|
|
229 |
|
|
|
230 |
```{r} |
|
|
231 |
filter.res <- lmms.filter.lines(data = data.filtered, |
|
|
232 |
lmms.obj = lmms.output, time = time) |
|
|
233 |
profile.filtered <- filter.res$filtered |
|
|
234 |
``` |
|
|
235 |
|
|
|
236 |
# Single-Omic longitudinal clustering |
|
|
237 |
|
|
|
238 |
To achieve clustering with multivariate ordination methods, we rely on the `mixOmics` package [@rohart2017mixomics]. |
|
|
239 |
|
|
|
240 |
## Principal Component Analysis |
|
|
241 |
|
|
|
242 |
From the modelled data, we use a PCA to cluster features with similar expression profiles over time. |
|
|
243 |
|
|
|
244 |
PCA is an unsupervised reduction dimension technique which uses uncorrelated |
|
|
245 |
intrumental variable (i.e. principal components) to summarize as much information |
|
|
246 |
(*variance*) as possible from the data. |
|
|
247 |
|
|
|
248 |
In PCA, each component is associated to a loading vector of length P (number of features/profiles). |
|
|
249 |
For a given set of component, we can extract a set of strongly correlated profiles by |
|
|
250 |
considering features with the top absolute coefficients in the loading vectors. |
|
|
251 |
|
|
|
252 |
Those profiles are linearly combined to define each component, and thus, explain similar information on a given component. |
|
|
253 |
Different clusters are therefore obtained on each component of the PCA. |
|
|
254 |
Each cluster is then further separated into two sets of profiles which we denote as “positive” or “negative” based on the sign of the coefficients in the loading vectors |
|
|
255 |
Sign indicates how the features can be assign into 2 clusters. |
|
|
256 |
|
|
|
257 |
At the end of this procedure, each component create 2 clusters and each feature is assigned to a cluster according to the maximum contribution on a component and the sign of that contribution. |
|
|
258 |
|
|
|
259 |
*(see also `?mixOmics::pca` for more details about PCA and available options)* |
|
|
260 |
|
|
|
261 |
|
|
|
262 |
### Longitudinal clustering |
|
|
263 |
|
|
|
264 |
To optimize the number of clusters, the number of PCA components needs to be optimized (`getNcomp`). |
|
|
265 |
The quality of clustering is assessed using the silhouette coefficient. |
|
|
266 |
The number of components that maximizes the silhouette coefficient will provide the best clustering. |
|
|
267 |
|
|
|
268 |
|
|
|
269 |
```{r} |
|
|
270 |
# run pca |
|
|
271 |
pca.res <- pca(X = profile.filtered, ncomp = 5, scale=FALSE, center=FALSE) |
|
|
272 |
|
|
|
273 |
# tuning ncomp |
|
|
274 |
pca.ncomp <- getNcomp(pca.res, max.ncomp = 5, X = profile.filtered, |
|
|
275 |
scale = FALSE, center=FALSE) |
|
|
276 |
|
|
|
277 |
pca.ncomp$choice.ncomp |
|
|
278 |
plot(pca.ncomp) |
|
|
279 |
``` |
|
|
280 |
|
|
|
281 |
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 2` (4 clusters). |
|
|
282 |
|
|
|
283 |
```{r} |
|
|
284 |
# final model |
|
|
285 |
pca.res <- pca(X = profile.filtered, ncomp = 2, scale = FALSE, center=FALSE) |
|
|
286 |
``` |
|
|
287 |
|
|
|
288 |
All information about the cluster is displayed below (`getCluster`). |
|
|
289 |
Once run, the procedure will indicate the assignement of each molecule to either the `positive` or `negative` cluster of a given component based on the sign of loading vector (contribution). |
|
|
290 |
|
|
|
291 |
```{r} |
|
|
292 |
# extract cluster |
|
|
293 |
pca.cluster <- getCluster(pca.res) |
|
|
294 |
head(pca.cluster) |
|
|
295 |
``` |
|
|
296 |
|
|
|
297 |
### A word about the multivariate models |
|
|
298 |
|
|
|
299 |
Multivariate models provide a set of graphical methods to extract useful information about samples or variables (R functions from mixOmics). |
|
|
300 |
|
|
|
301 |
The sample plot, or more accurately here, the timepoint plot projects the samples/timpoints into the reduced space represented by the principal components (or latent structures). |
|
|
302 |
It displays the similarity (points are closed to each other) or dissimilarities between samples/timepoints. |
|
|
303 |
|
|
|
304 |
```{r} |
|
|
305 |
plotIndiv(pca.res) |
|
|
306 |
``` |
|
|
307 |
|
|
|
308 |
Associations between variables can be displayed on a circle correlation. |
|
|
309 |
The variables are projected on the plane defined two principal components. |
|
|
310 |
Their projections are inside a circle of radius 1 centered and of unit variance. |
|
|
311 |
Strongly associated (or correlated) variables are projected in the same direction from the origin. The greater the distance from the origin the stronger the association. |
|
|
312 |
|
|
|
313 |
```{r} |
|
|
314 |
plotVar(pca.res) |
|
|
315 |
``` |
|
|
316 |
|
|
|
317 |
Lastly, the strenght of the variables on a component can be displayed by an horizontal barplot. |
|
|
318 |
|
|
|
319 |
```{r} |
|
|
320 |
plotLoadings(pca.res) |
|
|
321 |
``` |
|
|
322 |
|
|
|
323 |
|
|
|
324 |
|
|
|
325 |
### Plot PCA longitudinal clusters |
|
|
326 |
|
|
|
327 |
Clustered profiles can be displayed with `plotLong`. |
|
|
328 |
|
|
|
329 |
The user can set the parameters `scale` and `center` to scale/center all time profiles. |
|
|
330 |
|
|
|
331 |
(*See also `mixOmics::plotVar(pca.res)` for variable representation*) |
|
|
332 |
|
|
|
333 |
```{r} |
|
|
334 |
plotLong(pca.res, scale = FALSE, center = FALSE, |
|
|
335 |
title = "PCA longitudinal clustering") |
|
|
336 |
``` |
|
|
337 |
|
|
|
338 |
|
|
|
339 |
## *sparse* PCA |
|
|
340 |
|
|
|
341 |
The previous clustering used all features. |
|
|
342 |
*sparse* PCA is an optional step to define a cluster signature per cluster. |
|
|
343 |
It selects the features with the highest loading scores for each component in order to determine a signature. |
|
|
344 |
|
|
|
345 |
*(see also `?mixOmics::spca` for more details about sPCA and available options)* |
|
|
346 |
|
|
|
347 |
### `keepX` optimization |
|
|
348 |
|
|
|
349 |
To find the right number of features to keep per component (`keepX`) and thus per cluster, the silhouette coefficient is assessed for a list of selected features (`test.keepX`) on each component. |
|
|
350 |
|
|
|
351 |
We plot below the silhouette coefficient corresponding to each sub-cluster (positive or negative contibution) with respect to the number of features selected. A large decrease indicates that the clusters are not homogeneous and therefore the new added features should not be included in the final model. |
|
|
352 |
|
|
|
353 |
This method tends to select the smallest possible signature, so if the user wishes to set a minimum number of features per component, this parameter will have to be adjusted accordingly. |
|
|
354 |
|
|
|
355 |
|
|
|
356 |
```{r} |
|
|
357 |
tune.spca.res <- tuneCluster.spca(X = profile.filtered, ncomp = 2, |
|
|
358 |
test.keepX = c(2:10)) |
|
|
359 |
# selected features in each component |
|
|
360 |
tune.spca.res$choice.keepX |
|
|
361 |
plot(tune.spca.res) |
|
|
362 |
``` |
|
|
363 |
|
|
|
364 |
In the above graph, evolution of silhouette coefficient per component and per contribution is plotted as a function of `keepX`. |
|
|
365 |
|
|
|
366 |
```{r} |
|
|
367 |
# final model |
|
|
368 |
spca.res <- spca(X = profile.filtered, ncomp = 2, |
|
|
369 |
keepX = tune.spca.res$choice.keepX, scale = FALSE) |
|
|
370 |
plotLong(spca.res) |
|
|
371 |
``` |
|
|
372 |
|
|
|
373 |
|
|
|
374 |
# multi-Omics longitudinal clustering |
|
|
375 |
|
|
|
376 |
In this type of scenario, the user has 2 or more blocks of omics data from the same experiment |
|
|
377 |
(i. e. gene expression and metabolite concentration) and he is interested in discovering |
|
|
378 |
which genes and metabolites have a common expression profile. |
|
|
379 |
This may reveal dynamic biological mechanisms. |
|
|
380 |
|
|
|
381 |
The clustering strategy with more than one block of data is the same as longitudinal clustering with PCA and is based on integrative methods using Projection on Latent Structures (PLS). |
|
|
382 |
|
|
|
383 |
With 2 blocks, it is then necessary to use PLS. |
|
|
384 |
With more than 2 blocks, the user has to use a multi-block PLS. |
|
|
385 |
|
|
|
386 |
|
|
|
387 |
## Projection on Latent Structures (PLS) |
|
|
388 |
|
|
|
389 |
In the following section, *PLS* is used to cluster time profiles coming from **2 blocks** of data. |
|
|
390 |
*PLS* accepts 2 data.frames with the same number of rows (corresponding samples). |
|
|
391 |
|
|
|
392 |
*(see also `?mixOmics::pls` for more details about PLS and available options)* |
|
|
393 |
|
|
|
394 |
### Ncomp and Clustering |
|
|
395 |
|
|
|
396 |
Like *PCA*, number of components of PLS model and thus number of clusters needs to be optimized (`getNcomp`). |
|
|
397 |
|
|
|
398 |
```{r} |
|
|
399 |
X <- profile.filtered |
|
|
400 |
Y <- timeOmics.simdata$Y |
|
|
401 |
|
|
|
402 |
pls.res <- pls(X,Y, ncomp = 5, scale = FALSE) |
|
|
403 |
pls.ncomp <- getNcomp(pls.res, max.ncomp = 5, X=X, Y=Y, scale = FALSE) |
|
|
404 |
pls.ncomp$choice.ncomp |
|
|
405 |
plot(pls.ncomp) |
|
|
406 |
``` |
|
|
407 |
|
|
|
408 |
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 2` (4 clusters). |
|
|
409 |
|
|
|
410 |
```{r} |
|
|
411 |
# final model |
|
|
412 |
pls.res <- pls(X,Y, ncomp = 2, scale = FALSE) |
|
|
413 |
|
|
|
414 |
# info cluster |
|
|
415 |
head(getCluster(pls.res)) |
|
|
416 |
# plot clusters |
|
|
417 |
plotLong(pls.res, title = "PLS longitudinal clustering", legend = TRUE) |
|
|
418 |
``` |
|
|
419 |
|
|
|
420 |
### Signature with *sparse* PLS |
|
|
421 |
|
|
|
422 |
As with *PCA*, it is possible to use the *sparse* PLS to get a signature of the clusters. |
|
|
423 |
|
|
|
424 |
`tuneCluster.spls` choose the correct number of feature to keep on block X `test.keepX` as well as |
|
|
425 |
the correct number of feature to keep on block Y `test.keepY` among a list provided by the user and are tested for each of the components. |
|
|
426 |
|
|
|
427 |
*(see also `?mixOmics::spls` for more details about `spls` and available options)* |
|
|
428 |
|
|
|
429 |
```{r} |
|
|
430 |
tune.spls <- tuneCluster.spls(X, Y, ncomp = 2, test.keepX = c(4:10), test.keepY <- c(2,4,6)) |
|
|
431 |
|
|
|
432 |
# selected features in each component on block X |
|
|
433 |
tune.spls$choice.keepX |
|
|
434 |
# selected features in each component on block Y |
|
|
435 |
tune.spls$choice.keepY |
|
|
436 |
|
|
|
437 |
# final model |
|
|
438 |
spls.res <- spls(X,Y, ncomp = 2, scale = FALSE, |
|
|
439 |
keepX = tune.spls$choice.keepX, keepY = tune.spls$choice.keepY) |
|
|
440 |
|
|
|
441 |
# spls cluster |
|
|
442 |
spls.cluster <- getCluster(spls.res) |
|
|
443 |
|
|
|
444 |
# longitudinal cluster plot |
|
|
445 |
plotLong(spls.res, title = "sPLS clustering") |
|
|
446 |
``` |
|
|
447 |
|
|
|
448 |
## Multi-block (s)PLS longitudinal clustering |
|
|
449 |
|
|
|
450 |
With more than **2 blocks** of data, it is necessary to use *multi-block PLS* to identify cluster of similar profile from **3 and more blocks** of data. |
|
|
451 |
|
|
|
452 |
This methods accepts a list of data.frame as `X` (same corresponding rows) and a Y data.frame. |
|
|
453 |
|
|
|
454 |
*(see also `?mixOmics::block.pls` for more details about block PLS and available options)* |
|
|
455 |
|
|
|
456 |
### Ncomp and Clustering |
|
|
457 |
|
|
|
458 |
```{r} |
|
|
459 |
X <- list("X" = profile.filtered, "Z" = timeOmics.simdata$Z) |
|
|
460 |
Y <- as.matrix(timeOmics.simdata$Y) |
|
|
461 |
|
|
|
462 |
block.pls.res <- block.pls(X=X, Y=Y, ncomp = 5, |
|
|
463 |
scale = FALSE, mode = "canonical") |
|
|
464 |
block.ncomp <- getNcomp(block.pls.res,X=X, Y=Y, |
|
|
465 |
scale = FALSE, mode = "canonical") |
|
|
466 |
block.ncomp$choice.ncomp |
|
|
467 |
plot(block.ncomp) |
|
|
468 |
``` |
|
|
469 |
|
|
|
470 |
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 1` (2 clusters). |
|
|
471 |
|
|
|
472 |
```{r} |
|
|
473 |
# final model |
|
|
474 |
block.pls.res <- block.pls(X=X, Y=Y, ncomp = 1, scale = FALSE, mode = "canonical") |
|
|
475 |
# block.pls cluster |
|
|
476 |
block.pls.cluster <- getCluster(block.pls.res) |
|
|
477 |
|
|
|
478 |
# longitudinal cluster plot |
|
|
479 |
plotLong(block.pls.res) |
|
|
480 |
``` |
|
|
481 |
|
|
|
482 |
### Signature with multi-block *sparse* PLS |
|
|
483 |
|
|
|
484 |
As with *PCA* and *PLS*, it is possible to use the *sparse* multi-block PLS to get a signature of the clusters. |
|
|
485 |
|
|
|
486 |
`tuneCluster.block.spls` choose the correct number of feature to keep on each block of X `test.keepX` as well as |
|
|
487 |
the correct number of feature to keep on block Y `test.keepY` among a list provided by the user. |
|
|
488 |
|
|
|
489 |
*(see also `?mixOmics::block.spls` for more details about block sPLS and available options)* |
|
|
490 |
|
|
|
491 |
|
|
|
492 |
```{r} |
|
|
493 |
test.list.keepX <- list("X" = 4:10, "Z" = c(2,4,6,8)) |
|
|
494 |
test.keepY <- c(2,4,6) |
|
|
495 |
|
|
|
496 |
tune.block.res <- tuneCluster.block.spls(X= X, Y= Y, |
|
|
497 |
test.list.keepX=test.list.keepX, |
|
|
498 |
test.keepY= test.keepY, |
|
|
499 |
scale=FALSE, |
|
|
500 |
mode = "canonical", ncomp = 1) |
|
|
501 |
# ncomp = 1 given by the getNcomp() function |
|
|
502 |
|
|
|
503 |
# selected features in each component on block X |
|
|
504 |
tune.block.res$choice.keepX |
|
|
505 |
# selected features in each component on block Y |
|
|
506 |
tune.block.res$choice.keepY |
|
|
507 |
|
|
|
508 |
# final model |
|
|
509 |
block.pls.res <- block.spls(X=X, Y=Y, |
|
|
510 |
ncomp = 1, |
|
|
511 |
scale = FALSE, |
|
|
512 |
mode = "canonical", |
|
|
513 |
keepX = tune.block.res$choice.keepX, |
|
|
514 |
keepY = tune.block.res$choice.keepY) |
|
|
515 |
|
|
|
516 |
head(getCluster(block.pls.res)) |
|
|
517 |
plotLong(block.pls.res) |
|
|
518 |
``` |
|
|
519 |
|
|
|
520 |
|
|
|
521 |
# Post-hoc evaluation |
|
|
522 |
|
|
|
523 |
Interpretation based on correlations between profiles must be made with caution as it is highly likely to be |
|
|
524 |
spurious. Proportional distances has been proposed as an alternative to measure association a posteriori on the identified signature. |
|
|
525 |
|
|
|
526 |
In the following graphs, we represent all the proportionality distance within clusters and the distance of features inside the clusters with entire background set. |
|
|
527 |
|
|
|
528 |
We also use a Wilcoxon U-test to compare the within cluster median compared to the entire background set. |
|
|
529 |
|
|
|
530 |
|
|
|
531 |
```{r, eval=TRUE} |
|
|
532 |
# example fro multiblock analysis |
|
|
533 |
res <- proportionality(block.pls.res) |
|
|
534 |
# distance between pairs of features |
|
|
535 |
head(res$propr.distance)[1:6] |
|
|
536 |
|
|
|
537 |
# u-test pvalue by clusters |
|
|
538 |
pval.propr <- res$pvalue |
|
|
539 |
knitr::kable(pval.propr) |
|
|
540 |
plot(res) |
|
|
541 |
``` |
|
|
542 |
|
|
|
543 |
In addition to the Wilcoxon test, proportionality distance dispersion within and with entire background set is represented by cluster in the above graph. |
|
|
544 |
|
|
|
545 |
Here, for cluster `1`, the proportionality distance is calculated between pairs of feature from the same cluster `1` (inside). |
|
|
546 |
Then the distance is calculated between each feature of cluster `1` and every feature of cluster `-1` (outside). |
|
|
547 |
|
|
|
548 |
The same is applied on features from cluster `-1`. |
|
|
549 |
|
|
|
550 |
So we see that the intra-cluster distance is lower than the distances with the entire background set. |
|
|
551 |
Which is confirmed by the Wilcoxon test and this ensures a good clustering. |
|
|
552 |
|
|
|
553 |
# References |