a b/vignettes/vignette.Rmd
1
---
2
title: "timeOmics"
3
author: 
4
- name:  "Antoine Bodein"
5
  affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada"
6
  email: "antoine.bodein.1@ulaval.ca"
7
- name: "Olivier Chapleur"
8
  affiliation: "Hydrosystems and Biopresses Research Unit, Irstea, Antony, France"
9
- name: "Kim-Anh Lê Cao"
10
  affiliation: "Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia"
11
- name: "Arnaud Droit"
12
  affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada"
13
package: timeOmics  
14
output: 
15
  BiocStyle::html_document
16
  
17
vignette: >
18
  %\VignetteIndexEntry{timeOmics}
19
  %\VignetteEngine{knitr::rmarkdown}
20
  %\VignetteEncoding{UTF-8}  
21
  
22
bibliography: ["mybib.bib"]
23
biblio-style: apalike
24
link-citations: true
25
---
26
27
```{r, echo =  FALSE}
28
knitr::opts_chunk$set(eval = TRUE, 
29
                      echo = TRUE,
30
                      fig.align = "center",
31
                      warning = FALSE,
32
                      message = FALSE)
33
```
34
35
# Introduction
36
37
***timeOmics*** is a generic data-driven framework to integrate multi-Omics longitudinal data (**A.**) measured on the same biological samples and select key temporal features with strong associations within the same sample group.
38
39
The main steps of ***timeOmics*** are:
40
41
* a pre-processing step (**B.**) Normalize and filter low-expressed features, except those not varying over time,
42
* a modelling step (**C.**)  Capture inter-individual variability in biological/technical replicates and accommodate heterogeneous experimental designs,
43
* a clustering step (**D.**) Group features with the same expression profile over time. Feature selection step can also be used to identify a signature per cluster,
44
* a post-hoc validation step (**E.**) Ensure clustering quality.
45
46
This framework is presented on both single-Omic and multi-Omics situations.
47
48
![Framework Overview](./img/method_overview.png)
49
50
For more details please check:  
51
*Bodein A, Chapleur O, Droit A and Lê Cao K-A (2019) A Generic Multivariate Framework for the Integration of Microbiome Longitudinal Studies With Other Data Types. Front. Genet. 10:963. <a href="http://dx.doi.org/10.3389/fgene.2019.00963"> doi:10.3389/fgene.2019.00963</a>*
52
53
# Start
54
55
## Installation
56
57
### Lastest Bioconductor Release
58
59
```r
60
## install BiocManager if not installed
61
if (!requireNamespace("BiocManager", quietly = TRUE))
62
    install.packages("BiocManager")
63
## install timeOmics
64
BiocManager::install('timeOmics')
65
``` 
66
### Lastest `Github` version
67
68
```r
69
install.packages("devtools")
70
# then load
71
library(devtools)
72
install_github("abodein/timeOmics")
73
```
74
75
## Load the package
76
77
```{r, message=FALSE, warning=FALSE}
78
library(timeOmics)
79
```
80
81
## Useful package to run this vignette
82
83
```{r, message=F}
84
library(tidyverse)
85
```
86
87
# Required data
88
89
Each omics technology produces count or abundance tables with samples in rows and features in columns (genes, proteins, species, ...).
90
In multi-Omics, each *block* has the same rows and a variable number of columns depending on the technology and number of identified features.
91
92
We assume each *block* *(omics)* is a matrix/data.frame with samples in **rows** (similar in each *block*) and features in **columns** (variable number of column). Normalization steps applied to each block will be covered in the next section.
93
94
For this example, we will use a part of simulated data based on the above-mentioned article and generated as follow:
95
96
Twenty reference time profiles, were generated on 9 *equally spaced*\* time points and assigned to 4 clusters (5 profiles each). These ground truth profiles were then used to simulate new profiles.
97
The profiles from the 5 individuals were then modelled with `lmms` [@straube2015linear]. 
98
Please check [@bodein2019generic] for more details about the simulated data.
99
100
To illustrate the filtering step implemented later, we add an extra noisy profile resulting in a matrix of (9x5) x (20+1).
101
102
\* *It is not mandatory to have equally spaced time points in your data.*
103
104
```{r}
105
data("timeOmics.simdata")
106
sim.data <- timeOmics.simdata$sim
107
108
dim(sim.data) 
109
head(sim.data[,1:6])
110
```
111
112
113
# Data preprocessing
114
115
Every analysis starts with a pre-processing step that includes normalization and data cleaning.
116
In longitudinal multi-omics analysis we have a 2-step pre-processing procedure.
117
118
## Platform-specific
119
120
Platform-specific pre-processing is the type of normalization normally used without time component.
121
It may differ depending on the type of technology.
122
123
The user can apply normalization steps *(log, scale, rle, ...)* and filtering steps *(low count removal, ...)*.
124
125
It is also possible to handle microbiome data with Centered Log Ratio transformation as described [here](http://mixomics.org/mixmc/pre-processing/).
126
127
That is why we let the user apply their favorite method of normalization.
128
129
130
## Time-specific
131
132
In a longitudinal context, one can be interested only in features that vary over time and filter out molecules with a low variation coefficient.
133
134
To do so, we can first naively set a threshold on the variation coefficient and keep those features that exceed the threshold.
135
136
```{r}
137
remove.low.cv <- function(X, cutoff = 0.5){
138
  # var.coef
139
  cv <- unlist(lapply(as.data.frame(X), 
140
                      function(x) abs(sd(x)/mean(x))))
141
  return(X[,cv > cutoff])
142
}
143
144
data.filtered <- remove.low.cv(sim.data, 0.5)
145
```
146
147
148
# Time Modelling
149
150
The next step is the modelling of each feature (molecule) as a function of time.
151
152
We rely on a *Linear Mixed Model Splines* framework (package `lmms`) to model the features expression as a function of time by taking into account inter-individual variability.
153
154
`lmms` fits 4 different types of models described and indexed as below and assigns the best fit for each of the feature.
155
156
* 0 = linear model, 
157
* 1 = linear mixed effect model spline (LMMS) with defined basis, 
158
* 2 = LMMS taking subject-specific random intercepts, 
159
* 3 = LMMS with subject-specific intercept and slope.
160
161
The package also has an interesting feature for filtering profiles which are not differentially expressed over time, with statistical testing (see `lmms::lmmsDE`).
162
163
Once run, `lmms` summarizes each feature into a unique time profile.
164
165
## `lmms` example
166
167
`lmms` requires a data.frame with features in columns and samples in rows.
168
169
For more information about `lmms` modelling parameters, please check `?lmms::lmmSpline`
170
171
*** Package `lmms` was removed from the CRAN repository (Archived on 2020-09-11).
172
https://cran.r-project.org/web/packages/lmms/index.html ***
173
174
`lmms` package is still available and can be installed as follow:
175
176
```r
177
devtools::install_github("cran/lmms")
178
library(lmms)
179
```
180
181
182
```{r, message=FALSE}
183
# numeric vector containing the sample time point information
184
time <- timeOmics.simdata$time
185
head(time)
186
```
187
188
```{r,eval=FALSE}
189
# example of lmms
190
lmms.output <- lmms::lmmSpline(data = data.filtered, time = time,
191
                         sampleID = rownames(data.filtered), deri = FALSE,
192
                         basis = "p-spline", numCores = 4, timePredict = 1:9,
193
                         keepModels = TRUE)
194
modelled.data <- t(slot(lmms.output, 'predSpline'))
195
```
196
197
```{r, warning=FALSE, message=FALSE}
198
lmms.output <- timeOmics.simdata$lmms.output
199
modelled.data <- timeOmics.simdata$modelled
200
```
201
202
The `lmms` object provides a list of models for each feature. 
203
It also includes the new predicted splines *(modelled data)* in the `predSpline` slot.
204
The produced table contains features in columns and now, **times in rows**.
205
206
Let's plot the modeled profiles.
207
208
```{r}
209
# gather data
210
data.gathered <- modelled.data %>% as.data.frame() %>% 
211
  rownames_to_column("time") %>%
212
  mutate(time = as.numeric(time)) %>%
213
  pivot_longer(names_to="feature", values_to = 'value', -time)
214
215
# plot profiles
216
ggplot(data.gathered, aes(x = time, y = value, color = feature)) + geom_line() +
217
  theme_bw() + ggtitle("`lmms` profiles") + ylab("Feature expression") +
218
  xlab("Time")
219
```
220
221
## Profile filtering
222
223
Straight line modelling can occur when the inter-individual variation is too high.
224
To remove the noisy profiles, we have implemented a 2-phase test procedure.
225
226
* *Breusch-Pagan test*, which tests the homoscedasticity of the residuals.
227
* *Cutoff on MSE* (mean squared error), to remove feature for which the residuals are too dispersed arround the fitted line.
228
This threshold is determined automatically. The MSE for a linear model must not exceed the MSE of more complex models.
229
230
```{r}
231
filter.res <- lmms.filter.lines(data = data.filtered, 
232
                                lmms.obj = lmms.output, time = time)
233
profile.filtered <- filter.res$filtered
234
```
235
236
# Single-Omic longitudinal clustering
237
238
To achieve clustering with multivariate ordination methods, we rely on the `mixOmics` package [@rohart2017mixomics].
239
240
## Principal Component Analysis
241
242
From the modelled data, we use a PCA to cluster features with similar expression profiles over time.
243
244
PCA is an unsupervised reduction dimension technique which uses uncorrelated 
245
intrumental variable (i.e. principal components) to summarize as much information 
246
(*variance*) as possible from the data.
247
248
In PCA, each component is associated to a loading vector of length P (number of features/profiles).
249
For a given set of component, we can extract a set of strongly correlated profiles by 
250
considering features with the top absolute coefficients in the loading vectors.
251
252
Those profiles are linearly combined to define each component, and thus, explain similar information on a given component. 
253
Different clusters are therefore obtained on each component of the PCA. 
254
Each cluster is then further separated into two sets of profiles which we denote as “positive” or “negative” based on the sign of the coefficients in the loading vectors 
255
Sign indicates how the features can be assign into 2 clusters.
256
257
At the end of this procedure, each component create 2 clusters and each feature is assigned to a cluster according to the maximum contribution on a component and the sign of that contribution.
258
259
*(see also `?mixOmics::pca` for more details about PCA and available options)*
260
261
262
### Longitudinal clustering
263
264
To optimize the number of clusters, the number of PCA components needs to be optimized (`getNcomp`).
265
The quality of clustering is assessed using the silhouette coefficient.
266
The number of components that maximizes the silhouette coefficient will provide the best clustering.
267
268
269
```{r}
270
# run pca
271
pca.res <- pca(X = profile.filtered, ncomp = 5, scale=FALSE, center=FALSE)
272
273
# tuning ncomp
274
pca.ncomp <- getNcomp(pca.res, max.ncomp = 5, X = profile.filtered, 
275
                      scale = FALSE, center=FALSE)
276
277
pca.ncomp$choice.ncomp
278
plot(pca.ncomp)
279
```
280
281
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 2` (4 clusters).
282
283
```{r}
284
# final model
285
pca.res <- pca(X = profile.filtered, ncomp = 2, scale = FALSE, center=FALSE)
286
```
287
288
All information about the cluster is displayed below (`getCluster`). 
289
Once run, the procedure will indicate the assignement of each molecule to either the `positive` or `negative` cluster of a given component based on the sign of loading vector (contribution).
290
291
```{r}
292
# extract cluster
293
pca.cluster <- getCluster(pca.res)
294
head(pca.cluster)
295
```
296
297
### A word about the multivariate models
298
299
Multivariate models provide a set of graphical methods to extract useful information about samples or variables (R functions from mixOmics).
300
301
The sample plot, or more accurately here, the timepoint plot projects the samples/timpoints into the reduced space represented by the principal components (or latent structures).
302
It displays the similarity (points are closed to each other) or dissimilarities between samples/timepoints.
303
304
```{r}
305
plotIndiv(pca.res)
306
```
307
308
Associations between variables can be displayed on a circle correlation.
309
The variables are projected on the plane defined two principal components.
310
Their projections are inside a circle of radius 1 centered and of unit variance. 
311
Strongly associated (or correlated) variables are projected in the same direction from the origin. The greater the distance from the origin the stronger the association.
312
313
```{r}
314
plotVar(pca.res)
315
```
316
317
Lastly, the strenght of the variables on a component can be displayed by an horizontal barplot.
318
319
```{r}
320
plotLoadings(pca.res)
321
```
322
323
324
325
### Plot PCA longitudinal clusters
326
327
Clustered profiles can be displayed with `plotLong`. 
328
329
The user can set the parameters `scale` and `center` to scale/center all time profiles.
330
331
(*See also `mixOmics::plotVar(pca.res)` for variable representation*)
332
333
```{r}
334
plotLong(pca.res, scale = FALSE, center = FALSE, 
335
         title = "PCA longitudinal clustering")
336
```
337
338
339
## *sparse* PCA
340
341
The previous clustering used all features. 
342
*sparse* PCA is an optional step to define a cluster signature per cluster. 
343
It selects the features with the highest loading scores for each component in order to determine a signature.
344
345
*(see also `?mixOmics::spca` for more details about sPCA and available options)*
346
347
### `keepX` optimization
348
349
To find the right number of features to keep per component (`keepX`) and thus per cluster, the silhouette coefficient is assessed for a list of selected features (`test.keepX`) on each component.
350
351
We plot below the silhouette coefficient corresponding to each sub-cluster (positive or negative contibution) with respect to the number of features selected. A large decrease indicates that the clusters are not homogeneous and therefore the new added features should not be included in the final model.
352
353
This method tends to select the smallest possible signature, so if the user wishes to set a minimum number of features per component, this parameter will have to be adjusted accordingly.
354
355
356
```{r}
357
tune.spca.res <- tuneCluster.spca(X = profile.filtered, ncomp = 2, 
358
                                  test.keepX = c(2:10))
359
# selected features in each component
360
tune.spca.res$choice.keepX
361
plot(tune.spca.res)
362
```
363
364
In the above graph, evolution of silhouette coefficient per component and per contribution is plotted as a function of `keepX`.
365
366
```{r}
367
# final model
368
spca.res <- spca(X = profile.filtered, ncomp = 2, 
369
                 keepX = tune.spca.res$choice.keepX, scale = FALSE)
370
plotLong(spca.res)
371
```
372
373
374
# multi-Omics longitudinal clustering 
375
376
In this type of scenario, the user has 2 or more blocks of omics data from the same experiment 
377
(i. e. gene expression and metabolite concentration) and he is interested in discovering 
378
which genes and metabolites have a common expression profile.
379
This may reveal dynamic biological mechanisms.
380
381
The clustering strategy with more than one block of data is the same as longitudinal clustering with PCA and is based on integrative methods using Projection on Latent Structures (PLS).
382
383
With 2 blocks, it is then necessary to use PLS.
384
With more than 2 blocks, the user has to use a multi-block PLS.
385
386
387
## Projection on Latent Structures (PLS)
388
389
In the following section, *PLS* is used to cluster time profiles coming from **2 blocks** of data.
390
*PLS* accepts 2 data.frames with the same number of rows (corresponding samples).
391
392
*(see also `?mixOmics::pls` for more details about PLS and available options)*
393
394
### Ncomp and Clustering 
395
396
Like *PCA*, number of components of PLS model and thus number of clusters needs to be optimized (`getNcomp`).
397
398
```{r}
399
X <- profile.filtered
400
Y <- timeOmics.simdata$Y
401
402
pls.res <- pls(X,Y, ncomp = 5, scale = FALSE)
403
pls.ncomp <- getNcomp(pls.res, max.ncomp = 5, X=X, Y=Y, scale = FALSE)
404
pls.ncomp$choice.ncomp
405
plot(pls.ncomp)
406
```
407
408
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 2` (4 clusters).
409
410
```{r}
411
# final model
412
pls.res <- pls(X,Y, ncomp = 2, scale = FALSE)
413
414
# info cluster
415
head(getCluster(pls.res))
416
# plot clusters
417
plotLong(pls.res, title = "PLS longitudinal clustering", legend = TRUE)
418
```
419
420
### Signature with *sparse* PLS
421
422
As with *PCA*, it is possible to use the *sparse* PLS to get a signature of the clusters.
423
424
`tuneCluster.spls` choose the correct number of feature to keep on block X `test.keepX` as well as 
425
the correct number of feature to keep on block Y `test.keepY` among a list provided by the user and are tested for each of the components.
426
427
*(see also `?mixOmics::spls` for more details about `spls` and available options)*
428
429
```{r}
430
tune.spls <- tuneCluster.spls(X, Y, ncomp = 2, test.keepX = c(4:10), test.keepY <- c(2,4,6))
431
432
# selected features in each component on block X
433
tune.spls$choice.keepX
434
# selected features in each component on block Y
435
tune.spls$choice.keepY
436
437
# final model
438
spls.res <- spls(X,Y, ncomp = 2, scale = FALSE, 
439
                 keepX = tune.spls$choice.keepX, keepY = tune.spls$choice.keepY)
440
441
# spls cluster
442
spls.cluster <- getCluster(spls.res)
443
444
# longitudinal cluster plot
445
plotLong(spls.res, title = "sPLS clustering")
446
```
447
448
## Multi-block (s)PLS longitudinal clustering
449
450
With more than **2 blocks** of data, it is necessary to use *multi-block PLS* to identify cluster of similar profile from **3 and more blocks** of data.
451
452
This methods accepts a list of data.frame as `X` (same corresponding rows) and a Y data.frame.
453
454
*(see also `?mixOmics::block.pls` for more details about block PLS and available options)*
455
456
### Ncomp and Clustering 
457
458
```{r}
459
X <- list("X" = profile.filtered, "Z" = timeOmics.simdata$Z)
460
Y <- as.matrix(timeOmics.simdata$Y)
461
462
block.pls.res <- block.pls(X=X, Y=Y, ncomp = 5, 
463
                           scale = FALSE, mode = "canonical")
464
block.ncomp <- getNcomp(block.pls.res,X=X, Y=Y, 
465
                        scale = FALSE, mode = "canonical")
466
block.ncomp$choice.ncomp
467
plot(block.ncomp)
468
```
469
470
In this plot, we can observe that the highest silhouette coefficient is achieved when `ncomp = 1` (2 clusters).
471
472
```{r}
473
# final model
474
block.pls.res <- block.pls(X=X, Y=Y, ncomp = 1, scale = FALSE, mode = "canonical")
475
# block.pls cluster
476
block.pls.cluster <- getCluster(block.pls.res)
477
478
# longitudinal cluster plot
479
plotLong(block.pls.res)
480
```
481
482
### Signature with multi-block *sparse* PLS
483
484
As with *PCA* and *PLS*, it is possible to use the *sparse* multi-block PLS to get a signature of the clusters.
485
486
`tuneCluster.block.spls` choose the correct number of feature to keep on each block of X `test.keepX` as well as 
487
the correct number of feature to keep on block Y `test.keepY` among a list provided by the user.
488
489
*(see also `?mixOmics::block.spls` for more details about block sPLS and available options)*
490
491
492
```{r}
493
test.list.keepX <- list("X" = 4:10, "Z" = c(2,4,6,8))
494
test.keepY <- c(2,4,6)
495
496
tune.block.res <- tuneCluster.block.spls(X= X, Y= Y, 
497
                                         test.list.keepX=test.list.keepX, 
498
                                         test.keepY= test.keepY, 
499
                                         scale=FALSE, 
500
                                         mode = "canonical", ncomp = 1)
501
# ncomp = 1 given by the getNcomp() function
502
503
# selected features in each component on block X
504
tune.block.res$choice.keepX
505
# selected features in each component on block Y
506
tune.block.res$choice.keepY
507
508
# final model
509
block.pls.res <- block.spls(X=X, Y=Y, 
510
                            ncomp = 1, 
511
                            scale = FALSE, 
512
                            mode = "canonical", 
513
                            keepX = tune.block.res$choice.keepX, 
514
                            keepY = tune.block.res$choice.keepY)
515
516
head(getCluster(block.pls.res))
517
plotLong(block.pls.res)
518
```
519
520
521
# Post-hoc evaluation
522
523
Interpretation based on correlations between profiles must be made with caution as it is highly likely to be
524
spurious. Proportional distances has been proposed as an alternative to measure association a posteriori on the identified signature.
525
526
In the following graphs, we represent all the proportionality distance within clusters and the distance of features inside the clusters with entire background set.
527
528
We also use a Wilcoxon U-test to compare the within cluster median compared to the entire background set.
529
530
531
```{r, eval=TRUE}
532
# example fro multiblock analysis
533
res <- proportionality(block.pls.res)
534
# distance between pairs of features
535
head(res$propr.distance)[1:6]
536
537
# u-test pvalue by clusters
538
pval.propr <- res$pvalue
539
knitr::kable(pval.propr)
540
plot(res)
541
```
542
543
In addition to the Wilcoxon test, proportionality distance dispersion within and with entire background set is represented by cluster in the above graph.
544
545
Here, for cluster `1`, the proportionality distance is calculated between pairs of feature from the same cluster `1` (inside). 
546
Then the distance is calculated between each feature of cluster `1` and every feature of cluster `-1` (outside).
547
548
The same is applied on features from cluster `-1`.
549
550
So we see that the intra-cluster distance is lower than the distances with the entire background set. 
551
Which is confirmed by the Wilcoxon test and this ensures a good clustering.
552
553
# References