|
a |
|
b/docs/getting-started-model-analysis.md |
|
|
1 |
# Getting Started with Model Analysis |
|
|
2 |
|
|
|
3 |
The following functions are provided to assist in the analysis and visualisation |
|
|
4 |
of a model fitted by `mixOmics`. Focus has been given to sPLS-DA models where a |
|
|
5 |
single block is being fitted (single-omics) and to DIABLO models where multiple |
|
|
6 |
blocks are being fitted simulataneously (multi-omics). The code below will |
|
|
7 |
generate known trained models and corresponding performance tests for each model |
|
|
8 |
type, which are then used to demonstrate the functions in the rest of this |
|
|
9 |
document. |
|
|
10 |
|
|
|
11 |
```R |
|
|
12 |
library(OmicsFold) |
|
|
13 |
library(mixOmics) |
|
|
14 |
library(dplyr) |
|
|
15 |
|
|
|
16 |
####################################### |
|
|
17 |
## sPLS-DA model and performance result |
|
|
18 |
####################################### |
|
|
19 |
|
|
|
20 |
# Prepare data |
|
|
21 |
data(srbct) |
|
|
22 |
splsda.X <- srbct$gene |
|
|
23 |
splsda.Y <- srbct$class |
|
|
24 |
|
|
|
25 |
# Model fit |
|
|
26 |
splsda.ncomp <- 3 |
|
|
27 |
splsda.keepX <- c(9, 280, 30) |
|
|
28 |
splsda.model <- splsda(splsda.X, splsda.Y, |
|
|
29 |
ncomp = splsda.ncomp, keepX = splsda.keepX) |
|
|
30 |
|
|
|
31 |
# Preformance result -- takes about 1 min to run |
|
|
32 |
set.seed(1234) # for reproducibility |
|
|
33 |
splsda.perf <- perf(splsda.model, validation = "Mfold", folds = 5, |
|
|
34 |
dist = 'max.dist', nrepeat = 10, |
|
|
35 |
progressBar = FALSE) |
|
|
36 |
|
|
|
37 |
|
|
|
38 |
###################################### |
|
|
39 |
## DIABLO model and performance result |
|
|
40 |
###################################### |
|
|
41 |
|
|
|
42 |
# Prepare data |
|
|
43 |
data('breast.TCGA') |
|
|
44 |
diablo.data <- list(mRNA = breast.TCGA$data.train$mrna, |
|
|
45 |
miRNA = breast.TCGA$data.train$mirna, |
|
|
46 |
proteomics = breast.TCGA$data.train$protein) |
|
|
47 |
diablo.Y <- breast.TCGA$data.train$subtype |
|
|
48 |
diablo.design <- matrix(0.1, ncol = length(diablo.data), |
|
|
49 |
nrow = length(diablo.data), |
|
|
50 |
dimnames = list(names(diablo.data), names(diablo.data))) |
|
|
51 |
diag(diablo.design) <- 0 |
|
|
52 |
|
|
|
53 |
# Model fit |
|
|
54 |
diablo.ncomp <- 2 |
|
|
55 |
diablo.keepX <- list(mRNA = c(6, 14), miRNA = c(5, 18), proteomics = c(6, 7)) |
|
|
56 |
diablo.model <- block.splsda(X = diablo.data, Y = diablo.Y, |
|
|
57 |
ncomp = diablo.ncomp, keepX = diablo.keepX, |
|
|
58 |
design = diablo.design) |
|
|
59 |
|
|
|
60 |
# Preformance result -- takes about 1 min to run |
|
|
61 |
set.seed(4321) # for reproducibility |
|
|
62 |
diablo.perf <- perf(diablo.model, validation = 'Mfold', M = 10, nrepeat = 10, |
|
|
63 |
dist = 'centroids.dist') |
|
|
64 |
``` |
|
|
65 |
|
|
|
66 |
## Model variance analysis |
|
|
67 |
|
|
|
68 |
OmicsFold provides the [`get.block.centroids()`](#get-block-centroids) function |
|
|
69 |
to extract the centroids of variance across blocks in a DIABLO model and export |
|
|
70 |
them as a data frame and plot. The |
|
|
71 |
[`get.model.variance()`](#get-model-variance) function is also provided which |
|
|
72 |
will take either an sPLS-DA model or a DIABLO model as input and creates a data |
|
|
73 |
frame showing the percentage contributions of each component to the model |
|
|
74 |
variance. |
|
|
75 |
|
|
|
76 |
### Get block centroids |
|
|
77 |
|
|
|
78 |
Applying the `get.block.centroids()` function to the DIABLO model from above |
|
|
79 |
gives the following: |
|
|
80 |
|
|
|
81 |
```R |
|
|
82 |
get.block.centroids(diablo.model) |
|
|
83 |
## A tibble: 150 x 4 |
|
|
84 |
## Groups: sample [150] |
|
|
85 |
## sample label comp1 comp2 |
|
|
86 |
## <int> <fct> <dbl> <dbl> |
|
|
87 |
## 1 1 Basal 2.38 2.22 |
|
|
88 |
## 2 2 Basal 1.74 2.03 |
|
|
89 |
## 3 3 Basal 2.28 -0.0206 |
|
|
90 |
## 4 4 Basal 1.54 1.46 |
|
|
91 |
## 5 5 Basal 1.93 -1.17 |
|
|
92 |
## ... with 145 more rows |
|
|
93 |
``` |
|
|
94 |
|
|
|
95 |
In addition to the tibble provided, a plot is produced showing the centroid |
|
|
96 |
across the blocks in the first and second component for each sample in the data. |
|
|
97 |
Each point in the plot represents a sample and is the mean of the three blocks |
|
|
98 |
loaded for the DIABLO model. The circles show the distribution of all the |
|
|
99 |
points sharing the same label and gives a good indication of the separation of |
|
|
100 |
classifications in each of the first two components. |
|
|
101 |
|
|
|
102 |
 |
|
|
103 |
|
|
|
104 |
### Get model variance |
|
|
105 |
|
|
|
106 |
The `get.model.variance()` function can be applied to the model to identify how |
|
|
107 |
much variation each component captures in the model for a given block: |
|
|
108 |
|
|
|
109 |
```R |
|
|
110 |
get.model.variance(diablo.model, "mRNA") |
|
|
111 |
## Proportion Cumulative |
|
|
112 |
## comp1 0.794465823 0.7944658 |
|
|
113 |
## comp2 0.008624793 0.8030906 |
|
|
114 |
|
|
|
115 |
get.model.variance(diablo.model, "miRNA") |
|
|
116 |
## Proportion Cumulative |
|
|
117 |
## comp1 0.669987181 0.6699872 |
|
|
118 |
## comp2 0.001283394 0.6712706 |
|
|
119 |
|
|
|
120 |
get.model.variance(diablo.model, "proteomics") |
|
|
121 |
## Proportion Cumulative |
|
|
122 |
## comp1 0.770439163 0.7704392 |
|
|
123 |
## comp2 0.002689493 0.7731287 |
|
|
124 |
``` |
|
|
125 |
|
|
|
126 |
Here we can see that the first component captures the vast majority of the |
|
|
127 |
variance for all three blocks but that slightly less of the variance in `miRNA` |
|
|
128 |
was captured than in `mRNA` or `proteomics`. |
|
|
129 |
|
|
|
130 |
## Feature analysis for sPLS-DA models |
|
|
131 |
|
|
|
132 |
OmicsFold provides functions for analysing the features selecting for fitting |
|
|
133 |
sPLS-DA models. |
|
|
134 |
|
|
|
135 |
- [`get.loadings.table()`](#get-feature-loadings-as-a-table) outputs a data |
|
|
136 |
frame of loadings for features selected in each component of the model. |
|
|
137 |
- [`feature.selection.stability()`](#get-spls-da-feature-selection-stability) |
|
|
138 |
gives the loadings from above for a single component, alongside stability |
|
|
139 |
scores for the selected features. |
|
|
140 |
- [`merge.feature.stability()`](#merge-feature-stability-on-a-table-of-loadings) |
|
|
141 |
takes a table such as that generated by `get.loadings.table()` and the output |
|
|
142 |
from `feature.selection.stability()` merging them into a single data frame. |
|
|
143 |
- [`blockrank.splsda()`](#measure-variable-importance) takes an sPLS-DA model |
|
|
144 |
object and calculates a score measuring each features contribution to the variance in the response |
|
|
145 |
|
|
|
146 |
### Get feature loadings as a table |
|
|
147 |
|
|
|
148 |
By calling `get.loadings.table()` with the sPLS-DA model as a parameter, a data |
|
|
149 |
frame will be produced listing the loadings of each feature selected for the |
|
|
150 |
sparse feature set. A separate column is included for each component and the |
|
|
151 |
features are arranged according to their loading value in the earliest component |
|
|
152 |
they are selected for. Ordering is highest loading first. Where a feature is |
|
|
153 |
selected in more than one component, it appears only once in the table, showing |
|
|
154 |
both loadings values. Where features are not selected for a component, the |
|
|
155 |
loading value is `NA`. |
|
|
156 |
|
|
|
157 |
The code example below demonstrates typical output from this function. The |
|
|
158 |
output has been truncated for brevity in this getting started guide. The actual |
|
|
159 |
output is 320 lines long. |
|
|
160 |
|
|
|
161 |
```R |
|
|
162 |
get.loadings.table(splsda.model) |
|
|
163 |
## comp1 comp2 comp3 |
|
|
164 |
## g123 0.64922048 NA NA |
|
|
165 |
## g846 0.44828969 NA NA |
|
|
166 |
## g1606 0.30641602 NA NA |
|
|
167 |
## ... |
|
|
168 |
## g1389 NA -2.312917e-01 NA |
|
|
169 |
## g246 NA -2.001212e-01 NA |
|
|
170 |
## g545 NA -1.986890e-01 NA |
|
|
171 |
## ... |
|
|
172 |
## g340 NA -4.226175e-02 NA |
|
|
173 |
## g742 NA 4.204118e-02 0.14003503 |
|
|
174 |
## g1143 NA -4.124087e-02 NA |
|
|
175 |
## ... |
|
|
176 |
## g174 NA NA -0.35418530 |
|
|
177 |
## g1896 NA NA -0.29939099 |
|
|
178 |
## g603 NA NA -0.27261886 |
|
|
179 |
``` |
|
|
180 |
|
|
|
181 |
### Get sPLS-DA feature selection stability |
|
|
182 |
|
|
|
183 |
Calling `feature.selection.stability()` with the model, performance test result |
|
|
184 |
and a component number as parameters generates a data frame showing the loadings |
|
|
185 |
values are were returned above, but for a single component only. The second |
|
|
186 |
column in the data frame gives the stability value of the feature. This is |
|
|
187 |
expressed as a fraction and indicates how many of the models in the performance |
|
|
188 |
test included this feature in the sparse set. High stability indicates that a |
|
|
189 |
feature is very important to the fit of the model and is therefore probably |
|
|
190 |
important to separating the classes the model is trying to discriminate. |
|
|
191 |
|
|
|
192 |
The example below shows extraction of the stability values for the 9 top |
|
|
193 |
features included in component 1. Note how all stabilities are an exact |
|
|
194 |
multiple of 0.02. This is because our performance test used 10 repeats and 5 |
|
|
195 |
folds, so 50 models were fitted, and each time a model selects a feature, this |
|
|
196 |
adds 0.02 to the stability of that feature. |
|
|
197 |
|
|
|
198 |
```R |
|
|
199 |
feature.selection.stability(splsda.model, splsda.perf, 1) |
|
|
200 |
## value stability |
|
|
201 |
## g123 0.64922048 0.50 |
|
|
202 |
## g846 0.44828969 0.50 |
|
|
203 |
## g1606 0.30641602 0.40 |
|
|
204 |
## g335 0.30031495 0.48 |
|
|
205 |
## g836 0.26392532 0.46 |
|
|
206 |
## g783 0.22022625 0.28 |
|
|
207 |
## g758 0.22019689 0.38 |
|
|
208 |
## g1386 0.16248889 0.38 |
|
|
209 |
## g585 0.02058632 0.12 |
|
|
210 |
``` |
|
|
211 |
|
|
|
212 |
### Merge feature stability on a table of loadings |
|
|
213 |
|
|
|
214 |
After getting the loadings table for all the components of the model, it is |
|
|
215 |
probably desirable to create a table with stability values shown in a final |
|
|
216 |
column. `merge.feature.stability()` can be called with the loadings table and |
|
|
217 |
stability table as parameters. The returned data frame adds a column for the |
|
|
218 |
stability values matching the features in the loadings table. `NA` is used |
|
|
219 |
where the stability value for a feature was not in the stability table. |
|
|
220 |
|
|
|
221 |
Output for the loadings table is truncated in the example below for brevity. |
|
|
222 |
You will see the full output in your own R instance. |
|
|
223 |
|
|
|
224 |
```R |
|
|
225 |
# Prepare the loadings table and stability value table we intend to merge on. |
|
|
226 |
loadings.table <- get.loadings.table(splsda.model) |
|
|
227 |
stability.comp3 <- feature.selection.stability(splsda.model, splsda.perf, 3) |
|
|
228 |
|
|
|
229 |
# Check the raw loadings table before merging any stability values. |
|
|
230 |
loadings.table |
|
|
231 |
## comp1 comp2 comp3 |
|
|
232 |
## g123 0.64922048 NA NA |
|
|
233 |
## g846 0.44828969 NA NA |
|
|
234 |
## g742 NA 4.204118e-02 0.14003503 |
|
|
235 |
## g1143 NA -4.124087e-02 NA |
|
|
236 |
## g174 NA NA -0.35418530 |
|
|
237 |
## g1896 NA NA -0.29939099 |
|
|
238 |
|
|
|
239 |
# Merge on stability values for component 3. |
|
|
240 |
merge.feature.stability(loadings.table, stability.comp3) |
|
|
241 |
## feature comp1 comp2 comp3 stability |
|
|
242 |
## g123 0.64922048 NA NA NA |
|
|
243 |
## g846 0.44828969 NA NA NA |
|
|
244 |
## g742 NA 4.204118e-02 0.14003503 0.76 |
|
|
245 |
## g1143 NA -4.124087e-02 NA NA |
|
|
246 |
## g174 NA NA -0.35418530 0.34 |
|
|
247 |
## g1896 NA NA -0.29939099 0.32 |
|
|
248 |
``` |
|
|
249 |
|
|
|
250 |
### Measure variable importance |
|
|
251 |
|
|
|
252 |
With a trained model object `blockrank.splsda()` will create a single item list containing a vector of BlockRank scores for that model. For more detail see section [BlockRank](#blockrank) below. |
|
|
253 |
|
|
|
254 |
|
|
|
255 |
## Feature analysis for DIABLO models |
|
|
256 |
|
|
|
257 |
OmicsFold provides a number of functions for analysing the output from DIABLO |
|
|
258 |
models produced in `mixOmics`. |
|
|
259 |
|
|
|
260 |
- [`get.diablo.top.loadings()`](#get-diablo-top-loadings) outputs a data frame |
|
|
261 |
of feature loadings for a specified block in the DIABLO model. |
|
|
262 |
- [`diablo.selection.stability()`](#get-diablo-feature-selection-stability) |
|
|
263 |
extracts a data frame of stability scores from a DIABLO model performance test |
|
|
264 |
for features in a particular block and component. |
|
|
265 |
- [`get.diablo.top.loadings.with.stability()`](#get-diablo-top-loadings-with-stability) |
|
|
266 |
provides a concatenated data frame of feature loadings and stability scores |
|
|
267 |
for a given block in the DIABLO model. |
|
|
268 |
- [`blockrank.diablo()`](#measure-feature-importance) takes a DIABLO model |
|
|
269 |
object and calculates a score measuring each features contribution to the variance in the response |
|
|
270 |
- [`plot.feature.stability()`](#plot-feature-stability) provides a simple |
|
|
271 |
histogram plot of the highest stability features in a single component and |
|
|
272 |
block, allowing visual comparison of their stability scores. |
|
|
273 |
- [`find.feature.associations()`](#find-diablo-feature-associations) outputs a |
|
|
274 |
matrix of values indicating possible feature associations in the model based |
|
|
275 |
on consistent feature measurement across samples. |
|
|
276 |
- [`export.matrix.as.network()`](#export-diablo-matrix-as-a-network) exports a |
|
|
277 |
data frame showing relationships between features where feature association |
|
|
278 |
strength exceeds a threshold. The data frame is also written to a CSV file |
|
|
279 |
which can be imported into a network plotting application such as |
|
|
280 |
[Cytoscape](https://cytoscape.org/). |
|
|
281 |
|
|
|
282 |
### Get DIABLO top loadings |
|
|
283 |
|
|
|
284 |
`get.diablo.top.loadings()` allows the export of loadings from a DIABLO model |
|
|
285 |
into a data frame. The resulting data frame has columns for each component of |
|
|
286 |
the model containing loadings for the features selected in that component. Rows |
|
|
287 |
are sorted according to the absolute value of the loadings so that the highest |
|
|
288 |
impact features are at the top of the table. All features for the first |
|
|
289 |
component are listed before listing begins for features of the second and |
|
|
290 |
subsequent components. If a feature is selected in more than one component, |
|
|
291 |
only one row is given for that feature but multiple columns will contain a |
|
|
292 |
loading for that feature. Where a feature was not selected in a component, the |
|
|
293 |
`NA` value is shown for the loading. |
|
|
294 |
|
|
|
295 |
In the example below, the loadings for the `proteomics` block are being arranged |
|
|
296 |
in the data frame. Note that feature `AR` has been selected in both components. |
|
|
297 |
Because its loading value in component 2 is `-0.11183919`, this would normally |
|
|
298 |
place `AR` below `c-Kit`, but features are sorted by lower numbered components |
|
|
299 |
first. |
|
|
300 |
|
|
|
301 |
```R |
|
|
302 |
get.diablo.top.loadings(diablo.model$loadings$proteomics) |
|
|
303 |
## comp1 comp2 |
|
|
304 |
## ER-alpha -0.74995295 NA |
|
|
305 |
## GATA3 -0.62448770 NA |
|
|
306 |
## ASNS 0.16825721 NA |
|
|
307 |
## Cyclin_B1 0.12026614 NA |
|
|
308 |
## AR -0.06842432 -0.11183919 |
|
|
309 |
## JNK2 -0.01137347 NA |
|
|
310 |
## HER2 NA -0.67182918 |
|
|
311 |
## HER2_pY1248 NA -0.62130938 |
|
|
312 |
## EGFR_pY1068 NA -0.33229248 |
|
|
313 |
## c-Kit NA 0.17791016 |
|
|
314 |
## HER3_pY1289 NA -0.08706226 |
|
|
315 |
## XRCC1 NA 0.02149536 |
|
|
316 |
``` |
|
|
317 |
|
|
|
318 |
### Get DIABLO feature selection stability |
|
|
319 |
|
|
|
320 |
`diablo.selection.stability()` is very similar to the [equivalent function for |
|
|
321 |
sPLS-DA models](#get-spls-da-feature-selection-stability). By passing the |
|
|
322 |
performance test result, a selected component and a selected block as |
|
|
323 |
parameters, a data frame will be generated for the features in that block and |
|
|
324 |
component, showing the stability scores for those features. Stability is a |
|
|
325 |
fraction indicating the number of models in the performance test which selected |
|
|
326 |
this feature when being fitted. 100 models were fitted when running the DIABLO |
|
|
327 |
performance test (10 repeats with 10 folds), so the stability scores shown are |
|
|
328 |
an exact multiple of `0.01`. Features are presented in the data frame in |
|
|
329 |
reverse stability score order so that the most stable features are shown first. |
|
|
330 |
Those with a score of `1.0` are essential for the model to achieve the best fit |
|
|
331 |
and are therefore also likely to be relevant to the discrimination of classes |
|
|
332 |
the model is trying to achieve. |
|
|
333 |
|
|
|
334 |
The code below outputs the feature selection stability for the `proteomics` |
|
|
335 |
block in component 1. |
|
|
336 |
|
|
|
337 |
```R |
|
|
338 |
diablo.selection.stability(diablo.perf, comp = 1, block = 'proteomics') |
|
|
339 |
## feature stability |
|
|
340 |
## ER-alpha 1.00 |
|
|
341 |
## GATA3 1.00 |
|
|
342 |
## ASNS 0.99 |
|
|
343 |
## Cyclin_B1 0.98 |
|
|
344 |
## AR 0.80 |
|
|
345 |
## JNK2 0.54 |
|
|
346 |
## PR 0.41 |
|
|
347 |
## Cyclin_E1 0.22 |
|
|
348 |
## INPP4B 0.12 |
|
|
349 |
``` |
|
|
350 |
|
|
|
351 |
### Get DIABLO top loadings with stability |
|
|
352 |
|
|
|
353 |
`get.diablo.top.loadings.with.stability()` allows the combination of information |
|
|
354 |
from both the previous functions. By passing in the model, performance test and |
|
|
355 |
block name, a data frame is generated with loadings of the selected features for |
|
|
356 |
each component alongside stability scores for those features. Where a feature |
|
|
357 |
exists in more than one component, the stability score shown is for the lowest |
|
|
358 |
numbered component. |
|
|
359 |
|
|
|
360 |
The code below generates a data frame containing loading values and stability |
|
|
361 |
scores for the selected features in the `proteomics` block. Note that, unless |
|
|
362 |
specified, the number of features returned will be up to `20` by default. |
|
|
363 |
|
|
|
364 |
```R |
|
|
365 |
get.diablo.top.loadings.with.stability(diablo.model, diablo.perf, |
|
|
366 |
block = 'proteomics', feature.count = 50) |
|
|
367 |
## comp1 comp2 stability |
|
|
368 |
## ER-alpha -0.74995295 NA 1.00 |
|
|
369 |
## GATA3 -0.62448770 NA 1.00 |
|
|
370 |
## ASNS 0.16825721 NA 0.99 |
|
|
371 |
## Cyclin_B1 0.12026614 NA 0.98 |
|
|
372 |
## AR -0.06842432 -0.11183919 0.80 |
|
|
373 |
## JNK2 -0.01137347 NA 0.54 |
|
|
374 |
## HER2 NA -0.67182918 1.00 |
|
|
375 |
## HER2_pY1248 NA -0.62130938 1.00 |
|
|
376 |
## EGFR_pY1068 NA -0.33229248 1.00 |
|
|
377 |
## c-Kit NA 0.17791016 1.00 |
|
|
378 |
## HER3_pY1289 NA -0.08706226 0.95 |
|
|
379 |
## XRCC1 NA 0.02149536 0.56 |
|
|
380 |
``` |
|
|
381 |
|
|
|
382 |
### Measure feature importance |
|
|
383 |
|
|
|
384 |
With a trained model object `blockrank.diablo()` will create a list of vectors of BlockRank scores for each feature. For more detail see section [BlockRank](#blockrank) below. |
|
|
385 |
|
|
|
386 |
### Plot feature stability |
|
|
387 |
|
|
|
388 |
`plot.feature.stability()` takes a single parameter with a table of stabilities, |
|
|
389 |
as obtained from |
|
|
390 |
[`diablo.selection.stability()`](#get-diablo-feature-selection-stability), and |
|
|
391 |
creates a bar-chart-style plot of stability for those features. The shape of |
|
|
392 |
the plot will give and indication for how consistently the models choose the |
|
|
393 |
same features during the performance tests. A plot that stays high for the |
|
|
394 |
first features and then drops sharply to a low level for the remainder shows |
|
|
395 |
that there is a clear preference for specific features and that those features |
|
|
396 |
should be of value to look at more closely in relation to the biological |
|
|
397 |
context. If the curve is closer to a straight line and only a very small number |
|
|
398 |
of features achieve close to `1.0` stability, then the feature selection is less |
|
|
399 |
stable and it may be harder to draw conclusions about which are relevant to the |
|
|
400 |
biological context. |
|
|
401 |
|
|
|
402 |
The code below prepares a stability data frame for the DIABLO performance test |
|
|
403 |
for the `miRNA` block in component 2. It then plots the stability scores as a |
|
|
404 |
bar-chart-style plot as shown below the code. This plot shows a reasonably |
|
|
405 |
sharp drop off after about 10 features, which in turn means that the high |
|
|
406 |
stability features on the left are likely to have good relevance to the |
|
|
407 |
biological context. |
|
|
408 |
|
|
|
409 |
```R |
|
|
410 |
stability <- diablo.selection.stability(diablo.perf, comp = 2, block = 'miRNA') |
|
|
411 |
plot.feature.stability(stability) |
|
|
412 |
``` |
|
|
413 |
|
|
|
414 |
 |
|
|
415 |
|
|
|
416 |
### Export DIABLO matrix as a network |
|
|
417 |
|
|
|
418 |
`export.matrix.as.network()` takes the output of the associations function above |
|
|
419 |
and both filters and converts the matrix to a flat format as a data frame as |
|
|
420 |
well as writing that to a CSV file. The data frame has one line per association |
|
|
421 |
and does not show reverse associations. Only associations which meet or exceed |
|
|
422 |
the given cutoff threshold are included. The association can be positive or |
|
|
423 |
negative to satisfy inclusion by the cutoff. The data frame shows both sides of |
|
|
424 |
the association as two feature columns and the strength of the association as a |
|
|
425 |
value column. The block column provides the name of the block containing the |
|
|
426 |
left hand feature. |
|
|
427 |
|
|
|
428 |
The CSV file specified is created and contains the same information as the data |
|
|
429 |
frame. The format of the CSV file is importable into programs for plotting |
|
|
430 |
networks, such as [Cytoscape](https://cytoscape.org/), which makes it easier to |
|
|
431 |
visualise how the features in the model might be relevant to each other. |
|
|
432 |
|
|
|
433 |
The code example below prepares the association matrix and then prepares a |
|
|
434 |
vector of strings where the string names are the names of the blocks each of the |
|
|
435 |
features are present in. It is also possible to provide the parameter |
|
|
436 |
`block.feature.count` instead of the `block.association` in the unlikely event |
|
|
437 |
that each block contains the same number of features. The output shown has been |
|
|
438 |
truncated to only show the associations included from the matrix above. The |
|
|
439 |
actual list of associations is much longer. A CSV file named `network.csv` is |
|
|
440 |
created and populated with the information shown in the data frame output. |
|
|
441 |
|
|
|
442 |
```R |
|
|
443 |
associations <- find.feature.associations(diablo.model, block.count = 3) |
|
|
444 |
block.association <- character(0) |
|
|
445 |
for (block.name in names(diablo.model$keepX)) { |
|
|
446 |
num.features.in.block <- diablo.model$keepX[[block.name]][1] |
|
|
447 |
block.association <- append(block.association, |
|
|
448 |
rep(block.name, num.features.in.block)) |
|
|
449 |
} |
|
|
450 |
export.matrix.as.network(associations, filename = "network.csv", cutoff = 0.7, |
|
|
451 |
block.association = block.association) |
|
|
452 |
## feature.1 feature.2 value block |
|
|
453 |
## KDM4B ZNF552 0.8247985 mRNA |
|
|
454 |
## KDM4B PREX1 0.7235040 mRNA |
|
|
455 |
## ZNF552 LRIG1 0.7072751 mRNA |
|
|
456 |
## ZNF552 PREX1 0.7308222 mRNA |
|
|
457 |
``` |
|
|
458 |
|
|
|
459 |
## Network functions |
|
|
460 |
OmicsFold provides a set of functions to extract and process data from a DIABLO model |
|
|
461 |
into a multi-omic interaction network visualization. The network functions identify associations between |
|
|
462 |
top features in the model and reformat them into a network object |
|
|
463 |
for direct visualization in R or in an external |
|
|
464 |
program such as Cytoscape. |
|
|
465 |
|
|
|
466 |
### Extract feature associations |
|
|
467 |
|
|
|
468 |
To extract feature associations from a multi-omic DIABLO model, use the |
|
|
469 |
`find.feature.associations()` function. |
|
|
470 |
|
|
|
471 |
```R |
|
|
472 |
associations <- find.feature.associations(diablo.model, feature_number=50, score_type="blockrank") |
|
|
473 |
``` |
|
|
474 |
|
|
|
475 |
The function requires the final trained DIABLO model. The `feature_number` parameter can be |
|
|
476 |
adjusted to select the number of top discriminative features to be returned. The `score_type` function |
|
|
477 |
can be set to either "blockrank" or "loading" to determine which feature ranking metric to use. |
|
|
478 |
The function will return a matrix of associations between the most important features in the model according to either |
|
|
479 |
blockrank or loading scores. |
|
|
480 |
|
|
|
481 |
### Filter and reformat network |
|
|
482 |
|
|
|
483 |
To filter and reformat the association matrix for network visualization, use the |
|
|
484 |
`filter.network()` function. |
|
|
485 |
|
|
|
486 |
```R |
|
|
487 |
omicsfold_network <- filter.network(diablo.model=diablo.model, associations=associations, cutoff=0.7, |
|
|
488 |
feature_list=c("phenol.sulfate", "Alistipes_putredinis"), remove_intrablock = FALSE) |
|
|
489 |
``` |
|
|
490 |
|
|
|
491 |
The function requires the final trained DIABLO model and the association matrix from `find.feature.associations()`. |
|
|
492 |
A correlation cut off can be set with `cutoff` to remove associations between features that |
|
|
493 |
do not meet the supplied threshold. A vector of feature names can be supplied to filter the network to only include |
|
|
494 |
associations to features of interest with `feature_list`, and associations between features from the same block |
|
|
495 |
can be removed with `remove_intrablock`. The function returns a dataframe describing associations |
|
|
496 |
between features and the blocks that they belong to. The dataframe can |
|
|
497 |
be exported to Cytoscape or applied to the `plot.network()` function for visualization in R. |
|
|
498 |
|
|
|
499 |
### Plot network |
|
|
500 |
|
|
|
501 |
To visualize the multi-omic network, use the `plot.network()` function. |
|
|
502 |
|
|
|
503 |
```R |
|
|
504 |
plot.network(omicsfold_network=omicsfold_network, diablo.model=diablo.model) |
|
|
505 |
``` |
|
|
506 |
|
|
|
507 |
 |
|
|
508 |
|
|
|
509 |
The function requires the network dataframe from the `filter.network()` function and the final trained DIABLO model. Nodes in the network represent |
|
|
510 |
features and edges represent correlations between nodes. Nodes are colored by block and edges are colored by correlation value, with darker red edges |
|
|
511 |
indicating the higher end of the correlation values in the network and darker orange edges indicating the lower end of correlation values in the network. |
|
|
512 |
The function uses the Kamada-Kawai network layout algorithm to position nodes so that more strongly connected nodes will be pulled together, |
|
|
513 |
while more weakly connected nodes will pushed away from other nodes. |
|
|
514 |
|
|
|
515 |
## BlockRank |
|
|
516 |
|
|
|
517 |
OmicsFold implements a novel approach to feature analysis called BlockRank. |
|
|
518 |
BlockRank allows the direct comparison of feature importance between different |
|
|
519 |
data blocks, and can also be used to measure feature importance in a singleomic |
|
|
520 |
setting. |
|
|
521 |
|
|
|
522 |
The BlockRank Score for each feature is a number between 0 and 1, with 0 |
|
|
523 |
indicating an insignificant feature and 1 indicating the most important feature |
|
|
524 |
in the model. |
|
|
525 |
|
|
|
526 |
### BlockRank for sPLS-DA |
|
|
527 |
|
|
|
528 |
To measure the BlockRank score for a single-omic sPLS-DA model use the |
|
|
529 |
`blockrank.splsda()` function. |
|
|
530 |
|
|
|
531 |
```R |
|
|
532 |
blockrank.scores <- blockrank.splsda(splsda.model) |
|
|
533 |
``` |
|
|
534 |
|
|
|
535 |
`blockrank.scores` is a list of one numeric vector, the vector containing a |
|
|
536 |
score for each feature in the model data. |
|
|
537 |
|
|
|
538 |
### BlockRank for DIABLO |
|
|
539 |
|
|
|
540 |
To measure the BlockRank score for a multi-omic DIABLO model use the |
|
|
541 |
`blockrank.diablo()` function. |
|
|
542 |
|
|
|
543 |
```R |
|
|
544 |
blockrank.scores <- blockrank.diablo(diablo.model) |
|
|
545 |
``` |
|
|
546 |
|
|
|
547 |
`blockrank.scores` is a list of numeric vectors. Each vector corresponds to one |
|
|
548 |
block of data, and the vector contains a score for each feature in the block. |
|
|
549 |
|
|
|
550 |
### Plotting BlockRank Scores |
|
|
551 |
|
|
|
552 |
The function `plot.blockrank.scores()` returns a bar chart of the top 'n' BlockRank scores. |
|
|
553 |
|
|
|
554 |
```R |
|
|
555 |
plot.blockrank.scores(blockrank.scores, |
|
|
556 |
nscores = 20, |
|
|
557 |
feature.font.size = 8, |
|
|
558 |
model = "DIABLO", |
|
|
559 |
data_source = "Breast TCGA") |
|
|
560 |
``` |
|
|
561 |
 |
|
|
562 |
|
|
|
563 |
The function requires a list of BlockRank scores to plot, and optionally has parameters for how many scores are displayed, `nscores`; the font size of the feature labels, `feature.font.size`; the name of type of model plotted for the title of the graph, `model`, and the name of the data source for the graph title, `data_source`. |
|
|
564 |
|
|
|
565 |
## Model predictivity |
|
|
566 |
|
|
|
567 |
`plot.predicted.projection()` is useful when an sPLS-DA single-omics model is |
|
|
568 |
used to predict the classifications of unseen data and you want to visualise the |
|
|
569 |
accuracy of that prediction. The built in `predict()` function from `mixOmics` |
|
|
570 |
will take a trained model and some unseen observations of the same set of |
|
|
571 |
features, returning an object which is complicated to parse, but contains the |
|
|
572 |
predicted classifications of the new data. |
|
|
573 |
|
|
|
574 |
The projection plot generated by our function contains a large centroid point in |
|
|
575 |
the first two components for each of the classifications surrounded by smaller |
|
|
576 |
points for each observation in the unseen set. The smaller points are coloured |
|
|
577 |
according to their actual classification. A predictive model will render the |
|
|
578 |
smaller points nearer to the large points of the same colour. Smaller points of |
|
|
579 |
the wrong colour compared to their closest large point were predicted |
|
|
580 |
incorrectly. |
|
|
581 |
|
|
|
582 |
The code example below prepares a new sPLS-DA model from a smaller proportion of |
|
|
583 |
the full dataset, described as the training data set. This training data set is |
|
|
584 |
used to fit the predictive model and then that model is made to predict the |
|
|
585 |
outcome of the held back observations. |
|
|
586 |
|
|
|
587 |
Printing the projection plot generated by our function renders the plot shown |
|
|
588 |
below the code. Here we can see that the predictions are mostly accurate with |
|
|
589 |
the exception of one `EWS` point being too close to the `RMS` centroid. There |
|
|
590 |
are also some other points that are situated approximately halfway between two |
|
|
591 |
centroids, but they still appear to be closest to the correct centroid. The |
|
|
592 |
model is therefore able to reasonably accurately predict the classification of |
|
|
593 |
the unseen set with at least 1 error shown. |
|
|
594 |
|
|
|
595 |
```R |
|
|
596 |
# Prepare data |
|
|
597 |
data(srbct) |
|
|
598 |
all.X <- srbct$gene |
|
|
599 |
all.Y <- srbct$class |
|
|
600 |
all.index <- 1:length(all.Y) |
|
|
601 |
|
|
|
602 |
# Subset the data for a training set |
|
|
603 |
set.seed(1234) # for reproducibility of subsetting |
|
|
604 |
train.index <- sort(sample(all.index, length(all.index) * 0.8)) |
|
|
605 |
train.X <- all.X[train.index,] |
|
|
606 |
train.Y <- all.Y[train.index] |
|
|
607 |
|
|
|
608 |
# Model fit with the training set |
|
|
609 |
predict.model <- splsda(train.X, train.Y, ncomp = 3, keepX = c(9, 280, 30)) |
|
|
610 |
|
|
|
611 |
# Prepare the prediction set |
|
|
612 |
predict.index <- setdiff(all.index, train.index) |
|
|
613 |
predict.X <- all.X[predict.index,] |
|
|
614 |
predict.Y <- all.Y[predict.index] |
|
|
615 |
|
|
|
616 |
# Create a prediction for the predict set and plot using plot.predicted.projection() |
|
|
617 |
prediction <- predict(predict.model, predict.X) |
|
|
618 |
projection.plot <- plot.predicted.projection(prediction, predict.Y) |
|
|
619 |
print(projection.plot) |
|
|
620 |
``` |
|
|
621 |
|
|
|
622 |
 |
|
|
623 |
|
|
|
624 |
## Utility functions |
|
|
625 |
|
|
|
626 |
`center.truncate()` is a helper function for plot labels, though it may have |
|
|
627 |
other uses. Where labels are long, as is typical for features of some -omics |
|
|
628 |
data, this can cause plots to be dominated by axis labels. By passing a string |
|
|
629 |
to this function, it will be truncated to exactly 43 characters or fewer. |
|
|
630 |
Characters in the middle of the string will be replaced by `...` instead. |
|
|
631 |
|
|
|
632 |
The examples below show both a single string being truncated, as well as how to |
|
|
633 |
apply the truncation to a vector of strings using the built in R function |
|
|
634 |
`sapply()`. |
|
|
635 |
|
|
|
636 |
```R |
|
|
637 |
center.truncate("When a string is particularly long it will be truncated back to 43 characters") |
|
|
638 |
## "When a string is par...back to 43 characters" |
|
|
639 |
|
|
|
640 |
labels = c("Short name", |
|
|
641 |
"Very long name would need truncating for plotting", |
|
|
642 |
"Plots don't work well when the axis is forced over by long labels") |
|
|
643 |
unname(sapply(labels, center.truncate)) |
|
|
644 |
## [1] "Short name" |
|
|
645 |
## [2] "Very long name would...uncating for plotting" |
|
|
646 |
## [3] "Plots don't work wel...d over by long labels" |
|
|
647 |
``` |