Switch to unified view

a b/docs/getting-started-model-analysis.md
1
# Getting Started with Model Analysis
2
3
The following functions are provided to assist in the analysis and visualisation
4
of a model fitted by `mixOmics`.  Focus has been given to sPLS-DA models where a
5
single block is being fitted (single-omics) and to DIABLO models where multiple
6
blocks are being fitted simulataneously (multi-omics).  The code below will
7
generate known trained models and corresponding performance tests for each model
8
type, which are then used to demonstrate the functions in the rest of this
9
document.
10
11
```R
12
library(OmicsFold)
13
library(mixOmics)
14
library(dplyr)
15
16
#######################################
17
## sPLS-DA model and performance result
18
#######################################
19
20
# Prepare data
21
data(srbct)
22
splsda.X <- srbct$gene
23
splsda.Y <- srbct$class
24
25
# Model fit
26
splsda.ncomp <- 3
27
splsda.keepX <- c(9, 280, 30)
28
splsda.model <- splsda(splsda.X, splsda.Y,
29
                       ncomp = splsda.ncomp, keepX = splsda.keepX)
30
31
# Preformance result -- takes about 1 min to run
32
set.seed(1234) # for reproducibility
33
splsda.perf <- perf(splsda.model, validation = "Mfold", folds = 5,
34
                    dist = 'max.dist', nrepeat = 10,
35
                    progressBar = FALSE)
36
37
38
######################################
39
## DIABLO model and performance result
40
######################################
41
42
# Prepare data
43
data('breast.TCGA')
44
diablo.data <- list(mRNA = breast.TCGA$data.train$mrna, 
45
                    miRNA = breast.TCGA$data.train$mirna, 
46
                    proteomics = breast.TCGA$data.train$protein)
47
diablo.Y <- breast.TCGA$data.train$subtype
48
diablo.design <- matrix(0.1, ncol = length(diablo.data),
49
                        nrow = length(diablo.data), 
50
                        dimnames = list(names(diablo.data), names(diablo.data)))
51
diag(diablo.design) <- 0
52
53
# Model fit
54
diablo.ncomp <- 2
55
diablo.keepX <- list(mRNA = c(6, 14), miRNA = c(5, 18), proteomics = c(6, 7))
56
diablo.model <- block.splsda(X = diablo.data, Y = diablo.Y,
57
                             ncomp = diablo.ncomp, keepX = diablo.keepX,
58
                             design = diablo.design)
59
60
# Preformance result -- takes about 1 min to run
61
set.seed(4321) # for reproducibility
62
diablo.perf <- perf(diablo.model, validation = 'Mfold', M = 10, nrepeat = 10, 
63
                    dist = 'centroids.dist')
64
```
65
66
## Model variance analysis
67
68
OmicsFold provides the [`get.block.centroids()`](#get-block-centroids) function
69
to extract the centroids of variance across blocks in a DIABLO model and export
70
them as a data frame and plot.  The
71
[`get.model.variance()`](#get-model-variance) function is also provided which
72
will take either an sPLS-DA model or a DIABLO model as input and creates a data
73
frame showing the percentage contributions of each component to the model
74
variance.
75
76
### Get block centroids
77
78
Applying the `get.block.centroids()` function to the DIABLO model from above
79
gives the following:
80
81
```R
82
get.block.centroids(diablo.model)
83
## A tibble: 150 x 4
84
## Groups:   sample [150]
85
##    sample label comp1   comp2
86
##     <int> <fct> <dbl>   <dbl>
87
##  1      1 Basal  2.38  2.22  
88
##  2      2 Basal  1.74  2.03  
89
##  3      3 Basal  2.28 -0.0206
90
##  4      4 Basal  1.54  1.46  
91
##  5      5 Basal  1.93 -1.17  
92
## ... with 145 more rows
93
```
94
95
In addition to the tibble provided, a plot is produced showing the centroid
96
across the blocks in the first and second component for each sample in the data.
97
Each point in the plot represents a sample and is the mean of the three blocks
98
loaded for the DIABLO model.  The circles show the distribution of all the
99
points sharing the same label and gives a good indication of the separation of
100
classifications in each of the first two components.
101
102
![Block Centroids Plot](imgs/block-centroids-plot.png)
103
104
### Get model variance
105
106
The `get.model.variance()` function can be applied to the model to identify how
107
much variation each component captures in the model for a given block:
108
109
```R
110
get.model.variance(diablo.model, "mRNA")
111
##        Proportion Cumulative
112
## comp1 0.794465823  0.7944658
113
## comp2 0.008624793  0.8030906
114
115
get.model.variance(diablo.model, "miRNA")
116
##        Proportion Cumulative
117
## comp1 0.669987181  0.6699872
118
## comp2 0.001283394  0.6712706
119
120
get.model.variance(diablo.model, "proteomics")
121
##        Proportion Cumulative
122
## comp1 0.770439163  0.7704392
123
## comp2 0.002689493  0.7731287
124
```
125
126
Here we can see that the first component captures the vast majority of the
127
variance for all three blocks but that slightly less of the variance in `miRNA`
128
was captured than in `mRNA` or `proteomics`.
129
130
## Feature analysis for sPLS-DA models
131
132
OmicsFold provides functions for analysing the features selecting for fitting
133
sPLS-DA models.
134
135
- [`get.loadings.table()`](#get-feature-loadings-as-a-table) outputs a data
136
  frame of loadings for features selected in each component of the model.
137
- [`feature.selection.stability()`](#get-spls-da-feature-selection-stability)
138
  gives the loadings from above for a single component, alongside stability
139
  scores for the selected features.
140
- [`merge.feature.stability()`](#merge-feature-stability-on-a-table-of-loadings)
141
  takes a table such as that generated by `get.loadings.table()` and the output
142
  from `feature.selection.stability()` merging them into a single data frame.
143
- [`blockrank.splsda()`](#measure-variable-importance) takes an sPLS-DA model 
144
  object and calculates a score measuring each features contribution to the variance in the response
145
146
### Get feature loadings as a table
147
148
By calling `get.loadings.table()` with the sPLS-DA model as a parameter, a data
149
frame will be produced listing the loadings of each feature selected for the
150
sparse feature set.  A separate column is included for each component and the
151
features are arranged according to their loading value in the earliest component
152
they are selected for.  Ordering is highest loading first.  Where a feature is
153
selected in more than one component, it appears only once in the table, showing
154
both loadings values.  Where features are not selected for a component, the
155
loading value is `NA`.
156
157
The code example below demonstrates typical output from this function.  The
158
output has been truncated for brevity in this getting started guide.  The actual
159
output is 320 lines long.
160
161
```R
162
get.loadings.table(splsda.model)
163
##            comp1         comp2       comp3
164
## g123  0.64922048            NA          NA
165
## g846  0.44828969            NA          NA
166
## g1606 0.30641602            NA          NA
167
## ...
168
## g1389         NA -2.312917e-01          NA
169
## g246          NA -2.001212e-01          NA
170
## g545          NA -1.986890e-01          NA
171
## ...
172
## g340          NA -4.226175e-02          NA
173
## g742          NA  4.204118e-02  0.14003503
174
## g1143         NA -4.124087e-02          NA
175
## ...
176
## g174          NA            NA -0.35418530
177
## g1896         NA            NA -0.29939099
178
## g603          NA            NA -0.27261886
179
```
180
181
### Get sPLS-DA feature selection stability
182
183
Calling `feature.selection.stability()` with the model, performance test result
184
and a component number as parameters generates a data frame showing the loadings
185
values are were returned above, but for a single component only.  The second
186
column in the data frame gives the stability value of the feature.  This is
187
expressed as a fraction and indicates how many of the models in the performance
188
test included this feature in the sparse set.  High stability indicates that a
189
feature is very important to the fit of the model and is therefore probably
190
important to separating the classes the model is trying to discriminate.
191
192
The example below shows extraction of the stability values for the 9 top
193
features included in component 1.  Note how all stabilities are an exact
194
multiple of 0.02.  This is because our performance test used 10 repeats and 5
195
folds, so 50 models were fitted, and each time a model selects a feature, this
196
adds 0.02 to the stability of that feature.
197
198
```R
199
feature.selection.stability(splsda.model, splsda.perf, 1)
200
##            value stability
201
## g123  0.64922048      0.50
202
## g846  0.44828969      0.50
203
## g1606 0.30641602      0.40
204
## g335  0.30031495      0.48
205
## g836  0.26392532      0.46
206
## g783  0.22022625      0.28
207
## g758  0.22019689      0.38
208
## g1386 0.16248889      0.38
209
## g585  0.02058632      0.12
210
```
211
212
### Merge feature stability on a table of loadings
213
214
After getting the loadings table for all the components of the model, it is
215
probably desirable to create a table with stability values shown in a final
216
column.  `merge.feature.stability()` can be called with the loadings table and
217
stability table as parameters.  The returned data frame adds a column for the
218
stability values matching the features in the loadings table.  `NA` is used
219
where the stability value for a feature was not in the stability table.
220
221
Output for the loadings table is truncated in the example below for brevity.
222
You will see the full output in your own R instance.
223
224
```R
225
# Prepare the loadings table and stability value table we intend to merge on.
226
loadings.table <- get.loadings.table(splsda.model)
227
stability.comp3 <- feature.selection.stability(splsda.model, splsda.perf, 3)
228
229
# Check the raw loadings table before merging any stability values.
230
loadings.table
231
##            comp1         comp2       comp3
232
## g123  0.64922048            NA          NA
233
## g846  0.44828969            NA          NA
234
## g742          NA  4.204118e-02  0.14003503
235
## g1143         NA -4.124087e-02          NA
236
## g174          NA            NA -0.35418530
237
## g1896         NA            NA -0.29939099
238
239
# Merge on stability values for component 3.
240
merge.feature.stability(loadings.table, stability.comp3)
241
## feature      comp1         comp2       comp3 stability
242
##    g123 0.64922048            NA          NA        NA
243
##    g846 0.44828969            NA          NA        NA
244
##    g742         NA  4.204118e-02  0.14003503      0.76
245
##   g1143         NA -4.124087e-02          NA        NA
246
##    g174         NA            NA -0.35418530      0.34
247
##   g1896         NA            NA -0.29939099      0.32
248
```
249
250
### Measure variable importance
251
252
With a trained model object `blockrank.splsda()` will create a single item list containing a vector of BlockRank scores for that model. For more detail see section [BlockRank](#blockrank) below.
253
254
255
## Feature analysis for DIABLO models
256
257
OmicsFold provides a number of functions for analysing the output from DIABLO
258
models produced in `mixOmics`.
259
260
- [`get.diablo.top.loadings()`](#get-diablo-top-loadings) outputs a data frame
261
  of feature loadings for a specified block in the DIABLO model.
262
- [`diablo.selection.stability()`](#get-diablo-feature-selection-stability)
263
  extracts a data frame of stability scores from a DIABLO model performance test
264
  for features in a particular block and component.
265
- [`get.diablo.top.loadings.with.stability()`](#get-diablo-top-loadings-with-stability)
266
  provides a concatenated data frame of feature loadings and stability scores
267
  for a given block in the DIABLO model.
268
- [`blockrank.diablo()`](#measure-feature-importance) takes a DIABLO model 
269
  object and calculates a score measuring each features contribution to the variance in the response
270
- [`plot.feature.stability()`](#plot-feature-stability) provides a simple
271
  histogram plot of the highest stability features in a single component and
272
  block, allowing visual comparison of their stability scores.
273
- [`find.feature.associations()`](#find-diablo-feature-associations) outputs a
274
  matrix of values indicating possible feature associations in the model based
275
  on consistent feature measurement across samples.
276
- [`export.matrix.as.network()`](#export-diablo-matrix-as-a-network) exports a
277
  data frame showing relationships between features where feature association
278
  strength exceeds a threshold.  The data frame is also written to a CSV file
279
  which can be imported into a network plotting application such as
280
  [Cytoscape](https://cytoscape.org/).
281
282
### Get DIABLO top loadings
283
284
`get.diablo.top.loadings()` allows the export of loadings from a DIABLO model
285
into a data frame.  The resulting data frame has columns for each component of
286
the model containing loadings for the features selected in that component.  Rows
287
are sorted according to the absolute value of the loadings so that the highest
288
impact features are at the top of the table.  All features for the first
289
component are listed before listing begins for features of the second and
290
subsequent components.  If a feature is selected in more than one component,
291
only one row is given for that feature but multiple columns will contain a
292
loading for that feature.  Where a feature was not selected in a component, the
293
`NA` value is shown for the loading.
294
295
In the example below, the loadings for the `proteomics` block are being arranged
296
in the data frame.  Note that feature `AR` has been selected in both components.
297
Because its loading value in component 2 is `-0.11183919`, this would normally
298
place `AR` below `c-Kit`, but features are sorted by lower numbered components
299
first.
300
301
```R
302
get.diablo.top.loadings(diablo.model$loadings$proteomics)
303
##                   comp1       comp2
304
## ER-alpha    -0.74995295          NA
305
## GATA3       -0.62448770          NA
306
## ASNS         0.16825721          NA
307
## Cyclin_B1    0.12026614          NA
308
## AR          -0.06842432 -0.11183919
309
## JNK2        -0.01137347          NA
310
## HER2                 NA -0.67182918
311
## HER2_pY1248          NA -0.62130938
312
## EGFR_pY1068          NA -0.33229248
313
## c-Kit                NA  0.17791016
314
## HER3_pY1289          NA -0.08706226
315
## XRCC1                NA  0.02149536
316
```
317
318
### Get DIABLO feature selection stability
319
320
`diablo.selection.stability()` is very similar to the [equivalent function for
321
sPLS-DA models](#get-spls-da-feature-selection-stability).  By passing the
322
performance test result, a selected component and a selected block as
323
parameters, a data frame will be generated for the features in that block and
324
component, showing the stability scores for those features.  Stability is a
325
fraction indicating the number of models in the performance test which selected
326
this feature when being fitted.  100 models were fitted when running the DIABLO
327
performance test (10 repeats with 10 folds), so the stability scores shown are
328
an exact multiple of `0.01`.  Features are presented in the data frame in
329
reverse stability score order so that the most stable features are shown first.
330
Those with a score of `1.0` are essential for the model to achieve the best fit
331
and are therefore also likely to be relevant to the discrimination of classes
332
the model is trying to achieve.
333
334
The code below outputs the feature selection stability for the `proteomics`
335
block in component 1.
336
337
```R
338
diablo.selection.stability(diablo.perf, comp = 1, block = 'proteomics')
339
##   feature stability
340
##  ER-alpha      1.00
341
##     GATA3      1.00
342
##      ASNS      0.99
343
## Cyclin_B1      0.98
344
##        AR      0.80
345
##      JNK2      0.54
346
##        PR      0.41
347
## Cyclin_E1      0.22
348
##    INPP4B      0.12
349
```
350
351
### Get DIABLO top loadings with stability
352
353
`get.diablo.top.loadings.with.stability()` allows the combination of information
354
from both the previous functions.  By passing in the model, performance test and
355
block name, a data frame is generated with loadings of the selected features for
356
each component alongside stability scores for those features.  Where a feature
357
exists in more than one component, the stability score shown is for the lowest
358
numbered component.
359
360
The code below generates a data frame containing loading values and stability
361
scores for the selected features in the `proteomics` block.  Note that, unless
362
specified, the number of features returned will be up to `20` by default.
363
364
```R
365
get.diablo.top.loadings.with.stability(diablo.model, diablo.perf,
366
                                       block = 'proteomics', feature.count = 50)
367
##                   comp1       comp2 stability
368
## ER-alpha    -0.74995295          NA      1.00
369
## GATA3       -0.62448770          NA      1.00
370
## ASNS         0.16825721          NA      0.99
371
## Cyclin_B1    0.12026614          NA      0.98
372
## AR          -0.06842432 -0.11183919      0.80
373
## JNK2        -0.01137347          NA      0.54
374
## HER2                 NA -0.67182918      1.00
375
## HER2_pY1248          NA -0.62130938      1.00
376
## EGFR_pY1068          NA -0.33229248      1.00
377
## c-Kit                NA  0.17791016      1.00
378
## HER3_pY1289          NA -0.08706226      0.95
379
## XRCC1                NA  0.02149536      0.56
380
```
381
382
### Measure feature importance
383
384
With a trained model object `blockrank.diablo()` will create a list of vectors of BlockRank scores for each feature. For more detail see section [BlockRank](#blockrank) below.
385
386
### Plot feature stability
387
388
`plot.feature.stability()` takes a single parameter with a table of stabilities,
389
as obtained from
390
[`diablo.selection.stability()`](#get-diablo-feature-selection-stability), and
391
creates a bar-chart-style plot of stability for those features.  The shape of
392
the plot will give and indication for how consistently the models choose the
393
same features during the performance tests.  A plot that stays high for the
394
first features and then drops sharply to a low level for the remainder shows
395
that there is a clear preference for specific features and that those features
396
should be of value to look at more closely in relation to the biological
397
context.  If the curve is closer to a straight line and only a very small number
398
of features achieve close to `1.0` stability, then the feature selection is less
399
stable and it may be harder to draw conclusions about which are relevant to the
400
biological context.
401
402
The code below prepares a stability data frame for the DIABLO performance test
403
for the `miRNA` block in component 2.  It then plots the stability scores as a
404
bar-chart-style plot as shown below the code.  This plot shows a reasonably
405
sharp drop off after about 10 features, which in turn means that the high
406
stability features on the left are likely to have good relevance to the
407
biological context.
408
409
```R
410
stability <- diablo.selection.stability(diablo.perf, comp = 2, block = 'miRNA')
411
plot.feature.stability(stability)
412
```
413
414
![Feature Stability Plot](img/../imgs/plot-feature-stability.png)
415
416
### Export DIABLO matrix as a network
417
418
`export.matrix.as.network()` takes the output of the associations function above
419
and both filters and converts the matrix to a flat format as a data frame as
420
well as writing that to a CSV file.  The data frame has one line per association
421
and does not show reverse associations.  Only associations which meet or exceed
422
the given cutoff threshold are included.  The association can be positive or
423
negative to satisfy inclusion by the cutoff.  The data frame shows both sides of
424
the association as two feature columns and the strength of the association as a
425
value column.  The block column provides the name of the block containing the
426
left hand feature.
427
428
The CSV file specified is created and contains the same information as the data
429
frame.  The format of the CSV file is importable into programs for plotting
430
networks, such as [Cytoscape](https://cytoscape.org/), which makes it easier to
431
visualise how the features in the model might be relevant to each other.
432
433
The code example below prepares the association matrix and then prepares a
434
vector of strings where the string names are the names of the blocks each of the
435
features are present in.  It is also possible to provide the parameter
436
`block.feature.count` instead of the `block.association` in the unlikely event
437
that each block contains the same number of features.  The output shown has been
438
truncated to only show the associations included from the matrix above. The
439
actual list of associations is much longer.  A CSV file named `network.csv` is
440
created and populated with the information shown in the data frame output.
441
442
```R
443
associations <- find.feature.associations(diablo.model, block.count = 3)
444
block.association <- character(0)
445
for (block.name in names(diablo.model$keepX)) {
446
  num.features.in.block <- diablo.model$keepX[[block.name]][1]
447
  block.association <- append(block.association,
448
                              rep(block.name, num.features.in.block))
449
}
450
export.matrix.as.network(associations, filename = "network.csv", cutoff = 0.7,
451
                         block.association = block.association)
452
##    feature.1    feature.2      value      block
453
##        KDM4B       ZNF552  0.8247985       mRNA
454
##        KDM4B        PREX1  0.7235040       mRNA
455
##       ZNF552        LRIG1  0.7072751       mRNA
456
##       ZNF552        PREX1  0.7308222       mRNA
457
```
458
459
## Network functions
460
OmicsFold provides a set of functions to extract and process data from a DIABLO model 
461
into a multi-omic interaction network visualization. The network functions identify associations between 
462
top features in the model and reformat them into a network object 
463
for direct visualization in R or in an external
464
program such as Cytoscape.
465
466
### Extract feature associations
467
468
To extract feature associations from a multi-omic DIABLO model, use the 
469
`find.feature.associations()` function.
470
471
```R
472
associations <- find.feature.associations(diablo.model, feature_number=50, score_type="blockrank")
473
```
474
475
The function requires the final trained DIABLO model. The `feature_number` parameter can be 
476
adjusted to select the number of top discriminative features to be returned. The `score_type` function
477
can be set to either "blockrank" or "loading" to determine which feature ranking metric to use.
478
The function will return a matrix of associations between the most important features in the model according to either 
479
blockrank or loading scores.
480
481
### Filter and reformat network
482
483
To filter and reformat the association matrix for network visualization, use the 
484
`filter.network()` function.
485
486
```R
487
omicsfold_network <- filter.network(diablo.model=diablo.model, associations=associations, cutoff=0.7, 
488
feature_list=c("phenol.sulfate", "Alistipes_putredinis"), remove_intrablock = FALSE)
489
```
490
491
The function requires the final trained DIABLO model and the association matrix from `find.feature.associations()`.
492
A correlation cut off can be set with `cutoff` to remove associations between features that 
493
do not meet the supplied threshold. A vector of feature names can be supplied to filter the network to only include
494
associations to features of interest with `feature_list`, and associations between features from the same block
495
can be removed with `remove_intrablock`. The function returns a dataframe describing associations 
496
between features and the blocks that they belong to. The dataframe can
497
be exported to Cytoscape or applied to the `plot.network()` function for visualization in R. 
498
499
### Plot network
500
501
To visualize the multi-omic network, use the `plot.network()` function. 
502
503
```R
504
plot.network(omicsfold_network=omicsfold_network, diablo.model=diablo.model)
505
```
506
507
![image](https://user-images.githubusercontent.com/49563541/182643757-33a50d8d-153c-496e-af17-dab19cddcdb8.png)
508
509
The function requires the network dataframe from the `filter.network()` function and the final trained DIABLO model. Nodes in the network represent 
510
features and edges represent correlations between nodes. Nodes are colored by block and edges are colored by correlation value, with darker red edges 
511
indicating the higher end of the correlation values in the network and darker orange edges indicating the lower end of correlation values in the network.
512
The function uses the Kamada-Kawai network layout algorithm to position nodes so that more strongly connected nodes will be pulled together,
513
while more weakly connected nodes will pushed away from other nodes.
514
515
## BlockRank
516
517
OmicsFold implements a novel approach to feature analysis called BlockRank. 
518
BlockRank allows the direct comparison of feature importance between different 
519
data blocks, and can also be used to measure feature importance in a singleomic 
520
setting.
521
522
The BlockRank Score for each feature is a number between 0 and 1, with 0 
523
indicating an insignificant feature and 1 indicating the most important feature 
524
in the model.
525
526
### BlockRank for sPLS-DA
527
528
To measure the BlockRank score for a single-omic sPLS-DA model use the 
529
`blockrank.splsda()` function.
530
531
```R
532
blockrank.scores <- blockrank.splsda(splsda.model)
533
```
534
535
`blockrank.scores` is a list of one numeric vector, the vector containing a 
536
score for each feature in the model data.
537
538
### BlockRank for DIABLO
539
540
To measure the BlockRank score for a multi-omic DIABLO model use the 
541
`blockrank.diablo()` function.
542
543
```R
544
blockrank.scores <- blockrank.diablo(diablo.model)
545
```
546
547
`blockrank.scores` is a list of numeric vectors. Each vector corresponds to one 
548
block of data, and the vector contains a score for each feature in the block.
549
550
### Plotting BlockRank Scores
551
552
The function `plot.blockrank.scores()` returns a bar chart of the top 'n' BlockRank scores.
553
554
```R
555
plot.blockrank.scores(blockrank.scores,
556
             nscores = 20,
557
             feature.font.size = 8,
558
             model = "DIABLO",
559
             data_source = "Breast TCGA")
560
```
561
![Example Blockrank Plot](imgs/blockrank_diablo.png)
562
563
The function requires a list of BlockRank scores to plot, and optionally has parameters for how many scores are displayed, `nscores`; the font size of the feature labels, `feature.font.size`; the name of type of model plotted for the title of the graph, `model`, and the name of the data source for the graph title, `data_source`.
564
565
## Model predictivity
566
567
`plot.predicted.projection()` is useful when an sPLS-DA single-omics model is
568
used to predict the classifications of unseen data and you want to visualise the
569
accuracy of that prediction.  The built in `predict()` function from `mixOmics`
570
will take a trained model and some unseen observations of the same set of
571
features, returning an object which is complicated to parse, but contains the
572
predicted classifications of the new data.
573
574
The projection plot generated by our function contains a large centroid point in
575
the first two components for each of the classifications surrounded by smaller
576
points for each observation in the unseen set.  The smaller points are coloured
577
according to their actual classification.  A predictive model will render the
578
smaller points nearer to the large points of the same colour.  Smaller points of
579
the wrong colour compared to their closest large point were predicted
580
incorrectly.
581
582
The code example below prepares a new sPLS-DA model from a smaller proportion of
583
the full dataset, described as the training data set.  This training data set is
584
used to fit the predictive model and then that model is made to predict the
585
outcome of the held back observations.
586
587
Printing the projection plot generated by our function renders the plot shown
588
below the code.  Here we can see that the predictions are mostly accurate with
589
the exception of one `EWS` point being too close to the `RMS` centroid.  There
590
are also some other points that are situated approximately halfway between two
591
centroids, but they still appear to be closest to the correct centroid.  The
592
model is therefore able to reasonably accurately predict the classification of
593
the unseen set with at least 1 error shown.
594
595
```R
596
# Prepare data
597
data(srbct)
598
all.X <- srbct$gene
599
all.Y <- srbct$class
600
all.index <- 1:length(all.Y)
601
602
# Subset the data for a training set
603
set.seed(1234) # for reproducibility of subsetting
604
train.index <- sort(sample(all.index, length(all.index) * 0.8))
605
train.X <- all.X[train.index,]
606
train.Y <- all.Y[train.index]
607
608
# Model fit with the training set
609
predict.model <- splsda(train.X, train.Y, ncomp = 3, keepX = c(9, 280, 30))
610
611
# Prepare the prediction set
612
predict.index <- setdiff(all.index, train.index)
613
predict.X <- all.X[predict.index,]
614
predict.Y <- all.Y[predict.index]
615
616
# Create a prediction for the predict set and plot using plot.predicted.projection()
617
prediction <- predict(predict.model, predict.X)
618
projection.plot <- plot.predicted.projection(prediction, predict.Y)
619
print(projection.plot)
620
```
621
622
![Prediction Projection Plot](imgs/plot-predicted-projection.png)
623
624
## Utility functions
625
626
`center.truncate()` is a helper function for plot labels, though it may have
627
other uses.  Where labels are long, as is typical for features of some -omics
628
data, this can cause plots to be dominated by axis labels.  By passing a string
629
to this function, it will be truncated to exactly 43 characters or fewer.
630
Characters in the middle of the string will be replaced by `...` instead.
631
632
The examples below show both a single string being truncated, as well as how to
633
apply the truncation to a vector of strings using the built in R function
634
`sapply()`.
635
636
```R
637
center.truncate("When a string is particularly long it will be truncated back to 43 characters")
638
## "When a string is par...back to 43 characters"
639
640
labels = c("Short name",
641
           "Very long name would need truncating for plotting",
642
           "Plots don't work well when the axis is forced over by long labels")
643
unname(sapply(labels, center.truncate))
644
## [1] "Short name"
645
## [2] "Very long name would...uncating for plotting"
646
## [3] "Plots don't work wel...d over by long labels"
647
```