[e26484]: / docs / getting-started-model-analysis.md

Download this file

648 lines (531 with data), 29.1 kB

Getting Started with Model Analysis

The following functions are provided to assist in the analysis and visualisation
of a model fitted by mixOmics. Focus has been given to sPLS-DA models where a
single block is being fitted (single-omics) and to DIABLO models where multiple
blocks are being fitted simulataneously (multi-omics). The code below will
generate known trained models and corresponding performance tests for each model
type, which are then used to demonstrate the functions in the rest of this
document.

library(OmicsFold)
library(mixOmics)
library(dplyr)

#######################################
## sPLS-DA model and performance result
#######################################

# Prepare data
data(srbct)
splsda.X <- srbct$gene
splsda.Y <- srbct$class

# Model fit
splsda.ncomp <- 3
splsda.keepX <- c(9, 280, 30)
splsda.model <- splsda(splsda.X, splsda.Y,
                       ncomp = splsda.ncomp, keepX = splsda.keepX)

# Preformance result -- takes about 1 min to run
set.seed(1234) # for reproducibility
splsda.perf <- perf(splsda.model, validation = "Mfold", folds = 5,
                    dist = 'max.dist', nrepeat = 10,
                    progressBar = FALSE)


######################################
## DIABLO model and performance result
######################################

# Prepare data
data('breast.TCGA')
diablo.data <- list(mRNA = breast.TCGA$data.train$mrna, 
                    miRNA = breast.TCGA$data.train$mirna, 
                    proteomics = breast.TCGA$data.train$protein)
diablo.Y <- breast.TCGA$data.train$subtype
diablo.design <- matrix(0.1, ncol = length(diablo.data),
                        nrow = length(diablo.data), 
                        dimnames = list(names(diablo.data), names(diablo.data)))
diag(diablo.design) <- 0

# Model fit
diablo.ncomp <- 2
diablo.keepX <- list(mRNA = c(6, 14), miRNA = c(5, 18), proteomics = c(6, 7))
diablo.model <- block.splsda(X = diablo.data, Y = diablo.Y,
                             ncomp = diablo.ncomp, keepX = diablo.keepX,
                             design = diablo.design)

# Preformance result -- takes about 1 min to run
set.seed(4321) # for reproducibility
diablo.perf <- perf(diablo.model, validation = 'Mfold', M = 10, nrepeat = 10, 
                    dist = 'centroids.dist')

Model variance analysis

OmicsFold provides the get.block.centroids() function
to extract the centroids of variance across blocks in a DIABLO model and export
them as a data frame and plot. The
get.model.variance() function is also provided which
will take either an sPLS-DA model or a DIABLO model as input and creates a data
frame showing the percentage contributions of each component to the model
variance.

Get block centroids

Applying the get.block.centroids() function to the DIABLO model from above
gives the following:

get.block.centroids(diablo.model)
## A tibble: 150 x 4
## Groups:   sample [150]
##    sample label comp1   comp2
##     <int> <fct> <dbl>   <dbl>
##  1      1 Basal  2.38  2.22  
##  2      2 Basal  1.74  2.03  
##  3      3 Basal  2.28 -0.0206
##  4      4 Basal  1.54  1.46  
##  5      5 Basal  1.93 -1.17  
## ... with 145 more rows

In addition to the tibble provided, a plot is produced showing the centroid
across the blocks in the first and second component for each sample in the data.
Each point in the plot represents a sample and is the mean of the three blocks
loaded for the DIABLO model. The circles show the distribution of all the
points sharing the same label and gives a good indication of the separation of
classifications in each of the first two components.

Block Centroids Plot

Get model variance

The get.model.variance() function can be applied to the model to identify how
much variation each component captures in the model for a given block:

get.model.variance(diablo.model, "mRNA")
##        Proportion Cumulative
## comp1 0.794465823  0.7944658
## comp2 0.008624793  0.8030906

get.model.variance(diablo.model, "miRNA")
##        Proportion Cumulative
## comp1 0.669987181  0.6699872
## comp2 0.001283394  0.6712706

get.model.variance(diablo.model, "proteomics")
##        Proportion Cumulative
## comp1 0.770439163  0.7704392
## comp2 0.002689493  0.7731287

Here we can see that the first component captures the vast majority of the
variance for all three blocks but that slightly less of the variance in miRNA
was captured than in mRNA or proteomics.

Feature analysis for sPLS-DA models

OmicsFold provides functions for analysing the features selecting for fitting
sPLS-DA models.

  • get.loadings.table() outputs a data
    frame of loadings for features selected in each component of the model.
  • feature.selection.stability()
    gives the loadings from above for a single component, alongside stability
    scores for the selected features.
  • merge.feature.stability()
    takes a table such as that generated by get.loadings.table() and the output
    from feature.selection.stability() merging them into a single data frame.
  • blockrank.splsda() takes an sPLS-DA model
    object and calculates a score measuring each features contribution to the variance in the response

Get feature loadings as a table

By calling get.loadings.table() with the sPLS-DA model as a parameter, a data
frame will be produced listing the loadings of each feature selected for the
sparse feature set. A separate column is included for each component and the
features are arranged according to their loading value in the earliest component
they are selected for. Ordering is highest loading first. Where a feature is
selected in more than one component, it appears only once in the table, showing
both loadings values. Where features are not selected for a component, the
loading value is NA.

The code example below demonstrates typical output from this function. The
output has been truncated for brevity in this getting started guide. The actual
output is 320 lines long.

get.loadings.table(splsda.model)
##            comp1         comp2       comp3
## g123  0.64922048            NA          NA
## g846  0.44828969            NA          NA
## g1606 0.30641602            NA          NA
## ...
## g1389         NA -2.312917e-01          NA
## g246          NA -2.001212e-01          NA
## g545          NA -1.986890e-01          NA
## ...
## g340          NA -4.226175e-02          NA
## g742          NA  4.204118e-02  0.14003503
## g1143         NA -4.124087e-02          NA
## ...
## g174          NA            NA -0.35418530
## g1896         NA            NA -0.29939099
## g603          NA            NA -0.27261886

Get sPLS-DA feature selection stability

Calling feature.selection.stability() with the model, performance test result
and a component number as parameters generates a data frame showing the loadings
values are were returned above, but for a single component only. The second
column in the data frame gives the stability value of the feature. This is
expressed as a fraction and indicates how many of the models in the performance
test included this feature in the sparse set. High stability indicates that a
feature is very important to the fit of the model and is therefore probably
important to separating the classes the model is trying to discriminate.

The example below shows extraction of the stability values for the 9 top
features included in component 1. Note how all stabilities are an exact
multiple of 0.02. This is because our performance test used 10 repeats and 5
folds, so 50 models were fitted, and each time a model selects a feature, this
adds 0.02 to the stability of that feature.

feature.selection.stability(splsda.model, splsda.perf, 1)
##            value stability
## g123  0.64922048      0.50
## g846  0.44828969      0.50
## g1606 0.30641602      0.40
## g335  0.30031495      0.48
## g836  0.26392532      0.46
## g783  0.22022625      0.28
## g758  0.22019689      0.38
## g1386 0.16248889      0.38
## g585  0.02058632      0.12

Merge feature stability on a table of loadings

After getting the loadings table for all the components of the model, it is
probably desirable to create a table with stability values shown in a final
column. merge.feature.stability() can be called with the loadings table and
stability table as parameters. The returned data frame adds a column for the
stability values matching the features in the loadings table. NA is used
where the stability value for a feature was not in the stability table.

Output for the loadings table is truncated in the example below for brevity.
You will see the full output in your own R instance.

# Prepare the loadings table and stability value table we intend to merge on.
loadings.table <- get.loadings.table(splsda.model)
stability.comp3 <- feature.selection.stability(splsda.model, splsda.perf, 3)

# Check the raw loadings table before merging any stability values.
loadings.table
##            comp1         comp2       comp3
## g123  0.64922048            NA          NA
## g846  0.44828969            NA          NA
## g742          NA  4.204118e-02  0.14003503
## g1143         NA -4.124087e-02          NA
## g174          NA            NA -0.35418530
## g1896         NA            NA -0.29939099

# Merge on stability values for component 3.
merge.feature.stability(loadings.table, stability.comp3)
## feature      comp1         comp2       comp3 stability
##    g123 0.64922048            NA          NA        NA
##    g846 0.44828969            NA          NA        NA
##    g742         NA  4.204118e-02  0.14003503      0.76
##   g1143         NA -4.124087e-02          NA        NA
##    g174         NA            NA -0.35418530      0.34
##   g1896         NA            NA -0.29939099      0.32

Measure variable importance

With a trained model object blockrank.splsda() will create a single item list containing a vector of BlockRank scores for that model. For more detail see section BlockRank below.

Feature analysis for DIABLO models

OmicsFold provides a number of functions for analysing the output from DIABLO
models produced in mixOmics.

  • get.diablo.top.loadings() outputs a data frame
    of feature loadings for a specified block in the DIABLO model.
  • diablo.selection.stability()
    extracts a data frame of stability scores from a DIABLO model performance test
    for features in a particular block and component.
  • get.diablo.top.loadings.with.stability()
    provides a concatenated data frame of feature loadings and stability scores
    for a given block in the DIABLO model.
  • blockrank.diablo() takes a DIABLO model
    object and calculates a score measuring each features contribution to the variance in the response
  • plot.feature.stability() provides a simple
    histogram plot of the highest stability features in a single component and
    block, allowing visual comparison of their stability scores.
  • find.feature.associations() outputs a
    matrix of values indicating possible feature associations in the model based
    on consistent feature measurement across samples.
  • export.matrix.as.network() exports a
    data frame showing relationships between features where feature association
    strength exceeds a threshold. The data frame is also written to a CSV file
    which can be imported into a network plotting application such as
    Cytoscape.

Get DIABLO top loadings

get.diablo.top.loadings() allows the export of loadings from a DIABLO model
into a data frame. The resulting data frame has columns for each component of
the model containing loadings for the features selected in that component. Rows
are sorted according to the absolute value of the loadings so that the highest
impact features are at the top of the table. All features for the first
component are listed before listing begins for features of the second and
subsequent components. If a feature is selected in more than one component,
only one row is given for that feature but multiple columns will contain a
loading for that feature. Where a feature was not selected in a component, the
NA value is shown for the loading.

In the example below, the loadings for the proteomics block are being arranged
in the data frame. Note that feature AR has been selected in both components.
Because its loading value in component 2 is -0.11183919, this would normally
place AR below c-Kit, but features are sorted by lower numbered components
first.

get.diablo.top.loadings(diablo.model$loadings$proteomics)
##                   comp1       comp2
## ER-alpha    -0.74995295          NA
## GATA3       -0.62448770          NA
## ASNS         0.16825721          NA
## Cyclin_B1    0.12026614          NA
## AR          -0.06842432 -0.11183919
## JNK2        -0.01137347          NA
## HER2                 NA -0.67182918
## HER2_pY1248          NA -0.62130938
## EGFR_pY1068          NA -0.33229248
## c-Kit                NA  0.17791016
## HER3_pY1289          NA -0.08706226
## XRCC1                NA  0.02149536

Get DIABLO feature selection stability

diablo.selection.stability() is very similar to the equivalent function for
sPLS-DA models
. By passing the
performance test result, a selected component and a selected block as
parameters, a data frame will be generated for the features in that block and
component, showing the stability scores for those features. Stability is a
fraction indicating the number of models in the performance test which selected
this feature when being fitted. 100 models were fitted when running the DIABLO
performance test (10 repeats with 10 folds), so the stability scores shown are
an exact multiple of 0.01. Features are presented in the data frame in
reverse stability score order so that the most stable features are shown first.
Those with a score of 1.0 are essential for the model to achieve the best fit
and are therefore also likely to be relevant to the discrimination of classes
the model is trying to achieve.

The code below outputs the feature selection stability for the proteomics
block in component 1.

diablo.selection.stability(diablo.perf, comp = 1, block = 'proteomics')
##   feature stability
##  ER-alpha      1.00
##     GATA3      1.00
##      ASNS      0.99
## Cyclin_B1      0.98
##        AR      0.80
##      JNK2      0.54
##        PR      0.41
## Cyclin_E1      0.22
##    INPP4B      0.12

Get DIABLO top loadings with stability

get.diablo.top.loadings.with.stability() allows the combination of information
from both the previous functions. By passing in the model, performance test and
block name, a data frame is generated with loadings of the selected features for
each component alongside stability scores for those features. Where a feature
exists in more than one component, the stability score shown is for the lowest
numbered component.

The code below generates a data frame containing loading values and stability
scores for the selected features in the proteomics block. Note that, unless
specified, the number of features returned will be up to 20 by default.

get.diablo.top.loadings.with.stability(diablo.model, diablo.perf,
                                       block = 'proteomics', feature.count = 50)
##                   comp1       comp2 stability
## ER-alpha    -0.74995295          NA      1.00
## GATA3       -0.62448770          NA      1.00
## ASNS         0.16825721          NA      0.99
## Cyclin_B1    0.12026614          NA      0.98
## AR          -0.06842432 -0.11183919      0.80
## JNK2        -0.01137347          NA      0.54
## HER2                 NA -0.67182918      1.00
## HER2_pY1248          NA -0.62130938      1.00
## EGFR_pY1068          NA -0.33229248      1.00
## c-Kit                NA  0.17791016      1.00
## HER3_pY1289          NA -0.08706226      0.95
## XRCC1                NA  0.02149536      0.56

Measure feature importance

With a trained model object blockrank.diablo() will create a list of vectors of BlockRank scores for each feature. For more detail see section BlockRank below.

Plot feature stability

plot.feature.stability() takes a single parameter with a table of stabilities,
as obtained from
diablo.selection.stability(), and
creates a bar-chart-style plot of stability for those features. The shape of
the plot will give and indication for how consistently the models choose the
same features during the performance tests. A plot that stays high for the
first features and then drops sharply to a low level for the remainder shows
that there is a clear preference for specific features and that those features
should be of value to look at more closely in relation to the biological
context. If the curve is closer to a straight line and only a very small number
of features achieve close to 1.0 stability, then the feature selection is less
stable and it may be harder to draw conclusions about which are relevant to the
biological context.

The code below prepares a stability data frame for the DIABLO performance test
for the miRNA block in component 2. It then plots the stability scores as a
bar-chart-style plot as shown below the code. This plot shows a reasonably
sharp drop off after about 10 features, which in turn means that the high
stability features on the left are likely to have good relevance to the
biological context.

stability <- diablo.selection.stability(diablo.perf, comp = 2, block = 'miRNA')
plot.feature.stability(stability)

Feature Stability Plot

Export DIABLO matrix as a network

export.matrix.as.network() takes the output of the associations function above
and both filters and converts the matrix to a flat format as a data frame as
well as writing that to a CSV file. The data frame has one line per association
and does not show reverse associations. Only associations which meet or exceed
the given cutoff threshold are included. The association can be positive or
negative to satisfy inclusion by the cutoff. The data frame shows both sides of
the association as two feature columns and the strength of the association as a
value column. The block column provides the name of the block containing the
left hand feature.

The CSV file specified is created and contains the same information as the data
frame. The format of the CSV file is importable into programs for plotting
networks, such as Cytoscape, which makes it easier to
visualise how the features in the model might be relevant to each other.

The code example below prepares the association matrix and then prepares a
vector of strings where the string names are the names of the blocks each of the
features are present in. It is also possible to provide the parameter
block.feature.count instead of the block.association in the unlikely event
that each block contains the same number of features. The output shown has been
truncated to only show the associations included from the matrix above. The
actual list of associations is much longer. A CSV file named network.csv is
created and populated with the information shown in the data frame output.

associations <- find.feature.associations(diablo.model, block.count = 3)
block.association <- character(0)
for (block.name in names(diablo.model$keepX)) {
  num.features.in.block <- diablo.model$keepX[[block.name]][1]
  block.association <- append(block.association,
                              rep(block.name, num.features.in.block))
}
export.matrix.as.network(associations, filename = "network.csv", cutoff = 0.7,
                         block.association = block.association)
##    feature.1    feature.2      value      block
##        KDM4B       ZNF552  0.8247985       mRNA
##        KDM4B        PREX1  0.7235040       mRNA
##       ZNF552        LRIG1  0.7072751       mRNA
##       ZNF552        PREX1  0.7308222       mRNA

Network functions

OmicsFold provides a set of functions to extract and process data from a DIABLO model
into a multi-omic interaction network visualization. The network functions identify associations between
top features in the model and reformat them into a network object
for direct visualization in R or in an external
program such as Cytoscape.

Extract feature associations

To extract feature associations from a multi-omic DIABLO model, use the
find.feature.associations() function.

associations <- find.feature.associations(diablo.model, feature_number=50, score_type="blockrank")

The function requires the final trained DIABLO model. The feature_number parameter can be
adjusted to select the number of top discriminative features to be returned. The score_type function
can be set to either "blockrank" or "loading" to determine which feature ranking metric to use.
The function will return a matrix of associations between the most important features in the model according to either
blockrank or loading scores.

Filter and reformat network

To filter and reformat the association matrix for network visualization, use the
filter.network() function.

omicsfold_network <- filter.network(diablo.model=diablo.model, associations=associations, cutoff=0.7, 
feature_list=c("phenol.sulfate", "Alistipes_putredinis"), remove_intrablock = FALSE)

The function requires the final trained DIABLO model and the association matrix from find.feature.associations().
A correlation cut off can be set with cutoff to remove associations between features that
do not meet the supplied threshold. A vector of feature names can be supplied to filter the network to only include
associations to features of interest with feature_list, and associations between features from the same block
can be removed with remove_intrablock. The function returns a dataframe describing associations
between features and the blocks that they belong to. The dataframe can
be exported to Cytoscape or applied to the plot.network() function for visualization in R.

Plot network

To visualize the multi-omic network, use the plot.network() function.

plot.network(omicsfold_network=omicsfold_network, diablo.model=diablo.model)

image

The function requires the network dataframe from the filter.network() function and the final trained DIABLO model. Nodes in the network represent
features and edges represent correlations between nodes. Nodes are colored by block and edges are colored by correlation value, with darker red edges
indicating the higher end of the correlation values in the network and darker orange edges indicating the lower end of correlation values in the network.
The function uses the Kamada-Kawai network layout algorithm to position nodes so that more strongly connected nodes will be pulled together,
while more weakly connected nodes will pushed away from other nodes.

BlockRank

OmicsFold implements a novel approach to feature analysis called BlockRank.
BlockRank allows the direct comparison of feature importance between different
data blocks, and can also be used to measure feature importance in a singleomic
setting.

The BlockRank Score for each feature is a number between 0 and 1, with 0
indicating an insignificant feature and 1 indicating the most important feature
in the model.

BlockRank for sPLS-DA

To measure the BlockRank score for a single-omic sPLS-DA model use the
blockrank.splsda() function.

blockrank.scores <- blockrank.splsda(splsda.model)

blockrank.scores is a list of one numeric vector, the vector containing a
score for each feature in the model data.

BlockRank for DIABLO

To measure the BlockRank score for a multi-omic DIABLO model use the
blockrank.diablo() function.

blockrank.scores <- blockrank.diablo(diablo.model)

blockrank.scores is a list of numeric vectors. Each vector corresponds to one
block of data, and the vector contains a score for each feature in the block.

Plotting BlockRank Scores

The function plot.blockrank.scores() returns a bar chart of the top 'n' BlockRank scores.

plot.blockrank.scores(blockrank.scores,
             nscores = 20,
             feature.font.size = 8,
             model = "DIABLO",
             data_source = "Breast TCGA")

Example Blockrank Plot

The function requires a list of BlockRank scores to plot, and optionally has parameters for how many scores are displayed, nscores; the font size of the feature labels, feature.font.size; the name of type of model plotted for the title of the graph, model, and the name of the data source for the graph title, data_source.

Model predictivity

plot.predicted.projection() is useful when an sPLS-DA single-omics model is
used to predict the classifications of unseen data and you want to visualise the
accuracy of that prediction. The built in predict() function from mixOmics
will take a trained model and some unseen observations of the same set of
features, returning an object which is complicated to parse, but contains the
predicted classifications of the new data.

The projection plot generated by our function contains a large centroid point in
the first two components for each of the classifications surrounded by smaller
points for each observation in the unseen set. The smaller points are coloured
according to their actual classification. A predictive model will render the
smaller points nearer to the large points of the same colour. Smaller points of
the wrong colour compared to their closest large point were predicted
incorrectly.

The code example below prepares a new sPLS-DA model from a smaller proportion of
the full dataset, described as the training data set. This training data set is
used to fit the predictive model and then that model is made to predict the
outcome of the held back observations.

Printing the projection plot generated by our function renders the plot shown
below the code. Here we can see that the predictions are mostly accurate with
the exception of one EWS point being too close to the RMS centroid. There
are also some other points that are situated approximately halfway between two
centroids, but they still appear to be closest to the correct centroid. The
model is therefore able to reasonably accurately predict the classification of
the unseen set with at least 1 error shown.

# Prepare data
data(srbct)
all.X <- srbct$gene
all.Y <- srbct$class
all.index <- 1:length(all.Y)

# Subset the data for a training set
set.seed(1234) # for reproducibility of subsetting
train.index <- sort(sample(all.index, length(all.index) * 0.8))
train.X <- all.X[train.index,]
train.Y <- all.Y[train.index]

# Model fit with the training set
predict.model <- splsda(train.X, train.Y, ncomp = 3, keepX = c(9, 280, 30))

# Prepare the prediction set
predict.index <- setdiff(all.index, train.index)
predict.X <- all.X[predict.index,]
predict.Y <- all.Y[predict.index]

# Create a prediction for the predict set and plot using plot.predicted.projection()
prediction <- predict(predict.model, predict.X)
projection.plot <- plot.predicted.projection(prediction, predict.Y)
print(projection.plot)

Prediction Projection Plot

Utility functions

center.truncate() is a helper function for plot labels, though it may have
other uses. Where labels are long, as is typical for features of some -omics
data, this can cause plots to be dominated by axis labels. By passing a string
to this function, it will be truncated to exactly 43 characters or fewer.
Characters in the middle of the string will be replaced by ... instead.

The examples below show both a single string being truncated, as well as how to
apply the truncation to a vector of strings using the built in R function
sapply().

center.truncate("When a string is particularly long it will be truncated back to 43 characters")
## "When a string is par...back to 43 characters"

labels = c("Short name",
           "Very long name would need truncating for plotting",
           "Plots don't work well when the axis is forced over by long labels")
unname(sapply(labels, center.truncate))
## [1] "Short name"
## [2] "Very long name would...uncating for plotting"
## [3] "Plots don't work wel...d over by long labels"