The following functions are provided to assist in the analysis and visualisation
of a model fitted by mixOmics
. Focus has been given to sPLS-DA models where a
single block is being fitted (single-omics) and to DIABLO models where multiple
blocks are being fitted simulataneously (multi-omics). The code below will
generate known trained models and corresponding performance tests for each model
type, which are then used to demonstrate the functions in the rest of this
document.
library(OmicsFold)
library(mixOmics)
library(dplyr)
#######################################
## sPLS-DA model and performance result
#######################################
# Prepare data
data(srbct)
splsda.X <- srbct$gene
splsda.Y <- srbct$class
# Model fit
splsda.ncomp <- 3
splsda.keepX <- c(9, 280, 30)
splsda.model <- splsda(splsda.X, splsda.Y,
ncomp = splsda.ncomp, keepX = splsda.keepX)
# Preformance result -- takes about 1 min to run
set.seed(1234) # for reproducibility
splsda.perf <- perf(splsda.model, validation = "Mfold", folds = 5,
dist = 'max.dist', nrepeat = 10,
progressBar = FALSE)
######################################
## DIABLO model and performance result
######################################
# Prepare data
data('breast.TCGA')
diablo.data <- list(mRNA = breast.TCGA$data.train$mrna,
miRNA = breast.TCGA$data.train$mirna,
proteomics = breast.TCGA$data.train$protein)
diablo.Y <- breast.TCGA$data.train$subtype
diablo.design <- matrix(0.1, ncol = length(diablo.data),
nrow = length(diablo.data),
dimnames = list(names(diablo.data), names(diablo.data)))
diag(diablo.design) <- 0
# Model fit
diablo.ncomp <- 2
diablo.keepX <- list(mRNA = c(6, 14), miRNA = c(5, 18), proteomics = c(6, 7))
diablo.model <- block.splsda(X = diablo.data, Y = diablo.Y,
ncomp = diablo.ncomp, keepX = diablo.keepX,
design = diablo.design)
# Preformance result -- takes about 1 min to run
set.seed(4321) # for reproducibility
diablo.perf <- perf(diablo.model, validation = 'Mfold', M = 10, nrepeat = 10,
dist = 'centroids.dist')
OmicsFold provides the get.block.centroids()
function
to extract the centroids of variance across blocks in a DIABLO model and export
them as a data frame and plot. The
get.model.variance()
function is also provided which
will take either an sPLS-DA model or a DIABLO model as input and creates a data
frame showing the percentage contributions of each component to the model
variance.
Applying the get.block.centroids()
function to the DIABLO model from above
gives the following:
get.block.centroids(diablo.model)
## A tibble: 150 x 4
## Groups: sample [150]
## sample label comp1 comp2
## <int> <fct> <dbl> <dbl>
## 1 1 Basal 2.38 2.22
## 2 2 Basal 1.74 2.03
## 3 3 Basal 2.28 -0.0206
## 4 4 Basal 1.54 1.46
## 5 5 Basal 1.93 -1.17
## ... with 145 more rows
In addition to the tibble provided, a plot is produced showing the centroid
across the blocks in the first and second component for each sample in the data.
Each point in the plot represents a sample and is the mean of the three blocks
loaded for the DIABLO model. The circles show the distribution of all the
points sharing the same label and gives a good indication of the separation of
classifications in each of the first two components.
The get.model.variance()
function can be applied to the model to identify how
much variation each component captures in the model for a given block:
get.model.variance(diablo.model, "mRNA")
## Proportion Cumulative
## comp1 0.794465823 0.7944658
## comp2 0.008624793 0.8030906
get.model.variance(diablo.model, "miRNA")
## Proportion Cumulative
## comp1 0.669987181 0.6699872
## comp2 0.001283394 0.6712706
get.model.variance(diablo.model, "proteomics")
## Proportion Cumulative
## comp1 0.770439163 0.7704392
## comp2 0.002689493 0.7731287
Here we can see that the first component captures the vast majority of the
variance for all three blocks but that slightly less of the variance in miRNA
was captured than in mRNA
or proteomics
.
OmicsFold provides functions for analysing the features selecting for fitting
sPLS-DA models.
get.loadings.table()
outputs a datafeature.selection.stability()
merge.feature.stability()
get.loadings.table()
and the outputfeature.selection.stability()
merging them into a single data frame.blockrank.splsda()
takes an sPLS-DA model By calling get.loadings.table()
with the sPLS-DA model as a parameter, a data
frame will be produced listing the loadings of each feature selected for the
sparse feature set. A separate column is included for each component and the
features are arranged according to their loading value in the earliest component
they are selected for. Ordering is highest loading first. Where a feature is
selected in more than one component, it appears only once in the table, showing
both loadings values. Where features are not selected for a component, the
loading value is NA
.
The code example below demonstrates typical output from this function. The
output has been truncated for brevity in this getting started guide. The actual
output is 320 lines long.
get.loadings.table(splsda.model)
## comp1 comp2 comp3
## g123 0.64922048 NA NA
## g846 0.44828969 NA NA
## g1606 0.30641602 NA NA
## ...
## g1389 NA -2.312917e-01 NA
## g246 NA -2.001212e-01 NA
## g545 NA -1.986890e-01 NA
## ...
## g340 NA -4.226175e-02 NA
## g742 NA 4.204118e-02 0.14003503
## g1143 NA -4.124087e-02 NA
## ...
## g174 NA NA -0.35418530
## g1896 NA NA -0.29939099
## g603 NA NA -0.27261886
Calling feature.selection.stability()
with the model, performance test result
and a component number as parameters generates a data frame showing the loadings
values are were returned above, but for a single component only. The second
column in the data frame gives the stability value of the feature. This is
expressed as a fraction and indicates how many of the models in the performance
test included this feature in the sparse set. High stability indicates that a
feature is very important to the fit of the model and is therefore probably
important to separating the classes the model is trying to discriminate.
The example below shows extraction of the stability values for the 9 top
features included in component 1. Note how all stabilities are an exact
multiple of 0.02. This is because our performance test used 10 repeats and 5
folds, so 50 models were fitted, and each time a model selects a feature, this
adds 0.02 to the stability of that feature.
feature.selection.stability(splsda.model, splsda.perf, 1)
## value stability
## g123 0.64922048 0.50
## g846 0.44828969 0.50
## g1606 0.30641602 0.40
## g335 0.30031495 0.48
## g836 0.26392532 0.46
## g783 0.22022625 0.28
## g758 0.22019689 0.38
## g1386 0.16248889 0.38
## g585 0.02058632 0.12
After getting the loadings table for all the components of the model, it is
probably desirable to create a table with stability values shown in a final
column. merge.feature.stability()
can be called with the loadings table and
stability table as parameters. The returned data frame adds a column for the
stability values matching the features in the loadings table. NA
is used
where the stability value for a feature was not in the stability table.
Output for the loadings table is truncated in the example below for brevity.
You will see the full output in your own R instance.
# Prepare the loadings table and stability value table we intend to merge on.
loadings.table <- get.loadings.table(splsda.model)
stability.comp3 <- feature.selection.stability(splsda.model, splsda.perf, 3)
# Check the raw loadings table before merging any stability values.
loadings.table
## comp1 comp2 comp3
## g123 0.64922048 NA NA
## g846 0.44828969 NA NA
## g742 NA 4.204118e-02 0.14003503
## g1143 NA -4.124087e-02 NA
## g174 NA NA -0.35418530
## g1896 NA NA -0.29939099
# Merge on stability values for component 3.
merge.feature.stability(loadings.table, stability.comp3)
## feature comp1 comp2 comp3 stability
## g123 0.64922048 NA NA NA
## g846 0.44828969 NA NA NA
## g742 NA 4.204118e-02 0.14003503 0.76
## g1143 NA -4.124087e-02 NA NA
## g174 NA NA -0.35418530 0.34
## g1896 NA NA -0.29939099 0.32
With a trained model object blockrank.splsda()
will create a single item list containing a vector of BlockRank scores for that model. For more detail see section BlockRank below.
OmicsFold provides a number of functions for analysing the output from DIABLO
models produced in mixOmics
.
get.diablo.top.loadings()
outputs a data framediablo.selection.stability()
get.diablo.top.loadings.with.stability()
blockrank.diablo()
takes a DIABLO model plot.feature.stability()
provides a simplefind.feature.associations()
outputs aexport.matrix.as.network()
exports aget.diablo.top.loadings()
allows the export of loadings from a DIABLO model
into a data frame. The resulting data frame has columns for each component of
the model containing loadings for the features selected in that component. Rows
are sorted according to the absolute value of the loadings so that the highest
impact features are at the top of the table. All features for the first
component are listed before listing begins for features of the second and
subsequent components. If a feature is selected in more than one component,
only one row is given for that feature but multiple columns will contain a
loading for that feature. Where a feature was not selected in a component, the
NA
value is shown for the loading.
In the example below, the loadings for the proteomics
block are being arranged
in the data frame. Note that feature AR
has been selected in both components.
Because its loading value in component 2 is -0.11183919
, this would normally
place AR
below c-Kit
, but features are sorted by lower numbered components
first.
get.diablo.top.loadings(diablo.model$loadings$proteomics)
## comp1 comp2
## ER-alpha -0.74995295 NA
## GATA3 -0.62448770 NA
## ASNS 0.16825721 NA
## Cyclin_B1 0.12026614 NA
## AR -0.06842432 -0.11183919
## JNK2 -0.01137347 NA
## HER2 NA -0.67182918
## HER2_pY1248 NA -0.62130938
## EGFR_pY1068 NA -0.33229248
## c-Kit NA 0.17791016
## HER3_pY1289 NA -0.08706226
## XRCC1 NA 0.02149536
diablo.selection.stability()
is very similar to the equivalent function for
sPLS-DA models. By passing the
performance test result, a selected component and a selected block as
parameters, a data frame will be generated for the features in that block and
component, showing the stability scores for those features. Stability is a
fraction indicating the number of models in the performance test which selected
this feature when being fitted. 100 models were fitted when running the DIABLO
performance test (10 repeats with 10 folds), so the stability scores shown are
an exact multiple of 0.01
. Features are presented in the data frame in
reverse stability score order so that the most stable features are shown first.
Those with a score of 1.0
are essential for the model to achieve the best fit
and are therefore also likely to be relevant to the discrimination of classes
the model is trying to achieve.
The code below outputs the feature selection stability for the proteomics
block in component 1.
diablo.selection.stability(diablo.perf, comp = 1, block = 'proteomics')
## feature stability
## ER-alpha 1.00
## GATA3 1.00
## ASNS 0.99
## Cyclin_B1 0.98
## AR 0.80
## JNK2 0.54
## PR 0.41
## Cyclin_E1 0.22
## INPP4B 0.12
get.diablo.top.loadings.with.stability()
allows the combination of information
from both the previous functions. By passing in the model, performance test and
block name, a data frame is generated with loadings of the selected features for
each component alongside stability scores for those features. Where a feature
exists in more than one component, the stability score shown is for the lowest
numbered component.
The code below generates a data frame containing loading values and stability
scores for the selected features in the proteomics
block. Note that, unless
specified, the number of features returned will be up to 20
by default.
get.diablo.top.loadings.with.stability(diablo.model, diablo.perf,
block = 'proteomics', feature.count = 50)
## comp1 comp2 stability
## ER-alpha -0.74995295 NA 1.00
## GATA3 -0.62448770 NA 1.00
## ASNS 0.16825721 NA 0.99
## Cyclin_B1 0.12026614 NA 0.98
## AR -0.06842432 -0.11183919 0.80
## JNK2 -0.01137347 NA 0.54
## HER2 NA -0.67182918 1.00
## HER2_pY1248 NA -0.62130938 1.00
## EGFR_pY1068 NA -0.33229248 1.00
## c-Kit NA 0.17791016 1.00
## HER3_pY1289 NA -0.08706226 0.95
## XRCC1 NA 0.02149536 0.56
With a trained model object blockrank.diablo()
will create a list of vectors of BlockRank scores for each feature. For more detail see section BlockRank below.
plot.feature.stability()
takes a single parameter with a table of stabilities,
as obtained from
diablo.selection.stability()
, and
creates a bar-chart-style plot of stability for those features. The shape of
the plot will give and indication for how consistently the models choose the
same features during the performance tests. A plot that stays high for the
first features and then drops sharply to a low level for the remainder shows
that there is a clear preference for specific features and that those features
should be of value to look at more closely in relation to the biological
context. If the curve is closer to a straight line and only a very small number
of features achieve close to 1.0
stability, then the feature selection is less
stable and it may be harder to draw conclusions about which are relevant to the
biological context.
The code below prepares a stability data frame for the DIABLO performance test
for the miRNA
block in component 2. It then plots the stability scores as a
bar-chart-style plot as shown below the code. This plot shows a reasonably
sharp drop off after about 10 features, which in turn means that the high
stability features on the left are likely to have good relevance to the
biological context.
stability <- diablo.selection.stability(diablo.perf, comp = 2, block = 'miRNA')
plot.feature.stability(stability)
export.matrix.as.network()
takes the output of the associations function above
and both filters and converts the matrix to a flat format as a data frame as
well as writing that to a CSV file. The data frame has one line per association
and does not show reverse associations. Only associations which meet or exceed
the given cutoff threshold are included. The association can be positive or
negative to satisfy inclusion by the cutoff. The data frame shows both sides of
the association as two feature columns and the strength of the association as a
value column. The block column provides the name of the block containing the
left hand feature.
The CSV file specified is created and contains the same information as the data
frame. The format of the CSV file is importable into programs for plotting
networks, such as Cytoscape, which makes it easier to
visualise how the features in the model might be relevant to each other.
The code example below prepares the association matrix and then prepares a
vector of strings where the string names are the names of the blocks each of the
features are present in. It is also possible to provide the parameter
block.feature.count
instead of the block.association
in the unlikely event
that each block contains the same number of features. The output shown has been
truncated to only show the associations included from the matrix above. The
actual list of associations is much longer. A CSV file named network.csv
is
created and populated with the information shown in the data frame output.
associations <- find.feature.associations(diablo.model, block.count = 3)
block.association <- character(0)
for (block.name in names(diablo.model$keepX)) {
num.features.in.block <- diablo.model$keepX[[block.name]][1]
block.association <- append(block.association,
rep(block.name, num.features.in.block))
}
export.matrix.as.network(associations, filename = "network.csv", cutoff = 0.7,
block.association = block.association)
## feature.1 feature.2 value block
## KDM4B ZNF552 0.8247985 mRNA
## KDM4B PREX1 0.7235040 mRNA
## ZNF552 LRIG1 0.7072751 mRNA
## ZNF552 PREX1 0.7308222 mRNA
OmicsFold provides a set of functions to extract and process data from a DIABLO model
into a multi-omic interaction network visualization. The network functions identify associations between
top features in the model and reformat them into a network object
for direct visualization in R or in an external
program such as Cytoscape.
To extract feature associations from a multi-omic DIABLO model, use the
find.feature.associations()
function.
associations <- find.feature.associations(diablo.model, feature_number=50, score_type="blockrank")
The function requires the final trained DIABLO model. The feature_number
parameter can be
adjusted to select the number of top discriminative features to be returned. The score_type
function
can be set to either "blockrank" or "loading" to determine which feature ranking metric to use.
The function will return a matrix of associations between the most important features in the model according to either
blockrank or loading scores.
To filter and reformat the association matrix for network visualization, use the
filter.network()
function.
omicsfold_network <- filter.network(diablo.model=diablo.model, associations=associations, cutoff=0.7,
feature_list=c("phenol.sulfate", "Alistipes_putredinis"), remove_intrablock = FALSE)
The function requires the final trained DIABLO model and the association matrix from find.feature.associations()
.
A correlation cut off can be set with cutoff
to remove associations between features that
do not meet the supplied threshold. A vector of feature names can be supplied to filter the network to only include
associations to features of interest with feature_list
, and associations between features from the same block
can be removed with remove_intrablock
. The function returns a dataframe describing associations
between features and the blocks that they belong to. The dataframe can
be exported to Cytoscape or applied to the plot.network()
function for visualization in R.
To visualize the multi-omic network, use the plot.network()
function.
plot.network(omicsfold_network=omicsfold_network, diablo.model=diablo.model)
The function requires the network dataframe from the filter.network()
function and the final trained DIABLO model. Nodes in the network represent
features and edges represent correlations between nodes. Nodes are colored by block and edges are colored by correlation value, with darker red edges
indicating the higher end of the correlation values in the network and darker orange edges indicating the lower end of correlation values in the network.
The function uses the Kamada-Kawai network layout algorithm to position nodes so that more strongly connected nodes will be pulled together,
while more weakly connected nodes will pushed away from other nodes.
OmicsFold implements a novel approach to feature analysis called BlockRank.
BlockRank allows the direct comparison of feature importance between different
data blocks, and can also be used to measure feature importance in a singleomic
setting.
The BlockRank Score for each feature is a number between 0 and 1, with 0
indicating an insignificant feature and 1 indicating the most important feature
in the model.
To measure the BlockRank score for a single-omic sPLS-DA model use the
blockrank.splsda()
function.
blockrank.scores <- blockrank.splsda(splsda.model)
blockrank.scores
is a list of one numeric vector, the vector containing a
score for each feature in the model data.
To measure the BlockRank score for a multi-omic DIABLO model use the
blockrank.diablo()
function.
blockrank.scores <- blockrank.diablo(diablo.model)
blockrank.scores
is a list of numeric vectors. Each vector corresponds to one
block of data, and the vector contains a score for each feature in the block.
The function plot.blockrank.scores()
returns a bar chart of the top 'n' BlockRank scores.
plot.blockrank.scores(blockrank.scores,
nscores = 20,
feature.font.size = 8,
model = "DIABLO",
data_source = "Breast TCGA")
The function requires a list of BlockRank scores to plot, and optionally has parameters for how many scores are displayed, nscores
; the font size of the feature labels, feature.font.size
; the name of type of model plotted for the title of the graph, model
, and the name of the data source for the graph title, data_source
.
plot.predicted.projection()
is useful when an sPLS-DA single-omics model is
used to predict the classifications of unseen data and you want to visualise the
accuracy of that prediction. The built in predict()
function from mixOmics
will take a trained model and some unseen observations of the same set of
features, returning an object which is complicated to parse, but contains the
predicted classifications of the new data.
The projection plot generated by our function contains a large centroid point in
the first two components for each of the classifications surrounded by smaller
points for each observation in the unseen set. The smaller points are coloured
according to their actual classification. A predictive model will render the
smaller points nearer to the large points of the same colour. Smaller points of
the wrong colour compared to their closest large point were predicted
incorrectly.
The code example below prepares a new sPLS-DA model from a smaller proportion of
the full dataset, described as the training data set. This training data set is
used to fit the predictive model and then that model is made to predict the
outcome of the held back observations.
Printing the projection plot generated by our function renders the plot shown
below the code. Here we can see that the predictions are mostly accurate with
the exception of one EWS
point being too close to the RMS
centroid. There
are also some other points that are situated approximately halfway between two
centroids, but they still appear to be closest to the correct centroid. The
model is therefore able to reasonably accurately predict the classification of
the unseen set with at least 1 error shown.
# Prepare data
data(srbct)
all.X <- srbct$gene
all.Y <- srbct$class
all.index <- 1:length(all.Y)
# Subset the data for a training set
set.seed(1234) # for reproducibility of subsetting
train.index <- sort(sample(all.index, length(all.index) * 0.8))
train.X <- all.X[train.index,]
train.Y <- all.Y[train.index]
# Model fit with the training set
predict.model <- splsda(train.X, train.Y, ncomp = 3, keepX = c(9, 280, 30))
# Prepare the prediction set
predict.index <- setdiff(all.index, train.index)
predict.X <- all.X[predict.index,]
predict.Y <- all.Y[predict.index]
# Create a prediction for the predict set and plot using plot.predicted.projection()
prediction <- predict(predict.model, predict.X)
projection.plot <- plot.predicted.projection(prediction, predict.Y)
print(projection.plot)
center.truncate()
is a helper function for plot labels, though it may have
other uses. Where labels are long, as is typical for features of some -omics
data, this can cause plots to be dominated by axis labels. By passing a string
to this function, it will be truncated to exactly 43 characters or fewer.
Characters in the middle of the string will be replaced by ...
instead.
The examples below show both a single string being truncated, as well as how to
apply the truncation to a vector of strings using the built in R function
sapply()
.
center.truncate("When a string is particularly long it will be truncated back to 43 characters")
## "When a string is par...back to 43 characters"
labels = c("Short name",
"Very long name would need truncating for plotting",
"Plots don't work well when the axis is forced over by long labels")
unname(sapply(labels, center.truncate))
## [1] "Short name"
## [2] "Very long name would...uncating for plotting"
## [3] "Plots don't work wel...d over by long labels"