The Maui Class

class maui.Maui(n_hidden=[1500], n_latent=80, batch_size=100, epochs=400, architecture='stacked', initial_beta_val=0, kappa=1.0, max_beta_val=1, learning_rate=0.0005, epsilon_std=1.0, batch_normalize_inputs=True, batch_normalize_intermediaries=True, batch_normalize_embedding=True, relu_intermediaries=True, relu_embedding=True, input_dim=None)[source]

Maui (Multi-omics Autoencoder Integration) model.

Trains a variational autoencoder to find latent factors in multi-modal data.

n_hidden: array (default [1500])
The sizes of the hidden layers of the autoencoder architecture. Each element of the array specifies the number of nodes in successive layers of the autoencoder
n_latent: int (default 80)
The size of the latent layer (number of latent features)
batch_size: int (default 100)
The size of the mini-batches used for training the network
epochs: int (default 400)
The number of epoches to use for training the network
architecture:
One of ‘stacked’ or ‘deep’. If ‘stacked’, will use a stacked VAE model, where the intermediate layers are also variational. If ‘deep’, will train a deep VAE where the intermediate layers are regular (ReLU) units, and only the middle (latent) layer is variational.
c_index(survival, clinical_only=True, duration_column='duration', observed_column='observed', cox_penalties=[0.1, 1, 10, 100, 1000, 10000], cv_folds=5, sel_clin_alpha=0.05, sel_clin_penalty=0)[source]

Compute’s Harrell’s c-Index for a Cox Proportional Hazards regression modeling survival by the latent factors in z.

z: pd.DataFrame (n_samples, n_latent factors) survival: pd.DataFrame of survival information and relevant covariates

(such as sex, age at diagnosis, or tumor stage)
clinical_only: Compute the c-Index for a model containing only
individually clinically relevant latent factors (see select_clinical_factors)
duration_column: the name of the column in survival containing the
duration (time between diagnosis and death or last followup)
observed_column: the name of the column in survival containing
indicating whether time of death is known
cox_penalties: penalty coefficient in Cox PH solver (see lifelines.CoxPHFitter)
to try. Returns the best c given by the different penalties (by cross-validation)

cv_folds: number of cross-validation folds to compute C sel_clin_penalty: CPH penalizer to use when selecting clinical factors sel_clin_alpha: significance level when selecting clinical factors

cs: array, Harrell’s c-Index, an auc-like metric for survival prediction accuracy.
one value per cv_fold
cluster(k=None, optimal_k_method='ami', optimal_k_range=range(3, 10), ami_y=None, kmeans_kwargs={'n_init': 1000, 'n_jobs': 2})[source]

Cluster the samples using k-means based on the latent factors.

k: optional, the number of clusters to find.
if not given, will attempt to find optimal k.
optimal_k_method: supported methods are ‘ami’ and ‘silhouette’. Otherwise, callable.
if ‘ami’, will pick K which gives the best AMI (adjusted mutual information) with external labels. if ‘silhouette’ will pick the K which gives the best mean silhouette coefficient. if callable, should have signature scorer(yhat) and return a scalar score.

optimal_k_range: array-like, range of Ks to try to find optimal K among ami_y: array-like (n_samples), the ground-truth labels to use

when picking K by “best AMI against ground-truth” method.

kmeans_kwargs: optional, kwargs for initialization of sklearn.cluster.KMeans

yhat: Series (n_samples) cluster labels for each sample

compute_auc(y, **kwargs)[source]

Compute area under the ROC curve for predicting the labels in y using the latent features previously inferred.

y: labels to predict **kwargs: arguments for compute_roc

aucs: pd.Series, auc per class as well as mean

compute_roc(y, **kwargs)[source]

Compute Receiver Operating Characteristics curve for SVM prediction of labels y from the latent factors. Computes both the ROC curves (true positive rate, true negative rate), and the area under the roc (auc). ROC and auROC computed for each class (the classes are inferred from y), as well as a “mean” ROC, computed by averaging the class ROCs. Only samples in the index of y will be considered.

y: array-like (n_samples,), the labels of the samples to predict **kwargs: arguments for utils.compute_roc

roc_curves: dict, one key per class as well as “mean”, each value is a dataframe
containing the tpr (true positive rate) and fpr (falce positive rate) defining that class (or the mean) ROC.
drop_unexplanatory_factors(threshold=0.02)[source]

Drops factors which have a low R^2 score in a univariate linear model predicting the features x from a column of the latent factors z.

threshold: threshold for R^2, latent factors below this threshold
are dropped.
z_filt: (n_samples, n_factors) DataFrame of latent factor values,
with only those columns from the input z which have an R^2 above the threshold when using that column as an input to a linear model predicting x.
fit(X, y=None, X_validation=None)[source]

Train autoencoder model

X: dict with multi-modal dataframes, containing training data, e.g.
{‘mRNA’: df1, ‘SNP’: df2}, df1, df2, etc. are (n_features, n_samples) pandas.DataFrame’s. The sample names must match, the feature names need not.
X_validation: optional, dict with multi-modal dataframes, containing validation data
will be used to compute validation loss under training

y: Not used.

self : Maui object

fit_transform(X, y=None, X_validation=None, encoder='mean')[source]

Train autoencoder model, and return the latent factor representation of the data X.

X: dict with multi-modal dataframes, containing training data, e.g.
{‘mRNA’: df1, ‘SNP’: df2}, df1, df2, etc. are (n_samples, n_features) pandas.DataFrame’s. The sample names must match, the feature names need not.
X_validation: optional, dict with multi-modal dataframes, containing validation data
will be used to compute validation loss under training

y: Not used.

z: DataFrame (n_samples, n_latent_factors)
Latent factors representation of the data X.
get_linear_weights()[source]

Get linear model coefficients obtained from fitting linear models predicting feature values from latent factors. One model is fit per latent factor, and the coefficients are stored in the matrix.

W: (n_features, n_latent_factors) DataFrame
w_{ij} is the coefficient associated with feature i in a linear model predicting it from latent factor j.
static load(dir)[source]

Load a maui model from disk, which was previously saved using save()

dir: The directory from which to load the maui model

maui_model: a maui model that was previously saved to disk

merge_similar_latent_factors(distance_in='z', distance_metric='correlation', linkage_method='complete', distance_threshold=0.17, merge_fn=<function mean>, plot_dendrogram=True, plot_dendro_ax=None)[source]

Merge latent factorz in z whose distance is below a certain threshold. Used to squeeze down latent factor representations if there are many co-linear latent factors.

distance_in: If ‘z’, latent factors will be merged based on their distance
to each other in ‘z’. If ‘w’, favtors will be merged based on their distance in ‘w’ (see get_linear_weights())
distance_metric: The distance metric based on which to merge latent factors.
One which is supported by scipy.spatial.distance.pdist()
linkage_method: The linkage method used to cluster latent factors. One which
is supported by scipy.cluster.hierarchy.linkage().
distance_threshold: Latent factors with distance below this threshold
will be merged
merge_fn: Function used to determine value of merged latent factor.
The default is numpy.mean(), meaning the merged latent factor will have the mean value of the inputs.
plot_dendrogram: Boolean. If true, a dendrogram will be plotted showing
which latent factors are merged and the threshold.

plot_dendro_ax: A matplotlib axis object to plot the dendrogram on (optional)

z: (n_samples, n_factors) pd.DataFrame of latent factors
where some have been merged
save(destdir)[source]

Save a maui model to disk, so that it may be reloaded later using load()

destdir: destination directory in which to save model files

select_clinical_factors(survival, duration_column='duration', observed_column='observed', alpha=0.05, cox_penalizer=0)[source]

Select latent factors which are predictive of survival. This is accomplished by fitting a Cox Proportional Hazards (CPH) model to each latent factor, while controlling for known covariates, and only keeping those latent factors whose coefficient in the CPH is nonzero (adjusted p-value < alpha).

survival: pd.DataFrame of survival information and relevant covariates
(such as sex, age at diagnosis, or tumor stage)
duration_column: the name of the column in survival containing the
duration (time between diagnosis and death or last followup)
observed_column: the name of the column in survival containing
indicating whether time of death is known
alpha: threshold for p-value of CPH coefficients to call a latent
factor clinically relevant (p < alpha)

cox_penalizer: penalty coefficient in Cox PH solver (see lifelines.CoxPHFitter)

z_clinical: pd.DataFrame, subset of the latent factors which have been
determined to have clinical value (are individually predictive of survival, controlling for covariates)
transform(X, encoder='mean')[source]

Transform X into the latent space that was previously learned using fit or fit_transform, and return the latent factor representation.

X: dict with multi-modal dataframes, containing training data, e.g.
{‘mRNA’: df1, ‘SNP’: df2}, df1, df2, etc. are (n_features, n_samples) pandas.DataFrame’s.
encoder: the mode of the encoder to be used. one of ‘mean’ or ‘sample’,
where ‘mean’ indicates the encoder network only uses the mean estimates for each successive layer. ‘sample’ indicates the encoder should sample from the distribution specified from each successive layer, and results in non-reproducible embeddings.
z: DataFrame (n_samples, n_latent_factors)
Latent factors representation of the data X.