Maui Utilities

The maui.utils model contains utility functions for multi-omics analysis using maui.

maui.utils.compute_harrells_c(z, survival, duration_column='duration', observed_column='observed', cox_penalties=[0.1, 1, 10, 100, 1000, 10000], cv_folds=5)[source]

Compute’s Harrell’s c-Index for a Cox Proportional Hazards regression modeling survival by the latent factors in z.

z: pd.DataFrame (n_samples, n_latent factors) survival: pd.DataFrame of survival information and relevant covariates

(such as sex, age at diagnosis, or tumor stage)
duration_column: the name of the column in survival containing the
duration (time between diagnosis and death or last followup)
observed_column: the name of the column in survival containing
indicating whether time of death is known
cox_penalties: penalty coefficient in Cox PH solver (see lifelines.CoxPHFitter)
to try. Returns the best c given by the different penalties (by cross-validation)

cv_folds: number of cross-validation folds to compute C

cs: array, Harrell’s c-Index, an auc-like metric for survival prediction accuracy.
one value per cv_fold
maui.utils.compute_roc(z, y, classifier=LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0), cv_folds=10)[source]

Compute the ROC (false positive rate, true positive rate) using cross-validation.

z: DataFrame (n_samples, n_latent_factors) of latent factor values y: Series (n_samples,) of ground-truth labels to try to predict classifier: Classifier object to use, default LinearSVC(C=.001)

roc_curves: dict, one key per class as well as “mean”, each value is a dataframe
containing the tpr (true positive rate) and fpr (falce positive rate) defining that class (or the mean) ROC.
maui.utils.correlate_factors_and_features(z, concatenated_data, pval_threshold=0.001)[source]

Compute pearson correlation of latent factors with input features.

z: (n_samples, n_factors) DataFrame of latent factor values, output of maui model concatenated_data: (n_samples, n_features) DataFrame of concatenated multi-omics data

feature_s: DataFrame (n_features, n_latent_factors)
Latent factors representation of the data X.
maui.utils.estimate_kaplan_meier(y, survival, duration_column='duration', observed_column='observed')[source]

Estimate survival curves for groups defined in y based on survival data in survival

y: pd.Series, groups (clusters, subtypes). the index is
the sample names
survival: pd.DataFrame with the same index as y, with columns for
the duration (survival time for each patient) and whether or not the death was observed. If the death was not observed (sensored), the duration is the time of the last followup.

duration_column: the name of the column in survival with the duration observed_column: the name of the column in survival with True/False values

for whether death was observed or not
km_estimates: pd.DataFrame, index is the timeline, columns are survival
functions (estimated by Kaplan-Meier) for each class, as defined in y.
maui.utils.filter_factors_by_r2(z, x, threshold=0.02)[source]

Filter latent factors by the R^2 of a linear model predicting features x from latent factors z.

z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model x: (n_samples, n_features) DataFrame of concatenated multi-omics data

z_filtered: (n_samples, n_factors) DataFrame of latent factor values,
with only those columns from the input z which have an R^2 above the threshold when using that column as an input to a linear model predicting x.
maui.utils.map_factors_to_feaures_using_linear_models(z, x)[source]

Get feature <-> latent factors mapping from linear models. Runs one univariate (multi-output) linear model per latent factor in z, predicting the values of the features x, in order to get weights between inputs and outputs.

z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model x: (n_samples, n_features) DataFrame of concatenated multi-omics data

W: (n_features, n_latent_factors) DataFrame
w_{ij} is the coefficient associated with feature i in a linear model predicting it from latent factor j.
maui.utils.merge_factors(z, l=None, threshold=0.17, merge_fn=<function mean>, metric='correlation', linkage='single', plot_dendro=True, plot_dendro_ax=None)[source]

Merge latent factors in z which form clusters, as defined by hierarchical clustering where a cluster is formed by cutting at a pre-set threshold, i.e. merge factors if their distance to one-another is below threshold.

z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model metric: Distance metric to merge factors by, one which is supported by

scipy.spatial.distance.pdist()
linkage: The kind of linkage to form hierarchical clustering, one which is
supported by scipy.cluster.hierarchy.linkage()
l: As an alternative to supplying metric and linkage, supply a
linkage matrix of your own choice, such as one computed by scipy.cluster.hierarchy.linkage()
threshold: The distance threshold. latent factors with similarity below the
threshold will be merged to form single latent facator
merge_fn: A function which will be used to merge latent factors. The default
is numpy.mean(), i.e. the newly formed (merged) latent factor will be the mean of the merged ones. Supply any function here which has the same interface, i.e. takes a matrix and an axis.
plot_dendro: Boolean. If True, the function will plot a dendrogram showing
which latent factors are merged and the threshold.
maui.utils.multivariate_logrank_test(y, survival, duration_column='duration', observed_column='observed')[source]

Compute the multivariate log-rank test for differential survival among the groups defined by y in the survival data in survival, under the null-hypothesis that all groups have the same survival function (i.e. test whether at least one group has different survival rates)

y: pd.Series, groups (clusters, subtypes). the index is
the sample names
survival: pd.DataFrame with the same index as y, with columns for
the duration (survival time for each patient) and whether or not the death was observed. If the death was not observed (sensored), the duration is the time of the last followup.

duration_column: the name of the column in survival with the duration observed_column: the name of the column in survival with True/False values

for whether death was observed or not

test_statistic: the test statistic (chi-square) p_value: the associated p_value

maui.utils.scale(df)[source]

Scale and center data

df: pd.DataFrame (n_features, n_samples) non-scaled data

scaled: pd.DataFrame (n_features, n_samples) scaled data

maui.utils.select_clinical_factors(z, survival, duration_column='duration', observed_column='observed', alpha=0.05, cox_penalizer=0)[source]

Select latent factors which are predictive of survival. This is accomplished by fitting a Cox Proportional Hazards (CPH) model to each latent factor, while controlling for known covariates, and only keeping those latent factors whose coefficient in the CPH is nonzero (adjusted p-value < alpha).

survival: pd.DataFrame of survival information and relevant covariates
(such as sex, age at diagnosis, or tumor stage)
duration_column: the name of the column in survival containing the
duration (time between diagnosis and death or last followup)
observed_column: the name of the column in survival containing
indicating whether time of death is known
alpha: threshold for p-value of CPH coefficients to call a latent
factor clinically relevant (p < alpha)

cox_penalizer: penalty coefficient in Cox PH solver (see lifelines.CoxPHFitter)

z_clinical: pd.DataFrame, subset of the latent factors which have been
determined to have clinical value (are individually predictive of survival, controlling for covariates)