Maui Utilities¶
The maui.utils model contains utility functions for multi-omics analysis using maui.
-
maui.utils.
compute_harrells_c
(z, survival, duration_column='duration', observed_column='observed', cox_penalties=[0.1, 1, 10, 100, 1000, 10000], cv_folds=5)[source]¶ Compute’s Harrell’s c-Index for a Cox Proportional Hazards regression modeling survival by the latent factors in z.
z: pd.DataFrame (n_samples, n_latent factors) survival: pd.DataFrame of survival information and relevant covariates
(such as sex, age at diagnosis, or tumor stage)- duration_column: the name of the column in
survival
containing the - duration (time between diagnosis and death or last followup)
- observed_column: the name of the column in
survival
containing - indicating whether time of death is known
- cox_penalties: penalty coefficient in Cox PH solver (see
lifelines.CoxPHFitter
) - to try. Returns the best c given by the different penalties (by cross-validation)
cv_folds: number of cross-validation folds to compute C
- cs: array, Harrell’s c-Index, an auc-like metric for survival prediction accuracy.
- one value per cv_fold
- duration_column: the name of the column in
-
maui.utils.
compute_roc
(z, y, classifier=LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0), cv_folds=10)[source]¶ Compute the ROC (false positive rate, true positive rate) using cross-validation.
z: DataFrame (n_samples, n_latent_factors) of latent factor values y: Series (n_samples,) of ground-truth labels to try to predict classifier: Classifier object to use, default
LinearSVC(C=.001)
- roc_curves: dict, one key per class as well as “mean”, each value is a dataframe
- containing the tpr (true positive rate) and fpr (falce positive rate) defining that class (or the mean) ROC.
-
maui.utils.
correlate_factors_and_features
(z, concatenated_data, pval_threshold=0.001)[source]¶ Compute pearson correlation of latent factors with input features.
z: (n_samples, n_factors) DataFrame of latent factor values, output of maui model concatenated_data: (n_samples, n_features) DataFrame of concatenated multi-omics data
- feature_s: DataFrame (n_features, n_latent_factors)
- Latent factors representation of the data X.
-
maui.utils.
estimate_kaplan_meier
(y, survival, duration_column='duration', observed_column='observed')[source]¶ Estimate survival curves for groups defined in y based on survival data in
survival
- y: pd.Series, groups (clusters, subtypes). the index is
- the sample names
- survival: pd.DataFrame with the same index as y, with columns for
- the duration (survival time for each patient) and whether or not the death was observed. If the death was not observed (sensored), the duration is the time of the last followup.
duration_column: the name of the column in
survival
with the duration observed_column: the name of the column insurvival
with True/False valuesfor whether death was observed or not- km_estimates: pd.DataFrame, index is the timeline, columns are survival
- functions (estimated by Kaplan-Meier) for each class, as
defined in
y
.
-
maui.utils.
filter_factors_by_r2
(z, x, threshold=0.02)[source]¶ Filter latent factors by the R^2 of a linear model predicting features x from latent factors z.
z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model x: (n_samples, n_features) DataFrame of concatenated multi-omics data
- z_filtered: (n_samples, n_factors) DataFrame of latent factor values,
- with only those columns from the input z which have an R^2 above the threshold when using that column as an input to a linear model predicting x.
-
maui.utils.
map_factors_to_feaures_using_linear_models
(z, x)[source]¶ Get feature <-> latent factors mapping from linear models. Runs one univariate (multi-output) linear model per latent factor in z, predicting the values of the features x, in order to get weights between inputs and outputs.
z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model x: (n_samples, n_features) DataFrame of concatenated multi-omics data
- W: (n_features, n_latent_factors) DataFrame
- w_{ij} is the coefficient associated with feature i in a linear model predicting it from latent factor j.
-
maui.utils.
merge_factors
(z, l=None, threshold=0.17, merge_fn=<function mean>, metric='correlation', linkage='single', plot_dendro=True, plot_dendro_ax=None)[source]¶ Merge latent factors in z which form clusters, as defined by hierarchical clustering where a cluster is formed by cutting at a pre-set threshold, i.e. merge factors if their distance to one-another is below threshold.
z: (n_samples, n_factors) DataFrame of latent factor values, output of a maui model metric: Distance metric to merge factors by, one which is supported by
scipy.spatial.distance.pdist()
- linkage: The kind of linkage to form hierarchical clustering, one which is
- supported by
scipy.cluster.hierarchy.linkage()
- l: As an alternative to supplying metric and linkage, supply a
- linkage matrix of your own choice, such as one computed by
scipy.cluster.hierarchy.linkage()
- threshold: The distance threshold. latent factors with similarity below the
- threshold will be merged to form single latent facator
- merge_fn: A function which will be used to merge latent factors. The default
- is
numpy.mean()
, i.e. the newly formed (merged) latent factor will be the mean of the merged ones. Supply any function here which has the same interface, i.e. takes a matrix and an axis. - plot_dendro: Boolean. If True, the function will plot a dendrogram showing
- which latent factors are merged and the threshold.
-
maui.utils.
multivariate_logrank_test
(y, survival, duration_column='duration', observed_column='observed')[source]¶ Compute the multivariate log-rank test for differential survival among the groups defined by
y
in the survival data insurvival
, under the null-hypothesis that all groups have the same survival function (i.e. test whether at least one group has different survival rates)- y: pd.Series, groups (clusters, subtypes). the index is
- the sample names
- survival: pd.DataFrame with the same index as y, with columns for
- the duration (survival time for each patient) and whether or not the death was observed. If the death was not observed (sensored), the duration is the time of the last followup.
duration_column: the name of the column in
survival
with the duration observed_column: the name of the column insurvival
with True/False valuesfor whether death was observed or nottest_statistic: the test statistic (chi-square) p_value: the associated p_value
-
maui.utils.
scale
(df)[source]¶ Scale and center data
df: pd.DataFrame (n_features, n_samples) non-scaled data
scaled: pd.DataFrame (n_features, n_samples) scaled data
-
maui.utils.
select_clinical_factors
(z, survival, duration_column='duration', observed_column='observed', alpha=0.05, cox_penalizer=0)[source]¶ Select latent factors which are predictive of survival. This is accomplished by fitting a Cox Proportional Hazards (CPH) model to each latent factor, while controlling for known covariates, and only keeping those latent factors whose coefficient in the CPH is nonzero (adjusted p-value < alpha).
- survival: pd.DataFrame of survival information and relevant covariates
- (such as sex, age at diagnosis, or tumor stage)
- duration_column: the name of the column in
survival
containing the - duration (time between diagnosis and death or last followup)
- observed_column: the name of the column in
survival
containing - indicating whether time of death is known
- alpha: threshold for p-value of CPH coefficients to call a latent
- factor clinically relevant (p < alpha)
cox_penalizer: penalty coefficient in Cox PH solver (see
lifelines.CoxPHFitter
)- z_clinical: pd.DataFrame, subset of the latent factors which have been
- determined to have clinical value (are individually predictive of survival, controlling for covariates)