|
a |
|
b/README.md |
|
|
1 |
# Bioinformatics 2018 - Distinguishing prognostic and predictive biomarkers: An information theoretic approach |
|
|
2 |
|
|
|
3 |
Information theoretic predictive biomarker ranking |
|
|
4 |
|
|
|
5 |
**Date:** 02/02/2018 |
|
|
6 |
|
|
|
7 |
**Paper:** Distinguishing prognostic and predictive biomarkers: An information theoretic approach |
|
|
8 |
**Authors:** Konstantinos Sechidis, Konstantinos Papangelou, Paul D. Metcalfe, David Svensson, James Weatherall and Gavin Brown |
|
|
9 |
|
|
|
10 |
**Platform:** R Version 3.3.1 |
|
|
11 |
|
|
|
12 |
**Required packages:** MASS, infotheo |
|
|
13 |
|
|
|
14 |
**Maintainer:** Konstantinos Sechidis konstantinos.sechidis@manchester.ac.uk |
|
|
15 |
|
|
|
16 |
**Description:** Deriving rankings that capture the predictive biomarker strength through univariate (INFO) or higher-order (INFO+) methods |
|
|
17 |
|
|
|
18 |
**Functions:** |
|
|
19 |
|
|
|
20 |
```INFOplus.Output_Categorical.Covariates_Categorical(data,labels,treatment,top_k)$ranking``` |
|
|
21 |
This function returns the predictive ranking, the input arguments are |
|
|
22 |
|
|
|
23 |
**data:** A matrix containing the covariates (biomarkers). The columns capture the different covariates, while the rows the different examples (patients). For this function the covariates are categorical (nominal). |
|
|
24 |
|
|
|
25 |
**labels:** A vector that contains the output (target) label for each patient, in this case it takes categorical (nominal) values. |
|
|
26 |
|
|
|
27 |
**treatment:** A vector that describes the treatment allocation (i.e. T=0 control group, T=1 experimental treatment). |
|
|
28 |
|
|
|
29 |
**top_k:** The number of top-k predictive biomarkers to be returned. |
|
|
30 |
|
|
|
31 |
Furthermore we provide functions that can be used for various data types: |
|
|
32 |
|
|
|
33 |
```INFOplus.Output_Categorical.Covariates_Continuous```: The covariates can be either all continuous or mixed (continuous and categorical). To discretise continuous covariates we follow by default Scott's rule. |
|
|
34 |
```INFOplus.Output_Survival.Covariates_Categorical```: For survival (time-to-event) output targets and categorical covariates. |
|
|
35 |
```INFOplus.Output_Survival.Covariates_Categorical```: For survival (time-to-event) output targets and continuous or mixed (continuous and categorical) covariates. |
|
|
36 |
|
|
|
37 |
Finally, we provide the same functions for deriving the uni-variate INFO ranking. |
|
|
38 |
|
|
|
39 |
|
|
|
40 |
Example |
|
|
41 |
|
|
|
42 |
We provide a source code (```Functions-GenerateData.R```) to generate the synthetic scenarios presented in the paper. The following example shows how to derive the predictive rankings using our code. |
|
|
43 |
|
|
|
44 |
``` |
|
|
45 |
## Load libraries |
|
|
46 |
library(MASS) # To generate synthetic data by sampling a Multivariate Normal |
|
|
47 |
library(infotheo) # Information theoretic library |
|
|
48 |
|
|
|
49 |
## Load sources |
|
|
50 |
source("Functions-GenerateData.R") # Function to generate synthetic data |
|
|
51 |
source("InformationTheory-PredictiveRankings.R") # Functions to derive predictive rankings |
|
|
52 |
|
|
|
53 |
|
|
|
54 |
################################### |
|
|
55 |
##### Generate synthetic data ##### |
|
|
56 |
################################### |
|
|
57 |
model <- 3 ; # Which model to use (1, 2, 3, 4, 5, 6, 7) - details on the paper |
|
|
58 |
theta_pred <- 1 # Strength of predictive part |
|
|
59 |
num_features <- 20 # Number of covariates |
|
|
60 |
sample_size <- 2000 # Number of examples |
|
|
61 |
|
|
|
62 |
dataset <- Generate.Data(sample_size,num_features,theta_pred,model) |
|
|
63 |
|
|
|
64 |
# The methods will return the top-k biomarkers |
|
|
65 |
top_k <-5 |
|
|
66 |
|
|
|
67 |
####################################################### |
|
|
68 |
# Ranking the biomarkers on their predictive strength # |
|
|
69 |
####################################################### |
|
|
70 |
# INFO, which captures first order interactions (returns the top_k = 5 biomarkers) |
|
|
71 |
INFO.Output_Categorical.Covariates_Categorical(dataset$data,dataset$labels,dataset$treatment)$ranking[1:top_k] # this function returns the ranking |
|
|
72 |
|
|
|
73 |
# INFO+, which captures second order interactions (returns the top_k = 5 biomarkers) |
|
|
74 |
INFOplus.Output_Categorical.Covariates_Categorical(dataset$data,dataset$labels,dataset$treatment,top_k)$ranking # this function returns the ranking |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
``` |