a/README.md b/README.md
1
## A Denoised Multi-omics Integration Framework for Cancer Subtype Classification and Survival Prediction
1
## A Denoised Multi-omics Integration Framework for Cancer Subtype Classification and Survival Prediction
2
2
3
---
3
---
4
4
5
### What we do?
5
### What we do?
6
6
7
- We developed a new feature selection method, Feature Selection with Distribution (FSD), for multi-omics data denosing and feature selection.
7
- We developed a new feature selection method, Feature Selection with Distribution (FSD), for multi-omics data denosing and feature selection.
8
8
9
- We developed a biologically informed deep learning algorithm for multi-omics integration to predict cancer subtypes and patient survival. 
9
- We developed a biologically informed deep learning algorithm for multi-omics integration to predict cancer subtypes and patient survival. 
10
10
11
- Commonly used feature selection methods, ANOVA, RFE, LASSO, PCA, were incorporated for comparison.
11
- Commonly used feature selection methods, ANOVA, RFE, LASSO, PCA, were incorporated for comparison.
12
12
13
- Several machine learning and deep learning algorithms, including Random Forest, XGboost, SVM, DNN, MOGONET<sup>1</sup>, Moanna<sup>2</sup>, were integrated for multi-omics integration for cpmparison. MOGONET used graph convolutional networks for multi-omics integration, and Moanna is a Autoencoder-based neural network.
13
- Several machine learning and deep learning algorithms, including Random Forest, XGboost, SVM, DNN, MOGONET<sup>1</sup>, Moanna<sup>2</sup>, were integrated for multi-omics integration for cpmparison. MOGONET used graph convolutional networks for multi-omics integration, and Moanna is a Autoencoder-based neural network.
14
14
15
---
15
---
16
16
17
<div align=center>
17
<div align=center>
18
<img src="https://github.com/BioAI-kits/AttentionMOI/blob/master/img/Figure1.jpg" />
18
<img src="https://github.com/BioAI-kits/AttentionMOI/blob/master/img/Figure1.jpg?raw=true" />
19
</div>
19
</div>
20
20
21
**Introduction of project**. The availability of high-throughput sequencing data create opportunities to comprehensively understand human diseases as well as challenges to train machine learning models using such high dimensions of data. Here, we propose a denoised multi-omics integration framework for cancer subtype classification and survival prediction. Firstly, a distribution based feature denosing algorithm, Feature Selection with Distribution (FSD), were designed to reduce dimensions of omics features. Secondly, we introduced a a multi-omics integration framework, Attention Multi-Omics Integration (AttentionMOI), which is inspired by the central dogma of biology. We demonstrated that FSD improved model performance either using single omics data or multi-omics data in 13 TCGA cancers for survival prediction and kidney cancer subtype identification. And our integration framework outperformed traditional artificial intellegnce models current multi-omics integration algorithms under high dimensions of features. Furthermore, FSD identisied features were related to cancer prognosis and could be considered as biomarkers. 
21
**Introduction of project**. The availability of high-throughput sequencing data create opportunities to comprehensively understand human diseases as well as challenges to train machine learning models using such high dimensions of data. Here, we propose a denoised multi-omics integration framework for cancer subtype classification and survival prediction. Firstly, a distribution based feature denosing algorithm, Feature Selection with Distribution (FSD), were designed to reduce dimensions of omics features. Secondly, we introduced a a multi-omics integration framework, Attention Multi-Omics Integration (AttentionMOI), which is inspired by the central dogma of biology. We demonstrated that FSD improved model performance either using single omics data or multi-omics data in 13 TCGA cancers for survival prediction and kidney cancer subtype identification. And our integration framework outperformed traditional artificial intellegnce models current multi-omics integration algorithms under high dimensions of features. Furthermore, FSD identisied features were related to cancer prognosis and could be considered as biomarkers. 
22
22
23
---
23
---
24
24
25
### Install
25
### Install
26
26
27
You can install programs and dependencies via pip. We recommend using conda to build a virtual environment with python version 3.9 or higher.
27
You can install programs and dependencies via pip. We recommend using conda to build a virtual environment with python version 3.9 or higher.
28
28
29
(optional) Create a virtual environment
29
(optional) Create a virtual environment
30
30
31
```bash
31
```bash
32
conda create -n env_moi python=3.9
32
conda create -n env_moi python=3.9
33
33
34
conda activate env_moi  # Activate the environment
34
conda activate env_moi  # Activate the environment
35
```
35
```
36
36
37
Install
37
Install
38
38
39
```bash
39
```bash
40
pip install AttentionMOI
40
pip install AttentionMOI
41
```
41
```
42
42
43
### Parameters
43
### Parameters
44
 
44
 
45
After your installation is complete, your computer terminal will contain a `moi` command. This is the only interface to our program. You will use this command to build an omics model.
45
After your installation is complete, your computer terminal will contain a `moi` command. This is the only interface to our program. You will use this command to build an omics model.
46
46
47
First, you can execute the following command line to get detailed help information.
47
First, you can execute the following command line to get detailed help information.
48
48
49
```
49
```
50
moi -h
50
moi -h
51
```
51
```
52
52
53
Then, we also introduce these parameters in the following documents: 
53
Then, we also introduce these parameters in the following documents: 
54
54
55
55
56
**1. Input**
56
**1. Input**
57
57
58
The input file format is described below, or you can refer to the reference data we provide (https://github.com/BioAI-kits/AttentionMOI/tree/master/AttentionMOI/example).
58
The input file format is described below, or you can refer to the reference data we provide (https://github.com/BioAI-kits/AttentionMOI/tree/master/AttentionMOI/example).
59
59
60
f | omic_file
60
f | omic_file
61
61
62
> REQUIRED: File path for omics files (should be matrix)
62
> REQUIRED: File path for omics files (should be matrix)
63
63
64
**NOTE:The file must be in csv format, such as rna.csv. Of course, it can be compressed with gz, such as rna.csv.gz.**. Example: The first line is the header, patient_id and gene (features) names.
64
**NOTE:The file must be in csv format, such as rna.csv. Of course, it can be compressed with gz, such as rna.csv.gz.**. Example: The first line is the header, patient_id and gene (features) names.
65
65
66
>  patient_id,A1BG,A1CF,A2BP1,A2LD1,....
66
>  patient_id,A1BG,A1CF,A2BP1,A2LD1,....
67
>
67
>
68
>  TCGA.KL.8323,3.3491,0.0,0.0,5.8939,....
68
>  TCGA.KL.8323,3.3491,0.0,0.0,5.8939,....
69
>
69
>
70
>  TCGA.KL.8324,2.922,0.5557,0.5557,6.4226,....
70
>  TCGA.KL.8324,2.922,0.5557,0.5557,6.4226,....
71
71
72
n | omic_name
72
n | omic_name
73
73
74
> REQUIRED: Omic names for omics files, should be the same order as the omics file
74
> REQUIRED: Omic names for omics files, should be the same order as the omics file
75
75
76
l | label_file
76
l | label_file
77
77
78
> REQUIRED: File path for label file
78
> REQUIRED: File path for label file
79
79
80
**NOTE:The file must be in csv format, such as label.csv. Of course, it can be compressed with gz, such as label.csv.gz.**. Example: The first line is the header, patient_id and label represent the sample name and sample classification label respectively. 
80
**NOTE:The file must be in csv format, such as label.csv. Of course, it can be compressed with gz, such as label.csv.gz.**. Example: The first line is the header, patient_id and label represent the sample name and sample classification label respectively. 
81
81
82
> patient_id,label
82
> patient_id,label
83
>
83
>
84
> TCGA.KL.8328,0
84
> TCGA.KL.8328,0
85
>
85
>
86
> TCGA.KL.8339,0
86
> TCGA.KL.8339,0
87
>
87
>
88
> TCGA.KM.8439,1
88
> TCGA.KM.8439,1
89
>
89
>
90
> TCGA.KM.8441,1
90
> TCGA.KM.8441,1
91
>
91
>
92
> TCGA.KM.8442,1
92
> TCGA.KM.8442,1
93
93
94
94
95
**2. Output**
95
**2. Output**
96
96
97
o | outdir
97
o | outdir
98
98
99
> OPTIONAL: Setting output file path, default=./output
99
> OPTIONAL: Setting output file path, default=./output
100
100
101
101
102
**3. Feature selection**
102
**3. Feature selection**
103
103
104
method
104
method
105
105
106
> OPTIONAL: Method of feature selection, choosing from ANOVA, RFE, LASSO, PCA, default is no feature selection
106
> OPTIONAL: Method of feature selection, choosing from ANOVA, RFE, LASSO, PCA, default is no feature selection
107
107
108
percentile
108
percentile
109
109
110
> OPTIONAL: Percent of features to keep for ANOVA (integer between 1-100), only used when using ANOVA, default=30
110
> OPTIONAL: Percent of features to keep for ANOVA (integer between 1-100), only used when using ANOVA, default=30
111
111
112
num_pc
112
num_pc
113
113
114
> OPTIONAL: Number of PCs to keep for PCA (integer), only used when using PCA, default=50
114
> OPTIONAL: Number of PCs to keep for PCA (integer), only used when using PCA, default=50
115
115
116
FSD
116
FSD
117
117
118
> OPTIONAL: Whether to use FSD to mitigate noise of omics. Default is not using FSD, and set --FSD to use FSD
118
> OPTIONAL: Whether to use FSD to mitigate noise of omics. Default is not using FSD, and set --FSD to use FSD
119
119
120
i | iteration
120
i | iteration
121
121
122
> OPTIONAL: The number of FSD iterations (integer), default=10
122
> OPTIONAL: The number of FSD iterations (integer), default=10
123
123
124
s | seed
124
s | seed
125
125
126
> OPTIONAL: Random seed for FSD (integer), default=0
126
> OPTIONAL: Random seed for FSD (integer), default=0
127
127
128
threshold
128
threshold
129
129
130
> OPTIONAL: FSD threshold to select features (float), default=0.8 (select features that are selected in 80 percent FSD iterations)
130
> OPTIONAL: FSD threshold to select features (float), default=0.8 (select features that are selected in 80 percent FSD iterations)
131
131
132
132
133
**4. Building Model**
133
**4. Building Model**
134
134
135
m | model 
135
m | model 
136
136
137
> OPTIONAL: Model names, choosing from DNN, Net (Net for AttentionMOI), RF, XGboost, svm, mogonet, moanna, default=DNN.
137
> OPTIONAL: Model names, choosing from DNN, Net (Net for AttentionMOI), RF, XGboost, svm, mogonet, moanna, default=DNN.
138
138
139
t | test_size
139
t | test_size
140
140
141
> OPTIONAL: Testing dataset proportion when split train test dataset (float), default=0.3 (30 percent data for testing)
141
> OPTIONAL: Testing dataset proportion when split train test dataset (float), default=0.3 (30 percent data for testing)
142
142
143
b | batch
143
b | batch
144
144
145
> OPTIONAL: Mini-batch number for model training (integer), default=32
145
> OPTIONAL: Mini-batch number for model training (integer), default=32
146
146
147
e | epoch
147
e | epoch
148
148
149
> OPTIONAL: Epoch number for model training (integer), default=300
149
> OPTIONAL: Epoch number for model training (integer), default=300
150
150
151
r | lr
151
r | lr
152
152
153
> OPTIONAL: Learning rate for model training(float), default=0.0001
153
> OPTIONAL: Learning rate for model training(float), default=0.0001
154
154
155
w | weight_decay
155
w | weight_decay
156
156
157
> OPTIONAL: weight_decay parameter for model training (float), default=0.0001
157
> OPTIONAL: weight_decay parameter for model training (float), default=0.0001
158
158
159
---
159
---
160
160
161
### Example
161
### Example
162
162
163
Example (Data can be downloaded from https://github.com/BioAI-kits/AttentionMOI ): 
163
Example (Data can be downloaded from https://github.com/BioAI-kits/AttentionMOI ): 
164
```
164
```
165
moi -f GBM_exp.csv.gz -f GBM_met.csv.gz -f GBM_logRatio.csv.gz -n rna -n met -n cnv -l GBM_label.csv --FSD -m Net -o GBM_Result
165
moi -f GBM_exp.csv.gz -f GBM_met.csv.gz -f GBM_logRatio.csv.gz -n rna -n met -n cnv -l GBM_label.csv --FSD -m Net -o GBM_Result
166
```
166
```
167
167
168
---
168
---
169
169
170
### Ref.
170
### Ref.
171
171
172
1. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification
172
1. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification
173
173
174
2. Moanna: Multi-Omics Autoencoder-Based Neural Network Algorithm for Predicting Breast Cancer Subtypes 
174
2. Moanna: Multi-Omics Autoencoder-Based Neural Network Algorithm for Predicting Breast Cancer Subtypes 
175
175
176
176
177
---
177
---
178
178
179
All rights reserved.
179
All rights reserved.
180
180
181
181
182
182