Diff of /README.md [000000] .. [3513e2]

Switch to unified view

a b/README.md
1
## A Denoised Multi-omics Integration Framework for Cancer Subtype Classification and Survival Prediction
2
3
---
4
5
### What we do?
6
7
- We developed a new feature selection method, Feature Selection with Distribution (FSD), for multi-omics data denosing and feature selection.
8
9
- We developed a biologically informed deep learning algorithm for multi-omics integration to predict cancer subtypes and patient survival. 
10
11
- Commonly used feature selection methods, ANOVA, RFE, LASSO, PCA, were incorporated for comparison.
12
13
- Several machine learning and deep learning algorithms, including Random Forest, XGboost, SVM, DNN, MOGONET<sup>1</sup>, Moanna<sup>2</sup>, were integrated for multi-omics integration for cpmparison. MOGONET used graph convolutional networks for multi-omics integration, and Moanna is a Autoencoder-based neural network.
14
15
---
16
17
<div align=center>
18
<img src="https://github.com/BioAI-kits/AttentionMOI/blob/master/img/Figure1.jpg" />
19
</div>
20
21
**Introduction of project**. The availability of high-throughput sequencing data create opportunities to comprehensively understand human diseases as well as challenges to train machine learning models using such high dimensions of data. Here, we propose a denoised multi-omics integration framework for cancer subtype classification and survival prediction. Firstly, a distribution based feature denosing algorithm, Feature Selection with Distribution (FSD), were designed to reduce dimensions of omics features. Secondly, we introduced a a multi-omics integration framework, Attention Multi-Omics Integration (AttentionMOI), which is inspired by the central dogma of biology. We demonstrated that FSD improved model performance either using single omics data or multi-omics data in 13 TCGA cancers for survival prediction and kidney cancer subtype identification. And our integration framework outperformed traditional artificial intellegnce models current multi-omics integration algorithms under high dimensions of features. Furthermore, FSD identisied features were related to cancer prognosis and could be considered as biomarkers. 
22
23
---
24
25
### Install
26
27
You can install programs and dependencies via pip. We recommend using conda to build a virtual environment with python version 3.9 or higher.
28
29
(optional) Create a virtual environment
30
31
```bash
32
conda create -n env_moi python=3.9
33
34
conda activate env_moi  # Activate the environment
35
```
36
37
Install
38
39
```bash
40
pip install AttentionMOI
41
```
42
43
### Parameters
44
 
45
After your installation is complete, your computer terminal will contain a `moi` command. This is the only interface to our program. You will use this command to build an omics model.
46
47
First, you can execute the following command line to get detailed help information.
48
49
```
50
moi -h
51
```
52
53
Then, we also introduce these parameters in the following documents: 
54
55
56
**1. Input**
57
58
The input file format is described below, or you can refer to the reference data we provide (https://github.com/BioAI-kits/AttentionMOI/tree/master/AttentionMOI/example).
59
60
f | omic_file
61
62
> REQUIRED: File path for omics files (should be matrix)
63
64
**NOTE:The file must be in csv format, such as rna.csv. Of course, it can be compressed with gz, such as rna.csv.gz.**. Example: The first line is the header, patient_id and gene (features) names.
65
66
>  patient_id,A1BG,A1CF,A2BP1,A2LD1,....
67
>
68
>  TCGA.KL.8323,3.3491,0.0,0.0,5.8939,....
69
>
70
>  TCGA.KL.8324,2.922,0.5557,0.5557,6.4226,....
71
72
n | omic_name
73
74
> REQUIRED: Omic names for omics files, should be the same order as the omics file
75
76
l | label_file
77
78
> REQUIRED: File path for label file
79
80
**NOTE:The file must be in csv format, such as label.csv. Of course, it can be compressed with gz, such as label.csv.gz.**. Example: The first line is the header, patient_id and label represent the sample name and sample classification label respectively. 
81
82
> patient_id,label
83
>
84
> TCGA.KL.8328,0
85
>
86
> TCGA.KL.8339,0
87
>
88
> TCGA.KM.8439,1
89
>
90
> TCGA.KM.8441,1
91
>
92
> TCGA.KM.8442,1
93
94
95
**2. Output**
96
97
o | outdir
98
99
> OPTIONAL: Setting output file path, default=./output
100
101
102
**3. Feature selection**
103
104
method
105
106
> OPTIONAL: Method of feature selection, choosing from ANOVA, RFE, LASSO, PCA, default is no feature selection
107
108
percentile
109
110
> OPTIONAL: Percent of features to keep for ANOVA (integer between 1-100), only used when using ANOVA, default=30
111
112
num_pc
113
114
> OPTIONAL: Number of PCs to keep for PCA (integer), only used when using PCA, default=50
115
116
FSD
117
118
> OPTIONAL: Whether to use FSD to mitigate noise of omics. Default is not using FSD, and set --FSD to use FSD
119
120
i | iteration
121
122
> OPTIONAL: The number of FSD iterations (integer), default=10
123
124
s | seed
125
126
> OPTIONAL: Random seed for FSD (integer), default=0
127
128
threshold
129
130
> OPTIONAL: FSD threshold to select features (float), default=0.8 (select features that are selected in 80 percent FSD iterations)
131
132
133
**4. Building Model**
134
135
m | model 
136
137
> OPTIONAL: Model names, choosing from DNN, Net (Net for AttentionMOI), RF, XGboost, svm, mogonet, moanna, default=DNN.
138
139
t | test_size
140
141
> OPTIONAL: Testing dataset proportion when split train test dataset (float), default=0.3 (30 percent data for testing)
142
143
b | batch
144
145
> OPTIONAL: Mini-batch number for model training (integer), default=32
146
147
e | epoch
148
149
> OPTIONAL: Epoch number for model training (integer), default=300
150
151
r | lr
152
153
> OPTIONAL: Learning rate for model training(float), default=0.0001
154
155
w | weight_decay
156
157
> OPTIONAL: weight_decay parameter for model training (float), default=0.0001
158
159
---
160
161
### Example
162
163
Example (Data can be downloaded from https://github.com/BioAI-kits/AttentionMOI ): 
164
```
165
moi -f GBM_exp.csv.gz -f GBM_met.csv.gz -f GBM_logRatio.csv.gz -n rna -n met -n cnv -l GBM_label.csv --FSD -m Net -o GBM_Result
166
```
167
168
---
169
170
### Ref.
171
172
1. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification
173
174
2. Moanna: Multi-Omics Autoencoder-Based Neural Network Algorithm for Predicting Breast Cancer Subtypes 
175
176
177
---
178
179
All rights reserved.
180
181
182