Diff of /README.md [000000] .. [d6904d]

Switch to unified view

a b/README.md
1
# COVID-19 EHR Benchmarks
2
3
> A Comprehensive Benchmark For COVID-19 Predictive Modeling Using Electronic Health Records
4
5
![cover](assets/cover.png)
6
7
TJH datasets and presentation slides are available in GitHub releases.
8
9
**This repo is not active. Please check our latest repo https://github.com/yhzhu99/pyehr**
10
11
## Prediction Tasks
12
13
- [x] (Early) Mortality outcome prediction
14
- [x] Length-of-stay prediction
15
- [x] Multi-task/Two-stage prediction
16
17
## Model Zoo
18
19
### Machine Learning Models
20
21
- [x] Random forest (RF)
22
- [x] Decision tree (DT)
23
- [x] Gradient Boosting Decision Tree (GBDT)
24
- [x] XGBoost
25
- [x] CatBoost
26
27
### Deep Learning Models
28
29
- [x] Multi-layer perceptron (MLP)
30
- [x] Recurrent neural network (RNN)
31
- [x] Long-short term memory network (LSTM)
32
- [x] Gated recurrent units (GRU)
33
- [x] Temporal convolutional networks
34
- [x] Transformer
35
36
### EHR Predictive Models
37
38
- [x] RETAIN
39
- [x] StageNet
40
- [x] Dr. Agent
41
- [x] AdaCare
42
- [x] ConCare
43
- [x] GRASP
44
45
## Code Description
46
47
```shell
48
app/
49
    apis/
50
        ml_{task}.py # machine learning pipelines
51
        dl_{task}.py # deep learning pipelines
52
    core/
53
        evaluation/ # evaluation metrics
54
        utils/
55
    datasets/ # dataset loader scripts
56
    models/
57
        backbones/ # feature extractors
58
        classifiers/ # prediction heads
59
        losses/ # task related loss functions
60
        build_model.py # concat backbones and heads
61
configs/
62
    _base_/
63
    # common configs
64
        datasets/
65
        # dataset basic info, training epochs and dataset split strategy
66
            {dataset}.yaml
67
        db.yaml # database settings (optional)
68
    {config_name}.yaml # detailed model settings
69
checkpoints/ # model checkpoints are stored here
70
datasets/ # raw/processed dataset and pre-process script
71
main.py # main entry point
72
requirements.txt # code dependencies
73
```
74
75
## Requirements
76
77
- Python 3.7+
78
- PyTorch 1.10+
79
- Cuda 10.2+ (If you plan to use GPU)
80
81
Note:
82
83
- Most models can be run quickly on CPU.
84
- You are required to have a GPU with 12GB memory to run ConCare model on CDSL dataset.
85
- TCN model may run much faster on CPU.
86
87
## Usage
88
89
- Install requirements.
90
91
    ```bash
92
    pip install -r requirements.txt [-i https://pypi.tuna.tsinghua.edu.cn/simple] # [xxx] is optional
93
    ```
94
95
- Download TJH dataset from [An interpretable mortality prediction model for COVID-19 patients](https://www.nature.com/articles/s42256-020-0180-7), unzip and put it in `datasets/tongji/raw_data/` folder.
96
- Run preprocessing notebook. (You can skip this step if you have already done this in the later training process)
97
- (The CDSL dataset is also the same process.) You need to apply for the CDSL dataset if necessary. [Covid Data Save Lives Dataset](https://www.hmhospitales.com/coronavirus/covid-data-save-lives/english-version)
98
- Run following commands to train models.
99
100
    ```bash
101
    python main.py --cfg configs/xxx.yaml [--train] [--cuda CUDA_NUM] [--db]
102
    # Note:
103
    # 1) use --train for training, only infererence stage if not
104
    # 2) If you plan to use CUDA, use --cuda 0/1/2/...
105
    # 3) If you have configured database settings, you can use --db to upload performance after training to the database.
106
    ```
107
108
## Data Format
109
110
The shape and meaning of the tensor fed to the models are as follows:
111
112
- `x.pkl`: (N, T, D) tensor, where N is the number of patients, T is the number of time steps, and D is the number of features. At $D$ dimention, the first $x$ features are demographic features, the next $y$ features are lab test features, where $x + y = D$
113
- `y.pkl`: (N, T, 2) tensor, where the 2 values are [outcome, length-of-stay] for each time step.
114
- `visits_length.pkl`: (N, ) tensor, where the value is the number of visits for each patient.
115
- `missing_mask.pkl`: same shape as `x.pkl`, tell whether features are imputed. `1`: existing, `0`: missing.
116
117
Pre-processed data are stored in `datasets/{dataset}/processed_data/` folder.
118
119
## Database preparation [Optional]
120
121
Example `db.yaml` settings, put it in `configs/_base_/db.yaml`.
122
123
```bash
124
engine: postgresql # or mysql
125
username: db_user
126
password: db_password
127
host: xx.xxx.com
128
port: 5432
129
database: db_name
130
```
131
132
Create `perflog` table in your database:
133
134
```sql
135
-- postgresql example
136
create table perflog
137
(
138
    id serial
139
        constraint perflog_pk
140
            primary key,
141
    record_time integer,
142
    model_name text,
143
    performance text,
144
    hidden_dim integer,
145
    dataset text,
146
    model_type text,
147
    config text,
148
    task text
149
);
150
151
-- mysql example
152
create table perflog
153
(
154
    id int auto_increment,
155
    record_time int null,
156
    model_name text null,
157
    task text null,
158
    performance text null,
159
    hidden_dim int null,
160
    dataset text null,
161
    model_type text null,
162
    config text null,
163
    constraint perflog_id_uindex
164
        unique (id)
165
);
166
167
alter table perflog
168
    add primary key (id);
169
```
170
171
## Configs
172
173
Below is the configurations after hyperparameter selection.
174
175
<details>
176
177
<summary>ML models</summary>
178
179
```bash
180
hm_los_catboost_kf10_md6_iter150_lr0.1_test
181
hm_los_decision_tree_kf10_md10_test
182
hm_los_gbdt_kf10_lr0.1_ss0.8_ne100_test
183
hm_los_random_forest_kf10_md10_mss2_ne100_test
184
hm_los_xgboost_kf10_lr0.01_md5_cw3_test
185
hm_outcome_catboost_kf10_md3_iter150_lr0.1_test
186
hm_outcome_decision_tree_kf10_md10_test
187
hm_outcome_gbdt_kf10_lr0.1_ss0.6_ne100_test
188
hm_outcome_random_forest_kf10_md20_mss10_ne100_test
189
hm_outcome_xgboost_kf10_lr0.1_md7_cw3_test
190
tj_los_catboost_kf10_md3_iter150_lr0.1_test
191
tj_los_decision_tree_kf10_md10_test
192
tj_los_gbdt_kf10_lr0.1_ss0.8_ne100_test
193
tj_los_random_forest_kf10_md20_mss5_ne100_test
194
tj_los_xgboost_kf10_lr0.01_md5_cw1_test
195
tj_outcome_catboost_kf10_md3_iter150_lr0.1_test
196
tj_outcome_decision_tree_kf10_md10_test
197
tj_outcome_gbdt_kf10_lr0.1_ss0.6_ne100_test
198
tj_outcome_random_forest_kf10_md20_mss2_ne10_test
199
tj_outcome_xgboost_kf10_lr0.1_md5_cw5_test
200
```
201
202
</details>
203
204
<details>
205
<summary>DL/EHR models</summary>
206
207
```bash
208
tj_outcome_grasp_ep100_kf10_bs64_hid64
209
tj_los_grasp_ep100_kf10_bs64_hid128
210
tj_outcome_concare_ep100_kf10_bs64_hid128
211
tj_los_concare_ep100_kf10_bs64_hid128
212
tj_outcome_agent_ep100_kf10_bs64_hid128
213
tj_los_agent_ep100_kf10_bs64_hid64
214
tj_outcome_adacare_ep100_kf10_bs64_hid64
215
tj_los_adacare_ep100_kf10_bs64_hid64
216
tj_outcome_transformer_ep100_kf10_bs64_hid128
217
tj_los_transformer_ep100_kf10_bs64_hid64
218
tj_outcome_tcn_ep100_kf10_bs64_hid128
219
tj_los_tcn_ep100_kf10_bs64_hid128
220
tj_outcome_stagenet_ep100_kf10_bs64_hid64
221
tj_los_stagenet_ep100_kf10_bs64_hid64
222
tj_outcome_rnn_ep100_kf10_bs64_hid64
223
tj_los_rnn_ep100_kf10_bs64_hid128
224
tj_outcome_retain_ep100_kf10_bs64_hid128
225
tj_los_retain_ep100_kf10_bs64_hid128
226
tj_outcome_mlp_ep100_kf10_bs64_hid64
227
tj_los_mlp_ep100_kf10_bs64_hid128
228
tj_outcome_lstm_ep100_kf10_bs64_hid64
229
tj_los_lstm_ep100_kf10_bs64_hid128
230
tj_outcome_gru_ep100_kf10_bs64_hid64
231
tj_los_gru_ep100_kf10_bs64_hid128
232
tj_multitask_rnn_ep100_kf10_bs64_hid64
233
tj_multitask_lstm_ep100_kf10_bs64_hid128
234
tj_multitask_gru_ep100_kf10_bs64_hid128
235
tj_multitask_transformer_ep100_kf10_bs64_hid128
236
tj_multitask_tcn_ep100_kf10_bs64_hid64
237
tj_multitask_mlp_ep100_kf10_bs64_hid128
238
tj_multitask_adacare_ep100_kf10_bs64_hid128
239
tj_multitask_agent_ep100_kf10_bs64_hid64
240
tj_multitask_concare_ep100_kf10_bs64_hid128
241
tj_multitask_stagenet_ep100_kf10_bs64_hid64
242
tj_multitask_grasp_ep100_kf10_bs64_hid128
243
tj_multitask_retain_ep100_kf10_bs64_hid64
244
hm_outcome_mlp_ep100_kf10_bs64_hid64
245
hm_los_mlp_ep100_kf10_bs64_hid128
246
hm_outcome_lstm_ep100_kf10_bs64_hid64
247
hm_los_lstm_ep100_kf10_bs64_hid128
248
hm_outcome_gru_ep100_kf10_bs64_hid64
249
hm_los_gru_ep100_kf10_bs64_hid128
250
hm_outcome_grasp_ep100_kf10_bs64_hid64
251
hm_los_grasp_ep100_kf10_bs64_hid64
252
hm_outcome_concare_ep100_kf10_bs64_hid128
253
hm_los_concare_ep100_kf10_bs64_hid64
254
hm_outcome_agent_ep100_kf10_bs64_hid128
255
hm_los_agent_ep100_kf10_bs64_hid64
256
hm_outcome_adacare_ep100_kf10_bs64_hid64
257
hm_los_adacare_ep100_kf10_bs64_hid128
258
hm_outcome_transformer_ep100_kf10_bs64_hid128
259
hm_los_transformer_ep100_kf10_bs64_hid128
260
hm_outcome_tcn_ep100_kf10_bs64_hid64
261
hm_los_tcn_ep100_kf10_bs64_hid128
262
hm_outcome_stagenet_ep100_kf10_bs64_hid64
263
hm_los_stagenet_ep100_kf10_bs64_hid64
264
hm_outcome_rnn_ep100_kf10_bs64_hid64
265
hm_los_rnn_ep100_kf10_bs64_hid128
266
hm_outcome_retain_ep100_kf10_bs64_hid128
267
hm_los_retain_ep100_kf10_bs64_hid128
268
hm_multitask_rnn_ep100_kf10_bs512_hid128
269
hm_multitask_lstm_ep100_kf10_bs512_hid64
270
hm_multitask_gru_ep100_kf10_bs512_hid128
271
hm_multitask_transformer_ep100_kf10_bs512_hid64
272
hm_multitask_tcn_ep100_kf10_bs512_hid64
273
hm_multitask_mlp_ep100_kf10_bs512_hid128
274
hm_multitask_adacare_ep100_kf10_bs512_hid128
275
hm_multitask_agent_ep100_kf10_bs512_hid128
276
hm_multitask_concare_ep100_kf10_bs64_hid128
277
hm_multitask_stagenet_ep100_kf10_bs512_hid128
278
hm_multitask_grasp_ep100_kf10_bs512_hid64
279
hm_multitask_retain_ep100_kf10_bs512_hid128
280
```
281
</details>
282
283
<details>
284
<summary>Two stage configs</summary>
285
286
```bash
287
tj_twostage_adacare_kf10.yaml
288
tj_twostage_agent_kf10.yaml
289
tj_twostage_concare_kf10.yaml
290
tj_twostage_gru_kf10.yaml
291
tj_twostage_lstm_kf10.yaml
292
tj_twostage_mlp_kf10.yaml
293
tj_twostage_retain_kf10.yaml
294
tj_twostage_rnn_kf10.yaml
295
tj_twostage_stagenet_kf10.yaml
296
tj_twostage_tcn_kf10.yaml
297
tj_twostage_transformer_kf10.yaml
298
tj_twostage_grasp_kf10.yaml
299
hm_twostage_adacare_kf10.yaml
300
hm_twostage_agent_kf10.yaml
301
hm_twostage_concare_kf10.yaml
302
hm_twostage_gru_kf10.yaml
303
hm_twostage_lstm_kf10.yaml
304
hm_twostage_mlp_kf10.yaml
305
hm_twostage_retain_kf10.yaml
306
hm_twostage_rnn_kf10.yaml
307
hm_twostage_stagenet_kf10.yaml
308
hm_twostage_tcn_kf10.yaml
309
hm_twostage_transformer_kf10.yaml
310
hm_twostage_grasp_kf10.yaml
311
```
312
</details>
313
314
## Contributing
315
316
We appreciate all contributions to improve covid-emr-benchmarks. Pull Requests amd Issues are welcomed!
317
318
## Contributors
319
320
[Yinghao Zhu](https://github.com/yhzhu99), [Wenqing Wang](https://github.com/ericaaaaaaaa), [Junyi Gao](https://github.com/v1xerunt)
321
322
## Citation
323
324
If you find this project useful in your research, please consider cite:
325
326
```BibTeX
327
@misc{https://doi.org/10.48550/arxiv.2209.07805,
328
  doi = {10.48550/ARXIV.2209.07805},
329
  url = {https://arxiv.org/abs/2209.07805},
330
  author = {Gao, Junyi and Zhu, Yinghao and Wang, Wenqing and Wang, Yasha and Tang, Wen and Ma, Liantao},
331
  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
332
  title = {A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care: Choosing the Best Model for COVID-19 Prognosis},
333
  publisher = {arXiv},
334
  year = {2022},
335
  copyright = {arXiv.org perpetual, non-exclusive license}
336
}
337
```
338
339
## License
340
341
This project is released under the [GPL-2.0 license](LICENSE).