|
a |
|
b/README.md |
|
|
1 |
# COVID-19 EHR Benchmarks |
|
|
2 |
|
|
|
3 |
> A Comprehensive Benchmark For COVID-19 Predictive Modeling Using Electronic Health Records |
|
|
4 |
|
|
|
5 |
 |
|
|
6 |
|
|
|
7 |
TJH datasets and presentation slides are available in GitHub releases. |
|
|
8 |
|
|
|
9 |
**This repo is not active. Please check our latest repo https://github.com/yhzhu99/pyehr** |
|
|
10 |
|
|
|
11 |
## Prediction Tasks |
|
|
12 |
|
|
|
13 |
- [x] (Early) Mortality outcome prediction |
|
|
14 |
- [x] Length-of-stay prediction |
|
|
15 |
- [x] Multi-task/Two-stage prediction |
|
|
16 |
|
|
|
17 |
## Model Zoo |
|
|
18 |
|
|
|
19 |
### Machine Learning Models |
|
|
20 |
|
|
|
21 |
- [x] Random forest (RF) |
|
|
22 |
- [x] Decision tree (DT) |
|
|
23 |
- [x] Gradient Boosting Decision Tree (GBDT) |
|
|
24 |
- [x] XGBoost |
|
|
25 |
- [x] CatBoost |
|
|
26 |
|
|
|
27 |
### Deep Learning Models |
|
|
28 |
|
|
|
29 |
- [x] Multi-layer perceptron (MLP) |
|
|
30 |
- [x] Recurrent neural network (RNN) |
|
|
31 |
- [x] Long-short term memory network (LSTM) |
|
|
32 |
- [x] Gated recurrent units (GRU) |
|
|
33 |
- [x] Temporal convolutional networks |
|
|
34 |
- [x] Transformer |
|
|
35 |
|
|
|
36 |
### EHR Predictive Models |
|
|
37 |
|
|
|
38 |
- [x] RETAIN |
|
|
39 |
- [x] StageNet |
|
|
40 |
- [x] Dr. Agent |
|
|
41 |
- [x] AdaCare |
|
|
42 |
- [x] ConCare |
|
|
43 |
- [x] GRASP |
|
|
44 |
|
|
|
45 |
## Code Description |
|
|
46 |
|
|
|
47 |
```shell |
|
|
48 |
app/ |
|
|
49 |
apis/ |
|
|
50 |
ml_{task}.py # machine learning pipelines |
|
|
51 |
dl_{task}.py # deep learning pipelines |
|
|
52 |
core/ |
|
|
53 |
evaluation/ # evaluation metrics |
|
|
54 |
utils/ |
|
|
55 |
datasets/ # dataset loader scripts |
|
|
56 |
models/ |
|
|
57 |
backbones/ # feature extractors |
|
|
58 |
classifiers/ # prediction heads |
|
|
59 |
losses/ # task related loss functions |
|
|
60 |
build_model.py # concat backbones and heads |
|
|
61 |
configs/ |
|
|
62 |
_base_/ |
|
|
63 |
# common configs |
|
|
64 |
datasets/ |
|
|
65 |
# dataset basic info, training epochs and dataset split strategy |
|
|
66 |
{dataset}.yaml |
|
|
67 |
db.yaml # database settings (optional) |
|
|
68 |
{config_name}.yaml # detailed model settings |
|
|
69 |
checkpoints/ # model checkpoints are stored here |
|
|
70 |
datasets/ # raw/processed dataset and pre-process script |
|
|
71 |
main.py # main entry point |
|
|
72 |
requirements.txt # code dependencies |
|
|
73 |
``` |
|
|
74 |
|
|
|
75 |
## Requirements |
|
|
76 |
|
|
|
77 |
- Python 3.7+ |
|
|
78 |
- PyTorch 1.10+ |
|
|
79 |
- Cuda 10.2+ (If you plan to use GPU) |
|
|
80 |
|
|
|
81 |
Note: |
|
|
82 |
|
|
|
83 |
- Most models can be run quickly on CPU. |
|
|
84 |
- You are required to have a GPU with 12GB memory to run ConCare model on CDSL dataset. |
|
|
85 |
- TCN model may run much faster on CPU. |
|
|
86 |
|
|
|
87 |
## Usage |
|
|
88 |
|
|
|
89 |
- Install requirements. |
|
|
90 |
|
|
|
91 |
```bash |
|
|
92 |
pip install -r requirements.txt [-i https://pypi.tuna.tsinghua.edu.cn/simple] # [xxx] is optional |
|
|
93 |
``` |
|
|
94 |
|
|
|
95 |
- Download TJH dataset from [An interpretable mortality prediction model for COVID-19 patients](https://www.nature.com/articles/s42256-020-0180-7), unzip and put it in `datasets/tongji/raw_data/` folder. |
|
|
96 |
- Run preprocessing notebook. (You can skip this step if you have already done this in the later training process) |
|
|
97 |
- (The CDSL dataset is also the same process.) You need to apply for the CDSL dataset if necessary. [Covid Data Save Lives Dataset](https://www.hmhospitales.com/coronavirus/covid-data-save-lives/english-version) |
|
|
98 |
- Run following commands to train models. |
|
|
99 |
|
|
|
100 |
```bash |
|
|
101 |
python main.py --cfg configs/xxx.yaml [--train] [--cuda CUDA_NUM] [--db] |
|
|
102 |
# Note: |
|
|
103 |
# 1) use --train for training, only infererence stage if not |
|
|
104 |
# 2) If you plan to use CUDA, use --cuda 0/1/2/... |
|
|
105 |
# 3) If you have configured database settings, you can use --db to upload performance after training to the database. |
|
|
106 |
``` |
|
|
107 |
|
|
|
108 |
## Data Format |
|
|
109 |
|
|
|
110 |
The shape and meaning of the tensor fed to the models are as follows: |
|
|
111 |
|
|
|
112 |
- `x.pkl`: (N, T, D) tensor, where N is the number of patients, T is the number of time steps, and D is the number of features. At $D$ dimention, the first $x$ features are demographic features, the next $y$ features are lab test features, where $x + y = D$ |
|
|
113 |
- `y.pkl`: (N, T, 2) tensor, where the 2 values are [outcome, length-of-stay] for each time step. |
|
|
114 |
- `visits_length.pkl`: (N, ) tensor, where the value is the number of visits for each patient. |
|
|
115 |
- `missing_mask.pkl`: same shape as `x.pkl`, tell whether features are imputed. `1`: existing, `0`: missing. |
|
|
116 |
|
|
|
117 |
Pre-processed data are stored in `datasets/{dataset}/processed_data/` folder. |
|
|
118 |
|
|
|
119 |
## Database preparation [Optional] |
|
|
120 |
|
|
|
121 |
Example `db.yaml` settings, put it in `configs/_base_/db.yaml`. |
|
|
122 |
|
|
|
123 |
```bash |
|
|
124 |
engine: postgresql # or mysql |
|
|
125 |
username: db_user |
|
|
126 |
password: db_password |
|
|
127 |
host: xx.xxx.com |
|
|
128 |
port: 5432 |
|
|
129 |
database: db_name |
|
|
130 |
``` |
|
|
131 |
|
|
|
132 |
Create `perflog` table in your database: |
|
|
133 |
|
|
|
134 |
```sql |
|
|
135 |
-- postgresql example |
|
|
136 |
create table perflog |
|
|
137 |
( |
|
|
138 |
id serial |
|
|
139 |
constraint perflog_pk |
|
|
140 |
primary key, |
|
|
141 |
record_time integer, |
|
|
142 |
model_name text, |
|
|
143 |
performance text, |
|
|
144 |
hidden_dim integer, |
|
|
145 |
dataset text, |
|
|
146 |
model_type text, |
|
|
147 |
config text, |
|
|
148 |
task text |
|
|
149 |
); |
|
|
150 |
|
|
|
151 |
-- mysql example |
|
|
152 |
create table perflog |
|
|
153 |
( |
|
|
154 |
id int auto_increment, |
|
|
155 |
record_time int null, |
|
|
156 |
model_name text null, |
|
|
157 |
task text null, |
|
|
158 |
performance text null, |
|
|
159 |
hidden_dim int null, |
|
|
160 |
dataset text null, |
|
|
161 |
model_type text null, |
|
|
162 |
config text null, |
|
|
163 |
constraint perflog_id_uindex |
|
|
164 |
unique (id) |
|
|
165 |
); |
|
|
166 |
|
|
|
167 |
alter table perflog |
|
|
168 |
add primary key (id); |
|
|
169 |
``` |
|
|
170 |
|
|
|
171 |
## Configs |
|
|
172 |
|
|
|
173 |
Below is the configurations after hyperparameter selection. |
|
|
174 |
|
|
|
175 |
<details> |
|
|
176 |
|
|
|
177 |
<summary>ML models</summary> |
|
|
178 |
|
|
|
179 |
```bash |
|
|
180 |
hm_los_catboost_kf10_md6_iter150_lr0.1_test |
|
|
181 |
hm_los_decision_tree_kf10_md10_test |
|
|
182 |
hm_los_gbdt_kf10_lr0.1_ss0.8_ne100_test |
|
|
183 |
hm_los_random_forest_kf10_md10_mss2_ne100_test |
|
|
184 |
hm_los_xgboost_kf10_lr0.01_md5_cw3_test |
|
|
185 |
hm_outcome_catboost_kf10_md3_iter150_lr0.1_test |
|
|
186 |
hm_outcome_decision_tree_kf10_md10_test |
|
|
187 |
hm_outcome_gbdt_kf10_lr0.1_ss0.6_ne100_test |
|
|
188 |
hm_outcome_random_forest_kf10_md20_mss10_ne100_test |
|
|
189 |
hm_outcome_xgboost_kf10_lr0.1_md7_cw3_test |
|
|
190 |
tj_los_catboost_kf10_md3_iter150_lr0.1_test |
|
|
191 |
tj_los_decision_tree_kf10_md10_test |
|
|
192 |
tj_los_gbdt_kf10_lr0.1_ss0.8_ne100_test |
|
|
193 |
tj_los_random_forest_kf10_md20_mss5_ne100_test |
|
|
194 |
tj_los_xgboost_kf10_lr0.01_md5_cw1_test |
|
|
195 |
tj_outcome_catboost_kf10_md3_iter150_lr0.1_test |
|
|
196 |
tj_outcome_decision_tree_kf10_md10_test |
|
|
197 |
tj_outcome_gbdt_kf10_lr0.1_ss0.6_ne100_test |
|
|
198 |
tj_outcome_random_forest_kf10_md20_mss2_ne10_test |
|
|
199 |
tj_outcome_xgboost_kf10_lr0.1_md5_cw5_test |
|
|
200 |
``` |
|
|
201 |
|
|
|
202 |
</details> |
|
|
203 |
|
|
|
204 |
<details> |
|
|
205 |
<summary>DL/EHR models</summary> |
|
|
206 |
|
|
|
207 |
```bash |
|
|
208 |
tj_outcome_grasp_ep100_kf10_bs64_hid64 |
|
|
209 |
tj_los_grasp_ep100_kf10_bs64_hid128 |
|
|
210 |
tj_outcome_concare_ep100_kf10_bs64_hid128 |
|
|
211 |
tj_los_concare_ep100_kf10_bs64_hid128 |
|
|
212 |
tj_outcome_agent_ep100_kf10_bs64_hid128 |
|
|
213 |
tj_los_agent_ep100_kf10_bs64_hid64 |
|
|
214 |
tj_outcome_adacare_ep100_kf10_bs64_hid64 |
|
|
215 |
tj_los_adacare_ep100_kf10_bs64_hid64 |
|
|
216 |
tj_outcome_transformer_ep100_kf10_bs64_hid128 |
|
|
217 |
tj_los_transformer_ep100_kf10_bs64_hid64 |
|
|
218 |
tj_outcome_tcn_ep100_kf10_bs64_hid128 |
|
|
219 |
tj_los_tcn_ep100_kf10_bs64_hid128 |
|
|
220 |
tj_outcome_stagenet_ep100_kf10_bs64_hid64 |
|
|
221 |
tj_los_stagenet_ep100_kf10_bs64_hid64 |
|
|
222 |
tj_outcome_rnn_ep100_kf10_bs64_hid64 |
|
|
223 |
tj_los_rnn_ep100_kf10_bs64_hid128 |
|
|
224 |
tj_outcome_retain_ep100_kf10_bs64_hid128 |
|
|
225 |
tj_los_retain_ep100_kf10_bs64_hid128 |
|
|
226 |
tj_outcome_mlp_ep100_kf10_bs64_hid64 |
|
|
227 |
tj_los_mlp_ep100_kf10_bs64_hid128 |
|
|
228 |
tj_outcome_lstm_ep100_kf10_bs64_hid64 |
|
|
229 |
tj_los_lstm_ep100_kf10_bs64_hid128 |
|
|
230 |
tj_outcome_gru_ep100_kf10_bs64_hid64 |
|
|
231 |
tj_los_gru_ep100_kf10_bs64_hid128 |
|
|
232 |
tj_multitask_rnn_ep100_kf10_bs64_hid64 |
|
|
233 |
tj_multitask_lstm_ep100_kf10_bs64_hid128 |
|
|
234 |
tj_multitask_gru_ep100_kf10_bs64_hid128 |
|
|
235 |
tj_multitask_transformer_ep100_kf10_bs64_hid128 |
|
|
236 |
tj_multitask_tcn_ep100_kf10_bs64_hid64 |
|
|
237 |
tj_multitask_mlp_ep100_kf10_bs64_hid128 |
|
|
238 |
tj_multitask_adacare_ep100_kf10_bs64_hid128 |
|
|
239 |
tj_multitask_agent_ep100_kf10_bs64_hid64 |
|
|
240 |
tj_multitask_concare_ep100_kf10_bs64_hid128 |
|
|
241 |
tj_multitask_stagenet_ep100_kf10_bs64_hid64 |
|
|
242 |
tj_multitask_grasp_ep100_kf10_bs64_hid128 |
|
|
243 |
tj_multitask_retain_ep100_kf10_bs64_hid64 |
|
|
244 |
hm_outcome_mlp_ep100_kf10_bs64_hid64 |
|
|
245 |
hm_los_mlp_ep100_kf10_bs64_hid128 |
|
|
246 |
hm_outcome_lstm_ep100_kf10_bs64_hid64 |
|
|
247 |
hm_los_lstm_ep100_kf10_bs64_hid128 |
|
|
248 |
hm_outcome_gru_ep100_kf10_bs64_hid64 |
|
|
249 |
hm_los_gru_ep100_kf10_bs64_hid128 |
|
|
250 |
hm_outcome_grasp_ep100_kf10_bs64_hid64 |
|
|
251 |
hm_los_grasp_ep100_kf10_bs64_hid64 |
|
|
252 |
hm_outcome_concare_ep100_kf10_bs64_hid128 |
|
|
253 |
hm_los_concare_ep100_kf10_bs64_hid64 |
|
|
254 |
hm_outcome_agent_ep100_kf10_bs64_hid128 |
|
|
255 |
hm_los_agent_ep100_kf10_bs64_hid64 |
|
|
256 |
hm_outcome_adacare_ep100_kf10_bs64_hid64 |
|
|
257 |
hm_los_adacare_ep100_kf10_bs64_hid128 |
|
|
258 |
hm_outcome_transformer_ep100_kf10_bs64_hid128 |
|
|
259 |
hm_los_transformer_ep100_kf10_bs64_hid128 |
|
|
260 |
hm_outcome_tcn_ep100_kf10_bs64_hid64 |
|
|
261 |
hm_los_tcn_ep100_kf10_bs64_hid128 |
|
|
262 |
hm_outcome_stagenet_ep100_kf10_bs64_hid64 |
|
|
263 |
hm_los_stagenet_ep100_kf10_bs64_hid64 |
|
|
264 |
hm_outcome_rnn_ep100_kf10_bs64_hid64 |
|
|
265 |
hm_los_rnn_ep100_kf10_bs64_hid128 |
|
|
266 |
hm_outcome_retain_ep100_kf10_bs64_hid128 |
|
|
267 |
hm_los_retain_ep100_kf10_bs64_hid128 |
|
|
268 |
hm_multitask_rnn_ep100_kf10_bs512_hid128 |
|
|
269 |
hm_multitask_lstm_ep100_kf10_bs512_hid64 |
|
|
270 |
hm_multitask_gru_ep100_kf10_bs512_hid128 |
|
|
271 |
hm_multitask_transformer_ep100_kf10_bs512_hid64 |
|
|
272 |
hm_multitask_tcn_ep100_kf10_bs512_hid64 |
|
|
273 |
hm_multitask_mlp_ep100_kf10_bs512_hid128 |
|
|
274 |
hm_multitask_adacare_ep100_kf10_bs512_hid128 |
|
|
275 |
hm_multitask_agent_ep100_kf10_bs512_hid128 |
|
|
276 |
hm_multitask_concare_ep100_kf10_bs64_hid128 |
|
|
277 |
hm_multitask_stagenet_ep100_kf10_bs512_hid128 |
|
|
278 |
hm_multitask_grasp_ep100_kf10_bs512_hid64 |
|
|
279 |
hm_multitask_retain_ep100_kf10_bs512_hid128 |
|
|
280 |
``` |
|
|
281 |
</details> |
|
|
282 |
|
|
|
283 |
<details> |
|
|
284 |
<summary>Two stage configs</summary> |
|
|
285 |
|
|
|
286 |
```bash |
|
|
287 |
tj_twostage_adacare_kf10.yaml |
|
|
288 |
tj_twostage_agent_kf10.yaml |
|
|
289 |
tj_twostage_concare_kf10.yaml |
|
|
290 |
tj_twostage_gru_kf10.yaml |
|
|
291 |
tj_twostage_lstm_kf10.yaml |
|
|
292 |
tj_twostage_mlp_kf10.yaml |
|
|
293 |
tj_twostage_retain_kf10.yaml |
|
|
294 |
tj_twostage_rnn_kf10.yaml |
|
|
295 |
tj_twostage_stagenet_kf10.yaml |
|
|
296 |
tj_twostage_tcn_kf10.yaml |
|
|
297 |
tj_twostage_transformer_kf10.yaml |
|
|
298 |
tj_twostage_grasp_kf10.yaml |
|
|
299 |
hm_twostage_adacare_kf10.yaml |
|
|
300 |
hm_twostage_agent_kf10.yaml |
|
|
301 |
hm_twostage_concare_kf10.yaml |
|
|
302 |
hm_twostage_gru_kf10.yaml |
|
|
303 |
hm_twostage_lstm_kf10.yaml |
|
|
304 |
hm_twostage_mlp_kf10.yaml |
|
|
305 |
hm_twostage_retain_kf10.yaml |
|
|
306 |
hm_twostage_rnn_kf10.yaml |
|
|
307 |
hm_twostage_stagenet_kf10.yaml |
|
|
308 |
hm_twostage_tcn_kf10.yaml |
|
|
309 |
hm_twostage_transformer_kf10.yaml |
|
|
310 |
hm_twostage_grasp_kf10.yaml |
|
|
311 |
``` |
|
|
312 |
</details> |
|
|
313 |
|
|
|
314 |
## Contributing |
|
|
315 |
|
|
|
316 |
We appreciate all contributions to improve covid-emr-benchmarks. Pull Requests amd Issues are welcomed! |
|
|
317 |
|
|
|
318 |
## Contributors |
|
|
319 |
|
|
|
320 |
[Yinghao Zhu](https://github.com/yhzhu99), [Wenqing Wang](https://github.com/ericaaaaaaaa), [Junyi Gao](https://github.com/v1xerunt) |
|
|
321 |
|
|
|
322 |
## Citation |
|
|
323 |
|
|
|
324 |
If you find this project useful in your research, please consider cite: |
|
|
325 |
|
|
|
326 |
```BibTeX |
|
|
327 |
@misc{https://doi.org/10.48550/arxiv.2209.07805, |
|
|
328 |
doi = {10.48550/ARXIV.2209.07805}, |
|
|
329 |
url = {https://arxiv.org/abs/2209.07805}, |
|
|
330 |
author = {Gao, Junyi and Zhu, Yinghao and Wang, Wenqing and Wang, Yasha and Tang, Wen and Ma, Liantao}, |
|
|
331 |
keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, |
|
|
332 |
title = {A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care: Choosing the Best Model for COVID-19 Prognosis}, |
|
|
333 |
publisher = {arXiv}, |
|
|
334 |
year = {2022}, |
|
|
335 |
copyright = {arXiv.org perpetual, non-exclusive license} |
|
|
336 |
} |
|
|
337 |
``` |
|
|
338 |
|
|
|
339 |
## License |
|
|
340 |
|
|
|
341 |
This project is released under the [GPL-2.0 license](LICENSE). |