|
a |
|
b/README.md |
|
|
1 |
# Automatic Assignment of ICD codes |
|
|
2 |
|
|
|
3 |
## Introduction |
|
|
4 |
This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT. |
|
|
5 |
|
|
|
6 |
## Structure of the project |
|
|
7 |
|
|
|
8 |
At the root of the project, you will have: |
|
|
9 |
|
|
|
10 |
- **main.py**: used for training and testing different models |
|
|
11 |
- **requirements.txt**: contains the minimum dependencies for running the project |
|
|
12 |
- **w2vmodel.model**: gensim word2vec model trained on MIMICIII discharge summaries |
|
|
13 |
- **src**: a folder that contains: |
|
|
14 |
- **bert**: contains utilities and files for pretrained bert model |
|
|
15 |
- **cnn**: contains utilities and files for CNN model |
|
|
16 |
- **hybrid**: contains utilities and files for the hybrid model (LSTM+CNN) model |
|
|
17 |
- **rnn**: contains utilities and files for LSTM and GRU models |
|
|
18 |
- **ovr**: contains utilities and files for different Machine Learning Models (like LR, SVM, NaiveBayes) |
|
|
19 |
- **fit.py**: training code for both LSTM and CNN models |
|
|
20 |
- **test_results.py**: inferencing code for trained model used for both LSTM and CNN models |
|
|
21 |
- **utils.py**: genearal utility codes used for all the models |
|
|
22 |
|
|
|
23 |
## Dependencies |
|
|
24 |
The dependencies are mentioned in the `requirements.txt` file. |
|
|
25 |
They can be installed by: |
|
|
26 |
```bash |
|
|
27 |
pip install -r requirements.txt |
|
|
28 |
``` |
|
|
29 |
|
|
|
30 |
## How to use the code |
|
|
31 |
|
|
|
32 |
Launch train.py with the following arguments: |
|
|
33 |
|
|
|
34 |
- `train_path`: path of the training data. |
|
|
35 |
- `test_path`: path of the test data |
|
|
36 |
- `model_name`: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert' |
|
|
37 |
- `icd_type`: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat' |
|
|
38 |
- `epochs`: number of epochs |
|
|
39 |
- `batch_size`: batch size, default to 16 (for bert model). |
|
|
40 |
- `val_split`: validation split of the training data, default = 2/7 (train:val:split = 5:2:3) |
|
|
41 |
- `learning_rate`: default to 2e-5 (for bert model) |
|
|
42 |
- `w2vmodel`: path for pretrained gensim word2vec model. |
|
|
43 |
|
|
|
44 |
***Example*** |
|
|
45 |
```bash |
|
|
46 |
python main.py --train_path train.csv --test_path test.csv --model_name cnn |
|
|
47 |
``` |
|
|
48 |
|
|
|
49 |
## Data |
|
|
50 |
The data used for training can be downloaded from: |
|
|
51 |
- [train data](https://drive.google.com/file/d/1--ZVpt614neHN9erxmsg6s6aGInThJ22/view?usp=sharing) |
|
|
52 |
- [test data](https://drive.google.com/file/d/1-4tp0og0I7KyNMoqF2_t1smu0_GqQCVf/view?usp=sharing) |
|
|
53 |
|