This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT.
At the root of the project, you will have:
The dependencies are mentioned in the requirements.txt
file.
They can be installed by:
```bash
pip install -r requirements.txt
## How to use the code
Launch train.py with the following arguments:
- `train_path`: path of the training data.
- `test_path`: path of the test data
- `model_name`: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert'
- `icd_type`: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat'
- `epochs`: number of epochs
- `batch_size`: batch size, default to 16 (for bert model).
- `val_split`: validation split of the training data, default = 2/7 (train:val:split = 5:2:3)
- `learning_rate`: default to 2e-5 (for bert model)
- `w2vmodel`: path for pretrained gensim word2vec model.
***Example***
```bash
python main.py --train_path train.csv --test_path test.csv --model_name cnn
The data used for training can be downloaded from:
- train data
- test data