Automatic Assignment of ICD codes

Introduction

This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT.

Structure of the project

At the root of the project, you will have:

main.py: used for training and testing different models
requirements.txt: contains the minimum dependencies for running the project
w2vmodel.model: gensim word2vec model trained on MIMICIII discharge summaries
src: a folder that contains:
bert: contains utilities and files for pretrained bert model
cnn: contains utilities and files for CNN model
hybrid: contains utilities and files for the hybrid model (LSTM+CNN) model
rnn: contains utilities and files for LSTM and GRU models
ovr: contains utilities and files for different Machine Learning Models (like LR, SVM, NaiveBayes)
fit.py: training code for both LSTM and CNN models
test_results.py: inferencing code for trained model used for both LSTM and CNN models
utils.py: genearal utility codes used for all the models

Dependencies

The dependencies are mentioned in the requirements.txt file.
They can be installed by:
```bash
pip install -r requirements.txt

## How to use the code

Launch train.py with the following arguments:

- `train_path`: path of the training data. 
- `test_path`: path of the test data
- `model_name`: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert'
- `icd_type`: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat'
- `epochs`: number of epochs 
- `batch_size`: batch size, default to 16 (for bert model).
- `val_split`: validation split of the training data, default = 2/7 (train:val:split = 5:2:3)
- `learning_rate`: default to 2e-5 (for bert model)
- `w2vmodel`: path for pretrained gensim word2vec model.

***Example***
```bash
python main.py --train_path train.csv --test_path test.csv --model_name cnn

Data

The data used for training can be downloaded from:
- train data
- test data