--- a +++ b/README.md @@ -0,0 +1,53 @@ +# Automatic Assignment of ICD codes + +## Introduction +This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT. + +## Structure of the project + +At the root of the project, you will have: + +- **main.py**: used for training and testing different models +- **requirements.txt**: contains the minimum dependencies for running the project +- **w2vmodel.model**: gensim word2vec model trained on MIMICIII discharge summaries +- **src**: a folder that contains: + - **bert**: contains utilities and files for pretrained bert model + - **cnn**: contains utilities and files for CNN model + - **hybrid**: contains utilities and files for the hybrid model (LSTM+CNN) model + - **rnn**: contains utilities and files for LSTM and GRU models + - **ovr**: contains utilities and files for different Machine Learning Models (like LR, SVM, NaiveBayes) + - **fit.py**: training code for both LSTM and CNN models + - **test_results.py**: inferencing code for trained model used for both LSTM and CNN models + - **utils.py**: genearal utility codes used for all the models + +## Dependencies + The dependencies are mentioned in the `requirements.txt` file. + They can be installed by: + ```bash +pip install -r requirements.txt +``` + +## How to use the code + +Launch train.py with the following arguments: + +- `train_path`: path of the training data. +- `test_path`: path of the test data +- `model_name`: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert' +- `icd_type`: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat' +- `epochs`: number of epochs +- `batch_size`: batch size, default to 16 (for bert model). +- `val_split`: validation split of the training data, default = 2/7 (train:val:split = 5:2:3) +- `learning_rate`: default to 2e-5 (for bert model) +- `w2vmodel`: path for pretrained gensim word2vec model. + +***Example*** +```bash +python main.py --train_path train.csv --test_path test.csv --model_name cnn +``` + +## Data +The data used for training can be downloaded from: +- [train data](https://drive.google.com/file/d/1--ZVpt614neHN9erxmsg6s6aGInThJ22/view?usp=sharing) +- [test data](https://drive.google.com/file/d/1-4tp0og0I7KyNMoqF2_t1smu0_GqQCVf/view?usp=sharing) +