# Mimic coding 
This notebook walks the user through the training and testing of a icd9 coding classifier, specifically built on and for the MIMIC-III dataset. The notebook utilizes the `mimic_icd9_coding` module, and the example uses a neural network classifier, although any other classifier may be used.

The user must set the root directory for the EHRKit, and optionally the data directories.

In [1]:
ROOT_EHR_DIR = '<EHRKit Path>' # set your root EHRKit directory here (with the '/' at the end)
import sys
import os
sys.path.append(os.path.dirname(ROOT_EHR_DIR))

In [2]:
# Set your mimic path here
# Put all of the individual mimic csv files in MIMIC_PATH, with the `/` at the end. These files should be all cap csv files, such as NOTEEVENTS.csv. Keep OUTPUT_DATA_PATH empty, the processed data will be deposited there.
OUTPUT_DATA_PATH = ROOT_EHR_DIR + 'data/output_data/'
MIMIC_PATH = ROOT_EHR_DIR + 'data/mimic_data/'

In [3]:
from mimic_icd9_coding.coding_pipeline import codingPipeline
from mimic_icd9_coding.utils.mimic_data_preparation import run_mimic_prep

In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases lat

In [4]:
# Run the mimic label preparation, examine the top n most common labels, and create a dataset of labels and text
run_mimic_prep(output_folder = OUTPUT_DATA_PATH, mimic_data_path= MIMIC_PATH)


In [5]:
# Create and train mimic data preprocessing and model, default is a random forest model
# But any model can be passed in (SKlearn or other model which fits and predicts in the same manner)
# Note that the resulting classification report identifies the metrics for each of the 138 different icd9 codes that this model investigates.
print("Building basic tfidf pipeline")
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=100, verbose=True)
# Switch max_iter to 100 for better results, but to run for the first time 10 is good
my_mimic_pipeline = codingPipeline(verbose=True, model=clf, data_path = OUTPUT_DATA_PATH)
print("Pipeline complete")



Building basic tfidf pipeline
Iteration 1, loss = 39.26339982
Iteration 2, loss = 27.15691598
Iteration 3, loss = 25.43022223
Iteration 4, loss = 24.39200925
Iteration 5, loss = 23.53421875
Iteration 6, loss = 22.71761089
Iteration 7, loss = 21.98377343
Iteration 8, loss = 21.40524055
Iteration 9, loss = 20.91691075
Iteration 10, loss = 20.46976534
Iteration 11, loss = 20.05602722
Iteration 12, loss = 19.67969932
Iteration 13, loss = 19.32472493
Iteration 14, loss = 18.98996470
Iteration 15, loss = 18.69047703
Iteration 16, loss = 18.41708954
Iteration 17, loss = 18.16937539
Iteration 18, loss = 17.94176193
Iteration 19, loss = 17.72638875
Iteration 20, loss = 17.52546337
Iteration 21, loss = 17.33147929
Iteration 22, loss = 17.14397179
Iteration 23, loss = 16.97036583
Iteration 24, loss = 16.80744903
Iteration 25, loss = 16.65087996
Iteration 26, loss = 16.50350788
Iteration 27, loss = 16.36437565
Iteration 28, loss = 16.22898538
Iteration 29, loss = 16.10017515
Iteration 30, loss = 1

In [6]:
# Let's check out the auroc
auroc = my_mimic_pipeline.auroc
print("Auroc is {:.2f}".format(auroc))

Auroc is 0.69


## Let's test out the model on a specific note!

In [7]:
# Here we load the data into the pipeline, this function simply saves the data, we don't want to save the data automatically because it uses more memory
my_mimic_pipeline.load_data()
df = my_mimic_pipeline.data

In [8]:
# We run the algorithm and see that at least for this example our model is pretty good
pred = my_mimic_pipeline.predict(df['TEXT'].iloc[10])
true = df['TARGET'].iloc[10]
print("Predicted ICD9 codes: {}".format(pred))
print("True ICD9 codes: {}".format(true))

Predicted ICD9 codes: [('424', '441', '785')]
True ICD9 codes: ['424', '441', '785']
