## Part 2- Pre-processing EHR data
by:Sparkle Russell-Puleri and Dorian Puleri

#### Background: Detailed review of Doctor AI: Predicting Clinical Events via Recurrent Neural Nets (Choi et.al 2016)
The intent of tutorial is to provide a detailed step through on how EHR data should be pre-processed for use in RNNs using Pytorch. This paper is one of the few papers that provide a code base to start taking a detailed look into how we can build generic models that leverages temporal models to predict future clinical events. However, while this highly cited paper is open sourced (written using Theano:https://github.com/mp2893/doctorai), it assumes quite a bit about it's readers. As such, we have modernized the code for ease of use in python 3+ and provided a detailed explanation of each step to allow anyone, with a computer and access to healthcare data to begin trying to develop innovative solutions to solve healthcare challenges. 

### Important Disclaimer: 
This data set was artificial created with two patients in Part 1 of this series to help provide readers with a clear understanding of the structure of EHR data. Please note that each EHR system is specifically designed to meet a specific providers needs and this is just a basic example of data that is typically contained in most systems. Additionally, it is also key to note that this tutorial begins after all of the desired exclusion and inclusion criteria related to your research question has been performed. Therefore, at this step your data would have been fully wrangled and cleaned.

In [2]:
import pandas as pd
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings
from datetime import datetime
import torch
import pickle
from collections import defaultdict
warnings.filterwarnings('ignore')
sns.set(style='white')
%autosave 180

Autosaving every 180 seconds


### Load data : A quick review of the artifical EHR data we created in Part 1:


<img src="img/overview.png" style="height:500px">

### Step 1: Create mappings of patient IDs 
In this step we are going to create a dictionary that maps each patient with his or her specific visit or `Admission ID`.

<img src="img/step1.png" style="height:500px">

In [6]:
print('Creating visit date mapping')
patHashMap = dict(defaultdict(list))  # this creates a dictionary with a list of values for each patient:[number of visists]
visitMap = dict(defaultdict()) # this creates a dictionary with a mapping of the patientID : visitdates

data = open('data/Admissions_Table.csv','r')
data.readline()[1:] # read every line except the file header

for line in data:
    feature = line.strip().split(',') # split line on , and isolate columns
    visitDateID = datetime.strptime(feature[3],'%Y-%m-%d') 
    patHashMap.setdefault(feature[1], []).append(feature[0]) # create a mapping for each visit for a specific PatientID
    visitMap.setdefault(feature[0], []).append(visitDateID) # create a mapping for each visit for a specific Admission Date

Creating visit date mapping


In [7]:
#Patient ID- visit mapping
patHashMap

{'A1234-B456': ['A1234-12', 'A1234-34', 'A1234-15'],
 'B1234-C456': ['B1234-13', 'B1234-34']}

In [8]:
# Patient Admission ID- visit date mapping
visitMap

{'A1234-12': [datetime.datetime(2019, 1, 3, 0, 0)],
 'A1234-34': [datetime.datetime(2019, 2, 3, 0, 0)],
 'A1234-15': [datetime.datetime(2019, 4, 3, 0, 0)],
 'B1234-13': [datetime.datetime(2018, 1, 3, 0, 0)],
 'B1234-34': [datetime.datetime(2018, 2, 3, 0, 0)]}

## Step 2: Create Diagnosis Code Mapped to each unique patient and visit
This step as with all subsequent steps is very important as it is important to keep the patient's diagnosis codes in the correct visit order.

<img src="img/step2.png" style="height:500px">

In [9]:
print('Creating Diagnosis-Visit mapping')
visitDxMap = dict(defaultdict(list))

data = open('data/Diagnosis_Table.csv', 'r')
data.readline()[1:]

for line in data:
    feature = line.strip().split(',')
    visitDxMap.setdefault(feature[0], []).append('D_' + feature[4].split('.')[0]) # add a unique identifier before the

Creating Diagnosis-Visit mapping


In [10]:
visitDxMap # Mapping of each Admission ID to each diagnosis code assigned during that visit

{'A1234-12': ['D_E11', 'D_I25', 'D_I25'],
 'A1234-34': ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784'],
 'A1234-15': ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789'],
 'B1234-13': ['D_M05', 'D_Z13', 'D_O99'],
 'B1234-34': ['D_M05', 'D_Z13', 'D_O99', 'D_D37']}

### Step 3:  Embed diagnosis codes into visit mapping Patient-Admission mapping
This step essentially adds each code assigned to the patient directing into the dictionary with the patient-admission id mapping and the visit date mapping `visitMap`. Which allows us to have a list of list of diagnosis codes that each patient recieved during each visit.

<img src="img/step3.png" style="height:500px">

In [11]:
print("Sorting visit mapping")
patDxVisitOrderMap = {}
for patid, visitDates in patHashMap.items():
    sorted_list = ([(visitMap[visitDateID], visitDxMap[visitDateID]) for visitDateID in visitDates])
    patDxVisitOrderMap[patid] = sorted_list 

Sorting visit mapping


In [12]:
patDxVisitOrderMap

{'A1234-B456': [([datetime.datetime(2019, 1, 3, 0, 0)],
   ['D_E11', 'D_I25', 'D_I25']),
  ([datetime.datetime(2019, 2, 3, 0, 0)],
   ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784']),
  ([datetime.datetime(2019, 4, 3, 0, 0)],
   ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789'])],
 'B1234-C456': [([datetime.datetime(2018, 1, 3, 0, 0)],
   ['D_M05', 'D_Z13', 'D_O99']),
  ([datetime.datetime(2018, 2, 3, 0, 0)],
   ['D_M05', 'D_Z13', 'D_O99', 'D_D37'])]}

### Step 4a: Extract patient IDs, visit dates and diagnosis
In this step, we will create a list of all of the diagnosis codes, this will then be used in step 4b to convert these strings into integers for modeling.

<img src="img/step4a.png" style="height:500px">

In [13]:
print("Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding")
patIDs = [patid for patid, visitDate in patDxVisitOrderMap.items()]
datesList = [[visit[0][0] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]
DxsCodesList = [[visit[1] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]

Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding


In [14]:
patIDs

['A1234-B456', 'B1234-C456']

In [15]:
datesList

[[datetime.datetime(2019, 1, 3, 0, 0),
  datetime.datetime(2019, 2, 3, 0, 0),
  datetime.datetime(2019, 4, 3, 0, 0)],
 [datetime.datetime(2018, 1, 3, 0, 0), datetime.datetime(2018, 2, 3, 0, 0)]]

In [16]:
DxsCodesList

[[['D_E11', 'D_I25', 'D_I25'],
  ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784'],
  ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789']],
 [['D_M05', 'D_Z13', 'D_O99'], ['D_M05', 'D_Z13', 'D_O99', 'D_D37']]]

### Step 4b: Create a dictionary of the unique diagnosis codes assigned at each visit for each unique patient
Here we need to make sure that the codes are not only converted to integers but that they are kept in the unique orders in which they were administered to each unique patient.

<img src="img/step4b.png" style="height:500px">

In [76]:
print('Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation')
DxCodeDictionary = {}
encodedDxs = []
for patient in DxsCodesList:
    encodedPatientDxs = []
    for visit in patient:
        encodedVisit = []
        for code in visit:
            if code in DxCodeDictionary:
                encodedVisit.append(DxCodeDictionary[code])
            else:
                DxCodeDictionary[code] = len(DxCodeDictionary)
                encodedVisit.append(DxCodeDictionary[code])
        encodedPatientDxs.append(encodedVisit)
    encodedDxs.append(encodedPatientDxs)

Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation


In [78]:
DxCodeDictionary # Dictionary of all unique codes in the entire dataset aka: Our Code Vocabulary

{'D_E11': 0,
 'D_I25': 1,
 'D_780': 2,
 'D_784': 3,
 'D_786': 4,
 'D_401': 5,
 'D_789': 6,
 'D_M05': 7,
 'D_Z13': 8,
 'D_O99': 9,
 'D_D37': 10}

In [79]:
encodedDxs # Converted list of list with integer converted diagnosis codes

[[[0, 1, 1], [0, 1, 1, 2, 3], [0, 1, 1, 4, 5, 6]], [[7, 8, 9], [7, 8, 9, 10]]]

### Step 6: Dump the data into a pickled list of list 

In [84]:
outFile = 'ArtificialEHR_Data'
print('Dumping files into a pickled list')
pickle.dump(patIDs, open(outFile+'.patIDs', 'wb'),-1)
pickle.dump(datesList, open(outFile+'.dates', 'wb'),-1)
pickle.dump(encodedDxs, open(outFile+'.encodedDxs', 'wb'),-1)
pickle.dump(DxCodeDictionary, open(outFile+'.Dxdictionary', 'wb'),-1)

Dumping files into a pickled list


### Full Script

In [None]:
print('Creating visit date mapping')
patHashMap = dict(defaultdict(list))  # this creates a dictionary with a list of values for each patient:[number of visists]
visitMap = dict(defaultdict()) # this creates a dictionary with a mapping of the patientID : visitdates

data = open('data/Admissions_Table.csv','r')
data.readline()[1:] # read every line except the file header

for line in data:
    feature = line.strip().split(',')
    visitDateID = datetime.strptime(feature[4],'%Y-%m-%d')
    patHashMap.setdefault(feature[0], []).append(feature[1])
    visitMap.setdefault(feature[1], []).append(visitDateID)

print('Creating Diagnosis-Visit mapping')
visitDxMap = dict(defaultdict(list))

data = open('data/Diagnosis_Table.csv', 'r')
data.readline()[1:]

for line in data:
    feature = line.strip().split(',')
    visitDxMap.setdefault(feature[1], []).append('D_' + feature[7].split('.')[0])

print("Sorting visit mapping")
patDxVisitOrderMap = {}
for patid, visitDates in patHashMap.items():
    sorted_list = ([(visitMap[visitDateID], visitDxMap[visitDateID]) for visitDateID in visitDates])
    patDxVisitOrderMap[patid] = sorted_list 

print("Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding")
patIDs = [patid for patid, visitDate in patDxVisitOrderMap.items()]
datesList = [[visit[0][0] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]
DxsCodesList = [[visit[1] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]

print('Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation')
DxCodeDictionary = {}
encodedDxs = []
for patient in DxsCodesList:
    encodedPatientDxs = []
    for visit in patient:
        encodedVisit = []
        for code in visit:
            if code in DxCodeDictionary:
                encodedVisit.append(DxCodeDictionary[code])
            else:
                DxCodeDictionary[code] = len(DxCodeDictionary)
                encodedVisit.append(DxCodeDictionary[code])
        encodedPatientDxs.append(encodedVisit)
    encodedDxs.append(encodedPatientDxs)