613 lines (612 with data), 18.6 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2- Pre-processing EHR data\n",
"by:Sparkle Russell-Puleri and Dorian Puleri"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Background: Detailed review of Doctor AI: Predicting Clinical Events via Recurrent Neural Nets (Choi et.al 2016)\n",
"The intent of tutorial is to provide a detailed step through on how EHR data should be pre-processed for use in RNNs using Pytorch. This paper is one of the few papers that provide a code base to start taking a detailed look into how we can build generic models that leverages temporal models to predict future clinical events. However, while this highly cited paper is open sourced (written using Theano:https://github.com/mp2893/doctorai), it assumes quite a bit about it's readers. As such, we have modernized the code for ease of use in python 3+ and provided a detailed explanation of each step to allow anyone, with a computer and access to healthcare data to begin trying to develop innovative solutions to solve healthcare challenges. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Important Disclaimer: \n",
"This data set was artificial created with two patients in Part 1 of this series to help provide readers with a clear understanding of the structure of EHR data. Please note that each EHR system is specifically designed to meet a specific providers needs and this is just a basic example of data that is typically contained in most systems. Additionally, it is also key to note that this tutorial begins after all of the desired exclusion and inclusion criteria related to your research question has been performed. Therefore, at this step your data would have been fully wrangled and cleaned."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"application/javascript": [
"IPython.notebook.set_autosave_interval(180000)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Autosaving every 180 seconds\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import pandas as pd\n",
"from time import time\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import sys\n",
"import warnings\n",
"from datetime import datetime\n",
"import torch\n",
"import pickle\n",
"from collections import defaultdict\n",
"warnings.filterwarnings('ignore')\n",
"sns.set(style='white')\n",
"%autosave 180"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load data : A quick review of the artifical EHR data we created in Part 1:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/overview.png\" style=\"height:500px\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1: Create mappings of patient IDs \n",
"In this step we are going to create a dictionary that maps each patient with his or her specific visit or `Admission ID`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/step1.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating visit date mapping\n"
]
}
],
"source": [
"print('Creating visit date mapping')\n",
"patHashMap = dict(defaultdict(list)) # this creates a dictionary with a list of values for each patient:[number of visists]\n",
"visitMap = dict(defaultdict()) # this creates a dictionary with a mapping of the patientID : visitdates\n",
"\n",
"data = open('data/Admissions_Table.csv','r')\n",
"data.readline()[1:] # read every line except the file header\n",
"\n",
"for line in data:\n",
" feature = line.strip().split(',') # split line on , and isolate columns\n",
" visitDateID = datetime.strptime(feature[3],'%Y-%m-%d') \n",
" patHashMap.setdefault(feature[1], []).append(feature[0]) # create a mapping for each visit for a specific PatientID\n",
" visitMap.setdefault(feature[0], []).append(visitDateID) # create a mapping for each visit for a specific Admission Date"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A1234-B456': ['A1234-12', 'A1234-34', 'A1234-15'],\n",
" 'B1234-C456': ['B1234-13', 'B1234-34']}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Patient ID- visit mapping\n",
"patHashMap"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A1234-12': [datetime.datetime(2019, 1, 3, 0, 0)],\n",
" 'A1234-34': [datetime.datetime(2019, 2, 3, 0, 0)],\n",
" 'A1234-15': [datetime.datetime(2019, 4, 3, 0, 0)],\n",
" 'B1234-13': [datetime.datetime(2018, 1, 3, 0, 0)],\n",
" 'B1234-34': [datetime.datetime(2018, 2, 3, 0, 0)]}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Patient Admission ID- visit date mapping\n",
"visitMap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create Diagnosis Code Mapped to each unique patient and visit\n",
"This step as with all subsequent steps is very important as it is important to keep the patient's diagnosis codes in the correct visit order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/step2.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating Diagnosis-Visit mapping\n"
]
}
],
"source": [
"print('Creating Diagnosis-Visit mapping')\n",
"visitDxMap = dict(defaultdict(list))\n",
"\n",
"data = open('data/Diagnosis_Table.csv', 'r')\n",
"data.readline()[1:]\n",
"\n",
"for line in data:\n",
" feature = line.strip().split(',')\n",
" visitDxMap.setdefault(feature[0], []).append('D_' + feature[4].split('.')[0]) # add a unique identifier before the"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A1234-12': ['D_E11', 'D_I25', 'D_I25'],\n",
" 'A1234-34': ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784'],\n",
" 'A1234-15': ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789'],\n",
" 'B1234-13': ['D_M05', 'D_Z13', 'D_O99'],\n",
" 'B1234-34': ['D_M05', 'D_Z13', 'D_O99', 'D_D37']}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"visitDxMap # Mapping of each Admission ID to each diagnosis code assigned during that visit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: Embed diagnosis codes into visit mapping Patient-Admission mapping\n",
"This step essentially adds each code assigned to the patient directing into the dictionary with the patient-admission id mapping and the visit date mapping `visitMap`. Which allows us to have a list of list of diagnosis codes that each patient recieved during each visit."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/step3.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sorting visit mapping\n"
]
}
],
"source": [
"print(\"Sorting visit mapping\")\n",
"patDxVisitOrderMap = {}\n",
"for patid, visitDates in patHashMap.items():\n",
" sorted_list = ([(visitMap[visitDateID], visitDxMap[visitDateID]) for visitDateID in visitDates])\n",
" patDxVisitOrderMap[patid] = sorted_list "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A1234-B456': [([datetime.datetime(2019, 1, 3, 0, 0)],\n",
" ['D_E11', 'D_I25', 'D_I25']),\n",
" ([datetime.datetime(2019, 2, 3, 0, 0)],\n",
" ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784']),\n",
" ([datetime.datetime(2019, 4, 3, 0, 0)],\n",
" ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789'])],\n",
" 'B1234-C456': [([datetime.datetime(2018, 1, 3, 0, 0)],\n",
" ['D_M05', 'D_Z13', 'D_O99']),\n",
" ([datetime.datetime(2018, 2, 3, 0, 0)],\n",
" ['D_M05', 'D_Z13', 'D_O99', 'D_D37'])]}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"patDxVisitOrderMap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4a: Extract patient IDs, visit dates and diagnosis\n",
"In this step, we will create a list of all of the diagnosis codes, this will then be used in step 4b to convert these strings into integers for modeling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/step4a.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding\n"
]
}
],
"source": [
"print(\"Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding\")\n",
"patIDs = [patid for patid, visitDate in patDxVisitOrderMap.items()]\n",
"datesList = [[visit[0][0] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]\n",
"DxsCodesList = [[visit[1] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['A1234-B456', 'B1234-C456']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"patIDs"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[datetime.datetime(2019, 1, 3, 0, 0),\n",
" datetime.datetime(2019, 2, 3, 0, 0),\n",
" datetime.datetime(2019, 4, 3, 0, 0)],\n",
" [datetime.datetime(2018, 1, 3, 0, 0), datetime.datetime(2018, 2, 3, 0, 0)]]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"datesList"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[['D_E11', 'D_I25', 'D_I25'],\n",
" ['D_E11', 'D_I25', 'D_I25', 'D_780', 'D_784'],\n",
" ['D_E11', 'D_I25', 'D_I25', 'D_786', 'D_401', 'D_789']],\n",
" [['D_M05', 'D_Z13', 'D_O99'], ['D_M05', 'D_Z13', 'D_O99', 'D_D37']]]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DxsCodesList"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4b: Create a dictionary of the unique diagnosis codes assigned at each visit for each unique patient\n",
"Here we need to make sure that the codes are not only converted to integers but that they are kept in the unique orders in which they were administered to each unique patient."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/step4b.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation\n"
]
}
],
"source": [
"print('Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation')\n",
"DxCodeDictionary = {}\n",
"encodedDxs = []\n",
"for patient in DxsCodesList:\n",
" encodedPatientDxs = []\n",
" for visit in patient:\n",
" encodedVisit = []\n",
" for code in visit:\n",
" if code in DxCodeDictionary:\n",
" encodedVisit.append(DxCodeDictionary[code])\n",
" else:\n",
" DxCodeDictionary[code] = len(DxCodeDictionary)\n",
" encodedVisit.append(DxCodeDictionary[code])\n",
" encodedPatientDxs.append(encodedVisit)\n",
" encodedDxs.append(encodedPatientDxs)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'D_E11': 0,\n",
" 'D_I25': 1,\n",
" 'D_780': 2,\n",
" 'D_784': 3,\n",
" 'D_786': 4,\n",
" 'D_401': 5,\n",
" 'D_789': 6,\n",
" 'D_M05': 7,\n",
" 'D_Z13': 8,\n",
" 'D_O99': 9,\n",
" 'D_D37': 10}"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DxCodeDictionary # Dictionary of all unique codes in the entire dataset aka: Our Code Vocabulary"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[[0, 1, 1], [0, 1, 1, 2, 3], [0, 1, 1, 4, 5, 6]], [[7, 8, 9], [7, 8, 9, 10]]]"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encodedDxs # Converted list of list with integer converted diagnosis codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 6: Dump the data into a pickled list of list "
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dumping files into a pickled list\n"
]
}
],
"source": [
"outFile = 'ArtificialEHR_Data'\n",
"print('Dumping files into a pickled list')\n",
"pickle.dump(patIDs, open(outFile+'.patIDs', 'wb'),-1)\n",
"pickle.dump(datesList, open(outFile+'.dates', 'wb'),-1)\n",
"pickle.dump(encodedDxs, open(outFile+'.encodedDxs', 'wb'),-1)\n",
"pickle.dump(DxCodeDictionary, open(outFile+'.Dxdictionary', 'wb'),-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Full Script"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Creating visit date mapping')\n",
"patHashMap = dict(defaultdict(list)) # this creates a dictionary with a list of values for each patient:[number of visists]\n",
"visitMap = dict(defaultdict()) # this creates a dictionary with a mapping of the patientID : visitdates\n",
"\n",
"data = open('data/Admissions_Table.csv','r')\n",
"data.readline()[1:] # read every line except the file header\n",
"\n",
"for line in data:\n",
" feature = line.strip().split(',')\n",
" visitDateID = datetime.strptime(feature[4],'%Y-%m-%d')\n",
" patHashMap.setdefault(feature[0], []).append(feature[1])\n",
" visitMap.setdefault(feature[1], []).append(visitDateID)\n",
"\n",
"print('Creating Diagnosis-Visit mapping')\n",
"visitDxMap = dict(defaultdict(list))\n",
"\n",
"data = open('data/Diagnosis_Table.csv', 'r')\n",
"data.readline()[1:]\n",
"\n",
"for line in data:\n",
" feature = line.strip().split(',')\n",
" visitDxMap.setdefault(feature[1], []).append('D_' + feature[7].split('.')[0])\n",
"\n",
"print(\"Sorting visit mapping\")\n",
"patDxVisitOrderMap = {}\n",
"for patid, visitDates in patHashMap.items():\n",
" sorted_list = ([(visitMap[visitDateID], visitDxMap[visitDateID]) for visitDateID in visitDates])\n",
" patDxVisitOrderMap[patid] = sorted_list \n",
"\n",
"print(\"Extracting patient IDs, visit dates and diagnosis codes into individual lists for encoding\")\n",
"patIDs = [patid for patid, visitDate in patDxVisitOrderMap.items()]\n",
"datesList = [[visit[0][0] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]\n",
"DxsCodesList = [[visit[1] for visit in visitDate] for patid, visitDate in patDxVisitOrderMap.items()]\n",
"\n",
"print('Encoding string Dx codes to integers and mapping the encoded integer value to the ICD-10 code for interpretation')\n",
"DxCodeDictionary = {}\n",
"encodedDxs = []\n",
"for patient in DxsCodesList:\n",
" encodedPatientDxs = []\n",
" for visit in patient:\n",
" encodedVisit = []\n",
" for code in visit:\n",
" if code in DxCodeDictionary:\n",
" encodedVisit.append(DxCodeDictionary[code])\n",
" else:\n",
" DxCodeDictionary[code] = len(DxCodeDictionary)\n",
" encodedVisit.append(DxCodeDictionary[code])\n",
" encodedPatientDxs.append(encodedVisit)\n",
" encodedDxs.append(encodedPatientDxs)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}