454 lines (453 with data), 17.2 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Doctor AI Pytorch Minimial Implementation in Pytorch:\n",
"by: Sparkle Russell-Puleri and Dorian Puleri\n",
"\n",
"We will now apply the knowledge gained from the GRUs tutorial and part 1 of this series to a larger publicly available EHR dataset.This study will utilize the MIMIC III electronic health record (EHR) dataset, which is comprised of over 58,000 hospital admissions for 38,645 adults and 7 ,875 neonates. This dataset is a collection of de-identified intensive care unit stays at the Beth Israel Deaconess Medical Center from June 2001- October 2012. Despite being de-identified, this EHR dataset contains information about the patients’ demographics, vital sign measurements made at the bedside (~1/hr), laboratory test results, billing codes, medications, caregiver notes, imaging reports, and mortality (during and after hospitalization). Using the pre-processing methods demonstrated on artificially generated dataset in (Part 1 & Part 2) we will create a companion cohort for use in this study."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model Architecture"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"img/Model_arch.png\" style=\"height:500px\">"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/javascript": [
"IPython.notebook.set_autosave_interval(120000)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Autosaving every 120 seconds\n"
]
}
],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"from torch.autograd import Variable\n",
"from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence\n",
"import torch.nn.functional as F\n",
"import numpy as np\n",
"import itertools\n",
"import pickle\n",
"import sys, random\n",
"np.random.seed(0)\n",
"torch.manual_seed(0)\n",
"%autosave 120"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking for GPU availability\n",
"This model was trained on a GPU enabled system...highly recommended."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training on CPU!\n"
]
}
],
"source": [
"# check if GPU is available\n",
"if(torch.cuda.is_available()):\n",
" print('Training on GPU!')\n",
"else: \n",
" print('Training on CPU!')\n",
" \n",
"device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load data\n",
"The data pre-processed datasets will be loaded and split into a train, test and validation set at a `75%:15%:10%` ratio."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def load_data(sequences, labels):\n",
" dataSize = len(labels)\n",
" idx = np.random.permutation(dataSize)\n",
" nTest = int(np.ceil(0.15 * dataSize))\n",
" nValid = int(np.ceil(0.10 * dataSize))\n",
"\n",
" test_idx = idx[:nTest]\n",
" valid_idx = idx[nTest:nTest+nValid]\n",
" train_idx = idx[nTest+nValid:]\n",
"\n",
" train_x = sequences[train_idx]\n",
" train_y = labels[train_idx]\n",
" test_x = sequences[test_idx]\n",
" test_y = labels[test_idx]\n",
" valid_x = sequences[valid_idx]\n",
" valid_y = labels[valid_idx]\n",
"\n",
" train_x = [sorted(seq) for seq in train_x]\n",
" train_y = [sorted(seq) for seq in train_y]\n",
" valid_x = [sorted(seq) for seq in valid_x]\n",
" valid_y = [sorted(seq) for seq in valid_y]\n",
" test_x = [sorted(seq) for seq in test_x]\n",
" test_y = [sorted(seq) for seq in test_y]\n",
"\n",
" train = (train_x, train_y)\n",
" test = (test_x, test_y)\n",
" valid = (valid_x, valid_y)\n",
" return (train, test, valid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Padding the inputs:\n",
"The input tensors were padded with zeros, note that the inputs are padded to allow the RNN to handle the variable length inputs. A mask was then created to provide the algorithm information about the padding. Note this can be done using Pytorch's utility `pad_pack_sequence` function. However, given the nested nature of this dataset, the encoded inputs were first multi-one hot encoded. This off-course creates a high-dimenisonal sparse inputs, however the dimensionallity was then projected into a lower-dimensional space using an embedding layer."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def padding(seqs, labels, vocab, n_classes):\n",
" lengths = np.array([len(seq) for seq in seqs]) - 1 # remove the last list in each patient's sequences for labels\n",
" n_samples = len(lengths)\n",
" maxlen = np.max(lengths)\n",
"\n",
" x = torch.zeros(maxlen, n_samples, vocab) # maxlen = number of visits, n_samples = samples\n",
" y = torch.zeros(maxlen, n_samples, n_classes)\n",
" mask = torch.zeros(maxlen, n_samples)\n",
" for idx, (seq,label) in enumerate(zip(seqs,labels)):\n",
" for xvec, subseq in zip(x[:,idx,:], seq[:-1]):\n",
" xvec[subseq] = 1.\n",
" for yvec, subseq in zip(y[:,idx,:], label[1:]):\n",
" yvec[subseq] = 1.\n",
" mask[:lengths[idx], idx] = 1.\n",
" \n",
" return x, y, lengths, mask"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## GRU Class:\n",
"This class contains randomly initiated weights needed to begin calculating the hidden states of the alogrithms. Note, in this paper the author used embedding matrix ($W_{emb}$) generated using the skip-gram algorithm, which outperformed the randomly initialized approached shown in this step."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"torch.manual_seed(1)\n",
"class EHRNN(nn.Module):\n",
" def __init__(self, inputDimSize, hiddenDimSize,embSize, batchSize, numClass):\n",
" super(EHRNN, self).__init__()\n",
"\n",
" self.hiddenDimSize = hiddenDimSize\n",
" self.inputDimSize = inputDimSize\n",
" self.embSize = embSize\n",
" self.numClass = numClass\n",
" self.batchSize = batchSize\n",
"\n",
" #Initialize random weights\n",
" self.W_z = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
" self.W_r = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
" self.W_h = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
"\n",
" self.U_z = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
" self.U_r = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
" self.U_h = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
"\n",
" self.b_z = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
" self.b_r = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
" self.b_h = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
"\n",
" \n",
" self.params = [self.W_z, self.W_r, self.W_h, \n",
" self.U_z, self.U_r, self.U_h,\n",
" self.b_z, self.b_r, self.b_h]\n",
"\n",
" \n",
" def forward(self,emb,h):\n",
" z = torch.sigmoid(torch.matmul(emb, self.W_z) + torch.matmul(h, self.U_z) + self.b_z)\n",
" r = torch.sigmoid(torch.matmul(emb, self.W_r) + torch.matmul(h, self.U_r) + self.b_r)\n",
" h_tilde = torch.tanh(torch.matmul(emb, self.W_h) + torch.matmul(r * h, self.U_h) + self.b_h)\n",
" h = z * h + ((1. - z) * h_tilde)\n",
" return h\n",
" \n",
" \n",
" def init_hidden(self):\n",
" return Variable(torch.zeros(self.batchSize,self.hiddenDimSize))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Custom Layer for handling two layer GRU\n",
"The purpose of this class, is to perform the intially embedding followed by caluculating the hidden states and performing dropout between the layers."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"torch.manual_seed(1)\n",
"class build_EHRNN(nn.Module):\n",
" def __init__(self, inputDimSize=4894, hiddenDimSize=[200,200], batchSize=100, embSize=200,numClass=4894, dropout=0.5,logEps=1e-8):\n",
" super(build_EHRNN, self).__init__()\n",
" \n",
" self.inputDimSize = inputDimSize\n",
" self.hiddenDimSize = hiddenDimSize\n",
" self.numClass = numClass\n",
" self.embSize = embSize\n",
" self.batchSize = batchSize\n",
" self.dropout = nn.Dropout(p=0.5)\n",
" self.logEps = logEps\n",
" \n",
" \n",
" # Embedding inputs\n",
" self.W_emb = nn.Parameter(torch.randn(self.inputDimSize, self.embSize).cuda())\n",
" self.b_emb = nn.Parameter(torch.zeros(self.embSize).cuda())\n",
" \n",
" self.W_out = nn.Parameter(torch.randn(self.hiddenDimSize, self.numClass).cuda())\n",
" self.b_out = nn.Parameter(torch.zeros(self.numClass).cuda())\n",
" \n",
" self.params = [self.W_emb, self.W_out, \n",
" self.b_emb, self.b_out] \n",
" \n",
" def forward(self,x, y, h, lengths, mask):\n",
" self.emb = torch.tanh(torch.matmul(x, self.W_emb) + self.b_emb)\n",
" input_values = self.emb\n",
" self.outputs = [input_values]\n",
" for i, hiddenSize in enumerate([self.hiddenDimSize, self.hiddenDimSize]): # iterate over layers\n",
" rnn = EHRNN(self.inputDimSize,hiddenSize,self.embSize,self.batchSize,self.numClass) # calculate hidden states\n",
" hidden_state = []\n",
" h = self.init_hidden().cuda()\n",
" for i,seq in enumerate(input_values): # loop over sequences in each batch\n",
" h = rnn(seq, h) \n",
" hidden_state.append(h) \n",
" hidden_state = self.dropout(torch.stack(hidden_state)) # apply dropout between layers\n",
" input_values = hidden_state\n",
" \n",
" y_linear = torch.matmul(hidden_state, self.W_out) + self.b_out # fully connected layer\n",
" yhat = F.softmax(y_linear, dim=1) # yhat\n",
" yhat = yhat*mask[:,:,None] # apply mask\n",
" \n",
" # Loss calculation\n",
" cross_entropy = -(y * torch.log(yhat + self.logEps) + (1. - y) * torch.log(1. - yhat + self.logEps))\n",
" last_step = -torch.mean(y[-1] * torch.log(yhat[-1] + self.logEps) + (1. - y[-1]) * torch.log(1. - yhat[-1] + self.logEps))\n",
" prediction_loss = torch.sum(torch.sum(cross_entropy, dim=0),dim=1)/ torch.cuda.FloatTensor(lengths)\n",
" cost = torch.mean(prediction_loss) + 0.000001 * (self.W_out ** 2).sum() # regularize\n",
" return (yhat, hidden_state, cost)\n",
"\n",
" def init_hidden(self):\n",
" return torch.zeros(self.batchSize, self.hiddenDimSize) # initial state"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiate Model\n",
"Instantiate model and provided parameters and be sure to send it to a GPU enabled device to speed up matrix computations."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"model = build_EHRNN(4894, 200, 100, 200, 4894,0.5,1e-8)\n",
"model = model.to(device)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"print(\"Loading Data\")\n",
"train, test, valid = load_data(sequences, labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Batch Size\n",
"Keep only enough samples to make the specified bactch size."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"batchSize = 100\n",
"n_batches = int(np.ceil(float(len(train[0])) / float(batchSize)))-1\n",
"n_batches_valid = int(np.ceil(float(len(valid[0])) / float(batchSize)))-1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train model\n",
"This model is a minimal implementation fo the Dr.AI algorithm created by Edward Choi, while functional it requires significant tuning. This will be demonstrated in a subsequent tutorial."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"optimizer = torch.optim.Adadelta(model.parameters(), lr = 0.01, rho=0.90)\n",
"max_epochs = 10\n",
"\n",
"loss_all = []\n",
"iteration = 0\n",
" \n",
"for e in range(max_epochs):\n",
" for index in random.sample(range(n_batches), n_batches):\n",
" batchX = train[0][:n_batches*batchSize][index*batchSize:(index+1)*batchSize]\n",
" batchY = train[1][:n_batches*batchSize][index*batchSize:(index+1)*batchSize]\n",
" \n",
" optimizer.zero_grad()\n",
" \n",
" x, y, lengths, mask = padding(batchX, batchY, 4894, 4894)\n",
" \n",
" if torch.cuda.is_available():\n",
" x, y, lenghts, mask = x.cuda(), y.cuda(), lengths, mask.cuda()\n",
" \n",
" outputs, hidden, cost = model(x,y, h, lengths, mask)\n",
" \n",
" if torch.cuda.is_available():\n",
" cost.cuda()\n",
" cost.backward()\n",
" nn.utils.clip_grad_norm_(model.parameters(), 5)\n",
" optimizer.step()\n",
" \n",
" loss_all.append(cost.item())\n",
" iteration +=1\n",
" if iteration % 10 == 0:\n",
" # Calculate Accuracy \n",
" losses = []\n",
" model.eval()\n",
" for index in random.sample(range(n_batches_valid), n_batches_valid):\n",
" validX = valid[0][:n_batches_valid*batchSize][index*batchSize:(index+1)*batchSize]\n",
" validY = valid[1][:n_batches_valid*batchSize][index*batchSize:(index+1)*batchSize]\n",
"\n",
" x, y, lengths, mask = padding(validX, validY, 4894, 4894)\n",
"\n",
" if torch.cuda.is_available():\n",
" x, y, lenghts, mask = x.cuda(), y.cuda(), lenghts, mask.cuda()\n",
"\n",
" outputs, hidden_val, cost_val = model(x,y, h, lengths, mask)\n",
" losses.append(cost_val)\n",
" model.train()\n",
"\n",
" print(\"Epoch: {}/{}...\".format(e+1, max_epochs),\n",
" \"Step: {}...\".format(iteration),\n",
" \"Training Loss: {:.4f}...\".format(np.mean(loss_all)),\n",
" \"Val Loss: {:.4f}\".format(torch.mean(torch.tensor(losses))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Final Notes/ Next Steps:\n",
"This should serve as starter code to get the model up and running. As noted before, a significant amount of tuning will be required as this was built using custom classes. We will walkthrough the proces in a future tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References:\n",
"1. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks (https://arxiv.org/abs/1511.05942)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}