[9c7551]: / DrAI_pytorch.ipynb

Download this file

454 lines (453 with data), 17.2 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Doctor AI Pytorch Minimial Implementation in Pytorch:\n",
    "by: Sparkle Russell-Puleri and Dorian Puleri\n",
    "\n",
    "We will now apply the knowledge gained from the GRUs tutorial and part 1 of this series to a larger publicly available EHR dataset.This study will utilize the MIMIC III electronic health record (EHR) dataset, which is comprised of over 58,000 hospital admissions for 38,645 adults and 7 ,875 neonates. This dataset is a collection of de-identified intensive care unit stays at the Beth Israel Deaconess Medical Center from June 2001- October 2012. Despite being de-identified, this EHR dataset contains information about the patients’ demographics, vital sign measurements made at the bedside (~1/hr), laboratory test results, billing codes, medications, caregiver notes, imaging reports, and mortality (during and after hospitalization). Using the pre-processing methods demonstrated on artificially generated dataset in (Part 1 & Part 2) we will create a companion cohort for use in this study."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"img/Model_arch.png\" style=\"height:500px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "IPython.notebook.set_autosave_interval(120000)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Autosaving every 120 seconds\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.autograd import Variable\n",
    "from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence\n",
    "import torch.nn.functional as F\n",
    "import numpy as np\n",
    "import itertools\n",
    "import pickle\n",
    "import sys, random\n",
    "np.random.seed(0)\n",
    "torch.manual_seed(0)\n",
    "%autosave 120"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking for GPU availability\n",
    "This model was trained on a GPU enabled system...highly recommended."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training on CPU!\n"
     ]
    }
   ],
   "source": [
    "# check if GPU is available\n",
    "if(torch.cuda.is_available()):\n",
    "    print('Training on GPU!')\n",
    "else: \n",
    "    print('Training on CPU!')\n",
    "    \n",
    "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data\n",
    "The data pre-processed datasets will be loaded and split into a train, test and validation set at a `75%:15%:10%` ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_data(sequences, labels):\n",
    "    dataSize = len(labels)\n",
    "    idx = np.random.permutation(dataSize)\n",
    "    nTest = int(np.ceil(0.15 * dataSize))\n",
    "    nValid = int(np.ceil(0.10 * dataSize))\n",
    "\n",
    "    test_idx = idx[:nTest]\n",
    "    valid_idx = idx[nTest:nTest+nValid]\n",
    "    train_idx = idx[nTest+nValid:]\n",
    "\n",
    "    train_x = sequences[train_idx]\n",
    "    train_y = labels[train_idx]\n",
    "    test_x = sequences[test_idx]\n",
    "    test_y = labels[test_idx]\n",
    "    valid_x = sequences[valid_idx]\n",
    "    valid_y = labels[valid_idx]\n",
    "\n",
    "    train_x = [sorted(seq) for seq in train_x]\n",
    "    train_y = [sorted(seq) for seq in train_y]\n",
    "    valid_x = [sorted(seq) for seq in valid_x]\n",
    "    valid_y = [sorted(seq) for seq in valid_y]\n",
    "    test_x = [sorted(seq) for seq in test_x]\n",
    "    test_y = [sorted(seq) for seq in test_y]\n",
    "\n",
    "    train = (train_x, train_y)\n",
    "    test = (test_x, test_y)\n",
    "    valid = (valid_x, valid_y)\n",
    "    return (train, test, valid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Padding the inputs:\n",
    "The input tensors were padded with zeros, note that the inputs are padded to allow the RNN to handle the variable length inputs. A mask was then created to provide the algorithm information about the padding. Note this can be done using Pytorch's utility `pad_pack_sequence` function. However, given the nested nature of this dataset, the encoded inputs were first multi-one hot encoded. This off-course creates a high-dimenisonal sparse inputs, however the dimensionallity was then projected into a lower-dimensional space using an embedding layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def padding(seqs, labels, vocab, n_classes):\n",
    "    lengths = np.array([len(seq) for seq in seqs]) - 1 # remove the last list in each patient's sequences for labels\n",
    "    n_samples = len(lengths)\n",
    "    maxlen = np.max(lengths)\n",
    "\n",
    "    x = torch.zeros(maxlen, n_samples, vocab) # maxlen = number of visits, n_samples = samples\n",
    "    y = torch.zeros(maxlen, n_samples, n_classes)\n",
    "    mask = torch.zeros(maxlen, n_samples)\n",
    "    for idx, (seq,label) in enumerate(zip(seqs,labels)):\n",
    "        for xvec, subseq in zip(x[:,idx,:], seq[:-1]):\n",
    "            xvec[subseq] = 1.\n",
    "        for yvec, subseq in zip(y[:,idx,:], label[1:]):\n",
    "            yvec[subseq] = 1.\n",
    "        mask[:lengths[idx], idx] = 1.\n",
    "        \n",
    "    return x, y, lengths, mask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GRU Class:\n",
    "This class contains randomly initiated weights needed to begin calculating the hidden states of the alogrithms. Note, in this paper the author used embedding matrix ($W_{emb}$) generated using the skip-gram algorithm, which outperformed the randomly initialized approached shown in this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(1)\n",
    "class EHRNN(nn.Module):\n",
    "    def __init__(self, inputDimSize, hiddenDimSize,embSize, batchSize, numClass):\n",
    "        super(EHRNN, self).__init__()\n",
    "\n",
    "        self.hiddenDimSize = hiddenDimSize\n",
    "        self.inputDimSize = inputDimSize\n",
    "        self.embSize = embSize\n",
    "        self.numClass = numClass\n",
    "        self.batchSize = batchSize\n",
    "\n",
    "        #Initialize random weights\n",
    "        self.W_z = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
    "        self.W_r = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
    "        self.W_h = nn.Parameter(torch.randn(self.embSize, self.hiddenDimSize).cuda())\n",
    "\n",
    "        self.U_z = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
    "        self.U_r = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
    "        self.U_h = nn.Parameter(torch.randn(self.hiddenDimSize, self.hiddenDimSize).cuda())\n",
    "\n",
    "        self.b_z = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
    "        self.b_r = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
    "        self.b_h = nn.Parameter(torch.zeros(self.hiddenDimSize).cuda())\n",
    "\n",
    "        \n",
    "        self.params = [self.W_z, self.W_r, self.W_h, \n",
    "                       self.U_z, self.U_r, self.U_h,\n",
    "                       self.b_z, self.b_r, self.b_h]\n",
    "\n",
    "        \n",
    "    def forward(self,emb,h):\n",
    "        z = torch.sigmoid(torch.matmul(emb, self.W_z)  + torch.matmul(h, self.U_z) + self.b_z)\n",
    "        r = torch.sigmoid(torch.matmul(emb, self.W_r)  + torch.matmul(h, self.U_r) + self.b_r)\n",
    "        h_tilde = torch.tanh(torch.matmul(emb, self.W_h)  + torch.matmul(r * h, self.U_h) + self.b_h)\n",
    "        h = z * h + ((1. - z) * h_tilde)\n",
    "        return h\n",
    "    \n",
    "                           \n",
    "    def init_hidden(self):\n",
    "        return Variable(torch.zeros(self.batchSize,self.hiddenDimSize))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom Layer for handling two layer GRU\n",
    "The purpose of this class, is to perform the intially embedding followed by caluculating the hidden states and performing dropout between the layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(1)\n",
    "class build_EHRNN(nn.Module):\n",
    "    def __init__(self, inputDimSize=4894, hiddenDimSize=[200,200], batchSize=100, embSize=200,numClass=4894, dropout=0.5,logEps=1e-8):\n",
    "        super(build_EHRNN, self).__init__()\n",
    "        \n",
    "        self.inputDimSize = inputDimSize\n",
    "        self.hiddenDimSize = hiddenDimSize\n",
    "        self.numClass = numClass\n",
    "        self.embSize = embSize\n",
    "        self.batchSize = batchSize\n",
    "        self.dropout = nn.Dropout(p=0.5)\n",
    "        self.logEps = logEps\n",
    "        \n",
    "        \n",
    "        # Embedding inputs\n",
    "        self.W_emb = nn.Parameter(torch.randn(self.inputDimSize, self.embSize).cuda())\n",
    "        self.b_emb = nn.Parameter(torch.zeros(self.embSize).cuda())\n",
    "        \n",
    "        self.W_out = nn.Parameter(torch.randn(self.hiddenDimSize, self.numClass).cuda())\n",
    "        self.b_out = nn.Parameter(torch.zeros(self.numClass).cuda())\n",
    "         \n",
    "        self.params = [self.W_emb, self.W_out, \n",
    "                       self.b_emb, self.b_out] \n",
    "    \n",
    "    def forward(self,x, y, h, lengths, mask):\n",
    "        self.emb = torch.tanh(torch.matmul(x, self.W_emb) + self.b_emb)\n",
    "        input_values = self.emb\n",
    "        self.outputs = [input_values]\n",
    "        for i, hiddenSize in enumerate([self.hiddenDimSize, self.hiddenDimSize]):  # iterate over layers\n",
    "            rnn = EHRNN(self.inputDimSize,hiddenSize,self.embSize,self.batchSize,self.numClass) # calculate hidden states\n",
    "            hidden_state = []\n",
    "            h = self.init_hidden().cuda()\n",
    "            for i,seq in enumerate(input_values): # loop over sequences in each batch\n",
    "                h = rnn(seq, h)                    \n",
    "                hidden_state.append(h)    \n",
    "            hidden_state = self.dropout(torch.stack(hidden_state))    # apply dropout between layers\n",
    "            input_values = hidden_state\n",
    "       \n",
    "        y_linear = torch.matmul(hidden_state, self.W_out)  + self.b_out # fully connected layer\n",
    "        yhat = F.softmax(y_linear, dim=1)  # yhat\n",
    "        yhat = yhat*mask[:,:,None]   # apply mask\n",
    "        \n",
    "        # Loss calculation\n",
    "        cross_entropy = -(y * torch.log(yhat + self.logEps) + (1. - y) * torch.log(1. - yhat + self.logEps))\n",
    "        last_step = -torch.mean(y[-1] * torch.log(yhat[-1] + self.logEps) + (1. - y[-1]) * torch.log(1. - yhat[-1] + self.logEps))\n",
    "        prediction_loss = torch.sum(torch.sum(cross_entropy, dim=0),dim=1)/ torch.cuda.FloatTensor(lengths)\n",
    "        cost = torch.mean(prediction_loss) + 0.000001 * (self.W_out ** 2).sum() # regularize\n",
    "        return (yhat, hidden_state, cost)\n",
    "\n",
    "    def init_hidden(self):\n",
    "        return torch.zeros(self.batchSize, self.hiddenDimSize)  # initial state"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate Model\n",
    "Instantiate model and provided parameters and be sure to send it to a GPU enabled device to speed up matrix computations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "model = build_EHRNN(4894, 200, 100, 200, 4894,0.5,1e-8)\n",
    "model = model.to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Loading Data\")\n",
    "train, test, valid = load_data(sequences, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Size\n",
    "Keep only enough samples to make the specified bactch size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "batchSize = 100\n",
    "n_batches = int(np.ceil(float(len(train[0])) / float(batchSize)))-1\n",
    "n_batches_valid = int(np.ceil(float(len(valid[0])) / float(batchSize)))-1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train model\n",
    "This model is a minimal implementation fo the Dr.AI algorithm created by Edward Choi, while functional it requires significant tuning. This will be demonstrated in a subsequent tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = torch.optim.Adadelta(model.parameters(), lr = 0.01, rho=0.90)\n",
    "max_epochs = 10\n",
    "\n",
    "loss_all = []\n",
    "iteration = 0\n",
    "        \n",
    "for e in range(max_epochs):\n",
    "    for index in random.sample(range(n_batches), n_batches):\n",
    "        batchX = train[0][:n_batches*batchSize][index*batchSize:(index+1)*batchSize]\n",
    "        batchY = train[1][:n_batches*batchSize][index*batchSize:(index+1)*batchSize]\n",
    "        \n",
    "        optimizer.zero_grad()\n",
    "        \n",
    "        x, y, lengths, mask = padding(batchX, batchY, 4894, 4894)\n",
    "        \n",
    "        if torch.cuda.is_available():\n",
    "            x, y, lenghts, mask = x.cuda(), y.cuda(), lengths, mask.cuda()\n",
    "        \n",
    "        outputs, hidden, cost = model(x,y, h, lengths, mask)\n",
    "        \n",
    "        if torch.cuda.is_available():\n",
    "            cost.cuda()\n",
    "        cost.backward()\n",
    "        nn.utils.clip_grad_norm_(model.parameters(), 5)\n",
    "        optimizer.step()\n",
    "        \n",
    "        loss_all.append(cost.item())\n",
    "        iteration +=1\n",
    "        if iteration % 10 == 0:\n",
    "            # Calculate Accuracy         \n",
    "            losses = []\n",
    "            model.eval()\n",
    "            for index in random.sample(range(n_batches_valid), n_batches_valid):\n",
    "                validX = valid[0][:n_batches_valid*batchSize][index*batchSize:(index+1)*batchSize]\n",
    "                validY = valid[1][:n_batches_valid*batchSize][index*batchSize:(index+1)*batchSize]\n",
    "\n",
    "                x, y, lengths, mask = padding(validX, validY, 4894, 4894)\n",
    "\n",
    "                if torch.cuda.is_available():\n",
    "                    x, y, lenghts, mask = x.cuda(), y.cuda(), lenghts, mask.cuda()\n",
    "\n",
    "                outputs, hidden_val, cost_val = model(x,y, h, lengths, mask)\n",
    "                losses.append(cost_val)\n",
    "            model.train()\n",
    "\n",
    "            print(\"Epoch: {}/{}...\".format(e+1, max_epochs),\n",
    "                          \"Step: {}...\".format(iteration),\n",
    "                          \"Training Loss: {:.4f}...\".format(np.mean(loss_all)),\n",
    "                          \"Val Loss: {:.4f}\".format(torch.mean(torch.tensor(losses))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final Notes/ Next Steps:\n",
    "This should serve as starter code to get the model up and running. As noted before, a significant amount of tuning will be required as this was built using custom classes. We will walkthrough the proces in a future tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### References:\n",
    "1. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks (https://arxiv.org/abs/1511.05942)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}