AI-for-Healthcare-EHR / Git / [d88862] /4. Applying AI to Wearable Device Data/3. Activity Classification/walkthroughs/activity-classifier/activity

Downloads: 1
[d88862]: / 4. Applying AI to Wearable Device Data / 3. Activity Classification / walkthroughs / activity-classifier / activity_classifier.ipynb
History
Download this file
362 lines (361 with data), 10.2 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Activity Classifier\n",
    "\n",
    "We've explored the data, examined the literature, chosen our features, and pre-processed all the data. Now it's time to finally build the classifier!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import some of the libraries that we will need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy as sp\n",
    "import scipy.signal\n",
    "import scipy.stats\n",
    "\n",
    "import activity_classifier_utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fs = 256\n",
    "data = activity_classifier_utils.LoadWristPPGDataset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Extraction\n",
    "\n",
    "Train on 10 second long non-overlapping windows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "window_length_s = 10\n",
    "window_shift_s = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import activity_classifier_utils\n",
    "\n",
    "window_length = window_length_s * fs\n",
    "window_shift = window_shift_s * fs\n",
    "labels, subjects, features = [], [], []\n",
    "for subject, activity, df in data:\n",
    "    for i in range(0, len(df) - window_length, window_shift):\n",
    "        window = df[i: i + window_length]\n",
    "        accx = window.accx.values\n",
    "        accy = window.accy.values\n",
    "        accz = window.accz.values\n",
    "        features.append(activity_classifier_utils.Featurize(accx, accy, accz, fs=fs))\n",
    "        labels.append(activity)\n",
    "        subjects.append(subject)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = np.array(labels)\n",
    "subjects = np.array(subjects)\n",
    "features = np.array(features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a Random Forest Classifier using sklearn\n",
    "\n",
    "If you've done machine learning in Python before, you've more than likely used `sklearn`. ML for wearable data is no different. Let's use sklearn to train a random forest to classify our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define hyperparameters\n",
    "\n",
    "Let's build a forest with 100 trees where each tree has a maximum depth of 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_estimators = 100\n",
    "max_tree_depth = 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = RandomForestClassifier(n_estimators=n_estimators,\n",
    "                             max_depth=max_tree_depth,\n",
    "                             random_state=42)\n",
    "clf.fit(features, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance Evaluation\n",
    "\n",
    "### Confusion Matrix\n",
    "\n",
    "One way to evaluate the performance of a multi-class classifier is to look at a confusion matrix. The confusion matrix shows how many datapoints were misclassified and what they were misclassified as."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_true = ['bike', 'run', 'run', 'walk']\n",
    "y_pred = ['run', 'run', 'bike', 'walk']\n",
    "class_names = ['bike', 'run', 'walk']\n",
    "cm = confusion_matrix(y_true, y_pred, labels=class_names)\n",
    "activity_classifier_utils.PlotConfusionMatrix(cm, class_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Leave-One-Subject-Out Cross Validation\n",
    "\n",
    "You may have seen leave-one-out cross validation. Leave-one-subject-out cross validation is similar.\n",
    "\n",
    "For many biomedical signal applications you have many datapoints per subject. In this case we have 611 datapoints from only 8 subjects. Because there might be a lot of similarity in how an individual performs a specific activity, leaving some of that person's data in the training set and then testing on it might lead us to believe our model generalizes better than it actually would if it encounters a brand new person who it has never seen in the training set. \n",
    "\n",
    "For this reason we do leave-one-subject-out cross validation.  This is why we kept track of which subject each datapoint belonged to in the `subjects` array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import LeaveOneGroupOut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class_names = np.array(['bike', 'run', 'walk'])\n",
    "logo = LeaveOneGroupOut()\n",
    "cm = np.zeros((3, 3), dtype='int')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for train_ind, test_ind in logo.split(features, labels, subjects):\n",
    "    # For each cross-validation fold...\n",
    "    \n",
    "    # Split up the dataset into a training and test set.\n",
    "    # The test set has all the data from just one subject\n",
    "    X_train, y_train = features[train_ind], labels[train_ind]\n",
    "    X_test, y_test = features[test_ind], labels[test_ind]\n",
    "    \n",
    "    # Train the classifier\n",
    "    clf.fit(X_train, y_train)\n",
    "    \n",
    "    # Run the classifier on the test set\n",
    "    y_pred = clf.predict(X_test)\n",
    "    \n",
    "    # Compute the confusion matrix for the test predictions\n",
    "    c = confusion_matrix(y_test, y_pred, labels=class_names)\n",
    "    \n",
    "    # Aggregate this confusion matrix with the ones from previous\n",
    "    # folds.\n",
    "    cm += c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot Confusion Matrices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class_names = ['bike', 'run', 'walk']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "activity_classifier_utils.PlotConfusionMatrix(cm, class_names,\n",
    "                                              title='classifier performance', normalize=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "activity_classifier_utils.PlotConfusionMatrix(cm, class_names, \n",
    "                                              title='normalized classifier performance',\n",
    "                                              normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We seem to be really good at classifying `run`. We don't really mistake `run` for either `bike` or `walk` and don't misclassify the other classes as `run` often.\n",
    "\n",
    "Our biggest mistake seems to be misclassifying `bike` as `walk`. We do that 42% of the time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute Classification Accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An overall measure of classifier performance is the classification accuracy. This is the percent of time that we make a correct classification. There are other metrics to evaluate classifier performance, and using a single metric can be misleading depending on your dataset. See the further resources section for this lesson to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the classification accuracy from the confusion matrix as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.sum(np.diag(cm)) / np.sum(np.sum(cm)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've build an activity classifier. This is a good first step. Can we do better?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}