Download this file

391 lines (390 with data), 13.6 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hyperparameter Tuning And Regularization\n",
    "\n",
    "We ended the last video with a classification accuracy of 77%. However, there are a few more nobs we can turn to improve the performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our growing set of imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy as sp\n",
    "import scipy.signal\n",
    "import scipy.stats\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.model_selection import LeaveOneGroupOut\n",
    "\n",
    "import activity_classifier_utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data and Extract Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fs = 256\n",
    "data = activity_classifier_utils.LoadWristPPGDataset()\n",
    "labels, subjects, features = activity_classifier_utils.GenerateFeatures(data,\n",
    "                                                                        fs,\n",
    "                                                                        window_length_s=10,\n",
    "                                                                        window_shift_s=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define hyperparameters\n",
    "\n",
    "How many hyperparameters should we really use and how big should these trees be. At first we made our best guesses, but now we can explore this space and see if the performance changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_estimators_opt = [2, 10, 20, 50, 100, 150, 300]\n",
    "max_tree_depth_opt = range(2, 7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class_names = np.array(['bike', 'run', 'walk'])\n",
    "logo = LeaveOneGroupOut()\n",
    "accuracy_table = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "\n",
    "for n_estimators, max_tree_depth in itertools.product(n_estimators_opt, max_tree_depth_opt):\n",
    "    # Iterate over each pair of hyperparameters\n",
    "    cm = np.zeros((3, 3), dtype='int')                       # Create a new confusion matrix\n",
    "    clf = RandomForestClassifier(n_estimators=n_estimators,  # and a new classifier  for each\n",
    "                                 max_depth=max_tree_depth,   # pair of hyperparameters\n",
    "                                 random_state=42,\n",
    "                                 class_weight='balanced')\n",
    "    for train_ind, test_ind in logo.split(features, labels, subjects):\n",
    "        # Do leave-one-subject-out cross validation as before.\n",
    "        X_train, y_train = features[train_ind], labels[train_ind]\n",
    "        X_test, y_test = features[test_ind], labels[test_ind]\n",
    "        clf.fit(X_train, y_train)\n",
    "        y_pred = clf.predict(X_test)\n",
    "        c = confusion_matrix(y_test, y_pred, labels=class_names)\n",
    "        cm += c\n",
    "    # For each pair of hyperparameters, compute the classification accuracy\n",
    "    classification_accuracy = np.sum(np.diag(cm)) / np.sum(np.sum(cm))\n",
    "    \n",
    "    # Store the hyperparameters and the classification accuracy that resulted\n",
    "    # from the model created with them.\n",
    "    accuracy_table.append((n_estimators, max_tree_depth, classification_accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_table_df = pd.DataFrame(accuracy_table,\n",
    "                                 columns=['n_estimators', 'max_tree_depth', 'accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_table_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_table_df.loc[accuracy_table_df.accuracy.idxmax()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just by reducing the maximum tree depth to 2, we have significantly increased our classification accuracy, from 77% to 89%. By reducing the depth to 2, we are **regularizing** our model. Regularization is an important topic in ML and is our best way to avoid overfitting. This is why we see an increase in the cross-validated performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, we used the entire dataset many times to figure out the optimal hyperparameters. In some sense, this is also overfitting. Our 90% classification accuracy is likely too high, and not the generalized performance. In the next video, we can see what our actual generalized performance might be if we use our dataset to optimize hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nested Cross Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get a more accurate idea of the performance, we'd ideally pick the best hyperparameters on a subset of the data, and then evaluate it on a hold-out set. This is similar to a train-validation-test set split. When you don't have enough data to separate your dataset into 3 parts, we can nest the hyperparameter selection in another layer of cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be patient, this takes a while. On my machine -- 3.3 GHz Intel Core i7 on a MacBook Pro 2016 -- it took less than 8 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class_names = ['bike', 'run', 'walk']\n",
    "\n",
    "# Store the confusion matrix for the outer CV fold.\n",
    "nested_cv_cm = np.zeros((3, 3), dtype='int')\n",
    "splits = 0\n",
    "\n",
    "for train_val_ind, test_ind in logo.split(features, labels, subjects):\n",
    "    # Split the dataset into a test set and a training + validation set.\n",
    "    # Model parameters (the random forest tree nodes) will be trained on the training set.\n",
    "    # Hyperparameters (how many trees and the max depth) will be trained on the validation set.\n",
    "    # Generalization error will be computed on the test set.\n",
    "    X_train_val, y_train_val = features[train_val_ind], labels[train_val_ind]\n",
    "    subjects_train_val = subjects[train_val_ind]\n",
    "    X_test, y_test = features[test_ind], labels[test_ind]\n",
    "    \n",
    "    # Keep track of the best hyperparameters for this training + validation set.\n",
    "    best_hyper_parames = None\n",
    "    best_accuracy = 0\n",
    "    \n",
    "    for n_estimators, max_tree_depth in itertools.product(n_estimators_opt,\n",
    "                                                          max_tree_depth_opt):\n",
    "        # Optimize hyperparameters as above.\n",
    "        inner_cm = np.zeros((3, 3), dtype='int')\n",
    "        clf = RandomForestClassifier(n_estimators=n_estimators,\n",
    "                                     max_depth=max_tree_depth,\n",
    "                                     random_state=42,\n",
    "                                     class_weight='balanced')\n",
    "        for train_ind, validation_ind in logo.split(X_train_val, y_train_val,\n",
    "                                                    subjects_train_val):\n",
    "            X_train, y_train = X_train_val[train_ind], y_train_val[train_ind]\n",
    "            X_val, y_val = X_train_val[validation_ind], y_train_val[validation_ind]\n",
    "            clf.fit(X_train, y_train)\n",
    "            y_pred = clf.predict(X_val)\n",
    "            c = confusion_matrix(y_val, y_pred, labels=class_names)\n",
    "            inner_cm += c\n",
    "        classification_accuracy = np.sum(np.diag(inner_cm)) / np.sum(np.sum((inner_cm)))\n",
    "        \n",
    "        # Keep track of the best pair of hyperparameters.\n",
    "        if classification_accuracy > best_accuracy:\n",
    "            best_accuracy = classification_accuracy\n",
    "            best_hyper_params = (n_estimators, max_tree_depth)\n",
    "    \n",
    "    # Create a model with the best pair of hyperparameters for this training + validation set.\n",
    "    best_clf = RandomForestClassifier(n_estimators=best_hyper_params[0],\n",
    "                                      max_depth=best_hyper_params[1],\n",
    "                                      class_weight='balanced')\n",
    "    \n",
    "    # Finally, train this model and test it on the test set.\n",
    "    best_clf.fit(X_train_val, y_train_val)\n",
    "    y_pred = best_clf.predict(X_test)\n",
    "    \n",
    "    # Aggregate confusion matrices for each CV fold.\n",
    "    c = confusion_matrix(y_test, y_pred, labels=class_names)\n",
    "    nested_cv_cm += c\n",
    "    splits += 1\n",
    "    print('Done split {}'.format(splits))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the drop in performance. This is because we are now not overfitting our hyperparameters when we evaluate model performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(np.diag(nested_cv_cm)) / np.sum(np.sum(nested_cv_cm))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Importance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to regularize our model and increase performance (besides reducing the tree depth) is to reduce the number of features we use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `RandomForestClassifier` can tell us how important the features are in classifying the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = RandomForestClassifier(n_estimators=100,\n",
    "                             max_depth=4,\n",
    "                             random_state=42,\n",
    "                             class_weight='balanced')\n",
    "activity_classifier_utils.LOSOCVPerformance(features, labels, subjects, clf)\n",
    "clf.feature_importances_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what the 10 most important features are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(list(zip(clf.feature_importances_, activity_classifier_utils.FeatureNames())), reverse=True)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train our original model on just the 10 best features as determined by the `RandomForestClassifier`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_features = sorted(zip(clf.feature_importances_, np.arange(len(clf.feature_importances_))), reverse=True)\n",
    "best_feature_indices = list(zip(*sorted_features))[1]\n",
    "X = features[:, best_feature_indices[:10]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cm = activity_classifier_utils.LOSOCVPerformance(X, labels, subjects, clf)\n",
    "activity_classifier_utils.PlotConfusionMatrix(cm, class_names, normalize=True)\n",
    "print('Classification accuracy = {:0.2f}'.format(np.sum(np.diag(cm)) / np.sum(np.sum(cm))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We no longer misclassify `bike` as `walk`. We've improved our classifier performance by 15%, just by picking the most important features! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}