[d0879a]: / Phase 2.ipynb

Download this file

3975 lines (3974 with data), 842.2 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "91c37298",
   "metadata": {},
   "source": [
    "# Phase 2 \n",
    "## UK Smokers Prediction ML Project (Predictive Analysis)\n",
    "### 2 June 2024\n",
    "\n",
    "Wong Yi Wei (Ethan) S3966890"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8899f5a9",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [1.0 Introduction](#1)<br>\n",
    "    * [1.1 Phase 1 Summary](#1.1)<br>\n",
    "    * [1.2 Report Overview](#1.2)<br>\n",
    "    * [1.3 Overview of Methodology](#1.3)<br>\n",
    "* [2.0 Data Preparation](#2.0)<br>\n",
    "    * [2.1 Data Import and Consistency Check](#2.1)<br>\n",
    "    * [2.2 Encoding Categorical Features](#2.2)<br>\n",
    "        * [2.2.1 Encoding Target Feature](#2.2.1)<br>\n",
    "        * [2.2.2 Encoding Categorical Descriptive Feature](#2.2.2)<br>\n",
    "        * [2.2.3 Feature Scaling](#2.2.3)<br>\n",
    "* [3.0 Predictive Modeling](#3.0)<br>\n",
    "    * [3.1 Feature Selection](#3.1)<br>\n",
    "        * [3.1.1 Full Set of Features](#3.1.1)<br>\n",
    "        * [3.1.2 F-Score](#3.1.2)<br>\n",
    "        * [3.1.3 Random Forest Importance (RFI)](#3.1.3)<br>\n",
    "        * [3.1.4 spFSR](#3.1.4)<br>\n",
    "        * [3.1.5 Performance Comparison using Paired T-tests](#3.1.5)<br> \n",
    "    * [3.2 Model Fitting and Tuning](#3.2)<br>\n",
    "        * [3.2.1 Data Sampling & Train-Test Splitting](#3.2.1)<br>\n",
    "        * [3.2.2 K-Nearest Neighbors (KNN)](#3.2.2)<br>\n",
    "        * [3.2.3 Decision tree (DT)](#3.2.3)<br>\n",
    "        * [3.2.4 Gaussian Naive Bayes (NB)](#3.2.4)<br>\n",
    "        * [3.2.5 Model Comparison](#3.2.5)<br>\n",
    "* [4.0 Critique and Limitations](#4.0)<br>\n",
    "* [5.0 Summary and Conclusions](#5.0)<br>\n",
    "    * [5.1 Project Summary](#5.1)<br>\n",
    "    * [5.2 Summary of Findings](#5.2)<br>\n",
    "    * [5.3 Conclusion](#5.3)<br>\n",
    "* [6.0 References](#6.0)<br>\n",
    "\n",
    "\n",
    "***\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "94a23419",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "from sklearn import preprocessing\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold\n",
    "import sklearn.metrics as metrics\n",
    "from sklearn import feature_selection as fs\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import StratifiedKFold, GridSearchCV\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.preprocessing import PowerTransformer\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, Dropout\n",
    "from sklearn.metrics import accuracy_score, roc_auc_score\n",
    "from tensorflow.keras.optimizers import SGD\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "#Setting to view all contents\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9c32206",
   "metadata": {},
   "source": [
    "***\n",
    "# 1.0 Introduction <a class=\"anchor\" id=\"1\"></a>\n",
    "\n",
    "## 1.1 Phase 1 Summary <a class=\"anchor\" id=\"1.1\"></a>\n",
    "\n",
    "In Phase 1 of this project, we have obtained the dataset from kaggle, which was the UK smoking data accessed from: \n",
    "[https://www.kaggle.com/datasets/mexwell/uk-smoking-data?resource=download] (MacQuarrie 2024). From the dataset in Phase 1, we have sucessfully obtained a clean and tidy dataset for analysis in Phase 2 of the report, which would be to perform predictive modelling to predict smokers in the United Kingdom based on various demographics within the data. Furthermore during Phase 1 of the project, we have also successfully dealt with missing values, and performed data exploration and visalisations. Through exploring the data, we have discovered that there were a large number of smokers within the data population, there were a higher number of young smokers within the population, and some smokers only smoked on weekdays, and on weekends, or vice versa. \n",
    "\n",
    "## 1.2 Report Overview <a class=\"anchor\" id=\"1.2\"></a>\n",
    "\n",
    "In the Phase 2 of this project in this report, we will first import the data from Phase 1 of the project and further prepare the data by using any encoding for the features as necessary. Then, we will select the best features in the dataset by using multiple feature selection methods and compare the methods using paired t-test for the best possible method. Then, we will fit the model with different algorithms and tune the algorithm as required. After fitting the model, we will then compare each algorithms using different metrics to identify the best model using a consistent training and test data. Furthermore, our goal in this phase of the project is to identify the best model for predictive analysis.\n",
    "\n",
    "## 1.3 Overview of Methodology <a class=\"anchor\" id=\"1.3\"></a>\n",
    "\n",
    "In Machine Learning, feature selection is important as it identifies the most importance features for model performance (*Feature Selection* 2024). We will be using the feature selection methods of full set of features, F-Score, Random Forest Importance (RFI), and spFSR. Full set of features selection method utilizes all descriptive features without any selection. Since this method allows for all features, hence no information can be lost. However, this method may be computationally demanding for large datasets as it includes all features within a dataset (Rosidi 2023).  F-score feature selection is a filter-based feature selection which evaluates each features independently against the target variable based on the correlation. This method is also used to split the classification tree to accurately identify the importance of features (Yeung et al 2023). However, limitations of this method are that it does not indicate any of the combination of 2 features, also known as mutual information (Chen and Lin 2006). Random Forest Importance selection method computes the importance of each feature based on the node impurity decrease when splitting on the feature through information gain (Akmand 2022). This selection method tends to allow for more relevancies for features with higher importance scores. Simultaneous Perturbation Stochastic Approximation (spFSR) is a selection method which utilizes a stochastic optimization-based feature selection method to maximise the classification accuracy by pertubing the feature subset (Akmand 2022). This algorithm also searches for a local optimal set of features using error measures such as accuracy rate to identify the best feature.\n",
    "\n",
    "K-Nearest neighbours is an instance-based learning algorithm used for classification task to classify a new data point based on the majority class of its k-nearest neighbours in the feature space (*What is k-nearest neighbors (KNN) algorithm?* n.d.). This algorithm first calculates the distance such as Euclidean, Manhattan, and Minkowski and selects the k-nearest neighbours based on the distance. Decision Tree is an algorithm that is a non-parametric supervised learning algorithm, which contains branches starting with a root node, feeding into the decision nodes (*What is a decision tree?* n.d.). This algorithm selects the best features and split the data based on the impurity measure such as gini and entropy impurities. Gaussian Naïve Bayes is a classification technique based on a probabilistic approach which assumes each class follows a normal distribution, and each parameter can predict the output variable (Martins 2023). This algorithm calculates the prior probabilities of each class based on the training data using the variance of Laplace smoothing.\n",
    "\n",
    "Paired t-test is a statistical method to identify if there are any significant differences between the two groups (Gleichmann N 2020). If the p value within the test is greater than the test statistic of 0.05, then there are significant difference between the two groups. Confusion matrix is a matrix which defines the performance of a classification model by comparing the predicted values with the true values (Sharma et al. 2022). Within the matrix, there are true positives, which indicates that the prediction is correct and true, false positives, indicating that there are inaccurate results of true values within the prediction value, false negative, which indicates that the prediction is correct and false, and false positives, indicating that there are inaccurate results of false values within the prediction value. Classification report is a summary of the classification metrics for each class within the machine learning model (*What is the difference between a confusion matrix and a classification report?* n.d.). This includes the recall value, which are the number of true positives within the positive class, which measures how correctly the model identified the actual positive samples (Evidently n.d.). F-1 score within the classification report measures the harmonic mean of precision and recall values to evaluate the model performance (Sharma 2023).\n",
    "\n",
    "***\n",
    "# 2.0 Data Preparation <a class=\"anchor\" id=\"2.0\"></a>\n",
    "\n",
    "## 2.1 Importing Data & Consistency Check <a class=\"anchor\" id=\"2.1\"></a>\n",
    "\n",
    "Before we do any predictive modelling, we will import the data from Phase 1, which has been exported into a csv file, and check the data structure and types to ensure consistencies with Phase 1 of the report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a2ce621b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>highest_qualification</th>\n",
       "      <th>nationality</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>gross_income</th>\n",
       "      <th>region</th>\n",
       "      <th>smoke</th>\n",
       "      <th>amt_weekends</th>\n",
       "      <th>amt_weekdays</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Male</td>\n",
       "      <td>Young</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>No Qualification</td>\n",
       "      <td>British</td>\n",
       "      <td>White</td>\n",
       "      <td>2,600 to 5,200</td>\n",
       "      <td>The North</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Female</td>\n",
       "      <td>Middle-Aged</td>\n",
       "      <td>Single</td>\n",
       "      <td>No Qualification</td>\n",
       "      <td>British</td>\n",
       "      <td>White</td>\n",
       "      <td>Under 2,600</td>\n",
       "      <td>The North</td>\n",
       "      <td>True</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Male</td>\n",
       "      <td>Middle-Aged</td>\n",
       "      <td>Married</td>\n",
       "      <td>Degree</td>\n",
       "      <td>English</td>\n",
       "      <td>White</td>\n",
       "      <td>28,600 to 36,400</td>\n",
       "      <td>The North</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Female</td>\n",
       "      <td>Middle-Aged</td>\n",
       "      <td>Married</td>\n",
       "      <td>Degree</td>\n",
       "      <td>English</td>\n",
       "      <td>White</td>\n",
       "      <td>10,400 to 15,600</td>\n",
       "      <td>The North</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Female</td>\n",
       "      <td>Young</td>\n",
       "      <td>Married</td>\n",
       "      <td>GCSE/O Level</td>\n",
       "      <td>British</td>\n",
       "      <td>White</td>\n",
       "      <td>2,600 to 5,200</td>\n",
       "      <td>The North</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   gender          age marital_status highest_qualification nationality  \\\n",
       "0    Male        Young       Divorced      No Qualification     British   \n",
       "1  Female  Middle-Aged         Single      No Qualification     British   \n",
       "2    Male  Middle-Aged        Married                Degree     English   \n",
       "3  Female  Middle-Aged        Married                Degree     English   \n",
       "4  Female        Young        Married          GCSE/O Level     British   \n",
       "\n",
       "  ethnicity      gross_income     region  smoke  amt_weekends  amt_weekdays  \n",
       "0     White    2,600 to 5,200  The North  False             0             0  \n",
       "1     White       Under 2,600  The North   True            12            12  \n",
       "2     White  28,600 to 36,400  The North  False             0             0  \n",
       "3     White  10,400 to 15,600  The North  False             0             0  \n",
       "4     White    2,600 to 5,200  The North  False             0             0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Import Data\n",
    "smoke=pd.read_csv(\"Phase2.csv\")\n",
    "smoke.iloc[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ceb346cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1561, 11)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Shape\n",
    "smoke.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "311785ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>highest_qualification</th>\n",
       "      <th>nationality</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>gross_income</th>\n",
       "      <th>region</th>\n",
       "      <th>smoke</th>\n",
       "      <th>amt_weekends</th>\n",
       "      <th>amt_weekdays</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561</td>\n",
       "      <td>1561.000000</td>\n",
       "      <td>1561.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>Female</td>\n",
       "      <td>Young</td>\n",
       "      <td>Married</td>\n",
       "      <td>No Qualification</td>\n",
       "      <td>English</td>\n",
       "      <td>White</td>\n",
       "      <td>5,200 to 10,400</td>\n",
       "      <td>The North</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>885</td>\n",
       "      <td>553</td>\n",
       "      <td>759</td>\n",
       "      <td>523</td>\n",
       "      <td>773</td>\n",
       "      <td>1457</td>\n",
       "      <td>394</td>\n",
       "      <td>400</td>\n",
       "      <td>1166</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.187060</td>\n",
       "      <td>3.487508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.757758</td>\n",
       "      <td>7.666444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>60.000000</td>\n",
       "      <td>55.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        gender    age marital_status highest_qualification nationality  \\\n",
       "count     1561   1561           1561                  1561        1561   \n",
       "unique       2      3              5                     8           6   \n",
       "top     Female  Young        Married      No Qualification     English   \n",
       "freq       885    553            759                   523         773   \n",
       "mean       NaN    NaN            NaN                   NaN         NaN   \n",
       "std        NaN    NaN            NaN                   NaN         NaN   \n",
       "min        NaN    NaN            NaN                   NaN         NaN   \n",
       "25%        NaN    NaN            NaN                   NaN         NaN   \n",
       "50%        NaN    NaN            NaN                   NaN         NaN   \n",
       "75%        NaN    NaN            NaN                   NaN         NaN   \n",
       "max        NaN    NaN            NaN                   NaN         NaN   \n",
       "\n",
       "       ethnicity     gross_income     region  smoke  amt_weekends  \\\n",
       "count       1561             1561       1561   1561   1561.000000   \n",
       "unique         5                8          7      2           NaN   \n",
       "top        White  5,200 to 10,400  The North  False           NaN   \n",
       "freq        1457              394        400   1166           NaN   \n",
       "mean         NaN              NaN        NaN    NaN      4.187060   \n",
       "std          NaN              NaN        NaN    NaN      8.757758   \n",
       "min          NaN              NaN        NaN    NaN      0.000000   \n",
       "25%          NaN              NaN        NaN    NaN      0.000000   \n",
       "50%          NaN              NaN        NaN    NaN      0.000000   \n",
       "75%          NaN              NaN        NaN    NaN      0.000000   \n",
       "max          NaN              NaN        NaN    NaN     60.000000   \n",
       "\n",
       "        amt_weekdays  \n",
       "count    1561.000000  \n",
       "unique           NaN  \n",
       "top              NaN  \n",
       "freq             NaN  \n",
       "mean        3.487508  \n",
       "std         7.666444  \n",
       "min         0.000000  \n",
       "25%         0.000000  \n",
       "50%         0.000000  \n",
       "75%         0.000000  \n",
       "max        55.000000  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Summary Statistics\n",
    "smoke.describe(include='all')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ab364e86",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gender \n",
      " ['Male' 'Female'] \n",
      "\n",
      "age \n",
      " ['Young' 'Middle-Aged' 'Old'] \n",
      "\n",
      "marital_status \n",
      " ['Divorced' 'Single' 'Married' 'Widowed' 'Separated'] \n",
      "\n",
      "highest_qualification \n",
      " ['No Qualification' 'Degree' 'GCSE/O Level' 'GCSE/CSE' 'Other/Sub Degree'\n",
      " 'Higher/Sub Degree' 'ONC/BTEC' 'A Levels'] \n",
      "\n",
      "nationality \n",
      " ['British' 'English' 'Scottish' 'Other' 'Welsh' 'Irish'] \n",
      "\n",
      "ethnicity \n",
      " ['White' 'Mixed' 'Black' 'Asian' 'Chinese'] \n",
      "\n",
      "gross_income \n",
      " ['2,600 to 5,200' 'Under 2,600' '28,600 to 36,400' '10,400 to 15,600'\n",
      " '15,600 to 20,800' 'Above 36,400' '5,200 to 10,400' '20,800 to 28,600'] \n",
      "\n",
      "region \n",
      " ['The North' 'Midlands & East Anglia' 'London' 'South East' 'South West'\n",
      " 'Wales' 'Scotland'] \n",
      "\n",
      "smoke \n",
      " [False  True] \n",
      "\n",
      "amt_weekends \n",
      " [ 0 12  6  8 15  5 20 25  4 30 10 40  9  7  2 50 16 35 18  1  3 60 24 45] \n",
      "\n",
      "amt_weekdays \n",
      " [ 0 12  6  8  2 20 15 25  4 10 30  3 40  9  5 50  7 18 35  1 55 16 24 45] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "#Unique Values\n",
    "colnames=[\"gender\",\"age\",\"marital_status\",\"highest_qualification\",\n",
    "          \"nationality\",\"ethnicity\",\"gross_income\",\"region\",\"smoke\",\n",
    "         \"amt_weekends\",\"amt_weekdays\"]\n",
    "\n",
    "for i in colnames:\n",
    "    print(i,'\\n',smoke[i].unique(),'\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "04d3f588",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Missing Values: \n",
      " gender                   0\n",
      "age                      0\n",
      "marital_status           0\n",
      "highest_qualification    0\n",
      "nationality              0\n",
      "ethnicity                0\n",
      "gross_income             0\n",
      "region                   0\n",
      "smoke                    0\n",
      "amt_weekends             0\n",
      "amt_weekdays             0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "#Missing Values\n",
    "print(\"Missing Values:\",'\\n',smoke.isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b6bca579",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "gender                   object\n",
       "age                      object\n",
       "marital_status           object\n",
       "highest_qualification    object\n",
       "nationality              object\n",
       "ethnicity                object\n",
       "gross_income             object\n",
       "region                   object\n",
       "smoke                      bool\n",
       "amt_weekends              int64\n",
       "amt_weekdays              int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Data Types\n",
    "smoke.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "14cc1007",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "gender                     object\n",
       "age                      category\n",
       "marital_status           category\n",
       "highest_qualification    category\n",
       "nationality              category\n",
       "ethnicity                  object\n",
       "gross_income             category\n",
       "region                   category\n",
       "smoke                        bool\n",
       "amt_weekends                int64\n",
       "amt_weekdays                int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Data Type Conversion\n",
    "#String Conversion\n",
    "smoke['gender']=smoke['gender'].astype('str')\n",
    "\n",
    "#Category Conversion\n",
    "cat=['marital_status','nationality','region']\n",
    "\n",
    "for i in cat:\n",
    "    smoke[i]=smoke[i].astype('category')\n",
    "\n",
    "from pandas.api.types import CategoricalDtype\n",
    "\n",
    "age_cat=CategoricalDtype(categories=['Young','Middle-Aged',\"Old\"], ordered = True)\n",
    "\n",
    "smoke['age']=smoke['age'].astype(age_cat)\n",
    "\n",
    "quali=CategoricalDtype(categories=[\"No Qualification\",\"GCSE/CSE\",\"GCSE/O Level\",\"ONC/BTEC\",\"A Levels\",\n",
    "                                   \"Other/Sub Degree\",\"Degree\",\"Higher/Sub Degree\"], ordered = True)\n",
    "\n",
    "smoke['highest_qualification']=smoke['highest_qualification'].astype(quali)\n",
    "\n",
    "income=CategoricalDtype(categories=[\"Under 2,600\",\"2,600 to 5,200\",\"5,200 to 10,400\",\"10,400 to 15,600\",\n",
    "                                    \"15,600 to 20,800\",\"20,800 to 28,600\",\"28,600 to 36,400\",\n",
    "                                    \"Above 36,400\"], ordered = True)\n",
    "\n",
    "smoke['gross_income']=smoke['gross_income'].astype(income)\n",
    "\n",
    "#Check Conversion\n",
    "smoke.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0edc65ac",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "English     49.520\n",
       "British     32.735\n",
       "Scottish     8.584\n",
       "Other        4.228\n",
       "Welsh        3.587\n",
       "Irish        1.345\n",
       "Name: nationality, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Check value spread for nationality\n",
    "smoke['nationality'].value_counts(normalize=True).mul(100).round(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9fbcfa35",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "White      93.338\n",
       "Asian       2.434\n",
       "Black       2.050\n",
       "Chinese     1.281\n",
       "Mixed       0.897\n",
       "Name: ethnicity, dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Check values spread for ethnicity\n",
    "smoke['ethnicity'].value_counts(normalize=True).mul(100).round(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c4755a3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Midlands & East Anglia    25.625\n",
       "The North                 25.625\n",
       "South East                14.798\n",
       "London                    11.019\n",
       "South West                 9.353\n",
       "Scotland                   9.033\n",
       "Wales                      4.548\n",
       "Name: region, dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Check value spread for region\n",
    "smoke['region'].value_counts(normalize=True).mul(100).round(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "72db458a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['White', 'Other']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Change ethnicity values to 'Other'\n",
    "smoke.loc[smoke['ethnicity'] != \"White\", 'ethnicity'] = 'Other'\n",
    "smoke['ethnicity']=smoke['ethnicity'].astype('category')\n",
    "smoke['ethnicity'].unique().tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1b24508",
   "metadata": {},
   "source": [
    "From the code cells above, we have successfully imported the data from the Phase 1 report labelled ```smoke.csv```. Data import was imported without any errors such as loss of data or missing values. However, upon checking the data types of all columns, columns ```gender```, ```age```, ```marital_status```, ```highest_qualification```, ```nationality```, ```ethnicity```, ```gross_income```, and ```region``` are incorrect and has been addressed appropriately.\n",
    "\n",
    "Additionally, since we have not checked the value spread for the categorical columns ```nationality```, ```ethnicity```, and ```region```, we have checked and found that for ```ethnicity```, more than 90% of the respondents were white, and this issue is addressed by replacing all other respondents as 'Other'.\n",
    "\n",
    "## 2.2 Encoding Categorical Features <a class=\"anchor\" id=\"2.2\"></a>\n",
    "\n",
    "Now that changes are made as appropriate, we will now encode all categorical features as it is essential to encode both the target and descriptive features into numerical features. That is, encoding the categorical values with numerical values. \n",
    "\n",
    "### 2.2.1 Encoding Target Feature <a class=\"anchor\" id=\"2.2.1\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8be5b531",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False    1166\n",
       "True      395\n",
       "Name: smoke, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Encode target feature\n",
    "Data = smoke.drop(columns='smoke')\n",
    "target = smoke['smoke']\n",
    "target.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "58d057e9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False    1166\n",
       "True      395\n",
       "Name: smoke, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Replace with binary values\n",
    "target = target.replace({\"False\": 0, \"True\": 1})\n",
    "target.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c55ebf30",
   "metadata": {},
   "source": [
    "### 2.2.2 Encoding Categorical Descriptive Feature <a class=\"anchor\" id=\"2.2.2\"></a>\n",
    "\n",
    "In this section we will encode the categorical descriptive features by using one-hot encoding. Furthermore, we will also define the dummy variables for categorical descriptive variables with levels for feature selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8b253dc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['gender']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Show categorical columns\n",
    "categorical_cols = Data.columns[Data.dtypes == object].tolist()\n",
    "categorical_cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "13a7b971",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['gender', 'amt_weekends', 'amt_weekdays', 'age_Young',\n",
       "       'age_Middle-Aged', 'age_Old', 'marital_status_Divorced',\n",
       "       'marital_status_Married', 'marital_status_Separated',\n",
       "       'marital_status_Single', 'marital_status_Widowed',\n",
       "       'highest_qualification_No Qualification',\n",
       "       'highest_qualification_GCSE/CSE', 'highest_qualification_GCSE/O Level',\n",
       "       'highest_qualification_ONC/BTEC', 'highest_qualification_A Levels',\n",
       "       'highest_qualification_Other/Sub Degree',\n",
       "       'highest_qualification_Degree',\n",
       "       'highest_qualification_Higher/Sub Degree', 'nationality_British',\n",
       "       'nationality_English', 'nationality_Irish', 'nationality_Other',\n",
       "       'nationality_Scottish', 'nationality_Welsh', 'ethnicity_Other',\n",
       "       'ethnicity_White', 'gross_income_Under 2,600',\n",
       "       'gross_income_2,600 to 5,200', 'gross_income_5,200 to 10,400',\n",
       "       'gross_income_10,400 to 15,600', 'gross_income_15,600 to 20,800',\n",
       "       'gross_income_20,800 to 28,600', 'gross_income_28,600 to 36,400',\n",
       "       'gross_income_Above 36,400', 'region_London',\n",
       "       'region_Midlands & East Anglia', 'region_Scotland', 'region_South East',\n",
       "       'region_South West', 'region_The North', 'region_Wales'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#One-hot encoding\n",
    "for col in categorical_cols:\n",
    "    if (Data[col].nunique() == 2):\n",
    "        Data[col] = pd.get_dummies(Data[col], drop_first = True)\n",
    "        \n",
    "Data = pd.get_dummies(Data)\n",
    "\n",
    "Data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "e280eb7d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>amt_weekends</th>\n",
       "      <th>amt_weekdays</th>\n",
       "      <th>age_Young</th>\n",
       "      <th>age_Middle-Aged</th>\n",
       "      <th>age_Old</th>\n",
       "      <th>marital_status_Divorced</th>\n",
       "      <th>marital_status_Married</th>\n",
       "      <th>marital_status_Separated</th>\n",
       "      <th>marital_status_Single</th>\n",
       "      <th>marital_status_Widowed</th>\n",
       "      <th>highest_qualification_No Qualification</th>\n",
       "      <th>highest_qualification_GCSE/CSE</th>\n",
       "      <th>highest_qualification_GCSE/O Level</th>\n",
       "      <th>highest_qualification_ONC/BTEC</th>\n",
       "      <th>highest_qualification_A Levels</th>\n",
       "      <th>highest_qualification_Other/Sub Degree</th>\n",
       "      <th>highest_qualification_Degree</th>\n",
       "      <th>highest_qualification_Higher/Sub Degree</th>\n",
       "      <th>nationality_British</th>\n",
       "      <th>nationality_English</th>\n",
       "      <th>nationality_Irish</th>\n",
       "      <th>nationality_Other</th>\n",
       "      <th>nationality_Scottish</th>\n",
       "      <th>nationality_Welsh</th>\n",
       "      <th>ethnicity_Other</th>\n",
       "      <th>ethnicity_White</th>\n",
       "      <th>gross_income_Under 2,600</th>\n",
       "      <th>gross_income_2,600 to 5,200</th>\n",
       "      <th>gross_income_5,200 to 10,400</th>\n",
       "      <th>gross_income_10,400 to 15,600</th>\n",
       "      <th>gross_income_15,600 to 20,800</th>\n",
       "      <th>gross_income_20,800 to 28,600</th>\n",
       "      <th>gross_income_28,600 to 36,400</th>\n",
       "      <th>gross_income_Above 36,400</th>\n",
       "      <th>region_London</th>\n",
       "      <th>region_Midlands &amp; East Anglia</th>\n",
       "      <th>region_Scotland</th>\n",
       "      <th>region_South East</th>\n",
       "      <th>region_South West</th>\n",
       "      <th>region_The North</th>\n",
       "      <th>region_Wales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>770</th>\n",
       "      <td>1</td>\n",
       "      <td>30</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1367</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>571</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1050</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      gender  amt_weekends  amt_weekdays  age_Young  age_Middle-Aged  age_Old  \\\n",
       "770        1            30            30          0                0        1   \n",
       "541        0             0             0          1                0        0   \n",
       "1367       0             5             2          1                0        0   \n",
       "571        1             0             0          1                0        0   \n",
       "1050       0             5             5          1                0        0   \n",
       "\n",
       "      marital_status_Divorced  marital_status_Married  \\\n",
       "770                         0                       0   \n",
       "541                         0                       1   \n",
       "1367                        0                       1   \n",
       "571                         0                       0   \n",
       "1050                        0                       0   \n",
       "\n",
       "      marital_status_Separated  marital_status_Single  marital_status_Widowed  \\\n",
       "770                          0                      0                       1   \n",
       "541                          0                      0                       0   \n",
       "1367                         0                      0                       0   \n",
       "571                          0                      1                       0   \n",
       "1050                         0                      1                       0   \n",
       "\n",
       "      highest_qualification_No Qualification  highest_qualification_GCSE/CSE  \\\n",
       "770                                        1                               0   \n",
       "541                                        0                               0   \n",
       "1367                                       0                               0   \n",
       "571                                        1                               0   \n",
       "1050                                       0                               0   \n",
       "\n",
       "      highest_qualification_GCSE/O Level  highest_qualification_ONC/BTEC  \\\n",
       "770                                    0                               0   \n",
       "541                                    0                               0   \n",
       "1367                                   0                               0   \n",
       "571                                    0                               0   \n",
       "1050                                   0                               0   \n",
       "\n",
       "      highest_qualification_A Levels  highest_qualification_Other/Sub Degree  \\\n",
       "770                                0                                       0   \n",
       "541                                0                                       0   \n",
       "1367                               1                                       0   \n",
       "571                                0                                       0   \n",
       "1050                               1                                       0   \n",
       "\n",
       "      highest_qualification_Degree  highest_qualification_Higher/Sub Degree  \\\n",
       "770                              0                                        0   \n",
       "541                              1                                        0   \n",
       "1367                             0                                        0   \n",
       "571                              0                                        0   \n",
       "1050                             0                                        0   \n",
       "\n",
       "      nationality_British  nationality_English  nationality_Irish  \\\n",
       "770                     0                    1                  0   \n",
       "541                     0                    1                  0   \n",
       "1367                    1                    0                  0   \n",
       "571                     0                    1                  0   \n",
       "1050                    1                    0                  0   \n",
       "\n",
       "      nationality_Other  nationality_Scottish  nationality_Welsh  \\\n",
       "770                   0                     0                  0   \n",
       "541                   0                     0                  0   \n",
       "1367                  0                     0                  0   \n",
       "571                   0                     0                  0   \n",
       "1050                  0                     0                  0   \n",
       "\n",
       "      ethnicity_Other  ethnicity_White  gross_income_Under 2,600  \\\n",
       "770                 0                1                         0   \n",
       "541                 0                1                         0   \n",
       "1367                0                1                         0   \n",
       "571                 0                1                         0   \n",
       "1050                0                1                         1   \n",
       "\n",
       "      gross_income_2,600 to 5,200  gross_income_5,200 to 10,400  \\\n",
       "770                             0                             0   \n",
       "541                             0                             1   \n",
       "1367                            1                             0   \n",
       "571                             0                             1   \n",
       "1050                            0                             0   \n",
       "\n",
       "      gross_income_10,400 to 15,600  gross_income_15,600 to 20,800  \\\n",
       "770                               0                              0   \n",
       "541                               0                              0   \n",
       "1367                              0                              0   \n",
       "571                               0                              0   \n",
       "1050                              0                              0   \n",
       "\n",
       "      gross_income_20,800 to 28,600  gross_income_28,600 to 36,400  \\\n",
       "770                               1                              0   \n",
       "541                               0                              0   \n",
       "1367                              0                              0   \n",
       "571                               0                              0   \n",
       "1050                              0                              0   \n",
       "\n",
       "      gross_income_Above 36,400  region_London  region_Midlands & East Anglia  \\\n",
       "770                           0              0                              1   \n",
       "541                           0              0                              1   \n",
       "1367                          0              0                              0   \n",
       "571                           0              0                              1   \n",
       "1050                          0              0                              0   \n",
       "\n",
       "      region_Scotland  region_South East  region_South West  region_The North  \\\n",
       "770                 0                  0                  0                 0   \n",
       "541                 0                  0                  0                 0   \n",
       "1367                0                  0                  0                 0   \n",
       "571                 0                  0                  0                 0   \n",
       "1050                0                  1                  0                 0   \n",
       "\n",
       "      region_Wales  \n",
       "770              0  \n",
       "541              0  \n",
       "1367             1  \n",
       "571              0  \n",
       "1050             0  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Random sample for Data\n",
    "Data.sample(5, random_state=999)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bbc9191",
   "metadata": {},
   "source": [
    "### 2.2.3 Feature Scaling <a class=\"anchor\" id=\"2.2.3\"></a>\n",
    "\n",
    "In this section we will perform a min-max scaling of descriptive features, and making a copy of the data for future column name references as the data will be converted into ```numPy``` array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4ffb9fe1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>amt_weekends</th>\n",
       "      <th>amt_weekdays</th>\n",
       "      <th>age_Young</th>\n",
       "      <th>age_Middle-Aged</th>\n",
       "      <th>age_Old</th>\n",
       "      <th>marital_status_Divorced</th>\n",
       "      <th>marital_status_Married</th>\n",
       "      <th>marital_status_Separated</th>\n",
       "      <th>marital_status_Single</th>\n",
       "      <th>marital_status_Widowed</th>\n",
       "      <th>highest_qualification_No Qualification</th>\n",
       "      <th>highest_qualification_GCSE/CSE</th>\n",
       "      <th>highest_qualification_GCSE/O Level</th>\n",
       "      <th>highest_qualification_ONC/BTEC</th>\n",
       "      <th>highest_qualification_A Levels</th>\n",
       "      <th>highest_qualification_Other/Sub Degree</th>\n",
       "      <th>highest_qualification_Degree</th>\n",
       "      <th>highest_qualification_Higher/Sub Degree</th>\n",
       "      <th>nationality_British</th>\n",
       "      <th>nationality_English</th>\n",
       "      <th>nationality_Irish</th>\n",
       "      <th>nationality_Other</th>\n",
       "      <th>nationality_Scottish</th>\n",
       "      <th>nationality_Welsh</th>\n",
       "      <th>ethnicity_Other</th>\n",
       "      <th>ethnicity_White</th>\n",
       "      <th>gross_income_Under 2,600</th>\n",
       "      <th>gross_income_2,600 to 5,200</th>\n",
       "      <th>gross_income_5,200 to 10,400</th>\n",
       "      <th>gross_income_10,400 to 15,600</th>\n",
       "      <th>gross_income_15,600 to 20,800</th>\n",
       "      <th>gross_income_20,800 to 28,600</th>\n",
       "      <th>gross_income_28,600 to 36,400</th>\n",
       "      <th>gross_income_Above 36,400</th>\n",
       "      <th>region_London</th>\n",
       "      <th>region_Midlands &amp; East Anglia</th>\n",
       "      <th>region_Scotland</th>\n",
       "      <th>region_South East</th>\n",
       "      <th>region_South West</th>\n",
       "      <th>region_The North</th>\n",
       "      <th>region_Wales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>770</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.545455</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1367</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.083333</td>\n",
       "      <td>0.036364</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>571</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1050</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.083333</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      gender  amt_weekends  amt_weekdays  age_Young  age_Middle-Aged  age_Old  \\\n",
       "770      1.0      0.500000      0.545455        0.0              0.0      1.0   \n",
       "541      0.0      0.000000      0.000000        1.0              0.0      0.0   \n",
       "1367     0.0      0.083333      0.036364        1.0              0.0      0.0   \n",
       "571      1.0      0.000000      0.000000        1.0              0.0      0.0   \n",
       "1050     0.0      0.083333      0.090909        1.0              0.0      0.0   \n",
       "\n",
       "      marital_status_Divorced  marital_status_Married  \\\n",
       "770                       0.0                     0.0   \n",
       "541                       0.0                     1.0   \n",
       "1367                      0.0                     1.0   \n",
       "571                       0.0                     0.0   \n",
       "1050                      0.0                     0.0   \n",
       "\n",
       "      marital_status_Separated  marital_status_Single  marital_status_Widowed  \\\n",
       "770                        0.0                    0.0                     1.0   \n",
       "541                        0.0                    0.0                     0.0   \n",
       "1367                       0.0                    0.0                     0.0   \n",
       "571                        0.0                    1.0                     0.0   \n",
       "1050                       0.0                    1.0                     0.0   \n",
       "\n",
       "      highest_qualification_No Qualification  highest_qualification_GCSE/CSE  \\\n",
       "770                                      1.0                             0.0   \n",
       "541                                      0.0                             0.0   \n",
       "1367                                     0.0                             0.0   \n",
       "571                                      1.0                             0.0   \n",
       "1050                                     0.0                             0.0   \n",
       "\n",
       "      highest_qualification_GCSE/O Level  highest_qualification_ONC/BTEC  \\\n",
       "770                                  0.0                             0.0   \n",
       "541                                  0.0                             0.0   \n",
       "1367                                 0.0                             0.0   \n",
       "571                                  0.0                             0.0   \n",
       "1050                                 0.0                             0.0   \n",
       "\n",
       "      highest_qualification_A Levels  highest_qualification_Other/Sub Degree  \\\n",
       "770                              0.0                                     0.0   \n",
       "541                              0.0                                     0.0   \n",
       "1367                             1.0                                     0.0   \n",
       "571                              0.0                                     0.0   \n",
       "1050                             1.0                                     0.0   \n",
       "\n",
       "      highest_qualification_Degree  highest_qualification_Higher/Sub Degree  \\\n",
       "770                            0.0                                      0.0   \n",
       "541                            1.0                                      0.0   \n",
       "1367                           0.0                                      0.0   \n",
       "571                            0.0                                      0.0   \n",
       "1050                           0.0                                      0.0   \n",
       "\n",
       "      nationality_British  nationality_English  nationality_Irish  \\\n",
       "770                   0.0                  1.0                0.0   \n",
       "541                   0.0                  1.0                0.0   \n",
       "1367                  1.0                  0.0                0.0   \n",
       "571                   0.0                  1.0                0.0   \n",
       "1050                  1.0                  0.0                0.0   \n",
       "\n",
       "      nationality_Other  nationality_Scottish  nationality_Welsh  \\\n",
       "770                 0.0                   0.0                0.0   \n",
       "541                 0.0                   0.0                0.0   \n",
       "1367                0.0                   0.0                0.0   \n",
       "571                 0.0                   0.0                0.0   \n",
       "1050                0.0                   0.0                0.0   \n",
       "\n",
       "      ethnicity_Other  ethnicity_White  gross_income_Under 2,600  \\\n",
       "770               0.0              1.0                       0.0   \n",
       "541               0.0              1.0                       0.0   \n",
       "1367              0.0              1.0                       0.0   \n",
       "571               0.0              1.0                       0.0   \n",
       "1050              0.0              1.0                       1.0   \n",
       "\n",
       "      gross_income_2,600 to 5,200  gross_income_5,200 to 10,400  \\\n",
       "770                           0.0                           0.0   \n",
       "541                           0.0                           1.0   \n",
       "1367                          1.0                           0.0   \n",
       "571                           0.0                           1.0   \n",
       "1050                          0.0                           0.0   \n",
       "\n",
       "      gross_income_10,400 to 15,600  gross_income_15,600 to 20,800  \\\n",
       "770                             0.0                            0.0   \n",
       "541                             0.0                            0.0   \n",
       "1367                            0.0                            0.0   \n",
       "571                             0.0                            0.0   \n",
       "1050                            0.0                            0.0   \n",
       "\n",
       "      gross_income_20,800 to 28,600  gross_income_28,600 to 36,400  \\\n",
       "770                             1.0                            0.0   \n",
       "541                             0.0                            0.0   \n",
       "1367                            0.0                            0.0   \n",
       "571                             0.0                            0.0   \n",
       "1050                            0.0                            0.0   \n",
       "\n",
       "      gross_income_Above 36,400  region_London  region_Midlands & East Anglia  \\\n",
       "770                         0.0            0.0                            1.0   \n",
       "541                         0.0            0.0                            1.0   \n",
       "1367                        0.0            0.0                            0.0   \n",
       "571                         0.0            0.0                            1.0   \n",
       "1050                        0.0            0.0                            0.0   \n",
       "\n",
       "      region_Scotland  region_South East  region_South West  region_The North  \\\n",
       "770               0.0                0.0                0.0               0.0   \n",
       "541               0.0                0.0                0.0               0.0   \n",
       "1367              0.0                0.0                0.0               0.0   \n",
       "571               0.0                0.0                0.0               0.0   \n",
       "1050              0.0                1.0                0.0               0.0   \n",
       "\n",
       "      region_Wales  \n",
       "770            0.0  \n",
       "541            0.0  \n",
       "1367           1.0  \n",
       "571            0.0  \n",
       "1050           0.0  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Min-max scaling\n",
    "Data_copy = Data.copy()\n",
    "\n",
    "Data_scaler = preprocessing.MinMaxScaler()\n",
    "Data_scaler.fit(Data)\n",
    "Data = Data_scaler.fit_transform(Data)\n",
    "\n",
    "pd.DataFrame(Data, columns=Data_copy.columns).sample(5, random_state=999)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67e901ea",
   "metadata": {},
   "source": [
    "***\n",
    "# 3.0 Predictive Modeling <a class=\"anchor\" id=\"3.0\"></a>\n",
    "\n",
    "## 3.1 Feature Selection <a class=\"anchor\" id=\"3.1\"></a>\n",
    "\n",
    "Here we will select the best features within the dataset by assessing and comparing our performance of the classifiers using all features. We will compare the F-Score, Random Forest Importance (RFI), and spFSR feature selection methods and compare the performance amongst all methods to select the best features.\n",
    "\n",
    "### 3.1.1 Full Set of Features <a class=\"anchor\" id=\"3.1.1\"></a>\n",
    "\n",
    "Here we will assess the performance using all features within the dataset by using the stratified 5-fold cross-validation with 3 repetitions, and the random_state is set to 999 for future references for analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "60512c39",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.        , 0.99679487, 1.        , 1.        , 1.        ,\n",
       "       0.99680511, 1.        , 1.        , 1.        , 1.        ,\n",
       "       1.        , 0.99679487, 1.        , 1.        , 1.        ])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Performance with Full Set of Features\n",
    "clf = DecisionTreeClassifier(random_state=999)\n",
    "\n",
    "cv_method = RepeatedStratifiedKFold(n_splits=5, \n",
    "                                     n_repeats=3,\n",
    "                                     random_state=999)\n",
    "\n",
    "scoring_metric = 'accuracy'\n",
    "\n",
    "cv_results_full = cross_val_score(estimator=clf,\n",
    "                             X=Data,\n",
    "                             y=target, \n",
    "                             cv=cv_method, \n",
    "                             scoring=scoring_metric)\n",
    "\n",
    "cv_results_full"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "5d3283ee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.999"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#CV mean results\n",
    "cv_results_full.mean().round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93cb12f5",
   "metadata": {},
   "source": [
    "### 3.1.2 F-Score <a class=\"anchor\" id=\"3.1.2\"></a>\n",
    "\n",
    "In this section we will use the F-Score feature selection method as it filters the features and measures the relationship between each descriptive features and the target feature through the F-Score distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "851acedd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1,  2,  5,  9,  7,  3, 17, 13, 12,  6])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#F-score\n",
    "num_features = 10\n",
    "fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=num_features)\n",
    "fs_fit_fscore.fit_transform(Data, target)\n",
    "fs_indices_fscore = np.argsort(np.nan_to_num(fs_fit_fscore.scores_))[::-1][0:num_features]\n",
    "fs_indices_fscore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a22e4820",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['amt_weekends', 'amt_weekdays', 'age_Old', 'marital_status_Single',\n",
       "       'marital_status_Married', 'age_Young',\n",
       "       'highest_qualification_Degree',\n",
       "       'highest_qualification_GCSE/O Level',\n",
       "       'highest_qualification_GCSE/CSE', 'marital_status_Divorced'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Best features for F-score ranked\n",
    "best_features_fscore = Data_copy.columns[fs_indices_fscore].values\n",
    "best_features_fscore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "425bbc76",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3240.4118294 , 2451.33850509,   58.21575874,   45.53830706,\n",
       "         40.63115417,   37.99583642,   16.77980932,   16.0445205 ,\n",
       "         10.09145764,    9.22759182])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Score of best features for F-score\n",
    "feature_importances_fscore = fs_fit_fscore.scores_[fs_indices_fscore]\n",
    "feature_importances_fscore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "fcf03a35",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 793
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualize features\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline \n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\"ggplot\")\n",
    "\n",
    "def plot_imp(best_features, scores, method_name):   \n",
    "    plt.barh(best_features, scores)\n",
    "    plt.title(method_name + ' Feature Importances')\n",
    "    plt.xlabel(\"Importance\")\n",
    "    plt.ylabel(\"Features\")\n",
    "    plt.show()\n",
    "\n",
    "plot_imp(best_features_fscore, feature_importances_fscore, 'Figure 1: F-Score')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "6c0261b9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1561, 10)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Shape of F-score\n",
    "Data[:, fs_indices_fscore].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "3cb4cf0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.999"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#CV mean results for F score\n",
    "cv_results_fscore = cross_val_score(estimator=clf,\n",
    "                             X=Data[:, fs_indices_fscore],\n",
    "                             y=target, \n",
    "                             cv=cv_method, \n",
    "                             scoring=scoring_metric)\n",
    "cv_results_fscore.mean().round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f847eb5",
   "metadata": {},
   "source": [
    "From ```Figure 1``` we can observe that for the F-Score Feature Selection, the most important feature is \"age\", and the least important feature is \"region\". Furthermore, we can observe that the average cross-validation is consistent with our performance with the full set of features.\n",
    "\n",
    "### 3.1.3 Random Forest Importance (RFI) <a class=\"anchor\" id=\"3.1.3\"></a>\n",
    "\n",
    "Here we will select the best features using the Random Forest Importance (RFI), which adds additional randomness to the model and searches for the best features amongst the subset of features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "c8ab4625",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['amt_weekends', 'amt_weekdays', 'age_Old', 'marital_status_Single',\n",
       "       'marital_status_Married', 'age_Young',\n",
       "       'highest_qualification_GCSE/O Level', 'gender',\n",
       "       'highest_qualification_Degree', 'gross_income_5,200 to 10,400'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Random Forest Importance\n",
    "num_features = 10 \n",
    "model_rfi = RandomForestClassifier(n_estimators=100)\n",
    "model_rfi.fit(Data, target)\n",
    "fs_indices_rfi = np.argsort(model_rfi.feature_importances_)[::-1][0:num_features]\n",
    "\n",
    "#Best features for RFI ranked\n",
    "best_features_rfi = Data_copy.columns[fs_indices_rfi].values\n",
    "best_features_rfi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "4974ae75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.51768691, 0.39560584, 0.00977245, 0.00670413, 0.00559715,\n",
       "       0.00460095, 0.00340457, 0.00327924, 0.00298661, 0.00258061])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Scores for RFI features\n",
    "feature_importances_rfi = model_rfi.feature_importances_[fs_indices_rfi]\n",
    "feature_importances_rfi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "a2347ee1",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 793
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualize features\n",
    "plot_imp(best_features_rfi, feature_importances_rfi, 'Figure 2: Random Forest')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "9739ee84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.999"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#CV mean results for RFI\n",
    "cv_results_rfi = cross_val_score(estimator=clf,\n",
    "                             X=Data[:, fs_indices_rfi],\n",
    "                             y=target, \n",
    "                             cv=cv_method, \n",
    "                             scoring=scoring_metric)\n",
    "cv_results_rfi.mean().round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0a7d1f5",
   "metadata": {},
   "source": [
    "In ```Figure 2``` we can observe that the most important feature with the highest significant importance is amt_weekends and amt_weekdays. Interestingly, the other features weren't as significant.\n",
    "\n",
    "### 3.1.4 spFSR <a class=\"anchor\" id=\"3.1.4\"></a>\n",
    "\n",
    "spSFR is a feature selection method using the binary stochastic approximation to select the best features in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "30885172",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "SpFSR-INFO: Wrapper: DecisionTreeClassifier(random_state=999)\n",
      "SpFSR-INFO: Hot start: True\n",
      "SpFSR-INFO: Hot start range: 0.2\n",
      "SpFSR-INFO: Feature weighting: False\n",
      "SpFSR-INFO: Scoring metric: accuracy\n",
      "SpFSR-INFO: Number of jobs: 1\n",
      "SpFSR-INFO: Number of observations in the dataset: 1561\n",
      "SpFSR-INFO: Number of observations used: 1561\n",
      "SpFSR-INFO: Number of features available: 42\n",
      "SpFSR-INFO: Number of features to select: 10\n",
      "SpFSR-INFO: iter_no: 0, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 0, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 1, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 2, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 3, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 4, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 5, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 6, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 7, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 8, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 9, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 10, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 10, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 11, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 12, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 13, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 14, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 15, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 16, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 17, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 18, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 19, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 20, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 20, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 21, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 22, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 23, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 24, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 25, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 26, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 27, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 28, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 29, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 30, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 30, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 31, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 32, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 33, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 34, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 35, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 36, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 37, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 38, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 39, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 40, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 40, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 41, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 42, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 43, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 44, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 45, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 46, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 47, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 48, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 49, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 50, num_ft: 10, value: 0.999, st_dev: 0.002, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 50, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 51, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 52, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 53, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 54, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 55, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 56, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 57, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 58, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 59, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 60, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 60, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 61, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 62, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 63, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 64, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 65, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 66, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 67, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 68, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 69, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 70, num_ft: 10, value: 0.999, st_dev: 0.003, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 70, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 71, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 72, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 73, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 74, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 75, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 76, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 77, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 78, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 79, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 80, num_ft: 10, value: 0.999, st_dev: 0.003, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 80, same feature stall limit reached, initializing search...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "SpFSR-INFO: ===> iter_no: 81, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 82, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 83, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 84, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 85, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 86, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 87, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 88, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 89, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 90, num_ft: 10, value: 0.999, st_dev: 0.001, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 90, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 91, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 92, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 93, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 94, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 95, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 96, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 97, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 98, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: ===> iter_no: 99, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: iter_no: 100, num_ft: 10, value: 0.999, st_dev: 0.002, best: 0.999 @ iter_no 0\n",
      "SpFSR-INFO: ===> iter_no: 100, same feature stall limit reached, initializing search...\n",
      "SpFSR-INFO: SpFSR completed in 0.07 minutes.\n",
      "SpFSR-INFO: Best value = 0.999 with 10 features and 100 total iterations.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#spFSR\n",
    "from spFSR import SpFSR\n",
    "\n",
    "sp_engine = SpFSR(x=Data, y=target, pred_type='c', wrapper=clf, scoring='accuracy')\n",
    "\n",
    "np.random.seed(999)\n",
    "sp_output = sp_engine.run(num_features=num_features).results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "78328440",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 5, 9, 7, 3, 17, 6, 4, 13]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Best features for spFSR ranked\n",
    "fs_indices_spfsr = sp_output.get('selected_features')\n",
    "fs_indices_spfsr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "d63e0f4a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['amt_weekends', 'amt_weekdays', 'age_Old', 'marital_status_Single',\n",
       "       'marital_status_Married', 'age_Young',\n",
       "       'highest_qualification_Degree', 'marital_status_Divorced',\n",
       "       'age_Middle-Aged', 'highest_qualification_GCSE/O Level'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Best features for spFSR ranked cont.\n",
    "best_features_spfsr = Data_copy.columns[fs_indices_spfsr].values\n",
    "best_features_spfsr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "59706c92",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.1       ,  0.08665648, -0.09739153, -0.09761024, -0.09778658,\n",
       "       -0.09834239, -0.09895688, -0.09909979, -0.09910913, -0.09922988])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Scores of features for spFSR\n",
    "feature_importances_spfsr = sp_output.get('selected_ft_importance')\n",
    "feature_importances_spfsr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "3d099118",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 793
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualize features\n",
    "plot_imp(best_features_spfsr, feature_importances_spfsr, 'Figure 3: spFSR')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "2da7c093",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.999"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#CV mean results for spFSR\n",
    "cv_results_spfsr = cross_val_score(estimator=clf,\n",
    "                             X=Data[:, fs_indices_spfsr],\n",
    "                             y=target, \n",
    "                             cv=cv_method, \n",
    "                             scoring=scoring_metric)\n",
    "cv_results_spfsr.mean().round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bb893dd",
   "metadata": {},
   "source": [
    "```Figure 3``` above shows the spFSR feature importances and we can observe that with previous feature selection methods, amt_weekends and amt_weekdays shows the most important against all other features."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26da91f2",
   "metadata": {},
   "source": [
    "### 3.1.5 Performance Comparison using Paired T-Tests <a class=\"anchor\" id=\"3.1.5\"></a>\n",
    "\n",
    "For comparing the performance of the feature selection methods, we will use statistical tests to determine if there are any performance differences on whether the feature selection methods are statistically significant. As the feature selection methods harbored different data observations due to the random states, we will test it on the same data partitions to compare the full set of features, random forest importance, and spFSR feature selection methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "fec4432a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Full Set of Features: 0.999\n",
      "F-Score: 0.999\n",
      "RFI: 0.999\n",
      "spFSR: 0.999\n"
     ]
    }
   ],
   "source": [
    "#CV mean results comparison\n",
    "print('Full Set of Features:', cv_results_full.mean().round(3))\n",
    "print('F-Score:', cv_results_fscore.mean().round(3))\n",
    "print('RFI:', cv_results_rfi.mean().round(3))\n",
    "print('spFSR:', cv_results_spfsr.mean().round(3)) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9471db18",
   "metadata": {},
   "source": [
    "The above results indicated that all the feature selection methods has similar performance. Hence, we shall perform statistical tests to diagnose our selection methods on any differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "e4093a94",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "spFSR v. F Score\n",
      "nan\n",
      "\n",
      " spFSR v. RFI\n",
      "nan\n",
      "\n",
      " spFSR v. Full\n",
      "nan\n",
      "\n",
      " F Score v. RFI\n",
      "nan\n",
      "\n",
      " F Score v. Full\n",
      "nan\n",
      "\n",
      " RFI v. Full\n",
      "nan\n"
     ]
    }
   ],
   "source": [
    "#Paired T-test for feature selection\n",
    "from scipy import stats\n",
    "\n",
    "print('spFSR v. F Score')\n",
    "print(stats.ttest_rel(cv_results_spfsr, cv_results_fscore).pvalue.round(3))\n",
    "\n",
    "print('\\n','spFSR v. RFI')\n",
    "print(stats.ttest_rel(cv_results_spfsr, cv_results_rfi).pvalue.round(3))\n",
    "\n",
    "print('\\n','spFSR v. Full')\n",
    "print(stats.ttest_rel(cv_results_spfsr, cv_results_full).pvalue.round(3))\n",
    "\n",
    "print('\\n','F Score v. RFI')\n",
    "print(stats.ttest_rel(cv_results_fscore, cv_results_rfi).pvalue.round(3))\n",
    "\n",
    "print('\\n','F Score v. Full')\n",
    "print(stats.ttest_rel(cv_results_fscore, cv_results_full).pvalue.round(3))\n",
    "\n",
    "print('\\n','RFI v. Full')\n",
    "print(stats.ttest_rel(cv_results_rfi, cv_results_full).pvalue.round(3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abebeb36",
   "metadata": {},
   "source": [
    "From the above output, due to the similar cross validation values of 0.999 we received a *nan* error. Thus, we can use either of the four feature selection methods above. Since there are no differences, we shall use the full feature selection method for further analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7f4aea0",
   "metadata": {},
   "source": [
    "## 3.2 Model Fitting and Tuning <a class=\"anchor\" id=\"3.2\"></a>\n",
    "\n",
    "### 3.2.1 Data Sampling & Train-Test Splitting <a class=\"anchor\" id=\"3.2.1\"></a>\n",
    "\n",
    "Here we will investigate and acquire our optimal parameters to be used for the algorithms. Since we have a low sample of 1561 observations within the dataset, we shall use the 70:30 split for training data and test data, respectively. Furthermore, we will also use a 5-fold stratified cross-validation evaluation method for hyperparameter tuning for all algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "178d5c51",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data shape: (1561, 42)\n",
      "Target shape: (1561,)\n"
     ]
    }
   ],
   "source": [
    "#Data and target shape\n",
    "print('Data shape:',Data.shape)\n",
    "print('Target shape:',target.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "65cbd47d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data train shape: (1092, 42)\n",
      "Data test shape: (469, 42)\n",
      "Target train shape: (1092,)\n",
      "Target test shape: (469,)\n"
     ]
    }
   ],
   "source": [
    "#Train test splitting\n",
    "D_train, D_test, t_train, t_test = train_test_split(Data, \n",
    "                                                    target, \n",
    "                                                    test_size = 0.3, \n",
    "                                                    random_state=8)\n",
    "\n",
    "print('Data train shape:',D_train.shape)\n",
    "print('Data test shape:',D_test.shape)\n",
    "print('Target train shape:',t_train.shape)\n",
    "print('Target test shape:',t_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "66ee236b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#CV method\n",
    "cv_method = StratifiedKFold(n_splits=5, shuffle=True, random_state=999)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da21126b",
   "metadata": {},
   "source": [
    "### 3.2.2 K-Nearest Neighbors (KNN) <a class=\"anchor\" id=\"3.2.2\"></a>\n",
    "\n",
    "Here we will use the K-Nearest Neighbors (KNN) algorithm for predictive analysis. However, before we start with the algorithm we will tune the algorithm by defining the number of neighbors (1-7) and p value of 1 (Manhattan), 2 (Euclidean), and 5 (Minkowski). Furthermore, we shall utilize the cross validation method to aid in finding the optimal parameters for KNN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "d90b3eb1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 21 candidates, totalling 105 fits\n"
     ]
    }
   ],
   "source": [
    "#Fit KNN algorithm\n",
    "model_KNN = KNeighborsClassifier()\n",
    "params_KNN = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7], \n",
    "              'p': [1, 2, 5]}\n",
    "\n",
    "gs_KNN = GridSearchCV(estimator=model_KNN, \n",
    "                      param_grid=params_KNN, \n",
    "                      cv=cv_method,\n",
    "                      verbose=1, \n",
    "                      scoring='accuracy',\n",
    "                      return_train_score=True)\n",
    "\n",
    "gs_KNN.fit(D_train, t_train);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "8a0f192a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimal Parameters for KNN: {'n_neighbors': 7, 'p': 1}\n",
      "Score of Best Paramaters for KNN: 0.8324284696912573\n"
     ]
    }
   ],
   "source": [
    "#Parameter and score for KNN\n",
    "print(f\"Optimal Parameters for KNN: {gs_KNN.best_params_}\")\n",
    "print(f\"Score of Best Paramaters for KNN: {gs_KNN.best_score_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1911142e",
   "metadata": {},
   "source": [
    "From the above output, we can observe that the optimal paramaters for KNN are an optimal number of neighbors of 7, and optimal distance is 1 (Manhattan Distance). We will now look at the other KNN parameters to observe if the difference is significant or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "a44934b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Function for table\n",
    "def get_search_results(gs):\n",
    "\n",
    "    def model_result(scores, params):\n",
    "        scores = {'mean_score': np.mean(scores),\n",
    "             'std_score': np.std(scores),\n",
    "             'min_score': np.min(scores),\n",
    "             'max_score': np.max(scores)}\n",
    "        return pd.Series({**params,**scores})\n",
    "\n",
    "    models = []\n",
    "    scores = []\n",
    "\n",
    "    for i in range(gs.n_splits_):\n",
    "        key = f\"split{i}_test_score\"\n",
    "        r = gs.cv_results_[key]        \n",
    "        scores.append(r.reshape(-1,1))\n",
    "\n",
    "    all_scores = np.hstack(scores)\n",
    "    for p, s in zip(gs.cv_results_['params'], all_scores):\n",
    "        models.append((model_result(s, p)))\n",
    "\n",
    "    pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)\n",
    "\n",
    "    columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']\n",
    "    columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]\n",
    "\n",
    "    return pipe_results[columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "1eede93a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_score</th>\n",
       "      <th>std_score</th>\n",
       "      <th>max_score</th>\n",
       "      <th>min_score</th>\n",
       "      <th>n_neighbors</th>\n",
       "      <th>p</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0.832428</td>\n",
       "      <td>0.028838</td>\n",
       "      <td>0.876147</td>\n",
       "      <td>0.799087</td>\n",
       "      <td>7.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0.831515</td>\n",
       "      <td>0.032102</td>\n",
       "      <td>0.880734</td>\n",
       "      <td>0.794521</td>\n",
       "      <td>7.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>0.829680</td>\n",
       "      <td>0.031592</td>\n",
       "      <td>0.876147</td>\n",
       "      <td>0.794521</td>\n",
       "      <td>7.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0.814109</td>\n",
       "      <td>0.028129</td>\n",
       "      <td>0.857798</td>\n",
       "      <td>0.785388</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0.814105</td>\n",
       "      <td>0.025826</td>\n",
       "      <td>0.853211</td>\n",
       "      <td>0.788991</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mean_score  std_score  max_score  min_score  n_neighbors    p\n",
       "18    0.832428   0.028838   0.876147   0.799087          7.0  1.0\n",
       "19    0.831515   0.032102   0.880734   0.794521          7.0  2.0\n",
       "20    0.829680   0.031592   0.876147   0.794521          7.0  5.0\n",
       "16    0.814109   0.028129   0.857798   0.785388          6.0  2.0\n",
       "15    0.814105   0.025826   0.853211   0.788991          6.0  1.0"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Parameter table\n",
    "results_KNN = get_search_results(gs_KNN)\n",
    "results_KNN.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00ab4633",
   "metadata": {},
   "source": [
    "From the table output above, we can observe that the differences between the parameter combinations has no significant differences between all parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "de8c2d42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Text(0.5, 1.0, 'Figure 4: KNN Performance Comparison')]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 578
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualize parameters\n",
    "results_KNN_1 = pd.DataFrame(gs_KNN.cv_results_['params'])\n",
    "results_KNN_1['mean_score']=gs_KNN.cv_results_['mean_test_score']\n",
    "results_KNN_1['distance']=results_KNN['p'].replace([1,2,5],[\"Manhattan\",\"Euclidean\",\"Minkowski\"])\n",
    "\n",
    "sns.lineplot(x=results_KNN_1['n_neighbors'],\n",
    "            y=results_KNN_1['mean_score'],\n",
    "            hue=results_KNN_1['distance']).set(title=\"Figure 4: KNN Performance Comparison\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c7ca1c9",
   "metadata": {},
   "source": [
    "```Figure 4``` above shows the line plot for KNN Performance Comparison where we can observe that as the number of neighbors increases, mean score of the performance increases. Furthermore, we can confirm that Manhattan distance is best performing, which is consistent with our results before. \n",
    "\n",
    "### 3.2.3 Decision Tree (DT) <a class=\"anchor\" id=\"3.2.3\"></a>\n",
    "\n",
    "In this section we will use the Decision Tree (DT) algorithm by assessing the criterion gini and entropy to find the best parameters from a max depth of 1 to 8, and a minimum sample split of 2 and 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "3e25ed4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 32 candidates, totalling 160 fits\n"
     ]
    }
   ],
   "source": [
    "#DT algorithm\n",
    "df_classifier = DecisionTreeClassifier(random_state=999)\n",
    "\n",
    "params_DT = {'criterion': ['gini', 'entropy'],\n",
    "             'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],\n",
    "             'min_samples_split': [2, 3]}\n",
    "\n",
    "gs_DT = GridSearchCV(estimator=df_classifier, \n",
    "                     param_grid=params_DT, \n",
    "                     cv=cv_method,\n",
    "                     verbose=1, \n",
    "                     scoring='accuracy')\n",
    "\n",
    "gs_DT.fit(Data, target);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "0131bdda",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimal Parameters for DT: {'criterion': 'gini', 'max_depth': 2, 'min_samples_split': 2}\n",
      "Score of Best Paramaters for DT: 0.9993589743589745\n"
     ]
    }
   ],
   "source": [
    "#Optimal and score of parameters for DT\n",
    "print(f\"Optimal Parameters for DT: {gs_DT.best_params_}\")\n",
    "print(f\"Score of Best Paramaters for DT: {gs_DT.best_score_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "738f6a79",
   "metadata": {},
   "source": [
    "From the output above, we can observe that the optimal parameters for DT are with a criterion of gini, max depth of 2, and a minimum sample split of 2. Additionally, the score for the optimal parameters are 0.9993. We will now look at the other parameters to see whether if there is any significant differences between other parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "08f477ba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['criterion', 'max_depth', 'min_samples_split', 'test_score'], dtype='object')"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Output table for parameters for DT\n",
    "results_DT = pd.DataFrame(gs_DT.cv_results_['params'])\n",
    "results_DT['test_score'] = gs_DT.cv_results_['mean_test_score']\n",
    "results_DT.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "bcd6a39f",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>criterion</th>\n",
       "      <th>max_depth</th>\n",
       "      <th>min_samples_split</th>\n",
       "      <th>test_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>gini</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0.996795</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>gini</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0.996795</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>gini</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>gini</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>gini</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>gini</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>gini</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>gini</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>gini</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>gini</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>gini</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>gini</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>gini</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>gini</td>\n",
       "      <td>8</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>entropy</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0.996795</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>entropy</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0.996795</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>entropy</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>entropy</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>entropy</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>entropy</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>entropy</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>entropy</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>entropy</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>entropy</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>entropy</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>entropy</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>entropy</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>entropy</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>entropy</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>entropy</td>\n",
       "      <td>8</td>\n",
       "      <td>3</td>\n",
       "      <td>0.999359</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   criterion  max_depth  min_samples_split  test_score\n",
       "0       gini          1                  2    0.996795\n",
       "1       gini          1                  3    0.996795\n",
       "2       gini          2                  2    0.999359\n",
       "3       gini          2                  3    0.999359\n",
       "4       gini          3                  2    0.999359\n",
       "5       gini          3                  3    0.999359\n",
       "6       gini          4                  2    0.999359\n",
       "7       gini          4                  3    0.999359\n",
       "8       gini          5                  2    0.999359\n",
       "9       gini          5                  3    0.999359\n",
       "10      gini          6                  2    0.999359\n",
       "11      gini          6                  3    0.999359\n",
       "12      gini          7                  2    0.999359\n",
       "13      gini          7                  3    0.999359\n",
       "14      gini          8                  2    0.999359\n",
       "15      gini          8                  3    0.999359\n",
       "16   entropy          1                  2    0.996795\n",
       "17   entropy          1                  3    0.996795\n",
       "18   entropy          2                  2    0.999359\n",
       "19   entropy          2                  3    0.999359\n",
       "20   entropy          3                  2    0.999359\n",
       "21   entropy          3                  3    0.999359\n",
       "22   entropy          4                  2    0.999359\n",
       "23   entropy          4                  3    0.999359\n",
       "24   entropy          5                  2    0.999359\n",
       "25   entropy          5                  3    0.999359\n",
       "26   entropy          6                  2    0.999359\n",
       "27   entropy          6                  3    0.999359\n",
       "28   entropy          7                  2    0.999359\n",
       "29   entropy          7                  3    0.999359\n",
       "30   entropy          8                  2    0.999359\n",
       "31   entropy          8                  3    0.999359"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Output table for parameters for DT cont.\n",
    "results_DT"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad26d9ea",
   "metadata": {},
   "source": [
    "From the table output above, we can observe that the differences between the parameter combinations has no significant differences between all parameters. Interestingly, many of the combinations yielded similar test score results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "0dbd7a1e",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 596
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualising parameters\n",
    "for i in ['gini', 'entropy']:\n",
    "    temp = results_DT[results_DT['criterion'] == i]\n",
    "    temp_average = temp.groupby('max_depth').agg({'test_score': 'mean'})\n",
    "    plt.plot(temp_average, marker = '.', label = i)\n",
    "    \n",
    "    \n",
    "plt.legend()\n",
    "plt.xlabel('Max Depth')\n",
    "plt.ylabel(\"Mean CV Score\")\n",
    "plt.title(\"Figure 5: DT Performance Comparison\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5e3de9f",
   "metadata": {},
   "source": [
    "```Figure 5``` above shows the performance comparison for DT with gini and entropy criterion. Interestingly, due to similar values both criterion follows the same line, and as max depth increases, so does mean cv score. In this case, we will use the recommended DT parameters of criterion gini, max depth of 2, and minimum sample split of 2.\n",
    "\n",
    "### 3.2.4 Gaussian Naive Bayes (NB) <a class=\"anchor\" id=\"3.2.4\"></a>\n",
    "\n",
    "In this section we will use the Gaussian Naive Bayes (NB) algorithm by optimizing the variance of Laplace smoothing by first performing a power transformation on the data before fitting the model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "a0746673",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 100 candidates, totalling 500 fits\n"
     ]
    }
   ],
   "source": [
    "#NB algorithm\n",
    "np.random.seed(999)\n",
    "\n",
    "nb_classifier = GaussianNB()\n",
    "\n",
    "params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}\n",
    "\n",
    "gs_NB = GridSearchCV(estimator=nb_classifier, \n",
    "                     param_grid=params_NB, \n",
    "                     cv=cv_method,\n",
    "                     verbose=1, \n",
    "                     scoring='accuracy')\n",
    "\n",
    "Data_transformed = PowerTransformer().fit_transform(Data)\n",
    "\n",
    "gs_NB.fit(Data_transformed, target);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "b3d6299e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimal Parameters for NB: {'var_smoothing': 3.5111917342151273e-09}\n",
      "Score of Best Paramaters for NB: 0.9961599901695749\n"
     ]
    }
   ],
   "source": [
    "#Optimal and score of best parameter for NB\n",
    "print(f\"Optimal Parameters for NB: {gs_NB.best_params_}\")\n",
    "print(f\"Score of Best Paramaters for NB: {gs_NB.best_score_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5943c129",
   "metadata": {},
   "source": [
    "From the output above, we can observe that the optimal parameters for NB are with variance of Laplace smoothing value of 3.5111917342151273e-09, which is a very low number. Additionally, the score for the optimal parameters are 0.99616. We will now look at the other parameters to see whether if there is any significant differences between other parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "cc8907ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Table for parameters for NB\n",
    "results_NB = pd.DataFrame(gs_NB.cv_results_['params'])\n",
    "results_NB['test_score'] = gs_NB.cv_results_['mean_test_score']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "75bc7ed0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_score</th>\n",
       "      <th>std_score</th>\n",
       "      <th>max_score</th>\n",
       "      <th>min_score</th>\n",
       "      <th>var_smoothing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>0.99616</td>\n",
       "      <td>0.003728</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.990415</td>\n",
       "      <td>1.000000e-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>0.99616</td>\n",
       "      <td>0.003728</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.990415</td>\n",
       "      <td>1.232847e-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>0.99616</td>\n",
       "      <td>0.003728</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.990415</td>\n",
       "      <td>1.519911e-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>0.99616</td>\n",
       "      <td>0.003728</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.990415</td>\n",
       "      <td>1.873817e-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>0.99616</td>\n",
       "      <td>0.003728</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.990415</td>\n",
       "      <td>2.310130e-09</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mean_score  std_score  max_score  min_score  var_smoothing\n",
       "99     0.99616   0.003728        1.0   0.990415   1.000000e-09\n",
       "98     0.99616   0.003728        1.0   0.990415   1.232847e-09\n",
       "97     0.99616   0.003728        1.0   0.990415   1.519911e-09\n",
       "96     0.99616   0.003728        1.0   0.990415   1.873817e-09\n",
       "95     0.99616   0.003728        1.0   0.990415   2.310130e-09"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Table for parameters for NB\n",
    "results_NB_1 = get_search_results(gs_NB)\n",
    "results_NB_1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5c21d8f",
   "metadata": {},
   "source": [
    "From the table output above, we can observe that the differences between the parameter combinations has no significant differences between all parameters. Interestingly, many of the combinations yielded similar mean score results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "17909e02",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 459,
       "width": 587
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Visualizing parameters for NB\n",
    "plt.plot(results_NB['var_smoothing'], results_NB['test_score'], marker = '.')    \n",
    "plt.xlabel('Var. Smoothing')\n",
    "plt.ylabel(\"Mean CV Score\")\n",
    "plt.title(\"Figure 6: NB Performance Comparison\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fefe5a28",
   "metadata": {},
   "source": [
    "From ```Figure 6``` above which shows the performance comparison for NB, we can observe that there are many convergence around the variable smoothing of 0.0, and as the variable smoothing increases, mean cv score decreases. Hence, we will continue with our recommended hyperparameter for variance smoothing of 3.5111917342151273e-09.\n",
    " \n",
    "### 3.2.5 Model Comparison <a class=\"anchor\" id=\"3.2.5\"></a>\n",
    "\n",
    "In this section we will evaluate the performance of the algorithms using the optimal parameters we have evaluated for KNN, DT, and NB to compare the algorithms. We will also be using the full feature selection as the feature selection method chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "68b4ad6e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KNN Results with Best Parameters: 0.7932280942576069\n",
      "DT Results with Best Parameters: 0.997872340425532\n",
      "NB Results with Best Parameters: 0.9914893617021278\n"
     ]
    }
   ],
   "source": [
    "#Model using optimal hyperparameters\n",
    "\n",
    "#KNN\n",
    "cv_results_KNN = cross_val_score(estimator=gs_KNN.best_estimator_,\n",
    "                                X=D_test,\n",
    "                                y=t_test,\n",
    "                                cv=cv_method,\n",
    "                                n_jobs=-2,\n",
    "                                scoring=scoring_metric)\n",
    "\n",
    "#DT\n",
    "cv_results_DT = cross_val_score(estimator=gs_DT.best_estimator_,\n",
    "                                X=D_test,\n",
    "                                y=t_test,\n",
    "                                cv=cv_method,\n",
    "                                n_jobs=-2,\n",
    "                                scoring=scoring_metric)\n",
    "\n",
    "#NB\n",
    "cv_results_NB = cross_val_score(estimator=gs_NB.best_estimator_,\n",
    "                                X=D_test,\n",
    "                                y=t_test,\n",
    "                                cv=cv_method,\n",
    "                                n_jobs=-2,\n",
    "                                scoring=scoring_metric)\n",
    "\n",
    "#Scores comparison\n",
    "print(\"KNN Results with Best Parameters:\",cv_results_KNN.mean())\n",
    "print(\"DT Results with Best Parameters:\",cv_results_DT.mean())\n",
    "print(\"NB Results with Best Parameters:\",cv_results_NB.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc0e9f20",
   "metadata": {},
   "source": [
    "From the above output, we can observe that for DT and NB, there are very high scores, but DT resulted in a higher score. Interestingly, KNN resulted in a significant score difference against the other algorithms. We will perform T test comparison to further evaluate the models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "12a4c7b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KNN v. DT: 0.0001\n",
      "KNN v. NB: 0.0002\n",
      "DT  v. NB: 0.208\n"
     ]
    }
   ],
   "source": [
    "#P-value comparison\n",
    "print(\"KNN v. DT:\", stats.ttest_rel(cv_results_KNN, cv_results_DT).pvalue.round(4))\n",
    "print(\"KNN v. NB:\", stats.ttest_rel(cv_results_KNN, cv_results_NB).pvalue.round(4))\n",
    "print(\"DT  v. NB:\", stats.ttest_rel(cv_results_DT, cv_results_NB).pvalue.round(4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "823bda15",
   "metadata": {},
   "source": [
    "From the output above, we can see that for the t test comparison, KNN and DT had a p value of 0.0001. Since p-value is less than the test statistic, 0.05, we can conclude that there is statistical difference between the KNN and DT. KNN and NB had a p value of 0.0002. Since p-value is less than test statistic, 0.05, we can conclude that there is statistical difference between KNN and NB. However, for DT and NB there was a p value of 0.208. Since p-value is greater than test statistic, 0.05, we can conclude that there is no statistical differences between KNN and NB. Interestingly, only DT and NB had a statistical difference, which meant that DT could be better than NB, or vice versa. Hence, we will continue to explore this with a confusion matrix to evaluate their performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "6f1319ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KNN Confusion Matrix \n",
      " [[344  11]\n",
      " [ 69  45]] \n",
      "\n",
      "DT Confusion Matrix \n",
      " [[355   0]\n",
      " [  0 114]] \n",
      "\n",
      "NB Confusion Matrix \n",
      " [[  0 355]\n",
      " [  0 114]]\n"
     ]
    }
   ],
   "source": [
    "#Confusion Matrix\n",
    "pred_KNN=gs_KNN.predict(D_test)\n",
    "pred_DT=gs_DT.predict(D_test)\n",
    "pred_NB=gs_NB.predict(D_test)\n",
    "\n",
    "print(\"KNN Confusion Matrix\",\"\\n\",metrics.confusion_matrix(t_test, pred_KNN),\"\\n\")\n",
    "print(\"DT Confusion Matrix\",\"\\n\",metrics.confusion_matrix(t_test, pred_DT),\"\\n\")\n",
    "print(\"NB Confusion Matrix\",\"\\n\",metrics.confusion_matrix(t_test, pred_NB))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "293ac1c9",
   "metadata": {},
   "source": [
    "In the above confusion matrix outputs, KNN had the 11 false positives and 69 false negatives. DT has performed well with no false positives or false negatives. However, the worst performing would be NB as there were 0 true positives and 355 false positives. With this in mind, we will continue with the model comparison with a classification report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "12b9dc79",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KNN Classification Report \n",
      "               precision    recall  f1-score   support\n",
      "\n",
      "       False       0.83      0.97      0.90       355\n",
      "        True       0.80      0.39      0.53       114\n",
      "\n",
      "    accuracy                           0.83       469\n",
      "   macro avg       0.82      0.68      0.71       469\n",
      "weighted avg       0.83      0.83      0.81       469\n",
      " \n",
      "\n",
      "DT Classification Report \n",
      "               precision    recall  f1-score   support\n",
      "\n",
      "       False       1.00      1.00      1.00       355\n",
      "        True       1.00      1.00      1.00       114\n",
      "\n",
      "    accuracy                           1.00       469\n",
      "   macro avg       1.00      1.00      1.00       469\n",
      "weighted avg       1.00      1.00      1.00       469\n",
      " \n",
      "\n",
      "NB Classification Report \n",
      "               precision    recall  f1-score   support\n",
      "\n",
      "       False       0.00      0.00      0.00       355\n",
      "        True       0.24      1.00      0.39       114\n",
      "\n",
      "    accuracy                           0.24       469\n",
      "   macro avg       0.12      0.50      0.20       469\n",
      "weighted avg       0.06      0.24      0.10       469\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/opt/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/opt/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "#Classification Report\n",
    "print(\"KNN Classification Report \\n\",metrics.classification_report(t_test,pred_KNN),\"\\n\")\n",
    "print(\"DT Classification Report \\n\",metrics.classification_report(t_test,pred_DT),\"\\n\")\n",
    "print(\"NB Classification Report \\n\",metrics.classification_report(t_test,pred_NB))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64564968",
   "metadata": {},
   "source": [
    "The above outputs show the classification report for KNN, DT, and NB, respectively. From the output we can observe that only DT had a recall and f1-score of 1, which meant that it was able to predict and identify the smokers and non-smokers without any misclassification. KNN had a recall value of 0.97 and 0.39, and f1-score of 0.90 and 0.53 for smokers and non smokers. Although the values are high and close to 1 for smokers, it is still not fully accurate, and non smokers values are relatively low meaning that there are higher misclassification within the prediction data. The worst model was with NB where there were recall values of 0 and 1, and f1-score values of 0 and 0.39. This meant that there were 0 accurate non-smokers and many misclassification within the data.\n",
    "\n",
    "This is consistent with our previous comparisons where the best model were DT and the inaccurate models were KNN and NB. Hence, with these comparisons it is suffice to say that the best algorithm for the smoker dataset is the Decision Trees (DT) model.\n",
    "\n",
    "***\n",
    "# 4.0 Critique and Limitations <a class=\"anchor\" id=\"4.0\"></a>\n",
    "\n",
    "Throughout both phases, one of the strengths of the dataset was that there was not much missing values which were difficult to replace with, and the dataset were mostly clean and tidy. However, since there were only 1561 observations within the dataset, we were unable to fully and accurately test our data for modeling. This comes as a weakness as if we had a bigger dataset with more observations, we could be able to produce a much more accurate model for selection.\n",
    "\n",
    "Furthermore, with a rather small dataset, we had to resort with a smaller test data during the performance comparison phase, and had to resort with using the 5 repeated 3 fold cross validation method. However, since we involved this method, there were limited effects of overfitting which is helpful for prediction analysis as overfit data often results in incorrect predictions. Therefore, future updates on this project with a larger dataset would be recommended to ensure the correctness of our results.\n",
    "\n",
    "***\n",
    "# 5.0 Summary and Conclusions <a class=\"anchor\" id=\"5.0\"></a>\n",
    "\n",
    "## 5.1 Project Summary <a class=\"anchor\" id=\"5.1\"></a>\n",
    "\n",
    "In the Phase 1 of the project, we have successfully obtained a clean and tidy dataset, and to prepare the dataset for analysis in Phase 2. During Phase 1 after importing the data, firstly we have used the ```unique``` function to identify the unique values for each columns to identify any inconsistencies. This was useful as we found that there were missing values in features amt_weekdays, amt_weekends and type. Furthermore, we have identified inconsistencies for features nationality, ethnicity, and gross_income where there were values of \"refused\" and \"unknown\". \n",
    "\n",
    "To tackle these issues we first replaced the \"refused\" value to \"unknown\" value for consistencies. Then, we removed the type column as it was not relevant to our research objectives. However, to rectify the missing values for amt_weekdays and amt_weekends, we first found out that the reason these values were missing was because the respondent was not a smoker. Hence, we simply replaced all missing values with 0 values as they were not smokers and did not smoke on weekdays and weekends. Then, we discretized the numeric features of age by classifying them with values of 'Young', 'Middle-Aged', and 'Old' using the ```qcut``` function from pandas. \n",
    "\n",
    "Furthermore, in Phas  1 we also dealt with incorrect data types for the features, and dealt with them as necessary. Once the data cleaning and preprocessing were completed, we then used a multitude of plots to visualize the data through One-Variable, Two-Variable, and Three-Variable plots.\n",
    "\n",
    "With Phase 1 completed, we then moved on to Phase 2 by first exporting the data from Phase 1 as a csv file, and imported it into our Phase 2 for analysis. During Phase 2, after importing the data we first checked the dataset for any inconsistencies with our Phase 1 output. We found that there were inconsistencies within the data types, however as the encoding only allowed object type features, we decided to leave it as object features as necessary. Then, we performed encoding by first separating our target and data features. Target features were first encoded using binary values as there were only true and false values within the target feature. Then, we performed an one-hot encoding for our data features and also performed a feature scaling using min-max scaling of descriptive features as it was required for feature selection. \n",
    "\n",
    "Then, we moved on with feature selection by using selection methods of full feature selection, F-score, Random Forest Importance, and spFSR methods. Then, we compared the feature selection methods through paired t-test to identify our best selection method. Since there were no statistically significant differences between all selection methods, we simply selected the full feature selection method. \n",
    "\n",
    "Once feature selection was completed, to ensure consistencies of the test and training data, we used a 70:30 split of the data for future analysis. We also fitted our models with different algorithms to identify the best hyperparameters for the corresponding model, and to tune the model as required. Here we have used the K-Nearest Neighbors, Decision Trees, and Gaussian Naive Bayes algorithms and identified the best hyperparameters for each models. For each model, we have also tested the performance of each hyperparameters to aid in identifying the best parameters and to compare if there were any significant differences. However, we also compared our models to find the best models which worked for our dataset. For this, we have fitted each model with their best respective hyperparameters and fitted them with the same split data. We first compared the scores of each models, then compared the p-values using Paired T-test method, and also generated a confusion matrix and classification report. \n",
    "Through this, we were able to identify the best model for our dataset.\n",
    "\n",
    "## 5.2 Summary of Findings <a class=\"anchor\" id=\"5.2\"></a>\n",
    "\n",
    "A comprehensive summary of your findings. That is, what exactly did you find about your particular problem?\n",
    "\n",
    "For both phases of the project, we have first found that missing values within the amt_weekdays, amt_weekends, and type was due to the respondent not being a smoker, and that some smokers only smoke on weekends and some on weekdays. During our data visualizations, we also found that there were more female than male smokers within the dataset, and more smokers within the younger population.\n",
    "\n",
    "Additionally, during the Phase 2 of the project, we have identified that for all feature selection method except for full feature selection method, amt_weekends and amt_weekdays were identified as the most important descriptive feature against our target feature, which then is followed by age, but were not as significant as amt_weekends and amt_weekdays. When comparing the feature selection methods, we identified that there were no statistical differences between all feature selection methods as they all yielded the same results. \n",
    "\n",
    "For the models, we have first identified the optimal hyperparameter for K Nearest Neighbours are number of neighbors of 7, and distance of 1 (Manhattan distance), with a score of 0.8324. Decision tree algorithm was identified with having the optimal hyperparameter of gini criterion, max depth of 2, and a minimum sample split of 2. For Gaussian Naive Bayes, the optimal hyperparameter were a variance of Laplace smoothing value of 3.5111917342151273e-09. \n",
    "\n",
    "During our model comparisons, we identified that the best algorithm was Decision Trees as it yielded the best scores, had no false positive or negatives within the confusion matrix, and had no misclassification in the classification report.\n",
    "\n",
    "## 5.3 Conclusion <a class=\"anchor\" id=\"5.3\"></a>\n",
    "\n",
    "In summary, we have cleaned the data during Phase 1 and prepared the data for analysis in Phase 2. In this Phase 2 report, the Decision Tree model selected by the full feature selection produces the highest cross-validation score, and the most accurate values on the training data. For this reason the Decision Tree model is selected as the best performing model compared to the K-Nearest Neighbors and Gaussian Naive Bayes models. In relation to our goals and objectives set out in Phase 1, we have successfully attained in goal to obtain a clean and tidy dataset in Phase 1, and here in Phase 2 we have successfully identified the best model for further prediction analysis.\n",
    "***\n",
    "# 6.0 References <a class=\"anchor\" id=\"6.0\"></a>\n",
    "\n",
    "Akmand D (2022) *SK Part 2: Feature Selection and Ranking*, GitHub website, accessed 27 May 2024. https://github.com/akmand/ml_tutorials/blob/master/SK2.ipynb\n",
    "\n",
    "Akmand D (2022) *spFSR*, GitHub website, accessed 27 May 2024. https://github.com/akmand/spFSR\n",
    "\n",
    "Chen Y and Lin C (2006) ‘Combining SVMs with Various Feature Selection Strategies’, *Studies in Fuzziness and Soft Computing*, vol 207, doi: 10.1007/978-3-540-35488-8_13.\n",
    "\n",
    "Evidently (n.d.) *Accuracy vs. precision vs. recall in machine learning: what's the difference?*, Evidently AI website, accessed 27 May 2024. https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall#:~:text=Recall%20is%20a%20metric%20that,the%20number%20of%20positive%20instances.\n",
    "\n",
    "*Feature Selection* (2024) Heavy.AI website, accessed 27 May 2024. https://www.heavy.ai/technical-glossary/feature-selection\n",
    "\n",
    "Gleichmann N (2020) *Paired vs Unpaired T-Test: Differences, Assumptions, and Hypotheses*, Technology Networks Informatics website, accessed 27 May 2024. https://www.technologynetworks.com/informatics/articles/paired-vs-unpaired-t-test-differences-assumptions-and-hypotheses-330826#:~:text=A%20paired%20t%2Dtest%20(also,difference%20between%20the%20two%20groups.\n",
    "\n",
    "MacQuarrie M (2024) *UK Smoking Data* [data set], Kaggle website, accessed 10 April 2024. https://www.kaggle.com/datasets/mexwell/uk-smoking-data?resource=download\n",
    "\n",
    "Martins C (2023) *Gaussian Naive Bayes Explained With Scikit-Learn*, builtin website, accessed 27 May 2024. https://builtin.com/artificial-intelligence/gaussian-naive-bayes#\n",
    "\n",
    "Rosidi N (2023) *Advanced Feature Selection Techniques for Machine Learning Models*, KDnuggets website, accessed 27 May 2024. https://www.kdnuggets.com/2023/06/advanced-feature-selection-techniques-machine-learning-models.html#:~:text=Exhaustive%20feature%20selection%20compares%20the,ensures%20the%20best%20feature%20subset.\n",
    "\n",
    "Sharma D, Chatterjee M, Kaur G and Vavilala S (2022) ‘3 - Deep learning applications for disease diagnosis’, *Academic Press*, pp. 31 – 51, doi: 10.1016/B978-0-12-824145-5.00005-8.\n",
    "\n",
    "Sharma N (2023) *Understanding and Applying F1 Score: AI Evaluation Essentials with Hands-On Coding Example*, arize website, accessed 27 May 2024, https://arize.com/blog-course/f1-score/#:~:text=F1%20score%20is%20a%20measure,better%20understanding%20of%20model%20performance.\n",
    "\n",
    "*What is a decision tree?* (n.d.) IBM website, accessed 27 May 2024. https://www.ibm.com/topics/decision-trees#:~:text=A%20decision%20tree%20is%20a,internal%20nodes%20and%20leaf%20nodes.\n",
    "\n",
    "*What is k-nearest neighbors (KNN) algorithm?* (n.d.) IBM website, accessed 27 May 2024, https://www.ibm.com/topics/knn#:~:text=The%20k%2Dnearest%20neighbors%20(KNN,used%20in%20machine%20learning%20today.\n",
    "\n",
    "*What is the difference between a confusion matrix and a classification report?* (n.d.) LinkedIn website, accessed 27 May 2024, https://www.linkedin.com/advice/3/what-difference-between-confusion-matrix-classification-hsehf#:~:text=A%20classification%20report%20is%20a,these%20metrics%20across%20all%20classes.\n",
    "\n",
    "Yeung C, Bunker R and Fujii K (2023) ‘F-Score for the XGBoost model’, *PLOS One*, doi: https://doi.org/10.1371/journal.pone.0284318.s001"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}