--- a
+++ b/Final_Project.ipynb
@@ -0,0 +1,2284 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e8ccf29a",
+   "metadata": {},
+   "source": [
+    "## Predictive Modeling of Smoking Status Using Bio-Signals:\n",
+    "## Leveraging AI and Machine Learning for Improved Smoking Cessation Strategies\n",
+    "\n",
+    "\n",
+    "### AIM OF PROJECT:\n",
+    "The objective of this study is to develop a machine learning model that utilizes bio-signals to accurately predict the smoking status of an individual. By leveraging AI and machine learning techniques, aim to provide a reliable tool for identifying individuals who are smokers or non-smokers. This predictive model can aid in the development of effective smoking cessation strategies and interventions, ultimately contributing to improved public health outcomes. The model will consider various bio-signals, such as physiological, behavioral, or environmental factors, to achieve accurate and reliable predictions. The goal is to create a practical and accessible solution that can assist healthcare professionals in assessing an individual's smoking status and provide personalized recommendations for smoking cessation.\n",
+    "\n",
+    "\n",
+    "### Table of Contents:\n",
+    "\n",
+    "> - INTRODUCTION\n",
+    "> - DATA PREPROCESSING\n",
+    "> - DATA EXPLORATION\n",
+    "> - FEATURE SELECTION\n",
+    "> - TRAINING\n",
+    "> - EVALUATION \n",
+    "> - PARAMETER TUNING\n",
+    "> - CONCLUSIONS\n",
+    "> - REFERENCES\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95dca433",
+   "metadata": {},
+   "source": [
+    "### 1. INTRODUCTION\n",
+    "\n",
+    "> Smoking is a well-known contributor to a number of health problems and has a major effect on public health globally. For the purpose of creating successful smoking cessation methods and interventions, it is essential to precisely determine each individual's smoking status. Self-reporting, which can be subjective and unreliable, is frequently used in traditional techniques of determining smoking status. The use of AI and machine learning approaches in predicting smoking status using biosignals has attracted attention as a solution to this problem.\n",
+    "\n",
+    "> In order to accurately anticipate a person's smoking status, the goal of this project is to develop a predictive modelling strategy that makes use of AI and machine learning algorithms. The model intends to provide a trustworthy tool for healthcare providers to assess smoking status and direct individualised smoking cessation efforts by analysing bio-signals, including physiological, behavioural, and environmental aspects.\n",
+    "\n",
+    "> The project's many phases, including data preprocessing, dataset exploration, feature selection, algorithm selection, training, assessment, parameter tweaking, and deployment, will be covered in the report. Each step will be carefully explained, showing the approaches used and the justification for the choices made.\n",
+    "\n",
+    "> A prediction model with high accuracy in predicting smoking status is the anticipated result of this investigation. The model's usefulness and accessibility will be emphasised to ensure that healthcare professionals may use it in practical contexts. The created model can help enhance smoking cessation tactics and ultimately lessen the harmful effects of smoking on public health by enabling precise identification of smoking status.\n",
+    "\n",
+    "> A summary of the report's conclusions, its caveats, and its suggestions for additional research are included at the end. A list of the cited sources and pertinent material used in the study is provided in the references section."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21ea682a",
+   "metadata": {},
+   "source": [
+    "### 2. DATA PREPROCESSING"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 156,
+   "id": "714c7eac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Importing the Library  \n",
+    "import pandas as pd;"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 157,
+   "id": "da685198",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>age</th>\n",
+       "      <th>height(cm)</th>\n",
+       "      <th>weight(kg)</th>\n",
+       "      <th>waist(cm)</th>\n",
+       "      <th>eyesight(left)</th>\n",
+       "      <th>eyesight(right)</th>\n",
+       "      <th>hearing(left)</th>\n",
+       "      <th>hearing(right)</th>\n",
+       "      <th>systolic</th>\n",
+       "      <th>relaxation</th>\n",
+       "      <th>...</th>\n",
+       "      <th>HDL</th>\n",
+       "      <th>LDL</th>\n",
+       "      <th>hemoglobin</th>\n",
+       "      <th>Urine protein</th>\n",
+       "      <th>serum creatinine</th>\n",
+       "      <th>AST</th>\n",
+       "      <th>ALT</th>\n",
+       "      <th>Gtp</th>\n",
+       "      <th>dental caries</th>\n",
+       "      <th>smoking</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>35</td>\n",
+       "      <td>170</td>\n",
+       "      <td>85</td>\n",
+       "      <td>97.0</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>118</td>\n",
+       "      <td>78</td>\n",
+       "      <td>...</td>\n",
+       "      <td>70</td>\n",
+       "      <td>142</td>\n",
+       "      <td>19.8</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>61</td>\n",
+       "      <td>115</td>\n",
+       "      <td>125</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>20</td>\n",
+       "      <td>175</td>\n",
+       "      <td>110</td>\n",
+       "      <td>110.0</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>119</td>\n",
+       "      <td>79</td>\n",
+       "      <td>...</td>\n",
+       "      <td>71</td>\n",
+       "      <td>114</td>\n",
+       "      <td>15.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.1</td>\n",
+       "      <td>19</td>\n",
+       "      <td>25</td>\n",
+       "      <td>30</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>45</td>\n",
+       "      <td>155</td>\n",
+       "      <td>65</td>\n",
+       "      <td>86.0</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>110</td>\n",
+       "      <td>80</td>\n",
+       "      <td>...</td>\n",
+       "      <td>57</td>\n",
+       "      <td>112</td>\n",
+       "      <td>13.7</td>\n",
+       "      <td>3</td>\n",
+       "      <td>0.6</td>\n",
+       "      <td>1090</td>\n",
+       "      <td>1400</td>\n",
+       "      <td>276</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>45</td>\n",
+       "      <td>165</td>\n",
+       "      <td>80</td>\n",
+       "      <td>94.0</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>158</td>\n",
+       "      <td>88</td>\n",
+       "      <td>...</td>\n",
+       "      <td>46</td>\n",
+       "      <td>91</td>\n",
+       "      <td>16.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>32</td>\n",
+       "      <td>36</td>\n",
+       "      <td>36</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>20</td>\n",
+       "      <td>165</td>\n",
+       "      <td>60</td>\n",
+       "      <td>81.0</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>0.1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>109</td>\n",
+       "      <td>64</td>\n",
+       "      <td>...</td>\n",
+       "      <td>47</td>\n",
+       "      <td>92</td>\n",
+       "      <td>14.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.2</td>\n",
+       "      <td>26</td>\n",
+       "      <td>28</td>\n",
+       "      <td>15</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38979</th>\n",
+       "      <td>40</td>\n",
+       "      <td>165</td>\n",
+       "      <td>60</td>\n",
+       "      <td>80.0</td>\n",
+       "      <td>0.4</td>\n",
+       "      <td>0.6</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>107</td>\n",
+       "      <td>60</td>\n",
+       "      <td>...</td>\n",
+       "      <td>61</td>\n",
+       "      <td>72</td>\n",
+       "      <td>12.3</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>18</td>\n",
+       "      <td>18</td>\n",
+       "      <td>21</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38980</th>\n",
+       "      <td>45</td>\n",
+       "      <td>155</td>\n",
+       "      <td>55</td>\n",
+       "      <td>75.0</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>1.2</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>126</td>\n",
+       "      <td>72</td>\n",
+       "      <td>...</td>\n",
+       "      <td>76</td>\n",
+       "      <td>131</td>\n",
+       "      <td>12.5</td>\n",
+       "      <td>2</td>\n",
+       "      <td>0.6</td>\n",
+       "      <td>23</td>\n",
+       "      <td>11</td>\n",
+       "      <td>12</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38981</th>\n",
+       "      <td>40</td>\n",
+       "      <td>170</td>\n",
+       "      <td>105</td>\n",
+       "      <td>124.0</td>\n",
+       "      <td>0.6</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>141</td>\n",
+       "      <td>85</td>\n",
+       "      <td>...</td>\n",
+       "      <td>48</td>\n",
+       "      <td>138</td>\n",
+       "      <td>17.1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>24</td>\n",
+       "      <td>23</td>\n",
+       "      <td>35</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38982</th>\n",
+       "      <td>40</td>\n",
+       "      <td>160</td>\n",
+       "      <td>55</td>\n",
+       "      <td>75.0</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>95</td>\n",
+       "      <td>69</td>\n",
+       "      <td>...</td>\n",
+       "      <td>79</td>\n",
+       "      <td>116</td>\n",
+       "      <td>12.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.6</td>\n",
+       "      <td>24</td>\n",
+       "      <td>20</td>\n",
+       "      <td>17</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38983</th>\n",
+       "      <td>55</td>\n",
+       "      <td>175</td>\n",
+       "      <td>60</td>\n",
+       "      <td>81.1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>114</td>\n",
+       "      <td>66</td>\n",
+       "      <td>...</td>\n",
+       "      <td>64</td>\n",
+       "      <td>137</td>\n",
+       "      <td>13.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>18</td>\n",
+       "      <td>12</td>\n",
+       "      <td>16</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>38984 rows × 23 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       age  height(cm)  weight(kg)  waist(cm)  eyesight(left)  \\\n",
+       "0       35         170          85       97.0             0.9   \n",
+       "1       20         175         110      110.0             0.7   \n",
+       "2       45         155          65       86.0             0.9   \n",
+       "3       45         165          80       94.0             0.8   \n",
+       "4       20         165          60       81.0             1.5   \n",
+       "...    ...         ...         ...        ...             ...   \n",
+       "38979   40         165          60       80.0             0.4   \n",
+       "38980   45         155          55       75.0             1.5   \n",
+       "38981   40         170         105      124.0             0.6   \n",
+       "38982   40         160          55       75.0             1.5   \n",
+       "38983   55         175          60       81.1             1.0   \n",
+       "\n",
+       "       eyesight(right)  hearing(left)  hearing(right)  systolic  relaxation  \\\n",
+       "0                  0.9              1               1       118          78   \n",
+       "1                  0.9              1               1       119          79   \n",
+       "2                  0.9              1               1       110          80   \n",
+       "3                  0.7              1               1       158          88   \n",
+       "4                  0.1              1               1       109          64   \n",
+       "...                ...            ...             ...       ...         ...   \n",
+       "38979              0.6              1               1       107          60   \n",
+       "38980              1.2              1               1       126          72   \n",
+       "38981              0.5              1               1       141          85   \n",
+       "38982              1.5              1               1        95          69   \n",
+       "38983              1.0              1               1       114          66   \n",
+       "\n",
+       "       ...  HDL  LDL  hemoglobin  Urine protein  serum creatinine   AST   ALT  \\\n",
+       "0      ...   70  142        19.8              1               1.0    61   115   \n",
+       "1      ...   71  114        15.9              1               1.1    19    25   \n",
+       "2      ...   57  112        13.7              3               0.6  1090  1400   \n",
+       "3      ...   46   91        16.9              1               0.9    32    36   \n",
+       "4      ...   47   92        14.9              1               1.2    26    28   \n",
+       "...    ...  ...  ...         ...            ...               ...   ...   ...   \n",
+       "38979  ...   61   72        12.3              1               0.5    18    18   \n",
+       "38980  ...   76  131        12.5              2               0.6    23    11   \n",
+       "38981  ...   48  138        17.1              1               0.8    24    23   \n",
+       "38982  ...   79  116        12.0              1               0.6    24    20   \n",
+       "38983  ...   64  137        13.9              1               1.0    18    12   \n",
+       "\n",
+       "       Gtp  dental caries  smoking  \n",
+       "0      125              1        1  \n",
+       "1       30              1        0  \n",
+       "2      276              0        0  \n",
+       "3       36              0        0  \n",
+       "4       15              0        0  \n",
+       "...    ...            ...      ...  \n",
+       "38979   21              1        0  \n",
+       "38980   12              0        0  \n",
+       "38981   35              1        1  \n",
+       "38982   17              0        1  \n",
+       "38983   16              0        1  \n",
+       "\n",
+       "[38984 rows x 23 columns]"
+      ]
+     },
+     "execution_count": 157,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Reading dataset using pandas library\n",
+    "train_data = pd.read_csv(\"E:/MA336 - AI and ML/Project_AI/train_dataset.csv\")\n",
+    "train_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8e1cd0b9",
+   "metadata": {},
+   "source": [
+    "It can be seen there are 22 predictor variables with 38984 observations.Last column smoking is a response variable.\n",
+    ">Description of Each variables in dataset.\n",
+    "> - ***age:***             Age of the individual in 5-year intervals.\n",
+    "> - ***height(cm):***      Height of the individual in centimeters.\n",
+    "> - ***weight(kg):***      Weight of the individual in kilograms.\n",
+    "> - ***waist(cm):***       Waist circumference length of the individual.\n",
+    "> - ***eyesight(left):***  Eyesight measurement of the left eye.\n",
+    "> - ***eyesight(right):*** Eyesight measurement of the right eye.\n",
+    "> - ***hearing(left):***   Hearing measurement of the left ear.\n",
+    "> - ***hearing(right):***  Hearing measurement of the right ear.\n",
+    "> - ***systolic:***        Systolic blood pressure measurement.\n",
+    "> - ***relaxation:***      Diastolic blood pressure measurement.\n",
+    "> - ***fasting blood sugar:***  Fasting blood sugar level.\n",
+    "> - ***Cholesterol:***     Total cholesterol level.\n",
+    "> - ***triglyceride:***    Triglyceride level.\n",
+    "> - ***HDL:***             High-density lipoprotein (HDL) cholesterol level.\n",
+    "> - ***LDL:***             Low-density lipoprotein (LDL) cholesterol level.\n",
+    "> - ***hemoglobin:***      Hemoglobin level in the blood.\n",
+    "> - ***Urine protein:***   Presence of protein in the urine.\n",
+    "> - ***serum creatinine:***  Serum creatinine level.\n",
+    "> - ***AST:***             Level of glutamic oxaloacetic transaminase (AST) enzyme.\n",
+    "> - ***ALT:***             Level of glutamic oxaloacetic transaminase (ALT) enzyme.\n",
+    "> - ***Gtp:***             Level of γ-GTP enzyme.\n",
+    "> - ***dental caries:***   Presence of dental caries (tooth decay).\n",
+    "> - ***smoking:***         Smoking status of the individual (1 for smoker, 0 for non-smoker).\n",
+    "\n",
+    "Aim is to find the status of smoking through bio-signals.\n",
+    "\n",
+    "Target variable:\n",
+    "\n",
+    "When ***smoking is yes*** = 1 , When ***no smoking*** = 0\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 158,
+   "id": "d326b597",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 38984 entries, 0 to 38983\n",
+      "Data columns (total 23 columns):\n",
+      " #   Column               Non-Null Count  Dtype  \n",
+      "---  ------               --------------  -----  \n",
+      " 0   age                  38984 non-null  int64  \n",
+      " 1   height(cm)           38984 non-null  int64  \n",
+      " 2   weight(kg)           38984 non-null  int64  \n",
+      " 3   waist(cm)            38984 non-null  float64\n",
+      " 4   eyesight(left)       38984 non-null  float64\n",
+      " 5   eyesight(right)      38984 non-null  float64\n",
+      " 6   hearing(left)        38984 non-null  int64  \n",
+      " 7   hearing(right)       38984 non-null  int64  \n",
+      " 8   systolic             38984 non-null  int64  \n",
+      " 9   relaxation           38984 non-null  int64  \n",
+      " 10  fasting blood sugar  38984 non-null  int64  \n",
+      " 11  Cholesterol          38984 non-null  int64  \n",
+      " 12  triglyceride         38984 non-null  int64  \n",
+      " 13  HDL                  38984 non-null  int64  \n",
+      " 14  LDL                  38984 non-null  int64  \n",
+      " 15  hemoglobin           38984 non-null  float64\n",
+      " 16  Urine protein        38984 non-null  int64  \n",
+      " 17  serum creatinine     38984 non-null  float64\n",
+      " 18  AST                  38984 non-null  int64  \n",
+      " 19  ALT                  38984 non-null  int64  \n",
+      " 20  Gtp                  38984 non-null  int64  \n",
+      " 21  dental caries        38984 non-null  int64  \n",
+      " 22  smoking              38984 non-null  int64  \n",
+      "dtypes: float64(5), int64(18)\n",
+      "memory usage: 6.8 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Basic Dataset information\n",
+    "train_data.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 159,
+   "id": "b21183eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Let create a copy of data set so that original data set is never altered.\n",
+    "train_copy = train_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 160,
+   "id": "f1ecf4d2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Column 'age' has 0 missing values.\n",
+      "Column 'height(cm)' has 0 missing values.\n",
+      "Column 'weight(kg)' has 0 missing values.\n",
+      "Column 'waist(cm)' has 0 missing values.\n",
+      "Column 'eyesight(left)' has 0 missing values.\n",
+      "Column 'eyesight(right)' has 0 missing values.\n",
+      "Column 'hearing(left)' has 0 missing values.\n",
+      "Column 'hearing(right)' has 0 missing values.\n",
+      "Column 'systolic' has 0 missing values.\n",
+      "Column 'relaxation' has 0 missing values.\n",
+      "Column 'fasting blood sugar' has 0 missing values.\n",
+      "Column 'Cholesterol' has 0 missing values.\n",
+      "Column 'triglyceride' has 0 missing values.\n",
+      "Column 'HDL' has 0 missing values.\n",
+      "Column 'LDL' has 0 missing values.\n",
+      "Column 'hemoglobin' has 0 missing values.\n",
+      "Column 'Urine protein' has 0 missing values.\n",
+      "Column 'serum creatinine' has 0 missing values.\n",
+      "Column 'AST' has 0 missing values.\n",
+      "Column 'ALT' has 0 missing values.\n",
+      "Column 'Gtp' has 0 missing values.\n",
+      "Column 'dental caries' has 0 missing values.\n",
+      "Column 'smoking' has 0 missing values.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Check for missing values in each column\n",
+    "missing_values = train_data.isnull().sum()\n",
+    "\n",
+    "# Iterate over columns and print the number of missing values\n",
+    "for column, count in missing_values.iteritems():\n",
+    "    print(f\"Column '{column}' has {count} missing values.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5dad71cb",
+   "metadata": {},
+   "source": [
+    "Hence the dataset has no missing values in it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd2ffc55",
+   "metadata": {},
+   "source": [
+    "#### Removing outliers using IQR Method\n",
+    "The IQR method is a robust approach that can be effective in identifying and removing outliers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 161,
+   "id": "07ed30bc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[IQR SELECTED] Number of rows in training data before removing outliers: 38984\n",
+      "[IQR SELECTED] Number of rows in training data after removing outliers: 36528\n"
+     ]
+    }
+   ],
+   "source": [
+    "from scipy.stats import iqr\n",
+    "\n",
+    "# Print the number of rows in the training data before removing outliers\n",
+    "print(f\"[IQR SELECTED] Number of rows in training data before removing outliers: {len(train_copy)}\")\n",
+    "\n",
+    "# Identify and remove outliers from the training set using IQR method\n",
+    "Q1 = train_copy.quantile(0.25)\n",
+    "Q3 = train_copy.quantile(0.75)\n",
+    "IQR = iqr(train_copy)\n",
+    "lower_bound = Q1 - 1.5 * IQR\n",
+    "upper_bound = Q3 + 1.5 * IQR\n",
+    "train_copy = train_copy[(train_copy >= lower_bound) & (train_copy <= upper_bound)].dropna()\n",
+    "\n",
+    "\n",
+    "# Print the number of rows in the training data after removing outliers\n",
+    "print(f\"[IQR SELECTED] Number of rows in training data after removing outliers: {len(train_copy)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d77d4b5",
+   "metadata": {},
+   "source": [
+    "#### Removing Duplicates"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 162,
+   "id": "fa0e984f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of duplicates: 5159\n",
+      "Number of observations after removing duplicates: 31369\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Assuming train_copy is your DataFrame containing the dataset\n",
+    "# Check for duplicates\n",
+    "duplicates = train_copy.duplicated()\n",
+    "\n",
+    "# Count the number of duplicates\n",
+    "num_duplicates = duplicates.sum()\n",
+    "print(\"Number of duplicates:\", num_duplicates)\n",
+    "\n",
+    "# Remove duplicates from the DataFrame\n",
+    "train_copy = train_copy.drop_duplicates()\n",
+    "\n",
+    "# Check the number of observations after removing duplicates\n",
+    "num_observations = len(train_copy)\n",
+    "print(\"Number of observations after removing duplicates:\", num_observations)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "595badb7",
+   "metadata": {},
+   "source": [
+    "### 3. DATA EXPLORATION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 163,
+   "id": "223b36a2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<style type=\"text/css\">\n",
+       "#T_ca6ab_row0_col0, #T_ca6ab_row0_col1, #T_ca6ab_row0_col2, #T_ca6ab_row0_col3, #T_ca6ab_row0_col4, #T_ca6ab_row0_col5, #T_ca6ab_row0_col6, #T_ca6ab_row0_col7, #T_ca6ab_row1_col0, #T_ca6ab_row1_col1, #T_ca6ab_row1_col2, #T_ca6ab_row1_col3, #T_ca6ab_row1_col4, #T_ca6ab_row1_col5, #T_ca6ab_row1_col6, #T_ca6ab_row1_col7, #T_ca6ab_row2_col0, #T_ca6ab_row2_col1, #T_ca6ab_row2_col2, #T_ca6ab_row2_col3, #T_ca6ab_row2_col4, #T_ca6ab_row2_col5, #T_ca6ab_row2_col6, #T_ca6ab_row2_col7, #T_ca6ab_row3_col0, #T_ca6ab_row3_col1, #T_ca6ab_row3_col2, #T_ca6ab_row3_col3, #T_ca6ab_row3_col4, #T_ca6ab_row3_col5, #T_ca6ab_row3_col6, #T_ca6ab_row3_col7, #T_ca6ab_row4_col0, #T_ca6ab_row4_col1, #T_ca6ab_row4_col2, #T_ca6ab_row4_col3, #T_ca6ab_row4_col4, #T_ca6ab_row4_col5, #T_ca6ab_row4_col6, #T_ca6ab_row4_col7, #T_ca6ab_row5_col0, #T_ca6ab_row5_col1, #T_ca6ab_row5_col2, #T_ca6ab_row5_col3, #T_ca6ab_row5_col4, #T_ca6ab_row5_col5, #T_ca6ab_row5_col6, #T_ca6ab_row5_col7, #T_ca6ab_row6_col0, #T_ca6ab_row6_col1, #T_ca6ab_row6_col2, #T_ca6ab_row6_col3, #T_ca6ab_row6_col4, #T_ca6ab_row6_col5, #T_ca6ab_row6_col6, #T_ca6ab_row6_col7, #T_ca6ab_row7_col0, #T_ca6ab_row7_col1, #T_ca6ab_row7_col2, #T_ca6ab_row7_col3, #T_ca6ab_row7_col4, #T_ca6ab_row7_col5, #T_ca6ab_row7_col6, #T_ca6ab_row7_col7, #T_ca6ab_row8_col0, #T_ca6ab_row8_col1, #T_ca6ab_row8_col2, #T_ca6ab_row8_col3, #T_ca6ab_row8_col4, #T_ca6ab_row8_col5, #T_ca6ab_row8_col6, #T_ca6ab_row8_col7, #T_ca6ab_row9_col0, #T_ca6ab_row9_col1, #T_ca6ab_row9_col2, #T_ca6ab_row9_col3, #T_ca6ab_row9_col4, #T_ca6ab_row9_col5, #T_ca6ab_row9_col6, #T_ca6ab_row9_col7, #T_ca6ab_row10_col0, #T_ca6ab_row10_col1, #T_ca6ab_row10_col2, #T_ca6ab_row10_col3, #T_ca6ab_row10_col4, #T_ca6ab_row10_col5, #T_ca6ab_row10_col6, #T_ca6ab_row10_col7, #T_ca6ab_row11_col0, #T_ca6ab_row11_col1, #T_ca6ab_row11_col2, #T_ca6ab_row11_col3, #T_ca6ab_row11_col4, #T_ca6ab_row11_col5, #T_ca6ab_row11_col6, #T_ca6ab_row11_col7, #T_ca6ab_row12_col0, #T_ca6ab_row12_col1, #T_ca6ab_row12_col2, #T_ca6ab_row12_col3, #T_ca6ab_row12_col4, #T_ca6ab_row12_col5, #T_ca6ab_row12_col6, #T_ca6ab_row12_col7, #T_ca6ab_row13_col0, #T_ca6ab_row13_col1, #T_ca6ab_row13_col2, #T_ca6ab_row13_col3, #T_ca6ab_row13_col4, #T_ca6ab_row13_col5, #T_ca6ab_row13_col6, #T_ca6ab_row13_col7, #T_ca6ab_row14_col0, #T_ca6ab_row14_col1, #T_ca6ab_row14_col2, #T_ca6ab_row14_col3, #T_ca6ab_row14_col4, #T_ca6ab_row14_col5, #T_ca6ab_row14_col6, #T_ca6ab_row14_col7, #T_ca6ab_row15_col0, #T_ca6ab_row15_col1, #T_ca6ab_row15_col2, #T_ca6ab_row15_col3, #T_ca6ab_row15_col4, #T_ca6ab_row15_col5, #T_ca6ab_row15_col6, #T_ca6ab_row15_col7, #T_ca6ab_row16_col0, #T_ca6ab_row16_col1, #T_ca6ab_row16_col2, #T_ca6ab_row16_col3, #T_ca6ab_row16_col4, #T_ca6ab_row16_col5, #T_ca6ab_row16_col6, #T_ca6ab_row16_col7, #T_ca6ab_row17_col0, #T_ca6ab_row17_col1, #T_ca6ab_row17_col2, #T_ca6ab_row17_col3, #T_ca6ab_row17_col4, #T_ca6ab_row17_col5, #T_ca6ab_row17_col6, #T_ca6ab_row17_col7, #T_ca6ab_row18_col0, #T_ca6ab_row18_col1, #T_ca6ab_row18_col2, #T_ca6ab_row18_col3, #T_ca6ab_row18_col4, #T_ca6ab_row18_col5, #T_ca6ab_row18_col6, #T_ca6ab_row18_col7, #T_ca6ab_row19_col0, #T_ca6ab_row19_col1, #T_ca6ab_row19_col2, #T_ca6ab_row19_col3, #T_ca6ab_row19_col4, #T_ca6ab_row19_col5, #T_ca6ab_row19_col6, #T_ca6ab_row19_col7, #T_ca6ab_row20_col0, #T_ca6ab_row20_col1, #T_ca6ab_row20_col2, #T_ca6ab_row20_col3, #T_ca6ab_row20_col4, #T_ca6ab_row20_col5, #T_ca6ab_row20_col6, #T_ca6ab_row20_col7, #T_ca6ab_row21_col0, #T_ca6ab_row21_col1, #T_ca6ab_row21_col2, #T_ca6ab_row21_col3, #T_ca6ab_row21_col4, #T_ca6ab_row21_col5, #T_ca6ab_row21_col6, #T_ca6ab_row21_col7, #T_ca6ab_row22_col0, #T_ca6ab_row22_col1, #T_ca6ab_row22_col2, #T_ca6ab_row22_col3, #T_ca6ab_row22_col4, #T_ca6ab_row22_col5, #T_ca6ab_row22_col6, #T_ca6ab_row22_col7 {\n",
+       "  font-size: 15px;\n",
+       "  color: #000000;\n",
+       "  border: 1.5px solid black;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_ca6ab\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th class=\"blank level0\" >&nbsp;</th>\n",
+       "      <th id=\"T_ca6ab_level0_col0\" class=\"col_heading level0 col0\" >count</th>\n",
+       "      <th id=\"T_ca6ab_level0_col1\" class=\"col_heading level0 col1\" >mean</th>\n",
+       "      <th id=\"T_ca6ab_level0_col2\" class=\"col_heading level0 col2\" >std</th>\n",
+       "      <th id=\"T_ca6ab_level0_col3\" class=\"col_heading level0 col3\" >min</th>\n",
+       "      <th id=\"T_ca6ab_level0_col4\" class=\"col_heading level0 col4\" >25%</th>\n",
+       "      <th id=\"T_ca6ab_level0_col5\" class=\"col_heading level0 col5\" >50%</th>\n",
+       "      <th id=\"T_ca6ab_level0_col6\" class=\"col_heading level0 col6\" >75%</th>\n",
+       "      <th id=\"T_ca6ab_level0_col7\" class=\"col_heading level0 col7\" >max</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row0\" class=\"row_heading level0 row0\" >age</th>\n",
+       "      <td id=\"T_ca6ab_row0_col0\" class=\"data row0 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row0_col1\" class=\"data row0 col1\" >44.160318</td>\n",
+       "      <td id=\"T_ca6ab_row0_col2\" class=\"data row0 col2\" >12.149882</td>\n",
+       "      <td id=\"T_ca6ab_row0_col3\" class=\"data row0 col3\" >20.000000</td>\n",
+       "      <td id=\"T_ca6ab_row0_col4\" class=\"data row0 col4\" >40.000000</td>\n",
+       "      <td id=\"T_ca6ab_row0_col5\" class=\"data row0 col5\" >40.000000</td>\n",
+       "      <td id=\"T_ca6ab_row0_col6\" class=\"data row0 col6\" >55.000000</td>\n",
+       "      <td id=\"T_ca6ab_row0_col7\" class=\"data row0 col7\" >85.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row1\" class=\"row_heading level0 row1\" >height(cm)</th>\n",
+       "      <td id=\"T_ca6ab_row1_col0\" class=\"data row1 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row1_col1\" class=\"data row1 col1\" >164.505563</td>\n",
+       "      <td id=\"T_ca6ab_row1_col2\" class=\"data row1 col2\" >9.236820</td>\n",
+       "      <td id=\"T_ca6ab_row1_col3\" class=\"data row1 col3\" >130.000000</td>\n",
+       "      <td id=\"T_ca6ab_row1_col4\" class=\"data row1 col4\" >160.000000</td>\n",
+       "      <td id=\"T_ca6ab_row1_col5\" class=\"data row1 col5\" >165.000000</td>\n",
+       "      <td id=\"T_ca6ab_row1_col6\" class=\"data row1 col6\" >170.000000</td>\n",
+       "      <td id=\"T_ca6ab_row1_col7\" class=\"data row1 col7\" >190.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row2\" class=\"row_heading level0 row2\" >weight(kg)</th>\n",
+       "      <td id=\"T_ca6ab_row2_col0\" class=\"data row2 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row2_col1\" class=\"data row2 col1\" >65.471166</td>\n",
+       "      <td id=\"T_ca6ab_row2_col2\" class=\"data row2 col2\" >12.738713</td>\n",
+       "      <td id=\"T_ca6ab_row2_col3\" class=\"data row2 col3\" >30.000000</td>\n",
+       "      <td id=\"T_ca6ab_row2_col4\" class=\"data row2 col4\" >55.000000</td>\n",
+       "      <td id=\"T_ca6ab_row2_col5\" class=\"data row2 col5\" >65.000000</td>\n",
+       "      <td id=\"T_ca6ab_row2_col6\" class=\"data row2 col6\" >75.000000</td>\n",
+       "      <td id=\"T_ca6ab_row2_col7\" class=\"data row2 col7\" >135.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row3\" class=\"row_heading level0 row3\" >waist(cm)</th>\n",
+       "      <td id=\"T_ca6ab_row3_col0\" class=\"data row3 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row3_col1\" class=\"data row3 col1\" >81.705155</td>\n",
+       "      <td id=\"T_ca6ab_row3_col2\" class=\"data row3 col2\" >9.233237</td>\n",
+       "      <td id=\"T_ca6ab_row3_col3\" class=\"data row3 col3\" >54.000000</td>\n",
+       "      <td id=\"T_ca6ab_row3_col4\" class=\"data row3 col4\" >75.000000</td>\n",
+       "      <td id=\"T_ca6ab_row3_col5\" class=\"data row3 col5\" >81.300000</td>\n",
+       "      <td id=\"T_ca6ab_row3_col6\" class=\"data row3 col6\" >87.800000</td>\n",
+       "      <td id=\"T_ca6ab_row3_col7\" class=\"data row3 col7\" >129.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row4\" class=\"row_heading level0 row4\" >eyesight(left)</th>\n",
+       "      <td id=\"T_ca6ab_row4_col0\" class=\"data row4 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row4_col1\" class=\"data row4 col1\" >1.014855</td>\n",
+       "      <td id=\"T_ca6ab_row4_col2\" class=\"data row4 col2\" >0.503095</td>\n",
+       "      <td id=\"T_ca6ab_row4_col3\" class=\"data row4 col3\" >0.100000</td>\n",
+       "      <td id=\"T_ca6ab_row4_col4\" class=\"data row4 col4\" >0.800000</td>\n",
+       "      <td id=\"T_ca6ab_row4_col5\" class=\"data row4 col5\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row4_col6\" class=\"data row4 col6\" >1.200000</td>\n",
+       "      <td id=\"T_ca6ab_row4_col7\" class=\"data row4 col7\" >9.900000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row5\" class=\"row_heading level0 row5\" >eyesight(right)</th>\n",
+       "      <td id=\"T_ca6ab_row5_col0\" class=\"data row5 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row5_col1\" class=\"data row5 col1\" >1.008850</td>\n",
+       "      <td id=\"T_ca6ab_row5_col2\" class=\"data row5 col2\" >0.494946</td>\n",
+       "      <td id=\"T_ca6ab_row5_col3\" class=\"data row5 col3\" >0.100000</td>\n",
+       "      <td id=\"T_ca6ab_row5_col4\" class=\"data row5 col4\" >0.800000</td>\n",
+       "      <td id=\"T_ca6ab_row5_col5\" class=\"data row5 col5\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row5_col6\" class=\"data row5 col6\" >1.200000</td>\n",
+       "      <td id=\"T_ca6ab_row5_col7\" class=\"data row5 col7\" >9.900000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row6\" class=\"row_heading level0 row6\" >hearing(left)</th>\n",
+       "      <td id=\"T_ca6ab_row6_col0\" class=\"data row6 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row6_col1\" class=\"data row6 col1\" >1.024897</td>\n",
+       "      <td id=\"T_ca6ab_row6_col2\" class=\"data row6 col2\" >0.155814</td>\n",
+       "      <td id=\"T_ca6ab_row6_col3\" class=\"data row6 col3\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row6_col4\" class=\"data row6 col4\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row6_col5\" class=\"data row6 col5\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row6_col6\" class=\"data row6 col6\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row6_col7\" class=\"data row6 col7\" >2.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row7\" class=\"row_heading level0 row7\" >hearing(right)</th>\n",
+       "      <td id=\"T_ca6ab_row7_col0\" class=\"data row7 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row7_col1\" class=\"data row7 col1\" >1.025758</td>\n",
+       "      <td id=\"T_ca6ab_row7_col2\" class=\"data row7 col2\" >0.158415</td>\n",
+       "      <td id=\"T_ca6ab_row7_col3\" class=\"data row7 col3\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row7_col4\" class=\"data row7 col4\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row7_col5\" class=\"data row7 col5\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row7_col6\" class=\"data row7 col6\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row7_col7\" class=\"data row7 col7\" >2.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row8\" class=\"row_heading level0 row8\" >systolic</th>\n",
+       "      <td id=\"T_ca6ab_row8_col0\" class=\"data row8 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row8_col1\" class=\"data row8 col1\" >121.136664</td>\n",
+       "      <td id=\"T_ca6ab_row8_col2\" class=\"data row8 col2\" >13.569889</td>\n",
+       "      <td id=\"T_ca6ab_row8_col3\" class=\"data row8 col3\" >71.000000</td>\n",
+       "      <td id=\"T_ca6ab_row8_col4\" class=\"data row8 col4\" >111.000000</td>\n",
+       "      <td id=\"T_ca6ab_row8_col5\" class=\"data row8 col5\" >120.000000</td>\n",
+       "      <td id=\"T_ca6ab_row8_col6\" class=\"data row8 col6\" >130.000000</td>\n",
+       "      <td id=\"T_ca6ab_row8_col7\" class=\"data row8 col7\" >223.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row9\" class=\"row_heading level0 row9\" >relaxation</th>\n",
+       "      <td id=\"T_ca6ab_row9_col0\" class=\"data row9 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row9_col1\" class=\"data row9 col1\" >75.743218</td>\n",
+       "      <td id=\"T_ca6ab_row9_col2\" class=\"data row9 col2\" >9.593874</td>\n",
+       "      <td id=\"T_ca6ab_row9_col3\" class=\"data row9 col3\" >40.000000</td>\n",
+       "      <td id=\"T_ca6ab_row9_col4\" class=\"data row9 col4\" >70.000000</td>\n",
+       "      <td id=\"T_ca6ab_row9_col5\" class=\"data row9 col5\" >76.000000</td>\n",
+       "      <td id=\"T_ca6ab_row9_col6\" class=\"data row9 col6\" >81.000000</td>\n",
+       "      <td id=\"T_ca6ab_row9_col7\" class=\"data row9 col7\" >146.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row10\" class=\"row_heading level0 row10\" >fasting blood sugar</th>\n",
+       "      <td id=\"T_ca6ab_row10_col0\" class=\"data row10 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row10_col1\" class=\"data row10 col1\" >98.004654</td>\n",
+       "      <td id=\"T_ca6ab_row10_col2\" class=\"data row10 col2\" >16.464372</td>\n",
+       "      <td id=\"T_ca6ab_row10_col3\" class=\"data row10 col3\" >46.000000</td>\n",
+       "      <td id=\"T_ca6ab_row10_col4\" class=\"data row10 col4\" >89.000000</td>\n",
+       "      <td id=\"T_ca6ab_row10_col5\" class=\"data row10 col5\" >95.000000</td>\n",
+       "      <td id=\"T_ca6ab_row10_col6\" class=\"data row10 col6\" >103.000000</td>\n",
+       "      <td id=\"T_ca6ab_row10_col7\" class=\"data row10 col7\" >234.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row11\" class=\"row_heading level0 row11\" >Cholesterol</th>\n",
+       "      <td id=\"T_ca6ab_row11_col0\" class=\"data row11 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row11_col1\" class=\"data row11 col1\" >195.880806</td>\n",
+       "      <td id=\"T_ca6ab_row11_col2\" class=\"data row11 col2\" >35.519681</td>\n",
+       "      <td id=\"T_ca6ab_row11_col3\" class=\"data row11 col3\" >55.000000</td>\n",
+       "      <td id=\"T_ca6ab_row11_col4\" class=\"data row11 col4\" >171.000000</td>\n",
+       "      <td id=\"T_ca6ab_row11_col5\" class=\"data row11 col5\" >194.000000</td>\n",
+       "      <td id=\"T_ca6ab_row11_col6\" class=\"data row11 col6\" >218.000000</td>\n",
+       "      <td id=\"T_ca6ab_row11_col7\" class=\"data row11 col7\" >348.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row12\" class=\"row_heading level0 row12\" >triglyceride</th>\n",
+       "      <td id=\"T_ca6ab_row12_col0\" class=\"data row12 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row12_col1\" class=\"data row12 col1\" >117.028914</td>\n",
+       "      <td id=\"T_ca6ab_row12_col2\" class=\"data row12 col2\" >57.650741</td>\n",
+       "      <td id=\"T_ca6ab_row12_col3\" class=\"data row12 col3\" >8.000000</td>\n",
+       "      <td id=\"T_ca6ab_row12_col4\" class=\"data row12 col4\" >73.000000</td>\n",
+       "      <td id=\"T_ca6ab_row12_col5\" class=\"data row12 col5\" >104.000000</td>\n",
+       "      <td id=\"T_ca6ab_row12_col6\" class=\"data row12 col6\" >150.000000</td>\n",
+       "      <td id=\"T_ca6ab_row12_col7\" class=\"data row12 col7\" >290.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row13\" class=\"row_heading level0 row13\" >HDL</th>\n",
+       "      <td id=\"T_ca6ab_row13_col0\" class=\"data row13 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row13_col1\" class=\"data row13 col1\" >57.722560</td>\n",
+       "      <td id=\"T_ca6ab_row13_col2\" class=\"data row13 col2\" >14.412297</td>\n",
+       "      <td id=\"T_ca6ab_row13_col3\" class=\"data row13 col3\" >4.000000</td>\n",
+       "      <td id=\"T_ca6ab_row13_col4\" class=\"data row13 col4\" >47.000000</td>\n",
+       "      <td id=\"T_ca6ab_row13_col5\" class=\"data row13 col5\" >56.000000</td>\n",
+       "      <td id=\"T_ca6ab_row13_col6\" class=\"data row13 col6\" >66.000000</td>\n",
+       "      <td id=\"T_ca6ab_row13_col7\" class=\"data row13 col7\" >157.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row14\" class=\"row_heading level0 row14\" >LDL</th>\n",
+       "      <td id=\"T_ca6ab_row14_col0\" class=\"data row14 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row14_col1\" class=\"data row14 col1\" >114.941280</td>\n",
+       "      <td id=\"T_ca6ab_row14_col2\" class=\"data row14 col2\" >32.910135</td>\n",
+       "      <td id=\"T_ca6ab_row14_col3\" class=\"data row14 col3\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row14_col4\" class=\"data row14 col4\" >92.000000</td>\n",
+       "      <td id=\"T_ca6ab_row14_col5\" class=\"data row14 col5\" >113.000000</td>\n",
+       "      <td id=\"T_ca6ab_row14_col6\" class=\"data row14 col6\" >136.000000</td>\n",
+       "      <td id=\"T_ca6ab_row14_col7\" class=\"data row14 col7\" >265.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row15\" class=\"row_heading level0 row15\" >hemoglobin</th>\n",
+       "      <td id=\"T_ca6ab_row15_col0\" class=\"data row15 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row15_col1\" class=\"data row15 col1\" >14.572973</td>\n",
+       "      <td id=\"T_ca6ab_row15_col2\" class=\"data row15 col2\" >1.563132</td>\n",
+       "      <td id=\"T_ca6ab_row15_col3\" class=\"data row15 col3\" >4.900000</td>\n",
+       "      <td id=\"T_ca6ab_row15_col4\" class=\"data row15 col4\" >13.500000</td>\n",
+       "      <td id=\"T_ca6ab_row15_col5\" class=\"data row15 col5\" >14.700000</td>\n",
+       "      <td id=\"T_ca6ab_row15_col6\" class=\"data row15 col6\" >15.700000</td>\n",
+       "      <td id=\"T_ca6ab_row15_col7\" class=\"data row15 col7\" >21.100000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row16\" class=\"row_heading level0 row16\" >Urine protein</th>\n",
+       "      <td id=\"T_ca6ab_row16_col0\" class=\"data row16 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row16_col1\" class=\"data row16 col1\" >1.080238</td>\n",
+       "      <td id=\"T_ca6ab_row16_col2\" class=\"data row16 col2\" >0.383653</td>\n",
+       "      <td id=\"T_ca6ab_row16_col3\" class=\"data row16 col3\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row16_col4\" class=\"data row16 col4\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row16_col5\" class=\"data row16 col5\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row16_col6\" class=\"data row16 col6\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row16_col7\" class=\"data row16 col7\" >6.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row17\" class=\"row_heading level0 row17\" >serum creatinine</th>\n",
+       "      <td id=\"T_ca6ab_row17_col0\" class=\"data row17 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row17_col1\" class=\"data row17 col1\" >0.883216</td>\n",
+       "      <td id=\"T_ca6ab_row17_col2\" class=\"data row17 col2\" >0.217669</td>\n",
+       "      <td id=\"T_ca6ab_row17_col3\" class=\"data row17 col3\" >0.100000</td>\n",
+       "      <td id=\"T_ca6ab_row17_col4\" class=\"data row17 col4\" >0.700000</td>\n",
+       "      <td id=\"T_ca6ab_row17_col5\" class=\"data row17 col5\" >0.900000</td>\n",
+       "      <td id=\"T_ca6ab_row17_col6\" class=\"data row17 col6\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row17_col7\" class=\"data row17 col7\" >11.600000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row18\" class=\"row_heading level0 row18\" >AST</th>\n",
+       "      <td id=\"T_ca6ab_row18_col0\" class=\"data row18 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row18_col1\" class=\"data row18 col1\" >24.974625</td>\n",
+       "      <td id=\"T_ca6ab_row18_col2\" class=\"data row18 col2\" >10.147233</td>\n",
+       "      <td id=\"T_ca6ab_row18_col3\" class=\"data row18 col3\" >6.000000</td>\n",
+       "      <td id=\"T_ca6ab_row18_col4\" class=\"data row18 col4\" >19.000000</td>\n",
+       "      <td id=\"T_ca6ab_row18_col5\" class=\"data row18 col5\" >23.000000</td>\n",
+       "      <td id=\"T_ca6ab_row18_col6\" class=\"data row18 col6\" >28.000000</td>\n",
+       "      <td id=\"T_ca6ab_row18_col7\" class=\"data row18 col7\" >156.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row19\" class=\"row_heading level0 row19\" >ALT</th>\n",
+       "      <td id=\"T_ca6ab_row19_col0\" class=\"data row19 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row19_col1\" class=\"data row19 col1\" >25.212758</td>\n",
+       "      <td id=\"T_ca6ab_row19_col2\" class=\"data row19 col2\" >16.668417</td>\n",
+       "      <td id=\"T_ca6ab_row19_col3\" class=\"data row19 col3\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row19_col4\" class=\"data row19 col4\" >15.000000</td>\n",
+       "      <td id=\"T_ca6ab_row19_col5\" class=\"data row19 col5\" >20.000000</td>\n",
+       "      <td id=\"T_ca6ab_row19_col6\" class=\"data row19 col6\" >30.000000</td>\n",
+       "      <td id=\"T_ca6ab_row19_col7\" class=\"data row19 col7\" >161.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row20\" class=\"row_heading level0 row20\" >Gtp</th>\n",
+       "      <td id=\"T_ca6ab_row20_col0\" class=\"data row20 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row20_col1\" class=\"data row20 col1\" >33.777519</td>\n",
+       "      <td id=\"T_ca6ab_row20_col2\" class=\"data row20 col2\" >26.767146</td>\n",
+       "      <td id=\"T_ca6ab_row20_col3\" class=\"data row20 col3\" >3.000000</td>\n",
+       "      <td id=\"T_ca6ab_row20_col4\" class=\"data row20 col4\" >17.000000</td>\n",
+       "      <td id=\"T_ca6ab_row20_col5\" class=\"data row20 col5\" >25.000000</td>\n",
+       "      <td id=\"T_ca6ab_row20_col6\" class=\"data row20 col6\" >40.000000</td>\n",
+       "      <td id=\"T_ca6ab_row20_col7\" class=\"data row20 col7\" >174.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row21\" class=\"row_heading level0 row21\" >dental caries</th>\n",
+       "      <td id=\"T_ca6ab_row21_col0\" class=\"data row21 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row21_col1\" class=\"data row21 col1\" >0.212630</td>\n",
+       "      <td id=\"T_ca6ab_row21_col2\" class=\"data row21 col2\" >0.409175</td>\n",
+       "      <td id=\"T_ca6ab_row21_col3\" class=\"data row21 col3\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row21_col4\" class=\"data row21 col4\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row21_col5\" class=\"data row21 col5\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row21_col6\" class=\"data row21 col6\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row21_col7\" class=\"data row21 col7\" >1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ca6ab_level0_row22\" class=\"row_heading level0 row22\" >smoking</th>\n",
+       "      <td id=\"T_ca6ab_row22_col0\" class=\"data row22 col0\" >31369.000000</td>\n",
+       "      <td id=\"T_ca6ab_row22_col1\" class=\"data row22 col1\" >0.349262</td>\n",
+       "      <td id=\"T_ca6ab_row22_col2\" class=\"data row22 col2\" >0.476744</td>\n",
+       "      <td id=\"T_ca6ab_row22_col3\" class=\"data row22 col3\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row22_col4\" class=\"data row22 col4\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row22_col5\" class=\"data row22 col5\" >0.000000</td>\n",
+       "      <td id=\"T_ca6ab_row22_col6\" class=\"data row22 col6\" >1.000000</td>\n",
+       "      <td id=\"T_ca6ab_row22_col7\" class=\"data row22 col7\" >1.000000</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n"
+      ],
+      "text/plain": [
+       "<pandas.io.formats.style.Styler at 0x21c824944f0>"
+      ]
+     },
+     "execution_count": 163,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#Descriptive statistics of all predictor variables\n",
+    "# Calculate descriptive statistics\n",
+    "descriptive_stats = train_copy.describe()\n",
+    "\n",
+    "# Apply style to the table\n",
+    "styled_table = descriptive_stats.T.style.set_properties(**{\n",
+    "    \"font-size\": \"15px\",\n",
+    "    \"color\": \"#000000\",\n",
+    "    \"border\": \"1.5px solid black\"\n",
+    "})\n",
+    "\n",
+    "# Display the styled table\n",
+    "styled_table"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2614053",
+   "metadata": {},
+   "source": [
+    "From the descriptive statistics of the predictor variables in the train_data dataset, we can derive several insights and interpretations:\n",
+    "> - Age: The average age of the individuals in the dataset is approximately 44 years, with a standard deviation of 12.06. The minimum and maximum ages are 20 and 85 years, respectively.\n",
+    "> - Height and Weight: The average height is around 164.69 cm, with a standard deviation of 9.19. The average weight is approximately 65.94 kg, with a standard deviation of 12.90.\n",
+    "> - Waist Circumference: The average waist circumference is about 82.06 cm, with a standard deviation of 9.33. The minimum and maximum waist circumferences are 51 cm and 129 cm, respectively.\n",
+    "> - Eyesight and Hearing: The average eyesight values for both the left and right eyes are close to 1, indicating relatively good eyesight. The average hearing values for both the left and right ears are slightly above 1, suggesting normal hearing ability.\n",
+    "> - Blood Pressure: The average systolic blood pressure is 121.48, with a standard deviation of 13.64. The average relaxation (diastolic) blood pressure is 75.99, with a standard deviation of 9.66.\n",
+    "> - Cholesterol Levels: The dataset includes variables such as total cholesterol, triglyceride levels, HDL cholesterol, and LDL cholesterol. The mean values and standard deviations provide insights into the cholesterol distribution within the population.\n",
+    "> - Hemoglobin: The average hemoglobin level is 14.62, with a standard deviation of 1.57. Hemoglobin levels can provide insights into the individual's blood health and oxygen-carrying capacity.\n",
+    "> - Urine Protein and Serum Creatinine: The average urine protein level is approximately 1.09, and the average serum creatinine level is around 0.89. These variables are indicators of kidney function and can be useful in assessing renal health.\n",
+    "> - AST and ALT: AST and ALT are liver enzyme levels. The mean AST value is 26.20, and the mean ALT value is 27.15. These values can provide insights into liver health.\n",
+    "> - Gtp: The average Gtp level is 39.91, with a standard deviation of 49.69. Elevated Gtp levels can indicate liver dysfunction or damage.\n",
+    "> - Dental Caries: The average dental caries value is 0.21, indicating a relatively low prevalence of dental caries in the population.\n",
+    "> - Smoking: The dataset includes a binary variable for smoking, with a mean value of 0.37. This indicates that approximately 37% of the individuals in the dataset are smokers, while the remaining 63% are non-smokers.\n",
+    "These descriptive statistics provide an overview of the distribution and range of each variable in the dataset, allowing us to identify any potential outliers, assess the variability of the data, and gain initial insights into the characteristics of the study population."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 164,
+   "id": "824b6e0d",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "image/png": "\n",
+      "text/plain": [
+       "<Figure size 700x700 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "#Data visualisation using pie chart\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# Set custom colors\n",
+    "colors = ['#ff9999', '#66b3ff']\n",
+    "\n",
+    "# Create the figure and axes\n",
+    "fig, ax = plt.subplots(figsize=[7, 7], facecolor='#f2f2f2')\n",
+    "\n",
+    "# Plot the pie chart\n",
+    "explode = [0, 0.15]\n",
+    "labels = train_data[\"smoking\"].value_counts().index\n",
+    "sizes = train_data[\"smoking\"].value_counts().values\n",
+    "ax.pie(sizes, explode=explode, labels=labels, autopct='%1.3f%%', shadow=True, startangle=90, colors=colors)\n",
+    "\n",
+    "# Add a title\n",
+    "ax.set_title(\"Smoking Percentage\")\n",
+    "\n",
+    "# Display the pie chart\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "025dd96f",
+   "metadata": {},
+   "source": [
+    "As it can be seen above visually and also in descriptive statistics that approximately 37% of the individuals in the dataset are smokers, while the remaining 63% are non-smokers. \n",
+    "This means there is a class imbalance in dataset.Lets deal with this in training section of this report."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c18a9c81",
+   "metadata": {},
+   "source": [
+    "#### Correlation Between Smoking and Input Features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 165,
+   "id": "636e512f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "\n",
+      "text/plain": [
+       "<Figure size 1000x600 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# Assuming your DataFrame is named 'train_copy'\n",
+    "correlations = train_copy.corr()['smoking'].drop('smoking')\n",
+    "\n",
+    "# Sort the correlations in descending order\n",
+    "correlations_sorted = correlations.sort_values(ascending=False)\n",
+    "\n",
+    "# Plot the correlations\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "correlations_sorted.plot(kind='bar')\n",
+    "plt.xlabel('Input Features')\n",
+    "plt.ylabel('Correlation')\n",
+    "plt.title('Correlation between Smoking and Input Features')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b87b5bb3",
+   "metadata": {},
+   "source": [
+    "Based on the correlations between the \"smoking\" variable and the other input features, here are some insights and interpretations:\n",
+    "\n",
+    "***Positive Correlations:***\n",
+    "\n",
+    "Height (0.394), weight (0.297), waist (0.217), triglyceride (0.232), hemoglobin (0.398), AST (0.073), ALT (0.151), and Gtp (0.304) show positive correlations with smoking. This suggests that there might be a tendency for individuals who smoke to have higher values in these variables.\n",
+    "Dental caries (0.109) also shows a positive correlation, indicating a possible association between smoking and dental caries.\n",
+    "\n",
+    "***Negative Correlations:***\n",
+    "\n",
+    "Age (-0.171), HDL (-0.177), LDL (-0.047), and Cholesterol (-0.044) show negative correlations with smoking. This suggests that smoking might be associated with lower values in these variables.\n",
+    "Urine protein (0.006), eyesight (both left and right) (0.061 and 0.066), hearing (both left and right) (-0.022 and -0.019), systolic (0.061), relaxation (0.094), fasting blood sugar (0.081), and serum creatinine (0.218) also show weak correlations with smoking.\n",
+    "\n",
+    "As these values are not very close to 1 or -1 ,further analysis and domain knowledge are necessary to draw meaningful conclusions about the relationship between smoking and these variables."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d76b55d6",
+   "metadata": {},
+   "source": [
+    "### 4. FEATURE SELECTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3dd29098",
+   "metadata": {},
+   "source": [
+    "#### Method 1. Random Forest classifier\n",
+    "Implementation of a feature importance analysis using the ***Random Forest classifier*** in scikit-learn.As this method is not sensitive to the scale of features, no scaling is performed before feature selection"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 166,
+   "id": "983f6aea",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "\n",
+      "text/plain": [
+       "<Figure size 800x500 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "X = train_copy.iloc[:, :-1]\n",
+    "y = train_copy.smoking\n",
+    "\n",
+    "# Create a Random Forest classifier\n",
+    "rf = RandomForestClassifier(n_estimators=100, random_state=42)\n",
+    "\n",
+    "# Fit the model to the data\n",
+    "rf.fit(X, y)\n",
+    "\n",
+    "feature_names = X.columns \n",
+    "\n",
+    "# Get feature importance scores\n",
+    "importance_scores = rf.feature_importances_\n",
+    "\n",
+    "# Sort features by importance in descending order\n",
+    "feature_importance = sorted(zip(importance_scores, feature_names), reverse=True)\n",
+    "\n",
+    "# Reverse the feature importance list to have the most important features at the top\n",
+    "feature_importance.reverse()\n",
+    "\n",
+    "# Extract the feature names and importance scores\n",
+    "features = [feature_name for importance, feature_name in feature_importance]\n",
+    "importances = [importance for importance, feature_name in feature_importance]\n",
+    "\n",
+    "# Plot the feature importances\n",
+    "plt.figure(figsize=(8, 5))\n",
+    "plt.barh(features, importances)\n",
+    "plt.xlabel('Importance Score')\n",
+    "plt.ylabel('Features')\n",
+    "plt.title('Feature Importances')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a63a93d",
+   "metadata": {},
+   "source": [
+    "#### Method 2. Lasso regression model classifier\n",
+    "Implementation of a feature importance analysis using the ***Lasso regression model classifier*** in scikit-learn.\n",
+    "This method  L1 regularization, work better when the features are on a similar scale.Hence scaling is performed before doing feature selection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 99,
+   "id": "a7522491",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "\n",
+      "text/plain": [
+       "<Figure size 800x400 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from sklearn.linear_model import Lasso\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# Create a scaler object\n",
+    "scaler = StandardScaler()\n",
+    "\n",
+    "# Scale the features\n",
+    "X_scaled = scaler.fit_transform(train_copy.iloc[:, :-1])\n",
+    "\n",
+    "# Create a Lasso model\n",
+    "lasso = Lasso(alpha=0.1)\n",
+    "\n",
+    "# Fit the model to the scaled data\n",
+    "lasso.fit(X_scaled, train_copy.smoking)\n",
+    "\n",
+    "# Get the coefficients and corresponding feature names\n",
+    "coefficients = lasso.coef_\n",
+    "feature_names = train_copy.columns[:-1]\n",
+    "\n",
+    "    # Create a bar plot for feature importance\n",
+    "plt.figure(figsize=(8, 4))\n",
+    "plt.bar(feature_names, coefficients)\n",
+    "plt.xticks(rotation=90)\n",
+    "plt.xlabel('Features')\n",
+    "plt.ylabel('Coefficient')\n",
+    "plt.title('Feature Importance based on Lasso Coefficients')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bd40058",
+   "metadata": {},
+   "source": [
+    "From Both Model1 and Model2,It can be seen that ***height(cm), hemoglobin, Gtp*** are the most important features for this dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e85e1b0",
+   "metadata": {},
+   "source": [
+    "### 5. MODEL TRAINING"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d6d31b7",
+   "metadata": {},
+   "source": [
+    "#### Step1 Standardization process\n",
+    "It is important to standardize features around the center and 0, with a standard deviation of 1, when comparing measurements that have different units. This helps address the issue of variables being measured on different scales, which can lead to unequal contributions in the analysis and potential bias. For instance, a variable with a range of 0 to 1000 would overshadow a variable with a range of 0 to 1 if not standardized. By transforming the data to comparable scales, can avoid this problem. Common data standardization techniques aim to equalize the range and/or variability of the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 167,
+   "id": "de2a405c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scaler = StandardScaler()\n",
+    "scaler.fit(train_copy.iloc[:, :-1])\n",
+    "X_scaled = scaler.transform(train_copy.iloc[:, :-1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "04b3377d",
+   "metadata": {},
+   "source": [
+    "#### Step2 Balancing Dataset\n",
+    "As the dataset is imbalanced, before performing Training dataset needs to be balanced.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 168,
+   "id": "175c9a28",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training Set Class before Balance: \n",
+      " 0    20413\n",
+      "1    10956\n",
+      "Name: smoking, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Show the classes balance in the training set\n",
+    "print('Training Set Class before Balance: \\n', train_copy.smoking.value_counts())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a555364",
+   "metadata": {},
+   "source": [
+    "RandomOverSampler is a technique used for handling imbalanced datasets by randomly oversampling the minority class samples. The sampling_strategy parameter determines the ratio of the majority class to the minority class after resampling. In this case, \"auto\" indicates that the ratio should be automatically determined."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 169,
+   "id": "ae232d8e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from imblearn.over_sampling import RandomOverSampler"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 170,
+   "id": "cdfd672a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create an instance of RandomOverSampler\n",
+    "ros = RandomOverSampler(sampling_strategy=\"auto\", random_state=11)\n",
+    "\n",
+    "# Resample the training data\n",
+    "x_rovs, y_rovs = ros.fit_resample(X_scaled, train_copy[\"smoking\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 171,
+   "id": "26406207",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training Set Class after Balance: \n",
+      " 1    20413\n",
+      "0    20413\n",
+      "Name: smoking, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Show the classes balance in the training set\n",
+    "print('Training Set Class after Balance: \\n', y_rovs.value_counts())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f163df40",
+   "metadata": {},
+   "source": [
+    "#### Step3 Splitting the dataset as training set and testing set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 172,
+   "id": "6c206f15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "import numpy as np\n",
+    "\n",
+    "# Set the random seed for reproducibility\n",
+    "random_seed = 42\n",
+    "np.random.seed(random_seed)\n",
+    "\n",
+    "# Splitting the dataset into train and test sets\n",
+    "X_train, X_test, y_train, y_test = train_test_split(x_rovs, y_rovs, test_size=0.2, random_state=42)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "409199ab",
+   "metadata": {},
+   "source": [
+    "#### Step4 Training Model "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91c32798",
+   "metadata": {},
+   "source": [
+    "With a dataset size of 38,984 rows and 23 columns, and all input features being numeric, This is a suitable dataset for applying various binary classification models. Here are a few modeling techniques commonly used for binary classification tasks with numerical input features:\n",
+    "\n",
+    "> ***Logistic Regression:*** This is a popular and interpretable model that estimates the probabilities of belonging to each class using a logistic function.\n",
+    "\n",
+    "> ***Support Vector Machines (SVM):*** SVMs aim to find a hyperplane that maximally separates the data points of different classes in a high-dimensional space.\n",
+    "\n",
+    "> ***Random Forest:*** This ensemble model consists of multiple decision trees and can handle both numerical and categorical features. It provides good predictive performance and feature importance rankings.\n",
+    "\n",
+    "> ***Gradient Boosting Algorithms:*** Models like XGBoost or LightGBM are gradient boosting algorithms that sequentially train weak classifiers and combine their predictions to improve overall accuracy.\n",
+    "\n",
+    "It's a good idea to try out different models and evaluate their performance using appropriate evaluation metrics like accuracy, precision, recall, and F1 score. Additionally,Lets consider using techniques like cross-validation to assess the generalization ability of the models and tune hyperparameters for optimal performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1dfbf937",
+   "metadata": {},
+   "source": [
+    "### 6. EVALUATION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74837a36",
+   "metadata": {},
+   "source": [
+    "In the code below,  first initialize the models - Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost - with their respective parameter settings. Then, iterate over the models and use the cross_val_score() function to perform cross-validation with 5 folds (cv=5) and evaluate the models based on accuracy (scoring='accuracy'). The mean accuracy across all cross-validation folds is calculated and printed for each model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 140,
+   "id": "75236aa4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Logistic Regression Cross-Validation Accuracy: 0.7368952847519902\n",
+      "Support Vector Machines Cross-Validation Accuracy: 0.7616962645437845\n",
+      "Random Forest Cross-Validation Accuracy: 0.7847826086956522\n",
+      "XGBoost Cross-Validation Accuracy: 0.78162890385793\n",
+      "Logistic Regression Test Accuracy: 0.7384276267450404\n",
+      "Support Vector Machines Test Accuracy: 0.7647563066372766\n",
+      "Random Forest Test Accuracy: 0.7893705608621112\n",
+      "XGBoost Test Accuracy: 0.7871662992897379\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.model_selection import cross_val_score\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.svm import SVC\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "from xgboost import XGBClassifier\n",
+    "\n",
+    "# Initialize the models\n",
+    "logreg = LogisticRegression(max_iter=10000, solver='sag')\n",
+    "svm = SVC()\n",
+    "random_forest = RandomForestClassifier(n_estimators=150, max_depth=15, min_samples_split=2, min_samples_leaf=5)\n",
+    "xgb = XGBClassifier()\n",
+    "\n",
+    "# Perform cross-validation and evaluate models\n",
+    "models = [logreg, svm, random_forest, xgb]\n",
+    "model_names = [\"Logistic Regression\", \"Support Vector Machines\", \"Random Forest\", \"XGBoost\"]\n",
+    "\n",
+    "for model, name in zip(models, model_names):\n",
+    "    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n",
+    "    mean_accuracy = scores.mean()\n",
+    "    print(f\"{name} Cross-Validation Accuracy: {mean_accuracy}\")\n",
+    "\n",
+    "# Train and evaluate the models on the test set\n",
+    "for model, name in zip(models, model_names):\n",
+    "    model.fit(X_train, y_train)\n",
+    "    test_accuracy = model.score(X_test, y_test)\n",
+    "    print(f\"{name} Test Accuracy: {test_accuracy}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 144,
+   "id": "60d0a343",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Model</th>\n",
+       "      <th>Cross-Validation Accuracy</th>\n",
+       "      <th>Test Accuracy</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Logistic Regression</td>\n",
+       "      <td>0.736895</td>\n",
+       "      <td>0.738428</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Support Vector Machines</td>\n",
+       "      <td>0.761696</td>\n",
+       "      <td>0.764756</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Random Forest</td>\n",
+       "      <td>0.784783</td>\n",
+       "      <td>0.789371</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>XGBoost</td>\n",
+       "      <td>0.781629</td>\n",
+       "      <td>0.787166</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                     Model  Cross-Validation Accuracy  Test Accuracy\n",
+       "0      Logistic Regression                   0.736895       0.738428\n",
+       "1  Support Vector Machines                   0.761696       0.764756\n",
+       "2            Random Forest                   0.784783       0.789371\n",
+       "3                  XGBoost                   0.781629       0.787166"
+      ]
+     },
+     "execution_count": 144,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Define the accuracy values\n",
+    "data = {\n",
+    "    'Model': ['Logistic Regression', 'Support Vector Machines', 'Random Forest', 'XGBoost'],\n",
+    "    'Cross-Validation Accuracy': [0.7368952847519902, 0.7616962645437845, 0.7847826086956522, 0.78162890385793],\n",
+    "    'Test Accuracy': [0.7384276267450404, 0.7647563066372766, 0.7893705608621112, 0.7871662992897379]\n",
+    "}\n",
+    "\n",
+    "# Create a DataFrame from the data\n",
+    "df = pd.DataFrame(data)\n",
+    "\n",
+    "# Display the DataFrame\n",
+    "df\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4aea77cd",
+   "metadata": {},
+   "source": [
+    "From the provided accuracy values,Lets draw several insights about the performance of the models:\n",
+    "\n",
+    "***Cross-Validation Accuracy:***\n",
+    "\n",
+    "Logistic Regression: 0.74\n",
+    "Support Vector Machines: 0.76\n",
+    "Random Forest: 0.784\n",
+    "XGBoost: 0.781\n",
+    "The cross-validation accuracy gives us an estimate of how well the models perform on unseen data. Based on these values, It can be observed that the Random Forest model has the highest cross-validation accuracy (0.784), indicating good generalization performance. The XGBoost model also performs well with a cross-validation accuracy of 0.781. The Logistic Regression and Support Vector Machines models have lower cross-validation accuracies, but still show reasonable performance.\n",
+    "\n",
+    "***Test Accuracy:***\n",
+    "\n",
+    "Logistic Regression: 0.74\n",
+    "Support Vector Machines: 0.76\n",
+    "Random Forest: 0.789\n",
+    "XGBoost: 0.787\n",
+    "The test accuracy represents the performance of the models on a separate, independent dataset. It provides a measure of how well the models generalize to unseen data. From the test accuracy values, It can observed that the Random Forest model achieves the highest accuracy (0.789), indicating good performance on new data. The Support Vector Machines and XGBoost models also show competitive test accuracies of 0.76 and 0.787, respectively. The Logistic Regression model has the lowest test accuracy but still performs reasonably well at 0.74.\n",
+    "\n",
+    "Based on these insights, It can concluded that the ***Random Forest model*** appears to be the most promising in terms of both cross-validation and test accuracy. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc020a50",
+   "metadata": {},
+   "source": [
+    "### 7. PARAMETER TUNING"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cfde57dd",
+   "metadata": {},
+   "source": [
+    "To tune hyperparameters, Lets use techniques such as grid search or randomized search in combination with cross-validation. These techniques involve systematically searching through different combinations of hyperparameter values and evaluating the model's performance using cross-validation.\n",
+    "As Random Forest accuracy was highest in model training ,So using parameter tuning to increase its performance and accoracy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 130,
+   "id": "bf54236e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Best Hyperparameters: {'random_forest__max_depth': None, 'random_forest__min_samples_leaf': 1, 'random_forest__min_samples_split': 2, 'random_forest__n_estimators': 200}\n",
+      "Best Cross-Validation Score: 0.8164740150642356\n",
+      "Test Accuracy: 0.8269654665686995\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.model_selection import GridSearchCV\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "from imblearn.over_sampling import RandomOverSampler\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "\n",
+    "# Define the parameter grid for the Random Forest classifier\n",
+    "param_grid = {\n",
+    "    'random_forest__n_estimators': [100, 200, 300],\n",
+    "    'random_forest__max_depth': [None, 5, 10, 15],\n",
+    "    'random_forest__min_samples_split': [2, 5, 10],\n",
+    "    'random_forest__min_samples_leaf': [1, 3, 5]\n",
+    "}\n",
+    "\n",
+    "# Initialize the Random Forest classifier\n",
+    "random_forest = RandomForestClassifier()\n",
+    "\n",
+    "# Create a pipeline with scaling\n",
+    "pipeline = Pipeline([\n",
+    "    ('scaler', StandardScaler()),\n",
+    "    ('random_forest', random_forest)\n",
+    "])\n",
+    "\n",
+    "# Perform oversampling on the training data\n",
+    "oversampler = RandomOverSampler(random_state=42)\n",
+    "X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)\n",
+    "\n",
+    "# Perform grid search with cross-validation\n",
+    "grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)\n",
+    "grid_search.fit(X_train_oversampled, y_train_oversampled)\n",
+    "\n",
+    "# Print the best hyperparameters and the corresponding cross-validation score\n",
+    "print(\"Best Hyperparameters:\", grid_search.best_params_)\n",
+    "print(\"Best Cross-Validation Score:\", grid_search.best_score_)\n",
+    "\n",
+    "# Evaluate the model on the test set\n",
+    "test_accuracy = grid_search.score(X_test, y_test)\n",
+    "print(\"Test Accuracy:\", test_accuracy)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42e3bc9c",
+   "metadata": {},
+   "source": [
+    "Now lets Predict the Target variable output for ***test data set***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 173,
+   "id": "1a1244a1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>age</th>\n",
+       "      <th>height(cm)</th>\n",
+       "      <th>weight(kg)</th>\n",
+       "      <th>waist(cm)</th>\n",
+       "      <th>eyesight(left)</th>\n",
+       "      <th>eyesight(right)</th>\n",
+       "      <th>hearing(left)</th>\n",
+       "      <th>hearing(right)</th>\n",
+       "      <th>systolic</th>\n",
+       "      <th>relaxation</th>\n",
+       "      <th>...</th>\n",
+       "      <th>triglyceride</th>\n",
+       "      <th>HDL</th>\n",
+       "      <th>LDL</th>\n",
+       "      <th>hemoglobin</th>\n",
+       "      <th>Urine protein</th>\n",
+       "      <th>serum creatinine</th>\n",
+       "      <th>AST</th>\n",
+       "      <th>ALT</th>\n",
+       "      <th>Gtp</th>\n",
+       "      <th>dental caries</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>40</td>\n",
+       "      <td>170</td>\n",
+       "      <td>65</td>\n",
+       "      <td>75.1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>120</td>\n",
+       "      <td>70</td>\n",
+       "      <td>...</td>\n",
+       "      <td>260</td>\n",
+       "      <td>41</td>\n",
+       "      <td>132</td>\n",
+       "      <td>15.7</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>24</td>\n",
+       "      <td>26</td>\n",
+       "      <td>32</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>45</td>\n",
+       "      <td>170</td>\n",
+       "      <td>75</td>\n",
+       "      <td>89.0</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>1.2</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>100</td>\n",
+       "      <td>67</td>\n",
+       "      <td>...</td>\n",
+       "      <td>345</td>\n",
+       "      <td>49</td>\n",
+       "      <td>140</td>\n",
+       "      <td>15.7</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.1</td>\n",
+       "      <td>26</td>\n",
+       "      <td>28</td>\n",
+       "      <td>138</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>30</td>\n",
+       "      <td>180</td>\n",
+       "      <td>90</td>\n",
+       "      <td>94.0</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>115</td>\n",
+       "      <td>72</td>\n",
+       "      <td>...</td>\n",
+       "      <td>103</td>\n",
+       "      <td>53</td>\n",
+       "      <td>103</td>\n",
+       "      <td>13.5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>19</td>\n",
+       "      <td>29</td>\n",
+       "      <td>30</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>60</td>\n",
+       "      <td>170</td>\n",
+       "      <td>50</td>\n",
+       "      <td>73.0</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>118</td>\n",
+       "      <td>78</td>\n",
+       "      <td>...</td>\n",
+       "      <td>70</td>\n",
+       "      <td>65</td>\n",
+       "      <td>108</td>\n",
+       "      <td>14.1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.3</td>\n",
+       "      <td>31</td>\n",
+       "      <td>28</td>\n",
+       "      <td>33</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>30</td>\n",
+       "      <td>170</td>\n",
+       "      <td>65</td>\n",
+       "      <td>78.0</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>110</td>\n",
+       "      <td>70</td>\n",
+       "      <td>...</td>\n",
+       "      <td>210</td>\n",
+       "      <td>45</td>\n",
+       "      <td>103</td>\n",
+       "      <td>14.7</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>21</td>\n",
+       "      <td>21</td>\n",
+       "      <td>19</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16703</th>\n",
+       "      <td>60</td>\n",
+       "      <td>165</td>\n",
+       "      <td>65</td>\n",
+       "      <td>82.0</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>101</td>\n",
+       "      <td>68</td>\n",
+       "      <td>...</td>\n",
+       "      <td>131</td>\n",
+       "      <td>41</td>\n",
+       "      <td>110</td>\n",
+       "      <td>13.5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>78</td>\n",
+       "      <td>75</td>\n",
+       "      <td>33</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16704</th>\n",
+       "      <td>60</td>\n",
+       "      <td>155</td>\n",
+       "      <td>70</td>\n",
+       "      <td>93.0</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>134</td>\n",
+       "      <td>70</td>\n",
+       "      <td>...</td>\n",
+       "      <td>259</td>\n",
+       "      <td>53</td>\n",
+       "      <td>60</td>\n",
+       "      <td>13.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>19</td>\n",
+       "      <td>28</td>\n",
+       "      <td>28</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16705</th>\n",
+       "      <td>40</td>\n",
+       "      <td>155</td>\n",
+       "      <td>50</td>\n",
+       "      <td>67.2</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>134</td>\n",
+       "      <td>80</td>\n",
+       "      <td>...</td>\n",
+       "      <td>50</td>\n",
+       "      <td>64</td>\n",
+       "      <td>131</td>\n",
+       "      <td>13.4</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>16</td>\n",
+       "      <td>10</td>\n",
+       "      <td>14</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16706</th>\n",
+       "      <td>35</td>\n",
+       "      <td>165</td>\n",
+       "      <td>70</td>\n",
+       "      <td>76.1</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>114</td>\n",
+       "      <td>68</td>\n",
+       "      <td>...</td>\n",
+       "      <td>43</td>\n",
+       "      <td>74</td>\n",
+       "      <td>118</td>\n",
+       "      <td>14.3</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.2</td>\n",
+       "      <td>19</td>\n",
+       "      <td>28</td>\n",
+       "      <td>30</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16707</th>\n",
+       "      <td>25</td>\n",
+       "      <td>180</td>\n",
+       "      <td>80</td>\n",
+       "      <td>87.0</td>\n",
+       "      <td>1.2</td>\n",
+       "      <td>0.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>110</td>\n",
+       "      <td>70</td>\n",
+       "      <td>...</td>\n",
+       "      <td>107</td>\n",
+       "      <td>53</td>\n",
+       "      <td>86</td>\n",
+       "      <td>15.9</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1.1</td>\n",
+       "      <td>23</td>\n",
+       "      <td>27</td>\n",
+       "      <td>23</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>16708 rows × 22 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       age  height(cm)  weight(kg)  waist(cm)  eyesight(left)  \\\n",
+       "0       40         170          65       75.1             1.0   \n",
+       "1       45         170          75       89.0             0.7   \n",
+       "2       30         180          90       94.0             1.0   \n",
+       "3       60         170          50       73.0             0.5   \n",
+       "4       30         170          65       78.0             1.5   \n",
+       "...    ...         ...         ...        ...             ...   \n",
+       "16703   60         165          65       82.0             0.7   \n",
+       "16704   60         155          70       93.0             0.8   \n",
+       "16705   40         155          50       67.2             0.9   \n",
+       "16706   35         165          70       76.1             1.0   \n",
+       "16707   25         180          80       87.0             1.2   \n",
+       "\n",
+       "       eyesight(right)  hearing(left)  hearing(right)  systolic  relaxation  \\\n",
+       "0                  0.9              1               1       120          70   \n",
+       "1                  1.2              1               1       100          67   \n",
+       "2                  0.8              1               1       115          72   \n",
+       "3                  0.7              1               1       118          78   \n",
+       "4                  1.0              1               1       110          70   \n",
+       "...                ...            ...             ...       ...         ...   \n",
+       "16703              1.0              1               1       101          68   \n",
+       "16704              1.0              1               1       134          70   \n",
+       "16705              0.8              1               1       134          80   \n",
+       "16706              1.0              1               1       114          68   \n",
+       "16707              0.9              1               1       110          70   \n",
+       "\n",
+       "       ...  triglyceride  HDL  LDL  hemoglobin  Urine protein  \\\n",
+       "0      ...           260   41  132        15.7              1   \n",
+       "1      ...           345   49  140        15.7              1   \n",
+       "2      ...           103   53  103        13.5              1   \n",
+       "3      ...            70   65  108        14.1              1   \n",
+       "4      ...           210   45  103        14.7              1   \n",
+       "...    ...           ...  ...  ...         ...            ...   \n",
+       "16703  ...           131   41  110        13.5              1   \n",
+       "16704  ...           259   53   60        13.9              1   \n",
+       "16705  ...            50   64  131        13.4              1   \n",
+       "16706  ...            43   74  118        14.3              1   \n",
+       "16707  ...           107   53   86        15.9              1   \n",
+       "\n",
+       "       serum creatinine  AST  ALT  Gtp  dental caries  \n",
+       "0                   0.8   24   26   32              0  \n",
+       "1                   1.1   26   28  138              0  \n",
+       "2                   1.0   19   29   30              0  \n",
+       "3                   1.3   31   28   33              0  \n",
+       "4                   0.8   21   21   19              0  \n",
+       "...                 ...  ...  ...  ...            ...  \n",
+       "16703               0.8   78   75   33              0  \n",
+       "16704               0.7   19   28   28              1  \n",
+       "16705               0.7   16   10   14              0  \n",
+       "16706               1.2   19   28   30              1  \n",
+       "16707               1.1   23   27   23              0  \n",
+       "\n",
+       "[16708 rows x 22 columns]"
+      ]
+     },
+     "execution_count": 173,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Reading dataset using pandas library\n",
+    "test_data = pd.read_csv(\"E:/MA336 - AI and ML/Project_AI/test_dataset.csv\")\n",
+    "test_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 174,
+   "id": "b4f3c39a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Random Forest Accuracy: 0.8268430075924565\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Random Forest\n",
+    "rf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=2, min_samples_leaf=1)\n",
+    "rf.fit(X_train, y_train)\n",
+    "rf_predictions = rf.predict(X_test)\n",
+    "rf_accuracy = accuracy_score(y_test, rf_predictions)\n",
+    "print(\"Random Forest Accuracy:\", rf_accuracy)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 182,
+   "id": "e486cd54",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Input Features: [ 40.  170.   65.   75.1   1.    0.9   1.    1.  120.   70.  102.  225.\n",
+      " 260.   41.  132.   15.7   1.    0.8  24.   26.   32.    0. ]\n",
+      "Predicted Value: 1\n",
+      "\n",
+      "Input Features: [ 45.  170.   75.   89.    0.7   1.2   1.    1.  100.   67.   96.  258.\n",
+      " 345.   49.  140.   15.7   1.    1.1  26.   28.  138.    0. ]\n",
+      "Predicted Value: 1\n",
+      "\n",
+      "Input Features: [ 30.  180.   90.   94.    1.    0.8   1.    1.  115.   72.   88.  177.\n",
+      " 103.   53.  103.   13.5   1.    1.   19.   29.   30.    0. ]\n",
+      "Predicted Value: 0\n",
+      "\n",
+      "Input Features: [ 60.  170.   50.   73.    0.5   0.7   1.    1.  118.   78.   86.  187.\n",
+      "  70.   65.  108.   14.1   1.    1.3  31.   28.   33.    0. ]\n",
+      "Predicted Value: 0\n",
+      "\n",
+      "Input Features: [ 30.  170.   65.   78.    1.5   1.    1.    1.  110.   70.   87.  190.\n",
+      " 210.   45.  103.   14.7   1.    0.8  21.   21.   19.    0. ]\n",
+      "Predicted Value: 0\n",
+      "\n",
+      "Input Features: [ 55.  175.   60.   75.    1.    1.    1.    1.  100.   64.   93.  186.\n",
+      "  80.   86.   84.   15.4   3.    1.   39.   20.   35.    0. ]\n",
+      "Predicted Value: 1\n",
+      "\n",
+      "Input Features: [ 40.  160.   55.   69.    1.5   1.5   1.    1.  112.   78.   90.  177.\n",
+      "  68.   78.   85.   12.4   1.    0.5  15.    9.   14.    0. ]\n",
+      "Predicted Value: 0\n",
+      "\n",
+      "Input Features: [ 55.  175.   60.   80.    1.2   1.5   1.    1.  137.   89.   80.  199.\n",
+      "  35.   68.  124.   16.    1.    1.1  23.   19.   17.    0. ]\n",
+      "Predicted Value: 1\n",
+      "\n",
+      "Input Features: [ 55.  160.   50.   68.    0.8   0.5   1.    1.  137.   87.   90.  176.\n",
+      "  36.   67.  102.   13.6   1.    0.7  15.   14.   13.    0. ]\n",
+      "Predicted Value: 0\n",
+      "\n",
+      "Input Features: [ 75.  145.   50.   81.    0.5   0.5   2.    2.  148.   86.  121.  192.\n",
+      " 109.   81.   89.   14.    1.    0.6  28.   24.   17.    1. ]\n",
+      "Predicted Value: 0\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "import numpy as np\n",
+    "\n",
+    "# Training data\n",
+    "X_tr = train_data.iloc[:, :-1]\n",
+    "y_tr = train_data.smoking\n",
+    "\n",
+    "# Test data\n",
+    "X_test = test_data\n",
+    "\n",
+    "# Create a Random Forest classifier\n",
+    "rf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=2, min_samples_leaf=1)\n",
+    "\n",
+    "# Train the model\n",
+    "rf.fit(X_tr, y_tr)\n",
+    "\n",
+    "# Make predictions on the test data\n",
+    "predictions = rf.predict(X_test)\n",
+    "\n",
+    "# Set the number of rows to print\n",
+    "num_rows = 10\n",
+    "\n",
+    "# Print the predicted values along with the input features of the test data\n",
+    "for features, prediction in zip(test_data.values[:num_rows], predictions[:num_rows]):\n",
+    "    print(\"Input Features:\", features)\n",
+    "    print(\"Predicted Value:\", prediction)\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78de9440",
+   "metadata": {},
+   "source": [
+    "From above it can be seen preicted values for each observation. Only 10 rows are printed to reduce scrolling pages. By increasing num_rows , more predictions can be seen."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9e6a181",
+   "metadata": {},
+   "source": [
+    "***Results***\n",
+    "\n",
+    ">The developed predictive model underwent rigorous evaluation, validation, and parameter tuning. Employed various machine learning algorithms, including Logistic Regression, Support Vector Machines, Random Forest, and XGBoost, to predict smoking status based on bio-signals. Through cross-validation, obtained reliable estimates of the models' performance on unseen data. The Random Forest model initially demonstrated the highest accuracy, but through parameter tuning using techniques such as grid search, further optimized its performance.\n",
+    "\n",
+    ">After fine-tuning the hyperparameters, the Random Forest model achieved even higher accuracy. The best hyperparameters for the model were found to be 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, and 'n_estimators': 200. With these optimized hyperparameters, the model achieved a cross-validation accuracy of 81.6% and a test accuracy of 82.7%.\n",
+    "\n",
+    ">The parameter tuning process played a crucial role in improving the model's performance by finding the optimal combination of hyperparameters. By fine-tuning the Random Forest model,  were able to enhance its predictive power, resulting in higher accuracy and improved reliability for identifying smoking status based on bio-signals."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7ebaa85",
+   "metadata": {},
+   "source": [
+    "### 8. CONCLUSIONS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b4fbed4",
+   "metadata": {},
+   "source": [
+    "In conclusion, study successfully developed a machine learning model that utilizes bio-signals to predict an individual's smoking status. The integration of AI and machine learning techniques, coupled with parameter tuning, has significantly enhanced the model's performance and accuracy. The Random Forest model, after careful hyperparameter optimization, emerged as the most promising model, achieving a cross-validation accuracy of 81.6% and a test accuracy of 82.7%. These results demonstrate the effectiveness of  predictive model in accurately identifying smoking status and provide valuable insights for healthcare professionals in assessing and addressing smoking behaviors. By leveraging bio-signals and advanced machine learning techniques,  contribute to the development of reliable tools for smoking cessation strategies, ultimately leading to improved public health outcomes and a reduction in the harmful effects of smoking"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c15aa32",
+   "metadata": {},
+   "source": [
+    "### 9. REFERENCES"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41ee617e",
+   "metadata": {},
+   "source": [
+    "https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}