Diff of /README.md [000000] .. [1f0bb9]

Switch to unified view

a b/README.md
1
2
Goal is to predict Smokers and Drinkers using body signal data. 
3
4
•   Manual Observation in identifying variables and  getting familiar with the dataset 
5
6
Dataset contains medical records of individual between age 20 to 85. The records consist of details of physical examination such as height, weight, sight, hearing, BP and other health signals through blood test of cholesterol, liver function. Kidney etc.   
7
8
Using health statistic, we are trying to classify individual as smokers (may be they quit smoking regular smokers, chain smokers, etc.) and also check if they are consuming alcohol . 
9
10
We can use physical examination data to identify outliers values and remove that data. 
11
12
We can correlate the HDL, LDL, total cholesterol, haemoglobin data values to smokers, and the kidney and liver function tests (SGoT, urine protein etc.) to drinkers.
13
14
And based on new test records of an individual we may try to predict what type of smoker he or she would be.  
15
16
17
# Exploratory Data Analysis (EDA) 
18
19
EDA purpose is to understand the dataset, cleanse it and analyse the relationship between variable. 
20
21
Import libraries such as numpy, pandas, matplotlib, seaborn etc.  for the analysis. 
22
23
.shape returns the number of rows by the number of columns for my dataset. My output was (991346, 24), meaning the dataset has 991346rows and 24 columns. 
24
25
.head() returns the first 5 rows of my dataset. This is useful if you want to see some example values for each variable. 
26
27
https://github.com/SaarthChahal/ML-DL/blob/main/first%205%20rows%20of%20dataset.png
28
29
.columns returns the name of all of your columns in the dataset. 
30
31
https://github.com/SaarthChahal/ML-DL/blob/main/dataset%20columns.png
32
33
After this I worked on getting better understanding of the different values for each variable. 
34
35
.nunique(axis=0) returns the number of unique values for each variable.
36
37
.describe() summarizes the count, mean, standard deviation, min, and max for numeric variables
38
39
https://github.com/SaarthChahal/ML-DL/blob/main/describe%20dataset.png
40
 
41
data.describe().apply(lambda s: s.apply(lambda x: format(x, 'f'))) output 
42
43
44
45
**Cleaning your DataSet by removing outliers, nulls.**
46
47
48
Used .dropna(axis=0) to remove any rows with null values. There was no null value so cleaned data still returned the same (991346, 24) for data_cleaned.shape 
49
Removed outliers by using varaible.between( lower limit, upper limit ) and variable < limit 
50
Data for analysis does not use string datatype as an argument , I could have encoded the gender variable labelled as ‘sex’ to number for male or female or I could have dropped column itself. I choose to drop the column by using .drop(‘variable’) 
51
Also I encoded /converted DRK_YN variable string values ( Y on N) to number 1 or 0
52
53
https://github.com/SaarthChahal/ML-DL/blob/main/cleaned%20dataset.png
54
55
**From shapes output: (985543, 23) i.e I was able to reduce 5803 records and 1 column.**
56
57
**Data Plotting exercise** to analyze relation ship between variables. Calculate the correlation matrix.   
58
There are too many variables to produce more readable correlation matrix and heatmap. Created 2 smaller array for matrix and heatmap for smoke and drink correlation 
59
60
Heatmap for smokers
61
https://github.com/SaarthChahal/ML-DL/blob/main/heatmap%20for%20smokers.png
62
63
Heatmap for drinkers
64
https://github.com/SaarthChahal/ML-DL/blob/main/heatmap%20for%20drinkers.png
65
66
Scatterplot for total cholestrol of smokers
67
https://github.com/SaarthChahal/ML-DL/blob/main/total%20cholestrol%20scatterplot%20for%20smokers.png
68
69
using sns.pairplot() created scatterplots between some of key variables
70
https://github.com/SaarthChahal/ML-DL/blob/main/scatterplots%20for%20pairs%20of%20variables.png
71
72
# Model training module
73
74
**LINEAR REGRESSION MODEL**
75
76
Train linear regression model.
77
78
We will need to first split up our data into an X1 array(cholesterol)  that contains the features to train on, 
79
And a y1 array(SMK_stat_type_cd) with the target variable, 
80
split up our data into an X2 array(Kidney function) that contains the features to train on, 
81
And a y2 array(DRK_YN)
82
83
Train test split. test split is 40 % train set is 60 % 
84
85
Loading the linear regression Model
86
prediction on Training data 
87
88
**Model evlauation.**  
89
Let's evaluate the model by checking out it's coefficients and how we can interpret them.
90
91
Learning model intercept 1 (smokers): -1.2726375396245835
92
93
Learning model intercept 2 (drinkers) : 0.4141733052412185
94
Coefficients for smokers: https://github.com/SaarthChahal/ML-DL/blob/main/coefficient.png
95
96
Coefficient for drinkers: https://github.com/SaarthChahal/ML-DL/blob/main/coefficient2.png
97
98
99
Interpreting the coefficient.
100
101
For every one unit change in smoke status there is negative impact on Cholestrol ( refelcted as negative)
102
103
and increase in  triglyceride and  hemoglobin which negatively affect the health indicator. 
104
105
106
107
**Prediction from Model**
108
Prediction scatterplot for smokers: https://github.com/SaarthChahal/ML-DL/blob/main/prediction%20scatterplot%20for%20smokers.png
109
110
Displot prediction for smokers: https://github.com/SaarthChahal/ML-DL/blob/main/displot%20method%20for%20smokers.png
111
112
Prediction scatterplot for drinkers: https://github.com/SaarthChahal/ML-DL/blob/main/scatterplot%20prediction%20for%20drinkers.png
113
114
Displot scatterplot for drinkers: https://github.com/SaarthChahal/ML-DL/blob/main/displot%20method%20for%20smokers.png
115
116
117
**Regression Evaluation Metrics**
118
Here are three common evaluation metrics for regression problems:
119
120
Mean Absolute Error** (MAE) is the mean of the absolute value of the errors: is the easiest to understand, because it's the average error.
121
122
Mean Squared Error** (MSE) is the mean of the squared errors: is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
123
124
Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors: is even more popular than MSE, because RMSE is interpretable in the "y" units.
125
126
127
Regression Evaluation Metrics for smokers:
128
129
MAE:1 1.1094765424193946
130
131
MSE:1 1.8638480376557032
132
133
RMSE:1 1.3652281998463491
134
135
136
Regression Evaluation Metrics for drinkers:
137
138
MAE:2 0.4814403810202582
139
140
MSE:2 0.23847155682695156
141
142
RMSE:2 0.4883354961775271
143
144
List of variables used: https://github.com/SaarthChahal/ML-DL/blob/main/variables.png
145
146
147
**DECISION TREE AND RANDOM FOREST MODEL**
148
149
|**Decision Tree**|**Random Forest**|
150
|-----------------|-----------------|
151
|A decision tree is a tree-like model of  decisions along with possible outcomes in a diagram.|A classification algorithm consisting of many decision trees combined to get a more accurate result as compared to a single tree.|
152
|There is always a scope for overfitting, caused due to the presence of variance.|Random forest algorithm avoids and prevents overfitting by using multiple trees.|
153
|The results are not accurate.|This gives accurate and precise results.|
154
|Decision trees require low computation, thus reducing time to implement and carrying low accuracy.|This consumes more computation. The process of generation and analyzing is time-consuming.|
155
|It is easy to visualize. The only task is to fit the decision tree model.|This has complex visualization as it determines the pattern behind the data.|
156
157
158
Supervised, regression machine learning problem. It’s supervised because we have both the features (data on health parameters) and the targets (Smokers and Drinkers) that we want to predict
159
160
161
The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), and sample average (only for multilabel classification). Micro average (averaging the total true positives, false negatives and false positives) is only shown for multi-label or multi-class with a subset of classes, because it corresponds to accuracy otherwise and would be the same for all metrics
162
163
164
**Classfication metrics.** 
165
166
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label a negative sample as positive. 
167
168
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
169
170
Fscore. The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
171
The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.    
172
173
Smokers metrics for decision tree:
174
175
              precision    recall   f1-score  support
176
177
           1       0.77      0.75      0.76    179903
178
           2       0.31      0.32      0.32     52132
179
           3       0.40      0.41      0.40     63628
180
181
    accuracy                           0.60    295663
182
    macro avg      0.49      0.49      0.49    295663
183
    weighted avg   0.61      0.60      0.61    295663
184
185
186
Smokers metrics for random forest model:
187
188
               precision   recall   f1-score  support
189
                   
190
           1       0.80      0.85      0.82    179903
191
           2       0.43      0.32      0.37     52132
192
           3       0.52      0.54      0.53     63628
193
194
    accuracy                           0.69    295663
195
    macro avg       0.58      0.57     0.57    295663
196
    weighted avg    0.67      0.69     0.68    295663
197
198
  
199
  Drinkers metrics for decision tree:
200
201
                 precision    recall  f1-score   support
202
203
           0       0.63      0.63      0.63    147705
204
           1       0.63      0.63      0.63    147958
205
206
    accuracy                           0.63    295663
207
    macro avg       0.63      0.63     0.63    295663
208
    weighted avg    0.63      0.63     0.63    295663
209
  
210
   Drinkers metrics for random forest model:
211
212
                 precision    recall  f1-score   support
213
214
           0       0.72       0.72      0.72    147705
215
           1       0.72       0.72      0.72    147958
216
217
    accuracy                            0.63    295663
218
    macro avg       0.72      0.72      0.63    295663
219
    weighted avg    0.72      0.72      0.63    295663
220
221
222
Confusion Matrix.
223
224
Confusion matrix to evaluate the accuracy of a classification. 
225
226
Confusion matrix usage to evaluate the quality of the output of a classifier on the data. 
227
228
The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. 
229
230
The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.
231
232
233
Decision Tree Confusion Matrix for smokers: 
234
235
    [[135735  20884  23284]
236
237
    [ 19324  16859  15949]
238
 
239
    [ 21749  15999  25880]]
240
241
242
Random Forest Confusion Matrix for smokers:
243
244
    [[152378  11722  15803]
245
    
246
    [ 19136  16504  16492]
247
    
248
    [ 19491   9826  34311]]
249
250
  
251
 Decision Tree Confusion Matrix for drinkers:
252
```
253
[[93021 54684]
254
[54020 93938]]
255
```
256
257
 Random Forest Confusion Matrix for drinkers:
258
```
259
[[106993  40712]
260
[ 41075 106883]]
261
``` 
262
263
264
# Conclusion
265
266
267
To compare linear regression, decision trees, and random forests for supervised learning, let's assess them based on performance, accuracy, computational speed, and memory usage. Keep in mind that the specific results may vary depending on your dataset and implementation, but I can provide a general comparison:
268
269
## Linear Regression:
270
271
*   Performance:
272
      Linear regression performs well when the relationship between features and the target variable is roughly linear.
273
*   Accuracy:
274
      Accuracy is moderate when dealing with linear relationships but may suffer when data has complex, non-linear patterns.
275
*   Computational Speed:
276
      Linear regression is very computationally efficient and quick to train.
277
*   Memory Usage:
278
      Linear regression models have a small memory footprint as they store coefficients for each feature.
279
280
## Decision Tree:
281
282
*   Performance:
283
      Decision trees are versatile and can capture both linear and non-linear relationships in data.
284
*   Accuracy:
285
      Decision trees can provide high accuracy but may overfit when not properly pruned.
286
*   Computational Speed:
287
      Building decision trees is usually fast, especially for small to moderately sized datasets.
288
*   Memory Usage:
289
      Decision trees can consume moderate memory, especially if they are deep.
290
    
291
## Random Forest:
292
293
*   Performance:
294
Random forests are robust and handle non-linearity well, making them suitable for a wide range of problems.
295
*   Accuracy:
296
Random forests often provide high accuracy and reduce overfitting through ensemble techniques.
297
*   Computational Speed:
298
Training a random forest involves building multiple decision trees, which can be slower than linear regression but faster than training a deep single decision tree for large datasets.
299
*   Memory Usage:
300
Random forests consume more memory than linear regression but are typically more memory-efficient than deep decision trees.
301
302
303
### In summary:
304
305
Linear Regression is computationally efficient and interpretable but may not perform well with complex, non-linear data.
306
307
Decision Trees are versatile and can capture non-linear relationships but may overfit and require careful pruning.
308
309
Random Forests are robust, accurate, and handle non-linearity well, making them suitable for a wide range of tasks. They offer a balance between accuracy and computational efficiency.
310
311
312
### Recommendations 
313
I would recommend using the Random Forest model as it provides the **highest accuracy**, the computational speed is slower in comparison but speed is not crucial for analyzing data on smokers and drinkers.