a/README.md b/README.md
1
# Stroke-Predictions
1
# Stroke-Predictions
2
XGBoost model to predict stroke risk from clinical patient information, aiding in early detection and prevention.
2
XGBoost model to predict stroke risk from clinical patient information, aiding in early detection and prevention.
3
3
4
# Data
4
# Data
5
Data is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
5
Data is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
6
- 10 clinical features
6
- 10 clinical features
7
  - Gender, Age, Hypertension, Heart Disease, Marriage Status, Work Type, Residence Type, Average Glucose Level, BMI, Smoking Status
7
  - Gender, Age, Hypertension, Heart Disease, Marriage Status, Work Type, Residence Type, Average Glucose Level, BMI, Smoking Status
8
- 1 binary target variable
8
- 1 binary target variable
9
  - Stroke
9
  - Stroke
10
- 5110 observations
10
- 5110 observations
11
- Class Imbalance: 7% of samples are positive for having had a stroke
11
- Class Imbalance: 7% of samples are positive for having had a stroke
12
12
13
# Exploratory Analyses
13
# Exploratory Analyses
14
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/age_stroke.png" width=500 height=400>
14
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/age_stroke.png?raw=true" width=500 height=400>
15
15
16
Those who have had a stroke are ~26 years older than those who have not, on average.
16
Those who have had a stroke are ~26 years older than those who have not, on average.
17
<br />
17
<br />
18
<br />
18
<br />
19
<br />
19
<br />
20
20
21
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/marriage_stroke.png" width=500 height=400>
21
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/marriage_stroke.png?raw=true" width=500 height=400>
22
22
23
Those who have been married are ~4x more likely to have had a stroke than those who have not been married, on average.
23
Those who have been married are ~4x more likely to have had a stroke than those who have not been married, on average.
24
24
25
# Final Model
25
# Final Model
26
XGBClassifier with `eta = 0.3`, `lambda = 1`, `max_depth = 5`, and `min_child_weight = 2`
26
XGBClassifier with `eta = 0.3`, `lambda = 1`, `max_depth = 5`, and `min_child_weight = 2`
27
27
28
# Model Evaluation
28
# Model Evaluation
29
29
30
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/cf_matrix.png" width=450 height=400>
30
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/cf_matrix.png?raw=true" width=450 height=400>
31
31
32
The final model was chosen to maximize recall, since false negatives are worse than false positives when predicting stroke. False negatives may lead to a lack of proper treatment for those who might be at risk for stroke, while false positives only lead to an unnecessary checkup.
32
The final model was chosen to maximize recall, since false negatives are worse than false positives when predicting stroke. False negatives may lead to a lack of proper treatment for those who might be at risk for stroke, while false positives only lead to an unnecessary checkup.
33
33
34
The final model showed a recall of 0.07 and an accuracy of 0.94. The model also showed a precision of 0.40 and an f1-score of 0.13.
34
The final model showed a recall of 0.07 and an accuracy of 0.94. The model also showed a precision of 0.40 and an f1-score of 0.13.
35
35
36
# Recommendations
36
# Recommendations
37
Further research and model development is required to obtain a suitable model for production use. The final model's recall of 0.07 is still low for stroke predictions, and may lead to dangerous false negative errors. The model has not learnt to predict the positive class well because of the low proportion of positive samples (7%) present in the dataset. A future project could explore using resampling techniques, such as oversampling with SMOTE, to reduce class imbalances in the target and increase recall.
37
Further research and model development is required to obtain a suitable model for production use. The final model's recall of 0.07 is still low for stroke predictions, and may lead to dangerous false negative errors. The model has not learnt to predict the positive class well because of the low proportion of positive samples (7%) present in the dataset. A future project could explore using resampling techniques, such as oversampling with SMOTE, to reduce class imbalances in the target and increase recall.
38
38
39
# Contact
39
# Contact
40
For further information, contact ankitkdey@gmail.com
40
For further information, contact ankitkdey@gmail.com