# Stroke-Predictions
XGBoost model to predict stroke risk from clinical patient information, aiding in early detection and prevention.

# Data
Data is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
- 10 clinical features
  - Gender, Age, Hypertension, Heart Disease, Marriage Status, Work Type, Residence Type, Average Glucose Level, BMI, Smoking Status
- 1 binary target variable
  - Stroke
- 5110 observations
- Class Imbalance: 7% of samples are positive for having had a stroke

# Exploratory Analyses
<img src="https://github.com/adey4/Stroke-Predictions/blob/main/age_stroke.png?raw=true" width=500 height=400>

Those who have had a stroke are ~26 years older than those who have not, on average.
<br />
<br />
<br />

<img src="https://github.com/adey4/Stroke-Predictions/blob/main/marriage_stroke.png?raw=true" width=500 height=400>

Those who have been married are ~4x more likely to have had a stroke than those who have not been married, on average.

# Final Model
XGBClassifier with `eta = 0.3`, `lambda = 1`, `max_depth = 5`, and `min_child_weight = 2`

# Model Evaluation

<img src="https://github.com/adey4/Stroke-Predictions/blob/main/cf_matrix.png?raw=true" width=450 height=400>

The final model was chosen to maximize recall, since false negatives are worse than false positives when predicting stroke. False negatives may lead to a lack of proper treatment for those who might be at risk for stroke, while false positives only lead to an unnecessary checkup.

The final model showed a recall of 0.07 and an accuracy of 0.94. The model also showed a precision of 0.40 and an f1-score of 0.13.

# Recommendations
Further research and model development is required to obtain a suitable model for production use. The final model's recall of 0.07 is still low for stroke predictions, and may lead to dangerous false negative errors. The model has not learnt to predict the positive class well because of the low proportion of positive samples (7%) present in the dataset. A future project could explore using resampling techniques, such as oversampling with SMOTE, to reduce class imbalances in the target and increase recall.

# Contact
For further information, contact ankitkdey@gmail.com

	a/README.md		b/README.md
1	# Stroke-Predictions	1	# Stroke-Predictions
2	XGBoost model to predict stroke risk from clinical patient information, aiding in early detection and prevention.	2	XGBoost model to predict stroke risk from clinical patient information, aiding in early detection and prevention.
3		3
4	# Data	4	# Data
5	Data is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset	5	Data is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
6	- 10 clinical features	6	- 10 clinical features
7	- Gender, Age, Hypertension, Heart Disease, Marriage Status, Work Type, Residence Type, Average Glucose Level, BMI, Smoking Status	7	- Gender, Age, Hypertension, Heart Disease, Marriage Status, Work Type, Residence Type, Average Glucose Level, BMI, Smoking Status
8	- 1 binary target variable	8	- 1 binary target variable
9	- Stroke	9	- Stroke
10	- 5110 observations	10	- 5110 observations
11	- Class Imbalance: 7% of samples are positive for having had a stroke	11	- Class Imbalance: 7% of samples are positive for having had a stroke
12		12
13	# Exploratory Analyses	13	# Exploratory Analyses
14	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/age_stroke.png" width=500 height=400>	14	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/age_stroke.png?raw=true" width=500 height=400>
15		15
16	Those who have had a stroke are ~26 years older than those who have not, on average.	16	Those who have had a stroke are ~26 years older than those who have not, on average.
17	<br />	17	<br />
18	<br />	18	<br />
19	<br />	19	<br />
20		20
21	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/marriage_stroke.png" width=500 height=400>	21	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/marriage_stroke.png?raw=true" width=500 height=400>
22		22
23	Those who have been married are ~4x more likely to have had a stroke than those who have not been married, on average.	23	Those who have been married are ~4x more likely to have had a stroke than those who have not been married, on average.
24		24
25	# Final Model	25	# Final Model
26	XGBClassifier with `eta = 0.3`, `lambda = 1`, `max_depth = 5`, and `min_child_weight = 2`	26	XGBClassifier with `eta = 0.3`, `lambda = 1`, `max_depth = 5`, and `min_child_weight = 2`
27		27
28	# Model Evaluation	28	# Model Evaluation
29		29
30	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/cf_matrix.png" width=450 height=400>	30	<img src="https://github.com/adey4/Stroke-Predictions/blob/main/cf_matrix.png?raw=true" width=450 height=400>
31		31
32	The final model was chosen to maximize recall, since false negatives are worse than false positives when predicting stroke. False negatives may lead to a lack of proper treatment for those who might be at risk for stroke, while false positives only lead to an unnecessary checkup.	32	The final model was chosen to maximize recall, since false negatives are worse than false positives when predicting stroke. False negatives may lead to a lack of proper treatment for those who might be at risk for stroke, while false positives only lead to an unnecessary checkup.
33		33
34	The final model showed a recall of 0.07 and an accuracy of 0.94. The model also showed a precision of 0.40 and an f1-score of 0.13.	34	The final model showed a recall of 0.07 and an accuracy of 0.94. The model also showed a precision of 0.40 and an f1-score of 0.13.
35		35
36	# Recommendations	36	# Recommendations
37	Further research and model development is required to obtain a suitable model for production use. The final model's recall of 0.07 is still low for stroke predictions, and may lead to dangerous false negative errors. The model has not learnt to predict the positive class well because of the low proportion of positive samples (7%) present in the dataset. A future project could explore using resampling techniques, such as oversampling with SMOTE, to reduce class imbalances in the target and increase recall.	37	Further research and model development is required to obtain a suitable model for production use. The final model's recall of 0.07 is still low for stroke predictions, and may lead to dangerous false negative errors. The model has not learnt to predict the positive class well because of the low proportion of positive samples (7%) present in the dataset. A future project could explore using resampling techniques, such as oversampling with SMOTE, to reduce class imbalances in the target and increase recall.
38		38
39	# Contact	39	# Contact
40	For further information, contact ankitkdey@gmail.com	40	For further information, contact ankitkdey@gmail.com