a b/README.md
1
# Stroke Prediction Model
2
3
## Introduction
4
This repository hosts a machine learning project designed to predict the likelihood of a stroke based on various health indicators and demographic information. The project encapsulates the entire workflow of a typical data science project including data exploration, preprocessing, model training, evaluation, and suggestions for improvement.
5
6
## Data Exploration
7
We initiated the project by performing an extensive exploratory data analysis (EDA). The process involved:
8
- Examining summary statistics and distributions of various features.
9
- Identifying missing values, particularly in the 'bmi' feature.
10
- Visualizing data distributions and potential correlations between features.
11
12
## Data Preprocessing
13
Data preprocessing was a critical step due to issues such as missing values and the need to convert categorical variables into a machine-readable format. The following steps were taken:
14
- Missing values in 'bmi' were imputed using the median of the feature.
15
- Categorical variables were encoded using label encoding.
16
- Numerical features were scaled to have a mean of zero and a standard deviation of one.
17
18
## Modeling
19
We employed a step-wise approach to model building:
20
1. **Logistic Regression**: Served as our baseline model.
21
2. **Random Forest**: Provided a more robust model capable of capturing non-linear patterns.
22
3. **Gradient Boosting and XGBoost**: Leveraged for their prowess in dealing with imbalanced datasets.
23
24
## Evaluation Metrics Explained
25
Model performance was evaluated using several metrics suited for imbalanced datasets:
26
- **Precision**: The accuracy of positive predictions.
27
- **Recall**: The ability of the model to capture actual positive instances.
28
- **F1-Score**: A balance between precision and recall.
29
- **Support**: The number of instances for each class in the validation set.
30
- **Accuracy**: Although not the primary metric due to class imbalance, it was still considered.
31
- **Macro Average**: Average performance across classes.
32
- **Weighted Average**: Average performance weighted by the number of instances in each class.
33
34
## Handling Imbalanced Data
35
To address the imbalance in the dataset, we explored:
36
- **SMOTE**: For oversampling the minority class.
37
- **Class Weight Adjustment**: To make the model pay more attention to the minority class.
38
39
## Results and Discussion
40
The models' predictions and their respective performance metrics were analyzed to identify the best-performing model. The analysis revealed a strong bias towards the majority class, prompting the use of SMOTE and class weight adjustments.
41
42
## Future Work
43
Future improvements could include:
44
- Advanced feature engineering.
45
- Hyperparameter optimization.
46
- Exploration of alternative resampling strategies.
47
48
## Installation and Usage
49
To replicate this project, clone the repository and install the required dependencies listed in `requirements.txt`.
50
51
```bash
52
git clone https://github.com/mayankbaluni/StrokeRiskPredictor.git
53
cd stroke-prediction
54
pip install -r requirements.txt
55
56
# End of script
57
exit 0
58
```
59
60
## Contact
61
For any queries or suggestions, feel free to contact me at [mayankbaluni@gmail.com]