About Dataset
π Overview
This dataset is curated to support research in stroke risk prediction, enabling the development of models that estimate:
- Binary Classification: Whether a person is at risk of stroke.
- Regression Analysis: The percentage likelihood of stroke occurrence.
It is designed for use in machine learning and deep learning applications in medical AI and predictive healthcare. The dataset is balanced, with 50% of records for individuals at risk and 50% not at risk.
π Dataset Generation Process
The dataset was created through a combination of medical literature review, expert consultation, and statistical modeling. Feature distributions and relationships reflect real-world clinical patterns.
π Medical References & Sources
The dataset is grounded in established risk factors from trusted medical sources, including:
- American Stroke Association (ASA): Guidelines on stroke risk and early symptoms.
- Mayo Clinic & Cleveland Clinic: Literature on cardiovascular and stroke risk.
- Harrisonβs Principles of Internal Medicine (20th Ed.)
- Stroke Prevention, Treatment, and Rehabilitation (Oxford University Press, 2021)
- The Stroke Book (Cambridge Medicine, 2nd Ed.)
- World Health Organization (WHO) reports on stroke risk and prevention
π¬ Features of the Dataset
1οΈβ£ Symptoms (Primary Predictors)
All features are binary (1 = present, 0 = absent):
- Chest Pain
- Shortness of Breath
- Irregular Heartbeat
- Fatigue & Weakness
- Dizziness
- Swelling (Edema)
- Pain in Neck/Jaw/Shoulder/Back
- Excessive Sweating
- Persistent Cough
- Nausea/Vomiting
- High Blood Pressure
- Chest Discomfort (Activity)
- Cold Hands/Feet
- Snoring/Sleep Apnea
- Anxiety/Feeling of Doom
2οΈβ£ Target Variables (Predicted Outcomes)
- At Risk (Binary): 1 if the person is at risk of stroke, 0 otherwise
- Stroke Risk (%): Estimated probability of stroke (0β100)
3οΈβ£ Demographic Feature
- Age: Stroke risk increases significantly with age
β‘ Why This Dataset is Accurate and Useful?
β
Balanced Data Distribution:
- 50% at risk, 50% not at risk
- Prevents model bias toward any class
β
Medically-Inspired Feature Engineering:
- Features validated via clinical guidelines and expert opinion
- Age is a weighted predictor
- Symptom severity is implicitly encoded
β
Diverse Risk Factors Included:
- Cardiovascular: chest pain, high BP, heartbeat irregularity
- Neurological: dizziness, fatigue, anxiety
- Sleep-related: snoring, sleep apnea
β
Scalable and ML-Ready:
- Supports classification and regression
- Works with ML (XGBoost, SVM, RF) and DL frameworks (PyTorch, TensorFlow)
- Suitable for Explainable AI (XAI)
π Dataset Usage & Applications
- β
Predictive Analytics: Early detection and prevention of stroke
- β
Healthcare Chatbots: Real-time triage and risk advice
- β
Medical Research: Studying patterns in stroke risk
- β
Explainable AI: Understanding how models assess stroke likelihood