Card

About Dataset

πŸ“Œ Overview

This dataset is curated to support research in stroke risk prediction, enabling the development of models that estimate:

  • Binary Classification: Whether a person is at risk of stroke.
  • Regression Analysis: The percentage likelihood of stroke occurrence.

It is designed for use in machine learning and deep learning applications in medical AI and predictive healthcare. The dataset is balanced, with 50% of records for individuals at risk and 50% not at risk.

πŸ“œ Dataset Generation Process

The dataset was created through a combination of medical literature review, expert consultation, and statistical modeling. Feature distributions and relationships reflect real-world clinical patterns.

πŸ“– Medical References & Sources

The dataset is grounded in established risk factors from trusted medical sources, including:

  • American Stroke Association (ASA): Guidelines on stroke risk and early symptoms.
  • Mayo Clinic & Cleveland Clinic: Literature on cardiovascular and stroke risk.
  • Harrison’s Principles of Internal Medicine (20th Ed.)
  • Stroke Prevention, Treatment, and Rehabilitation (Oxford University Press, 2021)
  • The Stroke Book (Cambridge Medicine, 2nd Ed.)
  • World Health Organization (WHO) reports on stroke risk and prevention

πŸ”¬ Features of the Dataset

1️⃣ Symptoms (Primary Predictors)

All features are binary (1 = present, 0 = absent):

  • Chest Pain
  • Shortness of Breath
  • Irregular Heartbeat
  • Fatigue & Weakness
  • Dizziness
  • Swelling (Edema)
  • Pain in Neck/Jaw/Shoulder/Back
  • Excessive Sweating
  • Persistent Cough
  • Nausea/Vomiting
  • High Blood Pressure
  • Chest Discomfort (Activity)
  • Cold Hands/Feet
  • Snoring/Sleep Apnea
  • Anxiety/Feeling of Doom

2️⃣ Target Variables (Predicted Outcomes)

  • At Risk (Binary): 1 if the person is at risk of stroke, 0 otherwise
  • Stroke Risk (%): Estimated probability of stroke (0–100)

3️⃣ Demographic Feature

  • Age: Stroke risk increases significantly with age

⚑ Why This Dataset is Accurate and Useful?

βœ… Balanced Data Distribution:

  • 50% at risk, 50% not at risk
  • Prevents model bias toward any class

βœ… Medically-Inspired Feature Engineering:

  • Features validated via clinical guidelines and expert opinion
  • Age is a weighted predictor
  • Symptom severity is implicitly encoded

βœ… Diverse Risk Factors Included:

  • Cardiovascular: chest pain, high BP, heartbeat irregularity
  • Neurological: dizziness, fatigue, anxiety
  • Sleep-related: snoring, sleep apnea

βœ… Scalable and ML-Ready:

  • Supports classification and regression
  • Works with ML (XGBoost, SVM, RF) and DL frameworks (PyTorch, TensorFlow)
  • Suitable for Explainable AI (XAI)

πŸ“‚ Dataset Usage & Applications

  • βœ… Predictive Analytics: Early detection and prevention of stroke
  • βœ… Healthcare Chatbots: Real-time triage and risk advice
  • βœ… Medical Research: Studying patterns in stroke risk
  • βœ… Explainable AI: Understanding how models assess stroke likelihood