[c7ae97]: / README.md

Download this file

102 lines (88 with data), 3.9 kB

About Dataset

πŸ“Œ Overview

This dataset is curated to support research in stroke risk prediction, enabling the development of models that estimate:

  • Binary Classification: Whether a person is at risk of stroke.
  • Regression Analysis: The percentage likelihood of stroke occurrence.

It is designed for use in machine learning and deep learning applications in medical AI and predictive healthcare. The dataset is balanced, with 50% of records for individuals at risk and 50% not at risk.

πŸ“œ Dataset Generation Process

The dataset was created through a combination of medical literature review, expert consultation, and statistical modeling. Feature distributions and relationships reflect real-world clinical patterns.

πŸ“– Medical References & Sources

The dataset is grounded in established risk factors from trusted medical sources, including:

  • American Stroke Association (ASA): Guidelines on stroke risk and early symptoms.
  • Mayo Clinic & Cleveland Clinic: Literature on cardiovascular and stroke risk.
  • Harrison’s Principles of Internal Medicine (20th Ed.)
  • Stroke Prevention, Treatment, and Rehabilitation (Oxford University Press, 2021)
  • The Stroke Book (Cambridge Medicine, 2nd Ed.)
  • World Health Organization (WHO) reports on stroke risk and prevention

πŸ”¬ Features of the Dataset

1️⃣ Symptoms (Primary Predictors)

All features are binary (1 = present, 0 = absent):

  • Chest Pain
  • Shortness of Breath
  • Irregular Heartbeat
  • Fatigue & Weakness
  • Dizziness
  • Swelling (Edema)
  • Pain in Neck/Jaw/Shoulder/Back
  • Excessive Sweating
  • Persistent Cough
  • Nausea/Vomiting
  • High Blood Pressure
  • Chest Discomfort (Activity)
  • Cold Hands/Feet
  • Snoring/Sleep Apnea
  • Anxiety/Feeling of Doom

2️⃣ Target Variables (Predicted Outcomes)

  • At Risk (Binary): 1 if the person is at risk of stroke, 0 otherwise
  • Stroke Risk (%): Estimated probability of stroke (0–100)

3️⃣ Demographic Feature

  • Age: Stroke risk increases significantly with age

⚑ Why This Dataset is Accurate and Useful?

βœ… Balanced Data Distribution:

  • 50% at risk, 50% not at risk
  • Prevents model bias toward any class

βœ… Medically-Inspired Feature Engineering:

  • Features validated via clinical guidelines and expert opinion
  • Age is a weighted predictor
  • Symptom severity is implicitly encoded

βœ… Diverse Risk Factors Included:

  • Cardiovascular: chest pain, high BP, heartbeat irregularity
  • Neurological: dizziness, fatigue, anxiety
  • Sleep-related: snoring, sleep apnea

βœ… Scalable and ML-Ready:

  • Supports classification and regression
  • Works with ML (XGBoost, SVM, RF) and DL frameworks (PyTorch, TensorFlow)
  • Suitable for Explainable AI (XAI)

πŸ“‚ Dataset Usage & Applications

  • βœ… Predictive Analytics: Early detection and prevention of stroke
  • βœ… Healthcare Chatbots: Real-time triage and risk advice
  • βœ… Medical Research: Studying patterns in stroke risk
  • βœ… Explainable AI: Understanding how models assess stroke likelihood