|
a |
|
b/README.md |
|
|
1 |
<h1>About Dataset</h1> |
|
|
2 |
|
|
|
3 |
<h2>π Overview</h2> |
|
|
4 |
<p> |
|
|
5 |
This dataset is curated to support research in <strong>stroke risk prediction</strong>, enabling the development of models that estimate: |
|
|
6 |
</p> |
|
|
7 |
<ul> |
|
|
8 |
<li><strong>Binary Classification:</strong> Whether a person is at risk of stroke.</li> |
|
|
9 |
<li><strong>Regression Analysis:</strong> The percentage likelihood of stroke occurrence.</li> |
|
|
10 |
</ul> |
|
|
11 |
<p> |
|
|
12 |
It is designed for use in machine learning and deep learning applications in medical AI and predictive healthcare. The dataset is balanced, with 50% of records for individuals at risk and 50% not at risk. |
|
|
13 |
</p> |
|
|
14 |
|
|
|
15 |
<h2>π Dataset Generation Process</h2> |
|
|
16 |
<p> |
|
|
17 |
The dataset was created through a combination of <strong>medical literature review</strong>, expert consultation, and <strong>statistical modeling</strong>. Feature distributions and relationships reflect real-world clinical patterns. |
|
|
18 |
</p> |
|
|
19 |
|
|
|
20 |
<h2>π Medical References & Sources</h2> |
|
|
21 |
<p> |
|
|
22 |
The dataset is grounded in established risk factors from trusted medical sources, including: |
|
|
23 |
</p> |
|
|
24 |
<ul> |
|
|
25 |
<li>American Stroke Association (ASA): Guidelines on stroke risk and early symptoms.</li> |
|
|
26 |
<li>Mayo Clinic & Cleveland Clinic: Literature on cardiovascular and stroke risk.</li> |
|
|
27 |
<li><em>Harrisonβs Principles of Internal Medicine</em> (20th Ed.)</li> |
|
|
28 |
<li><em>Stroke Prevention, Treatment, and Rehabilitation</em> (Oxford University Press, 2021)</li> |
|
|
29 |
<li><em>The Stroke Book</em> (Cambridge Medicine, 2nd Ed.)</li> |
|
|
30 |
<li>World Health Organization (WHO) reports on stroke risk and prevention</li> |
|
|
31 |
</ul> |
|
|
32 |
|
|
|
33 |
<h2>π¬ Features of the Dataset</h2> |
|
|
34 |
|
|
|
35 |
<h3>1οΈβ£ Symptoms (Primary Predictors)</h3> |
|
|
36 |
<p>All features are binary (1 = present, 0 = absent):</p> |
|
|
37 |
<ul> |
|
|
38 |
<li>Chest Pain</li> |
|
|
39 |
<li>Shortness of Breath</li> |
|
|
40 |
<li>Irregular Heartbeat</li> |
|
|
41 |
<li>Fatigue & Weakness</li> |
|
|
42 |
<li>Dizziness</li> |
|
|
43 |
<li>Swelling (Edema)</li> |
|
|
44 |
<li>Pain in Neck/Jaw/Shoulder/Back</li> |
|
|
45 |
<li>Excessive Sweating</li> |
|
|
46 |
<li>Persistent Cough</li> |
|
|
47 |
<li>Nausea/Vomiting</li> |
|
|
48 |
<li>High Blood Pressure</li> |
|
|
49 |
<li>Chest Discomfort (Activity)</li> |
|
|
50 |
<li>Cold Hands/Feet</li> |
|
|
51 |
<li>Snoring/Sleep Apnea</li> |
|
|
52 |
<li>Anxiety/Feeling of Doom</li> |
|
|
53 |
</ul> |
|
|
54 |
|
|
|
55 |
<h3>2οΈβ£ Target Variables (Predicted Outcomes)</h3> |
|
|
56 |
<ul> |
|
|
57 |
<li><strong>At Risk (Binary):</strong> 1 if the person is at risk of stroke, 0 otherwise</li> |
|
|
58 |
<li><strong>Stroke Risk (%):</strong> Estimated probability of stroke (0β100)</li> |
|
|
59 |
</ul> |
|
|
60 |
|
|
|
61 |
<h3>3οΈβ£ Demographic Feature</h3> |
|
|
62 |
<ul> |
|
|
63 |
<li><strong>Age:</strong> Stroke risk increases significantly with age</li> |
|
|
64 |
</ul> |
|
|
65 |
|
|
|
66 |
<h2>β‘ Why This Dataset is Accurate and Useful?</h2> |
|
|
67 |
|
|
|
68 |
<h4>β
Balanced Data Distribution:</h4> |
|
|
69 |
<ul> |
|
|
70 |
<li>50% at risk, 50% not at risk</li> |
|
|
71 |
<li>Prevents model bias toward any class</li> |
|
|
72 |
</ul> |
|
|
73 |
|
|
|
74 |
<h4>β
Medically-Inspired Feature Engineering:</h4> |
|
|
75 |
<ul> |
|
|
76 |
<li>Features validated via clinical guidelines and expert opinion</li> |
|
|
77 |
<li>Age is a weighted predictor</li> |
|
|
78 |
<li>Symptom severity is implicitly encoded</li> |
|
|
79 |
</ul> |
|
|
80 |
|
|
|
81 |
<h4>β
Diverse Risk Factors Included:</h4> |
|
|
82 |
<ul> |
|
|
83 |
<li>Cardiovascular: chest pain, high BP, heartbeat irregularity</li> |
|
|
84 |
<li>Neurological: dizziness, fatigue, anxiety</li> |
|
|
85 |
<li>Sleep-related: snoring, sleep apnea</li> |
|
|
86 |
</ul> |
|
|
87 |
|
|
|
88 |
<h4>β
Scalable and ML-Ready:</h4> |
|
|
89 |
<ul> |
|
|
90 |
<li>Supports classification and regression</li> |
|
|
91 |
<li>Works with ML (XGBoost, SVM, RF) and DL frameworks (PyTorch, TensorFlow)</li> |
|
|
92 |
<li>Suitable for Explainable AI (XAI)</li> |
|
|
93 |
</ul> |
|
|
94 |
|
|
|
95 |
<h2>π Dataset Usage & Applications</h2> |
|
|
96 |
<ul> |
|
|
97 |
<li>β
<strong>Predictive Analytics:</strong> Early detection and prevention of stroke</li> |
|
|
98 |
<li>β
<strong>Healthcare Chatbots:</strong> Real-time triage and risk advice</li> |
|
|
99 |
<li>β
<strong>Medical Research:</strong> Studying patterns in stroke risk</li> |
|
|
100 |
<li>β
<strong>Explainable AI:</strong> Understanding how models assess stroke likelihood</li> |
|
|
101 |
</ul> |