--- a +++ b/README.md @@ -0,0 +1,101 @@ +<h1>About Dataset</h1> + +<h2>π Overview</h2> +<p> +This dataset is curated to support research in <strong>stroke risk prediction</strong>, enabling the development of models that estimate: +</p> +<ul> + <li><strong>Binary Classification:</strong> Whether a person is at risk of stroke.</li> + <li><strong>Regression Analysis:</strong> The percentage likelihood of stroke occurrence.</li> +</ul> +<p> +It is designed for use in machine learning and deep learning applications in medical AI and predictive healthcare. The dataset is balanced, with 50% of records for individuals at risk and 50% not at risk. +</p> + +<h2>π Dataset Generation Process</h2> +<p> +The dataset was created through a combination of <strong>medical literature review</strong>, expert consultation, and <strong>statistical modeling</strong>. Feature distributions and relationships reflect real-world clinical patterns. +</p> + +<h2>π Medical References & Sources</h2> +<p> +The dataset is grounded in established risk factors from trusted medical sources, including: +</p> +<ul> + <li>American Stroke Association (ASA): Guidelines on stroke risk and early symptoms.</li> + <li>Mayo Clinic & Cleveland Clinic: Literature on cardiovascular and stroke risk.</li> + <li><em>Harrisonβs Principles of Internal Medicine</em> (20th Ed.)</li> + <li><em>Stroke Prevention, Treatment, and Rehabilitation</em> (Oxford University Press, 2021)</li> + <li><em>The Stroke Book</em> (Cambridge Medicine, 2nd Ed.)</li> + <li>World Health Organization (WHO) reports on stroke risk and prevention</li> +</ul> + +<h2>π¬ Features of the Dataset</h2> + +<h3>1οΈβ£ Symptoms (Primary Predictors)</h3> +<p>All features are binary (1 = present, 0 = absent):</p> +<ul> + <li>Chest Pain</li> + <li>Shortness of Breath</li> + <li>Irregular Heartbeat</li> + <li>Fatigue & Weakness</li> + <li>Dizziness</li> + <li>Swelling (Edema)</li> + <li>Pain in Neck/Jaw/Shoulder/Back</li> + <li>Excessive Sweating</li> + <li>Persistent Cough</li> + <li>Nausea/Vomiting</li> + <li>High Blood Pressure</li> + <li>Chest Discomfort (Activity)</li> + <li>Cold Hands/Feet</li> + <li>Snoring/Sleep Apnea</li> + <li>Anxiety/Feeling of Doom</li> +</ul> + +<h3>2οΈβ£ Target Variables (Predicted Outcomes)</h3> +<ul> + <li><strong>At Risk (Binary):</strong> 1 if the person is at risk of stroke, 0 otherwise</li> + <li><strong>Stroke Risk (%):</strong> Estimated probability of stroke (0β100)</li> +</ul> + +<h3>3οΈβ£ Demographic Feature</h3> +<ul> + <li><strong>Age:</strong> Stroke risk increases significantly with age</li> +</ul> + +<h2>β‘ Why This Dataset is Accurate and Useful?</h2> + +<h4>β Balanced Data Distribution:</h4> +<ul> + <li>50% at risk, 50% not at risk</li> + <li>Prevents model bias toward any class</li> +</ul> + +<h4>β Medically-Inspired Feature Engineering:</h4> +<ul> + <li>Features validated via clinical guidelines and expert opinion</li> + <li>Age is a weighted predictor</li> + <li>Symptom severity is implicitly encoded</li> +</ul> + +<h4>β Diverse Risk Factors Included:</h4> +<ul> + <li>Cardiovascular: chest pain, high BP, heartbeat irregularity</li> + <li>Neurological: dizziness, fatigue, anxiety</li> + <li>Sleep-related: snoring, sleep apnea</li> +</ul> + +<h4>β Scalable and ML-Ready:</h4> +<ul> + <li>Supports classification and regression</li> + <li>Works with ML (XGBoost, SVM, RF) and DL frameworks (PyTorch, TensorFlow)</li> + <li>Suitable for Explainable AI (XAI)</li> +</ul> + +<h2>π Dataset Usage & Applications</h2> +<ul> + <li>β <strong>Predictive Analytics:</strong> Early detection and prevention of stroke</li> + <li>β <strong>Healthcare Chatbots:</strong> Real-time triage and risk advice</li> + <li>β <strong>Medical Research:</strong> Studying patterns in stroke risk</li> + <li>β <strong>Explainable AI:</strong> Understanding how models assess stroke likelihood</li> +</ul>