Diff of /README.md [000000] .. [ba27f9]

Switch to unified view

a b/README.md
1
# Patient Readmission Risk Prediction
2
3
## Problem Statement
4
5
Patient readmissions within 30 days of discharge pose a significant challenge in healthcare. High readmission rates are not only indicators of suboptimal care but also lead to financial penalties under programs such as the Hospital Readmissions Reduction Program (HRRP). This project aims to predict the likelihood of patient readmission by leveraging advanced SQL for data preparation, Python for predictive modeling, and Tableau for data visualization.
6
7
## Table of Contents
8
9
- [Problem Statement](#problem-statement)
10
- [Project Overview](#project-overview)
11
- [Folder Structure](#folder-structure)
12
- [Data Collection and Preparation](#data-collection-and-preparation)
13
- [Predictive Analytics](#predictive-analytics)
14
- [Data Visualization](#data-visualization)
15
- [Technical Details](#technical-details)
16
- [Key Features](#key-features)
17
- [Outcome and Impact](#outcome-and-impact)
18
19
## Project Overview
20
21
The project is structured into the following phases:
22
23
1. **Data Collection and Preparation**: SQL-based extraction, transformation, and loading (ETL) of patient data.
24
2. **Predictive Analytics**: Developing a logistic regression model to predict 30-day readmission risks.
25
3. **Data Visualization**: Using Tableau and Python libraries to create interactive dashboards and visual representations of key metrics.
26
27
## Folder Structure
28
29
- **Data_loading_in_MySQL.ipynb**: MySQL-based data loading script to manage and insert patient data into a relational database.
30
- **generating_data.ipynb**: Python script to generate synthetic patient data, simulating real-world scenarios.
31
- **readmission_risk_prediction.ipynb**: Jupyter notebook for running SQL queries and implementing predictive models using Python.
32
- **Visualisation_script.ipynb**: Notebook to create visualizations using seaborn and matplotlib, along with Tableau for interactive dashboards.
33
34
## Data Collection and Preparation
35
36
### 1. Data Loading
37
- **MySQL Operations**: The `Data_loading_in_MySQL.ipynb` script performs the following SQL operations:
38
  - **Database Connection**: Utilizes the `mysql.connector` Python library to establish a connection to the MySQL server.
39
  - **Data Insertion**: Reads a CSV file into a pandas DataFrame, and inserts the data using appropriate SQL commands. The script includes error handling for data types, particularly for date fields, using the `datetime.strptime` function for proper formatting.Ensured the generated data integrity and that all the tables have related data creating a relational database.
40
41
### 2. Data Generation
42
- **Synthetic Data Creation**: The `generating_data.ipynb` script:
43
  - Creates synthetic patient data using numpy and pandas, simulating various patient attributes such as age, comorbidity count, medication count, and lab results.
44
  - **Feature Engineering**: Derives additional fields like `readmission_risk` using a combination of boolean logic and statistical methods, ensuring the synthetic data is representative of real-world scenarios.
45
46
## Predictive Analytics
47
48
### 3. Risk Prediction
49
- **Data Extraction**: In `readmission_risk_prediction.ipynb`, patient data is extracted from the MySQL database using SQL queries.
50
- **Feature Encoding**: Uses pandas to encode categorical variables into numerical formats suitable for modeling, utilizing techniques such as one-hot encoding.
51
- **Correlation Matrix**: Computes correlations between features using pandas `.corr()` method, helping identify key predictors of readmission.
52
- **Predictive Modeling**: Implements logistic regression using `sklearn.linear_model.LogisticRegression` to predict the probability of a patient being readmitted within 30 days.
53
54
## Data Visualization
55
56
### 4. Visualization
57
- **Boxplots and Heatmaps**: `Visualisation_script.ipynb` includes:
58
  - **Age Distribution Analysis**: Boxplot visualization using seaborn’s `sns.boxplot()` to compare age across different readmission risk levels.
59
  - **Correlation Heatmap**: Utilizes seaborn’s `sns.heatmap()` to display a correlation matrix, identifying relationships between variables such as length of stay, comorbidity count, and medication count.
60
  - **Additional Visualizations**: Multiple boxplots and scatter plots that provide insights into how various factors like medication duration and lab results impact readmission risks.
61
62
- **Tableau Dashboards**: 
63
  - Interactive dashboards created in Tableau, providing healthcare professionals with tools to explore and filter data dynamically. Dashboards include trend analyses, patient segmentation, and real-time risk scoring.
64
65
## Technical Details
66
67
- **SQL Expertise**: Proficient use of SQL for data extraction, manipulation, and integration within a Python environment.
68
- **Data Processing**: Extensive use of pandas for data cleaning, transformation, and feature engineering.
69
- **Machine Learning**: Implementation of logistic regression for classification tasks, utilizing scikit-learn.
70
- **Visualization Tools**: Expertise in seaborn and matplotlib for static plots; proficiency in Tableau for interactive dashboards.
71
72
## Key Features
73
74
- **ETL Process**: Comprehensive ETL pipeline that extracts, transforms, and loads data into a MySQL database.
75
- **Predictive Analytics**: Develops a logistic regression model that accurately predicts patient readmission, offering actionable insights.
76
- **Interactive Dashboards**: (If applicable) Tableau dashboards that allow for dynamic exploration of patient data.
77
78
79
## Outcome and Impact
80
81
### Outcome
82
- **Predictive Insights**: The model predicts patient readmission risk, allowing healthcare providers to take preemptive actions, such as closer monitoring or additional follow-up care.
83
84
### Impact
85
- **Reduction in Readmission Rates**: Targeted interventions can lower the number of 30-day readmissions, improving patient outcomes.
86
- **Cost Efficiency**: Avoidance of financial penalties under HRRP and reduced overall healthcare costs through better resource allocation.
87
- **Enhanced Patient Care**: Improved discharge planning and patient management based on data-driven insights.
88
89
---