Patient Readmission Risk Prediction
Problem Statement
Patient readmissions within 30 days of discharge pose a significant challenge in healthcare. High readmission rates are not only indicators of suboptimal care but also lead to financial penalties under programs such as the Hospital Readmissions Reduction Program (HRRP). This project aims to predict the likelihood of patient readmission by leveraging advanced SQL for data preparation, Python for predictive modeling, and Tableau for data visualization.
Table of Contents
Project Overview
The project is structured into the following phases:
- Data Collection and Preparation: SQL-based extraction, transformation, and loading (ETL) of patient data.
- Predictive Analytics: Developing a logistic regression model to predict 30-day readmission risks.
- Data Visualization: Using Tableau and Python libraries to create interactive dashboards and visual representations of key metrics.
Folder Structure
- Data_loading_in_MySQL.ipynb: MySQL-based data loading script to manage and insert patient data into a relational database.
- generating_data.ipynb: Python script to generate synthetic patient data, simulating real-world scenarios.
- readmission_risk_prediction.ipynb: Jupyter notebook for running SQL queries and implementing predictive models using Python.
- Visualisation_script.ipynb: Notebook to create visualizations using seaborn and matplotlib, along with Tableau for interactive dashboards.
Data Collection and Preparation
1. Data Loading
- MySQL Operations: The
Data_loading_in_MySQL.ipynb
script performs the following SQL operations:
- Database Connection: Utilizes the
mysql.connector
Python library to establish a connection to the MySQL server.
- Data Insertion: Reads a CSV file into a pandas DataFrame, and inserts the data using appropriate SQL commands. The script includes error handling for data types, particularly for date fields, using the
datetime.strptime
function for proper formatting.Ensured the generated data integrity and that all the tables have related data creating a relational database.
2. Data Generation
- Synthetic Data Creation: The
generating_data.ipynb
script:
- Creates synthetic patient data using numpy and pandas, simulating various patient attributes such as age, comorbidity count, medication count, and lab results.
- Feature Engineering: Derives additional fields like
readmission_risk
using a combination of boolean logic and statistical methods, ensuring the synthetic data is representative of real-world scenarios.
Predictive Analytics
3. Risk Prediction
- Data Extraction: In
readmission_risk_prediction.ipynb
, patient data is extracted from the MySQL database using SQL queries.
- Feature Encoding: Uses pandas to encode categorical variables into numerical formats suitable for modeling, utilizing techniques such as one-hot encoding.
- Correlation Matrix: Computes correlations between features using pandas
.corr()
method, helping identify key predictors of readmission.
- Predictive Modeling: Implements logistic regression using
sklearn.linear_model.LogisticRegression
to predict the probability of a patient being readmitted within 30 days.
Data Visualization
4. Visualization
Technical Details
- SQL Expertise: Proficient use of SQL for data extraction, manipulation, and integration within a Python environment.
- Data Processing: Extensive use of pandas for data cleaning, transformation, and feature engineering.
- Machine Learning: Implementation of logistic regression for classification tasks, utilizing scikit-learn.
- Visualization Tools: Expertise in seaborn and matplotlib for static plots; proficiency in Tableau for interactive dashboards.
Key Features
- ETL Process: Comprehensive ETL pipeline that extracts, transforms, and loads data into a MySQL database.
- Predictive Analytics: Develops a logistic regression model that accurately predicts patient readmission, offering actionable insights.
- Interactive Dashboards: (If applicable) Tableau dashboards that allow for dynamic exploration of patient data.
Outcome and Impact
Outcome
- Predictive Insights: The model predicts patient readmission risk, allowing healthcare providers to take preemptive actions, such as closer monitoring or additional follow-up care.
Impact
- Reduction in Readmission Rates: Targeted interventions can lower the number of 30-day readmissions, improving patient outcomes.
- Cost Efficiency: Avoidance of financial penalties under HRRP and reduced overall healthcare costs through better resource allocation.
- Enhanced Patient Care: Improved discharge planning and patient management based on data-driven insights.