Multi-Class-Prediction-of-Obesity-Risk
This project is an extension of improving the models, productionizing the project with best practices previously developed for Kaggle Competition "Multi Class Prediction of Obesity Risk"where we placed within the top 5%. The project aims at redoing the project with added production using best practices learned from class MGSC-695-076. For the sake of security, no access keys were shared.
Tech Stack: Apache Kafka, MLflow, Azure ML, VS Code, Poetry, AutoGluon, H2O, PyCaret, FLAML, PandasAI, Docker, Streamlit, Postman, FastAPI, SHAP
Project Overview
1. Data Preparation and Simulation
- Data Source: Original Kaggle CSV data split into Model Development and Hold-Off datasets.
- Live Data Simulation: Used Apache Kafka for simulating real-time data feeds.
2. Azure Machine Learning Setup
- Workspace Configuration: Established Azure ML Workspace with RBAC.
- Team Roles: Assigned roles for Data Science, Data Engineering, ML Engineering, and Governance.
3. Exploratory Data Analysis (EDA)
- Comprehensive Analysis:
- Univariate Analysis: Leveraged PandasAI for detailed insights.
- Bivariate Analysis: Used pairplots and interaction plots.
- Dimensionality Reduction: Applied PCA with KMediansClustering.
4. Data Preprocessing
- Feature Engineering: Enhanced performance based on EDA insights.
- Normalization and Scaling: Ensured optimal feature scaling.
- Missing Data Handling: Applied appropriate strategies for missing data.
Step 9: EDA [Owner to Update Step]
5. Dependency Management
- Poetry Integration: Managed dependencies for reproducibility.
6. Model Development and Optimization
7. Experiment Tracking and Management
- MLflow & Azure MLFlow Integration:
- Tracked global and local metrics, target distribution.
- SHAP Analysis: Utilized SHAP values for explainability and error analysis.
8. Deployment Strategies
9. User Interface and Interaction
- Streamlit Application: User-friendly interface integrated with APIs.
10. Model Monitoring and Drift Management
- Monitoring Strategy: Drift detection, automated endpoint management.
11. Azure ML Designer Integration
- UI-Based Experiments: Used Azure ML Designer for experiments additionally for learning purposes using SDK v2, and UI.
12. Additional Expert Considerations
- Cross-Validation: Ensured model generalizability.
- Model Governance: Versioning, lineage tracking, compliance.
- Scalability and Optimization: Performance tests, scalability checks.
- Feedback Loop: Integrated feedback for continuous improvement.
13. Branches:
- Main: For Final Product [Owner - Team]
- Experiments: For ML Experiments and tracking [Owners - Arham, Krishan]
- ArchDevelopment: For CICD [Owner - Nandani]
- Streamlit: For front end [Owner - Nandani]
- Data Engineering: For Kafka Streaming [Owner- Yash]
- Backup: For Backup [Owner - Aasna, Mahrukh]
Technologies Used
- Data Analysis/Model Training: Python, Jupyter Notebooks
- Experiment Tracking: MLFlow
- Model Building: PyCaret, LightGBM, XGBoost, CatBoost
- Hyperparameter Optimization: Optuna
- Containerization: Docker
- Realtime Data Streaming: Kafka
- Version Control and CI/CD: Git, GitHub Actions
- Cloud Deployment: Azure Machine Learning, Azure Blob Storage
- User Interface: Streamlit
- Dependency and Environment Management: Poetry
How to Run the Code
Prerequisites
- Python 3.8+
- Poetry
- Docker
- Azure Account
- Kafka
Setup
-
Clone the Repository
bash
git clone https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk.git
cd Multi-Class-Prediction-of-Obesity-Risk
-
Install Dependencies
bash
poetry install
-
Set Up Environment Variables
Create a .env
file in the root directory and add the necessary environment variables. Example:
env
AZURE_SUBSCRIPTION_ID=your_subscription_id
AZURE_RESOURCE_GROUP=your_resource_group
AZURE_WORKSPACE_NAME=your_workspace_name
-
Start Docker
Ensure Docker is running on your machine. Build and run the Docker containers:
bash
docker-compose up --build
-
Run Streamlit Application
bash
streamlit run Streamlit/app.py
-
Run Jupyter Notebooks
Start Jupyter Lab to run and explore notebooks:
bash
poetry run jupyter lab
Deployment
-
Azure ML Deployment
- Configure your Azure workspace by setting up the necessary resources.
- Use the provided Azure scripts to deploy models and services.
bash
poetry run python deploy/deploy_to_azure.py
-
CI/CD Setup
- Ensure GitHub Actions are configured correctly.
- Push changes to the repository to trigger CI/CD pipelines.
bash
git add .
git commit -m "Your commit message"
git push origin main
Monitoring and Maintenance
- Model Monitoring: Utilize integrated monitoring tools to track model performance and detect drift.
- Endpoint Management: Automated endpoint management to ensure availability and performance.
Business Case
Our solution targets healthcare providers for early identification of at-risk patients, public health officials for data-driven policy making, and insurance companies for premium adjustment based on individual risk. The economic impact includes significant healthcare cost savings and revenue generation from tailored wellness programs.
Acknowledgements
This project is an effort by the team to tackle the global health crisis of obesity by employing advanced data science and machine learning techniques, aiming to make a significant impact in the healthcare sector.
Meet the Team
- Product Manager - Aasna
- Machine Learning Engineer - Arham
- ML Ops - Krishan
- Data Engineer - Yash
- Cloud SME - Nandani
- Business Analyst - Mahrukh