a b/README.md
1
# Multi-Class-Prediction-of-Obesity-Risk
2
3
#### This project is an extension of improving the models, productionizing the project with best practices previously developed for Kaggle Competition "Multi Class Prediction of Obesity Risk"where we placed within the top 5%. The project aims at redoing the project with added production using best practices learned from class MGSC-695-076. For the sake of security, no access keys were shared. 
4
5
Tech Stack: Apache Kafka, MLflow, Azure ML, VS Code, Poetry, AutoGluon, H2O, PyCaret, FLAML, PandasAI, Docker, Streamlit, Postman, FastAPI, SHAP
6
7
## Project Overview
8
9
#### 1. Data Preparation and Simulation
10
11
- **Data Source:** Original Kaggle CSV data split into Model Development and Hold-Off datasets.
12
- **Live Data Simulation:** Used Apache Kafka for simulating real-time data feeds.
13
14
15
16
<!-- Slide 6 -->
17
<p align="center">
18
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide6.png">
19
</p>
20
21
<!-- Slide 7 -->
22
<p align="center">
23
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide7.png">
24
</p>
25
26
<!-- Slide 8 -->
27
<p align="center">
28
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide8.png">
29
</p>
30
31
#### 2. Azure Machine Learning Setup
32
33
- **Workspace Configuration:** Established Azure ML Workspace with RBAC.
34
- **Team Roles:** Assigned roles for Data Science, Data Engineering, ML Engineering, and Governance.
35
36
#### 3. Exploratory Data Analysis (EDA)
37
38
- **Comprehensive Analysis:**
39
  - **Univariate Analysis:** Leveraged PandasAI for detailed insights.
40
  - **Bivariate Analysis:** Used pairplots and interaction plots.
41
  - **Dimensionality Reduction:** Applied PCA with KMediansClustering.
42
43
<!-- Slide 9 -->
44
<p align="center">
45
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide9.png">
46
</p>
47
48
<!-- Slide 10 -->
49
<p align="center">
50
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide10.png">
51
</p>
52
53
<!-- Slide 11 -->
54
<p align="center">
55
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide11.png">
56
</p>
57
58
<!-- Slide 12 -->
59
<p align="center">
60
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide12.png">
61
</p>
62
63
64
<!-- Slide 13 -->
65
<p align="center">
66
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide13.png">
67
</p>
68
69
#### 4. Data Preprocessing
70
71
- **Feature Engineering:** Enhanced performance based on EDA insights.
72
- **Normalization and Scaling:** Ensured optimal feature scaling.
73
- **Missing Data Handling:** Applied appropriate strategies for missing data.
74
75
#### Step 9: EDA [Owner to Update Step]
76
<!-- Slide 14 -->
77
<p align="center">
78
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide14.png">
79
</p>
80
81
82
83
#### 5. Dependency Management
84
85
- **Poetry Integration:** Managed dependencies for reproducibility.
86
87
88
<!-- Slide 15 -->
89
<p align="center">
90
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide15.png">
91
</p>
92
93
94
95
#### 6. Model Development and Optimization
96
97
- **State-of-the-Art Models:**
98
  - Custom models like XGBoost, LightGBM, CatBoost.
99
  - **Hyperparameter Tuning:** Used Optuna for optimization.
100
101
- **AutoML Exploration:**
102
  - Explored Pycaret, AutoGluon, H2O for benchmarking.
103
  - **Advanced Techniques:** Stacked models, Isolation Forest, custom loss functions.
104
105
#### 7. Experiment Tracking and Management
106
107
- **MLflow & Azure MLFlow Integration:**
108
  - Tracked global and local metrics, target distribution.
109
  - **SHAP Analysis:** Utilized SHAP values for explainability and error analysis.
110
111
112
<!-- Slide 16 -->
113
<p align="center">
114
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide16.png">
115
</p>
116
117
118
119
<!-- Slide 17 -->
120
<p align="center">
121
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide17.png">
122
</p>
123
124
125
126
<!-- Slide 18 -->
127
<p align="center">
128
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide18.png">
129
</p>
130
131
132
133
<!-- Slide 19 -->
134
<p align="center">
135
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide19.png">
136
</p>
137
138
139
140
<!-- Slide 20 -->
141
<p align="center">
142
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide20.png">
143
</p>
144
145
146
147
<!-- Slide 21 -->
148
<p align="center">
149
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide21.png">
150
</p>
151
152
153
#### 8. Deployment Strategies
154
155
- **Containerization:** Used FastAPI and Docker.
156
- **Azure Deployment:** Azure Container Instances, planned Kubernetes.
157
158
- **Conversion to Azure Scripts:**
159
  - Converted Jupyter notebooks to Python scripts for Azure jobs.
160
  - **Azure Pipelines:** CI/CD with GitHub Actions and Azure Container Registry.
161
162
#### 9. User Interface and Interaction
163
164
- **Streamlit Application:** User-friendly interface integrated with APIs.
165
166
#### 10. Model Monitoring and Drift Management
167
168
- **Monitoring Strategy:** Drift detection, automated endpoint management.
169
170
#### 11. Azure ML Designer Integration
171
172
- **UI-Based Experiments:** Used Azure ML Designer for experiments additionally for learning purposes using SDK v2, and UI.
173
174
175
<!-- Slide 22 -->
176
<p align="center">
177
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide22.png">
178
</p>
179
180
181
182
<!-- Slide 23 -->
183
<p align="center">
184
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide23.png">
185
</p>
186
187
188
189
<!-- Slide 24 -->
190
<p align="center">
191
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide24.png">
192
</p>
193
194
195
196
<!-- Slide 25 -->
197
<p align="center">
198
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide25.png">
199
</p>
200
201
202
203
<!-- Slide 26 -->
204
<p align="center">
205
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide26.png">
206
</p>
207
208
209
210
<!-- Slide 27 -->
211
<p align="center">
212
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide27.png">
213
</p>
214
215
216
217
<!-- Slide 28 -->
218
<p align="center">
219
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide28.png">
220
</p>
221
222
223
224
<!-- Slide 29 -->
225
<p align="center">
226
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide29.png">
227
</p>
228
229
230
231
<!-- Slide 30 -->
232
<p align="center">
233
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide30.png">
234
</p>
235
236
237
238
<!-- Slide 31 -->
239
<p align="center">
240
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide31.png">
241
</p>
242
243
244
245
<!-- Slide 32 -->
246
<p align="center">
247
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide32.png">
248
</p>
249
250
251
252
<!-- Slide 33 -->
253
<p align="center">
254
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide33.png">
255
</p>
256
257
258
259
<!-- Slide 34 -->
260
<p align="center">
261
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide34.png">
262
</p>
263
264
265
266
<!-- Slide 35 -->
267
<p align="center">
268
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide35.png">
269
</p>
270
271
272
273
#### 12. Additional Expert Considerations
274
275
- **Cross-Validation:** Ensured model generalizability.
276
- **Model Governance:** Versioning, lineage tracking, compliance.
277
- **Scalability and Optimization:** Performance tests, scalability checks.
278
- **Feedback Loop:** Integrated feedback for continuous improvement.
279
280
281
282
#### 13. Branches: 
283
1. Main: For Final Product [Owner - Team]
284
2. Experiments: For ML Experiments and tracking [Owners - Arham, Krishan]
285
3. ArchDevelopment: For CICD  [Owner - Nandani]
286
4. Streamlit: For front end [Owner - Nandani]
287
5. Data Engineering: For Kafka Streaming [Owner- Yash]
288
6. Backup: For Backup [Owner - Aasna, Mahrukh]
289
290
   
291
### Technologies Used
292
293
- **Data Analysis/Model Training:** Python, Jupyter Notebooks
294
- **Experiment Tracking:** MLFlow
295
- **Model Building:** PyCaret, LightGBM, XGBoost, CatBoost
296
- **Hyperparameter Optimization:** Optuna
297
- **Containerization:** Docker
298
- **Realtime Data Streaming:** Kafka
299
- **Version Control and CI/CD:** Git, GitHub Actions
300
- **Cloud Deployment:** Azure Machine Learning, Azure Blob Storage
301
- **User Interface:** Streamlit
302
- **Dependency and Environment Management:** Poetry
303
304
## How to Run the Code
305
306
### Prerequisites
307
308
- **Python 3.8+**
309
- **Poetry**
310
- **Docker**
311
- **Azure Account**
312
- **Kafka**
313
314
### Setup
315
316
1. **Clone the Repository**
317
318
    ```bash
319
    git clone https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk.git
320
    cd Multi-Class-Prediction-of-Obesity-Risk
321
    ```
322
323
2. **Install Dependencies**
324
325
    ```bash
326
    poetry install
327
    ```
328
329
3. **Set Up Environment Variables**
330
331
    Create a `.env` file in the root directory and add the necessary environment variables. Example:
332
333
    ```env
334
    AZURE_SUBSCRIPTION_ID=your_subscription_id
335
    AZURE_RESOURCE_GROUP=your_resource_group
336
    AZURE_WORKSPACE_NAME=your_workspace_name
337
    ```
338
339
4. **Start Docker**
340
341
    Ensure Docker is running on your machine. Build and run the Docker containers:
342
343
    ```bash
344
    docker-compose up --build
345
    ```
346
347
5. **Run Streamlit Application**
348
349
    ```bash
350
    streamlit run Streamlit/app.py
351
    ```
352
353
6. **Run Jupyter Notebooks**
354
355
    Start Jupyter Lab to run and explore notebooks:
356
357
    ```bash
358
    poetry run jupyter lab
359
    ```
360
361
### Deployment
362
363
1. **Azure ML Deployment**
364
365
    - Configure your Azure workspace by setting up the necessary resources.
366
    - Use the provided Azure scripts to deploy models and services.
367
368
    ```bash
369
    poetry run python deploy/deploy_to_azure.py
370
    ```
371
372
2. **CI/CD Setup**
373
374
    - Ensure GitHub Actions are configured correctly.
375
    - Push changes to the repository to trigger CI/CD pipelines.
376
377
    ```bash
378
    git add .
379
    git commit -m "Your commit message"
380
    git push origin main
381
    ```
382
383
### Monitoring and Maintenance
384
385
- **Model Monitoring:** Utilize integrated monitoring tools to track model performance and detect drift.
386
- **Endpoint Management:** Automated endpoint management to ensure availability and performance.
387
388
389
390
### Business Case
391
392
Our solution targets healthcare providers for early identification of at-risk patients, public health officials for data-driven policy making, and insurance companies for premium adjustment based on individual risk. The economic impact includes significant healthcare cost savings and revenue generation from tailored wellness programs.
393
394
### Acknowledgements
395
396
This project is an effort by the team to tackle the global health crisis of obesity by employing advanced data science and machine learning techniques, aiming to make a significant impact in the healthcare sector.
397
398
399
### Meet the Team 
400
1. Product Manager - Aasna
401
2. Machine Learning Engineer - Arham
402
3. ML Ops - Krishan
403
4. Data Engineer - Yash
404
5. Cloud SME - Nandani
405
6. Business Analyst - Mahrukh
406
407
408
----
409
410
<!-- Slide 36 -->
411
<p align="center">
412
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide36.png">
413
</p>
414
415
416
417
<!-- Slide 37 -->
418
<p align="center">
419
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide37.png">
420
</p>
421
422
423
424
<!-- Slide 38 -->
425
<p align="center">
426
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide38.png">
427
</p>
428
429
430
431
432
<!-- Slide 2 -->
433
<p align="center">
434
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide2.png">
435
</p>
436
437
438
439
<!-- Slide 3 -->
440
<p align="center">
441
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide3.png">
442
</p>
443
444
445
446
<!-- Slide 4 -->
447
<p align="center">
448
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide4.png">
449
</p>
450
451
452
453
<!-- Slide 5 -->
454
<p align="center">
455
  <img src="https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk/blob/main/16-README-Support-Files/Slide5.png">
456
</p>