a b/README.md
1
## **<span style="color:red">[COMPREHENSIVE ANALYSIS AND PREDICTION OF OBESITY RISK LEVELS USING MACHINE LEARNING TECHNIQUES WITH - (90.99)% ACCURACY](https://nbviewer.org/github/Anamicca23/Obesity-Risk-Level-Prediction--Project-using-ML/blob/master/prediction-of-obesity-risk-levels-using-ml%20project-Final.ipynb)</span>**
2
**Author**: **Anamika Kumari**
3
4
# Introduction:
5
6
Obesity is a pressing global health concern, with millions affected worldwide and significant implications for morbidity, mortality, and healthcare costs. The prevalence of obesity has tripled since 1975, now affecting approximately 30% of the global population. This escalating trend underscores the urgent need to address the multifaceted risks associated with excess weight. Obesity is a leading cause of various health complications, including diabetes, heart disease, osteoarthritis, sleep apnea, strokes, and high blood pressure, significantly reducing life expectancy and increasing mortality rates. Effective prediction of obesity risk is crucial for implementing targeted interventions and promoting public health.
7
8
# Approach:
9
10
- **Data Collection and Preprocessing:** 
11
    - We will gather comprehensive datasets containing information on demographics, lifestyle habits, dietary patterns, physical activity levels, and medical history. 
12
    - We will preprocess the data to handle missing values, normalize features, and encode categorical variables.
13
14
- **Exploratory Data Analysis (EDA):** 
15
    - We will perform exploratory data analysis to gain insights into the distribution of variables, identify patterns, and explore correlations between features and obesity risk levels. 
16
    - Visualization techniques will be employed to present key findings effectively.
17
18
- **Feature Engineering:** 
19
    - We will engineer new features and transformations to enhance the predictive power of our models. 
20
    - This may involve creating interaction terms, deriving new variables, or transforming existing features to improve model performance.
21
22
- **Model Development:** 
23
    - We will employ advanced machine learning techniques, including ensemble methods such as Random Forest, Gradient Boosting (XGBoost, LightGBM), and possibly deep learning approaches, to develop predictive models for obesity risk classification. 
24
    - We will train and fine-tune these models using appropriate evaluation metrics and cross-validation techniques to ensure robustness and generalization.
25
26
- **Model Evaluation:** 
27
    - We will evaluate the performance of our models using various metrics such as accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). 
28
    - We will also conduct sensitivity analysis and interpretability assessments to understand the factors driving predictions and identify areas for improvement.
29
    - 
30
31
<img src="https://www.limarp.com/wp-content/uploads/2023/02/obesity-risk-factors.png" alt="Obesity-Risk-Factors" width="1500">
32
33
34
# About Obesity Risk Level Prediction-Project:
35
<details>
36
<summary><b><span style="color:blue">Understanding Obesity and Risk Prediction:</span></b></summary>
37
38
<ul>
39
  <li><b>Understanding Obesity:</b>
40
    <ul>
41
      <li>Obesity stems from excessive body fat accumulation, influenced by genetic, environmental, and behavioral factors.</li>
42
      <li>Risk prediction involves analyzing demographics, lifestyle habits, and physical activity to classify individuals into obesity risk categories.</li>
43
    </ul>
44
  </li>
45
  <li><b>Global Impact:</b>
46
    <ul>
47
      <li>Worldwide obesity rates have tripled since 1975, affecting 30% of the global population.</li>
48
      <li>Urgent action is needed to develop effective risk prediction and management strategies.</li>
49
    </ul>
50
  </li>
51
  <li><b>Factors Influencing Risk:</b>
52
    <ul>
53
      <li>Obesity risk is shaped by demographics, lifestyle habits, diet, physical activity, and medical history.</li>
54
      <li>Analyzing these factors reveals insights into obesity's mechanisms and identifies high-risk populations.</li>
55
    </ul>
56
  </li>
57
  <li><b>Data-Driven Approach:</b>
58
    <ul>
59
      <li>Advanced machine learning and large datasets enable the development of predictive models for stratifying obesity risk.</li>
60
      <li>These models empower healthcare professionals and policymakers to implement tailored interventions for improved public health outcomes.</li>
61
    </ul>
62
  </li>
63
  <li><b>Proactive Health Initiatives:</b>
64
    <ul>
65
      <li>Our proactive approach aims to combat obesity by leveraging data and technology for personalized prevention and management.</li>
66
      <li>By predicting obesity risk, we aspire to create a future where interventions are precise, impactful, and tailored to individual needs.</li>
67
    </ul>
68
  </li>
69
</ul>
70
71
<p><b>Source:</b> <a href="https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight">World Health Organization.</a> (2022). Obesity and overweight.</p>
72
</details>
73
74
<details>
75
<summary><b><span style="color:blue">Dataset Overview:</span></b></summary>
76
77
<p>The dataset contains comprehensive information encompassing eating habits, physical activity, and demographic variables, comprising a total of 17.</p>
78
79
<h3>Key Attributes Related to Eating Habits:</h3>
80
<ul>
81
  <li><b>Frequent Consumption of High-Caloric Food (FAVC):</b> Indicates the frequency of consuming high-caloric food items.</li>
82
  <li><b>Frequency of Consumption of Vegetables (FCVC):</b> Measures the frequency of consuming vegetables.</li>
83
  <li><b>Number of Main Meals (NCP):</b> Represents the count of main meals consumed per day.</li>
84
  <li><b>Consumption of Food Between Meals (CAEC):</b> Describes the pattern of food consumption between main meals.</li>
85
  <li><b>Consumption of Water Daily (CH20):</b> Quantifies the daily water intake.</li>
86
  <li><b>Consumption of Alcohol (CALC):</b> Indicates the frequency of alcohol consumption.</li>
87
</ul>
88
89
<h3>Attributes Related to Physical Condition:</h3>
90
<ul>
91
  <li><b>Calories Consumption Monitoring (SCC):</b> Reflects the extent to which individuals monitor their calorie intake.</li>
92
  <li><b>Physical Activity Frequency (FAF):</b> Measures the frequency of engaging in physical activities.</li>
93
  <li><b>Time Using Technology Devices (TUE):</b> Indicates the duration spent using technology devices.</li>
94
  <li><b>Transportation Used (MTRANS):</b> Describes the mode of transportation typically used.</li>
95
</ul>
96
97
<p>Additionally, the dataset includes essential demographic variables such as gender, age, height, and weight, providing a comprehensive overview of individuals' characteristics.</p>
98
99
<h3>Target Variable:</h3>
100
<p>The target variable, NObesity, represents different obesity risk levels, categorized as:</p>
101
<ul>
102
  <li>Underweight (BMI < 18.5): 0</li>
103
  <li>Normal (18.5 <= BMI < 20): 1</li>
104
  <li>Overweight I (20 <= BMI < 25): 2</li>
105
  <li>Overweight II (25 <= BMI < 30): 3</li>
106
  <li>Obesity I (30 <= BMI < 35): 4</li>
107
  <li>Obesity II (35 <= BMI < 40): 5</li>
108
  <li>Obesity III (BMI >= 40): 6</li>
109
</ul>
110
</details>
111
112
113
## Table of Contents:
114
115
116
<details>
117
<summary><strong>Section: 1. Introduction</strong></summary>
118
119
| No. | Topic                                      |
120
|-----|--------------------------------------------|
121
| 1.  | [What is Obesity?](#What-is-Obesity:)      |
122
| 2.  | [Understanding Obesity and Risk Prediction](#Understanding-Obesity-and-Risk-Prediction:)|
123
| 3.  | [Dataset Overview](#Dataset-Overview:)     |
124
125
126
127
</details>
128
129
<details>
130
<summary><strong>Section: 2. Importing Libraries and Dataset</strong></summary>
131
132
| No. | Topic                                |
133
|-----|--------------------------------------|
134
| 1.  | [Importing Relevant Libraries](#Importing-Relevant-Libraries:) |
135
| 2.  | [Loading Datasets](#Loading-Datasets:)                        |
136
137
138
</details>
139
140
<details>
141
<summary><strong>Section: 3. Descriptive Analysis</strong></summary>
142
143
| No. | Topic                                                               |
144
|-----|---------------------------------------------------------------------|
145
| 1.  | [Summary Statistic of dataframe](#1.-Summary-Statistic-of-dataframe:) |
146
| 2.  | [The unique values present in dataset](#2.-The-unique-values-present-in-dataset:) |
147
| 3.  | [The count of unique value in the NObeyesdad column](#3.-The-count-of-unique-value-in-the-NObeyesdad-column:) |
148
| 4.  | [Categorical and numerical Variables Analysis](#4.-Categorical-and-numerical-Variables-Analysis:) |
149
|     |   - [a. Extracting column names for categorical, numerical, and categorical but cardinal variables](#a.-Extracting-column-names-for-categorical,-numerical,-and-categorical-but-cardinal-variables:) |
150
|     |   - [b. Summary Of All Categorical Variables](#b.-Summary-Of-All-Categorical-Variables:) |
151
|     |   - [c. Summary Of All Numerical Variables](#c.-Summary-Of-All-Numerical-Variables:) |
152
153
154
</details>
155
156
<details>
157
<summary><strong>Section: 4. Data Preprocessing</strong></summary>
158
159
| No. | Topic                                                      |
160
|-----|------------------------------------------------------------|
161
| 1.  | [Typeconversion of dataframe](#1.-Typeconversion-of-dataframe:) |
162
| 2.  | [Renaming the Columns](#2.-Renaming-the-Columns:)          |
163
| 3.  | [Detecting Columns with Large or Infinite Values](#3.-Detecting-Columns-with-Large-or-Infinite-Values:) |
164
165
166
</details>
167
168
<details>
169
<summary><strong>Section: 5. Exploratory Data Analysis and Visualization-EDAV</strong></summary>
170
171
<details>
172
<summary><strong>1. Univariate Analysis</strong></summary>
173
  
174
| No. | Topic                                                      |
175
|-----|------------------------------------------------------------|
176
| a.  | [Countplots for all Variables](#a.-Countplots-for-all-Variables:) |
177
| b.  | [Analyzing Individual Variables Using Histogram](#b.-Analyzing-Individual-Variables-Using-Histogram:) |
178
| c.  | [KDE Plots of Numerical Columns](#c.-KDE-Plots-of-Numerical-Columns:) |
179
| d.  | [Pie Chart and Barplot for categorical variables](#d.-Pie-Chart-and-Barplot-for-categorical-variables:) |
180
| e.  | [Violin Plot and Box Plot for Numerical variables](#e.-Violin-Plot-and-Box-Plot-for-Numerical-variables:) |
181
182
183
</details>
184
185
<details>
186
<summary><strong>2. Bivariate Analysis</strong></summary>
187
188
| No. | Topic                                                                   |
189
|-----|-------------------------------------------------------------------------|
190
| a.  | [Scatter plot: AGE V/s Weight with Obesity Level](#a.-Scatter-plot:-AGE-V/s-Weight-with-Obesity-Level:) |
191
| b.  | [Scatter plot: AGE V/s Height with Obesity Level](#b.-Scatter-plot:-AGE-V/s-Height-with-Obesity-Level:) |
192
| c.  | [Scatter plot: Height V/s Weight with Obesity Level](#c.-Scatter-plot:-Height-V/s-Weight-with-Obesity-Level:) |
193
| d.  | [Scatter plot: AGE V/s Weight with Overweighted Family History](#d.-Scatter-plot:-AGE-V/s-Weight-with-Overweighted-Family-History:) |
194
| e.  | [Scatter plot: AGE V/s height with Overweighted Family History](#e.-Scatter-plot:-AGE-V/s-height-with-Overweighted-Family-History:) |
195
| f.  | [Scatter plot: Height V/s Weight with Overweighted Family History](#f.-Scatter-plot:-Height-V/s-Weight-with-Overweighted-Family-History:) |
196
| g.  | [Scatter plot: AGE V/s Weight with Transport use](#g.-Scatter-plot:-AGE-V/s-Weight-with-Transport-use:) |
197
| h.  | [Scatter plot: AGE V/s Height with Transport use](#h.-Scatter-plot:-AGE-V/s-Height-with-Transport-use:) |
198
| i.  | [Scatter plot: Height V/s Weight with Transport use](#i.-Scatter-plot:-Height-V/s-Weight-with-Transport-use:) |
199
200
</details>
201
202
<details>
203
<summary><strong>3. Multivariate Analysis</strong></summary>
204
205
| No. | Topic                                                                   |
206
|-----|-------------------------------------------------------------------------|
207
| a.  | [Pair Plot of Variables against Obesity Levels](#a.-Pair-Plot-of-Variables-against-Obesity-Levels:) |
208
| b.  | [Correlation heatmap for Pearson's correlation coefficient](#b.-Correlation-heatmap-for-Pearson's-correlation-coefficient:) |
209
| c.  | [Correlation heatmap for Kendall's tau correlation coefficient](#c.-Correlation-heatmap-for-Kendall's-tau-correlation-coefficient:) |
210
| d.  | [3D Scatter Plot of Numerical Columns against Obesity Level](#d.-3D-Scatter-Plot-of-Numerical-Columns-against-Obesity-Level:) |
211
212
213
<details>
214
<summary><strong>e. Cluster Analysis</strong></summary>
215
216
| No. | Topic                                                                        |
217
|-----|------------------------------------------------------------------------------|
218
| I.  | [K-Means Clustering on Obesity level](#I.-K-Means-Clustering-on-Obesity-level:) |
219
| II. | [PCA Plot of numerical variables against obesity level](#II.-PCA-Plot-of-numerical-variables-against-obesity-level:) |
220
221
222
</details>
223
224
</details>
225
226
<details>
227
<summary><strong>4. Outlier Analysis</strong></summary>
228
229
<details>
230
<summary><strong>a. Univariate Outlier Analysis</strong></summary>
231
232
| No. | Topic                                                         |
233
|-----|---------------------------------------------------------------|
234
| I.  | [Boxplot Outlier Analysis](#I.-Boxplot-Outlier-Analysis:)     |
235
| II. | [Detecting outliers using Z-Score](#II.-Detecting-outliers-using-Z-Score:) |
236
| III.| [Detecting outliers using Interquartile Range (IQR)](#III.-Detecting-outliers-using-Interquartile-Range-(IQR):) |
237
238
</details>
239
240
<details>
241
<summary><strong>b. Multivariate Outlier Analysis</strong></summary>
242
243
| No. | Topic                                                         |
244
|-----|---------------------------------------------------------------|
245
| I.   | [Detecting Multivariate Outliers Using Mahalanobis Distance](#I.-Detecting-Multivariate-Outliers-Using-Mahalanobis-Distance:) |
246
| II.  | [Detecting Multivariate Outliers Using Principal Component Analysis (PCA)](#II.-Detecting-Multivariate-Outliers-Using-Principal-Component-Analysis-(PCA):) |
247
| III. | [Detecting Cluster-Based Outliers Using KMeans Clustering](#III.-Detecting-Cluster-Based-Outliers-Using-KMeans-Clustering:) |
248
249
250
</details>
251
252
</details>
253
254
<details>
255
<summary><strong>5. Feature Engineering:</strong></summary>
256
257
| No. | Topic                                                              |
258
|-----|--------------------------------------------------------------------|
259
| a.  | [Encoding Categorical to numerical variables](#a.-Encoding-Categorical-to-numerical-variables:) |
260
| b.  | [BMI(Body Mass Index) Calculation](#b.-BMI(Body-Mass-Index)-Calculation:) |
261
| c.  | [Total Meal Consumed:](#c.-Total-Meal-Consumed:)                   |
262
| d.  | [Total Activity Frequency Calculation](#d.-Total-Activity-Frequency-Calculation:) |
263
| e.  | [Ageing process analysis](#e.-Ageing-process-analysis:)            |
264
265
266
</details>
267
268
</details>
269
270
<details>
271
<summary><strong>Section: 6. Analysis & Prediction Using Machine Learning(ML) Model</strong></summary>
272
273
| No. | Topic                                                                   |
274
|-----|-------------------------------------------------------------------------|
275
| 1.  | [Feature Importance Analysis and Visualization](#1.-Feature-Importance-Analysis-and-Visualization:) |
276
|     |   a. [Feature Importance Analysis  using Random Forest Classifier](#a.-Feature-Importance-Analysis--using-Random-Forest-Classifier:) |
277
|     |   b. [Feature Importance Analysis using XGBoost(XGB) Model](#b.-Feature-Importance-Analysis-using-XGBoost(XGB)-Model:) |
278
|     |   c. [Feature Importance Analysis Using (LightGBM) Classifier Model](#c.-Feature-Importance-Analysis-Using-(LightGBM)-Classifier-Model:) |
279
| 2.  | [Data visualization after Feature Engineering](#2.-Data-visualization-after-Feature-Engineering:) |
280
|     |   a. [Bar plot of numerical variables](#a.-Bar-plot-of-numerical-variables:) |
281
|     |   b. [PairPlot of Numerical Variables](#b.-PairPlot-of-Numerical-Variables:) |
282
|     |   c. [Correlation Heatmap of Numerical Variables](#c.-Correlation-Heatmap-of-Numerical-Variables:) |
283
284
285
</details>
286
287
<details>
288
<summary><strong>Section: 7. Prediction of Obesity Risk Level Using Machine learning(ML) Models</strong></summary>
289
290
| No. | Topic                                                                                                    |
291
|-----|----------------------------------------------------------------------------------------------------------|
292
| 1.  | [Machine Learning Model Creation: XGBoost and LightGBM - Powering The Predictions! 🚀](#1.-Machine-Learning-Model-Creation:-XGBoost-and-LightGBM---Powering-The-Predictions!-🚀) |
293
| 2.  | [Cutting-edge Machine Learning Model Evaluation: XGBoosting and LightGBM 🤖](#2.-Cutting-edge-Machine-Learning-Model-Evaluation:-XGBoosting-and-LightGBM-🤖) |
294
| 3.  | [Test Data Preprocessing for Prediction](#3.-Test-Data-Preprocessing-for-Prediction:) |
295
| 4.  | [Showcase Predicted Encdd_Obesity_Level Values on Test Dataset 📊](#4.-Showcase-Predicted-Encdd_Obesity_Level-Values-on-Test-Dataset-📊) |
296
297
298
</details>
299
300
<details>
301
<summary><strong>Section: 8. Conclusion: 📝</strong></summary>
302
303
| No. | Topic                                                                                  |
304
|-----|----------------------------------------------------------------------------------------|
305
| 1.  | [Conclusion: 📝](#Conclusion:-📝)                                                      |
306
| 2.  | [It's time to make Submission:](#It's-time-to-make-Submission:)                        |
307
308
309
</details>
310
311
Links to access this project's ipynb file, if you are cannot able to see it in github reposetory are [here](https://nbviewer.org/github/Anamicca23/Obesity-Risk-Level-Prediction--Project-using-ML/blob/master/prediction-of-obesity-rlevels-using-ml-lightgbm_jupyter%20notebook.ipynb)
312
313
##  🎯 Project Objectives:
314
315
1. **Machine Learning Model Development**: 
316
   Develop a robust machine learning model leveraging advanced techniques to accurately predict obesity risk levels.
317
318
2. **Data Analysis and Feature Engineering**: 
319
   Conduct thorough analysis of demographics, lifestyle habits, and physical activity data to identify key factors influencing obesity risk. Implement effective feature engineering strategies to enhance model performance.
320
321
3. **Achieve 100% Accuracy**: 
322
   Strive to achieve a high level of accuracy, aiming for 100% precision in predicting obesity risk levels. Employ rigorous model evaluation techniques and optimize model parameters accordingly.
323
324
4. **Actionable Insights**: 
325
   Provide actionable insights derived from the predictive model to facilitate targeted interventions and public health strategies. Enable healthcare professionals and policymakers to make informed decisions for obesity prevention and management.
326
327
5. **Documentation and Presentation**: 
328
   Ensure comprehensive documentation of the model development process and findings. Prepare clear and concise presentations to communicate results effectively to stakeholders.
329
330
331
## 🚀 Prerequisites:
332
333
- **Machine Learning Basics**: Understanding of supervised learning, model evaluation, and feature engineering.
334
- **Python Proficiency**: Proficiency in Python, including libraries like NumPy, Pandas, and Scikit-learn.
335
- **Data Analysis Skills**: Ability to perform EDA, preprocess datasets, and visualize data.
336
- **Jupyter Notebooks**: Familiarity with Jupyter Notebooks for interactive coding and documentation.
337
- **Health Data Understanding**: Basic knowledge of obesity, BMI calculation, and health-related datasets.
338
- **Computational Resources**: Access to a computer with sufficient processing power and memory.
339
- **Environment Setup**: Python environment setup with necessary libraries installed.
340
- **Version Control**: Familiarity with Git and GitHub for collaboration and project management.
341
- **Documentation Skills**: Ability to document methodologies and results effectively using markdown.
342
- **Passion for Data Science**: Genuine interest in data science and public health projects.
343
344
345
## Industry Relevance:
346
347
This project is highly relevant to the industry across several critical areas:
348
349
- **Healthcare Analytics**: Leveraging advanced machine learning techniques, this project facilitates predictive analysis in healthcare, enabling personalized interventions and preventive strategies.
350
351
- **Precision Medicine**: Accurately predicting obesity risk levels contributes to the advancement of precision medicine, allowing for tailored treatments and interventions based on individual health profiles.
352
353
- **Public Health Initiatives**: By providing actionable insights derived from data analysis, this project assists in formulating targeted public health initiatives to reduce obesity rates and improve population health outcomes.
354
355
- **Data-driven Decision Making**: Empowering healthcare professionals and policymakers with data-driven insights facilitates informed decision-making processes, optimizing resource allocation and intervention strategies.
356
357
- **Technology Integration**: Integrating machine learning models into healthcare systems enhances diagnostic capabilities, risk assessment, and patient management, driving efficiency and improving healthcare delivery.
358
359
- **Preventive Healthcare**: Emphasizing predictive analytics for obesity risk levels supports preventive healthcare initiatives, focusing on early detection and intervention to mitigate health risks and improve overall well-being.
360
361
362
363
364
<details>
365
<summary><strong>Libraries and Packages Requirement</strong></summary>
366
367
To execute this project, ensure the following libraries and packages are installed:
368
369
- **Python Standard Libraries**:
370
    - `os`: Operating system functionality
371
    - `pickle`: Serialization protocol for Python objects
372
    - `warnings`: Control over warning messages
373
    - `collections`: Container datatypes
374
    - `csv`: CSV file reading and writing
375
    - `sys`: System-specific parameters and functions
376
377
- **Data Processing and Analysis**:
378
    - `numpy`: Numerical computing library
379
    - `pandas`: Data manipulation and analysis library
380
381
- **Data Visualization**:
382
    - `matplotlib.pyplot`: Data visualization library
383
    - `seaborn`: Statistical data visualization library
384
    - `altair`: Declarative statistical visualization library
385
    - `mpl_toolkits.mplot3d`: 3D plotting toolkit
386
    - `tabulate`: Pretty-print tabular data
387
    - `colorama`: Terminal text styling library
388
389
- **Machine Learning and Model Evaluation**:
390
    - `scipy.stats`: Statistical functions
391
    - `sklearn.cluster`: Clustering algorithms
392
    - `sklearn.preprocessing`: Data preprocessing techniques
393
    - `sklearn.decomposition`: Dimensionality reduction techniques
394
    - `sklearn.ensemble`: Ensemble learning algorithms
395
    - `xgboost`: Extreme Gradient Boosting library
396
    - `lightgbm`: Light Gradient Boosting Machine library
397
398
- **Miscellaneous**:
399
    - `IPython.display.Image`: Displaying images in IPython
400
    - `sklearn.metrics`: Metrics for model evaluation
401
    - `sklearn.model_selection`: Model selection and evaluation tools
402
    - `sklearn.preprocessing.LabelEncoder`: Encode labels with a value between 0 and n_classes-1
403
    - `scipy.stats.pearsonr`: Pearson correlation coefficient and p-value for testing non-correlation
404
    - `scipy.stats.chi2`: Chi-square distribution
405
406
Make sure to have these libraries installed in your Python environment before running the code.
407
408
</details>
409
410
411
<details>
412
<summary><strong>Tech Stack Used:</strong></summary>
413
414
<details>
415
<summary><strong>Programming Languages</strong></summary>
416
417
- **Python**: Used for data processing, analysis, machine learning model development, and scripting tasks.
418
419
</details>
420
421
<details>
422
<summary><strong>Libraries and Frameworks</strong></summary>
423
424
- **NumPy**: For numerical computing and array operations.
425
- **Pandas**: For data manipulation and analysis.
426
- **Matplotlib**: For static, interactive, and animated visualizations.
427
- **Seaborn**: For statistical data visualization.
428
- **Scikit-learn**: For machine learning algorithms and model evaluation.
429
- **XGBoost**: For gradient boosting algorithms.
430
- **LightGBM**: For gradient boosting algorithms with faster training speed and higher efficiency.
431
- **Altair**: For declarative statistical visualization.
432
- **IPython.display**: For displaying images in IPython.
433
- **Tabulate**: For pretty-printing tabular data.
434
- **Colorama**: For terminal text styling.
435
- **SciPy**: For scientific computing and statistical functions.
436
437
</details>
438
439
<details>
440
<summary><strong>Tools and Utilities</strong></summary>
441
442
- **Jupyter Notebook**: For interactive computing and data exploration.
443
- **Git**: For version control and collaboration.
444
- **GitHub**: For hosting project repositories and collaboration.
445
- **Travis CI**: For continuous integration and automated testing.
446
- **CircleCI**: For continuous integration and automated testing.
447
- **GitHub Actions**: For continuous integration and automated workflows directly within GitHub.
448
449
</details>
450
451
<details>
452
<summary><strong>Data Storage and Processing</strong></summary>
453
454
- **CSV Files**: For storing structured data.
455
- **Pickle**: For serializing and deserializing Python objects.
456
457
</details>
458
459
<details>
460
<summary><strong>Development Environment</strong></summary>
461
462
- **Operating System**: Platform-independent (Windows, macOS, Linux).
463
- **Integrated Development Environment (IDE)**: Any Python-compatible IDE like PyCharm, VS Code, or Jupyter Lab.
464
465
</details>
466
467
<details>
468
<summary><strong>Documentation and Collaboration</strong></summary>
469
470
- **Markdown**: For documenting project details, README files, and collaboration.
471
- **GitHub Wiki**: For project documentation and knowledge sharing.
472
- **Google Docs**: For collaborative documentation and note-taking.
473
474
</details>
475
476
<details>
477
<summary><strong>Version Control Requirements</strong></summary>
478
479
To manage code changes and collaboration effectively, the following version control tools and practices are recommended for this project:
480
481
1. **Git Installation**:
482
    - Download and install Git from the [official Git website](https://git-scm.com/downloads).
483
    - Ensure Git is properly configured on your system, including setting up your username and email address.
484
485
2. **GitHub Repository**:
486
    - Create a GitHub account if you don't have one.
487
    - Set up a new repository for the project on GitHub.
488
    - Initialize the local project directory as a Git repository using the following commands:
489
        ```bash
490
        git init
491
        ```
492
493
3. **Collaboration Workflow**:
494
    - Follow a standard Git workflow, such as the feature branch workflow or Gitflow, for managing branches and code changes.
495
    - Utilize pull requests for code review and collaboration between team members.
496
    - Ensure consistent and descriptive commit messages to track changes effectively.
497
498
4. **Continuous Integration (CI)**:
499
    - Integrate a CI/CD pipeline with GitHub using platforms like Travis CI, CircleCI, or GitHub Actions.
500
    - Configure automated tests to run on each push or pull request to ensure code quality and reliability.
501
502
5. **Code Review**:
503
    - Conduct thorough code reviews for all pull requests to maintain code quality and ensure adherence to coding standards.
504
    - Provide constructive feedback and suggestions for improvement during code reviews.
505
506
By following these version control practices, you can streamline collaboration, track changes effectively, and ensure the stability and reliability of the project codebase.
507
508
</details>
509
</details>
510
511
512
## Installation Requirements:
513
514
To set up the environment for this project, follow these steps:
515
516
1. **Python Installation**:
517
    Ensure Python is installed on your system. You can download it from the [official Python website](https://www.python.org/downloads/).
518
519
2. **Virtual Environment (Optional but Recommended)**:
520
    - Install virtualenv: `pip install virtualenv`
521
    - Create a virtual environment: `virtualenv env`
522
    - Activate the virtual environment:
523
        - On Windows: `.\env\Scripts\activate`
524
        - On macOS and Linux: `source env/bin/activate`
525
526
3. **Required Libraries**:
527
    - Install necessary libraries using pip:
528
        ```bash
529
        pip install numpy pandas scikit-learn matplotlib seaborn jupyter xgboost lightgbm
530
        ```
531
    - These libraries are essential for data analysis, visualization, and machine learning tasks. Additional libraries like XGBoost and LightGBM are included for specific machine learning models. As listed above in the Libraries Requirements
532
533
4. **Jupyter Notebook Installation** (Optional but Recommended):
534
    - Install Jupyter Notebook: `pip install notebook`
535
    - Launch Jupyter Notebook: `jupyter notebook`
536
537
5. **Git Installation** (Optional but Recommended):
538
    - Download and install Git from the [official Git website](https://git-scm.com/downloads).
539
540
6. **Project Repository**:
541
    - Clone the project repository from GitHub:
542
        ```bash
543
        git clone https://github.com/yourname/Obesity-Risk-Level-Prediction--Project-using-ML
544
        ```
545
    - Alternatively, download the project files directly from the repository.
546
547
7. **Data Source**:
548
    - Ensure you have access to the dataset required for the project.(as provided in this repository).
549
    - Or you can visit this link to get dataset for this project : [See here](https://www.kaggle.com/competitions/playground-series-s4e2)
550
551
8. **Environment Setup**:
552
    - Set up the project environment by installing all required dependencies listed in the project's requirements.txt file:
553
        ```bash
554
        pip install -r requirements.txt
555
        ```
556
557
9. **Run Jupyter Notebook**:
558
    - Navigate to the project directory containing the Jupyter Notebook file and launch Jupyter Notebook:
559
        ```bash
560
        jupyter notebook
561
        ```
562
563
10. **Project Configuration**:
564
    - Customize any project configurations or settings as necessary, such as file paths, model parameters, or data preprocessing steps.
565
566
11. **Documentation and Notes**:
567
    - Keep documentation and notes handy for reference during the project, including datasets, code snippets, and research papers related to obesity prediction and machine learning techniques.
568
569
570
# Outcome and Analysis:
571
 **Model Evaluation Matrix:**
572
![Model Evaluation Matrix](https://github.com/Anamicca23/Obesity-Risk-Level-Prediction--Project-using-ML/assets/99593529/a924cce4-7d30-4690-9296-b24c74f69771)
573
574
 **Best Model Performanace for Obesity Risk-Level Prediction:**
575
![best Model](https://github.com/Anamicca23/Obesity-Risk-Level-Prediction--Project-using-ML/assets/99593529/bf08169e-1edc-434a-8cb6-8bb982ad29f1)
576
577
**Result:**
578
  -  Based on the evaluation metrics, the models performed quite similarly, with minor differences in accuracy, precision, recall, and F1-score. The XGBoost model achieved an accuracy of approximately 90.87%, followed closely by LightGBM with an accuracy of approximately 90.99%. CatBoost achieved an accuracy of approximately 90.56%. The ensemble model, which combines predictions from XGBoost and LightGBM, achieved an accuracy of approximately 90.80%.
579
580
Considering the performance metrics and confusion matrices, LightGBM appears to have a slight edge over the other models in terms of accuracy and F1-score, with similar performance in precision and recall. However, the differences in performance among the models are relatively small, indicating that they are all capable of producing reliable predictions.
581
582
Therefore, based on the evaluation results, LightGBM seems to be the best model for making predictions on Obesity Risk Level Prediction.
583
584
  - Through our comprehensive analysis and predictive modeling efforts, we aim to achieve accurate classification of individuals into different obesity risk categories. This outcome will enable healthcare professionals to identify high-risk individuals, tailor interventions, and allocate resources effectively. Furthermore, our insights into the factors influencing obesity risk will inform public health policies and initiatives aimed at prevention and management. By leveraging data-driven approaches and advanced machine learning techniques, we aspire to make significant strides towards combating the global obesity epidemic and promoting healthier communities.
585
586
Enjoy Project!