Switch to unified view

a/README.md b/README.md
1
1
<h1 style="text-align: center;">ICU Admission Prediction for COVID-19 Patients Using Machine Learning</h1>
2
<h1 style="text-align: center;">ICU Admission Prediction for COVID-19 Patients Using Machine Learning</h1>
2
3
4
---
3
---
5
4
6
# **Table of Contents**
5
# **Table of Contents**
7
1. [Abstract](#abstract)
6
1. [Abstract](#abstract)
8
2. [Introduction](#introduction)
7
2. [Introduction](#introduction)
9
   - [Background](#background)
8
   - [Background](#background)
10
   - [Problem Statement](#problem-statement)
9
   - [Problem Statement](#problem-statement)
11
   - [Objective](#objective)
10
   - [Objective](#objective)
12
3. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
11
3. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
13
   - [Data Cleaning](#data-cleaning)
12
   - [Data Cleaning](#data-cleaning)
14
   - [Descriptive Statistics](#descriptive-statistics)
13
   - [Descriptive Statistics](#descriptive-statistics)
15
   - [Visualizations](#visualizations)
14
   - [Visualizations](#visualizations)
16
   - [Feature Balance](#feature-balance)
15
   - [Feature Balance](#feature-balance)
17
4. [Feature Engineering](#feature-engineering)
16
4. [Feature Engineering](#feature-engineering)
18
   - [Feature Selection](#feature-selection)
17
   - [Feature Selection](#feature-selection)
19
   - [New Feature Creation](#new-feature-creation)
18
   - [New Feature Creation](#new-feature-creation)
20
   - [Encoding Categorical Data](#encoding-categorical-data)
19
   - [Encoding Categorical Data](#encoding-categorical-data)
21
5. [Model Selection and Training](#model-selection-and-training)
20
5. [Model Selection and Training](#model-selection-and-training)
22
   - [Preprocessing Parameter Tuning](#preprocessing-parameter-tuning)
21
   - [Preprocessing Parameter Tuning](#preprocessing-parameter-tuning)
23
   - [Model Hyperparameter Tuning](#model-hyperparameter-tuning)
22
   - [Model Hyperparameter Tuning](#model-hyperparameter-tuning)
24
   - [Grid Search Process](#grid-search-process)
23
   - [Grid Search Process](#grid-search-process)
25
   - [Selecting the Best Model](#selecting-the-best-model)
24
   - [Selecting the Best Model](#selecting-the-best-model)
26
6. [Results](#results)
25
6. [Results](#results)
27
   - [Model Performance](#model-performance)
26
   - [Model Performance](#model-performance)
28
7. [Discussion](#discussion)
27
7. [Discussion](#discussion)
29
   - [Handling Missing Values](#handling-missing-values)
28
   - [Handling Missing Values](#handling-missing-values)
30
   - [Feature Engineering](#feature-engineering-1)
29
   - [Feature Engineering](#feature-engineering-1)
31
   - [Model Selection](#model-selection)
30
   - [Model Selection](#model-selection)
32
   - [Feature Importance and Data Shuffling](#feature-importance-and-data-shuffling)
31
   - [Feature Importance and Data Shuffling](#feature-importance-and-data-shuffling)
33
   - [Handling Unknown Values](#handling-unknown-values)
32
   - [Handling Unknown Values](#handling-unknown-values)
34
   - [Refining Missing Value Imputation](#refining-missing-value-imputation)
33
   - [Refining Missing Value Imputation](#refining-missing-value-imputation)
35
8. [Setup Instructions](#setup-instructions)
34
8. [Setup Instructions](#setup-instructions)
36
35
37
---
36
---
38
37
39
# **Abstract**
38
# **Abstract**
40
39
41
This project aims to predict whether a COVID-19 patient will need ICU admission based on clinical data using machine learning models. The dataset includes key patient information such as vital signs, blood test results, underlying conditions, and demographic data. Exploratory data analysis (EDA) was conducted to understand patterns and relationships within the data. Features were selected and engineered to improve prediction accuracy. Several machine learning models, including Random Forest, SVM, and XGBoost, were trained and evaluated using appropriate performance metrics. The models were fine-tuned through hyperparameter optimization techniques such as grid search. The outcome is a predictive model that can help healthcare professionals allocate ICU resources more effectively, potentially reducing mortality rates and improving the overall efficiency of hospital operations during the COVID-19 pandemic.
40
This project aims to predict whether a COVID-19 patient will need ICU admission based on clinical data using machine learning models. The dataset includes key patient information such as vital signs, blood test results, underlying conditions, and demographic data. Exploratory data analysis (EDA) was conducted to understand patterns and relationships within the data. Features were selected and engineered to improve prediction accuracy. Several machine learning models, including Random Forest, SVM, and XGBoost, were trained and evaluated using appropriate performance metrics. The models were fine-tuned through hyperparameter optimization techniques such as grid search. The outcome is a predictive model that can help healthcare professionals allocate ICU resources more effectively, potentially reducing mortality rates and improving the overall efficiency of hospital operations during the COVID-19 pandemic.
42
41
43
---
42
---
44
43
45
# **Introduction**
44
# **Introduction**
46
45
47
## **Background**
46
## **Background**
48
During the COVID-19 pandemic, hospitals faced unprecedented pressure due to a surge in critically ill patients requiring ICU care. The ability to predict ICU admission for patients based on clinical data became crucial for optimizing hospital resources and ensuring that care was delivered to those who needed it the most. Early and accurate prediction of ICU needs can assist in better resource allocation, improve patient outcomes, and reduce mortality rates.
47
During the COVID-19 pandemic, hospitals faced unprecedented pressure due to a surge in critically ill patients requiring ICU care. The ability to predict ICU admission for patients based on clinical data became crucial for optimizing hospital resources and ensuring that care was delivered to those who needed it the most. Early and accurate prediction of ICU needs can assist in better resource allocation, improve patient outcomes, and reduce mortality rates.
49
48
50
## **Problem Statement**
49
## **Problem Statement**
51
The goal of this project is to develop a machine learning model that can predict whether a COVID-19 patient will require ICU admission based on clinical features such as vital signs, blood test results, and underlying health conditions.
50
The goal of this project is to develop a machine learning model that can predict whether a COVID-19 patient will require ICU admission based on clinical features such as vital signs, blood test results, and underlying health conditions.
52
51
53
## **Objective**
52
## **Objective**
54
This project aims to use machine learning techniques to analyze clinical data and predict the likelihood of ICU admission for COVID-19 patients. By accurately predicting ICU needs, the model will help healthcare providers make informed decisions, manage hospital resources, and improve patient care during the pandemic.
53
This project aims to use machine learning techniques to analyze clinical data and predict the likelihood of ICU admission for COVID-19 patients. By accurately predicting ICU needs, the model will help healthcare providers make informed decisions, manage hospital resources, and improve patient care during the pandemic.
55
54
56
---
55
---
57
# **Setup Instructions**
56
# **Setup Instructions**
58
- Head to the root directory where Dockerfile exists.
57
- Head to the root directory where Dockerfile exists.
59
- Build the image with `docker build -t model .`
58
- Build the image with `docker build -t model .`
60
- Start the container with `docker run -d -p 8000:8000 model`
59
- Start the container with `docker run -d -p 8000:8000 model`
61
- Open `request.py` file for easier testing.
60
- Open `request.py` file for easier testing.
62
- Place your test dataset as `dataset.xlsx` in the root directory.
61
- Place your test dataset as `dataset.xlsx` in the root directory.
63
- Run the `request.py` script for testing.
62
- Run the `request.py` script for testing.
64
---
63
---
65
64
66
# **Exploratory Data Analysis (EDA)**
65
# **Exploratory Data Analysis (EDA)**
67
66
68
## **Data Cleaning**:
67
## **Data Cleaning**:
69
   - **fequency of missing values:**
68
   - **frequency of missing values:**
70
   
69
   
71
      ![fequency of missing values](./images/missing_values_hist.png)
70
    
72
73
      according to chart, we have almost 175 columns 1000 to 1100 missing values, this suggests that many columns can be challenging to fill, considering we are dealing medical dataset.
71
      according to chart, we have almost 175 columns 1000 to 1100 missing values, this suggests that many columns can be challenging to fill, considering we are dealing medical dataset.
74
   <br><br>
72
   <br><br>
75
   
73
   
76
   - **Columns with Most Missing Values:**
74
   initial stages of data cleaning, missing values were handled using various techniques. The `_remove_missing_values` function was used to filter columns based on a defined threshold for missing data. Columns with missing values above this threshold were excluded to maintain dataset quality. Additionally, sparse rows (those with many missing values) were removed using the `_remove_sparse_row` function to retain rows with sufficient information. However, this step can be skipped to make the model more robust, as real-world test data may include missing values that need to be filled during prediction. Data correction, normalization, and scaling were applied where necessary to ensure feature consistency.
77
75
78
      ![most miss values](./images/most_missing_values.png)
79
80
      In the initial stages of data cleaning, missing values were handled using various techniques. The `_remove_missing_values` function was used to filter columns based on a defined threshold for missing data. Columns with missing values above this threshold were excluded to maintain dataset quality. Additionally, sparse rows (those with many missing values) were removed using the `_remove_sparse_row` function to retain rows with sufficient information. However, this step can be skipped to make the model more robust, as real-world test data may include missing values that need to be filled during prediction. Data correction, normalization, and scaling were applied where necessary to ensure feature consistency.
81
82
## **Descriptive Statistics**:  
76
## **Descriptive Statistics**:  
83
   - **Some Features After Preprocess and Filling Missing Values:**
77
84
85
      ![outliers](./images/outliers.png)
86
87
      According to the chart, some columns have a small percentage of outliers, which could affect the final results, especially with missing data. However, since we used multiple techniques for column-specific missing value imputation, the outliers won't impact the final outcome.
78
      According to the chart, some columns have a small percentage of outliers, which could affect the final results, especially with missing data. However, since we used multiple techniques for column-specific missing value imputation, the outliers won't impact the final outcome.
88
   
79
   
89
   - The `describe()` method was utilized to provide summary statistics for the dataset, such as mean, median, and variance, for each feature. This allowed for an understanding of the central tendencies and the variability of the data. Exploratory analysis also included calculating the distribution of values across critical features such as patient age, heart rate, blood pressure, and blood markers, which helped identify any unusual patterns.
80
   - The `describe()` method was utilized to provide summary statistics for the dataset, such as mean, median, and variance, for each feature. This allowed for an understanding of the central tendencies and the variability of the data. Exploratory analysis also included calculating the distribution of values across critical features such as patient age, heart rate, blood pressure, and blood markers, which helped identify any unusual patterns.
90
81
91
## **Visualizations**: 
82
## **Visualizations**: 
92
   - **Some Features Tend to be More Effective For Distinguishing Labels:**
83
   - **Some Features Tend to be More Effective For Distinguishing Labels:**
93
84
94
      ![cluster](./images/cluster.png)
95
   - **Some Features Tend to Be Really Different By Label:**
85
   - **Some Features Tend to Be Really Different By Label:**
96
86
97
      ![age vs targewt](./images/age_vs_target.png)
87
 
98
88
99
      Visualizations were generated to show the distribution of key variables such as vital signs and blood test results. Correlation plots were used to identify potential relationships between features (e.g., the correlation between oxygen saturation and ICU admission). Scatter plots, histograms, and box plots were also used to identify outliers and understand the data spread.
89
      Visualizations were generated to show the distribution of key variables such as vital signs and blood test results. Correlation plots were used to identify potential relationships between features (e.g., the correlation between oxygen saturation and ICU admission). Scatter plots, histograms, and box plots were also used to identify outliers and understand the data spread.
100
90
101
## **Feature Balance**: 
91
## **Feature Balance**: 
102
   - **Labels Before Preprocessing:**
92
103
104
      ![imbalance before](./images/imbalanced_before.png)
105
106
   - **Labels After Preprocessing:**
107
108
      ![imbalaned after](./images/imbalanced_after.png)
109
110
      A key part of EDA involved determining whether the dataset was imbalanced with respect to ICU admissions. Initial checks showed that the data was imbalanced, as a smaller percentage of patients required ICU admission. This imbalance was handled using techniques such as careful model evaluation and metric selection (such as F1-score) to ensure the model’s performance was reliable for both classes.
93
      A key part of EDA involved determining whether the dataset was imbalanced with respect to ICU admissions. Initial checks showed that the data was imbalanced, as a smaller percentage of patients required ICU admission. This imbalance was handled using techniques such as careful model evaluation and metric selection (such as F1-score) to ensure the model’s performance was reliable for both classes.
111
94
112
**Further More Interesting Charts and Interpretations on EDA.ipynb**
95
**Further More Interesting Charts and Interpretations on EDA.ipynb**
113
96
114
---
97
---
115
98
116
# **Feature Engineering**
99
# **Feature Engineering**
117
100
118
## **Feature Selection**: 
101
## **Feature Selection**: 
119
   Feature selection was performed using both statistical and domain knowledge-based methods. A `VarianceThreshold` was used to remove low-variance features, ensuring that only the most informative features were retained. Additionally, correlation analysis was applied to remove redundant features and retain those with a stronger impact on predicting ICU admission. Specific mappings from the `_mapping()` function were applied to variables like age, gender, hypertension, and other conditions to map categorical values into numerical representations.
102
   Feature selection was performed using both statistical and domain knowledge-based methods. A `VarianceThreshold` was used to remove low-variance features, ensuring that only the most informative features were retained. Additionally, correlation analysis was applied to remove redundant features and retain those with a stronger impact on predicting ICU admission. Specific mappings from the `_mapping()` function were applied to variables like age, gender, hypertension, and other conditions to map categorical values into numerical representations.
120
103
121
## **New Feature Creation**:  
104
## **New Feature Creation**:  
122
   Feature engineering involved combining existing features to create new, more meaningful features. For instance, blood test results were transformed into aggregated metrics such as median, mean, or differences from baseline values (`albumin_median`, `albumin_diff`, etc.). These new features helped capture patient trends over time, which are vital for predicting ICU admissions.
105
   Feature engineering involved combining existing features to create new, more meaningful features. For instance, blood test results were transformed into aggregated metrics such as median, mean, or differences from baseline values (`albumin_median`, `albumin_diff`, etc.). These new features helped capture patient trends over time, which are vital for predicting ICU admissions.
123
106
124
## **Encoding Categorical Data**:  
107
## **Encoding Categorical Data**:  
125
   Categorical features, such as `age`, `gender`, `hypertension`, and `other conditions`, were mapped to numerical values using custom-defined mappings (as seen in the `_mapping()` function). This included binary mappings (e.g., for age greater than 65), allowing these features to be effectively used by machine learning algorithms. This transformation ensured that all categorical features were appropriately encoded for model training.
108
   Categorical features, such as `age`, `gender`, `hypertension`, and `other conditions`, were mapped to numerical values using custom-defined mappings (as seen in the `_mapping()` function). This included binary mappings (e.g., for age greater than 65), allowing these features to be effectively used by machine learning algorithms. This transformation ensured that all categorical features were appropriately encoded for model training.
126
109
127
---
110
---
128
# **Model Selection and Training**
111
# **Model Selection and Training**
129
112
130
The model selection process for this project involved an extensive grid search to find the best preprocessing parameters and machine learning model configurations. We employed a systematic approach that tested various combinations of preprocessing and model hyperparameters to optimize performance.
113
The model selection process for this project involved an extensive grid search to find the best preprocessing parameters and machine learning model configurations. We employed a systematic approach that tested various combinations of preprocessing and model hyperparameters to optimize performance.
131
114
132
## **Preprocessing Parameter Tuning**
115
## **Preprocessing Parameter Tuning**
133
The preprocessing stage was tuned using the following parameters:
116
The preprocessing stage was tuned using the following parameters:
134
- **Missing Value Percentage (`missing_value_per`)**: This parameter controls the threshold for removing columns based on missing values. Different thresholds ranging from 10% to 50% were tested to find the optimal balance between data retention and quality.
117
- **Missing Value Percentage (`missing_value_per`)**: This parameter controls the threshold for removing columns based on missing values. Different thresholds ranging from 10% to 50% were tested to find the optimal balance between data retention and quality.
135
- **Variance Threshold (`variance_threshold`)**: This threshold was used to remove features with low variance, as they may not provide meaningful information for the model. The grid tested thresholds from 0.0 to 0.2.
118
- **Variance Threshold (`variance_threshold`)**: This threshold was used to remove features with low variance, as they may not provide meaningful information for the model. The grid tested thresholds from 0.0 to 0.2.
136
- **Minimum Null Percentage (`min_null_per`)**: Set at a constant value of 0.5, this parameter ensured that rows with excessive missing data were removed. However, rows below this threshold were retained to preserve valuable data for model training.
119
- **Minimum Null Percentage (`min_null_per`)**: Set at a constant value of 0.5, this parameter ensured that rows with excessive missing data were removed. However, rows below this threshold were retained to preserve valuable data for model training.
137
120
138
For each combination of these preprocessing parameters, the training data was split into training and test sets, with an 80/20 split ratio and shuffling enabled to ensure randomization.
121
For each combination of these preprocessing parameters, the training data was split into training and test sets, with an 80/20 split ratio and shuffling enabled to ensure randomization.
139
122
140
1. **Preprocessing Pipeline**:  
123
1. **Preprocessing Pipeline**:  
141
   The `Preprocess` class was applied to the training data to handle missing values, reduce variance, and remove sparse rows. After applying these transformations, the `MissingValue` class was used to fill any remaining missing values. The missing value imputation method was chosen by simulating missing data, filling it using various methods, and selecting the method with the lowest mean absolute error (MAE).
124
   The `Preprocess` class was applied to the training data to handle missing values, reduce variance, and remove sparse rows. After applying these transformations, the `MissingValue` class was used to fill any remaining missing values. The missing value imputation method was chosen by simulating missing data, filling it using various methods, and selecting the method with the lowest mean absolute error (MAE).
142
125
143
2. **Mapping and Consistency**:  
126
2. **Mapping and Consistency**:  
144
   A key step in preprocessing was ensuring that the same transformations were applied to both the training and test datasets. This was done using the `_mapping()` function to maintain consistency in feature mappings, such as categorical encodings, and to ensure that all test data columns matched the training data.
127
   A key step in preprocessing was ensuring that the same transformations were applied to both the training and test datasets. This was done using the `_mapping()` function to maintain consistency in feature mappings, such as categorical encodings, and to ensure that all test data columns matched the training data.
145
128
146
## **Model Hyperparameter Tuning**
129
## **Model Hyperparameter Tuning**
147
Once the data was preprocessed, different machine learning models were evaluated using grid search to find the best model configuration. The models considered for tuning included Random Forest and XGBoost, as they performed better than SVM and Logistic Regression in initial experiments.
130
Once the data was preprocessed, different machine learning models were evaluated using grid search to find the best model configuration. The models considered for tuning included Random Forest and XGBoost, as they performed better than SVM and Logistic Regression in initial experiments.
148
131
149
1. **Random Forest Hyperparameters**:
132
1. **Random Forest Hyperparameters**:
150
   - **Number of Trees (`n_estimators`)**: Values of 150 and 250 were tested to determine the optimal number of trees in the forest.
133
   - **Number of Trees (`n_estimators`)**: Values of 150 and 250 were tested to determine the optimal number of trees in the forest.
151
   - **Tree Depth (`max_depth`)**: Tested at 10 and 20, this parameter controls the maximum depth of the trees to prevent overfitting while capturing sufficient complexity.
134
   - **Tree Depth (`max_depth`)**: Tested at 10 and 20, this parameter controls the maximum depth of the trees to prevent overfitting while capturing sufficient complexity.
152
   - **Minimum Samples to Split (`min_samples_split`)**: Set at 2, 5, and 10 to evaluate the minimum number of samples required to split a node.
135
   - **Minimum Samples to Split (`min_samples_split`)**: Set at 2, 5, and 10 to evaluate the minimum number of samples required to split a node.
153
   - **Minimum Samples per Leaf (`min_samples_leaf`)**: Values of 1 and 2 were tested to determine the smallest number of samples allowed in a leaf node.
136
   - **Minimum Samples per Leaf (`min_samples_leaf`)**: Values of 1 and 2 were tested to determine the smallest number of samples allowed in a leaf node.
154
   - **Bootstrap**: This was set to `True` to enable bootstrapping when building the trees, ensuring variance reduction.
137
   - **Bootstrap**: This was set to `True` to enable bootstrapping when building the trees, ensuring variance reduction.
155
138
156
2. **XGBoost Hyperparameters**:
139
2. **XGBoost Hyperparameters**:
157
   - **Number of Boosting Rounds (`n_estimators`)**: Tested at 100 and 200 to find the optimal number of boosting iterations.
140
   - **Number of Boosting Rounds (`n_estimators`)**: Tested at 100 and 200 to find the optimal number of boosting iterations.
158
   - **Maximum Depth (`max_depth`)**: Set at 3 and 6 to control the depth of trees, balancing model complexity and generalization.
141
   - **Maximum Depth (`max_depth`)**: Set at 3 and 6 to control the depth of trees, balancing model complexity and generalization.
159
   - **Learning Rate (`learning_rate`)**: Values of 0.1 and 0.2 were tested to control the step size at each iteration.
142
   - **Learning Rate (`learning_rate`)**: Values of 0.1 and 0.2 were tested to control the step size at each iteration.
160
   - **Subsample**: This was set at 0.6 and 1.0 to test the fraction of samples used for training each tree.
143
   - **Subsample**: This was set at 0.6 and 1.0 to test the fraction of samples used for training each tree.
161
   - **Colsample_bytree**: Values of 0.8 and 1.0 were tested to determine the fraction of features used at each split.
144
   - **Colsample_bytree**: Values of 0.8 and 1.0 were tested to determine the fraction of features used at each split.
162
145
163
## **Grid Search Process**
146
## **Grid Search Process**
164
For each combination of preprocessing parameters and model hyperparameters, the training and testing datasets were prepared. The `MLModelSelector` class was used to train the models and evaluate their performance on the test set. Each model was trained using the grid search process, where every combination of hyperparameters was tested.
147
For each combination of preprocessing parameters and model hyperparameters, the training and testing datasets were prepared. The `MLModelSelector` class was used to train the models and evaluate their performance on the test set. Each model was trained using the grid search process, where every combination of hyperparameters was tested.
165
148
166
For every combination of preprocessing and model parameters:
149
For every combination of preprocessing and model parameters:
167
- **Training**: The model was trained on the preprocessed training data (`X_train` and `y_train`).
150
- **Training**: The model was trained on the preprocessed training data (`X_train` and `y_train`).
168
- **Evaluation**: The model was tested on the preprocessed test data (`X_test` and `y_test`), and the performance was measured using evaluation metrics like accuracy or F1-score.
151
- **Evaluation**: The model was tested on the preprocessed test data (`X_test` and `y_test`), and the performance was measured using evaluation metrics like accuracy or F1-score.
169
- **Best Score Selection**: The best-performing combination of preprocessing parameters and model hyperparameters was selected based on the highest score obtained during this process.
152
- **Best Score Selection**: The best-performing combination of preprocessing parameters and model hyperparameters was selected based on the highest score obtained during this process.
170
153
171
## **Selecting the Best Model**
154
## **Selecting the Best Model**
172
The best overall model was selected by comparing the performance scores of each grid search iteration. The combination of preprocessing parameters and model hyperparameters that resulted in the highest score was recorded as the optimal solution. The chosen model was either Random Forest or XGBoost, depending on the performance during the search. Ultimately, this approach ensures that the model is not only well-tuned but also capable of handling various data complexities, including missing values and imbalanced features.
155
The best overall model was selected by comparing the performance scores of each grid search iteration. The combination of preprocessing parameters and model hyperparameters that resulted in the highest score was recorded as the optimal solution. The chosen model was either Random Forest or XGBoost, depending on the performance during the search. Ultimately, this approach ensures that the model is not only well-tuned but also capable of handling various data complexities, including missing values and imbalanced features.
173
156
174
After running the entire process, the best preprocessing parameters, model parameters, and overall score were printed for final evaluation.
157
After running the entire process, the best preprocessing parameters, model parameters, and overall score were printed for final evaluation.
175
158
176
---
159
---
177
# **Results**
160
# **Results**
178
161
179
## **Model Performance**
162
## **Model Performance**
180
 
163
 
181
After running the model with various parameters, including preprocessing options, the best model parameters were selected based on their F1 score performance. These optimal parameters were then saved. By evaluating both the model and preprocessing parameters, the following results were achieved:
164
After running the model with various parameters, including preprocessing options, the best model parameters were selected based on their F1 score performance. These optimal parameters were then saved. By evaluating both the model and preprocessing parameters, the following results were achieved:
182
165
183
1
166
1
184
0.7454545454545455
167
0.7454545454545455
185
168
186
2
169
2
187
0.7454545454545455
170
0.7454545454545455
188
171
189
3
172
3
190
0.7484030554078361
173
0.7484030554078361
191
174
192
4
175
4
193
0.7546958304853042
176
0.7546958304853042
194
177
195
...
178
...
196
179
197
180
198
20
181
20
199
0.8853833897195243
182
0.8853833897195243
200
183
201
21
184
21
202
0.8853833897195243
185
0.8853833897195243
203
186
204
22
187
22
205
0.8853833897195243
188
0.8853833897195243
206
189
207
23
190
23
208
0.8853833897195243
191
0.8853833897195243
209
192
210
24
193
24
211
0.8853833897195243
194
0.8853833897195243
212
195
213
25
196
25
214
0.8853833897195243
197
0.8853833897195243
215
198
216
Best Preprocess Params: {'missing_value_per': 0.4, 'variance_threshold': 0.05, 'min_null_per': 0.5}
199
Best Preprocess Params: {'missing_value_per': 0.4, 'variance_threshold': 0.05, 'min_null_per': 0.5}
217
Best Model Params: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 250}
200
Best Model Params: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 250}
218
Best Overall Score: 0.8853833897195243
201
Best Overall Score: 0.8853833897195243
219
202
220
The detailed results were logged into MLflow and are as follows:
203
The detailed results were logged into MLflow and are as follows:
221
204
222
![mlflow1](./images/mlflow1.png)
205
![mlflow1](./images/mlflow1.png)
223
206
224
![mlflow2](./images/mlflow2.png)
207
![mlflow2](./images/mlflow2.png)
225
208
226
![mlflow3](./images/mlflow3.png)
209
![mlflow3](./images/mlflow3.png)
227
210
228
211
229
---
212
---
230
# **Discussion**
213
# **Discussion**
231
214
232
## **Handling Missing Values**
215
## **Handling Missing Values**
233
216
234
Initially, removing rows with significant missing values seemed like a viable option. However, given that the test data may include many rows with missing values, the better approach for this project was to fill the missing values instead. This ensures that the model is more robust and capable of handling unseen test data with missing entries, ultimately improving generalization.
217
Initially, removing rows with significant missing values seemed like a viable option. However, given that the test data may include many rows with missing values, the better approach for this project was to fill the missing values instead. This ensures that the model is more robust and capable of handling unseen test data with missing entries, ultimately improving generalization.
235
218
236
## **Feature Engineering**
219
## **Feature Engineering**
237
220
238
Several feature engineering techniques were considered. While PCA is often useful for dimensionality reduction, it was not ideal for this project due to the large number of features that would be discarded during feature selection. KMeans clustering, on the other hand, was a better option, as it allowed us to add a new feature based on the remaining important features after feature reduction, improving model performance.
221
Several feature engineering techniques were considered. While PCA is often useful for dimensionality reduction, it was not ideal for this project due to the large number of features that would be discarded during feature selection. KMeans clustering, on the other hand, was a better option, as it allowed us to add a new feature based on the remaining important features after feature reduction, improving model performance.
239
222
240
## **Model Selection**
223
## **Model Selection**
241
224
242
When selecting models for grid search and hyperparameter tuning, various algorithms were tested. Initial results showed that Random Forest and XGBoost outperformed SVM and Logistic Regression in most parameter sets. As a result, we focused on optimizing Random Forest and XGBoost, which demonstrated the best potential for this project.
225
When selecting models for grid search and hyperparameter tuning, various algorithms were tested. Initial results showed that Random Forest and XGBoost outperformed SVM and Logistic Regression in most parameter sets. As a result, we focused on optimizing Random Forest and XGBoost, which demonstrated the best potential for this project.
243
226
244
## **Feature Importance and Data Shuffling**
227
## **Feature Importance and Data Shuffling**
245
228
246
An analysis of the dataset’s plots highlighted 2 or 3 features as particularly strong for classification tasks. However, when testing with different random splits of the data, the results were inconsistent, with some splits yielding lower performance. This indicates that while these features are important, they are sensitive to the data shuffling process and may need further investigation.
229
An analysis of the dataset’s plots highlighted 2 or 3 features as particularly strong for classification tasks. However, when testing with different random splits of the data, the results were inconsistent, with some splits yielding lower performance. This indicates that while these features are important, they are sensitive to the data shuffling process and may need further investigation.
247
230
248
## **Handling Unknown Values**
231
## **Handling Unknown Values**
249
232
250
For features with unknown values, especially those where the majority of the data falls into a single category, we assigned the majority value to the missing entries. This approach helped maintain consistency in the dataset and ensured the model received representative data for those features.
233
For features with unknown values, especially those where the majority of the data falls into a single category, we assigned the majority value to the missing entries. This approach helped maintain consistency in the dataset and ensured the model received representative data for those features.
251
234
252
## **Refining Missing Value Imputation**
235
## **Refining Missing Value Imputation**
253
236
254
Initially, missing values were imputed across the entire dataset. However, after further analysis, we found that some features had a dominant value. To improve the accuracy of the model, we imputed missing values for each feature individually. By treating some known values as missing and testing various imputation methods, we calculated the mean absolute error (MAE) for each method. The technique with the lowest MAE was then applied to that feature, ensuring the most accurate imputation possible.
237
Initially, missing values were imputed across the entire dataset. However, after further analysis, we found that some features had a dominant value. To improve the accuracy of the model, we imputed missing values for each feature individually. By treating some known values as missing and testing various imputation methods, we calculated the mean absolute error (MAE) for each method. The technique with the lowest MAE was then applied to that feature, ensuring the most accurate imputation possible.