This project aims to predict whether a COVID-19 patient will need ICU admission based on clinical data using machine learning models. The dataset includes key patient information such as vital signs, blood test results, underlying conditions, and demographic data. Exploratory data analysis (EDA) was conducted to understand patterns and relationships within the data. Features were selected and engineered to improve prediction accuracy. Several machine learning models, including Random Forest, SVM, and XGBoost, were trained and evaluated using appropriate performance metrics. The models were fine-tuned through hyperparameter optimization techniques such as grid search. The outcome is a predictive model that can help healthcare professionals allocate ICU resources more effectively, potentially reducing mortality rates and improving the overall efficiency of hospital operations during the COVID-19 pandemic.
During the COVID-19 pandemic, hospitals faced unprecedented pressure due to a surge in critically ill patients requiring ICU care. The ability to predict ICU admission for patients based on clinical data became crucial for optimizing hospital resources and ensuring that care was delivered to those who needed it the most. Early and accurate prediction of ICU needs can assist in better resource allocation, improve patient outcomes, and reduce mortality rates.
The goal of this project is to develop a machine learning model that can predict whether a COVID-19 patient will require ICU admission based on clinical features such as vital signs, blood test results, and underlying health conditions.
This project aims to use machine learning techniques to analyze clinical data and predict the likelihood of ICU admission for COVID-19 patients. By accurately predicting ICU needs, the model will help healthcare providers make informed decisions, manage hospital resources, and improve patient care during the pandemic.
docker build -t model .
docker run -d -p 8000:8000 model
request.py
file for easier testing.dataset.xlsx
in the root directory.request.py
script for testing.frequency of missing values:
according to chart, we have almost 175 columns 1000 to 1100 missing values, this suggests that many columns can be challenging to fill, considering we are dealing medical dataset.
initial stages of data cleaning, missing values were handled using various techniques. The _remove_missing_values
function was used to filter columns based on a defined threshold for missing data. Columns with missing values above this threshold were excluded to maintain dataset quality. Additionally, sparse rows (those with many missing values) were removed using the _remove_sparse_row
function to retain rows with sufficient information. However, this step can be skipped to make the model more robust, as real-world test data may include missing values that need to be filled during prediction. Data correction, normalization, and scaling were applied where necessary to ensure feature consistency.
According to the chart, some columns have a small percentage of outliers, which could affect the final results, especially with missing data. However, since we used multiple techniques for column-specific missing value imputation, the outliers won't impact the final outcome.
describe()
method was utilized to provide summary statistics for the dataset, such as mean, median, and variance, for each feature. This allowed for an understanding of the central tendencies and the variability of the data. Exploratory analysis also included calculating the distribution of values across critical features such as patient age, heart rate, blood pressure, and blood markers, which helped identify any unusual patterns.Some Features Tend to be More Effective For Distinguishing Labels:
Some Features Tend to Be Really Different By Label:
Visualizations were generated to show the distribution of key variables such as vital signs and blood test results. Correlation plots were used to identify potential relationships between features (e.g., the correlation between oxygen saturation and ICU admission). Scatter plots, histograms, and box plots were also used to identify outliers and understand the data spread.
A key part of EDA involved determining whether the dataset was imbalanced with respect to ICU admissions. Initial checks showed that the data was imbalanced, as a smaller percentage of patients required ICU admission. This imbalance was handled using techniques such as careful model evaluation and metric selection (such as F1-score) to ensure the model’s performance was reliable for both classes.
Further More Interesting Charts and Interpretations on EDA.ipynb
Feature selection was performed using both statistical and domain knowledge-based methods. A VarianceThreshold
was used to remove low-variance features, ensuring that only the most informative features were retained. Additionally, correlation analysis was applied to remove redundant features and retain those with a stronger impact on predicting ICU admission. Specific mappings from the _mapping()
function were applied to variables like age, gender, hypertension, and other conditions to map categorical values into numerical representations.
Feature engineering involved combining existing features to create new, more meaningful features. For instance, blood test results were transformed into aggregated metrics such as median, mean, or differences from baseline values (albumin_median
, albumin_diff
, etc.). These new features helped capture patient trends over time, which are vital for predicting ICU admissions.
Categorical features, such as age
, gender
, hypertension
, and other conditions
, were mapped to numerical values using custom-defined mappings (as seen in the _mapping()
function). This included binary mappings (e.g., for age greater than 65), allowing these features to be effectively used by machine learning algorithms. This transformation ensured that all categorical features were appropriately encoded for model training.
The model selection process for this project involved an extensive grid search to find the best preprocessing parameters and machine learning model configurations. We employed a systematic approach that tested various combinations of preprocessing and model hyperparameters to optimize performance.
The preprocessing stage was tuned using the following parameters:
- Missing Value Percentage (missing_value_per
): This parameter controls the threshold for removing columns based on missing values. Different thresholds ranging from 10% to 50% were tested to find the optimal balance between data retention and quality.
- Variance Threshold (variance_threshold
): This threshold was used to remove features with low variance, as they may not provide meaningful information for the model. The grid tested thresholds from 0.0 to 0.2.
- Minimum Null Percentage (min_null_per
): Set at a constant value of 0.5, this parameter ensured that rows with excessive missing data were removed. However, rows below this threshold were retained to preserve valuable data for model training.
For each combination of these preprocessing parameters, the training data was split into training and test sets, with an 80/20 split ratio and shuffling enabled to ensure randomization.
Preprocessing Pipeline:
The Preprocess
class was applied to the training data to handle missing values, reduce variance, and remove sparse rows. After applying these transformations, the MissingValue
class was used to fill any remaining missing values. The missing value imputation method was chosen by simulating missing data, filling it using various methods, and selecting the method with the lowest mean absolute error (MAE).
Mapping and Consistency:
A key step in preprocessing was ensuring that the same transformations were applied to both the training and test datasets. This was done using the _mapping()
function to maintain consistency in feature mappings, such as categorical encodings, and to ensure that all test data columns matched the training data.
Once the data was preprocessed, different machine learning models were evaluated using grid search to find the best model configuration. The models considered for tuning included Random Forest and XGBoost, as they performed better than SVM and Logistic Regression in initial experiments.
n_estimators
): Values of 150 and 250 were tested to determine the optimal number of trees in the forest.max_depth
): Tested at 10 and 20, this parameter controls the maximum depth of the trees to prevent overfitting while capturing sufficient complexity.min_samples_split
): Set at 2, 5, and 10 to evaluate the minimum number of samples required to split a node.min_samples_leaf
): Values of 1 and 2 were tested to determine the smallest number of samples allowed in a leaf node.Bootstrap: This was set to True
to enable bootstrapping when building the trees, ensuring variance reduction.
XGBoost Hyperparameters:
n_estimators
): Tested at 100 and 200 to find the optimal number of boosting iterations.max_depth
): Set at 3 and 6 to control the depth of trees, balancing model complexity and generalization.learning_rate
): Values of 0.1 and 0.2 were tested to control the step size at each iteration.For each combination of preprocessing parameters and model hyperparameters, the training and testing datasets were prepared. The MLModelSelector
class was used to train the models and evaluate their performance on the test set. Each model was trained using the grid search process, where every combination of hyperparameters was tested.
For every combination of preprocessing and model parameters:
- Training: The model was trained on the preprocessed training data (X_train
and y_train
).
- Evaluation: The model was tested on the preprocessed test data (X_test
and y_test
), and the performance was measured using evaluation metrics like accuracy or F1-score.
- Best Score Selection: The best-performing combination of preprocessing parameters and model hyperparameters was selected based on the highest score obtained during this process.
The best overall model was selected by comparing the performance scores of each grid search iteration. The combination of preprocessing parameters and model hyperparameters that resulted in the highest score was recorded as the optimal solution. The chosen model was either Random Forest or XGBoost, depending on the performance during the search. Ultimately, this approach ensures that the model is not only well-tuned but also capable of handling various data complexities, including missing values and imbalanced features.
After running the entire process, the best preprocessing parameters, model parameters, and overall score were printed for final evaluation.
After running the model with various parameters, including preprocessing options, the best model parameters were selected based on their F1 score performance. These optimal parameters were then saved. By evaluating both the model and preprocessing parameters, the following results were achieved:
1
0.7454545454545455
2
0.7454545454545455
3
0.7484030554078361
4
0.7546958304853042
...
20
0.8853833897195243
21
0.8853833897195243
22
0.8853833897195243
23
0.8853833897195243
24
0.8853833897195243
25
0.8853833897195243
Best Preprocess Params: {'missing_value_per': 0.4, 'variance_threshold': 0.05, 'min_null_per': 0.5}
Best Model Params: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 250}
Best Overall Score: 0.8853833897195243
The detailed results were logged into MLflow and are as follows:
Initially, removing rows with significant missing values seemed like a viable option. However, given that the test data may include many rows with missing values, the better approach for this project was to fill the missing values instead. This ensures that the model is more robust and capable of handling unseen test data with missing entries, ultimately improving generalization.
Several feature engineering techniques were considered. While PCA is often useful for dimensionality reduction, it was not ideal for this project due to the large number of features that would be discarded during feature selection. KMeans clustering, on the other hand, was a better option, as it allowed us to add a new feature based on the remaining important features after feature reduction, improving model performance.
When selecting models for grid search and hyperparameter tuning, various algorithms were tested. Initial results showed that Random Forest and XGBoost outperformed SVM and Logistic Regression in most parameter sets. As a result, we focused on optimizing Random Forest and XGBoost, which demonstrated the best potential for this project.
An analysis of the dataset’s plots highlighted 2 or 3 features as particularly strong for classification tasks. However, when testing with different random splits of the data, the results were inconsistent, with some splits yielding lower performance. This indicates that while these features are important, they are sensitive to the data shuffling process and may need further investigation.
For features with unknown values, especially those where the majority of the data falls into a single category, we assigned the majority value to the missing entries. This approach helped maintain consistency in the dataset and ensured the model received representative data for those features.
Initially, missing values were imputed across the entire dataset. However, after further analysis, we found that some features had a dominant value. To improve the accuracy of the model, we imputed missing values for each feature individually. By treating some known values as missing and testing various imputation methods, we calculated the mean absolute error (MAE) for each method. The technique with the lowest MAE was then applied to that feature, ensuring the most accurate imputation possible.