Card

Smoker Prediction

This project focuses on building a Smoker Prediction model using classical machine learning techniques. The system predicts if the person is smoker or not based on biological signals.

Table of Contents

  • Overview
  • What's New
  • Features
  • Project Structure
  • Data
  • Modeling
  • Installation
  • Usage
  • Future Work
  • Contributing
  • License

Overview

The Modes utilizes both classical and advanced machine learning models to Predict if the person is smoker or not based on some biological signals. We developed Over 250 combination or different data Preprocessing and modeling using Automated Piplines . The dataset I used for training and evaluation was Smoker Status Prediction using Bio-Signals on kaggle .
dataset link : https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals

What's New

In this project, we started with a brute force combination technique, testing over 250+ unique combinations of data preprocessing and modeling with default parameters. Each combination was saved with evaluation metrics such as ROC, Accuracy, F1 score, and F-beta score.

Additional Techniques Used:

  • Feature Selection Filter Methods were applied to reduce model complexity:
  • Using Subsamples
  • Correlation, ANOVA, VIF analysis
  • Overlap analysis
  • Outlier removal
  • We experimented with using only Boosting methods without any scaling, transformation, or sampling.
  • We applied Optuna to get the best parameters for each model.

For the most promising combinations, we further optimized them using Grid Search to fine-tune hyperparameters and improve performance.

We also visualized the generated dataset to compare the performance of different methods at each step of preprocessing and modeling to find the optimal combination, not just the best method for each individual step.

Features

  • Predicts person Smoking status.

Project Structure

Data

The dataset I used for training and evaluation was Smoker Status Prediction using Bio-Signals on kaggle .
dataset link : https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals

Data Preprocessing:

  • Features Engineering : from the domain knowledge I get some equations for Features Crossing .
  • Anomaly Detection : checke the anomaly and drop it
  • Sampling , Transformation , Scaling , Dimentionality Reduction : I Used several " over 250 " Piplines to Preprocess the data "look at Smokers_Prediction.ipynb"

Modeling

Models tested include:

  • Logistic Regression
  • RandomForestClassifier
  • DecisionTreeClassifier
  • SVC (linear, poly, RBF)
  • KNeighborsClassifier
  • XGBClassifier
  • GradientBoostingClassifier
  • CatBoostingClassifier
  • AdaBoostingClassifier
  • HistGradientBoostingClassifier

Installation

  • Clone this repository:
git clone https://github.com/ahmedomer13218/Smokers-Predictions-ML-project-

cd Smokers-Predictions-ML-project-

Usage

Future Work

  • Streamlit app
  • Deep learning techniques

Contributing

We welcome contributions! If you'd like to contribute, please fork the repository and make your changes. After testing, submit a pull request for review.

License

This project is licensed under the MIT License - see the LICENSE file for details.