--- a +++ b/README.md @@ -0,0 +1,154 @@ +# Lung Cancer Detection - v1.1 + +The machine learning project pipeline for lung cancer analysis and prediction at a low cost, to assist individuals in understanding their risk of lung cancer. It also supports decision making, health awareness, based on their lifestyle habits. + +## Project Directory Structure + +``` +lung-cancer-detection/ # Root folder. +├── api/ # Deploying model using flask for production. +├── data/ # Different set of dataset. +| ├── input/ # Holdout set (training, testing). +| ├── processed/ # cleaned set (original, synthetic). +| ├── raw/ # un-processed set (original, synthetic). +├── figures/ # Visualization charts. +| ├── eda/ # Exploratory analysis chart images. +| | ├── original/ # Chart images for original part. +| | ├── synthetic/ # Chart images for synthetic part. +| ├── model/ # Model evaluation chart images. +├── models/ # Saved trained model. +├── notebooks/ # Experimentation and analysis notebooks. +| ├── data/ # Notebooks for processing and preparations set. +| ├── eda/ # Exploratory analysis notebooks (original, synthetic). +| ├── model/ # Ml notebooks experimentation +| ├── evaluation/ # Notebook for training, validation and testing. +| ├── inference/ # Notebook for making prediction. +├── scripts/ # Automated python scripts. +| ├── data/ # Scripts for processing and preparations set. +| ├── model/ # Scripts for model training, testing & inference. +├── tests/ # Unit testing scripts (integration, functional). +├── .gitignore # Tells Git which files to ignore when committing your project. +├── LICENSE # Author license. +├── README.md # Project documentations for developers. +├── requirements.txt # Project installation dependencies. +``` + +## Model Pipeline Workflow + +``` +1. **Processing** - remove missing or duplicated data, feature engineering. +2. **Preparation** - feature selection, remove duplicated data, holdout split (train/test set). +3. **Training + cross val** - training + validation (training set), model selection. +4. **Testing** - model testing (test set). +5. **Inference** - make prediction for new data. +``` + +## Model Performance + + **Metrics** + + ``` + 1. **Accuracy** - 93% + 2. **Precision** - 95% + 3. **Recall** - 91% + 4. **F1** - 93% + ``` + + **Matrix** + + ``` + TP: 43 - TN: 40 - FP: 2 - FN: 4 + ``` + + **AUC** + + ``` + AUC - 0.97 + ``` + + **Class Report** + + ``` + Class 0: Precision - 91%, Recall - 95%, F1 - 93% | Total - 42 + Class 1: Precision - 96%, Recall - 91%, F1 - 93% | Total - 47 + ``` + +The model used was gradient boosting (GB). + +## Getting Started +Install this project on your local machine and here are following steps. + +### Installation + + **Clone the Repository** + + ``` + $ git clone https://github.com/nordszamora/lung-cancer-detection.git + + $ cd lung-cancer-detection/ + + $ pip install -r requirements.txt + ``` + +### Automated Scripts + 1. **Run data scripts** + + ``` + $ cd scripts/ + + $ cd data/ + + $ python processing.py + + $ python preparation.py + ``` + + 2. **Run model scripts** + + ``` + $ cd scripts/ + + $ cd model/ + + $ python training_validation.py + + $ python testing.py + + $ python inference.py + ``` + +### Serving Model + + 1. **Run flask api** + + ``` + $ cd api/ + + $ python app.py + ``` + + 2. **Test api endpoint** + + ``` + curl -X POST http://localhost:5000/api/v1/predict -H "Content-Type: application/json" -d '{"gender": 1, "age": 43, "smoking": 2, "yellow_skin": 2, "fatigue": 2, "wheezing": 2, "coughing": 2, "shortness_of_breath": 2, "swallowing_difficulty": 2, "chest_pain": 2, "chronic_disease": 1}' + ``` + +### Unit Testing + + **Run pytest** + + ``` + $ cd tests/ + + $ pytest + ``` + +#### Data source: +See: ([kaggle](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer)) + +#### Note: +I used a SMOTE to generate a synthetic value due to poorly imbalance dataset. + +## License + +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.