Diff of /README.md [000000] .. [0241e6]

Switch to unified view

a b/README.md
1
# Lung Cancer Detection - v1.1
2
3
The machine learning project pipeline for lung cancer analysis and prediction at a low cost, to assist individuals in understanding their risk of lung cancer. It also supports decision making, health awareness, based on their lifestyle habits.
4
5
## Project Directory Structure
6
7
```
8
lung-cancer-detection/               # Root folder.
9
├── api/                               # Deploying model using flask for production.
10
├── data/                              # Different set of dataset.
11
|   ├── input/                           # Holdout set (training, testing).
12
|   ├── processed/                       # cleaned set (original, synthetic).
13
|   ├── raw/                             # un-processed set (original, synthetic).
14
├── figures/                           # Visualization charts.
15
|   ├── eda/                             # Exploratory analysis chart images.
16
|   |   ├── original/                      # Chart images for original part.
17
|   |   ├── synthetic/                     # Chart images for synthetic part.
18
|   ├── model/                           # Model evaluation chart images.
19
├── models/                            # Saved trained model.
20
├── notebooks/                         # Experimentation and analysis notebooks.
21
|   ├── data/                            # Notebooks for processing and preparations set.
22
|   ├── eda/                             # Exploratory analysis notebooks (original, synthetic).
23
|   ├── model/                           # Ml notebooks experimentation
24
|       ├── evaluation/                    # Notebook for training, validation and testing.
25
|       ├── inference/                     # Notebook for making prediction.
26
├── scripts/                           # Automated python scripts.
27
|   ├── data/                            # Scripts for processing and preparations set.
28
|   ├── model/                           # Scripts for model training, testing & inference.
29
├── tests/                             # Unit testing scripts (integration, functional).
30
├── .gitignore                         # Tells Git which files to ignore when committing your project.
31
├── LICENSE                            # Author license.
32
├── README.md                          # Project documentations for developers.
33
├── requirements.txt                   # Project installation dependencies.
34
```
35
36
## Model Pipeline Workflow
37
38
```
39
1. **Processing** - remove missing or duplicated data, feature engineering.
40
2. **Preparation** - feature selection, remove duplicated data, holdout split (train/test set).
41
3. **Training + cross val** - training + validation (training set), model selection.
42
4. **Testing** - model testing (test set).
43
5. **Inference** - make prediction for new data.
44
```
45
46
## Model Performance
47
  
48
  **Metrics**
49
50
  ```
51
  1. **Accuracy** - 93%
52
  2. **Precision** - 95%
53
  3. **Recall** - 91%
54
  4. **F1** - 93%
55
  ```
56
57
  **Matrix**
58
59
  ```
60
  TP: 43 - TN: 40 - FP: 2 - FN: 4
61
  ```
62
63
  **AUC**
64
65
  ```
66
  AUC - 0.97
67
  ```
68
69
  **Class Report**
70
71
  ```
72
  Class 0: Precision - 91%, Recall - 95%, F1 - 93% | Total - 42
73
  Class 1: Precision - 96%, Recall - 91%, F1 - 93% | Total - 47
74
  ```
75
76
The model used was gradient boosting (GB).
77
78
## Getting Started
79
Install this project on your local machine and here are following steps.
80
81
### Installation
82
83
   **Clone the Repository**
84
85
   ```
86
   $ git clone https://github.com/nordszamora/lung-cancer-detection.git
87
88
   $ cd lung-cancer-detection/
89
90
   $ pip install -r requirements.txt
91
   ```
92
93
### Automated Scripts
94
   1. **Run data scripts**
95
96
   ```
97
   $ cd scripts/
98
99
   $ cd data/
100
101
   $ python processing.py
102
   
103
   $ python preparation.py
104
   ```
105
106
   2. **Run model scripts**
107
108
   ```
109
   $ cd scripts/
110
111
   $ cd model/
112
113
   $ python training_validation.py
114
   
115
   $ python testing.py
116
   
117
   $ python inference.py
118
   ```
119
120
### Serving Model
121
122
   1. **Run flask api**
123
124
   ```
125
   $ cd api/
126
127
   $ python app.py
128
   ```
129
130
   2. **Test api endpoint**
131
132
   ```
133
   curl -X POST http://localhost:5000/api/v1/predict -H "Content-Type: application/json" -d '{"gender": 1, "age": 43, "smoking": 2, "yellow_skin": 2, "fatigue": 2, "wheezing": 2, "coughing": 2, "shortness_of_breath": 2, "swallowing_difficulty": 2, "chest_pain": 2, "chronic_disease": 1}'
134
   ```
135
136
### Unit Testing
137
138
   **Run pytest**
139
140
   ```
141
   $ cd tests/
142
143
   $ pytest
144
   ```
145
146
#### Data source:
147
See: ([kaggle](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer))
148
149
#### Note:
150
I used a SMOTE to generate a synthetic value due to poorly imbalance dataset.
151
152
## License
153
154
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.