Diff of /ReadMe.md [000000] .. [fc2931]

Switch to unified view

a b/ReadMe.md
1
2
3
<p align="center">
4
             <center> <h1> LUNG CANCER PREDICTION INSIDE CONTAINERS 🐋  </h1> </center>
5
</p>
6
7
<p align="center">
8
  <img width="900" height="400" src="https://miro.medium.com/max/3840/1*6ORJX1A5NYom1ClGa7xwjQ.jpeg">
9
</p>
10
11
Hello guys! Back with another article. In this article i will go through how we can train machine learning models inside containers. For containers technology i will be using docker.
12
13
# Why Docker 🐋🐋 ??
14
15
Because Docker containers encapsulate everything an application needs to run (and only those things), they allow applications to be shuttled easily between environments. Any host with the Docker runtime installed — be it a developer’s laptop or a public cloud instance — can run a Docker container.
16
In this article, i will be using lung cancer dataset. This would be `classification problem`. On the given inputs the prediction will be a particular guys whether he is affected with lung cancer or not.
17
18
19
In this article, i will be using lung cancer dataset. This would be classification problem. On the given inputs the prediction will be a particular guys whether he is affected with lung cancer or not.
20
21
<p align="center">
22
  <img width="900" height="150" src="https://miro.medium.com/max/1094/1*ig1fQCpMMyKqA2-1prrKFw.jpeg">
23
</p>
24
25
```
26
docker    -it    --name   lung_cancer_os   centos:7
27
```
28
29
So, i have launched docker on AWS inside ec2 instance. You have to install python3 inside docker. Here os name is `lung_cancer_os`.
30
31
<p align="center">
32
  <img width="900" height="100" src="https://miro.medium.com/max/1094/1*QBZIQWmDSvV3XSA4Kbtcew.jpeg">
33
</p>
34
35
```
36
yum install python3 -y
37
```
38
39
Now you have to create a `requirement.txt` file same as blow. In this file you have to mention how many libraries you wanted to install. This process is less tike taking and can easily install the libraries.
40
<p align="center">
41
  <img width="900" height="150" src="https://miro.medium.com/max/1094/1*hg17URIYWEU6KSxJY90L3Q.jpeg">
42
</p>
43
44
```
45
pandas
46
numpy
47
scikit-learn
48
joblib
49
```
50
51
Now you have to install all these libraries.
52
<p align="center">
53
  <img width="900" height="150" src="https://miro.medium.com/max/1094/1*dDUyvb_tIIx4XYWmdnCtTQ.jpeg">
54
</p>
55
56
After installing you you can check out the list of all the libraries installed.
57
58
```
59
pip3 list
60
```
61
62
<p align="center">
63
  <img width="900" height="250" src="https://miro.medium.com/max/1094/1*QLGlwYe2p-5mDo96_RucyA.jpeg">
64
</p>
65
66
Now we have to read lung_cancer data and predict on different classification algorithms.
67
# Importing Required Libraries:
68
69
```
70
import numpy as np   # importing numpy for numerical calculations
71
import pandas as pd # importing  pandas for creating data frames
72
from sklearn.model_selection import train_test_split  # train_tets_split for spliting the data into training an testing
73
from sklearn.linear_model import LogisticRegression  # for logistic resgression
74
from sklearn.ensemble import RandomForestClassifier # for random 
75
forest classifierfrom sklearn.ensemble import GradientBoostingClassifier     # For gradienboosting classifier
76
from sklearn.metrics import accuracy_score # importing metrics for measuring accuracy
77
from sklearn.metrics import mean_squared_error  # for calculating mean squre errors
78
```
79
# Reading Lung Cancer Data:
80
Now we will be reading lung cancer data using `pandas.read_csv()` function
81
82
```
83
df = pd.read_csv("lung_cancer_examples.csv")  # reading csv data
84
print(df.sample)
85
```
86
87
<p align="center">
88
  <img width="900" height="150" src="https://miro.medium.com/max/1094/1*Xp5PjuNdUTEIFOfZqGm8mQ.jpeg">
89
</p>
90
91
In the above output the column names are not in the sequence as see that, this is because i have used print(df) inside docker.
92
Now we have to select the features and divide then into training and testing data.
93
94
```
95
X = df.drop[['Name', 'Surname', 'Result']]
96
y = df.iloc[:, -1]
97
```
98
As there were less feature and anyone can understand that `names` and `surnames` are not much important to predict `lung cancer`, also `Results` is not independent feature therefore, i removed them all and rest will be kept as a features inside `X` variable. The `y` (dependent variable) will contain only the Results feature.
99
Now we have to divide the data into training and testing part. so here we will use `train_test_split` function from sklearn library. here i am taking `training data as 80%` and `testing data as 20%`.
100
101
```
102
from sklearn.model_selection import train_test_split
103
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)
104
```
105
106
<p align="center">
107
  <img width="900" height="200" src="https://miro.medium.com/max/1094/1*A-9Ui59K21OR_9fqy73Qvg.gif">
108
</p>
109
110
Now we are ready to train the different classification models. Here we are taking first as *logistic regression* then *gradient boosting classifer* followed by *random forest classifier*.
111
112
# Logistic Regression:
113
LogisticRegression is meant for classification problem. As we also having a classification dataset so we can implement classification easily. LogisticRegression uses a sigmoid function which helps it to train the model on large dataset. To use LogisticRegression() funtion we have to import linear_model module from sklearn library.
114
115
```
116
class logistic_regression:    # creating a logistic regression class
117
    def logistic(self, X_train, y_train):    # creating a function that will create a model and after training it will give accuracy
118
        # Here we are creating a decision tree model using LogisticRegression. to do that we have to fit the training data (X_train, y_rain) into model_lr object
119
        from sklearn.linear_model import LogisticRegression
120
        model_lr = LogisticRegression()    # creating a logistic model
121
        model_lr.fit(X_train, y_train)
122
        # here we are predicting the Logistic Regression model on X_test [testing data]
123
        self.y_pred_lr = model_lr.predict(X_test)
124
        #print("Mean square error for logistic regression model: ", mean_squared_error(y_test, y_pred_lr)) # will give mean square error of the model
125
        # accuracy_score will take y_test(actual value) and y_pred_lr(predicted value) and it will give the accuracy of the model
126
        print("Logistic Regression model Accuracy               :", accuracy_score(y_test, self.y_pred_lr)*100, "%")
127
    def mean_absolute_error(self):
128
        print("Mean Absoluter error of logistic Regression     :", np.square(y_test - self.y_pred_lr).mean()) # calculating mean absolute error of LogisticRegression model
129
def variance_bias(self):
130
        Variance = np.var(self.y_pred_lr)     # calculating variance in the predicted output
131
        print("Variance of LogisticRegression model is         :", Variance)
132
        SSE = np.mean((np.mean(self.y_pred_lr) - y_test)** 2)  # calculating s=sum of square error
133
        Bias = SSE - Variance                         # calculating Bias taking a difference between SSE and Variance
134
        print("Bias of LogisticRegression model is             :", Bias)
135
```
136
137
Inside logistic_regression class I have created three functions i.e., logistic(), mean_absolute_error() and variance_bias().
138
logistic_regression() function will train the model using LogisticRegession() function and it will return the model accuracy. This function accepts parameters X_train, y_train (training data). Inside this function model_lr = LogisticRegession() will create a model and save it into model_lr variable and then model_lr.fit(X_train, y_train) will train the model and after training we have to use testing data i.e., X_test for prediction for checking the accuracy of the model using accuracy_score() function.
139
mean_absolute_error() function will return the error generated while doing the prediction. This function will use np.square(y_test — y_pred_lr).mean()) formula. First it will square the differences between y_test(actual value) and y_pred_lr(predicted value) and it will take a mean of the value obtained by taking the square. Here error = y_test — y_pred_lr.
140
variance_bias() function will helps to obtain the bias and variance of the model. Variance is obtained using var() function from numpy and it will obtain variance on predicted values i.e., y_pred_lr. Now to obtain the bias we have to take help from variance value. First we have to calculate square of error using SSE = np.mean((np.mean(y_pred_lr) — y_test)** 2) formula. First this formula will take mean of (y_pred_lr) and then calculate the error mean(y_pred_lr) — y_test) and then it will square the values. Now variance’s value will be subtracted from SSE and it will give bias value. The less the value of the variance will have less variety in data. The goal is to balance the bias and variance, so the model does not overfit or underfit. If the variance and bias go high then it will affect the model accuracy. Every machine learning algorithm will give difference values of bias and variance.
141
142
# GradientBoostingClassifer:
143
144
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
145
```
146
class gradient_boosting:    # creating a logistic regression class
147
    def gb(self, X_train, y_train):    # creating a function that will create a model and after training it will give accuracy
148
        # Here we are creating a decision tree model using LogisticRegression. to do that we have to fit the training data (X_train, y_rain) into model_lr object
149
        from sklearn.ensemble import GradientBoostingClassifier
150
        model_gbc = GradientBoostingClassifier()    # creating a logistic model
151
        model_gbc.fit(X_train, y_train)
152
        # here we are predicting the Logistic Regression model on X_test [testing data]
153
        self.y_pred_gbc = model_gbc.predict(X_test)
154
        #print("Mean square error for logistic regression model: ", mean_squared_error(y_test, y_pred_lr)) # will give mean square error of the model
155
        # accuracy_score will take y_test(actual value) and y_pred_lr(predicted value) and it will give the accuracy of the model
156
        print("Logistic Regression model Accuracy               :", accuracy_score(y_test, self.y_pred_gbc)*100, "%")
157
    def mean_absolute_error(self):
158
        print("Mean Absoluter error of logistic Regression     :", np.square(y_test - self.y_pred_gbc).mean()) # calculating mean absolute error of LogisticRegression model
159
def variance_bias(self):
160
        Variance = np.var(self.y_pred_gbc)     # calculating variance in the predicted output
161
        print("Variance of LogisticRegression model is         :", Variance)
162
        SSE = np.mean((np.mean(self.y_pred_gbc) - y_test)** 2)  # calculating s=sum of square error
163
        Bias = SSE - Variance                         # calculating Bias taking a difference between SSE and Variance
164
        print("Bias of LogisticRegression model is             :", Bias)
165
```
166
167
168
Inside gradient_boosting class I have created three functions i.e., gb(), mean_absolute_error() and variance_bias().
169
170
gb() function will train the model using GradientBoostingClassifier() function and it will return the model accuracy. This function accepts parameters X_train, y_train (training data). Inside this function model_gbc = GradientBoostingClassifier() will create a model and save it into model_gbc variable and then model_gbc.fit(X_train, y_train) will train the model and after training we have to use testing data i.e., X_test for prediction for checking the accuracy of the model using accuracy_score() function.
171
172
mean_absolute_error() function will return the error generated while doing the prediction. This function will use np.square(y_test — y_pred_gbc).mean()) formula. First it will square the differences between y_test(actual value) and y_pred_rf(predicted value) and it will take a mean of the value obtained by taking the square. Here error = y_test — y_pred_gbc.
173
174
variance_bias() function will helps to obtain the bias and variance of the model. Variance is obtained using var() function from numpy and it will obtain variance on predicted values i.e., y_pred_gbc. Now to obtain the bias we have to take help from variance value. First we have to calculate square of error using SSE = np.mean((np.mean(y_pred_gbc) — y_test)** 2) formula. First this formula will take mean of (y_pred_gbc) and then calculate the error mean(y_pred_gbc) — y_test) and then it will square the values. Now variance’s value will be subtracted from SSE and it will give bias value. The less the value of the variance will have less variety in data. The goal is to balance the bias and variance, so the model does not overfit or underfit. If the variance and bias go high then it will affect the model accuracy.
175
176
# RandomForestClassifier:
177
178
RandomForestClassifier is meant for classification problem. As we also having a classification dataset so we can implement classification easily. RandomForestClassifier uses Decision tree classifier for create trees and group of trees will create forests and it will help us to train then model and give accuracy results. Trees are helpful in taking decision. If the depth of trees increases then the decision(accuracy) will be more accurate. To use RandomForestClassifier() funtion we have to import ensemble module from sklearn library. Ensemble means the algorithms are assembled into a group and that module name is ensemble. We have more ensemble algorithms e.g., AdaBoostClassifier, CatBoostClassifer, GradientBoostingClassifier.
179
180
```
181
class random_forest_classifier:
182
    def random_forest(self, X_train, y_train):
183
        # Here we are creating a decision tree model using RandomForestClassifier. to do that we have to fit the training data (X_train, y_rain) into model_rc object
184
        from sklearn.ensemble import RandomForestClassifier
185
        self.model_rf = RandomForestClassifier()
186
        self.model_rf.fit(X_train, y_train)
187
        # here we are predicting the  Random Forest Classifier model on X_test [testing data]
188
        self.y_pred_rf = self.model_rf.predict(X_test)
189
        print("Mean square error for random forest model: ", mean_squared_error(y_test, self.y_pred_rf)) # will give mean square error of the model
190
        # accuracy_score will take y_test(actual value) and y_pred_rc(predicted value) and it will give the accuracy of the model
191
        print("Random Forest model accuracy              :",  accuracy_score(y_test, self.y_pred_rf)*100, "%")
192
    def mean_absolute_error(self):
193
        print("Mean Absoluter error of Random Forest     :", np.square(y_test - self.y_pred_rf).mean()) # calculating mean absolute error of RandomForest model
194
    def variance_bias(self):
195
        Variance = np.var(self.y_pred_rf)     # calculating variance in the predicted output
196
        print("Variance of RandomForest model is         :", Variance)
197
        SSE = np.mean((np.mean(self.y_pred_rf) - y_test)** 2)  # calculating s=sum of square error
198
        Bias = SSE - Variance                         # calculating Bias taking a difference between SSE and Variance
199
        print("Bias of RandomForest model is             :", Bias)
200
```
201
202
Inside random_forest_classifier class I have created three functions i.e., random_forest(), mean_absolute_error() and variance_bias().
203
204
random_forest() function will train the model using RandomForestClassifier() function and it will return the model accuracy. This function accepts parameters X_train, y_train (training data). Inside this function model_rf = RandomForestClassifier() will create a model and save it into model_rf variable and then model_rf.fit(X_train, y_train) will train the model and after training we have to use testing data i.e., X_test for prediction for checking the accuracy of the model using accuracy_score() function.
205
206
mean_absolute_error() function will return the error generated while doing the prediction. This function will use np.square(y_test — y_pred_rf).mean()) formula. First it will square the differences between y_test(actual value) and y_pred_rf(predicted value) and it will take a mean of the value obtained by taking the square. Here error = y_test — y_pred_rf.
207
208
variance_bias() function will helps to obtain the bias and variance of the model. Variance is obtained using var() function from numpy and it will obtain variance on predicted values i.e., y_pred_rf. Now to obtain the bias we have to take help from variance value. First we have to calculate square of error using SSE = np.mean((np.mean(y_pred_rf) — y_test)** 2) formula. First this formula will take mean of (y_pred_rf) and then calculate the error mean(y_pred_rf) — y_test) and then it will square the values. Now variance’s value will be subtracted from SSE and it will give bias value. The less the value of the variance will have less variety in data. The goal is to balance the bias and variance, so the model does not overfit or underfit. If the variance and bias go high then it will affect the model accuracy.
209
210
# Training and Testing Models:
211
212
Now we have to train and test the logistic regression, random forest classifier and gradient boosting model.
213
214
```
215
print("-------LUNG CANCER PREDICTION USING LOGISTIC REGRESSION--------")
216
# calling the class logistic_regression and creating object.
217
logistic = logistic_regression()
218
# calling logistic function that accepts two parameters i.e X_train, y_train
219
print(logistic.logistic(X_train, y_train))
220
# getting accuracy of logistic regression model
221
print(logistic.mean_absolute_error())        # getting mean absolute error
222
print(logistic.variance_bias())              # getting variance and bias
223
print("-------LUNG CANCER PREDICTION USING GRADIENT BOOSTING CLASSIFIER--------")
224
# calling the class gradient_boosting and creating object.
225
gbc = gradient_boosting()
226
# calling gb function that accepts two parameters i.e X_train, y_train
227
print(gbc.gb(X_train, y_train))
228
# getting accuracy of GradientBoostingClassifier model
229
print(gbc.mean_absolute_error())        # getting mean absolute error
230
print(gbc.variance_bias())              # getting variance and bias
231
print("-------LUNG CANCER PREDICTION USING RANDOM FOREST CLASSIFIER--------")
232
# calling the class random_forest_classifier and creating object.
233
rf_classifier = random_forest_classifier()
234
# calling random_forest function that accepts two parameters i.e X_train, y_train
235
print(rf_classifier.random_forest(X_train, y_train)) # getting accuracy of of random forest model
236
print(rf_classifier.mean_absolute_error())     # getting mean absolute error
237
print(rf_classifier.variance_bias())           # getting variance and bias
238
~
239
```
240
<p align="center">
241
  <img width="900" height="400" src="https://miro.medium.com/max/1094/1*1n_ec0E96HwHPKkfw9wuzQ.jpeg">
242
</p>
243
244
From above results we can conclude that logistic regression is behaving excellent followed by random forest classifier and gradient boosting classifier. This is due to less amount of data therefore, its giving 100 results. hope you like this article and comment will be appriciated. 😊😊
245
246
247
248