--- a
+++ b/outputs/data_analysis.txt
@@ -0,0 +1,95 @@
+------------ PRE-PROCESSED DATA ANALYSIS ------------ 
+ 
+We perform data analysis on each features of the PLCO and NLST dataset.
+Number of participants: 
+  - PLCO: 55161
+  - NLST: 48595
+ 
+--- Feature analysis --- 
+
+Age: This feature captures the person’s age. 
+--------------  -----  ------  -----  ------
+Age             PLCO   PLCO %  NLST   NLST %
+<= 50           0      0.0     1      0.0
+50 < ... <= 60  27337  49.6    24861  51.2
+60 < ... <= 70  25120  45.5    20901  43.0
+> 70            2704   4.9     2832   5.8
+Missing         0      0.0     0      0.0
+--------------  -----  ------  -----  ------
+
+Smoking cessation age: This feature describes the age at which the person stopped smoking. 
+---------------------  -----  ------  -----  ------
+Smoking cessation age  PLCO   PLCO %  NLST   NLST %
+<= 30                  10470  19.0    2      0.0
+30 < ... <= 40         11886  21.5    130    0.3
+40 < ... <= 50         11447  20.8    7025   14.5
+50 < ... <= 60         8649   15.7    14071  29.0
+> 60                   1942   3.5     4378   9.0
+Missing                10767  19.5    22989  47.3
+---------------------  -----  ------  -----  ------
+
+Smoking status: This feature describes if the person is a current or a former cigarette smoker at the beginning of the study. 
+--------------  -----  ------  -----  ------
+Smoking status  PLCO   PLCO %  NLST   NLST %
+Active          9965   18.1    22842  47.0
+Former          45196  81.9    25753  53.0
+Missing         0      0.0     0      0.0
+--------------  -----  ------  -----  ------
+
+Pack-years: This feature refers to the number of packs smoked per day multiplied by the number of years during which the person smoked. 
+---------------  -----  ------  -----  ------
+Pack years       PLCO   PLCO %  NLST   NLST %
+<= 25            26981  48.9    8      0.0
+25 < ... <= 50   16147  29.3    26746  55.0
+50 < ... <= 100  9448   17.1    19544  40.2
+> 100            1434   2.6     2297   4.7
+Missing          1151   2.1     0      0.0
+---------------  -----  ------  -----  ------
+
+Smoking onset age: This feature indicates the age at which the person started smoking. 
+-----------------  -----  ------  -----  ------
+Smoking onset age  PLCO   PLCO %  NLST   NLST %
+<= 15              10169  18.4    17927  36.9
+15 < ... <= 20     33760  61.2    25411  52.3
+> 20               10950  19.9    5256   10.8
+Missing            282    0.5     1      0.0
+-----------------  -----  ------  -----  ------
+
+Years smoked: This feature describes the total number of years during which the person smoked. 
+--------------  -----  ------  -----  ------
+Smoking years   PLCO   PLCO %  NLST   NLST %
+<= 10           8800   16.0    2      0.0
+10 < ... <= 20  11761  21.3    292    0.6
+20 < ... <= 30  11532  20.9    5134   10.6
+30 < ... <= 40  13037  23.6    21620  44.5
+> 40            8963   16.2    21547  44.3
+Missing         1068   1.9     0      0.0
+--------------  -----  ------  -----  ------
+
+Lung family history: This feature describes if the person has close family (parents, siblings or child) who had lung cancer. 
+--------------------------  -----  ------  -----  ------
+Lung cancer family history  PLCO   PLCO %  NLST   NLST %
+No                          48415  87.8    37302  76.8
+Yes                         6323   11.5    10598  21.8
+Missing                     423    0.8     695    1.4
+--------------------------  -----  ------  -----  ------
+
+BMI: This feature describes the person’s body mass index. 
+------------------------------------  -----  ------  -----  ------
+Body Mass Index                       PLCO   PLCO %  NLST   NLST %
+Underweight (... <= 18.4)             295    0.5     347    0.7
+Healthy weight (18.5 <= ... <= 24.9)  17556  31.8    13404  27.6
+Overweight (25 <= ... <= 29.9)        23920  43.4    20894  43.0
+Obesity (... >= 30)                   12631  22.9    13696  28.2
+Missing                               759    1.4     234    0.5
+------------------------------------  -----  ------  -----  ------
+
+Lung cancer: This feature indicates if the person was diagnosed with lung cancer. 
+-----------  -----  ------  -----  ------
+Lung cancer  PLCO   PLCO %  NLST   NLST %
+Negative     52409  95.0    47084  96.9
+Positive     2752   5.0     1511   3.1
+Missing      0      0.0     0      0.0
+-----------  -----  ------  -----  ------
+
+