Diff of /outputs/data_analysis.txt [000000] .. [9ab7c1]

Switch to unified view

a b/outputs/data_analysis.txt
1
------------ PRE-PROCESSED DATA ANALYSIS ------------ 
2
 
3
We perform data analysis on each features of the PLCO and NLST dataset.
4
Number of participants: 
5
  - PLCO: 55161
6
  - NLST: 48595
7
 
8
--- Feature analysis --- 
9
10
Age: This feature captures the person’s age. 
11
--------------  -----  ------  -----  ------
12
Age             PLCO   PLCO %  NLST   NLST %
13
<= 50           0      0.0     1      0.0
14
50 < ... <= 60  27337  49.6    24861  51.2
15
60 < ... <= 70  25120  45.5    20901  43.0
16
> 70            2704   4.9     2832   5.8
17
Missing         0      0.0     0      0.0
18
--------------  -----  ------  -----  ------
19
20
Smoking cessation age: This feature describes the age at which the person stopped smoking. 
21
---------------------  -----  ------  -----  ------
22
Smoking cessation age  PLCO   PLCO %  NLST   NLST %
23
<= 30                  10470  19.0    2      0.0
24
30 < ... <= 40         11886  21.5    130    0.3
25
40 < ... <= 50         11447  20.8    7025   14.5
26
50 < ... <= 60         8649   15.7    14071  29.0
27
> 60                   1942   3.5     4378   9.0
28
Missing                10767  19.5    22989  47.3
29
---------------------  -----  ------  -----  ------
30
31
Smoking status: This feature describes if the person is a current or a former cigarette smoker at the beginning of the study. 
32
--------------  -----  ------  -----  ------
33
Smoking status  PLCO   PLCO %  NLST   NLST %
34
Active          9965   18.1    22842  47.0
35
Former          45196  81.9    25753  53.0
36
Missing         0      0.0     0      0.0
37
--------------  -----  ------  -----  ------
38
39
Pack-years: This feature refers to the number of packs smoked per day multiplied by the number of years during which the person smoked. 
40
---------------  -----  ------  -----  ------
41
Pack years       PLCO   PLCO %  NLST   NLST %
42
<= 25            26981  48.9    8      0.0
43
25 < ... <= 50   16147  29.3    26746  55.0
44
50 < ... <= 100  9448   17.1    19544  40.2
45
> 100            1434   2.6     2297   4.7
46
Missing          1151   2.1     0      0.0
47
---------------  -----  ------  -----  ------
48
49
Smoking onset age: This feature indicates the age at which the person started smoking. 
50
-----------------  -----  ------  -----  ------
51
Smoking onset age  PLCO   PLCO %  NLST   NLST %
52
<= 15              10169  18.4    17927  36.9
53
15 < ... <= 20     33760  61.2    25411  52.3
54
> 20               10950  19.9    5256   10.8
55
Missing            282    0.5     1      0.0
56
-----------------  -----  ------  -----  ------
57
58
Years smoked: This feature describes the total number of years during which the person smoked. 
59
--------------  -----  ------  -----  ------
60
Smoking years   PLCO   PLCO %  NLST   NLST %
61
<= 10           8800   16.0    2      0.0
62
10 < ... <= 20  11761  21.3    292    0.6
63
20 < ... <= 30  11532  20.9    5134   10.6
64
30 < ... <= 40  13037  23.6    21620  44.5
65
> 40            8963   16.2    21547  44.3
66
Missing         1068   1.9     0      0.0
67
--------------  -----  ------  -----  ------
68
69
Lung family history: This feature describes if the person has close family (parents, siblings or child) who had lung cancer. 
70
--------------------------  -----  ------  -----  ------
71
Lung cancer family history  PLCO   PLCO %  NLST   NLST %
72
No                          48415  87.8    37302  76.8
73
Yes                         6323   11.5    10598  21.8
74
Missing                     423    0.8     695    1.4
75
--------------------------  -----  ------  -----  ------
76
77
BMI: This feature describes the person’s body mass index. 
78
------------------------------------  -----  ------  -----  ------
79
Body Mass Index                       PLCO   PLCO %  NLST   NLST %
80
Underweight (... <= 18.4)             295    0.5     347    0.7
81
Healthy weight (18.5 <= ... <= 24.9)  17556  31.8    13404  27.6
82
Overweight (25 <= ... <= 29.9)        23920  43.4    20894  43.0
83
Obesity (... >= 30)                   12631  22.9    13696  28.2
84
Missing                               759    1.4     234    0.5
85
------------------------------------  -----  ------  -----  ------
86
87
Lung cancer: This feature indicates if the person was diagnosed with lung cancer. 
88
-----------  -----  ------  -----  ------
89
Lung cancer  PLCO   PLCO %  NLST   NLST %
90
Negative     52409  95.0    47084  96.9
91
Positive     2752   5.0     1511   3.1
92
Missing      0      0.0     0      0.0
93
-----------  -----  ------  -----  ------
94
95