--- a +++ b/outputs/data_analysis.txt @@ -0,0 +1,95 @@ +------------ PRE-PROCESSED DATA ANALYSIS ------------ + +We perform data analysis on each features of the PLCO and NLST dataset. +Number of participants: + - PLCO: 55161 + - NLST: 48595 + +--- Feature analysis --- + +Age: This feature captures the person’s age. +-------------- ----- ------ ----- ------ +Age PLCO PLCO % NLST NLST % +<= 50 0 0.0 1 0.0 +50 < ... <= 60 27337 49.6 24861 51.2 +60 < ... <= 70 25120 45.5 20901 43.0 +> 70 2704 4.9 2832 5.8 +Missing 0 0.0 0 0.0 +-------------- ----- ------ ----- ------ + +Smoking cessation age: This feature describes the age at which the person stopped smoking. +--------------------- ----- ------ ----- ------ +Smoking cessation age PLCO PLCO % NLST NLST % +<= 30 10470 19.0 2 0.0 +30 < ... <= 40 11886 21.5 130 0.3 +40 < ... <= 50 11447 20.8 7025 14.5 +50 < ... <= 60 8649 15.7 14071 29.0 +> 60 1942 3.5 4378 9.0 +Missing 10767 19.5 22989 47.3 +--------------------- ----- ------ ----- ------ + +Smoking status: This feature describes if the person is a current or a former cigarette smoker at the beginning of the study. +-------------- ----- ------ ----- ------ +Smoking status PLCO PLCO % NLST NLST % +Active 9965 18.1 22842 47.0 +Former 45196 81.9 25753 53.0 +Missing 0 0.0 0 0.0 +-------------- ----- ------ ----- ------ + +Pack-years: This feature refers to the number of packs smoked per day multiplied by the number of years during which the person smoked. +--------------- ----- ------ ----- ------ +Pack years PLCO PLCO % NLST NLST % +<= 25 26981 48.9 8 0.0 +25 < ... <= 50 16147 29.3 26746 55.0 +50 < ... <= 100 9448 17.1 19544 40.2 +> 100 1434 2.6 2297 4.7 +Missing 1151 2.1 0 0.0 +--------------- ----- ------ ----- ------ + +Smoking onset age: This feature indicates the age at which the person started smoking. +----------------- ----- ------ ----- ------ +Smoking onset age PLCO PLCO % NLST NLST % +<= 15 10169 18.4 17927 36.9 +15 < ... <= 20 33760 61.2 25411 52.3 +> 20 10950 19.9 5256 10.8 +Missing 282 0.5 1 0.0 +----------------- ----- ------ ----- ------ + +Years smoked: This feature describes the total number of years during which the person smoked. +-------------- ----- ------ ----- ------ +Smoking years PLCO PLCO % NLST NLST % +<= 10 8800 16.0 2 0.0 +10 < ... <= 20 11761 21.3 292 0.6 +20 < ... <= 30 11532 20.9 5134 10.6 +30 < ... <= 40 13037 23.6 21620 44.5 +> 40 8963 16.2 21547 44.3 +Missing 1068 1.9 0 0.0 +-------------- ----- ------ ----- ------ + +Lung family history: This feature describes if the person has close family (parents, siblings or child) who had lung cancer. +-------------------------- ----- ------ ----- ------ +Lung cancer family history PLCO PLCO % NLST NLST % +No 48415 87.8 37302 76.8 +Yes 6323 11.5 10598 21.8 +Missing 423 0.8 695 1.4 +-------------------------- ----- ------ ----- ------ + +BMI: This feature describes the person’s body mass index. +------------------------------------ ----- ------ ----- ------ +Body Mass Index PLCO PLCO % NLST NLST % +Underweight (... <= 18.4) 295 0.5 347 0.7 +Healthy weight (18.5 <= ... <= 24.9) 17556 31.8 13404 27.6 +Overweight (25 <= ... <= 29.9) 23920 43.4 20894 43.0 +Obesity (... >= 30) 12631 22.9 13696 28.2 +Missing 759 1.4 234 0.5 +------------------------------------ ----- ------ ----- ------ + +Lung cancer: This feature indicates if the person was diagnosed with lung cancer. +----------- ----- ------ ----- ------ +Lung cancer PLCO PLCO % NLST NLST % +Negative 52409 95.0 47084 96.9 +Positive 2752 5.0 1511 3.1 +Missing 0 0.0 0 0.0 +----------- ----- ------ ----- ------ + +