Diff of /hcc-description.txt [000000] .. [6b6586]

Switch to unified view

a b/hcc-description.txt
1
{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
2
{\fonttbl\f0\fmodern\fcharset0 Courier;}
3
{\colortbl;\red255\green255\blue255;}
4
{\*\expandedcolortbl;;}
5
\paperw11900\paperh16840\margl1440\margr1440\vieww25100\viewh13180\viewkind0
6
\deftab720
7
\pard\pardeftab720\partightenfactor0
8
9
\f0\fs26 \cf0 \expnd0\expndtw0\kerning0
10
Citation Request:\
11
Please include this citation if you plan to use this database:\
12
\
13
\pard\pardeftab720\partightenfactor0
14
\cf0 \kerning1\expnd0\expndtw0 Miriam Seoane Santos, Pedro Henriques Abreu, Pedro J. Garc\'eda-Laencina, Ad\'e9lia Sim\'e3o, Armando Carvalho, \'93A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients\'94, Journal of biomedical informatics, 58, 49-59, 2015.\expnd0\expndtw0\kerning0
15
\
16
\
17
\
18
1. Title: Hepatocellular Carcinoma Dataset (HCC dataset) \
19
\
20
2. Source Information\
21
   -- Donors of database: \
22
    \kerning1\expnd0\expndtw0 Miriam Seoane Santos (miriams@student.dei.uc.pt)\
23
    Pedro Henriques Abreu (pha@dei.uc.pt)\
24
    Department of Informatics Engineering, University of Coimbra, Portugal\
25
\
26
    Armando Carvalho(aspcarvalho@gmail.com)\
27
    Ad\'e9lia Sim\'e3o (adeliasimao@gmail.com)\expnd0\expndtw0\kerning0
28
 \
29
    Hospital and University Centre of Coimbra \
30
\
31
   -- Date: Feb, 2015\
32
 \
33
\
34
3. Past Usage:\
35
\kerning1\expnd0\expndtw0 Miriam Seoane Santos, Pedro Henriques Abreu, Pedro J. Garc\'eda-Laencina, Ad\'e9lia Sim\'e3o, Armando Carvalho, \'93A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients\'94, Journal of biomedical informatics, 58, 49-59, 2015.\
36
\
37
\pard\pardeftab720\partightenfactor0
38
\cf0 \expnd0\expndtw0\kerning0
39
    -- Proposed a cluster-based oversampling approach robust to small and imbalanced datasets, accounting for the heterogeneity between HCC patients. The new approach  is based on K-means clustering and a modification of SMOTE algorithm.\
40
    \
41
    -- The approach was coupled with NN and LR and compared to baseline approaches that do not consider clustering and/oversampling. \
42
\
43
    -- The target was the first-year survival of the patients, and the results were evaluated in terms of Accuracy, AUC     values and F-measure.\
44
 \
45
    -- Data imputation was performed with KNN with the HEOM metric.\
46
\
47
    -- The proposed approach (particularly, Augmented Sets Approach) coupled with NN presented better results regarding Accuracy (0.7519), AUC (0.7) and F-measure (0.6650).\
48
\
49
\
50
4. Relevant Information:\
51
\pard\pardeftab720\partightenfactor0
52
\cf0 \kerning1\expnd0\expndtw0 HCC dataset was obtained at a University Hospital in Portugal and contais several demographic, risk factors, laboratory and overall survival features of 165 real patients diagnosed with HCC. The dataset contains 49 features selected according to the EASL-EORTC (European Association for the Study of the Liver - European Organisation for Research and Treatment of Cancer) Clinical Practice Guidelines, which are the current state-of-the-art on the management of HCC.\
53
\
54
This is an heterogeneous dataset, with 23 quantitative variables, and 26 qualitative variables. Overall, missing data represents 10.22% of the whole dataset and only eight patients have complete information in all fields (4.85%). The target variables is the survival at 1 year, and was encoded as a binary           variable: 0 (die) and 1 (lives). A certain degree of class-imbalance is also present (63 cases labeled as \'93dies\'94 and 102 as \'93lives\'94).\
55
\
56
A detailed description of the HCC dataset (feature\'92s type/scale, range, mean/mode and missing data percentages) is provided in Santos et al. \'93A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients\'94, Journal of biomedical informatics, 58, 49-59, 2015.\
57
\
58
\pard\pardeftab720\partightenfactor0
59
\cf0 \expnd0\expndtw0\kerning0
60
\
61
5. Number of Instances: 165 \
62
\
63
6. Number of Attributes: 49 + the class attribute\
64
\
65
\
66
7. Attribute Information:\
67
   \
68
Name                            Data Type       Abbreviation        Range/Possible Values       Missing Values(%)   \
69
------                      ----------  ------------        ----------------------      ------------------\
70
\pard\pardeftab720\partightenfactor0
71
\cf0 \kerning1\expnd0\expndtw0 Gender                       nominal     Gender          (1=Male;0=Female)           0       \
72
Symptoms                        nominal     Symptoms            (1=Yes;0=No)                10.91\
73
Alcohol                     nominal     Alcohol         (1=Yes;0=No)                0\
74
Hepatitis B Surface Antigen     nominal     HBsAg           (1=Yes;0=No)                10.3\
75
Hepatitis B e Antigen           nominal     HBeAg           (1=Yes;0=No)                23.64\
76
Hepatitis B Core Antibody       nominal     HBcAb           (1=Yes;0=No)                14.55\
77
Hepatitis C Virus Antibody      nominal     HCVAb           (1=Yes;0=No)                5.45\
78
Cirrhosis                       nominal     Cirrhosis           (1=Yes;0=No)                0\
79
Endemic Countries               nominal     Endemic         (1=Yes;0=No)                23.64\
80
Smoking                     nominal     Smoking         (1=Yes;0=No)                24.85\
81
Diabetes                        nominal     Diabetes            (1=Yes;0=No)                1.82\
82
Obesity                     nominal     Obesity         (1=Yes;0=No)                6.06\
83
Hemochromatosis             nominal     Hemochro            (1=Yes;0=No)                13.94\
84
Arterial Hypertension           nominal     AHT             (1=Yes;0=No)                1.82\
85
Chronic Renal Insufficiency     nominal     CRI             (1=Yes;0=No)                1.21\
86
Human Immunodeficiency Virus    nominal     HIV             (1=Yes;0=No)                8.48\
87
Nonalcoholic Steatohepatitis    nominal     NASH                (1=Yes;0=No)                13.33\
88
Esophageal Varices              nominal     Varices         (1=Yes;0=No)                31.52\
89
Splenomegaly                    nominal     Spleno          (1=Yes;0=No)                9.09\expnd0\expndtw0\kerning0
90
\
91
\kerning1\expnd0\expndtw0 Portal Hypertension           nominal     PHT             (1=Yes;0=No)                6.67\expnd0\expndtw0\kerning0
92
\
93
\kerning1\expnd0\expndtw0 Portal Vein Thrombosis            nominal     PVT             (1=Yes;0=No)                1.82\
94
Liver Metastasis                nominal     Metastasis      (1=Yes;0=No)                2.42\
95
Radiological Hallmark           nominal     Hallmark            (1=Yes;0=No)                1.21\
96
Age at diagnosis                integer     Age             20-93                   0\
97
Grams of Alcohol per day        continuous  Grams/day           0-500                   29.09\
98
Packs of cigarets per year      continuous  Packs/year      0-510                   32.12\
99
Performance Status*         ordinal     PS              [0,1,2,3,4,5]               0                   \
100
Encephalopathy degree*          ordinal     Encephalopathy  [1,2,3]                 0.61\
101
Ascites degree*             ordinal     Ascites         [1,2,3]                 1.21\
102
International Normalised Ratio* continuous  INR             0.84-4.82                   2.42\
103
Alpha-Fetoprotein (ng/mL)       continuous  AFP             1.2-1810346             4.85\
104
Haemoglobin (g/dL)              continuous  Hemoglobin      5-18.7                  1.82\
105
Mean Corpuscular Volume  (fl)   continuous  MCV             69.5-119.6              1.82\
106
Leukocytes(G/L)             continuous  Leucocytes      2.2-13000                   1.82    \
107
Platelets   (G/L)               continuous  Platelets           1.71-459000             1.82\
108
Albumin (mg/dL)             continuous  Albumin         1.9-4.9                 3.64\
109
Total Bilirubin(mg/dL)          continuous  Total Bil           0.3-40.5                    3.03\
110
Alanine transaminase (U/L)      continuous  ALT             11-420                  2.42\
111
Aspartate transaminase (U/L)    continuous  AST             17-553                  1.82\
112
Gamma glutamyl transferase (U/L)    continuous  GGT             23-1575                 1.82\
113
Alkaline phosphatase (U/L)      continuous  ALP             1.28-980                    1.82\
114
Total Proteins (g/dL)           continuous  TP              3.9-102                 6.67\
115
Creatinine (mg/dL)              continuous  Creatinine      0.2-7.6                 4.24\
116
Number of Nodules               integer     Nodules         0-5                     1.21\
117
Major dimension of nodule (cm)  continuous  Major Dim           1.5-22                  12.12\
118
Direct Bilirubin (mg/dL)        continuous  Dir. Bil            0.1-29.3                    26.67\
119
Iron    (mcg/dL)                    continuous  Iron                0-244                   47.88\
120
Oxygen Saturation (%)           continuous  Sat             0-126                   48.48\
121
Ferritin (ng/mL)                continuous  Ferritin            0-2230                  48.48\
122
Class Attribute             nominal     Class           (1=lives;0=dies)            0\
123
\
124
(*) Adicional Info:\
125
PS: [0=Active;1=Restricted;2=Ambulatory;3=Selfcare;4=Disabled;5=Dead]. In this dataset there are only PS from 0 to 4.\
126
Encephalopathy degree: [1=None;2=Grade I/II; 3=Grade III/IV]\
127
Ascites degree: [1=None;2=Mild;3=Moderate to Severe]\
128
\pard\pardeftab720\partightenfactor0
129
\cf0 \expnd0\expndtw0\kerning0
130
\
131
\pard\pardeftab720\partightenfactor0
132
\cf0 \kerning1\expnd0\expndtw0 More information on HCC dataset\'92s features can be found in Santos et al. \'93A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients\'94, Journal of biomedical informatics, 58, 49-59, 2015.\
133
\pard\pardeftab720\partightenfactor0
134
\cf0 \expnd0\expndtw0\kerning0
135
\
136
\
137
8. Missing Attribute Values: Denoted by \'93?\'94. Missing percentages for each attribute are specified above.\
138
\
139
9. Class Distribution: \
140
    2 classes:\
141
    63 patients labeled as \'93dies\'94 (0)\
142
    102 patients labeled as \'93lives\'94 (1)\
143
}