Diff of /cluster/README [000000] .. [748a59]

Switch to unified view

a b/cluster/README
1
We clustered trials in the following four files from the
2
../nci_data/dataset1-trials directory, based on each trial's curated
3
inclusion criteria, expressed in part through UMLS codes, as stored
4
in the "Boolean" column:
5
- "Hemoglobin_CTEP Trials_072018"
6
- "Platelets_CTEP Trials_072018"
7
- "WBC_CTEP Trials_072018"
8
- "HIV_CTEPTrials_072018"
9
10
First, we excluded trials with missing criteria in the Boolean
11
column:
12
- "Hemoglobin_CTEP Trials_072018"
13
    Excluding 173 of 347 rows
14
    After exclusion, 174 rows remain
15
- "Platelets_CTEP Trials_072018"
16
    Excluding 148 of 342 rows
17
    After exclusion, 194 rows remain
18
- "WBC_CTEP Trials_072018"
19
    Excluding 276 of 342 rows
20
    After exclusion, 66 rows remain
21
- "HIV_CTEPTrials_072018"
22
    Excluding 123 of 342 rows
23
    After exclusion, 219 rows remain
24
25
Next, we parsed the boolean expression in a crude way to obtain
26
information needed for the feature extraction below. We primarily
27
focused on extracting (1) triples that represent individual criteria
28
and (2) operators ("AND" and "OR"). As an example of a criterion
29
triple, the criterion "C64848 >= 8g/dL" maps to the triple
30
('C64848', '>=', '8g/dL').
31
32
The parsing described above results in a sequence of operators and a
33
sequence of triples for each trial. We ignore nesting of criteria
34
disjunctions and conjunctions. While this is a significant
35
simplification, our approach seems to work well despite the
36
simplification, perhaps due to the pairwise relations captured by
37
the features described below. In some cases, we encountered triples
38
that were incomplete, perhaps due to manual annotation error. We
39
filled in this missing data using a placeholder value.
40
41
We define the following features based on the sequence of operators
42
and the sequence of triples for each trial. Each triple contains
43
three elements, the left, center, and right elements. The counts
44
below are taken over the triples or operators in each trial's list.
45
Most commonly, each element or triple only occurs 0 or 1 times.
46
However, in some cases, elements or entire triples are repeated,
47
e.g., the same UMLS code occurs in several triples, or a triple is
48
repeated in two clauses. The features are as follows:
49
- The count of each triple element (left, center, or right).
50
- The count of each pair of triple elements.
51
- The count of each triple.
52
- The count of each pair of triples.
53
- The count of each operator.
54
- The count of each pair of operators.
55
56
For example, "(C64848 >= 9g/dL) OR (C64848 >= 5.6mmol/L)' maps to
57
the following non-zero features:
58
  - The count of each triple element (left, center, or right).
59
    'l_count_C64848': 2.0,
60
    'c_count_>=': 2.0,
61
    'r_count_5.6mmol/L': 1.0,
62
    'r_count_9g/dL': 1.0,
63
  - The count of each pair of triple elements.
64
    'lc_count_(C64848, >=)': 2.0,
65
    'lr_count_(C64848, 5.6mmol/L)': 1.0,
66
    'lr_count_(C64848, 9g/dL)': 1.0,
67
    'cr_count_(>=, 5.6mmol/L)': 1.0,
68
    'cr_count_(>=, 9g/dL)': 1.0,
69
  - The count of each triple.
70
    "triple_count_('C64848', '>=', '5.6mmol/L')": 1.0,
71
    "triple_count_('C64848', '>=', '9g/dL')": 1.0,
72
  - The count of each pair of triples.
73
    "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '5.6mmol/L')": 1.0,
74
    "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '9g/dL')": 1.0,
75
    "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '5.6mmol/L')": 1.0,
76
    "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '9g/dL')": 1.0
77
  - The count of each operator.
78
    'operator_count_OR': 1.0,
79
  - The count of each pair of operators.
80
    'operator_pair_count_OR_OR': 1.0,
81
82
Based on these features, we carry out hierarchical clustering using
83
complete linkage and cosine similarity. I tried a few variants and
84
this combination seemed to give the best results. It makes sense
85
that cosine similarity works well for these sparse count features.
86
87
We plot dendrograms representing each clustering in
88
*.clustering.pdf. We report features (alongside the original data)
89
in *.features.csv. We report the linkage matrix in
90
*.linkage_matrix.csv.