|
a |
|
b/cluster/README |
|
|
1 |
We clustered trials in the following four files from the |
|
|
2 |
../nci_data/dataset1-trials directory, based on each trial's curated |
|
|
3 |
inclusion criteria, expressed in part through UMLS codes, as stored |
|
|
4 |
in the "Boolean" column: |
|
|
5 |
- "Hemoglobin_CTEP Trials_072018" |
|
|
6 |
- "Platelets_CTEP Trials_072018" |
|
|
7 |
- "WBC_CTEP Trials_072018" |
|
|
8 |
- "HIV_CTEPTrials_072018" |
|
|
9 |
|
|
|
10 |
First, we excluded trials with missing criteria in the Boolean |
|
|
11 |
column: |
|
|
12 |
- "Hemoglobin_CTEP Trials_072018" |
|
|
13 |
Excluding 173 of 347 rows |
|
|
14 |
After exclusion, 174 rows remain |
|
|
15 |
- "Platelets_CTEP Trials_072018" |
|
|
16 |
Excluding 148 of 342 rows |
|
|
17 |
After exclusion, 194 rows remain |
|
|
18 |
- "WBC_CTEP Trials_072018" |
|
|
19 |
Excluding 276 of 342 rows |
|
|
20 |
After exclusion, 66 rows remain |
|
|
21 |
- "HIV_CTEPTrials_072018" |
|
|
22 |
Excluding 123 of 342 rows |
|
|
23 |
After exclusion, 219 rows remain |
|
|
24 |
|
|
|
25 |
Next, we parsed the boolean expression in a crude way to obtain |
|
|
26 |
information needed for the feature extraction below. We primarily |
|
|
27 |
focused on extracting (1) triples that represent individual criteria |
|
|
28 |
and (2) operators ("AND" and "OR"). As an example of a criterion |
|
|
29 |
triple, the criterion "C64848 >= 8g/dL" maps to the triple |
|
|
30 |
('C64848', '>=', '8g/dL'). |
|
|
31 |
|
|
|
32 |
The parsing described above results in a sequence of operators and a |
|
|
33 |
sequence of triples for each trial. We ignore nesting of criteria |
|
|
34 |
disjunctions and conjunctions. While this is a significant |
|
|
35 |
simplification, our approach seems to work well despite the |
|
|
36 |
simplification, perhaps due to the pairwise relations captured by |
|
|
37 |
the features described below. In some cases, we encountered triples |
|
|
38 |
that were incomplete, perhaps due to manual annotation error. We |
|
|
39 |
filled in this missing data using a placeholder value. |
|
|
40 |
|
|
|
41 |
We define the following features based on the sequence of operators |
|
|
42 |
and the sequence of triples for each trial. Each triple contains |
|
|
43 |
three elements, the left, center, and right elements. The counts |
|
|
44 |
below are taken over the triples or operators in each trial's list. |
|
|
45 |
Most commonly, each element or triple only occurs 0 or 1 times. |
|
|
46 |
However, in some cases, elements or entire triples are repeated, |
|
|
47 |
e.g., the same UMLS code occurs in several triples, or a triple is |
|
|
48 |
repeated in two clauses. The features are as follows: |
|
|
49 |
- The count of each triple element (left, center, or right). |
|
|
50 |
- The count of each pair of triple elements. |
|
|
51 |
- The count of each triple. |
|
|
52 |
- The count of each pair of triples. |
|
|
53 |
- The count of each operator. |
|
|
54 |
- The count of each pair of operators. |
|
|
55 |
|
|
|
56 |
For example, "(C64848 >= 9g/dL) OR (C64848 >= 5.6mmol/L)' maps to |
|
|
57 |
the following non-zero features: |
|
|
58 |
- The count of each triple element (left, center, or right). |
|
|
59 |
'l_count_C64848': 2.0, |
|
|
60 |
'c_count_>=': 2.0, |
|
|
61 |
'r_count_5.6mmol/L': 1.0, |
|
|
62 |
'r_count_9g/dL': 1.0, |
|
|
63 |
- The count of each pair of triple elements. |
|
|
64 |
'lc_count_(C64848, >=)': 2.0, |
|
|
65 |
'lr_count_(C64848, 5.6mmol/L)': 1.0, |
|
|
66 |
'lr_count_(C64848, 9g/dL)': 1.0, |
|
|
67 |
'cr_count_(>=, 5.6mmol/L)': 1.0, |
|
|
68 |
'cr_count_(>=, 9g/dL)': 1.0, |
|
|
69 |
- The count of each triple. |
|
|
70 |
"triple_count_('C64848', '>=', '5.6mmol/L')": 1.0, |
|
|
71 |
"triple_count_('C64848', '>=', '9g/dL')": 1.0, |
|
|
72 |
- The count of each pair of triples. |
|
|
73 |
"triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '5.6mmol/L')": 1.0, |
|
|
74 |
"triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '9g/dL')": 1.0, |
|
|
75 |
"triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '5.6mmol/L')": 1.0, |
|
|
76 |
"triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '9g/dL')": 1.0 |
|
|
77 |
- The count of each operator. |
|
|
78 |
'operator_count_OR': 1.0, |
|
|
79 |
- The count of each pair of operators. |
|
|
80 |
'operator_pair_count_OR_OR': 1.0, |
|
|
81 |
|
|
|
82 |
Based on these features, we carry out hierarchical clustering using |
|
|
83 |
complete linkage and cosine similarity. I tried a few variants and |
|
|
84 |
this combination seemed to give the best results. It makes sense |
|
|
85 |
that cosine similarity works well for these sparse count features. |
|
|
86 |
|
|
|
87 |
We plot dendrograms representing each clustering in |
|
|
88 |
*.clustering.pdf. We report features (alongside the original data) |
|
|
89 |
in *.features.csv. We report the linkage matrix in |
|
|
90 |
*.linkage_matrix.csv. |