--- a +++ b/cluster/README @@ -0,0 +1,90 @@ +We clustered trials in the following four files from the +../nci_data/dataset1-trials directory, based on each trial's curated +inclusion criteria, expressed in part through UMLS codes, as stored +in the "Boolean" column: +- "Hemoglobin_CTEP Trials_072018" +- "Platelets_CTEP Trials_072018" +- "WBC_CTEP Trials_072018" +- "HIV_CTEPTrials_072018" + +First, we excluded trials with missing criteria in the Boolean +column: +- "Hemoglobin_CTEP Trials_072018" + Excluding 173 of 347 rows + After exclusion, 174 rows remain +- "Platelets_CTEP Trials_072018" + Excluding 148 of 342 rows + After exclusion, 194 rows remain +- "WBC_CTEP Trials_072018" + Excluding 276 of 342 rows + After exclusion, 66 rows remain +- "HIV_CTEPTrials_072018" + Excluding 123 of 342 rows + After exclusion, 219 rows remain + +Next, we parsed the boolean expression in a crude way to obtain +information needed for the feature extraction below. We primarily +focused on extracting (1) triples that represent individual criteria +and (2) operators ("AND" and "OR"). As an example of a criterion +triple, the criterion "C64848 >= 8g/dL" maps to the triple +('C64848', '>=', '8g/dL'). + +The parsing described above results in a sequence of operators and a +sequence of triples for each trial. We ignore nesting of criteria +disjunctions and conjunctions. While this is a significant +simplification, our approach seems to work well despite the +simplification, perhaps due to the pairwise relations captured by +the features described below. In some cases, we encountered triples +that were incomplete, perhaps due to manual annotation error. We +filled in this missing data using a placeholder value. + +We define the following features based on the sequence of operators +and the sequence of triples for each trial. Each triple contains +three elements, the left, center, and right elements. The counts +below are taken over the triples or operators in each trial's list. +Most commonly, each element or triple only occurs 0 or 1 times. +However, in some cases, elements or entire triples are repeated, +e.g., the same UMLS code occurs in several triples, or a triple is +repeated in two clauses. The features are as follows: +- The count of each triple element (left, center, or right). +- The count of each pair of triple elements. +- The count of each triple. +- The count of each pair of triples. +- The count of each operator. +- The count of each pair of operators. + +For example, "(C64848 >= 9g/dL) OR (C64848 >= 5.6mmol/L)' maps to +the following non-zero features: + - The count of each triple element (left, center, or right). + 'l_count_C64848': 2.0, + 'c_count_>=': 2.0, + 'r_count_5.6mmol/L': 1.0, + 'r_count_9g/dL': 1.0, + - The count of each pair of triple elements. + 'lc_count_(C64848, >=)': 2.0, + 'lr_count_(C64848, 5.6mmol/L)': 1.0, + 'lr_count_(C64848, 9g/dL)': 1.0, + 'cr_count_(>=, 5.6mmol/L)': 1.0, + 'cr_count_(>=, 9g/dL)': 1.0, + - The count of each triple. + "triple_count_('C64848', '>=', '5.6mmol/L')": 1.0, + "triple_count_('C64848', '>=', '9g/dL')": 1.0, + - The count of each pair of triples. + "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '5.6mmol/L')": 1.0, + "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '9g/dL')": 1.0, + "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '5.6mmol/L')": 1.0, + "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '9g/dL')": 1.0 + - The count of each operator. + 'operator_count_OR': 1.0, + - The count of each pair of operators. + 'operator_pair_count_OR_OR': 1.0, + +Based on these features, we carry out hierarchical clustering using +complete linkage and cosine similarity. I tried a few variants and +this combination seemed to give the best results. It makes sense +that cosine similarity works well for these sparse count features. + +We plot dendrograms representing each clustering in +*.clustering.pdf. We report features (alongside the original data) +in *.features.csv. We report the linkage matrix in +*.linkage_matrix.csv.