We clustered trials in the following four files from the ../nci_data/dataset1-trials directory, based on each trial's curated inclusion criteria, expressed in part through UMLS codes, as stored in the "Boolean" column: - "Hemoglobin_CTEP Trials_072018" - "Platelets_CTEP Trials_072018" - "WBC_CTEP Trials_072018" - "HIV_CTEPTrials_072018" First, we excluded trials with missing criteria in the Boolean column: - "Hemoglobin_CTEP Trials_072018" Excluding 173 of 347 rows After exclusion, 174 rows remain - "Platelets_CTEP Trials_072018" Excluding 148 of 342 rows After exclusion, 194 rows remain - "WBC_CTEP Trials_072018" Excluding 276 of 342 rows After exclusion, 66 rows remain - "HIV_CTEPTrials_072018" Excluding 123 of 342 rows After exclusion, 219 rows remain Next, we parsed the boolean expression in a crude way to obtain information needed for the feature extraction below. We primarily focused on extracting (1) triples that represent individual criteria and (2) operators ("AND" and "OR"). As an example of a criterion triple, the criterion "C64848 >= 8g/dL" maps to the triple ('C64848', '>=', '8g/dL'). The parsing described above results in a sequence of operators and a sequence of triples for each trial. We ignore nesting of criteria disjunctions and conjunctions. While this is a significant simplification, our approach seems to work well despite the simplification, perhaps due to the pairwise relations captured by the features described below. In some cases, we encountered triples that were incomplete, perhaps due to manual annotation error. We filled in this missing data using a placeholder value. We define the following features based on the sequence of operators and the sequence of triples for each trial. Each triple contains three elements, the left, center, and right elements. The counts below are taken over the triples or operators in each trial's list. Most commonly, each element or triple only occurs 0 or 1 times. However, in some cases, elements or entire triples are repeated, e.g., the same UMLS code occurs in several triples, or a triple is repeated in two clauses. The features are as follows: - The count of each triple element (left, center, or right). - The count of each pair of triple elements. - The count of each triple. - The count of each pair of triples. - The count of each operator. - The count of each pair of operators. For example, "(C64848 >= 9g/dL) OR (C64848 >= 5.6mmol/L)' maps to the following non-zero features: - The count of each triple element (left, center, or right). 'l_count_C64848': 2.0, 'c_count_>=': 2.0, 'r_count_5.6mmol/L': 1.0, 'r_count_9g/dL': 1.0, - The count of each pair of triple elements. 'lc_count_(C64848, >=)': 2.0, 'lr_count_(C64848, 5.6mmol/L)': 1.0, 'lr_count_(C64848, 9g/dL)': 1.0, 'cr_count_(>=, 5.6mmol/L)': 1.0, 'cr_count_(>=, 9g/dL)': 1.0, - The count of each triple. "triple_count_('C64848', '>=', '5.6mmol/L')": 1.0, "triple_count_('C64848', '>=', '9g/dL')": 1.0, - The count of each pair of triples. "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '5.6mmol/L')": 1.0, "triple_pair_count_('C64848', '>=', '5.6mmol/L')_('C64848', '>=', '9g/dL')": 1.0, "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '5.6mmol/L')": 1.0, "triple_pair_count_('C64848', '>=', '9g/dL')_('C64848', '>=', '9g/dL')": 1.0 - The count of each operator. 'operator_count_OR': 1.0, - The count of each pair of operators. 'operator_pair_count_OR_OR': 1.0, Based on these features, we carry out hierarchical clustering using complete linkage and cosine similarity. I tried a few variants and this combination seemed to give the best results. It makes sense that cosine similarity works well for these sparse count features. We plot dendrograms representing each clustering in *.clustering.pdf. We report features (alongside the original data) in *.features.csv. We report the linkage matrix in *.linkage_matrix.csv.