# Clinical Trial Matching

For the clinical trial matching, we will start by reading and processing the data from the second dataset so that it can be used for making predictions.

The first document to process is the patient descriptions. We choose the first patient in order to be able to illustrate and test the application of the fine-tuned NER model using one single example.

In [7]:
import pandas as pd
import xml.etree.ElementTree as ET
import re

In [1]:
patient_path='/content/patient_descriptions.topics'

The content of the text file (patient_path) is splitted into individual sections based on the "TOP" delimiter, initializes an empty dictionary (patient_dict) and then processes each section to extract the "NUM" and "TITLE" information. The result is a dictionary, where the key is the patient description abd the value is the textual description.

In [2]:
with open(patient_path, 'r') as file:
    file_content = file.read()

sections = file_content.split('<TOP>\n')
patient_dict = {}

for section in sections:
    if section.strip():
        num_start = section.find('<NUM>') + len('<NUM>')
        num_end = section.find('</NUM>')
        title_start = section.find('<TITLE>') + len('<TITLE>')
        title_end = section.find('<NUM>', title_start)

        num = section[num_start:num_end].strip()
        title = section[title_start:title_end].strip()
        title = title.replace("</TOP>", "").replace("\n", "")

        patient_dict[num] = title

This is the patient description, we will use to match with trial descriptions:

In [3]:
descrip_1 = list(patient_dict.values())[0]
descrip_1

'A 58-year-old African-American woman presents to the ER with episodic pressing/burning anterior chest pain that began two days earlier for the first time in her life. The pain started while she was walking, radiates to the back, and is accompanied by nausea, diaphoresis and mild dyspnea, but is not increased on inspiration. The latest episode of pain ended half an hour prior to her arrival. She is known to have hypertension and obesity. She denies smoking, diabetes, hypercholesterolemia, or a family history of heart disease. She currently takes no medications. Physical examination is normal. The EKG shows nonspecific changes.        '

In the 'matches.txt' file the patient numbers with selected trial descriptions are given, along with a number, either 0, 1 or 2.

- 0 means that it is not a match
- 1 means that it may be a match
- 2 is a match between the two

This information is stored in a table for the patient

In [6]:
def get_matches_for_patient(input_number):
    with open('/content/matches.txt', 'r') as file:
        lines = file.readlines()

    trials = []
    matches = []

    for line in lines:
        parts = line.strip().split('\t')
        if len(parts) == 4:
            number = int(parts[0])
            trial = parts[2]
            match = int(parts[3])
            if number == input_number:
                trials.append(trial)
                matches.append(match)

    data = pd.DataFrame({'trial': trials, 'match': matches})
    return data

input_number = 20141
result = get_matches_for_patient(input_number)

print(result)

           trial  match
0    NCT00000408      0
1    NCT00000492      1
2    NCT00000501      0
3    NCT00001853      0
4    NCT00004727      0
..           ...    ...
138  NCT02507050      0
139  NCT02516839      0
140  NCT02532699      2
141  NCT02608255      1
142  NCT02626741      0

[143 rows x 2 columns]


The matching is used for simplified demonstration with 10 randomly selected trials, these are the selected files with their match value:

- NCT00143195: 2
- NCT01253486: 2
- NCT00809029: 2
- NCT01724567: 2
- NCT00162344: 1
- NCT01162902: 1
- NCT00000408: 0
- NCT00623454: 0
- NCT00848250: 0
- NCT00356707: 0


Next, only helpful information is extracted from the trial descriptions. The documents contain a lot of meta data, with the following relevant information being extracted: a short summary, a detailed description and the inclusion criteria.

In [8]:
def extract_criteria_text(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()

    brief_summary_text = ""
    detailed_description_text = ""
    inclusion_criteria_text = ""

    for textblock in root.findall(".//detailed_description/textblock"):
        detailed_description_text = textblock.text.strip()

    for textblock in root.findall(".//brief_summary/textblock"):
        brief_summary_text = textblock.text.strip()

    criteria_element = root.find(".//criteria")
    if criteria_element is not None:
        criteria_text = criteria_element.find(".//textblock").text

        criteria_text_lower = criteria_text.lower()
        inclusion_criteria_start = criteria_text_lower.find("inclusion criteria")
        exclusion_criteria_start = criteria_text_lower.find("exclusion criteria")

        if inclusion_criteria_start != -1:
            if exclusion_criteria_start != -1:
                inclusion_criteria_text = criteria_text[inclusion_criteria_start:exclusion_criteria_start]
            else:
                inclusion_criteria_text = criteria_text[inclusion_criteria_start:]

    full_text = inclusion_criteria_text + brief_summary_text + detailed_description_text
    full_text = full_text.replace("\n", "")
    full_text = re.sub(r'\s+', ' ', full_text)
    return full_text

In order to improve the matching and also include exclusion criteria, these are also extracted.

In [9]:
def extract_criteria_text_exclusion(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()

    exclusion_criteria_text = ""

    criteria_element = root.find(".//criteria")
    if criteria_element is not None:
        criteria_text = criteria_element.find(".//textblock").text

        inclusion_criteria = ""
        exclusion_criteria = ""

        criteria_text_lower = criteria_text.lower()
        if "exclusion criteria" in criteria_text_lower:
            exclusion_start = criteria_text_lower.find("exclusion criteria")
            exclusion_criteria = criteria_text[exclusion_start:]

        exclusion_criteria = exclusion_criteria.replace("\n", "")
        exclusion_criteria = re.sub(r'\s+', ' ', exclusion_criteria)
        return exclusion_criteria

Here are some examples for included and excluded data:

In [10]:
xml_file = '/content/trials/NCT00000408.xml'

print(extract_criteria_text(xml_file))
print(extract_criteria_text_exclusion(xml_file))

Inclusion Criteria: - Must live in the United States - Must understand and write English - Must have access to a computer with e-mail and expect to have this access for at least 3 years - Must be 18 years old - Must have seen a doctor for back pain at least once in the past year Back pain is one of the most common of all symptoms. It is also a great cause of days lost from work and visits to health care providers. This study will develop and evaluate an approach to low back pain that allows subjects to talk with each other and with health professionals via an Internet discussion group. Results we will look at include health behaviors, such as exercise; health status, such as pain and disability; and health care use, such as number of visits to doctors and other health care providers. Anyone 18 years old or older who lives in the United States and has ongoing Internet access can take part in the study. All subjects must have back pain and meet the eligibility criteria listed below.This 

## Model

BioClinicalBERT with best configuartion will be fine-tuned. Code used from fine-tuning file.

In [None]:
import os
import itertools
import numpy as np
!pip3 install datasets
!pip3 install transformers
from datasets import Dataset, DatasetDict
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch
!pip install seqeval
!pip install evaluate
import evaluate
metric = evaluate.load("seqeval")
import ast
import pandas as pd
from sklearn.model_selection import train_test_split
! pip install -U accelerate
! pip install -U transformers

In [12]:
data = pd.read_csv("data.csv")

In [13]:
def convert_tags(tags_string):
    return ast.literal_eval(tags_string)

data['tags'] = data['tags'].apply(convert_tags)
data['numeric_tags'] = data['numeric_tags'].apply(convert_tags)
data['tokens'] = data['tokens'].apply(convert_tags)

In [14]:
label_dict = {'O': 0, 'B-Age': 1, 'I-Age': 2, 'B-Sex': 3, 'I-Sex': 4, 'B-Clinical_event': 5,
              'I-Clinical_event': 6, 'B-Nonbiological_location': 7, 'I-Nonbiological_location': 8,
              'B-Sign_symptom': 9, 'I-Sign_symptom': 10, 'B-Biological_structure': 11, 'I-Biological_structure': 12,
              'B-Detailed_description': 13, 'I-Detailed_description': 14, 'B-History': 15, 'I-History': 16, 'B-Family_history': 17,
              'I-Family_history': 18, 'B-Diagnostic_procedure': 19, 'I-Diagnostic_procedure': 20, 'B-Distance': 21,
              'I-Distance': 22, 'B-Lab_value': 23, 'I-Lab_value': 24, 'B-Disease_disorder': 25, 'I-Disease_disorder': 26,
              'B-Shape': 27, 'I-Shape': 28, 'B-Coreference': 29, 'I-Coreference': 30, 'B-Volume': 31, 'I-Volume': 32,
              'B-Therapeutic_procedure': 33, 'I-Therapeutic_procedure': 34, 'B-Area': 35, 'I-Area': 36, 'B-Duration': 37,
              'I-Duration': 38, 'B-Date': 39, 'I-Date': 40, 'B-Color': 41, 'I-Color': 42, 'B-Frequency': 43, 'I-Frequency': 44,
              'B-Texture': 45, 'I-Texture': 46, 'B-Biological_attribute': 47, 'I-Biological_attribute': 48, 'B-Severity': 49,
              'I-Severity': 50, 'B-Activity': 51, 'I-Activity': 52, 'B-Outcome': 53, 'I-Outcome': 54, 'B-Personal_background': 55,
              'I-Personal_background': 56, 'B-Medication': 57, 'I-Medication': 58, 'B-Dosage': 59, 'I-Dosage': 60, 'B-Other_event': 61,
              'I-Other_event': 62, 'B-Administration': 63, 'I-Administration': 64, 'B-Occupation': 65, 'I-Occupation': 66,
              'B-Other_entity': 67, 'I-Other_entity': 68, 'B-Time': 69, 'I-Time': 70, 'B-Subject': 71, 'I-Subject': 72,
              'B-Quantitative_concept': 73, 'I-Quantitative_concept': 74, 'B-Height': 75, 'I-Height': 76, 'B-Mass': 77, 'I-Mass': 78,
              'B-Weight': 79, 'I-Weight': 80, 'B-Qualitative_concept': 81, 'I-Qualitative_concept': 82}

In [15]:
id2label = {i: label for i, label in enumerate(label_dict)}
label2id = {v: k for k, v in id2label.items()}

In [16]:
X = data["sentence"]
y = data["tags"]
numeric_tags = data["numeric_tags"]

X_train, X_rest, y_train, y_rest, numeric_tags_train, numeric_tags_rest = train_test_split(X, y, numeric_tags, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test, numeric_tags_valid, numeric_tags_test = train_test_split(X_rest, y_rest, numeric_tags_rest, test_size=0.5, random_state=42)

train_df = pd.DataFrame({"tags": y_train, "sentence": X_train, "numeric_tags": numeric_tags_train, "tokens": data["tokens"][X_train.index]})
valid_df = pd.DataFrame({"tags": y_valid, "sentence": X_valid, "numeric_tags": numeric_tags_valid, "tokens": data["tokens"][X_valid.index]})
test_df = pd.DataFrame({"tags": y_test, "sentence": X_test, "numeric_tags": numeric_tags_test, "tokens": data["tokens"][X_test.index]})

train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)

data_dict = DatasetDict({
    "train": train_dataset,
    "validation": valid_dataset,
    "test": test_dataset
})

In [17]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word: # New word
            current_word = word_id
            try:
              label = -100 if word_id is None else labels[word_id]
            except:
              label = -100
            new_labels.append(label)
        elif word_id is None: # Special token
            new_labels.append(-100)
        else: # Same word as previous token
            try:
              label = labels[word_id]
              if label % 2 == 1: # if praefix is B-, it gets changed to I-
                  label += 1
            except:
              label = -100
            new_labels.append(label)

    return new_labels

In [18]:
def tokenize_and_align_labels(examples, tokenizer):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["numeric_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [19]:
def get_tokenized_data(tokenizer, data=data_dict ):
    tokenized_datasets = data.map(
        lambda examples: tokenize_and_align_labels(examples, tokenizer),
        batched=True,
        remove_columns=data["train"].column_names,
    )
    return tokenized_datasets

In [20]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [21]:
def get_args(name_to_save, learning_rate=1e-4, num_train_epochs=3, weight_decay=0.01):
  args = TrainingArguments(
    name_to_save,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=learning_rate,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay
  )
  return args

In [22]:
def get_model(checkpoint):
  return AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
)

In [23]:
def get_trainer(model, args, training_set, eval_set, data_collator, tokenizer):
  return Trainer(
    model=model,
    args=args,
    train_dataset=training_set,
    eval_dataset=eval_set,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [24]:
checkpoint_clinicalBERT = "emilyalsentzer/Bio_ClinicalBERT"
model_clinicalBERT = get_model(checkpoint_clinicalBERT)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
tokenizer_clinicalBERT = AutoTokenizer.from_pretrained(checkpoint_clinicalBERT)
data_collator_clinicalBERT = DataCollatorForTokenClassification(tokenizer=tokenizer_clinicalBERT)

In [26]:
tokenized_datasets_clinicalBERT = get_tokenized_data(tokenizer_clinicalBERT)

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/454 [00:00<?, ? examples/s]

Map:   0%|          | 0/455 [00:00<?, ? examples/s]

In [27]:
trainer_clinicalBERT = get_trainer(model_clinicalBERT,
                                   get_args("bioclinicalBERT_ner"),
                                    tokenized_datasets_clinicalBERT["train"],
                                    tokenized_datasets_clinicalBERT["validation"],
                                    data_collator_clinicalBERT,
                                    tokenizer_clinicalBERT)

In [28]:
trainer_clinicalBERT.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.926511,0.433478,0.58375,0.497514,0.727637
2,1.181800,0.8749,0.507201,0.601667,0.55041,0.751611
3,0.643400,0.891894,0.535612,0.620417,0.574903,0.762144


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1365, training_loss=0.7824189126709878, metrics={'train_runtime': 229.307, 'train_samples_per_second': 47.53, 'train_steps_per_second': 5.953, 'total_flos': 326628269271252.0, 'train_loss': 0.7824189126709878, 'epoch': 3.0})

In [29]:
model = model_clinicalBERT
tokenizer = tokenizer_clinicalBERT

*Copy prediction-functions from fine-tuning file to predict entities*

In [30]:
from transformers import pipeline

In [31]:
classifier = pipeline("ner", model=model, tokenizer=tokenizer)
model.to("cpu")

def predict_tags(sentence):
  prediction = classifier(sentence)
  return prediction

An example to try if everything works:

In [32]:
predict_tags(patient_dict["20154"])

[{'entity': 'B-Age',
  'score': 0.99116594,
  'index': 2,
  'word': '82',
  'start': 3,
  'end': 5},
 {'entity': 'I-Age',
  'score': 0.9957867,
  'index': 3,
  'word': '-',
  'start': 5,
  'end': 6},
 {'entity': 'I-Age',
  'score': 0.99554574,
  'index': 4,
  'word': 'year',
  'start': 6,
  'end': 10},
 {'entity': 'I-Age',
  'score': 0.9956742,
  'index': 5,
  'word': '-',
  'start': 10,
  'end': 11},
 {'entity': 'I-Age',
  'score': 0.9952171,
  'index': 6,
  'word': 'old',
  'start': 11,
  'end': 14},
 {'entity': 'B-Sex',
  'score': 0.9897379,
  'index': 7,
  'word': 'woman',
  'start': 15,
  'end': 20},
 {'entity': 'B-Clinical_event',
  'score': 0.96544045,
  'index': 8,
  'word': 'comes',
  'start': 21,
  'end': 26},
 {'entity': 'B-Nonbiological_location',
  'score': 0.9009364,
  'index': 11,
  'word': 'emergency',
  'start': 34,
  'end': 43},
 {'entity': 'I-Nonbiological_location',
  'score': 0.9545496,
  'index': 12,
  'word': 'department',
  'start': 44,
  'end': 54},
 {'entity':

In [33]:
def create_token_table(sentence):
    sentence = sentence.replace("\u2005", "").replace("\u200a", "").replace("\u2009", "").replace("\n", "")
    tokens = sentence.split()

    token_table = []
    start = 0

    for token in tokens:
        end = start + len(token)
        token_info = {
            "token": token,
            "start": start,
            "end": end
        }
        token_table.append(token_info)

        start = end + 1

    return token_table

Originally, tags were fulfilled with a low probability. However, since matching was more difficult than expected, this threshold was removed again.

In [34]:
def preprocess_entities(sentence_info):
    #sentence_info = [info for info in sentence_info if info['score'] >= 0.4]
    cleaned_entities = []
    current_entity = None
    current_start = None
    current_end = None

    for info in sentence_info:
        entity = info['entity']
        start = info['start']
        end = info['end']

        if current_entity is None:
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_start = start
            current_end = end
        elif current_end == start:
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_end = end
        else:
            cleaned_entities.append((current_entity, current_start, current_end))
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_start = start
            current_end = end

    if current_entity is not None:
        cleaned_entities.append((current_entity, current_start, current_end))

    return cleaned_entities

In [35]:
def generate_tags(predicted_entities, sentence_entities):
    tags = ["O"] * len(sentence_entities)

    for entity, start, end in predicted_entities:
        for i, token_info in enumerate(sentence_entities):
            token_start = token_info['start']
            token_end = token_info['end']

            if token_start == start and token_end == end:
                tags[i] = entity
    return tags

In [36]:
def format_tags(tags):
    formatted_tags = []
    current_entity = None

    for tag in tags:
        if tag == 'O':
            formatted_tags.append(tag)
            current_entity = None
        else:
            entity_type = tag
            if current_entity == entity_type:
                formatted_tags.append('I-' + entity_type)
            else:
                formatted_tags.append('B-' + entity_type)
                current_entity = entity_type

    return formatted_tags

In [37]:
def get_sentence_tags(sentence):
  token_table = create_token_table(sentence)
  sentence_info = predict_tags(sentence)
  predicted_entities = preprocess_entities(sentence_info)
  general_tags = generate_tags(predicted_entities, token_table)
  return general_tags

# Turn entities into values

Now that every function is defined, we can start with making predictions.

In [38]:
from nltk import tokenize
import nltk
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [39]:
descrip_1 = list(patient_dict.values())[0]
list_tags = get_sentence_tags(descrip_1)
list_descrip = descrip_1.split()

In [40]:
result_dict = {}
for key, value in zip(list_tags, list_descrip):
    if key in result_dict:
        result_dict[key].append(value)
    else:
        result_dict[key] = [value]

In [41]:
if 'O' in result_dict:
    del result_dict['O']

In [42]:
result_dict

{'Age': ['58-year-old'],
 'Personal_background': ['African-American'],
 'Sex': ['woman'],
 'Clinical_event': ['presents'],
 'Nonbiological_location': ['ER'],
 'Detailed_description': ['episodic', 'pressing/burning'],
 'Biological_structure': ['anterior', 'chest'],
 'Sign_symptom': ['pain', 'pain', 'diaphoresis', 'increased', 'pain'],
 'Date': ['two', 'days', 'earlier', 'half', 'an', 'prior'],
 'Duration': ['time'],
 'Severity': ['mild'],
 'Time': ['hour'],
 'History': ['hypertension', 'history', 'of', 'heart'],
 'Family_history': ['family'],
 'Diagnostic_procedure': ['Physical', 'examination', 'EKG']}

To turn the entities into embeddings, we use BERT finetuned with Medical Texts (https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO).

In [43]:
model_name = "pritamdeka/S-PubMedBert-MS-MARCO"
tokenizer_embeddings = AutoTokenizer.from_pretrained(model_name)
model_embeddings = AutoModel.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/388 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/461k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [44]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

model_embeddings = model_embeddings.to(device)

With `create_word_vectors`, every entity cluster will be turned into one vector.

In [45]:
def create_word_vectors(texts, tokenizer=tokenizer_embeddings, model=model_embeddings, device=device):
    embeddings = []
    for text in texts:
        tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device)
        with torch.no_grad():
            model = model.to(device)
            output = model(**tokens)
            embeddings.append(output.last_hidden_state.mean(dim=1).squeeze().cpu().numpy())
    return embeddings

First, the patient descriptions will be turned into a semantic vector by applying the above defined function.

In [46]:
vector_dict_patient = {}
for key, values in result_dict.items():
    vectors = create_word_vectors(values)
    if vectors:
        averaged_vector = np.mean(vectors, axis=0)
        vector_dict_patient[key] = averaged_vector

print(vector_dict_patient)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'Age': array([-4.48633522e-01, -5.57422221e-01, -6.25807941e-01, -6.50893092e-01,
       -2.29620993e-01, -1.82942122e-01, -6.47665381e-01,  8.48993361e-01,
       -1.27741784e-01, -2.12214768e-01,  2.88278669e-01,  5.50653994e-01,
        1.88696459e-01, -1.27048090e-01, -1.07246256e+00,  4.82443459e-02,
        2.16187134e-01,  3.06083590e-01, -7.88488537e-02,  7.07307756e-01,
        5.30802667e-01,  8.10717225e-01, -1.98618099e-01, -1.29397362e-01,
       -2.42943540e-01,  7.89049119e-02, -1.42334998e-01,  2.71630615e-01,
        6.71719909e-01,  4.65820253e-01,  1.33993447e-01,  3.55350077e-01,
       -5.45750260e-01,  9.65225875e-01,  1.82438582e-01, -2.94252276e-01,
        1.23208717e-01, -2.88575411e-01,  3.93170893e-01, -8.43763500e-02,
        2.91413993e-01,  2.67682046e-01,  9.93976220e-02, -1.12106480e-01,
       -9.48969275e-02,  6.60806119e-01, -1.08379729e-01, -5.39819002e-01,
        1.16810776e-01, -1.40077382e-01, -1.92926452e-01, -1.13017648e-01,
        1.2060924

Now, all vectors get averaged, resulting in one vector for all entities.

In [47]:
vector_list = [torch.tensor(v) if isinstance(v, np.ndarray) else v for v in vector_dict_patient.values()]
average_vector_patient = torch.stack(vector_list).mean(dim=0)
print(average_vector_patient)

tensor([-2.6287e-01, -5.5217e-01, -6.4275e-01, -4.0841e-01, -2.1664e-01,
        -2.1445e-01, -5.2317e-01,  3.8331e-01,  2.2948e-02,  1.9140e-01,
         1.5245e-01,  2.5273e-01, -2.4588e-02, -3.8370e-01, -5.1313e-01,
        -3.3356e-02,  6.3320e-03,  2.9500e-01,  5.5257e-02,  3.4000e-01,
         4.5481e-01,  3.5739e-01, -6.3879e-02, -9.4655e-02, -1.4587e-01,
         7.2917e-02, -1.0249e-02, -5.0220e-02,  4.6808e-01, -2.2201e-01,
         1.6824e-01,  4.4901e-01, -3.6082e-01,  6.7337e-01, -1.1133e-01,
        -3.3917e-01, -1.7223e-02, -2.1815e-01,  7.2779e-01,  6.5206e-02,
         2.2852e-01,  3.0957e-01,  1.8308e-01,  2.3772e-02,  1.1512e-01,
         3.8970e-01,  6.4321e-02, -5.2707e-01,  2.4572e-01, -2.1427e-01,
        -2.5681e-01, -5.1376e-02,  1.1637e-01, -2.9529e-02, -5.6034e-01,
        -1.6303e-02, -2.5902e-02,  2.0966e-01, -3.7153e-01,  2.5158e-01,
        -1.5748e-01,  5.5396e-01, -1.3207e-01, -1.4678e-01,  7.7694e-02,
        -1.0562e-01, -4.6436e-01, -2.5583e-01, -4.0

Now the experiments are processed into embeddings. We first read the randomly selected files and then proceed as with the patient description.

In [48]:
folder_path = '/content/trials'
os.chdir(folder_path)
xml_files = [f for f in os.listdir() if f.endswith('.xml')]

In [None]:
trials_embedding_inclusion = {}
trials_embedding_exclusion = {}

The result of extracting the entities of each trial is output.

In [None]:
for xml_file in xml_files:
    tree = ET.parse(xml_file)
    root = tree.getroot()

    text = extract_criteria_text(xml_file)
    model.to("cpu")
    try:
      print(xml_file)
      list1_tags = get_sentence_tags(text)
    except:
      print(xml_file)
      continue
    liste2_tokens = text.split()

    ergebnis_dict = {}
    for key, value in zip(list1_tags, liste2_tokens):
        if key in ergebnis_dict:
            ergebnis_dict[key].append(value)
        else:
            ergebnis_dict[key] = [value]
    if 'O' in ergebnis_dict:
      del ergebnis_dict['O']

    print(ergebnis_dict)

    vector_dict_patient = {}
    for key, values in ergebnis_dict.items():
        vectors = create_word_vectors(values)
        if vectors:
            averaged_vector = np.mean(vectors, axis=0)
            vector_dict_patient[key] = averaged_vector

    vector_list = [torch.tensor(v) if isinstance(v, np.ndarray) else v for v in vector_dict_patient.values()]

    average_vector_patient = torch.stack(vector_list).mean(dim=0)

    trials_embedding_inclusion.update({xml_file: average_vector_patient})

NCT00623454.xml
{'Biological_structure': ['chest', 'chest', 'chest'], 'Sign_symptom': ['pain', 'palpitation', 'pain', 'palpitations', 'pain', 'palpitation', 'pain'], 'Date': ['6', 'month', 'ago', '18', 'and', '65', 'years', '6', 'month'], 'Nonbiological_location': ['Cardiological', 'Out-patient', 'Hospital', 'Cardiological', 'Out-patient'], 'Therapeutic_procedure': ['coping', 'course', 'coping', 'cognitive', 'behaviour'], 'Detailed_description': ['three', 'sessions']}
NCT01162902.xml
{'Detailed_description': ['Stable'], 'Disease_disorder': ['angina'], 'Sign_symptom': ['lesions'], 'Diagnostic_procedure': ['angiography'], 'Medication': ['PCI', 'therapy', 'calcium-channel', 'beta']}
NCT00356707.xml
{'Diagnostic_procedure': ['BWHS'], 'Personal_background': ['American']}
NCT00000408.xml
{'Nonbiological_location': ['States']}
NCT00848250.xml
{'Family_history': ['Infants', 'children', '(newborn', 'to', '17', 'years', 'of', 'children', 'with', 'congenital', 'heart'], 'Therapeutic_procedure': [

We will proceed the exact same way, just using the exclusion criteria. The only thing we have to adjust is, in the case of not having exclusion criteria, we will add a tensor with 0's, so that there will be no problems when continuing preprocessing the data.

In [None]:
for xml_file in xml_files:
    tree = ET.parse(xml_file)
    root = tree.getroot()

    text = extract_criteria_text_exclusion(xml_file)
    model.to("cpu")
    list1_tags = get_sentence_tags(text)
    liste2_tokens = text.split()

    ergebnis_dict = {}
    for key, value in zip(list1_tags, liste2_tokens):
        if key in ergebnis_dict:
            ergebnis_dict[key].append(value)
        else:
            ergebnis_dict[key] = [value]
    if 'O' in ergebnis_dict:
      del ergebnis_dict['O']

    print(ergebnis_dict)

    vector_dict_patient = {}
    for key, values in ergebnis_dict.items():
        vectors = create_word_vectors(values)
        if vectors:
            averaged_vector = np.mean(vectors, axis=0)
            vector_dict_patient[key] = averaged_vector

    try:
      vector_list = [torch.tensor(v) if isinstance(v, np.ndarray) else v for v in vector_dict_patient.values()]
      average_vector_patient = torch.stack(vector_list).mean(dim=0)
      trials_embedding_exclusion.update({xml_file: average_vector_patient})
    except:
      trials_embedding_exclusion.update({xml_file: torch.zeros(torch.Size([768]))})

{'Detailed_description': ['do', 'speak', 'norwegian'], 'History': ['mentally', 'retarded', 'psychosis', 'last', 'alcohol', 'drug', 'misuse'], 'Duration': ['6', 'month']}
{'Date': ['within', 'one'], 'Disease_disorder': ['Angina', 'Liver', 'function', 'abnormality', 'renal', 'failure', 'heart', 'uncorrectable', 'hematologic', 'disease', 'pregnant', 'diabetes'], 'History': ['testing', 'Severe', 'Uncontrolled'], 'Diagnostic_procedure': ['life', 'span']}
{}
{'Therapeutic_procedure': ['Pregnancy', 'Back', 'surgery', 'legal', 'proceedings'], 'Duration': ['past', '6', 'months', 'next', '6', 'months', 'the', 'last', '6', 'months', 'past', 'year'], 'Biological_structure': ['Back', 'Back', 'back', 'back', 'back', 'crotch', 'area', 'back'], 'Sign_symptom': ['pain', 'pain', 'pain', 'Difficulty', 'with', 'pain', 'pain', 'Numbness', 'pain'], 'Disease_disorder': ['car', 'accident', 'injury', 'sciatica', 'systemic', 'disease', 'rheumatic', 'physical', 'or', 'mental', 'health', 'condition', 'sciatica', 

In [None]:
import torch
import torch.nn.functional as F
import numpy as np

`calculate_cosine_similarity` calculates the cosine similarity between the query embedding, which is the embedded single patient description, and a dictionary of embeddings, which is the dictionary of trial descriptions.

In [None]:
def calculate_cosine_similarity(query_embedding, embedding_dict):
    similarities = []

    for key, value_embedding in embedding_dict.items():
        similarity = F.cosine_similarity(query_embedding, value_embedding, dim=0).item()
        similarities.append((key, similarity))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities

Now, we will compute the similarities between the patient description and the inclusion criteria with detailed description and the summary. The results will be sorted ascending with its similarity value.

In [None]:
similarities_in = calculate_cosine_similarity(average_vector_patient, trials_embedding_inclusion)
for id, similarity in similarities_in:
    match_value = result.loc[result['trial'] == id.replace(".xml", ""), 'match'].values[0]
    print(f'ID: {id}, Ähnlichkeit: {similarity:.4f}, Ground Truth: {match_value}')

ID: NCT00162344.xml, Ähnlichkeit: 0.9739, Ground Truth: 1
ID: NCT00623454.xml, Ähnlichkeit: 0.9733, Ground Truth: 0
ID: NCT00848250.xml, Ähnlichkeit: 0.9722, Ground Truth: 0
ID: NCT01724567.xml, Ähnlichkeit: 0.9688, Ground Truth: 2
ID: NCT00809029.xml, Ähnlichkeit: 0.9677, Ground Truth: 2
ID: NCT01253486.xml, Ähnlichkeit: 0.9675, Ground Truth: 2
ID: NCT00143195.xml, Ähnlichkeit: 0.9659, Ground Truth: 2
ID: NCT01162902.xml, Ähnlichkeit: 0.9582, Ground Truth: 1
ID: NCT00356707.xml, Ähnlichkeit: 0.9332, Ground Truth: 0
ID: NCT00000408.xml, Ähnlichkeit: 0.8835, Ground Truth: 0


The same thing is done, now only the patient's description is compared with the exclusion criteria. This means that the higher the match, the worse the fit.

In [None]:
similarities_ex = calculate_cosine_similarity(average_vector_patient, trials_embedding_exclusion)
for id, similarity in similarities_ex:
    try:
      match_value = result.loc[result['trial'] == id.replace(".xml", ""), 'match'].values[0]
    except:
      match_value = -100
    print(f'ID: {id}, Ähnlichkeit: {similarity:.4f}, Ground Truth: {match_value}')

ID: NCT00809029.xml, Ähnlichkeit: 1.0000, Ground Truth: 2
ID: NCT01724567.xml, Ähnlichkeit: 0.9772, Ground Truth: 2
ID: NCT01162902.xml, Ähnlichkeit: 0.9743, Ground Truth: 1
ID: NCT00000408.xml, Ähnlichkeit: 0.9743, Ground Truth: 0
ID: NCT00143195.xml, Ähnlichkeit: 0.9734, Ground Truth: 2
ID: NCT00623454.xml, Ähnlichkeit: 0.9715, Ground Truth: 0
ID: NCT00848250.xml, Ähnlichkeit: 0.9669, Ground Truth: 0
ID: NCT00162344.xml, Ähnlichkeit: 0.9645, Ground Truth: 1
ID: NCT01253486.xml, Ähnlichkeit: 0.9303, Ground Truth: 2
ID: NCT00356707.xml, Ähnlichkeit: 0.0000, Ground Truth: 0


To take the exclusion criteria into account, a new function is defined.
`rank_trials_with_penalty` takes the two lists of trial similarities, one based on inclusion criteria and the other on exclusion criteria, and computes a combined similarity score for each trial. It applies a penalty to trials with higher similarity to the exclusion criteria, controlled by the weight parameter. Finally, it returns a list of ranked trials based on their combined similarity scores, with higher scores indicating a better match to the patient's description.

In [None]:
def rank_trials_with_penalty(similarities_inclusion, similarities_exclusion, weight):
    ranked_trials = []

    for id, similarity_inclusion in similarities_inclusion:
        similarity_exclusion = next((sim for sim_id, sim in similarities_exclusion if sim_id == id), None)

        if similarity_exclusion is not None:
            weighted_similarity = (1 - weight) * similarity_inclusion + weight * similarity_exclusion
            ranked_trials.append((id, weighted_similarity))

    ranked_trials.sort(key=lambda x: x[1], reverse=True)

    return ranked_trials

In [None]:
ranked_trials = rank_trials_with_penalty(similarities_in, similarities_ex, weight=0.3)

rankings = []
relevant_items = []
maybe_relevant_items = []

for id, weighted_similarity in ranked_trials:
    rankings.append(id)
    if (match_value > 1):
      relevant_items.append(id)
    if (match_value > 0):
      maybe_relevant_items.append(id)
    match_value = result.loc[result['trial'] == id.replace(".xml", ""), 'match'].values[0]
    print(f'ID: {id}, Gewichtete Ähnlichkeit: {weighted_similarity:.4f}, Ground Truth: {match_value}')

ID: NCT00809029.xml, Gewichtete Ähnlichkeit: 0.9774, Ground Truth: 2
ID: NCT00623454.xml, Gewichtete Ähnlichkeit: 0.9728, Ground Truth: 0
ID: NCT01724567.xml, Gewichtete Ähnlichkeit: 0.9713, Ground Truth: 2
ID: NCT00162344.xml, Gewichtete Ähnlichkeit: 0.9710, Ground Truth: 1
ID: NCT00848250.xml, Gewichtete Ähnlichkeit: 0.9706, Ground Truth: 0
ID: NCT00143195.xml, Gewichtete Ähnlichkeit: 0.9682, Ground Truth: 2
ID: NCT01162902.xml, Gewichtete Ähnlichkeit: 0.9630, Ground Truth: 1
ID: NCT01253486.xml, Gewichtete Ähnlichkeit: 0.9564, Ground Truth: 2
ID: NCT00000408.xml, Gewichtete Ähnlichkeit: 0.9108, Ground Truth: 0
ID: NCT00356707.xml, Gewichtete Ähnlichkeit: 0.6533, Ground Truth: 0


For evaluating this ranking, we choose normalized Discounted Cumulative Gain. This metric takes into account the relevance of documents and position in the ranking. It rates relevant documents higher and gives them more weight if they are higher in the ranking, which suits our ranking.

In [None]:
def discounted_cumulative_gain(rankings, relevant_items):
    dcg = 0.0
    for i, item in enumerate(rankings):
        if item in relevant_items:
            dcg += 1.0 / (i + 1)
    return dcg

def normalized_discounted_cumulative_gain(rankings, relevant_items):
    dcg = discounted_cumulative_gain(rankings, relevant_items)
    ideal_rankings = sorted(relevant_items, reverse=True)
    idcg = discounted_cumulative_gain(ideal_rankings, relevant_items)
    return dcg / idcg if idcg > 0 else 0.0

Because nDCG can not divide between different relevance rankings, we will do two evaluations, one only with really relevant trials, the other one with conditional matches.

In [None]:
ndcg = normalized_discounted_cumulative_gain(rankings, relevant_items)
ndcg

0.48190476190476195

In [None]:
ndcg = normalized_discounted_cumulative_gain(rankings, maybe_relevant_items)
ndcg

0.5424360220278588

The nDCG values of 0.48 and 0.54 indicate the quality of the ranking, with higher values being better.

Further discussion of results can be read in the paper.