# Finetuning Transformer Models for NER

In this file following things will be done:

- fine-tune BERT, RoBERTA and Bio_ClinicalBERT for NER
- evaluate and compare their performance
- choose best performaning model and optimize parameter
- continue evaluating with test set

In [None]:
import os
import itertools
import pandas as pd
import numpy as np
!pip3 install datasets
!pip3 install transformers
from datasets import Dataset
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch

In [None]:
import ast

In [None]:
from datasets import DatasetDict, Dataset
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("data.csv")
data

Unnamed: 0,sentence,tags,tokens,numeric_tags
0,CASE: A 28-year-old previously healthy man pre...,"['O', 'O', 'B-Age', 'B-History', 'I-History', ...","['case', 'a', '28-year-old', 'previously', 'he...","[0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9]"
1,"The symptoms occurred during rest, 2–3 times p...","['O', 'B-Coreference', 'O', 'O', 'B-Clinical_e...","['the', 'symptoms', 'occurred', 'during', 'res...","[0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14..."
2,Except for a grade 2/6 holosystolic tricuspid ...,"['O', 'O', 'O', 'B-Lab_value', 'I-Lab_value', ...","['except', 'for', 'a', 'grade', '2/6', 'holosy...","[0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1..."
3,An electrocardiogram (ECG) revealed normal sin...,"['O', 'B-Diagnostic_procedure', 'O', 'O', 'B-L...","['an', 'electrocardiogram', 'ecg', 'revealed',...","[0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,..."
4,Transthoracic echocardiography demonstrated th...,"['B-Biological_structure', 'B-Diagnostic_proce...","['transthoracic', 'echocardiography', 'demonst...","[11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ..."
...,...,...,...,...
4537,MHL was diagnosed (Fig.3).,"['B-Disease_disorder', 'O', 'O', 'O']","['mhl', 'was', 'diagnosed', 'fig3']","[25, 0, 0, 0]"
4538,Immunohistochemistry results (Fig.4) were the ...,"['B-Diagnostic_procedure', 'I-Diagnostic_proce...","['immunohistochemistry', 'results', 'fig4', 'w...","[19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ..."
4539,"After 9 days of recovery, the patient returned...","['O', 'B-Duration', 'I-Duration', 'O', 'B-Ther...","['after', '9', 'days', 'of', 'recovery', 'the'...","[0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9]"
4540,"A follow-up examination, which included blood ...","['O', 'B-Clinical_event', 'O', 'O', 'O', 'B-Di...","['a', 'follow-up', 'examination', 'which', 'in...","[0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,..."


Because when reading the file in, the lists in columns 'tags', 'numeric_tags' and 'tokens' are read as strings, so they need to be converted into lists again.

In [None]:
def convert_tags(tags_string):
    return ast.literal_eval(tags_string)

data['tags'] = data['tags'].apply(convert_tags)
data['numeric_tags'] = data['numeric_tags'].apply(convert_tags)
data['tokens'] = data['tokens'].apply(convert_tags)

For later purposes, we will need the dictionary again.

In [None]:
label_dict = {'O': 0, 'B-Age': 1, 'I-Age': 2, 'B-Sex': 3, 'I-Sex': 4, 'B-Clinical_event': 5,
              'I-Clinical_event': 6, 'B-Nonbiological_location': 7, 'I-Nonbiological_location': 8,
              'B-Sign_symptom': 9, 'I-Sign_symptom': 10, 'B-Biological_structure': 11, 'I-Biological_structure': 12,
              'B-Detailed_description': 13, 'I-Detailed_description': 14, 'B-History': 15, 'I-History': 16, 'B-Family_history': 17,
              'I-Family_history': 18, 'B-Diagnostic_procedure': 19, 'I-Diagnostic_procedure': 20, 'B-Distance': 21,
              'I-Distance': 22, 'B-Lab_value': 23, 'I-Lab_value': 24, 'B-Disease_disorder': 25, 'I-Disease_disorder': 26,
              'B-Shape': 27, 'I-Shape': 28, 'B-Coreference': 29, 'I-Coreference': 30, 'B-Volume': 31, 'I-Volume': 32,
              'B-Therapeutic_procedure': 33, 'I-Therapeutic_procedure': 34, 'B-Area': 35, 'I-Area': 36, 'B-Duration': 37,
              'I-Duration': 38, 'B-Date': 39, 'I-Date': 40, 'B-Color': 41, 'I-Color': 42, 'B-Frequency': 43, 'I-Frequency': 44,
              'B-Texture': 45, 'I-Texture': 46, 'B-Biological_attribute': 47, 'I-Biological_attribute': 48, 'B-Severity': 49,
              'I-Severity': 50, 'B-Activity': 51, 'I-Activity': 52, 'B-Outcome': 53, 'I-Outcome': 54, 'B-Personal_background': 55,
              'I-Personal_background': 56, 'B-Medication': 57, 'I-Medication': 58, 'B-Dosage': 59, 'I-Dosage': 60, 'B-Other_event': 61,
              'I-Other_event': 62, 'B-Administration': 63, 'I-Administration': 64, 'B-Occupation': 65, 'I-Occupation': 66,
              'B-Other_entity': 67, 'I-Other_entity': 68, 'B-Time': 69, 'I-Time': 70, 'B-Subject': 71, 'I-Subject': 72,
              'B-Quantitative_concept': 73, 'I-Quantitative_concept': 74, 'B-Height': 75, 'I-Height': 76, 'B-Mass': 77, 'I-Mass': 78,
              'B-Weight': 79, 'I-Weight': 80, 'B-Qualitative_concept': 81, 'I-Qualitative_concept': 82}

In [None]:
id2label = {i: label for i, label in enumerate(label_dict)}
label2id = {v: k for k, v in id2label.items()}

Next, the dataset is splitted into training, validation, and test. It creates dataframes containing the corresponding data, which get converted into datasets using the Hugging Face `Dataset.from_pandas` method. Finally, it organizes these datasets into a DatasetDict named data_dict.

In [None]:
X = data["sentence"]
y = data["tags"]
numeric_tags = data["numeric_tags"]

X_train, X_rest, y_train, y_rest, numeric_tags_train, numeric_tags_rest = train_test_split(X, y, numeric_tags, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test, numeric_tags_valid, numeric_tags_test = train_test_split(X_rest, y_rest, numeric_tags_rest, test_size=0.5, random_state=42)

train_df = pd.DataFrame({"tags": y_train, "sentence": X_train, "numeric_tags": numeric_tags_train, "tokens": data["tokens"][X_train.index]})
valid_df = pd.DataFrame({"tags": y_valid, "sentence": X_valid, "numeric_tags": numeric_tags_valid, "tokens": data["tokens"][X_valid.index]})
test_df = pd.DataFrame({"tags": y_test, "sentence": X_test, "numeric_tags": numeric_tags_test, "tokens": data["tokens"][X_test.index]})

train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)

data_dict = DatasetDict({
    "train": train_dataset,
    "validation": valid_dataset,
    "test": test_dataset
})

Now it is in the right format: "train" for fine-tuning, "eval" for evaluation and "test" for testing the models performance.

# Prepare Data

The sentences need to be converted to token ids before the model can make sense of them.

The first processing will be more detailed, using the first pre-trained model: BERT

For later, we need a framework for evaluating the token classification prediction, called seqeval.

In [None]:
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
!pip install seqeval
!pip install evaluate
import evaluate
metric = evaluate.load("seqeval")

To begin, the tokenizer is initialized. To tokenize a pre-tokenized input, we add `is_split_into_words=True`:

In [None]:
checkpoint_bert = "bert-base-uncased"
tokenizer_bert = AutoTokenizer.from_pretrained(checkpoint_bert)
data_collator_bert = DataCollatorForTokenClassification(tokenizer=tokenizer_bert)

In [None]:
inputs = tokenizer_bert(data_dict["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens() # add start and end, and turn token into subtoken

['[CLS]',
 'then',
 'the',
 'tram',
 'flap',
 'was',
 'harvested',
 'from',
 'right',
 'rec',
 '##tus',
 'abd',
 '##omi',
 '##nis',
 'fig',
 '##2',
 '##b',
 'and',
 'was',
 'deep',
 '##ith',
 '##elial',
 '##ized',
 'fig',
 '##2',
 '##c',
 '[SEP]']

The tokenizer now added special tokens ([CLS] at the beginning and [SEP] at the end), left most of the words untouched, some got splitted into subtokens.

It is necessary to do more processing on the labels as the input ids returned by the tokenizer are longer than the lists of labels of the dataset, which produces mismatches. We need another function to align all the labels with its word.

In [None]:
inputs.word_ids()

[None,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 8,
 9,
 9,
 9,
 10,
 10,
 10,
 11,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 None]

The function `align_labels_with_tokens` expands the label list to match the tokens. Special tokens get a label of -100, which will be ignored in the loss function. Subtokens get the same tokens as their starting token.

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word: # New word
            current_word = word_id
            try:
              label = -100 if word_id is None else labels[word_id]
            except:
              label = -100
            new_labels.append(label)
        elif word_id is None: # Special token
            new_labels.append(-100)
        else: # Same word as previous token
            try:
              label = labels[word_id]
              if label % 2 == 1: # if praefix is B-, it gets changed to I-
                  label += 1
            except:
              label = -100
            new_labels.append(label)

    return new_labels

An example:

In [None]:
labels = data_dict["train"][0]["numeric_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 33, 34, 0, 33, 0, 11, 12, 12, 0, 0, 0, 33, 0]
[-100, 0, 0, 33, 34, 0, 33, 0, 11, 12, 12, 12, 12, 12, 0, 0, 0, 0, 0, 33, 34, 34, 34, 0, 0, 0, -100]


The special tokens in the beginning and end are now represented by -100, additional subwords are now laveled like its corresponding ancestor.

To preprocess the whole dataset, we need to tokenize all the inputs and apply `align_labels_with_tokens()` on all the labels.

In [None]:
def tokenize_and_align_labels(examples, tokenizer):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["numeric_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

By calling the function `get_tokenized_data`, each dataset will be preprocessed in the shown way.

In [None]:
def get_tokenized_data(tokenizer, data=data_dict ):
    tokenized_datasets = data.map(
        lambda examples: tokenize_and_align_labels(examples, tokenizer),
        batched=True,
        remove_columns=data["train"].column_names,
    )
    return tokenized_datasets

In [None]:
tokenized_datasets_bert = get_tokenized_data(tokenizer_bert)

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/454 [00:00<?, ? examples/s]

Map:   0%|          | 0/455 [00:00<?, ? examples/s]

The data has now finally been preprocessed.

The labels should be padded the exact same way as the inputs so that they stay the same size, which can be achieved using a DataCollatorForTokenClassification. It takes the tokenizer used to preprocess the inputs. Here are some examples:

In [None]:
batch = data_collator_bert([tokenized_datasets_bert["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,   33,   34,    0,   33,    0,   11,   12,   12,   12,
           12,   12,    0,    0,    0,    0,    0,   33,   34,   34,   34,    0,
            0,    0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100],
        [-100,   39,   40,   40,    0,    0,    0,    0,    0,    0,   49,   11,
           12,    9,    0,    9,    0,    0,    0,    0,   33,    0,    0,    7,
            8,    8,    0,    0,    0,    0,    0,   13,   25,   26,   26,   26,
         -100]])

 Seqeval takes the lists of labels as strings, not integers, so we need to fully decode the predictions and labels before passing them to the metric. The labels for our first training example looks like this:

In [None]:
data_dict["train"][0]

{'tags': ['O',
  'O',
  'B-Therapeutic_procedure',
  'I-Therapeutic_procedure',
  'O',
  'B-Therapeutic_procedure',
  'O',
  'B-Biological_structure',
  'I-Biological_structure',
  'I-Biological_structure',
  'O',
  'O',
  'O',
  'B-Therapeutic_procedure',
  'O'],
 'sentence': 'Then the TRAM flap was harvested from right rectus abdominis (Fig.2B) and was deepithelialized (Fig.2C).',
 'numeric_tags': [0, 0, 33, 34, 0, 33, 0, 11, 12, 12, 0, 0, 0, 33, 0],
 'tokens': ['then',
  'the',
  'tram',
  'flap',
  'was',
  'harvested',
  'from',
  'right',
  'rectus',
  'abdominis',
  'fig2b',
  'and',
  'was',
  'deepithelialized',
  'fig2c'],
 '__index_level_0__': 4418}

We can then create fake predictions for those by just changing the value at index 2:

In [None]:
labels = data_dict["train"][0]["tags"]
labels

['O',
 'O',
 'B-Therapeutic_procedure',
 'I-Therapeutic_procedure',
 'O',
 'B-Therapeutic_procedure',
 'O',
 'B-Biological_structure',
 'I-Biological_structure',
 'I-Biological_structure',
 'O',
 'O',
 'O',
 'B-Therapeutic_procedure',
 'O']

Here’s the output:

In [None]:
predictions = labels.copy()
predictions[2] = 'O'
metric.compute(predictions=[predictions], references=[labels])

{'Biological_structure': {'precision': 1.0,
  'recall': 1.0,
  'f1': 1.0,
  'number': 1},
 'Therapeutic_procedure': {'precision': 0.6666666666666666,
  'recall': 0.6666666666666666,
  'f1': 0.6666666666666666,
  'number': 3},
 'overall_precision': 0.75,
 'overall_recall': 0.75,
 'overall_f1': 0.75,
 'overall_accuracy': 0.9333333333333333}

We get the precision, recall, and F1 score for each separate entity, as well as overall metrics.

# Prepare for Fine-tuning

To fine-tune multiple models, general functions are defined, thus we can call them with adjusted parameters.

In [None]:
from transformers import AutoModelForTokenClassification
! pip install -U accelerate
! pip install -U transformers
from transformers import TrainingArguments
from transformers import Trainer

To enable the Trainer to calculate a metric after each epoch, we define the `compute_metrics()` function. This function receives arrays containing predictions and labels, and it returns a dictionary containing metric names and their values.

Within the compute_metrics() function, we first convert the logits to predictions by taking the argmax. Following this, we convert both the labels and predictions to strings. We filter out all instances where the label is -100 and then proceed to utilize the `metric.compute()` method with the obtained results.

In [None]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Next, we define our TrainingArguments. There are already default-values for the hyperparameteres, which will be replaced when optimizing the model.

In [None]:
def get_args(name_to_save, learning_rate=2e-5, num_train_epochs=3, weight_decay=0.01):
  args = TrainingArguments(
    name_to_save,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=learning_rate,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay
  )
  return args

When defining the model we have to pass along some information on the number of labels we have. id2label and label2id contain the mappings from ID to label and vice versa:

In [None]:
def get_model(checkpoint):
  return AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
)

 We just pass everything to the Trainer.

In [None]:
def get_trainer(model, args, training_set, eval_set, data_collator, tokenizer):
  return Trainer(
    model=model,
    args=args,
    train_dataset=training_set,
    eval_dataset=eval_set,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

## BERT

First, we try BERT based uncased.

In [None]:
model_bert = get_model(checkpoint_bert)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


All parameters are costumized with BERT's tokenizer.

In [None]:
trainer_bert = get_trainer(model_bert, get_args("bert_ner"),
                           tokenized_datasets_bert["train"],
                           tokenized_datasets_bert["validation"],
                           data_collator_bert,
                           tokenizer_bert)

In [None]:
trainer_bert.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.164763,0.37683,0.493333,0.427283,0.685781
2,1.665000,1.011111,0.4393,0.53375,0.481941,0.719987
3,1.017200,0.983526,0.458201,0.54125,0.496275,0.728789


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1365, training_loss=1.2219703534583906, metrics={'train_runtime': 233.4984, 'train_samples_per_second': 46.677, 'train_steps_per_second': 5.846, 'total_flos': 305075418355890.0, 'train_loss': 1.2219703534583906, 'epoch': 3.0})

In [None]:
metrics_bert = trainer_bert.evaluate()
print(metrics_bert)

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.9835257530212402, 'eval_precision': 0.4582010582010582, 'eval_recall': 0.54125, 'eval_f1': 0.4962750716332378, 'eval_accuracy': 0.7287894030851777, 'eval_runtime': 4.201, 'eval_samples_per_second': 108.069, 'eval_steps_per_second': 13.568, 'epoch': 3.0}


In [None]:
model_bert.save_pretrained('model_bert')
tokenizer_bert.save_pretrained('tokenizer_bert')

('tokenizer_bert/tokenizer_config.json',
 'tokenizer_bert/special_tokens_map.json',
 'tokenizer_bert/vocab.txt',
 'tokenizer_bert/added_tokens.json',
 'tokenizer_bert/tokenizer.json')

## RoBERTa

Next, we use RoBERTa for finetuning.

In [None]:
checkpoint_roberta = "roberta-base"
tokenizer_roberta = AutoTokenizer.from_pretrained(checkpoint_roberta, add_prefix_space=True)
data_collator_roberta = DataCollatorForTokenClassification(tokenizer=tokenizer_roberta)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
tokenized_datasets_roberta = get_tokenized_data(tokenizer_roberta)
model_roberta = get_model(checkpoint_roberta)

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/454 [00:00<?, ? examples/s]

Map:   0%|          | 0/455 [00:00<?, ? examples/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer_roberta = get_trainer(model_roberta,
                                   get_args("roberta_ner"),
                                    tokenized_datasets_roberta["train"],
                                    tokenized_datasets_roberta["validation"],
                                    data_collator_roberta,
                                    tokenizer_roberta)

In [None]:
trainer_roberta.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.120588,0.40974,0.525833,0.460584,0.692872
2,1.606100,0.963525,0.472543,0.562917,0.513786,0.734035
3,0.989300,0.935671,0.497052,0.562083,0.527571,0.739817


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1365, training_loss=1.1860921000386333, metrics={'train_runtime': 241.4337, 'train_samples_per_second': 45.143, 'train_steps_per_second': 5.654, 'total_flos': 297221582158254.0, 'train_loss': 1.1860921000386333, 'epoch': 3.0})

In [None]:
metrics_roberta = trainer_roberta.evaluate()
print(metrics_roberta)

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.9356706142425537, 'eval_precision': 0.4970523212969786, 'eval_recall': 0.5620833333333334, 'eval_f1': 0.5275713727023855, 'eval_accuracy': 0.7398170521228857, 'eval_runtime': 2.461, 'eval_samples_per_second': 184.474, 'eval_steps_per_second': 23.161, 'epoch': 3.0}


In [None]:
model_roberta.save_pretrained('model_roberta')
tokenizer_roberta.save_pretrained('tokenizer_roberta')

('tokenizer_roberta/tokenizer_config.json',
 'tokenizer_roberta/special_tokens_map.json',
 'tokenizer_roberta/vocab.json',
 'tokenizer_roberta/merges.txt',
 'tokenizer_roberta/added_tokens.json',
 'tokenizer_roberta/tokenizer.json')

## BioClinicalBERT

Lastly, we choose BioClinicalBERT.

In [None]:
checkpoint_clinicalBERT = "emilyalsentzer/Bio_ClinicalBERT"
model_clinicalBERT = get_model(checkpoint_clinicalBERT)

Downloading (…)lve/main/config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer_clinicalBERT = AutoTokenizer.from_pretrained(checkpoint_clinicalBERT)
data_collator_clinicalBERT = DataCollatorForTokenClassification(tokenizer=tokenizer_clinicalBERT)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
tokenized_datasets_clinicalBERT = get_tokenized_data(tokenizer_clinicalBERT)

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/454 [00:00<?, ? examples/s]

Map:   0%|          | 0/455 [00:00<?, ? examples/s]

In [None]:
trainer_clinicalBERT = get_trainer(model_clinicalBERT,
                                   get_args("bioclinicalBERT_ner"),
                                    tokenized_datasets_clinicalBERT["train"],
                                    tokenized_datasets_clinicalBERT["validation"],
                                    data_collator_clinicalBERT,
                                    tokenizer_clinicalBERT)

In [None]:
trainer_clinicalBERT.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.086799,0.41481,0.5275,0.464417,0.704842
2,1.608300,0.952119,0.46669,0.554583,0.506855,0.737384
3,0.944900,0.940906,0.471297,0.554167,0.509383,0.74155


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1365, training_loss=1.158196775555174, metrics={'train_runtime': 215.9306, 'train_samples_per_second': 50.475, 'train_steps_per_second': 6.321, 'total_flos': 326628269271252.0, 'train_loss': 1.158196775555174, 'epoch': 3.0})

In [None]:
metrics_clinicalBERT = trainer_clinicalBERT.evaluate()
print(metrics_clinicalBERT)

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.9409064054489136, 'eval_precision': 0.4712969525159461, 'eval_recall': 0.5541666666666667, 'eval_f1': 0.5093833780160858, 'eval_accuracy': 0.7415500707435938, 'eval_runtime': 2.8944, 'eval_samples_per_second': 156.853, 'eval_steps_per_second': 19.693, 'epoch': 3.0}


In [None]:
model_clinicalBERT.save_pretrained('model_clinicalBERT')
tokenizer_clinicalBERT.save_pretrained('tokenizer_clinicalBERT')

('tokenizer_clinicalBERT/tokenizer_config.json',
 'tokenizer_clinicalBERT/special_tokens_map.json',
 'tokenizer_clinicalBERT/vocab.txt',
 'tokenizer_clinicalBERT/added_tokens.json',
 'tokenizer_clinicalBERT/tokenizer.json')

**Which model is the best?**

To get a fast overview, the first metrics will be compared:

In [None]:
print(f"Bert:\nAccuracy: {metrics_bert['eval_accuracy']}\nF1-Score: {metrics_bert['eval_f1']}")
print(f"Bio_ClinicalBERT:\nAccuracy: {metrics_clinicalBERT['eval_accuracy']}\nF1-Score: {metrics_clinicalBERT['eval_f1']}")
print(f"RoBERTa:\nAccuracy: {metrics_roberta['eval_accuracy']}\nF1-Score: {metrics_roberta['eval_f1']}")

Bert:
Accuracy: 0.7287894030851777
F1-Score: 0.4962750716332378
Bio_ClinicalBERT:
Accuracy: 0.7415500707435938
F1-Score: 0.5093833780160858
RoBERTa:
Accuracy: 0.7398170521228857
F1-Score: 0.5275713727023855


It seems that all models perform similar good. To figure out if the values are differing significantly from each other, we will perform statistical tests, which can be done using metric values by applying 5-fold Cross Validation

In [None]:
from sklearn.model_selection import KFold
from statistics import mean

In [None]:
def get_dataset(data_frame, tokenizer):
    dataset = Dataset.from_pandas(data_frame)

    tokenized_dataset = dataset.map(
        lambda examples: tokenize_and_align_labels(examples, tokenizer),
        batched=True
    )

    return tokenized_dataset

In [None]:
def get_cross_validation_scores(checkpoint):
  kf = KFold(n_splits=5, shuffle=True, random_state=42)
  if checkpoint=='roberta-base':
    tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True)
  else:
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  accuracy_scores = []
  f1_scores = []

  for train, test in kf.split(data):
    train_data = data.iloc[train]
    test_data = data.iloc[test]

    train_fold = get_dataset(train_data, tokenizer)
    valid_fold = get_dataset(test_data, tokenizer)

    model = get_model(checkpoint)
    args = get_args("checkpoint_" + "for_cv")
    data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
    trainer = get_trainer(model, args, train_fold, valid_fold, data_collator, tokenizer)
    trainer.train()
    metrics = trainer.evaluate()
    accuracy_scores.append(metrics['eval_accuracy'])
    f1_scores.append(metrics['eval_f1'])
  return [accuracy_scores, f1_scores]

In [None]:
def print_mean_scores(scores):
  avg_accuracy = sum(scores[0]) / len(scores[0])
  avg_f1 = sum(scores[1]) / len(scores[1])

  print(f'Avg Accuracy: {avg_accuracy}, Avg F1-Score: {avg_f1}')

In [None]:
cross_validation_scores_bert = get_cross_validation_scores("bert-base-uncased")

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.156941,0.382589,0.470197,0.421893,0.685283
2,1.626900,1.030604,0.455342,0.51265,0.4823,0.71783
3,0.998900,1.024186,0.454118,0.539022,0.492941,0.724816


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.156984,0.396164,0.482838,0.435228,0.689652
2,1.632200,1.049778,0.45598,0.535481,0.492543,0.716842
3,1.001100,1.012401,0.469376,0.537376,0.50108,0.725277


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.176537,0.405291,0.460868,0.431297,0.686747
2,1.610800,1.039953,0.450009,0.519076,0.482081,0.712596
3,1.001500,1.019376,0.472344,0.538042,0.503057,0.72057


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.145955,0.410415,0.485792,0.444933,0.694829
2,1.628900,1.03963,0.442393,0.540744,0.486649,0.715414
3,1.006800,1.0216,0.468283,0.536774,0.500195,0.722399


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.110955,0.410031,0.427358,0.418515,0.695315
2,1.643800,0.999179,0.439949,0.528207,0.480055,0.716991
3,1.013800,0.979439,0.459369,0.535613,0.49457,0.727154


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
cross_validation_scores_bert

[[0.7248159303882196,
  0.7252774251668646,
  0.7205695509309967,
  0.7223990410449304,
  0.727153881836967],
 [0.4929411764705882,
  0.5010799136069115,
  0.5030574806359559,
  0.5001947040498443,
  0.4945695897023331]]

In [None]:
print_mean_scores(cross_validation_scores_bert)

Avg Accuracy: 0.723926030666099, Avg F1-Score: 0.49762739069204354


In [None]:
cross_validation_scores_roberta = get_cross_validation_scores('roberta-base')

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.097066,0.415056,0.483491,0.446667,0.698825
2,1.582200,0.991852,0.476973,0.535163,0.504395,0.723353
3,0.986500,0.976402,0.480274,0.555961,0.515353,0.730453


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.107128,0.423097,0.504527,0.460238,0.696007
2,1.574000,1.016745,0.473899,0.539061,0.504384,0.716966
3,0.979800,0.979545,0.494173,0.55359,0.522197,0.729516


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.109957,0.444037,0.475692,0.45932,0.700108
2,1.540400,0.983117,0.483996,0.543928,0.512215,0.723806
3,0.982000,0.959463,0.498349,0.559407,0.527116,0.730821


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.104467,0.442596,0.511492,0.474557,0.70229
2,1.553700,0.989619,0.47126,0.551609,0.508279,0.722118
3,0.985000,0.957092,0.496408,0.563101,0.527655,0.731903


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.06623,0.42974,0.446308,0.437867,0.70602
2,1.558500,0.948279,0.472836,0.549771,0.50841,0.729094
3,0.987400,0.939218,0.488368,0.557831,0.520793,0.738636


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
cross_validation_scores_roberta

[[0.7304531176040278,
  0.7295161631100782,
  0.7308211170069251,
  0.7319032561319545,
  0.7386363636363636],
 [0.5153532743714597,
  0.5221968417916377,
  0.5271158586688579,
  0.5276554087126775,
  0.5207930859176411]]

In [None]:
print_mean_scores(cross_validation_scores_roberta)

Avg Accuracy: 0.7322660034978699, Avg F1-Score: 0.5226228938924548


In [None]:
cross_validation_scores_bio_clinicalBERT = get_cross_validation_scores('emilyalsentzer/Bio_ClinicalBERT')

Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.091431,0.415973,0.503645,0.45563,0.703202
2,1.583000,0.992685,0.463863,0.529803,0.494645,0.725022
3,0.935500,0.983385,0.467625,0.551244,0.506003,0.733577


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3633 [00:00<?, ? examples/s]

Map:   0%|          | 0/909 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.133353,0.411875,0.506844,0.454451,0.699443
2,1.606100,1.019712,0.456397,0.537797,0.493765,0.721098
3,0.949900,1.004915,0.472793,0.552537,0.509564,0.725859


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.114807,0.44284,0.488119,0.464378,0.702966
2,1.577600,0.988521,0.465235,0.53826,0.49909,0.722519
3,0.951400,0.977412,0.481189,0.557663,0.516611,0.730365


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.098683,0.436341,0.516298,0.472964,0.701761
2,1.597500,0.990811,0.460678,0.55934,0.505237,0.725541
3,0.956900,0.973427,0.476918,0.567697,0.518363,0.731709


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


Map:   0%|          | 0/3634 [00:00<?, ? examples/s]

Map:   0%|          | 0/908 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.054887,0.441123,0.478981,0.459273,0.71428
2,1.590800,0.953492,0.453199,0.53997,0.492794,0.72998
3,0.954000,0.952928,0.470479,0.551949,0.507968,0.736268


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
cross_validation_scores_bio_clinicalBERT

[[0.7335766423357665,
  0.7258590900364753,
  0.7303647716069668,
  0.7317092094033673,
  0.736267647407759],
 [0.5060027553631176,
  0.5095640353432371,
  0.5166111279410279,
  0.5183630640083946,
  0.5079683271524508]]

In [None]:
print_mean_scores(cross_validation_scores_bio_clinicalBERT)

Avg Accuracy: 0.7315554721580669, Avg F1-Score: 0.5117018619616457


To find out if there is a significant difference in the means of the metrics data, we will use the python library Pingouin to perform three paired t-test.

In [None]:
!pip install pingouin
import pingouin as pg

In [None]:
bert_acc = [0.7248159303882196, 0.7252774251668646, 0.7205695509309967, 0.7223990410449304, 0.727153881836967] # eig accuracy
roberta_acc = [0.7304531176040278, 0.7295161631100782, 0.7308211170069251, 0.7319032561319545, 0.7386363636363636]
clinicalbert_acc = [0.7335766423357665, 0.7258590900364753, 0.7303647716069668, 0.7317092094033673, 0.736267647407759]

Bonferroni correction must be performed because multiple groups are compared in multiple t tests. This results in a new alpha level.

In [None]:
alpha_adjusted = 0.05 / 3
alpha_adjusted

0.016666666666666666

In [None]:
pg.ttest(bert_acc, roberta_acc, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-5.892911,4,two-sided,0.004147,"[-0.01, -0.0]",2.596799,13.337,0.987303


The t-test with the BERT and RoBERTa model showed a high significant difference in the means, as p<.016. Looking at the cross validation values, you can see that RoBERTa's values are slightly higher and thus is the better model of the two.

In [None]:
pg.ttest(bert_acc, clinicalbert_acc, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-4.315672,4,two-sided,0.01249,"[-0.01, -0.0]",2.281356,6.003,0.960982


The t-test between BERT and ClinicalBERT also showed a significant difference in the mean values, with p<.016. Looking at the values, it can be seen that ClinicalBERT performs better than BERT.

In [None]:
pg.ttest(roberta_acc, clinicalbert_acc, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,0.617701,4,two-sided,0.570204,"[-0.0, 0.0]",0.188366,0.463,0.062652


The last t-test with RoBERTa and BioClinicalBERT showed that neither model is significantly better, as p>.016. This means that we can decide which model we proceed with. We choose BioClinicalBERT.

# Model optimization

Tough we could not find a outstanding model, we will use BioClinicalBERT and optimize its parameter, trying to better its performance.

We will vary the following hyperparameters: learning rate, number of epochs and weight decay. Their values for fine-tuning can be easily adjusted calling the get_args method with different values.

For comparison the previous values:

In [None]:
metrics_clinicalBERT

{'eval_loss': 0.9409064054489136,
 'eval_precision': 0.4712969525159461,
 'eval_recall': 0.5541666666666667,
 'eval_f1': 0.5093833780160858,
 'eval_accuracy': 0.7415500707435938,
 'eval_runtime': 2.8944,
 'eval_samples_per_second': 156.853,
 'eval_steps_per_second': 19.693,
 'epoch': 3.0}

### Learning rate


Until now we used a learning rate of 2e-5, which will be reduced to 1e-4 for fine-tuning.

In [None]:
model_1 = get_model('emilyalsentzer/Bio_ClinicalBERT')
trainer_1 = get_trainer(model_1,
                        get_args("model_1", 1e-4),
                        tokenized_datasets_clinicalBERT["train"],
                        tokenized_datasets_clinicalBERT["validation"],
                        data_collator_clinicalBERT,
                        tokenizer_clinicalBERT
                        )
trainer_1.train()
trainer_1.evaluate()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.921146,0.439488,0.587083,0.502676,0.726851
2,1.181500,0.863842,0.519725,0.598333,0.556266,0.756092
3,0.636900,0.888697,0.526278,0.613333,0.566481,0.760887


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.8886973261833191,
 'eval_precision': 0.5262781551662495,
 'eval_recall': 0.6133333333333333,
 'eval_f1': 0.5664806619203385,
 'eval_accuracy': 0.7608866530419746,
 'eval_runtime': 2.1777,
 'eval_samples_per_second': 208.479,
 'eval_steps_per_second': 26.175,
 'epoch': 3.0}

With an adjusted learning rate of 1e-4, the model performs even better than before. All metrics improved, for example accuracy increased from 0.74 to 0.76.

### Number of Epochs

From the results of the evaluation metrics after each epoch, it can be seen that from the 2nd to the 3rd epoch only limited improvements are visible. As a test, 4 epochs are now trained.

In [None]:
model_2 = get_model('emilyalsentzer/Bio_ClinicalBERT')
trainer_2 = get_trainer(model_2,
                        get_args("model_2", learning_rate=1e-4, num_train_epochs=4),
                        tokenized_datasets_clinicalBERT["train"],
                        tokenized_datasets_clinicalBERT["validation"],
                        data_collator_clinicalBERT,
                        tokenizer_clinicalBERT
                        )
trainer_2.train()
trainer_2.evaluate()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.94218,0.438356,0.573333,0.496841,0.723
2,1.212700,0.870196,0.518204,0.610833,0.560719,0.750904
3,0.672200,0.902824,0.533818,0.611667,0.570097,0.760022
4,0.410000,0.988483,0.534858,0.61375,0.571595,0.757507


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.9884827733039856,
 'eval_precision': 0.5348583877995643,
 'eval_recall': 0.61375,
 'eval_f1': 0.5715948777648429,
 'eval_accuracy': 0.757506681339412,
 'eval_runtime': 2.1422,
 'eval_samples_per_second': 211.929,
 'eval_steps_per_second': 26.608,
 'epoch': 4.0}

The fine-tuning with 4 epochs clearly shows a decreasing accuracy, indicating that the model tends to overfit the data. Thus we will not change this parameter and keep 3 epochs.

### Weight Decay

Because the dataset is rather small and has many labels, it may happen that the model tends to overfit faster. By using a smaller weight decay value, you can increase regularization and force the model to restrict weights more. We will use a value of 0.001.

In [None]:
model_3 = get_model('emilyalsentzer/Bio_ClinicalBERT')
trainer_3 = get_trainer(model_3,
                        get_args("model_3", learning_rate=1e-4, weight_decay=0.001),
                        tokenized_datasets_clinicalBERT["train"],
                        tokenized_datasets_clinicalBERT["validation"],
                        data_collator_clinicalBERT,
                        tokenizer_clinicalBERT
                        )
trainer_3.train()
trainer_3.evaluate()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.924542,0.44252,0.585417,0.504036,0.728895
2,1.176200,0.841584,0.525887,0.617917,0.568199,0.76183
3,0.636200,0.895939,0.528126,0.614167,0.567906,0.759865


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.8959393501281738,
 'eval_precision': 0.528126119670369,
 'eval_recall': 0.6141666666666666,
 'eval_f1': 0.5679059911385089,
 'eval_accuracy': 0.7598648011318975,
 'eval_runtime': 2.1635,
 'eval_samples_per_second': 209.845,
 'eval_steps_per_second': 26.346,
 'epoch': 3.0}

In this case, a weight decay of 0.001 ensures a faster growing accuracy, but there is also a trend towards overfitting. That's why we keep the original value.

# Precise Evaluation of the best Model with the Test Data

In [None]:
from transformers import pipeline
from collections import Counter
from sklearn.metrics import classification_report

To use the finetuned model for inference we apply it in a pipeline(). After instantiating the pipeline for NER with the model, we will pass the text.

In [None]:
classifier = pipeline("ner", model=model_1, tokenizer=tokenizer_clinicalBERT)
model_1.to("cpu")  # Oder "cpu", je nachdem, was verfügbar ist

def predict_tags(sentence):
  return classifier(sentence)

In [None]:
classifier("The left femoral artery was cannulated.")

[{'entity': 'B-Biological_structure',
  'score': 0.99034506,
  'index': 2,
  'word': 'left',
  'start': 4,
  'end': 8},
 {'entity': 'I-Biological_structure',
  'score': 0.99015266,
  'index': 3,
  'word': 'f',
  'start': 9,
  'end': 10},
 {'entity': 'I-Biological_structure',
  'score': 0.9918446,
  'index': 4,
  'word': '##em',
  'start': 10,
  'end': 12},
 {'entity': 'I-Biological_structure',
  'score': 0.9898303,
  'index': 5,
  'word': '##oral',
  'start': 12,
  'end': 16},
 {'entity': 'I-Biological_structure',
  'score': 0.9824426,
  'index': 6,
  'word': 'artery',
  'start': 17,
  'end': 23},
 {'entity': 'B-Therapeutic_procedure',
  'score': 0.9038626,
  'index': 8,
  'word': 'can',
  'start': 28,
  'end': 31},
 {'entity': 'I-Therapeutic_procedure',
  'score': 0.91639614,
  'index': 9,
  'word': '##nu',
  'start': 31,
  'end': 33},
 {'entity': 'I-Therapeutic_procedure',
  'score': 0.95666164,
  'index': 10,
  'word': '##lated',
  'start': 33,
  'end': 38}]

The classifier correctly extracts the entities from the sentence. However, these are not returned in lists, but with its entity information. In order to be able to compare the output with the ground truth lists from the original dataset, the results have to be processed.

A first step is to extract each token from each sentence with its start and end value.

In [None]:
def create_token_table(sentence):
    sentence = sentence.replace("\u2005", "").replace("\u200a", "").replace("\u2009", "").replace("\n", "")
    tokens = sentence.split()

    token_table = []
    start = 0

    for token in tokens:
        end = start + len(token) Läng

        token_info = {
            "token": token,
            "start": start,
            "end": end
        }
        token_table.append(token_info)

        start = end + 1

    return token_table

Now the pipeline predictions are processed. Only entities with a confident prediction of 0.4 or higher are used. In addition, they are combined and the tokenization into subtokens is reversed.

In [None]:
def preprocess_entities(sentence_info):
    sentence_info = [info for info in sentence_info if info['score'] >= 0.4]

    cleaned_entities = []
    current_entity = None
    current_start = None
    current_end = None

    for info in sentence_info:
        entity = info['entity']
        start = info['start']
        end = info['end']

        if current_entity is None:
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_start = start
            current_end = end
        elif current_end == start:
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_end = end
        else:
            cleaned_entities.append((current_entity, current_start, current_end))
            current_entity = entity.replace('B-', '').replace('I-', '')
            current_start = start
            current_end = end

    if current_entity is not None:
        cleaned_entities.append((current_entity, current_start, current_end))

    return cleaned_entities

In [None]:
def generate_tags(predicted_entities, sentence_entities):
    tags = ["O"] * len(sentence_entities)

    for entity, start, end in predicted_entities:
        for i, token_info in enumerate(sentence_entities):
            token_start = token_info['start']
            token_end = token_info['end']

            if token_start == start and token_end == end:
                tags[i] = entity
    return tags

With the `format_tags` function, we turn the raw tags into BOI-tags, just like in the original data.

In [None]:
def format_tags(tags):
    formatted_tags = []
    current_entity = None

    for tag in tags:
        if tag == 'O':
            formatted_tags.append(tag)
            current_entity = None
        else:
            entity_type = tag
            if current_entity == entity_type:
                formatted_tags.append('I-' + entity_type)
            else:
                formatted_tags.append('B-' + entity_type)
                current_entity = entity_type

    return formatted_tags

The function `get_sentence_tags` automates the whole process from turning a single sentence into a list of tags in BOI-format.

In [None]:
def get_sentence_tags(sentence):
  token_table = create_token_table(sentence) # convert sentence to info table (where which word starts)
  sentence_info = predict_tags(sentence) # predict tags of sentence
  predicted_entities = preprocess_entities(sentence_info) # change format of pred tags
  general_tags = generate_tags(predicted_entities, token_table) # use sentence info and tags to get list
  boi_tags = format_tags(general_tags) # bring data in boi format
  return boi_tags

The predicted values are stored in a list, to be able to compare them with the original data.

In [None]:
predicted_entities = []
for sentence in data_dict["test"]["sentence"]:
  predicted_entities.append(get_sentence_tags(sentence))

For performing the classification report, we save the predicted and ground truth tags into lists.

In [None]:
ground_truth_tags = []

for sublist in data_dict["test"]["tags"]:
    ground_truth_tags.extend(sublist)

In [None]:
pred_tags = []

for sublist in predicted_entities:
    pred_tags.extend(sublist)

In [None]:
count_ground_truth = Counter(ground_truth_tags)
count_ground_truth

Counter({'B-Diagnostic_procedure': 442,
         'I-Diagnostic_procedure': 377,
         'O': 4521,
         'B-Biological_structure': 248,
         'B-Sign_symptom': 312,
         'B-Detailed_description': 295,
         'I-Detailed_description': 189,
         'B-Clinical_event': 60,
         'B-Nonbiological_location': 27,
         'I-Biological_structure': 212,
         'B-Therapeutic_procedure': 78,
         'B-Lab_value': 213,
         'I-Lab_value': 209,
         'B-Volume': 3,
         'I-Volume': 5,
         'I-Sign_symptom': 131,
         'B-Disease_disorder': 116,
         'I-Disease_disorder': 78,
         'B-Distance': 5,
         'I-Distance': 8,
         'B-Severity': 26,
         'I-Therapeutic_procedure': 33,
         'B-Administration': 12,
         'B-Medication': 76,
         'B-Date': 51,
         'I-Date': 106,
         'B-Coreference': 23,
         'I-Coreference': 4,
         'B-History': 31,
         'I-History': 141,
         'B-Duration': 23,
         'I-Durati

In [None]:
count_predictions = Counter(pred_tags)
count_predictions

Counter({'B-Diagnostic_procedure': 422,
         'I-Diagnostic_procedure': 273,
         'O': 5627,
         'B-Biological_structure': 238,
         'I-Biological_structure': 149,
         'B-Sign_symptom': 232,
         'B-Detailed_description': 227,
         'I-Detailed_description': 89,
         'B-Clinical_event': 43,
         'B-Therapeutic_procedure': 68,
         'B-Lab_value': 173,
         'I-Therapeutic_procedure': 24,
         'B-Date': 67,
         'I-Date': 87,
         'B-Nonbiological_location': 28,
         'I-Lab_value': 62,
         'B-Volume': 1,
         'B-Disease_disorder': 95,
         'I-Sign_symptom': 50,
         'B-Distance': 2,
         'B-Severity': 22,
         'B-History': 39,
         'I-History': 71,
         'B-Duration': 23,
         'B-Medication': 67,
         'B-Coreference': 17,
         'I-Medication': 16,
         'I-Disease_disorder': 44,
         'I-Nonbiological_location': 15,
         'B-Administration': 9,
         'I-Duration': 14,
       

Finally, we can print the classification report.

In [None]:
report = classification_report(ground_truth_tags, pred_tags)
print(report)

                          precision    recall  f1-score   support

              B-Activity       0.00      0.00      0.00         6
        B-Administration       0.78      0.58      0.67        12
                   B-Age       1.00      0.79      0.88        19
  B-Biological_attribute       0.00      0.00      0.00         2
  B-Biological_structure       0.69      0.66      0.67       248
        B-Clinical_event       0.86      0.62      0.72        60
                 B-Color       1.00      0.40      0.57         5
           B-Coreference       0.29      0.22      0.25        23
                  B-Date       0.61      0.80      0.69        51
  B-Detailed_description       0.53      0.41      0.46       295
  B-Diagnostic_procedure       0.75      0.72      0.73       442
      B-Disease_disorder       0.55      0.45      0.49       116
              B-Distance       0.00      0.00      0.00         5
                B-Dosage       0.50      0.36      0.42        11
         

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The Bio_ClinicalBERT model achieved an average accuracy of 0.73 when applied to the test dataset, indicating that it correctly identified 73% of the entities.

The model has some challenges when dealing with certain entities, such as 'B-Activity' or 'I-Shape.' These entities are relatively rare in the dataset, which may explain the model's difficulties in learning them well.

In contrast, the model performs very well in identifying specific entities, such as 'B-Clinical_event,' 'B-Diagnostic_procedure,' and 'B-Sex.' For these categories, the model demonstrates high precision, recall, and F1-scores, often approaching a value of 1.

Lastly, one example of a sentence, its ground truth tags and its predicted one:

In [None]:
data_dict["test"]["sentence"][10]

'The negative cardiolipin test excluded tabetic crises.'

In [None]:
data_dict["test"]["tags"][10]

['O',
 'B-Lab_value',
 'B-Diagnostic_procedure',
 'I-Diagnostic_procedure',
 'O',
 'B-Disease_disorder',
 'I-Disease_disorder']

In [None]:
predicted_entities[10]

['O',
 'B-Lab_value',
 'B-Diagnostic_procedure',
 'I-Diagnostic_procedure',
 'O',
 'B-Disease_disorder',
 'O']