# Preprocessing data

The dataset MACCROBAT2018 is a rich collection of annotated clinical language appropriate for training biomedical natural language processing systems. Each clinical case report is in .txt (free-text) and .ann (annotated entites) format, which needs to be processed.

We want to have a dataframe with sentences, tokens and its responding tags.

First import the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import glob
import nltk
import re
nltk.download('punkt')
import os

The function `get_simple_table` processes .ann files and extracts the annotation data. It parses lines, splitting them into relevant components, and stores them in a dataframe. The resulting dataframe contains columns for ID, type, start, end, and text.

In [None]:
def get_simple_table(raw_ann):
  with open(raw_ann, 'r') as file:
      lines = file.readlines()

  data = []
  for line in lines:
      if line.startswith('T') or line.startswith('E'):
          line_data = line.split('\t')
          if len(line_data) >= 3:
              entity_id, entity_info, entity_text = line_data[0], line_data[1], line_data[2].strip()
              entity_info_split = entity_info.split(' ')
              if len(entity_info_split) >= 3:
                  entity_type, start, end = entity_info_split[0], entity_info_split[1], entity_info_split[2]
                  data.append([entity_id, entity_type, start, end, entity_text])

  return pd.DataFrame(data, columns=['ID', 'Type', 'Start', 'End', 'Text'])

The function `get_BOI_table` initializes an empty dataframe with columns 'Type', 'Start', 'End', and 'Text'. It takes a dataframe as input, which is iterated through each row, modifying the 'type' column to represent the beginning ('B-'), inside ('I-') or outside ('O'-) of an entity, along with start and end positions. The resulting dataframe contains the data in the BOI format.

In [None]:
def get_BOI_table(simple_table):
  new_data = {
      'Type': [],
      'Start': [],
      'End': [],
      'Text': []
  }
  new_df = pd.DataFrame(new_data)

  for index, row in simple_table.iterrows():
      text_words = row['Text'].split()
      num_words = len(text_words)

      for i, word in enumerate(text_words):
          new_type = f"{'B-' if i == 0 else 'I-'}{row['Type']}"
          new_start = int(row['Start']) + int(row['Text'].index(word))
          new_end = int(new_start) + int(len(word))

          new_data = {
              'Type': new_type,
              'Start': new_start,
              'End': new_end,
              'Text': word
          }
          new_df = pd.concat([new_df, pd.DataFrame([new_data])], ignore_index=True)

  return new_df

`get_text` takes a file path as input, reads its contents and stores the text content in the variable text.

In [None]:
def get_text(raw_text):
  with open(raw_text, 'r') as file:
    text = file.read()
  return text

The function `get_annotated_data` extracts text and entity tags from the input raw_text file using the BOI_table. It tokenizes the text into sentences, then tokenizes each sentence into words while considering punctuation. It matches word positions with entity tags from the BOI_table and constructs a dataframe with sentence-text and corresponding entity tags for each word.

In [None]:
def get_annotated_data(raw_text, BOI_table):
  sentences = nltk.sent_tokenize(get_text(raw_text))
  data = []

  pos = 0 # Start index for first word
  for sentence in sentences:
      words = sentence.split(" ")
      sentence_words = []
      sentence_tags = []

      for word in words:
          curr_word = word
          punctuation = '"!@#$%^&*()_+[]<>?:.,;'
          for c in word:
            if c in punctuation:
              curr_word = curr_word.replace(c, "")

          start = pos
          end_mit = start + len(word)
          end_ohne = start + len(curr_word)
          tags = BOI_table[(BOI_table['Start'] == start) & (BOI_table['End'] == end_ohne)]
          sentence_words.append(word)

          if tags.empty:
              sentence_tags.append('O')
          else:
              sentence_tags.append(tags.iloc[0,0])

          pos = end_mit + 1

      data.append({
          'sentence': ' '.join(sentence_words),
          'tags': sentence_tags
      })

  return pd.DataFrame(data)

Finally, having defined all functions, the collection of text and annotation files in the 'MACCROBAT' directory is processed. It iterates through file pairs, extracting entity tags from annotations and associating them with text data. The resulting dataframe contains sentences and their corresponding tags.

In [None]:
path = './MACCROBAT'

txt_files = glob.glob(os.path.join(path, '*.txt'))
ann_files = glob.glob(os.path.join(path, '*.ann'))

txt_files.sort()
ann_files.sort()

dataframe = pd.DataFrame(columns=["sentence", "tags"])

for txt_file, ann_file in zip(txt_files, ann_files):
  simple_table = get_simple_table(ann_file)
  boi_table = get_BOI_table(simple_table)
  annotated_data = get_annotated_data(txt_file, boi_table)
  dataframe = pd.concat([dataframe, annotated_data], ignore_index=True)

The unique values in the 'Type' column of the final dataframe are printed.

In [None]:
unique_values = boi_table['Type'].unique()
print(unique_values)

['B-Age' 'I-Age' 'B-Sex' 'B-Clinical_event' 'B-Nonbiological_location'
 'B-Sign_symptom' 'B-Biological_structure' 'I-Sign_symptom'
 'B-Detailed_description' 'I-Detailed_description' 'B-History' 'I-History'
 'B-Family_history' 'I-Family_history' 'B-Diagnostic_procedure'
 'I-Diagnostic_procedure' 'I-Biological_structure' 'B-Distance'
 'I-Distance' 'B-Lab_value' 'I-Lab_value' 'B-Disease_disorder' 'B-Shape'
 'I-Shape' 'B-Coreference' 'B-Volume' 'I-Volume' 'B-Therapeutic_procedure'
 'I-Therapeutic_procedure' 'B-Area' 'I-Area' 'B-Duration' 'I-Duration'
 'B-Date' 'I-Date' 'B-Color' 'I-Color']


In [None]:
dataframe

Unnamed: 0,sentence,tags
0,CASE: A 28-year-old previously healthy man pre...,"[O, O, B-Age, B-History, I-History, B-Sex, B-C..."
1,"The symptoms occurred during rest, 2–3 times p...","[O, B-Coreference, O, O, B-Clinical_event, B-F..."
2,Except for a grade 2/6 holosystolic tricuspid ...,"[O, O, O, B-Lab_value, I-Lab_value, B-Detailed..."
3,An electrocardiogram (ECG) revealed normal sin...,"[O, B-Diagnostic_procedure, O, O, B-Lab_value,..."
4,Transthoracic echocardiography demonstrated th...,"[B-Biological_structure, B-Diagnostic_procedur..."
...,...,...
4537,MHL was diagnosed (Fig.3).,"[B-Disease_disorder, O, O, O]"
4538,Immunohistochemistry results (Fig.4) were the ...,"[B-Diagnostic_procedure, I-Diagnostic_procedur..."
4539,"After 9 days of recovery, the patient returned...","[O, B-Duration, I-Duration, O, B-Therapeutic_p..."
4540,"A follow-up examination, which included blood ...","[O, B-Clinical_event, O, O, O, B-Diagnostic_pr..."


In [None]:
def count_unique_tokens_in_column(dataframe, column_name):
    unique_tokens = set()

    for tokens_list in dataframe[column_name]:
        cleaned_tokens = [re.sub(r'^(B-|I-)', '', token) for token in tokens_list]
        unique_tokens.update(cleaned_tokens)

    token_counts = {}
    for token in unique_tokens:
        count = sum(dataframe[column_name].apply(lambda tokens: re.search(fr'\b{re.escape(token)}\b', ' '.join(tokens)) is not None))
        token_counts[token] = count

    sorted_token_counts = dict(sorted(token_counts.items(), key=lambda item: item[1], reverse=True))
    tags_freq = pd.DataFrame(list(sorted_token_counts.items()), columns=['Token', 'Anzahl'])

    return tags_freq

In [None]:
print(count_unique_tokens_in_column(dataframe, "tags"))

                     Token  Anzahl
0                        O    4527
1     Diagnostic_procedure    2215
2             Sign_symptom    1964
3     Detailed_description    1686
4     Biological_structure    1591
5                Lab_value    1400
6         Disease_disorder     946
7    Therapeutic_procedure     665
8                     Date     640
9           Clinical_event     567
10              Medication     567
11                Severity     318
12  Nonbiological_location     307
13             Coreference     272
14                 History     225
15                Duration     220
16                     Age     204
17                  Dosage     195
18                     Sex     190
19          Administration     123
20                Distance      90
21                Activity      70
22               Frequency      68
23                   Shape      56
24          Family_history      53
25     Personal_background      46
26                   Color      46
27                  

Now we now which tags are in the entire data.

In [None]:
dataframe

Unnamed: 0,sentence,tags
0,CASE: A 28-year-old previously healthy man pre...,"[O, O, B-Age, B-History, I-History, B-Sex, B-C..."
1,"The symptoms occurred during rest, 2–3 times p...","[O, B-Coreference, O, O, B-Clinical_event, B-F..."
2,Except for a grade 2/6 holosystolic tricuspid ...,"[O, O, O, B-Lab_value, I-Lab_value, B-Detailed..."
3,An electrocardiogram (ECG) revealed normal sin...,"[O, B-Diagnostic_procedure, O, O, B-Lab_value,..."
4,Transthoracic echocardiography demonstrated th...,"[B-Biological_structure, B-Diagnostic_procedur..."
...,...,...
4537,MHL was diagnosed (Fig.3).,"[B-Disease_disorder, O, O, O]"
4538,Immunohistochemistry results (Fig.4) were the ...,"[B-Diagnostic_procedure, I-Diagnostic_procedur..."
4539,"After 9 days of recovery, the patient returned...","[O, B-Duration, I-Duration, O, B-Therapeutic_p..."
4540,"A follow-up examination, which included blood ...","[O, B-Clinical_event, O, O, O, B-Diagnostic_pr..."


Next, the sentence are tokenized. The tokens are stored in the dataframe.

In [None]:
def tokenize_sentence(sentence):
    tokens = sentence.split()
    cleaned_tokens = []
    punctuation = '"!@#$^*()_+[]<>?:.,;'
    for word in tokens:
        cleaned_word = ''.join(c for c in word if c not in punctuation).lower()
        cleaned_tokens.append(cleaned_word)

    return cleaned_tokens

dataframe['tokens'] = dataframe['sentence'].apply(tokenize_sentence)

For later processing we will need a dictionary, which maps the entites to unique numbers.

In [None]:
label_dict = {'O': 0, 'B-Age': 1, 'I-Age': 2, 'B-Sex': 3, 'I-Sex': 4, 'B-Clinical_event': 5,
              'I-Clinical_event': 6, 'B-Nonbiological_location': 7, 'I-Nonbiological_location': 8,
              'B-Sign_symptom': 9, 'I-Sign_symptom': 10, 'B-Biological_structure': 11, 'I-Biological_structure': 12,
              'B-Detailed_description': 13, 'I-Detailed_description': 14, 'B-History': 15, 'I-History': 16, 'B-Family_history': 17,
              'I-Family_history': 18, 'B-Diagnostic_procedure': 19, 'I-Diagnostic_procedure': 20, 'B-Distance': 21,
              'I-Distance': 22, 'B-Lab_value': 23, 'I-Lab_value': 24, 'B-Disease_disorder': 25, 'I-Disease_disorder': 26,
              'B-Shape': 27, 'I-Shape': 28, 'B-Coreference': 29, 'I-Coreference': 30, 'B-Volume': 31, 'I-Volume': 32,
              'B-Therapeutic_procedure': 33, 'I-Therapeutic_procedure': 34, 'B-Area': 35, 'I-Area': 36, 'B-Duration': 37,
              'I-Duration': 38, 'B-Date': 39, 'I-Date': 40, 'B-Color': 41, 'I-Color': 42, 'B-Frequency': 43, 'I-Frequency': 44,
              'B-Texture': 45, 'I-Texture': 46, 'B-Biological_attribute': 47, 'I-Biological_attribute': 48, 'B-Severity': 49,
              'I-Severity': 50, 'B-Activity': 51, 'I-Activity': 52, 'B-Outcome': 53, 'I-Outcome': 54, 'B-Personal_background': 55,
              'I-Personal_background': 56, 'B-Medication': 57, 'I-Medication': 58, 'B-Dosage': 59, 'I-Dosage': 60, 'B-Other_event': 61,
              'I-Other_event': 62, 'B-Administration': 63, 'I-Administration': 64, 'B-Occupation': 65, 'I-Occupation': 66,
              'B-Other_entity': 67, 'I-Other_entity': 68, 'B-Time': 69, 'I-Time': 70, 'B-Subject': 71, 'I-Subject': 72,
              'B-Quantitative_concept': 73, 'I-Quantitative_concept': 74, 'B-Height': 75, 'I-Height': 76, 'B-Mass': 77, 'I-Mass': 78,
              'B-Weight': 79, 'I-Weight': 80, 'B-Qualitative_concept': 81, 'I-Qualitative_concept': 82}

In [None]:
id2label = {i: label for i, label in enumerate(label_dict)}
label2id = {v: k for k, v in id2label.items()}

Additionally, the mapped ids of the labels are stored inside the dataframe.

In [None]:
def map_labels_to_ids(label_list):
    return [label2id[label] for label in label_list]

dataframe['numeric_tags'] = dataframe['tags'].apply(map_labels_to_ids)

Now the dataframe is preprocessed:

- sentence: contains the whole sentence, not processed
- tags: contains for each token its corresponding tag
- token: contains tokens of each sentence, punctuation (besides -, &, %) filtered, lower case
- numeric_tags: contains for each tag its corresponding numeric tag

In [None]:
display(dataframe)

Unnamed: 0,sentence,tags,tokens,numeric_tags
0,CASE: A 28-year-old previously healthy man pre...,"[O, O, B-Age, B-History, I-History, B-Sex, B-C...","[case, a, 28-year-old, previously, healthy, ma...","[0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9]"
1,"The symptoms occurred during rest, 2–3 times p...","[O, B-Coreference, O, O, B-Clinical_event, B-F...","[the, symptoms, occurred, during, rest, 2–3, t...","[0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14..."
2,Except for a grade 2/6 holosystolic tricuspid ...,"[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...","[except, for, a, grade, 2/6, holosystolic, tri...","[0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1..."
3,An electrocardiogram (ECG) revealed normal sin...,"[O, B-Diagnostic_procedure, O, O, B-Lab_value,...","[an, electrocardiogram, ecg, revealed, normal,...","[0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,..."
4,Transthoracic echocardiography demonstrated th...,"[B-Biological_structure, B-Diagnostic_procedur...","[transthoracic, echocardiography, demonstrated...","[11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ..."
...,...,...,...,...
4537,MHL was diagnosed (Fig.3).,"[B-Disease_disorder, O, O, O]","[mhl, was, diagnosed, fig3]","[25, 0, 0, 0]"
4538,Immunohistochemistry results (Fig.4) were the ...,"[B-Diagnostic_procedure, I-Diagnostic_procedur...","[immunohistochemistry, results, fig4, were, th...","[19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ..."
4539,"After 9 days of recovery, the patient returned...","[O, B-Duration, I-Duration, O, B-Therapeutic_p...","[after, 9, days, of, recovery, the, patient, r...","[0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9]"
4540,"A follow-up examination, which included blood ...","[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...","[a, follow-up, examination, which, included, b...","[0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,..."


In [None]:
dataframe.to_csv('data.csv', index=False)

In [None]:
from google.colab import files
files.download('data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>