[fd8900]: / processing / data_preprocessing.ipynb

Download this file

1688 lines (1688 with data), 76.2 kB

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Preprocessing data\n",
        "\n",
        "The dataset MACCROBAT2018 is a rich collection of annotated clinical language appropriate for training biomedical natural language processing systems. Each clinical case report is in .txt (free-text) and .ann (annotated entites) format, which needs to be processed.\n",
        "\n",
        "We want to have a dataframe with sentences, tokens and its responding tags."
      ],
      "metadata": {
        "id": "ItZ4Op-l2lQ8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "First import the necessary libraries."
      ],
      "metadata": {
        "id": "aQQ6T5Hj3WLy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import glob\n",
        "import nltk\n",
        "import re\n",
        "nltk.download('punkt')\n",
        "import os"
      ],
      "metadata": {
        "id": "1k1A2pWlglPB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The function `get_simple_table` processes .ann files and extracts the annotation data. It parses lines, splitting them into relevant components, and stores them in a dataframe. The resulting dataframe contains columns for ID, type, start, end, and text."
      ],
      "metadata": {
        "id": "BL2iE_ss3zgQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_simple_table(raw_ann):\n",
        "  with open(raw_ann, 'r') as file:\n",
        "      lines = file.readlines()\n",
        "\n",
        "  data = []\n",
        "  for line in lines:\n",
        "      if line.startswith('T') or line.startswith('E'):\n",
        "          line_data = line.split('\\t')\n",
        "          if len(line_data) >= 3:\n",
        "              entity_id, entity_info, entity_text = line_data[0], line_data[1], line_data[2].strip()\n",
        "              entity_info_split = entity_info.split(' ')\n",
        "              if len(entity_info_split) >= 3:\n",
        "                  entity_type, start, end = entity_info_split[0], entity_info_split[1], entity_info_split[2]\n",
        "                  data.append([entity_id, entity_type, start, end, entity_text])\n",
        "\n",
        "  return pd.DataFrame(data, columns=['ID', 'Type', 'Start', 'End', 'Text'])"
      ],
      "metadata": {
        "id": "bNaHTbuZodtQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The function `get_BOI_table` initializes an empty dataframe with columns 'Type', 'Start', 'End', and 'Text'. It takes a dataframe as input, which is iterated through each row, modifying the 'type' column to represent the beginning ('B-'), inside ('I-') or outside ('O'-) of an entity, along with start and end positions. The resulting dataframe contains the data in the BOI format."
      ],
      "metadata": {
        "id": "NDCMwGl45YdK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_BOI_table(simple_table):\n",
        "  new_data = {\n",
        "      'Type': [],\n",
        "      'Start': [],\n",
        "      'End': [],\n",
        "      'Text': []\n",
        "  }\n",
        "  new_df = pd.DataFrame(new_data)\n",
        "\n",
        "  for index, row in simple_table.iterrows():\n",
        "      text_words = row['Text'].split()\n",
        "      num_words = len(text_words)\n",
        "\n",
        "      for i, word in enumerate(text_words):\n",
        "          new_type = f\"{'B-' if i == 0 else 'I-'}{row['Type']}\"\n",
        "          new_start = int(row['Start']) + int(row['Text'].index(word))\n",
        "          new_end = int(new_start) + int(len(word))\n",
        "\n",
        "          new_data = {\n",
        "              'Type': new_type,\n",
        "              'Start': new_start,\n",
        "              'End': new_end,\n",
        "              'Text': word\n",
        "          }\n",
        "          new_df = pd.concat([new_df, pd.DataFrame([new_data])], ignore_index=True)\n",
        "\n",
        "  return new_df"
      ],
      "metadata": {
        "id": "n7NmygxbzXhu"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "`get_text` takes a file path as input, reads its contents and stores the text content in the variable text."
      ],
      "metadata": {
        "id": "WzCW05xO6NP-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_text(raw_text):\n",
        "  with open(raw_text, 'r') as file:\n",
        "    text = file.read()\n",
        "  return text"
      ],
      "metadata": {
        "id": "7XHT3mjw6_s8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The function `get_annotated_data` extracts text and entity tags from the input raw_text file using the BOI_table. It tokenizes the text into sentences, then tokenizes each sentence into words while considering punctuation. It matches word positions with entity tags from the BOI_table and constructs a dataframe with sentence-text and corresponding entity tags for each word."
      ],
      "metadata": {
        "id": "R8GF6MB_6nVW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_annotated_data(raw_text, BOI_table):\n",
        "  sentences = nltk.sent_tokenize(get_text(raw_text))\n",
        "  data = []\n",
        "\n",
        "  pos = 0 # Start index for first word\n",
        "  for sentence in sentences:\n",
        "      words = sentence.split(\" \")\n",
        "      sentence_words = []\n",
        "      sentence_tags = []\n",
        "\n",
        "      for word in words:\n",
        "          curr_word = word\n",
        "          punctuation = '\"!@#$%^&*()_+[]<>?:.,;'\n",
        "          for c in word:\n",
        "            if c in punctuation:\n",
        "              curr_word = curr_word.replace(c, \"\")\n",
        "\n",
        "          start = pos\n",
        "          end_mit = start + len(word)\n",
        "          end_ohne = start + len(curr_word)\n",
        "          tags = BOI_table[(BOI_table['Start'] == start) & (BOI_table['End'] == end_ohne)]\n",
        "          sentence_words.append(word)\n",
        "\n",
        "          if tags.empty:\n",
        "              sentence_tags.append('O')\n",
        "          else:\n",
        "              sentence_tags.append(tags.iloc[0,0])\n",
        "\n",
        "          pos = end_mit + 1\n",
        "\n",
        "      data.append({\n",
        "          'sentence': ' '.join(sentence_words),\n",
        "          'tags': sentence_tags\n",
        "      })\n",
        "\n",
        "  return pd.DataFrame(data)"
      ],
      "metadata": {
        "id": "zXwbmdK2AEVh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Finally, having defined all functions, the collection of text and annotation files in the 'MACCROBAT' directory is processed. It iterates through file pairs, extracting entity tags from annotations and associating them with text data. The resulting dataframe contains sentences and their corresponding tags."
      ],
      "metadata": {
        "id": "AM_bOUdo68mc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "path = './MACCROBAT'\n",
        "\n",
        "txt_files = glob.glob(os.path.join(path, '*.txt'))\n",
        "ann_files = glob.glob(os.path.join(path, '*.ann'))\n",
        "\n",
        "txt_files.sort()\n",
        "ann_files.sort()\n",
        "\n",
        "dataframe = pd.DataFrame(columns=[\"sentence\", \"tags\"])\n",
        "\n",
        "for txt_file, ann_file in zip(txt_files, ann_files):\n",
        "  simple_table = get_simple_table(ann_file)\n",
        "  boi_table = get_BOI_table(simple_table)\n",
        "  annotated_data = get_annotated_data(txt_file, boi_table)\n",
        "  dataframe = pd.concat([dataframe, annotated_data], ignore_index=True)"
      ],
      "metadata": {
        "id": "v2jTMwDj_leQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The unique values in the 'Type' column of the final dataframe are printed."
      ],
      "metadata": {
        "id": "9TIFWeW17q90"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "unique_values = boi_table['Type'].unique()\n",
        "print(unique_values)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2N8-VRjJEC5t",
        "outputId": "1c3d2e8c-6319-42ae-c71d-30c8ca8d9087"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['B-Age' 'I-Age' 'B-Sex' 'B-Clinical_event' 'B-Nonbiological_location'\n",
            " 'B-Sign_symptom' 'B-Biological_structure' 'I-Sign_symptom'\n",
            " 'B-Detailed_description' 'I-Detailed_description' 'B-History' 'I-History'\n",
            " 'B-Family_history' 'I-Family_history' 'B-Diagnostic_procedure'\n",
            " 'I-Diagnostic_procedure' 'I-Biological_structure' 'B-Distance'\n",
            " 'I-Distance' 'B-Lab_value' 'I-Lab_value' 'B-Disease_disorder' 'B-Shape'\n",
            " 'I-Shape' 'B-Coreference' 'B-Volume' 'I-Volume' 'B-Therapeutic_procedure'\n",
            " 'I-Therapeutic_procedure' 'B-Area' 'I-Area' 'B-Duration' 'I-Duration'\n",
            " 'B-Date' 'I-Date' 'B-Color' 'I-Color']\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dataframe"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 411
        },
        "id": "07I-FYpOeH5z",
        "outputId": "0b592667-1020-42c0-d03d-45674bd87906"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                               sentence  \\\n",
              "0     CASE: A 28-year-old previously healthy man pre...   \n",
              "1     The symptoms occurred during rest, 2–3 times p...   \n",
              "2     Except for a grade 2/6 holosystolic tricuspid ...   \n",
              "3     An electrocardiogram (ECG) revealed normal sin...   \n",
              "4     Transthoracic echocardiography demonstrated th...   \n",
              "...                                                 ...   \n",
              "4537                         MHL was diagnosed (Fig.3).   \n",
              "4538  Immunohistochemistry results (Fig.4) were the ...   \n",
              "4539  After 9 days of recovery, the patient returned...   \n",
              "4540  A follow-up examination, which included blood ...   \n",
              "4541   No adverse or unanticipated event was presented.   \n",
              "\n",
              "                                                   tags  \n",
              "0     [O, O, B-Age, B-History, I-History, B-Sex, B-C...  \n",
              "1     [O, B-Coreference, O, O, B-Clinical_event, B-F...  \n",
              "2     [O, O, O, B-Lab_value, I-Lab_value, B-Detailed...  \n",
              "3     [O, B-Diagnostic_procedure, O, O, B-Lab_value,...  \n",
              "4     [B-Biological_structure, B-Diagnostic_procedur...  \n",
              "...                                                 ...  \n",
              "4537                      [B-Disease_disorder, O, O, O]  \n",
              "4538  [B-Diagnostic_procedure, I-Diagnostic_procedur...  \n",
              "4539  [O, B-Duration, I-Duration, O, B-Therapeutic_p...  \n",
              "4540  [O, B-Clinical_event, O, O, O, B-Diagnostic_pr...  \n",
              "4541  [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...  \n",
              "\n",
              "[4542 rows x 2 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>sentence</th>\n",
              "      <th>tags</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
              "      <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
              "      <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
              "      <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
              "      <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Transthoracic echocardiography demonstrated th...</td>\n",
              "      <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4537</th>\n",
              "      <td>MHL was diagnosed (Fig.3).</td>\n",
              "      <td>[B-Disease_disorder, O, O, O]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4538</th>\n",
              "      <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
              "      <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4539</th>\n",
              "      <td>After 9 days of recovery, the patient returned...</td>\n",
              "      <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4540</th>\n",
              "      <td>A follow-up examination, which included blood ...</td>\n",
              "      <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4541</th>\n",
              "      <td>No adverse or unanticipated event was presented.</td>\n",
              "      <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>4542 rows × 2 columns</p>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-e0e189e4-39e4-4f68-b16b-cc66e602702a\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e0e189e4-39e4-4f68-b16b-cc66e602702a')\"\n",
              "            title=\"Suggest charts.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-e0e189e4-39e4-4f68-b16b-cc66e602702a button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ]
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def count_unique_tokens_in_column(dataframe, column_name):\n",
        "    unique_tokens = set()\n",
        "\n",
        "    for tokens_list in dataframe[column_name]:\n",
        "        cleaned_tokens = [re.sub(r'^(B-|I-)', '', token) for token in tokens_list]\n",
        "        unique_tokens.update(cleaned_tokens)\n",
        "\n",
        "    token_counts = {}\n",
        "    for token in unique_tokens:\n",
        "        count = sum(dataframe[column_name].apply(lambda tokens: re.search(fr'\\b{re.escape(token)}\\b', ' '.join(tokens)) is not None))\n",
        "        token_counts[token] = count\n",
        "\n",
        "    sorted_token_counts = dict(sorted(token_counts.items(), key=lambda item: item[1], reverse=True))\n",
        "    tags_freq = pd.DataFrame(list(sorted_token_counts.items()), columns=['Token', 'Anzahl'])\n",
        "\n",
        "    return tags_freq"
      ],
      "metadata": {
        "id": "537IWK34geSH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print(count_unique_tokens_in_column(dataframe, \"tags\"))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3mDNKpsQge94",
        "outputId": "80a7dfa6-2573-4658-bcb0-dae280479a32"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "                     Token  Anzahl\n",
            "0                        O    4527\n",
            "1     Diagnostic_procedure    2215\n",
            "2             Sign_symptom    1964\n",
            "3     Detailed_description    1686\n",
            "4     Biological_structure    1591\n",
            "5                Lab_value    1400\n",
            "6         Disease_disorder     946\n",
            "7    Therapeutic_procedure     665\n",
            "8                     Date     640\n",
            "9           Clinical_event     567\n",
            "10              Medication     567\n",
            "11                Severity     318\n",
            "12  Nonbiological_location     307\n",
            "13             Coreference     272\n",
            "14                 History     225\n",
            "15                Duration     220\n",
            "16                     Age     204\n",
            "17                  Dosage     195\n",
            "18                     Sex     190\n",
            "19          Administration     123\n",
            "20                Distance      90\n",
            "21                Activity      70\n",
            "22               Frequency      68\n",
            "23                   Shape      56\n",
            "24          Family_history      53\n",
            "25     Personal_background      46\n",
            "26                   Color      46\n",
            "27                    Time      46\n",
            "28                 Subject      42\n",
            "29                 Texture      40\n",
            "30                 Outcome      38\n",
            "31     Qualitative_concept      32\n",
            "32                    Area      31\n",
            "33             Other_event      26\n",
            "34    Quantitative_concept      25\n",
            "35                  Volume      18\n",
            "36            Other_entity      17\n",
            "37              Occupation      13\n",
            "38    Biological_attribute       8\n",
            "39                  Weight       4\n",
            "40                  Height       4\n",
            "41                    Mass       2\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we now which tags are in the entire data."
      ],
      "metadata": {
        "id": "1r4rKXp27xo6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dataframe"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 494
        },
        "id": "PDH5LlsfAn-p",
        "outputId": "2fc9883a-5130-4ee0-c2a3-0ce231893eb8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                               sentence  \\\n",
              "0     CASE: A 28-year-old previously healthy man pre...   \n",
              "1     The symptoms occurred during rest, 2–3 times p...   \n",
              "2     Except for a grade 2/6 holosystolic tricuspid ...   \n",
              "3     An electrocardiogram (ECG) revealed normal sin...   \n",
              "4     Transthoracic echocardiography demonstrated th...   \n",
              "...                                                 ...   \n",
              "4537                         MHL was diagnosed (Fig.3).   \n",
              "4538  Immunohistochemistry results (Fig.4) were the ...   \n",
              "4539  After 9 days of recovery, the patient returned...   \n",
              "4540  A follow-up examination, which included blood ...   \n",
              "4541   No adverse or unanticipated event was presented.   \n",
              "\n",
              "                                                   tags  \n",
              "0     [O, O, B-Age, B-History, I-History, B-Sex, B-C...  \n",
              "1     [O, B-Coreference, O, O, B-Clinical_event, B-F...  \n",
              "2     [O, O, O, B-Lab_value, I-Lab_value, B-Detailed...  \n",
              "3     [O, B-Diagnostic_procedure, O, O, B-Lab_value,...  \n",
              "4     [B-Biological_structure, B-Diagnostic_procedur...  \n",
              "...                                                 ...  \n",
              "4537                      [B-Disease_disorder, O, O, O]  \n",
              "4538  [B-Diagnostic_procedure, I-Diagnostic_procedur...  \n",
              "4539  [O, B-Duration, I-Duration, O, B-Therapeutic_p...  \n",
              "4540  [O, B-Clinical_event, O, O, O, B-Diagnostic_pr...  \n",
              "4541  [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...  \n",
              "\n",
              "[4542 rows x 2 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>sentence</th>\n",
              "      <th>tags</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
              "      <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
              "      <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
              "      <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
              "      <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Transthoracic echocardiography demonstrated th...</td>\n",
              "      <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4537</th>\n",
              "      <td>MHL was diagnosed (Fig.3).</td>\n",
              "      <td>[B-Disease_disorder, O, O, O]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4538</th>\n",
              "      <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
              "      <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4539</th>\n",
              "      <td>After 9 days of recovery, the patient returned...</td>\n",
              "      <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4540</th>\n",
              "      <td>A follow-up examination, which included blood ...</td>\n",
              "      <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4541</th>\n",
              "      <td>No adverse or unanticipated event was presented.</td>\n",
              "      <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>4542 rows × 2 columns</p>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-fd0b0004-5a86-47b7-8447-eae8ab774a68\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-fd0b0004-5a86-47b7-8447-eae8ab774a68')\"\n",
              "            title=\"Suggest charts.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-fd0b0004-5a86-47b7-8447-eae8ab774a68 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next, the sentence are tokenized. The tokens are stored in the dataframe."
      ],
      "metadata": {
        "id": "SzWRDXkw8Tn8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def tokenize_sentence(sentence):\n",
        "    tokens = sentence.split()\n",
        "    cleaned_tokens = []\n",
        "    punctuation = '\"!@#$^*()_+[]<>?:.,;'\n",
        "    for word in tokens:\n",
        "        cleaned_word = ''.join(c for c in word if c not in punctuation).lower()\n",
        "        cleaned_tokens.append(cleaned_word)\n",
        "\n",
        "    return cleaned_tokens\n",
        "\n",
        "dataframe['tokens'] = dataframe['sentence'].apply(tokenize_sentence)"
      ],
      "metadata": {
        "id": "AXYh734Fkna8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "For later processing we will need a dictionary, which maps the entites to unique numbers."
      ],
      "metadata": {
        "id": "Vafri8Il9GuS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "label_dict = {'O': 0, 'B-Age': 1, 'I-Age': 2, 'B-Sex': 3, 'I-Sex': 4, 'B-Clinical_event': 5,\n",
        "              'I-Clinical_event': 6, 'B-Nonbiological_location': 7, 'I-Nonbiological_location': 8,\n",
        "              'B-Sign_symptom': 9, 'I-Sign_symptom': 10, 'B-Biological_structure': 11, 'I-Biological_structure': 12,\n",
        "              'B-Detailed_description': 13, 'I-Detailed_description': 14, 'B-History': 15, 'I-History': 16, 'B-Family_history': 17,\n",
        "              'I-Family_history': 18, 'B-Diagnostic_procedure': 19, 'I-Diagnostic_procedure': 20, 'B-Distance': 21,\n",
        "              'I-Distance': 22, 'B-Lab_value': 23, 'I-Lab_value': 24, 'B-Disease_disorder': 25, 'I-Disease_disorder': 26,\n",
        "              'B-Shape': 27, 'I-Shape': 28, 'B-Coreference': 29, 'I-Coreference': 30, 'B-Volume': 31, 'I-Volume': 32,\n",
        "              'B-Therapeutic_procedure': 33, 'I-Therapeutic_procedure': 34, 'B-Area': 35, 'I-Area': 36, 'B-Duration': 37,\n",
        "              'I-Duration': 38, 'B-Date': 39, 'I-Date': 40, 'B-Color': 41, 'I-Color': 42, 'B-Frequency': 43, 'I-Frequency': 44,\n",
        "              'B-Texture': 45, 'I-Texture': 46, 'B-Biological_attribute': 47, 'I-Biological_attribute': 48, 'B-Severity': 49,\n",
        "              'I-Severity': 50, 'B-Activity': 51, 'I-Activity': 52, 'B-Outcome': 53, 'I-Outcome': 54, 'B-Personal_background': 55,\n",
        "              'I-Personal_background': 56, 'B-Medication': 57, 'I-Medication': 58, 'B-Dosage': 59, 'I-Dosage': 60, 'B-Other_event': 61,\n",
        "              'I-Other_event': 62, 'B-Administration': 63, 'I-Administration': 64, 'B-Occupation': 65, 'I-Occupation': 66,\n",
        "              'B-Other_entity': 67, 'I-Other_entity': 68, 'B-Time': 69, 'I-Time': 70, 'B-Subject': 71, 'I-Subject': 72,\n",
        "              'B-Quantitative_concept': 73, 'I-Quantitative_concept': 74, 'B-Height': 75, 'I-Height': 76, 'B-Mass': 77, 'I-Mass': 78,\n",
        "              'B-Weight': 79, 'I-Weight': 80, 'B-Qualitative_concept': 81, 'I-Qualitative_concept': 82}"
      ],
      "metadata": {
        "id": "dELLPyt-kiRz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "id2label = {i: label for i, label in enumerate(label_dict)}\n",
        "label2id = {v: k for k, v in id2label.items()}"
      ],
      "metadata": {
        "id": "UuW18jhimUei"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Additionally, the mapped ids of the labels are stored inside the dataframe."
      ],
      "metadata": {
        "id": "kKQoCjIu9Y8G"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def map_labels_to_ids(label_list):\n",
        "    return [label2id[label] for label in label_list]\n",
        "\n",
        "dataframe['numeric_tags'] = dataframe['tags'].apply(map_labels_to_ids)"
      ],
      "metadata": {
        "id": "ktAuHAQymWLA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now the dataframe is preprocessed:\n",
        "\n",
        "- sentence: contains the whole sentence, not processed\n",
        "- tags: contains for each token its corresponding tag\n",
        "- token: contains tokens of each sentence, punctuation (besides -, &, %) filtered, lower case\n",
        "- numeric_tags: contains for each tag its corresponding numeric tag"
      ],
      "metadata": {
        "id": "m3TL24zj-UsR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "display(dataframe)"
      ],
      "metadata": {
        "id": "UYNvV-DxmX1x",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 724
        },
        "outputId": "96dcc726-d818-48ca-c76b-eb85cba57bfd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "                                               sentence  \\\n",
              "0     CASE: A 28-year-old previously healthy man pre...   \n",
              "1     The symptoms occurred during rest, 2–3 times p...   \n",
              "2     Except for a grade 2/6 holosystolic tricuspid ...   \n",
              "3     An electrocardiogram (ECG) revealed normal sin...   \n",
              "4     Transthoracic echocardiography demonstrated th...   \n",
              "...                                                 ...   \n",
              "4537                         MHL was diagnosed (Fig.3).   \n",
              "4538  Immunohistochemistry results (Fig.4) were the ...   \n",
              "4539  After 9 days of recovery, the patient returned...   \n",
              "4540  A follow-up examination, which included blood ...   \n",
              "4541   No adverse or unanticipated event was presented.   \n",
              "\n",
              "                                                   tags  \\\n",
              "0     [O, O, B-Age, B-History, I-History, B-Sex, B-C...   \n",
              "1     [O, B-Coreference, O, O, B-Clinical_event, B-F...   \n",
              "2     [O, O, O, B-Lab_value, I-Lab_value, B-Detailed...   \n",
              "3     [O, B-Diagnostic_procedure, O, O, B-Lab_value,...   \n",
              "4     [B-Biological_structure, B-Diagnostic_procedur...   \n",
              "...                                                 ...   \n",
              "4537                      [B-Disease_disorder, O, O, O]   \n",
              "4538  [B-Diagnostic_procedure, I-Diagnostic_procedur...   \n",
              "4539  [O, B-Duration, I-Duration, O, B-Therapeutic_p...   \n",
              "4540  [O, B-Clinical_event, O, O, O, B-Diagnostic_pr...   \n",
              "4541  [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...   \n",
              "\n",
              "                                                 tokens  \\\n",
              "0     [case, a, 28-year-old, previously, healthy, ma...   \n",
              "1     [the, symptoms, occurred, during, rest, 2–3, t...   \n",
              "2     [except, for, a, grade, 2/6, holosystolic, tri...   \n",
              "3     [an, electrocardiogram, ecg, revealed, normal,...   \n",
              "4     [transthoracic, echocardiography, demonstrated...   \n",
              "...                                                 ...   \n",
              "4537                        [mhl, was, diagnosed, fig3]   \n",
              "4538  [immunohistochemistry, results, fig4, were, th...   \n",
              "4539  [after, 9, days, of, recovery, the, patient, r...   \n",
              "4540  [a, follow-up, examination, which, included, b...   \n",
              "4541  [no, adverse, or, unanticipated, event, was, p...   \n",
              "\n",
              "                                           numeric_tags  \n",
              "0            [0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9]  \n",
              "1     [0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14...  \n",
              "2     [0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1...  \n",
              "3     [0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,...  \n",
              "4     [11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ...  \n",
              "...                                                 ...  \n",
              "4537                                      [25, 0, 0, 0]  \n",
              "4538  [19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ...  \n",
              "4539               [0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9]  \n",
              "4540  [0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,...  \n",
              "4541                           [0, 9, 10, 10, 10, 0, 0]  \n",
              "\n",
              "[4542 rows x 4 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-416d3de3-5e25-422b-9808-c43f36230ab6\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>sentence</th>\n",
              "      <th>tags</th>\n",
              "      <th>tokens</th>\n",
              "      <th>numeric_tags</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
              "      <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
              "      <td>[case, a, 28-year-old, previously, healthy, ma...</td>\n",
              "      <td>[0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
              "      <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
              "      <td>[the, symptoms, occurred, during, rest, 2–3, t...</td>\n",
              "      <td>[0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
              "      <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
              "      <td>[except, for, a, grade, 2/6, holosystolic, tri...</td>\n",
              "      <td>[0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
              "      <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
              "      <td>[an, electrocardiogram, ecg, revealed, normal,...</td>\n",
              "      <td>[0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Transthoracic echocardiography demonstrated th...</td>\n",
              "      <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
              "      <td>[transthoracic, echocardiography, demonstrated...</td>\n",
              "      <td>[11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4537</th>\n",
              "      <td>MHL was diagnosed (Fig.3).</td>\n",
              "      <td>[B-Disease_disorder, O, O, O]</td>\n",
              "      <td>[mhl, was, diagnosed, fig3]</td>\n",
              "      <td>[25, 0, 0, 0]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4538</th>\n",
              "      <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
              "      <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
              "      <td>[immunohistochemistry, results, fig4, were, th...</td>\n",
              "      <td>[19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4539</th>\n",
              "      <td>After 9 days of recovery, the patient returned...</td>\n",
              "      <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
              "      <td>[after, 9, days, of, recovery, the, patient, r...</td>\n",
              "      <td>[0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4540</th>\n",
              "      <td>A follow-up examination, which included blood ...</td>\n",
              "      <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
              "      <td>[a, follow-up, examination, which, included, b...</td>\n",
              "      <td>[0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4541</th>\n",
              "      <td>No adverse or unanticipated event was presented.</td>\n",
              "      <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
              "      <td>[no, adverse, or, unanticipated, event, was, p...</td>\n",
              "      <td>[0, 9, 10, 10, 10, 0, 0]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>4542 rows × 4 columns</p>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-416d3de3-5e25-422b-9808-c43f36230ab6')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-416d3de3-5e25-422b-9808-c43f36230ab6 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-416d3de3-5e25-422b-9808-c43f36230ab6');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-121bea10-399d-4add-903f-8738ac750e35\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-121bea10-399d-4add-903f-8738ac750e35')\"\n",
              "            title=\"Suggest charts.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-121bea10-399d-4add-903f-8738ac750e35 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dataframe.to_csv('data.csv', index=False)"
      ],
      "metadata": {
        "id": "2hSa432BAcjd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import files\n",
        "files.download('data.csv')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 16
        },
        "id": "vVq3E3lwAp1K",
        "outputId": "04d15ecd-bf3f-4a81-8dcc-291d5c8ac758"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.Javascript object>"
            ],
            "application/javascript": [
              "\n",
              "    async function download(id, filename, size) {\n",
              "      if (!google.colab.kernel.accessAllowed) {\n",
              "        return;\n",
              "      }\n",
              "      const div = document.createElement('div');\n",
              "      const label = document.createElement('label');\n",
              "      label.textContent = `Downloading \"${filename}\": `;\n",
              "      div.appendChild(label);\n",
              "      const progress = document.createElement('progress');\n",
              "      progress.max = size;\n",
              "      div.appendChild(progress);\n",
              "      document.body.appendChild(div);\n",
              "\n",
              "      const buffers = [];\n",
              "      let downloaded = 0;\n",
              "\n",
              "      const channel = await google.colab.kernel.comms.open(id);\n",
              "      // Send a message to notify the kernel that we're ready.\n",
              "      channel.send({})\n",
              "\n",
              "      for await (const message of channel.messages) {\n",
              "        // Send a message to notify the kernel that we're ready.\n",
              "        channel.send({})\n",
              "        if (message.buffers) {\n",
              "          for (const buffer of message.buffers) {\n",
              "            buffers.push(buffer);\n",
              "            downloaded += buffer.byteLength;\n",
              "            progress.value = downloaded;\n",
              "          }\n",
              "        }\n",
              "      }\n",
              "      const blob = new Blob(buffers, {type: 'application/binary'});\n",
              "      const a = document.createElement('a');\n",
              "      a.href = window.URL.createObjectURL(blob);\n",
              "      a.download = filename;\n",
              "      div.appendChild(a);\n",
              "      a.click();\n",
              "      div.remove();\n",
              "    }\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.Javascript object>"
            ],
            "application/javascript": [
              "download(\"download_a5bc3aeb-960b-4978-b11d-f32a03c260a5\", \"data.csv\", 2734989)"
            ]
          },
          "metadata": {}
        }
      ]
    }
  ]
}