1688 lines (1688 with data), 76.2 kB
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Preprocessing data\n",
"\n",
"The dataset MACCROBAT2018 is a rich collection of annotated clinical language appropriate for training biomedical natural language processing systems. Each clinical case report is in .txt (free-text) and .ann (annotated entites) format, which needs to be processed.\n",
"\n",
"We want to have a dataframe with sentences, tokens and its responding tags."
],
"metadata": {
"id": "ItZ4Op-l2lQ8"
}
},
{
"cell_type": "markdown",
"source": [
"First import the necessary libraries."
],
"metadata": {
"id": "aQQ6T5Hj3WLy"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import glob\n",
"import nltk\n",
"import re\n",
"nltk.download('punkt')\n",
"import os"
],
"metadata": {
"id": "1k1A2pWlglPB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The function `get_simple_table` processes .ann files and extracts the annotation data. It parses lines, splitting them into relevant components, and stores them in a dataframe. The resulting dataframe contains columns for ID, type, start, end, and text."
],
"metadata": {
"id": "BL2iE_ss3zgQ"
}
},
{
"cell_type": "code",
"source": [
"def get_simple_table(raw_ann):\n",
" with open(raw_ann, 'r') as file:\n",
" lines = file.readlines()\n",
"\n",
" data = []\n",
" for line in lines:\n",
" if line.startswith('T') or line.startswith('E'):\n",
" line_data = line.split('\\t')\n",
" if len(line_data) >= 3:\n",
" entity_id, entity_info, entity_text = line_data[0], line_data[1], line_data[2].strip()\n",
" entity_info_split = entity_info.split(' ')\n",
" if len(entity_info_split) >= 3:\n",
" entity_type, start, end = entity_info_split[0], entity_info_split[1], entity_info_split[2]\n",
" data.append([entity_id, entity_type, start, end, entity_text])\n",
"\n",
" return pd.DataFrame(data, columns=['ID', 'Type', 'Start', 'End', 'Text'])"
],
"metadata": {
"id": "bNaHTbuZodtQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The function `get_BOI_table` initializes an empty dataframe with columns 'Type', 'Start', 'End', and 'Text'. It takes a dataframe as input, which is iterated through each row, modifying the 'type' column to represent the beginning ('B-'), inside ('I-') or outside ('O'-) of an entity, along with start and end positions. The resulting dataframe contains the data in the BOI format."
],
"metadata": {
"id": "NDCMwGl45YdK"
}
},
{
"cell_type": "code",
"source": [
"def get_BOI_table(simple_table):\n",
" new_data = {\n",
" 'Type': [],\n",
" 'Start': [],\n",
" 'End': [],\n",
" 'Text': []\n",
" }\n",
" new_df = pd.DataFrame(new_data)\n",
"\n",
" for index, row in simple_table.iterrows():\n",
" text_words = row['Text'].split()\n",
" num_words = len(text_words)\n",
"\n",
" for i, word in enumerate(text_words):\n",
" new_type = f\"{'B-' if i == 0 else 'I-'}{row['Type']}\"\n",
" new_start = int(row['Start']) + int(row['Text'].index(word))\n",
" new_end = int(new_start) + int(len(word))\n",
"\n",
" new_data = {\n",
" 'Type': new_type,\n",
" 'Start': new_start,\n",
" 'End': new_end,\n",
" 'Text': word\n",
" }\n",
" new_df = pd.concat([new_df, pd.DataFrame([new_data])], ignore_index=True)\n",
"\n",
" return new_df"
],
"metadata": {
"id": "n7NmygxbzXhu"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"`get_text` takes a file path as input, reads its contents and stores the text content in the variable text."
],
"metadata": {
"id": "WzCW05xO6NP-"
}
},
{
"cell_type": "code",
"source": [
"def get_text(raw_text):\n",
" with open(raw_text, 'r') as file:\n",
" text = file.read()\n",
" return text"
],
"metadata": {
"id": "7XHT3mjw6_s8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The function `get_annotated_data` extracts text and entity tags from the input raw_text file using the BOI_table. It tokenizes the text into sentences, then tokenizes each sentence into words while considering punctuation. It matches word positions with entity tags from the BOI_table and constructs a dataframe with sentence-text and corresponding entity tags for each word."
],
"metadata": {
"id": "R8GF6MB_6nVW"
}
},
{
"cell_type": "code",
"source": [
"def get_annotated_data(raw_text, BOI_table):\n",
" sentences = nltk.sent_tokenize(get_text(raw_text))\n",
" data = []\n",
"\n",
" pos = 0 # Start index for first word\n",
" for sentence in sentences:\n",
" words = sentence.split(\" \")\n",
" sentence_words = []\n",
" sentence_tags = []\n",
"\n",
" for word in words:\n",
" curr_word = word\n",
" punctuation = '\"!@#$%^&*()_+[]<>?:.,;'\n",
" for c in word:\n",
" if c in punctuation:\n",
" curr_word = curr_word.replace(c, \"\")\n",
"\n",
" start = pos\n",
" end_mit = start + len(word)\n",
" end_ohne = start + len(curr_word)\n",
" tags = BOI_table[(BOI_table['Start'] == start) & (BOI_table['End'] == end_ohne)]\n",
" sentence_words.append(word)\n",
"\n",
" if tags.empty:\n",
" sentence_tags.append('O')\n",
" else:\n",
" sentence_tags.append(tags.iloc[0,0])\n",
"\n",
" pos = end_mit + 1\n",
"\n",
" data.append({\n",
" 'sentence': ' '.join(sentence_words),\n",
" 'tags': sentence_tags\n",
" })\n",
"\n",
" return pd.DataFrame(data)"
],
"metadata": {
"id": "zXwbmdK2AEVh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Finally, having defined all functions, the collection of text and annotation files in the 'MACCROBAT' directory is processed. It iterates through file pairs, extracting entity tags from annotations and associating them with text data. The resulting dataframe contains sentences and their corresponding tags."
],
"metadata": {
"id": "AM_bOUdo68mc"
}
},
{
"cell_type": "code",
"source": [
"path = './MACCROBAT'\n",
"\n",
"txt_files = glob.glob(os.path.join(path, '*.txt'))\n",
"ann_files = glob.glob(os.path.join(path, '*.ann'))\n",
"\n",
"txt_files.sort()\n",
"ann_files.sort()\n",
"\n",
"dataframe = pd.DataFrame(columns=[\"sentence\", \"tags\"])\n",
"\n",
"for txt_file, ann_file in zip(txt_files, ann_files):\n",
" simple_table = get_simple_table(ann_file)\n",
" boi_table = get_BOI_table(simple_table)\n",
" annotated_data = get_annotated_data(txt_file, boi_table)\n",
" dataframe = pd.concat([dataframe, annotated_data], ignore_index=True)"
],
"metadata": {
"id": "v2jTMwDj_leQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The unique values in the 'Type' column of the final dataframe are printed."
],
"metadata": {
"id": "9TIFWeW17q90"
}
},
{
"cell_type": "code",
"source": [
"unique_values = boi_table['Type'].unique()\n",
"print(unique_values)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2N8-VRjJEC5t",
"outputId": "1c3d2e8c-6319-42ae-c71d-30c8ca8d9087"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['B-Age' 'I-Age' 'B-Sex' 'B-Clinical_event' 'B-Nonbiological_location'\n",
" 'B-Sign_symptom' 'B-Biological_structure' 'I-Sign_symptom'\n",
" 'B-Detailed_description' 'I-Detailed_description' 'B-History' 'I-History'\n",
" 'B-Family_history' 'I-Family_history' 'B-Diagnostic_procedure'\n",
" 'I-Diagnostic_procedure' 'I-Biological_structure' 'B-Distance'\n",
" 'I-Distance' 'B-Lab_value' 'I-Lab_value' 'B-Disease_disorder' 'B-Shape'\n",
" 'I-Shape' 'B-Coreference' 'B-Volume' 'I-Volume' 'B-Therapeutic_procedure'\n",
" 'I-Therapeutic_procedure' 'B-Area' 'I-Area' 'B-Duration' 'I-Duration'\n",
" 'B-Date' 'I-Date' 'B-Color' 'I-Color']\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"dataframe"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 411
},
"id": "07I-FYpOeH5z",
"outputId": "0b592667-1020-42c0-d03d-45674bd87906"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" sentence \\\n",
"0 CASE: A 28-year-old previously healthy man pre... \n",
"1 The symptoms occurred during rest, 2–3 times p... \n",
"2 Except for a grade 2/6 holosystolic tricuspid ... \n",
"3 An electrocardiogram (ECG) revealed normal sin... \n",
"4 Transthoracic echocardiography demonstrated th... \n",
"... ... \n",
"4537 MHL was diagnosed (Fig.3). \n",
"4538 Immunohistochemistry results (Fig.4) were the ... \n",
"4539 After 9 days of recovery, the patient returned... \n",
"4540 A follow-up examination, which included blood ... \n",
"4541 No adverse or unanticipated event was presented. \n",
"\n",
" tags \n",
"0 [O, O, B-Age, B-History, I-History, B-Sex, B-C... \n",
"1 [O, B-Coreference, O, O, B-Clinical_event, B-F... \n",
"2 [O, O, O, B-Lab_value, I-Lab_value, B-Detailed... \n",
"3 [O, B-Diagnostic_procedure, O, O, B-Lab_value,... \n",
"4 [B-Biological_structure, B-Diagnostic_procedur... \n",
"... ... \n",
"4537 [B-Disease_disorder, O, O, O] \n",
"4538 [B-Diagnostic_procedure, I-Diagnostic_procedur... \n",
"4539 [O, B-Duration, I-Duration, O, B-Therapeutic_p... \n",
"4540 [O, B-Clinical_event, O, O, O, B-Diagnostic_pr... \n",
"4541 [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym... \n",
"\n",
"[4542 rows x 2 columns]"
],
"text/html": [
"\n",
" <div id=\"df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sentence</th>\n",
" <th>tags</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
" <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
" <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
" <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
" <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Transthoracic echocardiography demonstrated th...</td>\n",
" <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4537</th>\n",
" <td>MHL was diagnosed (Fig.3).</td>\n",
" <td>[B-Disease_disorder, O, O, O]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4538</th>\n",
" <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
" <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4539</th>\n",
" <td>After 9 days of recovery, the patient returned...</td>\n",
" <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4540</th>\n",
" <td>A follow-up examination, which included blood ...</td>\n",
" <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4541</th>\n",
" <td>No adverse or unanticipated event was presented.</td>\n",
" <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4542 rows × 2 columns</p>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-4ea887e1-c4d7-4023-b0c0-578c2f65b6dd');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-e0e189e4-39e4-4f68-b16b-cc66e602702a\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e0e189e4-39e4-4f68-b16b-cc66e602702a')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-e0e189e4-39e4-4f68-b16b-cc66e602702a button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"source": [
"def count_unique_tokens_in_column(dataframe, column_name):\n",
" unique_tokens = set()\n",
"\n",
" for tokens_list in dataframe[column_name]:\n",
" cleaned_tokens = [re.sub(r'^(B-|I-)', '', token) for token in tokens_list]\n",
" unique_tokens.update(cleaned_tokens)\n",
"\n",
" token_counts = {}\n",
" for token in unique_tokens:\n",
" count = sum(dataframe[column_name].apply(lambda tokens: re.search(fr'\\b{re.escape(token)}\\b', ' '.join(tokens)) is not None))\n",
" token_counts[token] = count\n",
"\n",
" sorted_token_counts = dict(sorted(token_counts.items(), key=lambda item: item[1], reverse=True))\n",
" tags_freq = pd.DataFrame(list(sorted_token_counts.items()), columns=['Token', 'Anzahl'])\n",
"\n",
" return tags_freq"
],
"metadata": {
"id": "537IWK34geSH"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(count_unique_tokens_in_column(dataframe, \"tags\"))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3mDNKpsQge94",
"outputId": "80a7dfa6-2573-4658-bcb0-dae280479a32"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" Token Anzahl\n",
"0 O 4527\n",
"1 Diagnostic_procedure 2215\n",
"2 Sign_symptom 1964\n",
"3 Detailed_description 1686\n",
"4 Biological_structure 1591\n",
"5 Lab_value 1400\n",
"6 Disease_disorder 946\n",
"7 Therapeutic_procedure 665\n",
"8 Date 640\n",
"9 Clinical_event 567\n",
"10 Medication 567\n",
"11 Severity 318\n",
"12 Nonbiological_location 307\n",
"13 Coreference 272\n",
"14 History 225\n",
"15 Duration 220\n",
"16 Age 204\n",
"17 Dosage 195\n",
"18 Sex 190\n",
"19 Administration 123\n",
"20 Distance 90\n",
"21 Activity 70\n",
"22 Frequency 68\n",
"23 Shape 56\n",
"24 Family_history 53\n",
"25 Personal_background 46\n",
"26 Color 46\n",
"27 Time 46\n",
"28 Subject 42\n",
"29 Texture 40\n",
"30 Outcome 38\n",
"31 Qualitative_concept 32\n",
"32 Area 31\n",
"33 Other_event 26\n",
"34 Quantitative_concept 25\n",
"35 Volume 18\n",
"36 Other_entity 17\n",
"37 Occupation 13\n",
"38 Biological_attribute 8\n",
"39 Weight 4\n",
"40 Height 4\n",
"41 Mass 2\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Now we now which tags are in the entire data."
],
"metadata": {
"id": "1r4rKXp27xo6"
}
},
{
"cell_type": "code",
"source": [
"dataframe"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 494
},
"id": "PDH5LlsfAn-p",
"outputId": "2fc9883a-5130-4ee0-c2a3-0ce231893eb8"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" sentence \\\n",
"0 CASE: A 28-year-old previously healthy man pre... \n",
"1 The symptoms occurred during rest, 2–3 times p... \n",
"2 Except for a grade 2/6 holosystolic tricuspid ... \n",
"3 An electrocardiogram (ECG) revealed normal sin... \n",
"4 Transthoracic echocardiography demonstrated th... \n",
"... ... \n",
"4537 MHL was diagnosed (Fig.3). \n",
"4538 Immunohistochemistry results (Fig.4) were the ... \n",
"4539 After 9 days of recovery, the patient returned... \n",
"4540 A follow-up examination, which included blood ... \n",
"4541 No adverse or unanticipated event was presented. \n",
"\n",
" tags \n",
"0 [O, O, B-Age, B-History, I-History, B-Sex, B-C... \n",
"1 [O, B-Coreference, O, O, B-Clinical_event, B-F... \n",
"2 [O, O, O, B-Lab_value, I-Lab_value, B-Detailed... \n",
"3 [O, B-Diagnostic_procedure, O, O, B-Lab_value,... \n",
"4 [B-Biological_structure, B-Diagnostic_procedur... \n",
"... ... \n",
"4537 [B-Disease_disorder, O, O, O] \n",
"4538 [B-Diagnostic_procedure, I-Diagnostic_procedur... \n",
"4539 [O, B-Duration, I-Duration, O, B-Therapeutic_p... \n",
"4540 [O, B-Clinical_event, O, O, O, B-Diagnostic_pr... \n",
"4541 [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym... \n",
"\n",
"[4542 rows x 2 columns]"
],
"text/html": [
"\n",
" <div id=\"df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sentence</th>\n",
" <th>tags</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
" <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
" <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
" <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
" <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Transthoracic echocardiography demonstrated th...</td>\n",
" <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4537</th>\n",
" <td>MHL was diagnosed (Fig.3).</td>\n",
" <td>[B-Disease_disorder, O, O, O]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4538</th>\n",
" <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
" <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4539</th>\n",
" <td>After 9 days of recovery, the patient returned...</td>\n",
" <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4540</th>\n",
" <td>A follow-up examination, which included blood ...</td>\n",
" <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4541</th>\n",
" <td>No adverse or unanticipated event was presented.</td>\n",
" <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4542 rows × 2 columns</p>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-3caf9a33-975f-4ac4-b7a6-2a746733d2fd');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-fd0b0004-5a86-47b7-8447-eae8ab774a68\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-fd0b0004-5a86-47b7-8447-eae8ab774a68')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-fd0b0004-5a86-47b7-8447-eae8ab774a68 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 21
}
]
},
{
"cell_type": "markdown",
"source": [
"Next, the sentence are tokenized. The tokens are stored in the dataframe."
],
"metadata": {
"id": "SzWRDXkw8Tn8"
}
},
{
"cell_type": "code",
"source": [
"def tokenize_sentence(sentence):\n",
" tokens = sentence.split()\n",
" cleaned_tokens = []\n",
" punctuation = '\"!@#$^*()_+[]<>?:.,;'\n",
" for word in tokens:\n",
" cleaned_word = ''.join(c for c in word if c not in punctuation).lower()\n",
" cleaned_tokens.append(cleaned_word)\n",
"\n",
" return cleaned_tokens\n",
"\n",
"dataframe['tokens'] = dataframe['sentence'].apply(tokenize_sentence)"
],
"metadata": {
"id": "AXYh734Fkna8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"For later processing we will need a dictionary, which maps the entites to unique numbers."
],
"metadata": {
"id": "Vafri8Il9GuS"
}
},
{
"cell_type": "code",
"source": [
"label_dict = {'O': 0, 'B-Age': 1, 'I-Age': 2, 'B-Sex': 3, 'I-Sex': 4, 'B-Clinical_event': 5,\n",
" 'I-Clinical_event': 6, 'B-Nonbiological_location': 7, 'I-Nonbiological_location': 8,\n",
" 'B-Sign_symptom': 9, 'I-Sign_symptom': 10, 'B-Biological_structure': 11, 'I-Biological_structure': 12,\n",
" 'B-Detailed_description': 13, 'I-Detailed_description': 14, 'B-History': 15, 'I-History': 16, 'B-Family_history': 17,\n",
" 'I-Family_history': 18, 'B-Diagnostic_procedure': 19, 'I-Diagnostic_procedure': 20, 'B-Distance': 21,\n",
" 'I-Distance': 22, 'B-Lab_value': 23, 'I-Lab_value': 24, 'B-Disease_disorder': 25, 'I-Disease_disorder': 26,\n",
" 'B-Shape': 27, 'I-Shape': 28, 'B-Coreference': 29, 'I-Coreference': 30, 'B-Volume': 31, 'I-Volume': 32,\n",
" 'B-Therapeutic_procedure': 33, 'I-Therapeutic_procedure': 34, 'B-Area': 35, 'I-Area': 36, 'B-Duration': 37,\n",
" 'I-Duration': 38, 'B-Date': 39, 'I-Date': 40, 'B-Color': 41, 'I-Color': 42, 'B-Frequency': 43, 'I-Frequency': 44,\n",
" 'B-Texture': 45, 'I-Texture': 46, 'B-Biological_attribute': 47, 'I-Biological_attribute': 48, 'B-Severity': 49,\n",
" 'I-Severity': 50, 'B-Activity': 51, 'I-Activity': 52, 'B-Outcome': 53, 'I-Outcome': 54, 'B-Personal_background': 55,\n",
" 'I-Personal_background': 56, 'B-Medication': 57, 'I-Medication': 58, 'B-Dosage': 59, 'I-Dosage': 60, 'B-Other_event': 61,\n",
" 'I-Other_event': 62, 'B-Administration': 63, 'I-Administration': 64, 'B-Occupation': 65, 'I-Occupation': 66,\n",
" 'B-Other_entity': 67, 'I-Other_entity': 68, 'B-Time': 69, 'I-Time': 70, 'B-Subject': 71, 'I-Subject': 72,\n",
" 'B-Quantitative_concept': 73, 'I-Quantitative_concept': 74, 'B-Height': 75, 'I-Height': 76, 'B-Mass': 77, 'I-Mass': 78,\n",
" 'B-Weight': 79, 'I-Weight': 80, 'B-Qualitative_concept': 81, 'I-Qualitative_concept': 82}"
],
"metadata": {
"id": "dELLPyt-kiRz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"id2label = {i: label for i, label in enumerate(label_dict)}\n",
"label2id = {v: k for k, v in id2label.items()}"
],
"metadata": {
"id": "UuW18jhimUei"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Additionally, the mapped ids of the labels are stored inside the dataframe."
],
"metadata": {
"id": "kKQoCjIu9Y8G"
}
},
{
"cell_type": "code",
"source": [
"def map_labels_to_ids(label_list):\n",
" return [label2id[label] for label in label_list]\n",
"\n",
"dataframe['numeric_tags'] = dataframe['tags'].apply(map_labels_to_ids)"
],
"metadata": {
"id": "ktAuHAQymWLA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now the dataframe is preprocessed:\n",
"\n",
"- sentence: contains the whole sentence, not processed\n",
"- tags: contains for each token its corresponding tag\n",
"- token: contains tokens of each sentence, punctuation (besides -, &, %) filtered, lower case\n",
"- numeric_tags: contains for each tag its corresponding numeric tag"
],
"metadata": {
"id": "m3TL24zj-UsR"
}
},
{
"cell_type": "code",
"source": [
"display(dataframe)"
],
"metadata": {
"id": "UYNvV-DxmX1x",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 724
},
"outputId": "96dcc726-d818-48ca-c76b-eb85cba57bfd"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
" sentence \\\n",
"0 CASE: A 28-year-old previously healthy man pre... \n",
"1 The symptoms occurred during rest, 2–3 times p... \n",
"2 Except for a grade 2/6 holosystolic tricuspid ... \n",
"3 An electrocardiogram (ECG) revealed normal sin... \n",
"4 Transthoracic echocardiography demonstrated th... \n",
"... ... \n",
"4537 MHL was diagnosed (Fig.3). \n",
"4538 Immunohistochemistry results (Fig.4) were the ... \n",
"4539 After 9 days of recovery, the patient returned... \n",
"4540 A follow-up examination, which included blood ... \n",
"4541 No adverse or unanticipated event was presented. \n",
"\n",
" tags \\\n",
"0 [O, O, B-Age, B-History, I-History, B-Sex, B-C... \n",
"1 [O, B-Coreference, O, O, B-Clinical_event, B-F... \n",
"2 [O, O, O, B-Lab_value, I-Lab_value, B-Detailed... \n",
"3 [O, B-Diagnostic_procedure, O, O, B-Lab_value,... \n",
"4 [B-Biological_structure, B-Diagnostic_procedur... \n",
"... ... \n",
"4537 [B-Disease_disorder, O, O, O] \n",
"4538 [B-Diagnostic_procedure, I-Diagnostic_procedur... \n",
"4539 [O, B-Duration, I-Duration, O, B-Therapeutic_p... \n",
"4540 [O, B-Clinical_event, O, O, O, B-Diagnostic_pr... \n",
"4541 [O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym... \n",
"\n",
" tokens \\\n",
"0 [case, a, 28-year-old, previously, healthy, ma... \n",
"1 [the, symptoms, occurred, during, rest, 2–3, t... \n",
"2 [except, for, a, grade, 2/6, holosystolic, tri... \n",
"3 [an, electrocardiogram, ecg, revealed, normal,... \n",
"4 [transthoracic, echocardiography, demonstrated... \n",
"... ... \n",
"4537 [mhl, was, diagnosed, fig3] \n",
"4538 [immunohistochemistry, results, fig4, were, th... \n",
"4539 [after, 9, days, of, recovery, the, patient, r... \n",
"4540 [a, follow-up, examination, which, included, b... \n",
"4541 [no, adverse, or, unanticipated, event, was, p... \n",
"\n",
" numeric_tags \n",
"0 [0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9] \n",
"1 [0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14... \n",
"2 [0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1... \n",
"3 [0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,... \n",
"4 [11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ... \n",
"... ... \n",
"4537 [25, 0, 0, 0] \n",
"4538 [19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ... \n",
"4539 [0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9] \n",
"4540 [0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,... \n",
"4541 [0, 9, 10, 10, 10, 0, 0] \n",
"\n",
"[4542 rows x 4 columns]"
],
"text/html": [
"\n",
" <div id=\"df-416d3de3-5e25-422b-9808-c43f36230ab6\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sentence</th>\n",
" <th>tags</th>\n",
" <th>tokens</th>\n",
" <th>numeric_tags</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CASE: A 28-year-old previously healthy man pre...</td>\n",
" <td>[O, O, B-Age, B-History, I-History, B-Sex, B-C...</td>\n",
" <td>[case, a, 28-year-old, previously, healthy, ma...</td>\n",
" <td>[0, 0, 1, 15, 16, 3, 5, 0, 0, 37, 0, 0, 9]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>The symptoms occurred during rest, 2–3 times p...</td>\n",
" <td>[O, B-Coreference, O, O, B-Clinical_event, B-F...</td>\n",
" <td>[the, symptoms, occurred, during, rest, 2–3, t...</td>\n",
" <td>[0, 29, 0, 0, 5, 43, 44, 44, 44, 0, 13, 14, 14...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Except for a grade 2/6 holosystolic tricuspid ...</td>\n",
" <td>[O, O, O, B-Lab_value, I-Lab_value, B-Detailed...</td>\n",
" <td>[except, for, a, grade, 2/6, holosystolic, tri...</td>\n",
" <td>[0, 0, 0, 23, 24, 13, 11, 9, 10, 0, 0, 0, 0, 1...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>An electrocardiogram (ECG) revealed normal sin...</td>\n",
" <td>[O, B-Diagnostic_procedure, O, O, B-Lab_value,...</td>\n",
" <td>[an, electrocardiogram, ecg, revealed, normal,...</td>\n",
" <td>[0, 19, 0, 0, 23, 19, 20, 0, 0, 9, 10, 10, 10,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Transthoracic echocardiography demonstrated th...</td>\n",
" <td>[B-Biological_structure, B-Diagnostic_procedur...</td>\n",
" <td>[transthoracic, echocardiography, demonstrated...</td>\n",
" <td>[11, 19, 0, 0, 0, 0, 25, 26, 0, 0, 11, 12, 0, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4537</th>\n",
" <td>MHL was diagnosed (Fig.3).</td>\n",
" <td>[B-Disease_disorder, O, O, O]</td>\n",
" <td>[mhl, was, diagnosed, fig3]</td>\n",
" <td>[25, 0, 0, 0]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4538</th>\n",
" <td>Immunohistochemistry results (Fig.4) were the ...</td>\n",
" <td>[B-Diagnostic_procedure, I-Diagnostic_procedur...</td>\n",
" <td>[immunohistochemistry, results, fig4, were, th...</td>\n",
" <td>[19, 20, 0, 0, 0, 0, 19, 20, 0, 19, 0, 19, 0, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4539</th>\n",
" <td>After 9 days of recovery, the patient returned...</td>\n",
" <td>[O, B-Duration, I-Duration, O, B-Therapeutic_p...</td>\n",
" <td>[after, 9, days, of, recovery, the, patient, r...</td>\n",
" <td>[0, 37, 38, 0, 33, 0, 0, 5, 7, 0, 9]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4540</th>\n",
" <td>A follow-up examination, which included blood ...</td>\n",
" <td>[O, B-Clinical_event, O, O, O, B-Diagnostic_pr...</td>\n",
" <td>[a, follow-up, examination, which, included, b...</td>\n",
" <td>[0, 5, 0, 0, 0, 19, 20, 19, 20, 20, 19, 20, 0,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4541</th>\n",
" <td>No adverse or unanticipated event was presented.</td>\n",
" <td>[O, B-Sign_symptom, I-Sign_symptom, I-Sign_sym...</td>\n",
" <td>[no, adverse, or, unanticipated, event, was, p...</td>\n",
" <td>[0, 9, 10, 10, 10, 0, 0]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4542 rows × 4 columns</p>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-416d3de3-5e25-422b-9808-c43f36230ab6')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-416d3de3-5e25-422b-9808-c43f36230ab6 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-416d3de3-5e25-422b-9808-c43f36230ab6');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-121bea10-399d-4add-903f-8738ac750e35\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-121bea10-399d-4add-903f-8738ac750e35')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-121bea10-399d-4add-903f-8738ac750e35 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"dataframe.to_csv('data.csv', index=False)"
],
"metadata": {
"id": "2hSa432BAcjd"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from google.colab import files\n",
"files.download('data.csv')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 16
},
"id": "vVq3E3lwAp1K",
"outputId": "04d15ecd-bf3f-4a81-8dcc-291d5c8ac758"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Javascript object>"
],
"application/javascript": [
"\n",
" async function download(id, filename, size) {\n",
" if (!google.colab.kernel.accessAllowed) {\n",
" return;\n",
" }\n",
" const div = document.createElement('div');\n",
" const label = document.createElement('label');\n",
" label.textContent = `Downloading \"${filename}\": `;\n",
" div.appendChild(label);\n",
" const progress = document.createElement('progress');\n",
" progress.max = size;\n",
" div.appendChild(progress);\n",
" document.body.appendChild(div);\n",
"\n",
" const buffers = [];\n",
" let downloaded = 0;\n",
"\n",
" const channel = await google.colab.kernel.comms.open(id);\n",
" // Send a message to notify the kernel that we're ready.\n",
" channel.send({})\n",
"\n",
" for await (const message of channel.messages) {\n",
" // Send a message to notify the kernel that we're ready.\n",
" channel.send({})\n",
" if (message.buffers) {\n",
" for (const buffer of message.buffers) {\n",
" buffers.push(buffer);\n",
" downloaded += buffer.byteLength;\n",
" progress.value = downloaded;\n",
" }\n",
" }\n",
" }\n",
" const blob = new Blob(buffers, {type: 'application/binary'});\n",
" const a = document.createElement('a');\n",
" a.href = window.URL.createObjectURL(blob);\n",
" a.download = filename;\n",
" div.appendChild(a);\n",
" a.click();\n",
" div.remove();\n",
" }\n",
" "
]
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Javascript object>"
],
"application/javascript": [
"download(\"download_a5bc3aeb-960b-4978-b11d-f32a03c260a5\", \"data.csv\", 2734989)"
]
},
"metadata": {}
}
]
}
]
}