[2d4573]: / collated_tasks / non_mimic / question_answering / MedicalQADataset.ipynb

Download this file

3492 lines (3491 with data), 205.8 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6_3560E1B1-u"
   },
   "source": [
    "# Overview: Datasets for Medical Question Answering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fKqYB0FZCAR9"
   },
   "source": [
    "In this notebook, we present various datasets used for Medical Question Answering. For each section below, we introduce one dataset and give instructions and code on how to download and inspect data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7vevzvaKIBGw"
   },
   "source": [
    "# Preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gt2p_Ozy9Q9N"
   },
   "source": [
    "Run below cell to enable access to google Drive. When prompted, click on the link and authorize access to Google Drive of desired account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 32906,
     "status": "ok",
     "timestamp": 1633959691899,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "U6tN3m7bDzhf",
    "outputId": "6b5edd5f-9593-4d93-eb2e-34e045de9e44"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mounted at /content/drive\n"
     ]
    }
   ],
   "source": [
    "### Google Colab Mount Drive ###\n",
    "\n",
    "# Load the Drive helper and mount\n",
    "from google.colab import drive\n",
    "\n",
    "# This will prompt for authorization.\n",
    "drive.mount('/content/drive')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 10,
     "status": "ok",
     "timestamp": 1633959691900,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "PrNzRcz59DN3",
    "outputId": "f8678de9-2cc7-48be-d3f0-3f9d5072cff7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/MyDrive\n"
     ]
    }
   ],
   "source": [
    "%cd drive/MyDrive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hnBdXsoOAc-t"
   },
   "source": [
    "# HeadQA\n",
    "[HeadQA](https://aghie.github.io/head-qa/) is a set of multiple-choice questions covering Medicine, Nursing, Psychology, Chemistry, Pharmacology, and Biology. Questions come from exams to access a specialized position in the Spanish healthcare system. The dataset can be downloaded from [huggingface datasets](https://huggingface.co/datasets/head_qa). Details of loading and inspecting HeadQA are shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "je2SXMMpDmsh"
   },
   "outputs": [],
   "source": [
    "!pip install datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Yu-opRu-lEqI"
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "a0-BgDbQong0"
   },
   "source": [
    "The questions and answers are available in both Spanish and English. Deafult language is Spanish. \n",
    "\n",
    "If Spanish version is desired, use the command `headqa = load_dataset(\"head_qa\")` to load dataset \n",
    "\n",
    "If English version is desired, use the command `headqa = load_dataset(\"head_qa\", \"en\")` to load dataset.\n",
    "\n",
    "In this example, we use the English version.\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 165
    },
    "executionInfo": {
     "elapsed": 6739,
     "status": "ok",
     "timestamp": 1633917727338,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "ZuL-GhQnlH7K",
    "outputId": "236cdecf-e60a-4fe3-f611-24c66882c97f"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "15592a94517044d48679534b94dac0d3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/1.51k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading and preparing dataset head_qa/en (download: 1.67 MiB, generated: 2.65 MiB, post-processed: Unknown size, total: 4.31 MiB) to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e3a9eceb8fe24e7984fbf6050253f010",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/1.75M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a57a9eab61c946a0ae85a31d811d0969",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "176754880c0344508f205a0be50375ae",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6d5deb02400e4802b237bf6b03c5f7ca",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset head_qa downloaded and prepared to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2. Subsequent calls will reuse this data.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a356879fc9cc43d4a74642aaabd0269f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "headqa = load_dataset(\"head_qa\", \"en\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "teHYDn85shzo"
   },
   "source": [
    "The `headqa` object itself is a [DatasetDict](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set. For each key, the value is a [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 117,
     "status": "ok",
     "timestamp": 1633917741534,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "0S8PHiPTlKcc",
    "outputId": "0e4262d6-6887-47bb-a4c8-e4c177416284"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],\n",
       "        num_rows: 2657\n",
       "    })\n",
       "    test: Dataset({\n",
       "        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],\n",
       "        num_rows: 2742\n",
       "    })\n",
       "    validation: Dataset({\n",
       "        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],\n",
       "        num_rows: 1366\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "headqa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "quGLUdg0tmGs"
   },
   "source": [
    "To view an actual data instance, select one of the splits and then specify an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 115,
     "status": "ok",
     "timestamp": 1633917751065,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "kRE-Q9gutj5n",
    "outputId": "2aa6f05e-7c62-462d-8357-6197a1aef2ed"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'answers': [{'aid': 1, 'atext': 'They are all or nothing.'},\n",
       "  {'aid': 2, 'atext': 'They are hyperpolarizing.'},\n",
       "  {'aid': 3, 'atext': 'They can be added.'},\n",
       "  {'aid': 4, 'atext': 'They spread long distances.'},\n",
       "  {'aid': 5, 'atext': 'They present a refractory period.'}],\n",
       " 'category': 'biology',\n",
       " 'image': '',\n",
       " 'name': 'Cuaderno_2013_1_B',\n",
       " 'qid': 1,\n",
       " 'qtext': 'The excitatory postsynaptic potentials:',\n",
       " 'ra': 3,\n",
       " 'year': '2013'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display the first training data instance\n",
    "headqa['train'][0] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ie2rDiaQt25w"
   },
   "source": [
    "To get a better sense of what the data looks like, the following function will show some examples picked randomly from the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "r--FW0UEtwbB"
   },
   "outputs": [],
   "source": [
    "from datasets import ClassLabel\n",
    "import random\n",
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "def show_random_elements(dataset, num_examples=10):\n",
    "    assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n",
    "    picks = []\n",
    "    for _ in range(num_examples):\n",
    "        pick = random.randint(0, len(dataset)-1)\n",
    "        while pick in picks:\n",
    "            pick = random.randint(0, len(dataset)-1)\n",
    "        picks.append(pick)\n",
    "    \n",
    "    df = pd.DataFrame(dataset[picks])\n",
    "    for column, typ in dataset.features.items():\n",
    "        if isinstance(typ, ClassLabel):\n",
    "            df[column] = df[column].transform(lambda i: typ.names[i])\n",
    "    display(HTML(df.to_html()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 518
    },
    "executionInfo": {
     "elapsed": 114,
     "status": "ok",
     "timestamp": 1633917759720,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "BF9XHMtfuBGg",
    "outputId": "18733ffc-a4ab-4e0d-e9b4-2ea54dc7b32b"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>year</th>\n",
       "      <th>category</th>\n",
       "      <th>qid</th>\n",
       "      <th>qtext</th>\n",
       "      <th>ra</th>\n",
       "      <th>image</th>\n",
       "      <th>answers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Cuaderno_2017_1_E</td>\n",
       "      <td>2017</td>\n",
       "      <td>nursery</td>\n",
       "      <td>81</td>\n",
       "      <td>A patient with a venous ulcer in the lower limbs has characteristic symptomatology and clinical manifestations. Which of the following responses is not characteristic of this situation ?:</td>\n",
       "      <td>3</td>\n",
       "      <td></td>\n",
       "      <td>[{'aid': 1, 'atext': 'Thick and hardened skin.'}, {'aid': 2, 'atext': 'Significant edema'}, {'aid': 3, 'atext': 'Intermittent claudication.'}, {'aid': 4, 'atext': 'Normal pulses'}]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Cuaderno_2017_1_P</td>\n",
       "      <td>2017</td>\n",
       "      <td>psychology</td>\n",
       "      <td>67</td>\n",
       "      <td>Studies on sleep in subjects complaining of insomnia show:</td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td>[{'aid': 1, 'atext': 'That most overestimate the amount of time he actually sleeps.'}, {'aid': 2, 'atext': 'That most underestimate the amount of time that actually sleeps.'}, {'aid': 3, 'atext': 'That most accurately estimate the amount of time he actually sleeps.'}, {'aid': 4, 'atext': 'That the majority estimates with accuracy the amount of time that sleeps only during the siesta.'}]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Cuaderno_2016_1_E</td>\n",
       "      <td>2016</td>\n",
       "      <td>nursery</td>\n",
       "      <td>54</td>\n",
       "      <td>With respect to critical thinking, which of the following terms used by Richard Paul as characteristics of critical thinkers is INCORRECT. The critical thinkers are:</td>\n",
       "      <td>3</td>\n",
       "      <td></td>\n",
       "      <td>[{'aid': 1, 'atext': 'Humble.'}, {'aid': 2, 'atext': 'Realistic'}, {'aid': 3, 'atext': 'Reagents'}, {'aid': 4, 'atext': 'Good communicators.'}]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# randomly choose 3 test data instances\n",
    "show_random_elements(headqa[\"test\"], 3) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m0A75FB-uVj9"
   },
   "source": [
    "In each example, the question text is contained in the field `qtext`. The `answers` field is a list of dictionaries, each dictionary has two keys: `aid` contains the index of the choice and `atext` contains the text for the choice. \\\\\n",
    "The following function helps to better visualize each question. The file `ra` contains the index of the right answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ATz5C_4AuTCb"
   },
   "outputs": [],
   "source": [
    "def show_one(example):\n",
    "    print(f\"Question: {example['qtext']}\")\n",
    "    print(f\"  1 - {example['answers'][0]['atext']}\")\n",
    "    print(f\"  2 - {example['answers'][1]['atext']}\")\n",
    "    print(f\"  3 - {example['answers'][2]['atext']}\")\n",
    "    print(f\"  4 - {example['answers'][3]['atext']}\")\n",
    "    print(f\"  5 - {example['answers'][4]['atext']}\")\n",
    "    print(f\"\\nGround truth: option {example['ra']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 123,
     "status": "ok",
     "timestamp": 1633917766739,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "Tn-p0XS9uqHO",
    "outputId": "0af0af74-713f-4b08-ce6d-7d3b9563d34a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: The excitatory postsynaptic potentials:\n",
      "  1 - They are all or nothing.\n",
      "  2 - They are hyperpolarizing.\n",
      "  3 - They can be added.\n",
      "  4 - They spread long distances.\n",
      "  5 - They present a refractory period.\n",
      "\n",
      "Ground truth: option 3\n"
     ]
    }
   ],
   "source": [
    "show_one(headqa[\"train\"][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vbA5FZuq9BxK"
   },
   "source": [
    "# BioASQ\n",
    "[BioASQ](http://bioasq.org) organizes challenges on biomedical semantic indexing and question answering (QA). The challenges include a variety of tasks, but in this section, we focus only on Question Answering (QA). Among [all challenges](http://bioasq.org/participate/challenges), The two relevant tasks are Task 9b: Biomedical Semantic QA and BioASQ Task Synergy: Biomedical Semantic QA for COVID-19."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2YGRbzyHFBHV"
   },
   "source": [
    "**Task 9b: Biomedical Semantic QA** \\\\\n",
    "[Task 9b](http://participants-area.bioasq.org/general_information/Task9b/) uses a benchmark QA dataset with four types of questions: \\\\\n",
    "\n",
    "\n",
    "1. **Yes/no questions**: These are questions that, strictly speaking, require \"yes\" or \"no\" answers, though of course in practice longer answers will often be desirable. For example, \"Do CpG islands colocalise with transcription start sites?\" is a yes/no question.\n",
    "2. **Factoid questions**: These are questions that, strictly speaking, require a particular entity name (e.g., of a disease, drug, or gene), a number, or a similar short expression as an answer, though again a longer answer may be desirable in practice. For example, \"Which virus is best known as the cause of infectious mononucleosis?\" is a factoid question.\n",
    "3. **List questions**: These are questions that, strictly speaking, require a list of entity names (e.g., a list of gene names), numbers, or similar short expressions as an answer; again, in practice additional information may be desirable. For example, \"Which are the Raf kinase inhibitors?\" is a list question.\n",
    "4. **Summary questions**: These are questions that do not belong in any of the previous categories and can only be answered by producing a short text summarizing the most prominent relevant information. For example, \"What is the treatment of infectious mononucleosis?\" is a summary question. \\\\\n",
    "\n",
    "We will inspect the dataset below.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Uzfud_2OFsht"
   },
   "outputs": [],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 157,
     "status": "ok",
     "timestamp": 1633915684575,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "ZC1SMwcMElu7",
    "outputId": "388c922a-f5df-4dce-f022-36cdcc92af1a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/MyDrive/BioASQ-training9b\n"
     ]
    }
   ],
   "source": [
    "%cd BioASQ-training9b/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R5hlSyC_FA0q"
   },
   "source": [
    "Inspect README file. The distribution of 3742 questions : 1091 factoid, 1033 yesno, 899 summary, 719 list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 382,
     "status": "ok",
     "timestamp": 1633915687895,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "bcGZLDchExCT",
    "outputId": "1030e7b6-ecf7-48dc-947a-75db4cd7224b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "== Data purpose ==\n",
      "\n",
      "The data are intended to be used as training and development data for BioASQ 9, which will take place during 2021.\n",
      "There is one file containing the data:\n",
      " - training9b.json\n",
      "\n",
      "\n",
      "The file contains the data of the first seven editions of the challenge: 3742 questions [1] with their relevant documents, snippets, concepts and RDF triples, exact and ideal answers.\n",
      "For more information about the format of the data as well as the instructions for participating at BioASQ please consult: http://participants-area.bioasq.org/general_information/Task9b/\n",
      "\n",
      "Differences with BioASQ-training8b.json \n",
      "\t- 499 new questions added from BioASQ8\n",
      "\t\t- The question with id 5e30e689fbd6abf43b00003a had identical body with 5880e417713cbdfd3d000001. All relevant elements from both questions are available in the merged question with id 5880e417713cbdfd3d000001.\n",
      "\n",
      "\n",
      "\n",
      "\t\t\n",
      "== Citing BioASQ ==\n",
      "When using this data please cite our previous work:\n",
      "\n",
      "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition: George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos and Georgios Paliouras, in BMC bioinformatics, 2015\n",
      "\n",
      "Bib:\n",
      "@article{tsatsaronis2015overview,\n",
      "  title={An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition},\n",
      "  author={Tsatsaronis, George and Balikas, Georgios and Malakasiotis, Prodromos and Partalas, Ioannis and Zschunke, Matthias and Alvers, Michael R and Weissenborn, Dirk and Krithara, Anastasia and Petridis, Sergios and Polychronopoulos, Dimitris and others},\n",
      "  journal={BMC bioinformatics},\n",
      "  volume={16},\n",
      "  number={1},\n",
      "  pages={138},\n",
      "  year={2015},\n",
      "  publisher={BioMed Central Ltd}\n",
      "}\n",
      "\n",
      "\n",
      "Fore more information about the challenge, the organisers and the relevant publications please visit: http://bioasq.org/\n",
      "\n",
      "[1] The distribution of 3742 questions : 1091 factoid, 1033 yesno, 899 summary, 719 list\n"
     ]
    }
   ],
   "source": [
    "!cat README"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BzzljzVp928v"
   },
   "source": [
    "Load data from json file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FYt1gljFFGZe"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "data_file = \"training9b.json\"\n",
    "data = json.load(open(data_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "50_L4cRi97tM"
   },
   "source": [
    "Inspect structure of json file. The json file contains one key, 'questions'. The corresponding value is a list of 3743 questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 113,
     "status": "ok",
     "timestamp": 1633915793585,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "rDG_8-1zFzHH",
    "outputId": "c4bb2386-5425-40e0-ef7d-47ec10e29895"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['questions'])\n",
      "Total number of questions:  3743\n"
     ]
    }
   ],
   "source": [
    "print(data.keys())\n",
    "print(\"Total number of questions: \", len(data['questions']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 160,
     "status": "ok",
     "timestamp": 1633915695273,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "daufJ0eNF2vH",
    "outputId": "ccc7883c-ddb6-4929-a871-8d91bc1a3ad5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(data['questions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UrjhK7yL-Hvv"
   },
   "source": [
    "Inspect structure of any question. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 142,
     "status": "ok",
     "timestamp": 1633916419529,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "Kyyh1ounGYXX",
    "outputId": "683e54bc-080c-477c-a8a3-824c63e5c63c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'body': 'Is Hirschsprung disease a mendelian or a multifactorial disorder?',\n",
       " 'concepts': ['http://www.disease-ontology.org/api/metadata/DOID:10487',\n",
       "  'http://www.nlm.nih.gov/cgi/mesh/2015/MB_cgi?field=uid&exact=Find+Exact+Term&term=D006627',\n",
       "  'http://www.nlm.nih.gov/cgi/mesh/2015/MB_cgi?field=uid&exact=Find+Exact+Term&term=D020412',\n",
       "  'http://www.disease-ontology.org/api/metadata/DOID:11372'],\n",
       " 'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/15858239',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/15829955',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/6650562',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/12239580',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/21995290',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/23001136',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/15617541',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/8896569',\n",
       "  'http://www.ncbi.nlm.nih.gov/pubmed/20598273'],\n",
       " 'id': '55031181e9bde69634000014',\n",
       " 'ideal_answer': [\"Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the development of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syndromic forms of Hirschsprung's disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model.\"],\n",
       " 'snippets': [{'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15829955',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 131,\n",
       "   'offsetInEndSection': 358,\n",
       "   'text': 'Hirschsprung disease (HSCR) is a multifactorial, non-mendelian disorder in which rare high-penetrance coding sequence mutations in the receptor tyrosine kinase RET contribute to risk in combination with mutations at other genes'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15617541',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 554,\n",
       "   'offsetInEndSection': 992,\n",
       "   'text': \"In this study, we review the identification of genes and loci involved in the non-syndromic common form and syndromic Mendelian forms of Hirschsprung's disease. The majority of the identified genes are related to Mendelian syndromic forms of Hirschsprung's disease. The non-Mendelian inheritance of sporadic non-syndromic Hirschsprung's disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model\"},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/12239580',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 397,\n",
       "   'offsetInEndSection': 939,\n",
       "   'text': 'Coding sequence mutations in e.g. RET, GDNF, EDNRB, EDN3, and SOX10 lead to long-segment (L-HSCR) as well as syndromic HSCR but fail to explain the transmission of the much more common short-segment form (S-HSCR). Furthermore, mutations in the RET gene are responsible for approximately half of the familial and some sporadic cases, strongly suggesting, on the one hand, the importance of non-coding variations and, on the other hand, that additional genes involved in the development of the enteric nervous system still await their discovery'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/12239580',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 941,\n",
       "   'offsetInEndSection': 1279,\n",
       "   'text': 'For almost all of the identified HSCR genes incomplete penetrance of the HSCR phenotype has been reported, probably due to modifier loci. Therefore, HSCR has become a model for a complex oligo-/polygenic disorder in which the relationship between different genes creating a non-mendelian inheritance pattern still remains to be elucidated'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15829955',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 129,\n",
       "   'offsetInEndSection': 358,\n",
       "   'text': ' Hirschsprung disease (HSCR) is a multifactorial, non-mendelian disorder in which rare high-penetrance coding sequence mutations in the receptor tyrosine kinase RET contribute to risk in combination with mutations at other genes.'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/6650562',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 851,\n",
       "   'offsetInEndSection': 1007,\n",
       "   'text': ' The inheritance of Hirschsprung disease is generally consistent with sex-modified multifactorial inheritance with a lower threshold of expression in males.'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15829955',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 131,\n",
       "   'offsetInEndSection': 359,\n",
       "   'text': 'Hirschsprung disease (HSCR) is a multifactorial, non-mendelian disorder in which rare high-penetrance coding sequence mutations in the receptor tyrosine kinase RET contribute to risk in combination with mutations at other genes.'},\n",
       "  {'beginSection': 'title',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/20598273',\n",
       "   'endSection': 'title',\n",
       "   'offsetInBeginSection': 0,\n",
       "   'offsetInEndSection': 131,\n",
       "   'text': 'Differential contributions of rare and common, coding and noncoding Ret mutations to multifactorial Hirschsprung disease liability.'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/21995290',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 0,\n",
       "   'offsetInEndSection': 210,\n",
       "   'text': 'BACKGROUND: RET is the major gene associated to Hirschsprung disease (HSCR) with differential contributions of its rare and common, coding and noncoding mutations to the multifactorial nature of this pathology.'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15858239',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 151,\n",
       "   'offsetInEndSection': 376,\n",
       "   'text': 'In the etiology of Hirschsprung disease various genes play a role; these are: RET, EDNRB, GDNF, EDN3 and SOX10, NTN3, ECE1, Mutations in these genes may result in dominant, recessive or multifactorial patterns of inheritance.'},\n",
       "  {'beginSection': 'title',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/23001136',\n",
       "   'endSection': 'title',\n",
       "   'offsetInBeginSection': 0,\n",
       "   'offsetInEndSection': 83,\n",
       "   'text': \"Chromosomal and related Mendelian syndromes associated with Hirschsprung's disease.\"},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15617541',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 715,\n",
       "   'offsetInEndSection': 818,\n",
       "   'text': \"The majority of the identified genes are related to Mendelian syndromic forms of Hirschsprung's disease\"},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15858239',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 151,\n",
       "   'offsetInEndSection': 375,\n",
       "   'text': 'In the etiology of Hirschsprung disease various genes play a role; these are: RET, EDNRB, GDNF, EDN3 and SOX10, NTN3, ECE1, Mutations in these genes may result in dominant, recessive or multifactorial patterns of inheritance'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/8896569',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 417,\n",
       "   'offsetInEndSection': 615,\n",
       "   'text': 'On the basis of a skewed sex-ratio (M/F = 4/1) and a risk to relatives much higher than the incidence in the general population, HSCR has long been regarded as a sex-modified multifactorial disorder'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/6650562',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 858,\n",
       "   'offsetInEndSection': 1012,\n",
       "   'text': 'The inheritance of Hirschsprung disease is generally consistent with sex-modified multifactorial inheritance with a lower threshold of expression in males'},\n",
       "  {'beginSection': 'abstract',\n",
       "   'document': 'http://www.ncbi.nlm.nih.gov/pubmed/15617541',\n",
       "   'endSection': 'abstract',\n",
       "   'offsetInBeginSection': 820,\n",
       "   'offsetInEndSection': 992,\n",
       "   'text': \"The non-Mendelian inheritance of sporadic non-syndromic Hirschsprung's disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model\"}],\n",
       " 'type': 'summary'}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['questions'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z2S2nCPaAhh8"
   },
   "source": [
    "Inspect keys of each question, different question type has different set of keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 166,
     "status": "ok",
     "timestamp": 1633916452246,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "S4x9hDhRGORq",
    "outputId": "7224ce01-ecb0-42b1-b8e1-7d25e1b4828f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['body', 'documents', 'ideal_answer', 'concepts', 'type', 'id', 'snippets'])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['questions'][0].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KKFRPu0vAobm"
   },
   "source": [
    "Check distribution of question types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SVbTDo1zGbTA"
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "type_distribution = Counter([x['type'] for x in data['questions']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 122,
     "status": "ok",
     "timestamp": 1633916469841,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "UZJ2fqucG0eN",
    "outputId": "89956944-16a2-485e-dcd1-dacb15090c5f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'factoid': 1092, 'list': 719, 'summary': 899, 'yesno': 1033})"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type_distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8416s6D4Au7H"
   },
   "source": [
    "The following function display one or more examples of a specified question type. Use the following function to further explore content of each question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "o0xOZhV1HERH"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "def show_random_question(dataset, qtype=\"factoid\", num_examples=1):\n",
    "  assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n",
    "  picks = []\n",
    "  for _ in range(num_examples):\n",
    "    pick = random.randint(0, len(dataset)-1)\n",
    "    while pick in picks or dataset[pick]['type'] != qtype:\n",
    "      pick = random.randint(0, len(dataset)-1)\n",
    "    picks.append(pick)\n",
    "  picked_questions = [dataset[pick] for pick in picks]\n",
    "  return picked_questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 121,
     "status": "ok",
     "timestamp": 1633916544559,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "XAPtIc9qIovW",
    "outputId": "a0a78748-d25c-4714-cabd-9657895d334c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'body': 'What is the ubiquitin proteome?',\n",
       "  'concepts': ['http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D020543',\n",
       "   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D014452',\n",
       "   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D040901',\n",
       "   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D054875',\n",
       "   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D025801',\n",
       "   'http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016567',\n",
       "   'http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0031386',\n",
       "   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D057149',\n",
       "   'http://www.uniprot.org/uniprot/UBIQ_CERCA'],\n",
       "  'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/22178446',\n",
       "   'http://www.ncbi.nlm.nih.gov/pubmed/23743150',\n",
       "   'http://www.ncbi.nlm.nih.gov/pubmed/23764619'],\n",
       "  'exact_answer': ['The ubiquitin proteome is the entire set ubiquitinated proteins and of their respective ubiquitination sites.'],\n",
       "  'id': '532f1452d6d3ac6a34000030',\n",
       "  'ideal_answer': ['The ubiquitin proteome is the entire set ubiquitinated proteins and of their respective ubiquitination sites.'],\n",
       "  'snippets': [{'beginSection': 'abstract',\n",
       "    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/23764619',\n",
       "    'endSection': 'abstract',\n",
       "    'offsetInBeginSection': 266,\n",
       "    'offsetInEndSection': 422,\n",
       "    'text': 'Mass spectrometry now allows high throughput approaches for the identification of the thousands of ubiquitinated proteins and of their ubiquitination sites.'},\n",
       "   {'beginSection': 'abstract',\n",
       "    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/22178446',\n",
       "    'endSection': 'abstract',\n",
       "    'offsetInBeginSection': 313,\n",
       "    'offsetInEndSection': 543,\n",
       "    'text': 'we used Tandem repeated Ubiquitin Binding Entities (TUBEs) under non-denaturing conditions followed by mass spectrometry analysis to study global ubiquitylation events that may lead to the identification of potential drug targets.'},\n",
       "   {'beginSection': 'abstract',\n",
       "    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/23743150',\n",
       "    'endSection': 'abstract',\n",
       "    'offsetInBeginSection': 174,\n",
       "    'offsetInEndSection': 339,\n",
       "    'text': 'To study the ubiquitin proteome we have established an immunoaffinity purification method for the proteomic analysis of endogenously ubiquitinated protein complexes.'}],\n",
       "  'type': 'factoid'}]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_random_question(data['questions'], \"factoid\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lC4-Ug2rmHxv"
   },
   "source": [
    "The following function will return a list of questions of a particular type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lIHf1lAAmDhX"
   },
   "outputs": [],
   "source": [
    "type_num_dic = {'factoid': 1092, 'list': 719, 'summary': 899, 'yesno': 1033}\n",
    "def get_question_of_type(dataset, qtype):\n",
    "  questions = [q for q in dataset if q['type'] == qtype]\n",
    "  assert len(questions) == type_num_dic[qtype]\n",
    "  return questions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-7-rDhTkBRaK"
   },
   "source": [
    "Get all questions of 'factoid' type and inspect a random element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 109,
     "status": "ok",
     "timestamp": 1633916646422,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "s6aD8HRVmupk",
    "outputId": "ce94c58d-5316-40d9-f8b6-4a3119619a20"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'body': 'What type of mutation is causing the industrial melanism phenotype in peppered moths?',\n",
       "  'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/27251284',\n",
       "   'http://www.ncbi.nlm.nih.gov/pubmed/12298233',\n",
       "   'http://www.ncbi.nlm.nih.gov/pubmed/12140267'],\n",
       "  'exact_answer': ['transposable element insertion'],\n",
       "  'id': '58a877cf38c171fb5b000004',\n",
       "  'ideal_answer': ['The mutation event giving rise to industrial melanism in Britain was the insertion of a large, tandemly repeated, transposable element into the first intron of the gene cortex.'],\n",
       "  'snippets': [{'beginSection': 'title',\n",
       "    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/27251284',\n",
       "    'endSection': 'title',\n",
       "    'offsetInBeginSection': 0,\n",
       "    'offsetInEndSection': 85,\n",
       "    'text': 'The industrial melanism mutation in British peppered moths is a transposable element.'},\n",
       "   {'beginSection': 'abstract',\n",
       "    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/27251284',\n",
       "    'endSection': 'abstract',\n",
       "    'offsetInBeginSection': 685,\n",
       "    'offsetInEndSection': 879,\n",
       "    'text': 'Here we show that the mutation event giving rise to industrial melanism in Britain was the insertion of a large, tandemly repeated, transposable element into the first intron of the gene cortex.'}],\n",
       "  'type': 'factoid'}]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qtype = 'factoid'\n",
    "selected_questions = get_question_of_type(data['questions'], qtype) \n",
    "show_random_question(selected_questions, qtype, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CmEwplU9GKHZ"
   },
   "source": [
    "**Task Synergy** \n",
    "\n",
    "[Task Synergy](http://participants-area.bioasq.org/general_information/TaskSynergy/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "eZA0dRyXGbi9"
   },
   "outputs": [],
   "source": [
    "# similar to Task9b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4_NKv95BKJGE"
   },
   "source": [
    "# MedQuAD \n",
    "[MedQuAD](https://github.com/abachaa/MedQuAD) includes 47,457 medical question-answer pairs created\n",
    "from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.\n",
    "\n",
    "Link to [Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4) for more information on dataset construction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gBVN0xo7p_6S"
   },
   "source": [
    "First, clone MedQuAD github repository. The repository contains 12 folders, each folder in turn contains questions from one of the medical resources. In each folder, there are multiple xml files. We will demontrate how to extract relevant information from these xml files below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 113,
     "status": "ok",
     "timestamp": 1633916719996,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "LPpUn7RHBcVH",
    "outputId": "6418eafc-190a-483e-8961-cc2f020243a6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive\n"
     ]
    }
   ],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 138488,
     "status": "ok",
     "timestamp": 1633742866403,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "O0dozo_PLFVM",
    "outputId": "a4938703-b238-4601-8ff8-bd56760fe71b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'MedQuAD'...\n",
      "remote: Enumerating objects: 11301, done.\u001b[K\n",
      "remote: Total 11301 (delta 0), reused 0 (delta 0), pack-reused 11301\u001b[K\n",
      "Receiving objects: 100% (11301/11301), 11.00 MiB | 3.72 MiB/s, done.\n",
      "Resolving deltas: 100% (6803/6803), done.\n",
      "Checking out files: 100% (11276/11276), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/abachaa/MedQuAD.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 221,
     "status": "ok",
     "timestamp": 1633916872888,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "rW8zygecp7PY",
    "outputId": "70ad0d96-508e-448d-a227-820a4a26ed91"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/MedQuAD\n"
     ]
    }
   ],
   "source": [
    "%cd MedQuAD/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "otUVMTxfr_r_"
   },
   "source": [
    "We will show an example of parsing an xml file using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#id17). <br>\n",
    "\n",
    "Install BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 3152,
     "status": "ok",
     "timestamp": 1633916757742,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "Dy-tD21-tIJN",
    "outputId": "4e46994a-3be9-4494-9f69-6bde21a3136c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: bs4 in /usr/local/lib/python3.7/dist-packages (0.0.1)\n",
      "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (from bs4) (4.6.3)\n"
     ]
    }
   ],
   "source": [
    "!pip install bs4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QYPfTJ95ByWK"
   },
   "source": [
    "The following function parse one xml file specified by filename."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VIT-W2Orr8DO"
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def parse_one_file(filename):\n",
    "  data = open(filename, 'r').read()\n",
    "  soup = BeautifulSoup(data, 'xml')\n",
    "  info_dic = {}\n",
    "  # parse Document tag\n",
    "  document_id = soup.Document['id']\n",
    "  source = soup.Document['source']\n",
    "  url = soup.Document['url']\n",
    "  info_dic['document_id'] = document_id\n",
    "  info_dic['source'] = source\n",
    "  info_dic['url'] = url\n",
    "  # parse focus\n",
    "  focus = soup.Focus.string # or soup.Focus.contents[0]\n",
    "  info_dic['focus'] = focus\n",
    "  # parse semantic group\n",
    "  semantic_group = soup.SemanticGroup.string\n",
    "  info_dic['semantic_group'] = semantic_group\n",
    "  QA_pairs = []\n",
    "  # parse QA pairs\n",
    "  for QAPair in soup.find_all(pid=True):\n",
    "    qid = QAPair.Question['qid']\n",
    "    qtype = QAPair.Question['qtype']\n",
    "    question = QAPair.Question.string\n",
    "    answer = QAPair.Answer.string\n",
    "    QA_pairs.append([qid, qtype, question, answer])\n",
    "  info_dic['QA_pairs'] = QA_pairs\n",
    "  return info_dic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EAJ5k-2RCG7B"
   },
   "source": [
    "The following function prints info_dic in a more readable format. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cT0zEOhy-hH2"
   },
   "outputs": [],
   "source": [
    "def show_example(info_dic):\n",
    "    keys = ['document_id', 'focus', 'semantic_group', 'source']\n",
    "    output = \"\"\n",
    "    for k in keys:\n",
    "      output += \"{:17}{}\\n\".format(k + ':', info_dic[k])\n",
    "    output += \"QA_pairs:\\n\"\n",
    "    for qid, qtype, question, answer in info_dic['QA_pairs']:\n",
    "      answer = ' '.join(answer.strip().split())\n",
    "      output += \"id: {:17}qtype: {}\\n     Question:\\t{}\\n     Answer:\\t{}\\n\".format(qid, qtype, question, answer)\n",
    "    return output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 233,
     "status": "ok",
     "timestamp": 1633917353619,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "lybNbqE88YP4",
    "outputId": "684b2db8-61df-4dec-9bc8-1111b997b9eb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "document_id:     0000001_1\n",
      "focus:           Adult Acute Lymphoblastic Leukemia\n",
      "semantic_group:  Disorders\n",
      "source:          CancerGov\n",
      "QA_pairs:\n",
      "id: 0000001_1-1      qtype: information\n",
      "     Question:\tWhat is (are) Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tKey Points - Adult acute lymphoblastic leukemia (ALL) is a type of cancer in which the bone marrow makes too many lymphocytes (a type of white blood cell). - Leukemia may affect red blood cells, white blood cells, and platelets. - Previous chemotherapy and exposure to radiation may increase the risk of developing ALL. - Signs and symptoms of adult ALL include fever, feeling tired, and easy bruising or bleeding. - Tests that examine the blood and bone marrow are used to detect (find) and diagnose adult ALL. - Certain factors affect prognosis (chance of recovery) and treatment options. Adult acute lymphoblastic leukemia (ALL) is a type of cancer in which the bone marrow makes too many lymphocytes (a type of white blood cell). Adult acute lymphoblastic leukemia (ALL; also called acute lymphocytic leukemia) is a cancer of the blood and bone marrow. This type of cancer usually gets worse quickly if it is not treated. Leukemia may affect red blood cells, white blood cells, and platelets. Normally, the bone marrow makes blood stem cells (immature cells) that become mature blood cells over time. A blood stem cell may become a myeloid stem cell or a lymphoid stem cell. A myeloid stem cell becomes one of three types of mature blood cells: - Red blood cells that carry oxygen and other substances to all tissues of the body. - Platelets that form blood clots to stop bleeding. - Granulocytes (white blood cells) that fight infection and disease. A lymphoid stem cell becomes a lymphoblast cell and then one of three types of lymphocytes (white blood cells): - B lymphocytes that make antibodies to help fight infection. - T lymphocytes that help B lymphocytes make the antibodies that help fight infection. - Natural killer cells that attack cancer cells and viruses. In ALL, too many stem cells become lymphoblasts, B lymphocytes, or T lymphocytes. These cells are also called leukemia cells. These leukemia cells are not able to fight infection very well. Also, as the number of leukemia cells increases in the blood and bone marrow, there is less room for healthy white blood cells, red blood cells, and platelets. This may cause infection, anemia, and easy bleeding. The cancer can also spread to the central nervous system (brain and spinal cord). This summary is about adult acute lymphoblastic leukemia. See the following PDQ summaries for information about other types of leukemia: - Childhood Acute Lymphoblastic Leukemia Treatment. - Adult Acute Myeloid Leukemia Treatment. - Childhood Acute Myeloid Leukemia/Other Myeloid Malignancies Treatment. - Chronic Lymphocytic Leukemia Treatment. - Chronic Myelogenous Leukemia Treatment. - Hairy Cell Leukemia Treatment.\n",
      "id: 0000001_1-2      qtype: symptoms\n",
      "     Question:\tWhat are the symptoms of Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tSigns and symptoms of adult ALL include fever, feeling tired, and easy bruising or bleeding. The early signs and symptoms of ALL may be like the flu or other common diseases. Check with your doctor if you have any of the following: - Weakness or feeling tired. - Fever or night sweats. - Easy bruising or bleeding. - Petechiae (flat, pinpoint spots under the skin, caused by bleeding). - Shortness of breath. - Weight loss or loss of appetite. - Pain in the bones or stomach. - Pain or feeling of fullness below the ribs. - Painless lumps in the neck, underarm, stomach, or groin. - Having many infections. These and other signs and symptoms may be caused by adult acute lymphoblastic leukemia or by other conditions.\n",
      "id: 0000001_1-3      qtype: exams and tests\n",
      "     Question:\tHow to diagnose Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tTests that examine the blood and bone marrow are used to detect (find) and diagnose adult ALL. The following tests and procedures may be used: - Physical exam and history : An exam of the body to check general signs of health, including checking for signs of disease, such as infection or anything else that seems unusual. A history of the patient's health habits and past illnesses and treatments will also be taken. - Complete blood count (CBC) with differential : A procedure in which a sample of blood is drawn and checked for the following: - The number of red blood cells and platelets. - The number and type of white blood cells. - The amount of hemoglobin (the protein that carries oxygen) in the red blood cells. - The portion of the blood sample made up of red blood cells. - Blood chemistry studies : A procedure in which a blood sample is checked to measure the amounts of certain substances released into the blood by organs and tissues in the body. An unusual (higher or lower than normal) amount of a substance can be a sign of disease. - Peripheral blood smear : A procedure in which a sample of blood is checked for blast cells, the number and kinds of white blood cells, the number of platelets, and changes in the shape of blood cells. - Bone marrow aspiration and biopsy : The removal of bone marrow, blood, and a small piece of bone by inserting a hollow needle into the hipbone or breastbone. A pathologist views the bone marrow, blood, and bone under a microscope to look for abnormal cells. The following tests may be done on the samples of blood or bone marrow tissue that are removed: - Cytogenetic analysis: A laboratory test in which the cells in a sample of blood or bone marrow are looked at under a microscope to find out if there are certain changes in the chromosomes of lymphocytes. For example, in Philadelphia chromosome positive ALL, part of one chromosome switches places with part of another chromosome. This is called the Philadelphia chromosome. - Immunophenotyping : A process used to identify cells, based on the types of antigens or markers on the surface of the cell. This process is used to diagnose the subtype of ALL by comparing the cancer cells to normal cells of the immune system. For example, a cytochemistry study may test the cells in a sample of tissue using chemicals (dyes) to look for certain changes in the sample. A chemical may cause a color change in one type of leukemia cell but not in another type of leukemia cell.\n",
      "id: 0000001_1-4      qtype: outlook\n",
      "     Question:\tWhat is the outlook for Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tCertain factors affect prognosis (chance of recovery) and treatment options. The prognosis (chance of recovery) and treatment options depend on the following: - The age of the patient. - Whether the cancer has spread to the brain or spinal cord. - Whether there are certain changes in the genes, including the Philadelphia chromosome. - Whether the cancer has been treated before or has recurred (come back).\n",
      "id: 0000001_1-5      qtype: susceptibility\n",
      "     Question:\tWho is at risk for Adult Acute Lymphoblastic Leukemia? ?\n",
      "     Answer:\tPrevious chemotherapy and exposure to radiation may increase the risk of developing ALL. Anything that increases your risk of getting a disease is called a risk factor. Having a risk factor does not mean that you will get cancer; not having risk factors doesnt mean that you will not get cancer. Talk with your doctor if you think you may be at risk. Possible risk factors for ALL include the following: - Being male. - Being white. - Being older than 70. - Past treatment with chemotherapy or radiation therapy. - Being exposed to high levels of radiation in the environment (such as nuclear radiation). - Having certain genetic disorders, such as Down syndrome.\n",
      "id: 0000001_1-6      qtype: stages\n",
      "     Question:\tWhat are the stages of Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tKey Points - Once adult ALL has been diagnosed, tests are done to find out if the cancer has spread to the central nervous system (brain and spinal cord) or to other parts of the body. - There is no standard staging system for adult ALL. Once adult ALL has been diagnosed, tests are done to find out if the cancer has spread to the central nervous system (brain and spinal cord) or to other parts of the body. The extent or spread of cancer is usually described as stages. It is important to know whether the leukemia has spread outside the blood and bone marrow in order to plan treatment. The following tests and procedures may be used to determine if the leukemia has spread: - Chest x-ray : An x-ray of the organs and bones inside the chest. An x-ray is a type of energy beam that can go through the body and onto film, making a picture of areas inside the body. - Lumbar puncture : A procedure used to collect a sample of cerebrospinal fluid (CSF) from the spinal column. This is done by placing a needle between two bones in the spine and into the CSF around the spinal cord and removing a sample of the fluid. The sample of CSF is checked under a microscope for signs that leukemia cells have spread to the brain and spinal cord. This procedure is also called an LP or spinal tap. - CT scan (CAT scan): A procedure that makes a series of detailed pictures of the abdomen, taken from different angles. The pictures are made by a computer linked to an x-ray machine. A dye may be injected into a vein or swallowed to help the organs or tissues show up more clearly. This procedure is also called computed tomography, computerized tomography, or computerized axial tomography. - MRI (magnetic resonance imaging): A procedure that uses a magnet, radio waves, and a computer to make a series of detailed pictures of areas inside the body. This procedure is also called nuclear magnetic resonance imaging (NMRI). There is no standard staging system for adult ALL. The disease is described as untreated, in remission, or recurrent. Untreated adult ALL The ALL is newly diagnosed and has not been treated except to relieve signs and symptoms such as fever, bleeding, or pain. - The complete blood count is abnormal. - More than 5% of the cells in the bone marrow are blasts (leukemia cells). - There are signs and symptoms of leukemia. Adult ALL in remission The ALL has been treated. - The complete blood count is normal. - 5% or fewer of the cells in the bone marrow are blasts (leukemia cells). - There are no signs or symptoms of leukemia other than in the bone marrow.\n",
      "id: 0000001_1-7      qtype: treatment\n",
      "     Question:\tWhat are the treatments for Adult Acute Lymphoblastic Leukemia ?\n",
      "     Answer:\tKey Points - There are different types of treatment for patients with adult ALL. - The treatment of adult ALL usually has two phases. - Four types of standard treatment are used: - Chemotherapy - Radiation therapy - Chemotherapy with stem cell transplant - Targeted therapy - New types of treatment are being tested in clinical trials. - Biologic therapy - Patients may want to think about taking part in a clinical trial. - Patients can enter clinical trials before, during, or after starting their cancer treatment. - Patients with ALL may have late effects after treatment. - Follow-up tests may be needed. There are different types of treatment for patients with adult ALL. Different types of treatment are available for patients with adult acute lymphoblastic leukemia (ALL). Some treatments are standard (the currently used treatment), and some are being tested in clinical trials. A treatment clinical trial is a research study meant to help improve current treatments or obtain information on new treatments for patients with cancer. When clinical trials show that a new treatment is better than the standard treatment, the new treatment may become the standard treatment. Patients may want to think about taking part in a clinical trial. Some clinical trials are open only to patients who have not started treatment. The treatment of adult ALL usually has two phases. The treatment of adult ALL is done in phases: - Remission induction therapy: This is the first phase of treatment. The goal is to kill the leukemia cells in the blood and bone marrow. This puts the leukemia into remission. - Post-remission therapy: This is the second phase of treatment. It begins once the leukemia is in remission. The goal of post-remission therapy is to kill any remaining leukemia cells that may not be active but could begin to regrow and cause a relapse. This phase is also called remission continuation therapy. Treatment called central nervous system (CNS) sanctuary therapy is usually given during each phase of therapy. Because standard doses of chemotherapy may not reach leukemia cells in the CNS (brain and spinal cord), the cells are able to \"find sanctuary\" (hide) in the CNS. Systemic chemotherapy given in high doses, intrathecal chemotherapy, and radiation therapy to the brain are able to reach leukemia cells in the CNS. They are given to kill the leukemia cells and lessen the chance the leukemia will recur (come back). CNS sanctuary therapy is also called CNS prophylaxis. Four types of standard treatment are used: Chemotherapy Chemotherapy is a cancer treatment that uses drugs to stop the growth of cancer cells, either by killing the cells or by stopping them from dividing. When chemotherapy is taken by mouth or injected into a vein or muscle, the drugs enter the bloodstream and can reach cancer cells throughout the body (systemic chemotherapy). When chemotherapy is placed directly into the cerebrospinal fluid (intrathecal chemotherapy), an organ, or a body cavity such as the abdomen, the drugs mainly affect cancer cells in those areas (regional chemotherapy). Combination chemotherapy is treatment using more than one anticancer drug. The way the chemotherapy is given depends on the type and stage of the cancer being treated. Intrathecal chemotherapy may be used to treat adult ALL that has spread, or may spread, to the brain and spinal cord. When used to lessen the chance leukemia cells will spread to the brain and spinal cord, it is called central nervous system (CNS) sanctuary therapy or CNS prophylaxis. See Drugs Approved for Acute Lymphoblastic Leukemia for more information. Radiation therapy Radiation therapy is a cancer treatment that uses high-energy x-rays or other types of radiation to kill cancer cells or keep them from growing. There are two types of radiation therapy: - External radiation therapy uses a machine outside the body to send radiation toward the cancer. - Internal radiation therapy uses a radioactive substance sealed in needles, seeds, wires, or catheters that are placed directly into or near the cancer. The way the radiation therapy is given depends on the type of cancer. External radiation therapy may be used to treat adult ALL that has spread, or may spread, to the brain and spinal cord. When used this way, it is called central nervous system (CNS) sanctuary therapy or CNS prophylaxis. External radiation therapy may also be used as palliative therapy to relieve symptoms and improve quality of life. Chemotherapy with stem cell transplant Stem cell transplant is a method of giving chemotherapy and replacing blood-forming cells destroyed by the cancer treatment. Stem cells (immature blood cells) are removed from the blood or bone marrow of the patient or a donor and are frozen and stored. After the chemotherapy is completed, the stored stem cells are thawed and given back to the patient through an infusion. These reinfused stem cells grow into (and restore) the body's blood cells. See Drugs Approved for Acute Lymphoblastic Leukemia for more information. Targeted therapy Targeted therapy is a type of treatment that uses drugs or other substances to identify and attack specific cancer cells without harming normal cells. Targeted therapy drugs called tyrosine kinase inhibitors are used to treat some types of adult ALL. These drugs block the enzyme, tyrosine kinase, that causes stem cells to develop into more white blood cells (blasts) than the body needs. Three of the drugs used are imatinib mesylate (Gleevec), dasatinib, and nilotinib. See Drugs Approved for Acute Lymphoblastic Leukemia for more information. New types of treatment are being tested in clinical trials. This summary section describes treatments that are being studied in clinical trials. It may not mention every new treatment being studied. Information about clinical trials is available from the NCI website. Biologic therapy Biologic therapy is a treatment that uses the patient's immune system to fight cancer. Substances made by the body or made in a laboratory are used to boost, direct, or restore the body's natural defenses against cancer. This type of cancer treatment is also called biotherapy or immunotherapy. Patients may want to think about taking part in a clinical trial. For some patients, taking part in a clinical trial may be the best treatment choice. Clinical trials are part of the cancer research process. Clinical trials are done to find out if new cancer treatments are safe and effective or better than the standard treatment. Many of today's standard treatments for cancer are based on earlier clinical trials. Patients who take part in a clinical trial may receive the standard treatment or be among the first to receive a new treatment. Patients who take part in clinical trials also help improve the way cancer will be treated in the future. Even when clinical trials do not lead to effective new treatments, they often answer important questions and help move research forward. Patients can enter clinical trials before, during, or after starting their cancer treatment. Some clinical trials only include patients who have not yet received treatment. Other trials test treatments for patients whose cancer has not gotten better. There are also clinical trials that test new ways to stop cancer from recurring (coming back) or reduce the side effects of cancer treatment. Clinical trials are taking place in many parts of the country. See the Treatment Options section that follows for links to current treatment clinical trials. These have been retrieved from NCI's listing of clinical trials. Patients with ALL may have late effects after treatment. Side effects from cancer treatment that begin during or after treatment and continue for months or years are called late effects. Late effects of treatment for ALL may include the risk of second cancers (new types of cancer). Regular follow-up exams are very important for long-term survivors. Follow-up tests may be needed. Some of the tests that were done to diagnose the cancer or to find out the stage of the cancer may be repeated. Some tests will be repeated in order to see how well the treatment is working. Decisions about whether to continue, change, or stop treatment may be based on the results of these tests. Some of the tests will continue to be done from time to time after treatment has ended. The results of these tests can show if your condition has changed or if the cancer has recurred (come back). These tests are sometimes called follow-up tests or check-ups. Treatment Options for Adult Acute Lymphoblastic Leukemia Untreated Adult Acute Lymphoblastic Leukemia Standard treatment of adult acute lymphoblastic leukemia (ALL) during the remission induction phase includes the following: - Combination chemotherapy. - Tyrosine kinase inhibitor therapy with imatinib mesylate, in certain patients. Some of these patients will also have combination chemotherapy. - Supportive care including antibiotics and red blood cell and platelet transfusions. - CNS prophylaxis therapy including chemotherapy (intrathecal and/or systemic) with or without radiation therapy to the brain. Check the list of NCI-supported cancer clinical trials that are now accepting patients with untreated adult acute lymphoblastic leukemia. For more specific results, refine the search by using other search features, such as the location of the trial, the type of treatment, or the name of the drug. Talk with your doctor about clinical trials that may be right for you. General information about clinical trials is available from the NCI website. Adult Acute Lymphoblastic Leukemia in Remission Standard treatment of adult ALL during the post-remission phase includes the following: - Chemotherapy. - Tyrosine kinase inhibitor therapy. - Chemotherapy with stem cell transplant. - CNS prophylaxis therapy including chemotherapy (intrathecal and/or systemic) with or without radiation therapy to the brain. Check the list of NCI-supported cancer clinical trials that are now accepting patients with adult acute lymphoblastic leukemia in remission. For more specific results, refine the search by using other search features, such as the location of the trial, the type of treatment, or the name of the drug. Talk with your doctor about clinical trials that may be right for you. General information about clinical trials is available from the NCI website. Recurrent Adult Acute Lymphoblastic Leukemia Standard treatment of recurrent adult ALL may include the following: - Combination chemotherapy followed by stem cell transplant. - Low-dose radiation therapy as palliative care to relieve symptoms and improve the quality of life. - Tyrosine kinase inhibitor therapy with dasatinib for certain patients. Some of the treatments being studied in clinical trials for recurrent adult ALL include the following: - A clinical trial of stem cell transplant using the patient's stem cells. - A clinical trial of biologic therapy. - A clinical trial of new anticancer drugs. Check the list of NCI-supported cancer clinical trials that are now accepting patients with recurrent adult acute lymphoblastic leukemia. For more specific results, refine the search by using other search features, such as the location of the trial, the type of treatment, or the name of the drug. Talk with your doctor about clinical trials that may be right for you. General information about clinical trials is available from the NCI website.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "example_file = \"./1_CancerGov_QA/0000001_1.xml\"\n",
    "example_info_dic = parse_one_file(example_file)\n",
    "print(show_example(example_info_dic))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K9mX2rxDDhQM"
   },
   "source": [
    "The following function parse all xml files in a specified directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 28182,
     "status": "ok",
     "timestamp": 1633917518865,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "J1nqwCrwBoa_",
    "outputId": "61bdddd4-54b1-4200-d62c-ede11150fd62",
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['0000001_1.xml', '0000001_2.xml', '0000001_3.xml', '0000001_4.xml', '0000001_5.xml', '0000001_6.xml', '0000001_7.xml', '0000003_1.xml', '0000003_2.xml', '0000003_3.xml', '0000003_4.xml', '0000003_5.xml', '0000003_6.xml', '0000004_1.xml', '0000004_2.xml', '0000004_3.xml', '0000004_4.xml', '0000004_5.xml', '0000004_6.xml', '0000004_7.xml', '0000005_1.xml', '0000005_2.xml', '0000006_1.xml', '0000006_2.xml', '0000006_3.xml', '0000006_4.xml', '0000006_5.xml', '0000006_6.xml', '0000006_7.xml', '0000006_8.xml', '0000006_9.xml', '0000007_1.xml', '0000007_2.xml', '0000007_3.xml', '0000007_4.xml', '0000007_5.xml', '0000009_1.xml', '0000009_2.xml', '0000010_1.xml', '0000013_1.xml', '0000013_2.xml', '0000013_2_1.xml', '0000013_2_2.xml', '0000013_2_3.xml', '0000013_2_4.xml', '0000013_2_5.xml', '0000013_2_6.xml', '0000013_3.xml', '0000013_3_1.xml', '0000013_3_2.xml', '0000013_3_3.xml', '0000013_3_4.xml', '0000014_1.xml', '0000014_2.xml', '0000014_3.xml', '0000014_4.xml', '0000015_1.xml', '0000016_1.xml', '0000017_1.xml', '0000018_1.xml', '0000019_1.xml', '0000019_2.xml', '0000019_3.xml', '0000019_4.xml', '0000019_5.xml', '0000020_1.xml', '0000021_1.xml', '0000021_2.xml', '0000022_1.xml', '0000023_1.xml', '0000024_1.xml', '0000024_10.xml', '0000024_2.xml', '0000024_3.xml', '0000024_4.xml', '0000024_5.xml', '0000024_6.xml', '0000024_7.xml', '0000024_8.xml', '0000024_9.xml', '0000025_1.xml', '0000025_2.xml', '0000026_1.xml', '0000026_2.xml', '0000026_3.xml', '0000027_1.xml', '0000027_2.xml', '0000027_3.xml', '0000027_4.xml', '0000027_5.xml', '0000028_1.xml', '0000028_2.xml', '0000028_3.xml', '0000028_4.xml', '0000028_5.xml', '0000030_1.xml', '0000031_1.xml', '0000031_2.xml', '0000032_1.xml', '0000032_2.xml', '0000032_3.xml', '0000032_4.xml', '0000033_1.xml', '0000034_1.xml', '0000035_1.xml', '0000036_1.xml', '0000036_2.xml', '0000036_3.xml', '0000037_1.xml', '0000037_2.xml', '0000037_3.xml', '0000037_4.xml', '0000039_1.xml', '0000040_1.xml', '0000041_1.xml', '0000043_1.xml']\n",
      "document_id:     0000004_3\n",
      "focus:           AIDS-Related Lymphoma\n",
      "semantic_group:  Disorders\n",
      "source:          CancerGov\n",
      "QA_pairs:\n",
      "id: 0000004_3-1      qtype: information\n",
      "     Question:\tWhat is (are) AIDS-Related Lymphoma ?\n",
      "     Answer:\tKey Points - AIDS-related lymphoma is a disease in which malignant (cancer) cells form in the lymph system of patients who have acquired immunodeficiency syndrome (AIDS). - There are many different types of lymphoma. - Signs of AIDS-related lymphoma include weight loss, fever, and night sweats. - Tests that examine the lymph system and other parts of the body are used to help detect (find) and diagnose AIDS-related lymphoma. - Certain factors affect prognosis (chance of recovery) and treatment options. AIDS-related lymphoma is a disease in which malignant (cancer) cells form in the lymph system of patients who have acquired immunodeficiency syndrome (AIDS). AIDS is caused by the human immunodeficiency virus (HIV), which attacks and weakens the body's immune system. The immune system is then unable to fight infection and disease. People with HIV disease have an increased risk of infection and lymphoma or other types of cancer. A person with HIV disease who develops certain types of infections or cancer is then diagnosed with AIDS. Sometimes, people are diagnosed with AIDS and AIDS-related lymphoma at the same time. For information about AIDS and its treatment, please see the AIDSinfo website. AIDS-related lymphoma is a type of cancer that affects the lymph system, which is part of the body's immune system. The immune system protects the body from foreign substances, infection, and diseases. The lymph system is made up of the following: - Lymph: Colorless, watery fluid that carries white blood cells called lymphocytes through the lymph system. Lymphocytes protect the body against infections and the growth of tumors. - Lymph vessels: A network of thin tubes that collect lymph from different parts of the body and return it to the bloodstream. - Lymph nodes: Small, bean-shaped structures that filter lymph and store white blood cells that help fight infection and disease. Lymph nodes are located along the network of lymph vessels found throughout the body. Clusters of lymph nodes are found in the neck, underarm, abdomen, pelvis, and groin. - Spleen: An organ that makes lymphocytes, filters the blood, stores blood cells, and destroys old blood cells. The spleen is on the left side of the abdomen near the stomach. - Thymus: An organ in which lymphocytes grow and multiply. The thymus is in the chest behind the breastbone. - Tonsils: Two small masses of lymph tissue at the back of the throat. The tonsils make lymphocytes. - Bone marrow: The soft, spongy tissue in the center of large bones. Bone marrow makes white blood cells, red blood cells, and platelets. Lymph tissue is also found in other parts of the body such as the brain, stomach, thyroid gland, and skin. Sometimes AIDS-related lymphoma occurs outside the lymph nodes in the bone marrow, liver, meninges (thin membranes that cover the brain) and gastrointestinal tract. Less often, it may occur in the anus, heart, bile duct, gingiva, and muscles. There are many different types of lymphoma. Lymphomas are divided into two general types: - Hodgkin lymphoma. - Non-Hodgkin lymphoma. Both Hodgkin lymphoma and non-Hodgkin lymphoma may occur in patients with AIDS, but non-Hodgkin lymphoma is more common. When a person with AIDS has non-Hodgkin lymphoma, it is called AIDS-related lymphoma. When AIDS-related lymphoma occurs in the central nervous system (CNS), it is called AIDS-related primary CNS lymphoma. Non-Hodgkin lymphomas are grouped by the way their cells look under a microscope. They may be indolent (slow-growing) or aggressive (fast-growing). AIDS-related lymphomas are aggressive. There are two main types of AIDS-related non-Hodgkin lymphoma: - Diffuse large B-cell lymphoma (including B-cell immunoblastic lymphoma). - Burkitt or Burkitt-like lymphoma. For more information about lymphoma or AIDS-related cancers, see the following PDQ summaries: - Adult Non-Hodgkin Lymphoma Treatment - Childhood Non-Hodgkin Lymphoma Treatment - Primary CNS Lymphoma Treatment - Kaposi Sarcoma Treatment\n",
      "id: 0000004_3-2      qtype: symptoms\n",
      "     Question:\tWhat are the symptoms of AIDS-Related Lymphoma ?\n",
      "     Answer:\tSigns of AIDS-related lymphoma include weight loss, fever, and night sweats. These and other signs and symptoms may be caused by AIDS-related lymphoma or by other conditions. Check with your doctor if you have any of the following: - Weight loss or fever for no known reason. - Night sweats. - Painless, swollen lymph nodes in the neck, chest, underarm, or groin. - A feeling of fullness below the ribs.\n",
      "id: 0000004_3-3      qtype: exams and tests\n",
      "     Question:\tHow to diagnose AIDS-Related Lymphoma ?\n",
      "     Answer:\tTests that examine the lymph system and other parts of the body are used to help detect (find) and diagnose AIDS-related lymphoma. The following tests and procedures may be used: - Physical exam and history : An exam of the body to check general signs of health, including checking for signs of disease, such as lumps or anything else that seems unusual. A history of the patients health habits and past illnesses and treatments will also be taken. - Complete blood count (CBC): A procedure in which a sample of blood is drawn and checked for the following: - The number of red blood cells, white blood cells, and platelets. - The amount of hemoglobin (the protein that carries oxygen) in the red blood cells. - The portion of the sample made up of red blood cells. - HIV test : A test to measure the level of HIV antibodies in a sample of blood. Antibodies are made by the body when it is invaded by a foreign substance. A high level of HIV antibodies may mean the body has been infected with HIV. - Lymph node biopsy : The removal of all or part of a lymph node. A pathologist views the tissue under a microscope to look for cancer cells. One of the following types of biopsies may be done: - Excisional biopsy : The removal of an entire lymph node. - Incisional biopsy : The removal of part of a lymph node. - Core biopsy : The removal of tissue from a lymph node using a wide needle. - Fine-needle aspiration (FNA) biopsy : The removal of tissue from a lymph node using a thin needle. - Bone marrow aspiration and biopsy : The removal of bone marrow and a small piece of bone by inserting a hollow needle into the hipbone or breastbone. A pathologist views the bone marrow and bone under a microscope to look for signs of cancer. - Chest x-ray : An x-ray of the organs and bones inside the chest. An x-ray is a type of energy beam that can go through the body and onto film, making a picture of areas inside the body.\n",
      "id: 0000004_3-4      qtype: outlook\n",
      "     Question:\tWhat is the outlook for AIDS-Related Lymphoma ?\n",
      "     Answer:\tCertain factors affect prognosis (chance of recovery) and treatment options. The prognosis (chance of recovery) and treatment options depend on the following: - The stage of the cancer. - The age of the patient. - The number of CD4 lymphocytes (a type of white blood cell) in the blood. - The number of places in the body lymphoma is found outside the lymph system. - Whether the patient has a history of intravenous (IV) drug use. - The patient's ability to carry out regular daily activities.\n",
      "id: 0000004_3-5      qtype: stages\n",
      "     Question:\tWhat are the stages of AIDS-Related Lymphoma ?\n",
      "     Answer:\tKey Points - After AIDS-related lymphoma has been diagnosed, tests are done to find out if cancer cells have spread within the lymph system or to other parts of the body. - There are three ways that cancer spreads in the body. - Stages of AIDS-related lymphoma may include E and S. - The following stages are used for AIDS-related lymphoma: - Stage I - Stage II - Stage III - Stage IV - For treatment, AIDS-related lymphomas are grouped based on where they started in the body, as follows: - Peripheral/systemic lymphoma - Primary CNS lymphoma After AIDS-related lymphoma has been diagnosed, tests are done to find out if cancer cells have spread within the lymph system or to other parts of the body. The process used to find out if cancer cells have spread within the lymph system or to other parts of the body is called staging. The information gathered from the staging process determines the stage of the disease. It is important to know the stage in order to plan treatment, but AIDS-related lymphoma is usually advanced when it is diagnosed. The following tests and procedures may be used in the staging process: - Blood chemistry studies : A procedure in which a blood sample is checked to measure the amounts of certain substances released into the blood by organs and tissues in the body. An unusual (higher or lower than normal) amount of a substance can be a sign of disease. The blood sample will be checked for the level of LDH (lactate dehydrogenase). - CT scan (CAT scan): A procedure that makes a series of detailed pictures of areas inside the body, such as the lung, lymph nodes, and liver, taken from different angles. The pictures are made by a computer linked to an x-ray machine. A dye may be injected into a vein or swallowed to help the organs or tissues show up more clearly. This procedure is also called computed tomography, computerized tomography, or computerized axial tomography. - PET scan (positron emission tomography scan): A procedure to find malignant tumor cells in the body. A small amount of radioactive glucose (sugar) is injected into a vein. The PET scanner rotates around the body and makes a picture of where glucose is being used in the body. Malignant tumor cells show up brighter in the picture because they are more active and take up more glucose than normal cells do. - MRI (magnetic resonance imaging) with gadolinium : A procedure that uses a magnet, radio waves, and a computer to make a series of detailed pictures of areas inside the body. A substance called gadolinium is injected into the patient through a vein. The gadolinium collects around the cancer cells so they show up brighter in the picture. This procedure is also called nuclear magnetic resonance imaging (NMRI). - Lumbar puncture : A procedure used to collect cerebrospinal fluid (CSF) from the spinal column. This is done by placing a needle between two bones in the spine and into the CSF around the spinal cord and removing a sample of the fluid. The sample of CSF is checked under a microscope for signs that the cancer has spread to the brain and spinal cord. The sample may also be checked for Epstein-Barr virus. This procedure is also called an LP or spinal tap. There are three ways that cancer spreads in the body. Cancer can spread through tissue, the lymph system, and the blood: - Tissue. The cancer spreads from where it began by growing into nearby areas. - Lymph system. The cancer spreads from where it began by getting into the lymph system. The cancer travels through the lymph vessels to other parts of the body. - Blood. The cancer spreads from where it began by getting into the blood. The cancer travels through the blood vessels to other parts of the body. Stages of AIDS-related lymphoma may include E and S. AIDS-related lymphoma may be described as follows: - E: \"E\" stands for extranodal and means the cancer is found in an area or organ other than the lymph nodes or has spread to tissues beyond, but near, the major lymphatic areas. - S: \"S\" stands for spleen and means the cancer is found in the spleen. The following stages are used for AIDS-related lymphoma: Stage I Stage I AIDS-related lymphoma is divided into stage I and stage IE. - Stage I: Cancer is found in one lymphatic area (lymph node group, tonsils and nearby tissue, thymus, or spleen). - Stage IE: Cancer is found in one organ or area outside the lymph nodes. Stage II Stage II AIDS-related lymphoma is divided into stage II and stage IIE. - Stage II: Cancer is found in two or more lymph node groups either above or below the diaphragm (the thin muscle below the lungs that helps breathing and separates the chest from the abdomen). - Stage IIE: Cancer is found in one or more lymph node groups either above or below the diaphragm. Cancer is also found outside the lymph nodes in one organ or area on the same side of the diaphragm as the affected lymph nodes. Stage III Stage III AIDS-related lymphoma is divided into stage III, stage IIIE, stage IIIS, and stage IIIE+S. - Stage III: Cancer is found in lymph node groups above and below the diaphragm (the thin muscle below the lungs that helps breathing and separates the chest from the abdomen). - Stage IIIE: Cancer is found in lymph node groups above and below the diaphragm and outside the lymph nodes in a nearby organ or area. - Stage IIIS: Cancer is found in lymph node groups above and below the diaphragm, and in the spleen. - Stage IIIE+S: Cancer is found in lymph node groups above and below the diaphragm, outside the lymph nodes in a nearby organ or area, and in the spleen. Stage IV In stage IV AIDS-related lymphoma, the cancer: - is found throughout one or more organs that are not part of a lymphatic area (lymph node group, tonsils and nearby tissue, thymus, or spleen) and may be in lymph nodes near those organs; or - is found in one organ that is not part of a lymphatic area and has spread to organs or lymph nodes far away from that organ; or - is found in the liver, bone marrow, cerebrospinal fluid (CSF), or lungs (other than cancer that has spread to the lungs from nearby areas). Patients who are infected with the Epstein-Barr virus or whose AIDS-related lymphoma affects the bone marrow have an increased risk of the cancer spreading to the central nervous system (CNS). For treatment, AIDS-related lymphomas are grouped based on where they started in the body, as follows: Peripheral/systemic lymphoma Lymphoma that starts in the lymph system or elsewhere in the body, other than the brain, is called peripheral/systemic lymphoma. It may spread throughout the body, including to the brain or bone marrow. It is often diagnosed in an advanced stage. Primary CNS lymphoma Primary CNS lymphoma starts in the central nervous system (brain and spinal cord). It is linked to the Epstein-Barr virus. Lymphoma that starts somewhere else in the body and spreads to the central nervous system is not primary CNS lymphoma.\n",
      "id: 0000004_3-6      qtype: treatment\n",
      "     Question:\tWhat are the treatments for AIDS-Related Lymphoma ?\n",
      "     Answer:\tKey Points - There are different types of treatment for patients with AIDS-related lymphoma. - Treatment of AIDS-related lymphoma combines treatment of the lymphoma with treatment for AIDS. - Four types of standard treatment are used: - Chemotherapy - Radiation therapy - High-dose chemotherapy with stem cell transplant - Targeted therapy - New types of treatment are being tested in clinical trials. - Patients may want to think about taking part in a clinical trial. - Patients can enter clinical trials before, during, or after starting their cancer treatment. - Follow-up tests may be needed. There are different types of treatment for patients with AIDS-related lymphoma. Different types of treatment are available for patients with AIDS-related lymphoma. Some treatments are standard (the currently used treatment), and some are being tested in clinical trials. A treatment clinical trial is a research study meant to help improve current treatments or obtain information on new treatments for patients with cancer. When clinical trials show that a new treatment is better than the standard treatment, the new treatment may become the standard treatment. Patients may want to think about taking part in a clinical trial. Some clinical trials are open only to patients who have not started treatment. Treatment of AIDS-related lymphoma combines treatment of the lymphoma with treatment for AIDS. Patients with AIDS have weakened immune systems and treatment can cause the immune system to become even weaker. For this reason, treating patients who have AIDS-related lymphoma is difficult and some patients may be treated with lower doses of drugs than lymphoma patients who do not have AIDS. Combined antiretroviral therapy (cART) is used to lessen the damage to the immune system caused by HIV. Treatment with combined antiretroviral therapy may allow some patients with AIDS-related lymphoma to safely receive anticancer drugs in standard or higher doses. In these patients, treatment may work as well as it does in lymphoma patients who do not have AIDS. Medicine to prevent and treat infections, which can be serious, is also used. For more information about AIDS and its treatment, please see the AIDSinfo website. Four types of standard treatment are used: Chemotherapy Chemotherapy is a cancer treatment that uses drugs to stop the growth of cancer cells, either by killing the cells or by stopping them from dividing. When chemotherapy is taken by mouth or injected into a vein or muscle, the drugs enter the bloodstream and can reach cancer cells throughout the body (systemic chemotherapy). When chemotherapy is placed directly into the cerebrospinal fluid (intrathecal chemotherapy), an organ, or a body cavity such as the abdomen, the drugs mainly affect cancer cells in those areas (regional chemotherapy). Combination chemotherapy is treatment using more than one anticancer drug. The way the chemotherapy is given depends on where the cancer has formed. Intrathecal chemotherapy may be used in patients who are more likely to have lymphoma in the central nervous system (CNS). Chemotherapy is used in the treatment of AIDS-related peripheral/systemic lymphoma. It is not yet known whether it is best to give combined antiretroviral therapy at the same time as chemotherapy or after chemotherapy ends. Colony-stimulating factors are sometimes given together with chemotherapy. This helps lessen the side effects chemotherapy may have on the bone marrow. Radiation therapy Radiation therapy is a cancer treatment that uses high-energy x-rays or other types of radiation to kill cancer cells or keep them from growing. There are two types of radiation therapy: - External radiation therapy uses a machine outside the body to send radiation toward the cancer. - Internal radiation therapy uses a radioactive substance sealed in needles, seeds, wires, or catheters that are placed directly into or near the cancer. The way the radiation therapy is given depends on where the cancer has formed. External radiation therapy is used to treat AIDS-related primary CNS lymphoma. High-dose chemotherapy with stem cell transplant High-dose chemotherapy with stem cell transplant is a way of giving high doses of chemotherapy and replacing blood -forming cells destroyed by the cancer treatment. Stem cells (immature blood cells) are removed from the blood or bone marrow of the patient or a donor and are frozen and stored. After the chemotherapy is completed, the stored stem cells are thawed and given back to the patient through an infusion. These reinfused stem cells grow into (and restore) the body's blood cells. Targeted therapy Targeted therapy is a type of treatment that uses drugs or other substances to identify and attack specific cancer cells without harming normal cells. Monoclonal antibody therapy is a type of targeted therapy. Monoclonal antibody therapy is a cancer treatment that uses antibodies made in the laboratory from a single type of immune system cell. These antibodies can identify substances on cancer cells or normal substances that may help cancer cells grow. The antibodies attach to the substances and kill the cancer cells, block their growth, or keep them from spreading. Monoclonal antibodies are given by infusion. These may be used alone or to carry drugs, toxins, or radioactive material directly to cancer cells. Rituximab is used in the treatment of AIDS-related peripheral/systemic lymphoma. New types of treatment are being tested in clinical trials. Information about clinical trials is available from the NCI website. Patients may want to think about taking part in a clinical trial. For some patients, taking part in a clinical trial may be the best treatment choice. Clinical trials are part of the cancer research process. Clinical trials are done to find out if new cancer treatments are safe and effective or better than the standard treatment. Many of today's standard treatments for cancer are based on earlier clinical trials. Patients who take part in a clinical trial may receive the standard treatment or be among the first to receive a new treatment. Patients who take part in clinical trials also help improve the way cancer will be treated in the future. Even when clinical trials do not lead to effective new treatments, they often answer important questions and help move research forward. Patients can enter clinical trials before, during, or after starting their cancer treatment. Some clinical trials only include patients who have not yet received treatment. Other trials test treatments for patients whose cancer has not gotten better. There are also clinical trials that test new ways to stop cancer from recurring (coming back) or reduce the side effects of cancer treatment. Clinical trials are taking place in many parts of the country. See the Treatment Options section that follows for links to current treatment clinical trials. These have been retrieved from NCI's listing of clinical trials. Follow-up tests may be needed. Some of the tests that were done to diagnose the cancer or to find out the stage of the cancer may be repeated. Some tests will be repeated in order to see how well the treatment is working. Decisions about whether to continue, change, or stop treatment may be based on the results of these tests. Some of the tests will continue to be done from time to time after treatment has ended. The results of these tests can show if your condition has changed or if the cancer has recurred (come back). These tests are sometimes called follow-up tests or check-ups. Treatment Options for AIDS-Related Lymphoma AIDS-Related Peripheral/Systemic Lymphoma Treatment of AIDS-related peripheral/systemic lymphoma may include the following: - Combination chemotherapy with or without targeted therapy. - High-dose chemotherapy and stem cell transplant, for lymphoma that has not responded to treatment or has come back. - Intrathecal chemotherapy for lymphoma that is likely to spread to the central nervous system (CNS). Check the list of NCI-supported cancer clinical trials that are now accepting patients with AIDS-related peripheral/systemic lymphoma. For more specific results, refine the search by using other search features, such as the location of the trial, the type of treatment, or the name of the drug. Talk with your doctor about clinical trials that may be right for you. General information about clinical trials is available from the NCI website. AIDS-Related Primary Central Nervous System Lymphoma Treatment of AIDS-related primary central nervous system lymphoma may include the following: - External radiation therapy. Check the list of NCI-supported cancer clinical trials that are now accepting patients with AIDS-related primary CNS lymphoma. For more specific results, refine the search by using other search features, such as the location of the trial, the type of treatment, or the name of the drug. Talk with your doctor about clinical trials that may be right for you. General information about clinical trials is available from the NCI website.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "dir = './1_CancerGov_QA'\n",
    "# get all files in the specified directory\n",
    "files = [f for f in os.listdir(dir) if os.path.isfile(os.path.join(dir, f))]\n",
    "print(files)\n",
    "dir_dicts = []\n",
    "for f in files:\n",
    "  info_dic = parse_one_file(os.path.join(dir, f))\n",
    "  dir_dicts.append(info_dic)\n",
    "# inspect a random parsed dict in the directory\n",
    "print(show_example(dir_dicts[15]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uMN56--YMFtH"
   },
   "source": [
    "# LiveQA\n",
    "[LiveQA](https://github.com/abachaa/LiveQA_MedicalTask_TREC2017), or TREC-2017 LiveQA: Medical Question Answering Task focuses on consumer health question answering. Details of data creation can be found in the [paper](https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf). There are 634 question-answer pairs for training and 104 for testing.\n",
    "Additional 2,479 judged answers are available with MedQuAD.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8980oxAMFy-a"
   },
   "outputs": [],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MUSJSai0F3jb"
   },
   "source": [
    "Clone github repo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uVwbiAVwPHyT"
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/abachaa/LiveQA_MedicalTask_TREC2017.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 105,
     "status": "ok",
     "timestamp": 1633918075832,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "1Ki8SdglFA2A",
    "outputId": "e3bc1883-7f57-4b69-9e08-daad9106006e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/LiveQA_MedicalTask_TREC2017\n"
     ]
    }
   ],
   "source": [
    "%cd LiveQA_MedicalTask_TREC2017/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 141,
     "status": "ok",
     "timestamp": 1633918077426,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "yclURBXjFEqe",
    "outputId": "b8c4e758-43fe-4843-a63e-3bfbbfb2a17c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/LiveQA_MedicalTask_TREC2017/TrainingDatasets\n"
     ]
    }
   ],
   "source": [
    "%cd TrainingDatasets/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Yo2Ub0DJK3TU"
   },
   "source": [
    "The following function parse an entire train file. It returns a list of dictionaries, each dictionary corresponds to one of the 200 questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6WVEWeUnFJx1"
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "def parse_data(filename):\n",
    "  data = open(filename, 'r').read()\n",
    "  soup = BeautifulSoup(data, 'xml')\n",
    "  info_dic = []\n",
    "  # training_questions = soup.find_all('NLM-QUESTION') # cannot use soup.NLM-QUESTION because of hyphen\n",
    "  # get list of questions\n",
    "  training_questions = soup.find_all(questionid=True)\n",
    "  print(\"Number of questions: \", len(training_questions))\n",
    "  for example_q in training_questions:\n",
    "    questionid = example_q['questionid']\n",
    "    subject = example_q.SUBJECT.string\n",
    "    message = example_q.MESSAGE.string\n",
    "    # get list of all subqustions\n",
    "    sub_questions = example_q.find_all(\"SUB-QUESTION\") \n",
    "    sub_q_dic = []\n",
    "    for s in sub_questions:\n",
    "      subqid = s['subqid']\n",
    "      focus = s.FOCUS.string\n",
    "      qtype = s.TYPE.string\n",
    "      answers = s.find_all('ANSWER')\n",
    "      answer_dics = []\n",
    "      for a in answers:\n",
    "        answer_dics.append({'answerid': a['answerid'], 'pairid': a['pairid'], 'atext': a.string})\n",
    "      sub_q_dic.append({'subqid': subqid, 'focus': focus, 'qtype': qtype, 'answers': answer_dics})\n",
    "    info_dic.append({'questioniid': questionid, 'subject': subject, 'message': message, 'sub-questions': sub_q_dic})\n",
    "  return info_dic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 267,
     "status": "ok",
     "timestamp": 1633919208814,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "6obze67PHdC1",
    "outputId": "988fe191-200b-48b5-a530-04a651b99386"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of questions:  200\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'message': 'Literature on Cardiac amyloidosis.  Please let me know where I can get literature on Cardiac amyloidosis.  My uncle died yesterday from this disorder.  Since this is such a rare disorder, and to honor his memory, I would like to distribute literature at his funeral service.  I am a retired NIH employee, so I am familiar with the campus in case you have literature at NIH that I can come and pick up.  Thank you ',\n",
       " 'questioniid': 'Q1',\n",
       " 'sub-questions': [{'answers': [{'answerid': 'Q1-S1-A1',\n",
       "     'atext': 'Cardiac amyloidosis is a disorder caused by deposits of an abnormal protein (amyloid) in the heart tissue. These deposits make it hard for the heart to work properly.',\n",
       "     'pairid': '1'},\n",
       "    {'answerid': 'Q1-S1-A2',\n",
       "     'atext': 'The term \"amyloidosis\" refers not to a single disease but to a collection of diseases in which a protein-based infiltrate deposits in tissues as beta-pleated sheets. The subtype of the disease is determined by which protein is depositing; although dozens of subtypes have been described, most are incredibly rare or of trivial importance. This analysis will focus on the main systemic forms of amyloidosis, both of which frequently involve the heart.',\n",
       "     'pairid': '2'}],\n",
       "   'focus': 'cardiac amyloidosis',\n",
       "   'qtype': 'information',\n",
       "   'subqid': 'Q1-S1'}],\n",
       " 'subject': None}"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'TREC-2017-LiveQA-Medical-Train-1.xml'\n",
    "train_data = parse_data(filename)\n",
    "# inspect \n",
    "train_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sRnz45nIOv6E"
   },
   "source": [
    "# MEDIQA2019\n",
    "[MEDIQA2019](https://github.com/abachaa/MEDIQA2019) challenge is an ACL-BioNLP 2019 shared tasks aiming to attract further research effors in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA). There is one task for each of the below task. In this section, we focus on [task 3 QA](https://github.com/abachaa/MEDIQA2019/tree/master/MEDIQA_Task3_QA)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 139,
     "status": "ok",
     "timestamp": 1633960491538,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "FcMCAAoXLXmP",
    "outputId": "5b1e26d5-7df9-40ad-dd3b-b414f38ac031"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive\n"
     ]
    }
   ],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "%cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NFD4MYjnO-eQ"
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/abachaa/MEDIQA2019.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 127,
     "status": "ok",
     "timestamp": 1633960492960,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "53PXgFC_FJCd",
    "outputId": "d4e7375a-7f87-4b55-a4fe-59165d5090da"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/MEDIQA2019\n"
     ]
    }
   ],
   "source": [
    "%cd MEDIQA2019"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 108,
     "status": "ok",
     "timestamp": 1633919284473,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "G16Z3Z2hP7-L",
    "outputId": "ca7f87e2-37ce-47d4-8021-d1a980eb5968"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/MEDIQA2019/MEDIQA_Task3_QA\n"
     ]
    }
   ],
   "source": [
    "%cd MEDIQA_Task3_QA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KkfXmq3GQEHF"
   },
   "source": [
    "Task description: <br>\n",
    "1) filter/classify the provided answers (1: correct, 0: incorrect) <br>\n",
    "2) re-rank the answers <br>\n",
    "Dataset:\n",
    "TrainingSet1 contains 104 consumer health questions covering different types of questions about diseases and drugs, and the associated answers.\n",
    "TrainingSet2 contains 104 simple qustions about the most frequent diseases, and the associated answers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6frjSJmyo9hT"
   },
   "source": [
    "The following function parses train/val/test xml files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Kz749GCCP-lI"
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "def parse_file(filename):\n",
    "  data = open(filename, 'r').read()\n",
    "  soup = BeautifulSoup(data, 'xml')\n",
    "  questions = soup.find_all('Question')\n",
    "  question_dic = []\n",
    "  for q in questions:\n",
    "    QID = q['QID']\n",
    "    QuestionText = q.QuestionText.STRING\n",
    "    AnswerList = q.AnswerList.find_all('Answer')\n",
    "    answer_list_dic = []\n",
    "    for answer in AnswerList:\n",
    "      AID = answer['AID']\n",
    "      SystemRank = answer['SystemRank']\n",
    "      ReferenceRank = answer['ReferenceRank']\n",
    "      ReferenceScore = answer['ReferenceScore']\n",
    "      AnswerURL = answer.AnswerURL.string\n",
    "      AnswerText = answer.AnswerText.string\n",
    "      answer_list_dic.append({'AID': AID, 'SystemRank': SystemRank, 'ReferenceRank': ReferenceRank,\n",
    "                             'ReferenceScore': ReferenceScore, 'AnswerURL': AnswerURL, 'AnswerText': AnswerText})\n",
    "      question_dic.append({'QID': QID, 'QuestionText': QuestionText, 'Answer_dics': answer_list_dic})\n",
    "    return question_dic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZRX9PF0go8Ph"
   },
   "outputs": [],
   "source": [
    "filename = './MEDIQA_Task3_QA/MEDIQA2019-Task3-QA-TrainingSet1-LiveQAMed.xml'\n",
    "question_dic = parse_file(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 113,
     "status": "ok",
     "timestamp": 1633961665881,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "C7lvxSENsh7m",
    "outputId": "18f3dac2-754a-4694-ca94-d0dfc0f9e7d5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['QID', 'QuestionText', 'Answer_dics'])"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect keys of first example\n",
    "question_dic[0].keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 168,
     "status": "ok",
     "timestamp": 1633961699047,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "IS4NPTpHtHgx",
    "outputId": "e286954a-5bb7-44c9-c08d-46d929014e7a"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'AID': '1_Answer1',\n",
       "  'AnswerText': \"Noonan syndrome: Noonan syndrome is a genetic disorder that prevents normal development in various parts of the body. A person can be affected by Noonan syndrome in a wide variety of ways. These include unusual facial characteristics, short stature, heart defects, other physical problems and possible developmental delays. Noonan syndrome is caused by a genetic mutation and is acquired when a child inherits a copy of an affected gene from a parent (dominant inheritance). It can also occur as a spontaneous mutation, meaning there's no family history involved. Management of Noonan syndrome focuses on controlling the disorder's symptoms and complications. Growth hormone may be used to treat short stature in some people with Noonan syndrome. Signs and symptoms of Noonan syndrome vary greatly among individuals and may be mild to severe. Characteristics may be related to the specific gene containing the mutation. Facial appearance is one of the key clinical features that leads to a diagnosis of Noonan syndrome. These features may be more pronounced in infants and young children, but change with age. In adulthood, these distinct features become more subtle. Features may include the following: - Eyes are wide-set and down-slanting with droopy lids. Irises are pale blue or green. - Ears are low-set and rotated backward. - Nose is depressed at the top, with a wide base and bulbous tip. - Mouth has a deep groove between the nose and the mouth and wide peaks in the upper lip. The crease that runs from the edge of the nose to the corner of the mouth becomes deeply grooved with age. Teeth may be crooked, the inside roof of the mouth (palate) may be highly arched and the lower jaw may be small. - Facial features may appear coarse, but appear sharper with age. The face may appear droopy and expressionless. - Head may appear large with a prominent forehead and a low hairline on the back of the head. - Skin may appear thin and transparent with age. Many people with Noonan syndrome are born with some form of heart defect (congenital heart disease), accounting for some of the key signs and symptoms of the disorder. Some heart problems can occur later in life. Some forms of congenital heart disease associated with this disorder include: - Valve disorders. Pulmonary valve stenosis is a narrowing of the pulmonary valve, the flap of tissue that separates the lower right chamber (ventricle) of the heart from the artery that supplies blood to the lungs (pulmonary artery). It's the most common heart problem seen with Noonan syndrome, and it may occur alone or with other heart defects. - Thickening of the heart muscle (hypertrophic cardiomyopathy). This is abnormal growth or thickening of the heart muscle that affects some people with Noonan syndrome. - Other structural defects of the heart. The defects can involve a hole in the wall that separates the two lower chambers of the heart (ventricular septal defect), narrowing of the artery that carries blood to the lungs for oxygen (pulmonary artery stenosis), or narrowing of the major blood vessel (aorta) that carries blood from the heart to the body (aortic coarctation). - Irregular heart rhythm. This can occur with or without structural heart abnormalities. Irregular heart rhythm occurs in the majority of people with Noonan syndrome. Noonan syndrome can affect normal growth. Many children with Noonan syndrome don't grow at a normal rate. Issues may include the following: - Birth weight will likely be normal, but growth slows over time. - Eating difficulties may result in inadequate nutrition and poor weight gain. - Growth hormone levels may be insufficient. - The growth spurt that's usually seen during the teenage years may be delayed. But because this disorder causes bone maturity to be delayed, growth sometimes continues into the late teens. - By adulthood, some people with Noonan syndrome may have normal height, but short stature is more common. Some common issues can include: - An unusually shaped chest often with a sunken sternum (pectus excavatum) or raised sternum (pectus carinatum) - Wide-set nipples - Short neck, often with extra folds of skin (webbed neck) or prominent neck muscles (trapezius) - Deformities of the spine Intelligence isn't affected for most people with Noonan syndrome. However, individuals may have: - An increased risk of learning disabilities and mild intellectual disability - A wide range of mental, emotional and behavioral issues that are usually mild - Hearing and vision deficits that may complicate learning A common sign of Noonan syndrome is abnormalities of the eyes and eyelids. These may include: - Problems with the eye muscles, such as cross-eye (strabismus) - Refractive problems, such as astigmatism, nearsightedness (myopia) or farsightedness (hyperopia) - Rapid movement of the eyeballs (nystagmus) - Cataracts Noonan syndrome can cause hearing deficits due to nerve issues or to structural abnormalities in the inner ear bones. Noonan syndrome can cause excessive bleeding and bruising due to clotting defects or having too few platelets. Noonan syndrome can cause problems with the lymphatic system, which drains excess fluid from the body and helps fight infection. These problems: - May show up before or after birth or develop in the teenage years or adulthood - Can be focused in a particular area of the body or widespread - Most commonly cause excess fluid (lymphedema) on the back of the hands or top of the feet Many people, especially males, with Noonan syndrome can have problems with the genitals and kidneys. - Testicles. Undescended testicles (cryptorchidism) are common in males. - Puberty. Puberty may be delayed in both boys and girls. - Fertility. Most females develop normal fertility. In males, however, fertility may not develop normally, often because of undescended testicles. - Kidneys. Kidney problems are generally mild and occur in a fairly small number of people with Noonan syndrome. People with Noonan syndrome may have skin conditions, which most commonly are: - Various problems that affect the color and texture of the skin - Curly, coarse hair or sparse hair The signs and symptoms of Noonan syndrome can be subtle. If you suspect you or your child may have the disorder, see your primary care doctor or your child's pediatrician. You or your child may be referred to a geneticist or a cardiologist. If your unborn child is at risk because of a family history of Noonan syndrome, prenatal tests may be available. Noonan syndrome is caused by a genetic mutation. These mutations can occur in multiple genes. Defects in these genes cause the production of proteins that are continually active. Because these genes play a role in the formation of many tissues throughout the body, this constant activation of proteins disrupts the normal process of cell growth and division. The mutations that cause Noonan syndrome can be: - Inherited. Children who have one parent with Noonan syndrome who carries the defective gene (autosomal dominant) have a 50 percent chance of developing the disorder. - Random. Noonan syndrome can develop because of a new mutation in children who don't have a genetic predisposition for the disorder (de novo). A parent with Noonan syndrome has a 50 percent chance (one chance in two) of passing the defective gene on to his or her child. The child who inherits the defective gene may have fewer or more symptoms than the affected parent. Complications can arise that may require special attention, including: - Developmental delays. If your child is affected developmentally, he or she may have difficulty with organization and spatial sense. Sometimes the developmental challenges are significant enough to require a special plan to address your child's learning and educational needs. - Bleeding and bruising. Sometimes the excessive bleeding common in Noonan syndrome isn't discovered until a person has dental work or surgery and experiences excessive bleeding. - Lymphatic complications. These usually involve excess fluid that gets stored in various places in the body. Sometimes fluid can collect in the space around the heart and lungs. - Urinary tract complications. Structural abnormalities in the kidneys may increase the risk of urinary tract infections. - Fertility issues. Males may have a low sperm count and other fertility problems because of undescended testicles or testicles that don't function properly. - Increased risk of cancer. There may be an increased risk of developing certain types of cancer, such as leukemia or certain types of tumors. A diagnosis of Noonan syndrome is usually made after a doctor observes some key signs, but this can be difficult because some features are subtle and hard to identify. Sometimes, Noonan syndrome isn't diagnosed until adulthood, only after a person has a child who is more obviously affected by the condition. Molecular genetic testing can help confirm a diagnosis. If there's evidence of heart problems, a doctor who specializes in heart conditions (cardiologist) can assess the type and severity. Although there's no way to repair the gene changes that cause Noonan syndrome, treatments can help minimize its effects. The earlier a diagnosis is made and treatment is started, the greater the benefits. Treatment of the symptoms and complications that occur with Noonan syndrome depends on type and severity. Many of the health and physical issues associated with this syndrome are treated as they would be for anyone with a similar health problem. Taken together, though, the many problems of this disorder require a coordinated team approach. Recommended approaches may include: - Heart treatment. Certain drugs may be effective in treating some kinds of heart problems. If there's a problem with the heart's valves, surgery may be necessary. The doctor also may recommend that heart function be evaluated periodically. - Treating low growth rate. Height should be measured three times a year until 3 years of age and then once every year until adulthood to make sure he or she is growing. To evaluate nutrition, the doctor will likely request blood tests. If your child's growth hormone levels are insufficient, growth hormone therapy may be a treatment option. - Addressing learning disabilities. For early childhood developmental delays, ask the doctor about infant stimulation programs. Physical and speech therapies may be helpful for addressing a variety of possible issues. In some cases special education or individualized teaching strategies may be appropriate. - Vision and hearing treatments. Eye exams are recommended at least every two years. Most eye issues can be treated with glasses alone. Surgery may be needed for some conditions, such as cataracts. Hearing screenings are recommended annually during childhood. - Treatment for bleeding and bruising. If there's a history of easy bruising or excessive bleeding, avoid aspirin and aspirin-containing products. In some cases, doctors may prescribe drugs that help the blood to clot. Notify your doctor before any procedures. - Treatment for lymphatic problems. Lymphatic problems can occur in many ways and may not require treatment. If they do require treatment, your doctor can suggest appropriate measures. - Treatment for genital problems. If one or both testicles haven't moved into proper position within the first few months of life (undescended testicle), surgery may be needed. Other evaluations and regular follow-up care may be recommended depending on specific issues, for example, regular dental care. Children, teens and adults should continue to have ongoing, periodic evaluations by their health care professional.\",\n",
       "  'AnswerURL': 'https://www.mayoclinic.org/diseases-conditions/noonan-syndrome/symptoms-causes/syc-20354422',\n",
       "  'ReferenceRank': '2',\n",
       "  'ReferenceScore': '4',\n",
       "  'SystemRank': '1'},\n",
       " {'AID': '1_Answer2',\n",
       "  'AnswerText': 'What are the treatments for Noonan syndrome?: How might Noonan syndrome be treated? Management generally focuses on the specific signs and symptoms present in each person. Treatments for the complications of Noonan syndrome (such as cardiovascular abnormalities) are generally standard and do not differ from treatment in the general population. Developmental disabilities are addressed by early intervention programs and individualized education strategies. Treatment for serious bleeding depends upon the specific factor deficiency or platelet abnormality. Growth hormone treatment increases growth velocity. More detailed information about treatment for Noonan syndrome can be viewed on the GeneReviews Web site.',\n",
       "  'AnswerURL': 'https://rarediseases.info.nih.gov/gard/10955/noonan-syndrome',\n",
       "  'ReferenceRank': '7',\n",
       "  'ReferenceScore': '2',\n",
       "  'SystemRank': '2'},\n",
       " {'AID': '1_Answer3',\n",
       "  'AnswerText': 'Noonan syndrome: Noonan syndrome is a disease that can be passed down through families (inherited). It causes many parts of the body to develop abnormally. Noonan syndrome is linked to defects in several genes. In general, certain proteins involved in growth and development become overactive as a result of these gene changes. Noonan syndrome is an autosomal dominant condition. This means only one parent has to pass down the nonworking gene for the child to have the syndrome. However, some cases may not be inherited. Symptoms include: - Delayed puberty - Down-slanting or wide-set eyes - Hearing loss (varies) - Low-set or abnormally shaped ears - Mild intellectual disability (only in about 25% of cases) - Sagging eyelids (ptosis) - Short stature - Small penis - Undescended testicles - Unusual chest shape (most often a sunken chest called pectus excavatum) - Webbed and short-appearing neck The health care provider will perform a physical exam. This may show signs of heart problems the infant had from birth. These may include pulmonary stenosis and atrial septal defect. Tests depend on the symptoms, but may include: - Platelet count - Blood clotting factor test - EKG, chest x-ray, or echocardiogram - Hearing tests - Growth hormone levels Genetic testing can help diagnose this syndrome. There is no specific treatment. Your provider will suggest treatment to relieve or manage symptoms. Growth hormone has been used successfully to treat short height in some persons with Noonan syndrome. The Noonan Syndrome Foundation is a place where people dealing with this condition can find information and resources. Complications may include: - Abnormal bleeding or bruising - Buildup of fluid in tissues of body (lymphedema, cystic hygroma) - Failure to thrive in infants - Leukemia and other cancers - Low self-esteem - Infertility in males if both testes are undescended - Problems with the structure of the heart - Short height - Social problems due to physical symptoms This condition may be found during early infant exams. A geneticist is often needed to diagnose Noonan syndrome. Couples with a family history of Noonan syndrome may want to consider genetic counseling before having children. Updated by: Chad Haldeman-Englert, MD, FACMG, Fullerton Genetics Center, Asheville, NC. Review provided by VeriMed Healthcare Network. Also reviewed by David Zieve, MD, MHA, Isla Ogilvie, PhD, and the A.D.A.M. Editorial team.',\n",
       "  'AnswerURL': 'https://medlineplus.gov/ency/article/001656.htm',\n",
       "  'ReferenceRank': '3',\n",
       "  'ReferenceScore': '3',\n",
       "  'SystemRank': '3'},\n",
       " {'AID': '1_Answer4',\n",
       "  'AnswerText': 'What are the treatments for Noonan syndrome?: These resources address the diagnosis or management of Noonan syndrome: - Gene Review: Gene Review: Noonan Syndrome - Genetic Testing Registry: Noonan syndrome - Genetic Testing Registry: Noonan syndrome 1 - Genetic Testing Registry: Noonan syndrome 2 - Genetic Testing Registry: Noonan syndrome 3 - Genetic Testing Registry: Noonan syndrome 4 - Genetic Testing Registry: Noonan syndrome 5 - Genetic Testing Registry: Noonan syndrome 6 - Genetic Testing Registry: Noonan syndrome 7 - MedlinePlus Encyclopedia: Noonan Syndrome These resources from MedlinePlus offer information about the diagnosis and management of various health conditions: - Diagnostic Tests - Drug Therapy - Surgery and Rehabilitation - Genetic Counseling - Palliative Care',\n",
       "  'AnswerURL': 'https://ghr.nlm.nih.gov/condition/noonan-syndrome',\n",
       "  'ReferenceRank': '8',\n",
       "  'ReferenceScore': '2',\n",
       "  'SystemRank': '4'},\n",
       " {'AID': '1_Answer5',\n",
       "  'AnswerText': \"Noonan syndrome: Noonan syndrome is a genetic disorder that causes abnormal development of multiple parts of the body. Features of Noonan syndrome may include a distinctive facial appearance, short stature , a broad or webbed neck, congenital heart defects , bleeding problems, problems with bone structure (skeletal malformations), and developmental delay . [1] [2] Noonan syndrome may be caused by a mutation in any of several genes, and\\xa0can be classified into subtypes based on the responsible gene . It is typically inherited in an autosomal dominant manner, but many cases are due to a new mutation and are not inherited from either parent. Treatment depends on the symptoms present in each person. [3] Noonan syndrome belongs to a group of related conditions called the RASopathies. These conditions have some overlapping features and are all caused by genetic changes that disrupt the body's RAS pathway, affecting growth and development. Other conditions in this group include: [4] neurofibromatosis type 1 LEOPARD syndrome, also called Noonan syndrome with multiple lentigines\\xa0 Costello syndrome cardiofaciocutaneous syndrome Legius syndrome capillary malformation–arteriovenous malformation syndrome The Human Phenotype Ontology (HPO) provides the following list of features that have been reported in people with this condition. Much of the information in the HPO comes from Orphanet, a European rare disease database. If available, the list includes a rough estimate of how common a feature is (its frequency). Frequencies are based on a specific study and may not be representative of all studies. You can use the MedlinePlus Medical Dictionary for definitions of the terms below. Signs and Symptoms Approximate number of patients (when available) Aplasia/Hypoplasia of the abdominal wall musculature Very frequent Cystic hygroma Very frequent Downslanted palpebral fissures Very frequent Dysarthria Very frequent Enlarged thorax Very frequent High forehead Very frequent High palate Very frequent Hypertelorism Very frequent Hypogonadotrophic hypogonadism Very frequent Joint hyperflexibility Very frequent Low-set, posteriorly rotated ears Very frequent Micrognathia Very frequent Midface retrusion Very frequent Pectus carinatum Very frequent Pectus excavatum Very frequent Proptosis Very frequent Ptosis Very frequent Pulmonary artery stenosis Very frequent Thick lower lip vermilion Very frequent Thickened helices Very frequent Thickened nuchal skin fold Very frequent Triangular face Very frequent Webbed neck Very frequent Wide intermamillary distance Very frequent Abnormal bleeding Frequent Abnormal dermatoglyphics Frequent Abnormal hair quantity Frequent Abnormal platelet function Frequent Abnormal pulmonary valve morphology Frequent Abnormality of coagulation Frequent Abnormality of the spleen Frequent Coarse hair Frequent Cryptorchidism Frequent Delayed skeletal maturation Frequent Feeding difficulties in infancy Frequent Hepatomegaly Frequent Low posterior hairline Frequent Muscular hypotonia Frequent Scoliosis Frequent Strabismus Frequent Aplasia of the semicircular canal Occasional Brachydactyly Occasional Clinodactyly of the 5th finger Occasional Hypogonadism Occasional Lymphedema Occasional Melanocytic nevus Occasional Nystagmus Occasional Radioulnar synostosis Occasional Sensorineural hearing impairment Occasional Intellectual disability Very rare Amegakaryocytic thrombocytopenia - Atrial septal defect - Autosomal dominant inheritance - Clinodactyly - Coarctation of aorta - Cubitus valgus - Dental malocclusion - Epicanthus - Failure to thrive in infancy - Heterogeneous - High, narrow palate - Hypertrophic cardiomyopathy - Kyphoscoliosis - Male infertility - Myopia - Neurofibrosarcoma - Patent ductus arteriosus - Pectus excavatum of inferior sternum - Postnatal growth retardation - Pulmonic stenosis - Radial deviation of finger - Reduced factor XII activity - Reduced factor XIII activity - Shield chest - Short neck - Superior pectus carinatum - Synovitis - Ventricular septal defect - View complete list of signs and symptoms... Noonan syndrome is inherited in an autosomal dominant manner. [5] This means that having one changed or mutated copy of the responsible gene in each cell is enough to cause the condition. Each child of a person with Noonan syndrome has a 50% (1 in 2) chance to inherit the condition. In other cases, the change in one of the genes that can cause Noonan syndrome is new and not found in either parent. This means that the genetic change was not passed down from either the mother or the father, but instead occurred for the first time (de novo) in the child who has the syndrome. [5]\\xa0 New changes or mutations in a gene can happen by mistake during the making of the egg or the sperm. Making a diagnosis for a genetic or rare disease can often be challenging. Healthcare professionals typically look at a person’s medical history, symptoms, physical exam, and laboratory test results in order to make a diagnosis. The following resources provide information relating to diagnosis and testing for this condition. If you have questions about getting a diagnosis, you should contact a healthcare professional. Testing Resources The Genetic Testing Registry (GTR) provides information about the genetic tests for this condition. The intended audience for the GTR is health care providers and researchers. Patients and consumers with specific questions about a genetic test should contact a health care provider or a genetics professional. Orphanet lists international laboratories offering diagnostic testing for this condition. Management of Noonan syndrome generally focuses on the specific signs and symptoms present in each person. Treatments for the complications of Noonan syndrome (such as cardiovascular problems) are generally standard and do not differ from treatment in the general population. [5] Developmental disabilities are addressed by early intervention programs. Some children with Noonan syndrome may need special help in school, including for example, an individualized educational program (IEP). [5]\\xa0 Treatment for bleeding problems depends on the cause. [5] Growth hormone (GH) therapy can increase the rate at which a child with Noonan syndrome grows in most cases. GH therapy during childhood and teen years may also increase final adult height slightly, often enough to reach the low normal range of average height. [5] [6] Management Guidelines GeneReviews provides current, expert-authored, peer-reviewed, full-text articles describing the application of genetic testing to the diagnosis, management, and genetic counseling of patients with specific inherited conditions. Project OrphanAnesthesia is a project whose aim is to create peer-reviewed, readily accessible guidelines for patients with rare diseases and for the anesthesiologists caring for them. The project is a collaborative effort of the German Society of Anesthesiology and Intensive Care, Orphanet, the European Society of Pediatric Anesthesia, anesthetists and rare disease experts with the aim to contribute to patient safety. There is a wide range in the nature and severity of signs and symptoms that may be present in people with Noonan syndrome , so the long-term outlook ( prognosis ) and life expectancy may differ among affected people. Studies generally suggest that long-term outcome depends largely on the presence and severity of congenital heart defects . Death in affected people has been frequently associated with the presence of complex left ventricular disease. [7] Studies have indicated that people with Noonan syndrome have a 3-fold higher mortality rate than those in the general population. [8] [9] Some affected people have ongoing health problems due to congenital heart defects, lymphatic vessel dysplasia, urinary tract malformations, blood disorders , or other associated health issues. [8] [9] However, with special care and counseling, the majority of children with Noonan syndrome grow up and function normally as adults. Signs and symptoms tend to lessen with age, and new medical problems associated with the condition are generally not expected to appear in adulthood. [8] The following diseases are related to Noonan syndrome. If you have a question about any of these diseases, you can contact GARD. Noonan syndrome 1 Noonan syndrome 2 Noonan syndrome 3 Noonan syndrome 4 Noonan syndrome 5 Noonan syndrome 6\",\n",
       "  'AnswerURL': 'https://rarediseases.info.nih.gov/diseases/10955/noonan-syndrome',\n",
       "  'ReferenceRank': '5',\n",
       "  'ReferenceScore': '3',\n",
       "  'SystemRank': '5'},\n",
       " {'AID': '1_Answer6',\n",
       "  'AnswerText': 'What are the treatments for Noonan syndrome?: There is no specific treatment. Your doctor will suggest treatment to relieve or manage symptoms. Growth hormone has been used successfully to treat short height in some persons with Noonan syndrome.',\n",
       "  'AnswerURL': 'https://www.nlm.nih.gov/medlineplus/ency/article/001656.htm',\n",
       "  'ReferenceRank': '6',\n",
       "  'ReferenceScore': '2',\n",
       "  'SystemRank': '6'},\n",
       " {'AID': '1_Answer7',\n",
       "  'AnswerText': \"Neurofibromatosis-Noonan syndrome: This condition doesn't have a summary yet. Please see our page(s) on Neurofibromatosis. The Human Phenotype Ontology (HPO) provides the following list of features that have been reported in people with this condition. Much of the information in the HPO comes from Orphanet, a European rare disease database. If available, the list includes a rough estimate of how common a feature is (its frequency). Frequencies are based on a specific study and may not be representative of all studies. You can use the MedlinePlus Medical Dictionary for definitions of the terms below. Signs and Symptoms Approximate number of patients (when available) Abdominal wall muscle weakness Very frequent Abnormality of the helix Very frequent Downslanted palpebral fissures Very frequent Hypertelorism Very frequent Hypertrophic cardiomyopathy Very frequent Low-set, posteriorly rotated ears Very frequent Multiple cafe-au-lait spots Very frequent Ptosis Very frequent Pulmonic stenosis Very frequent Specific learning disability Very frequent Webbed neck Very frequent Abnormality of the lymphatic system Frequent Abnormality of the thorax Frequent Cryptorchidism Frequent Dysphagia Frequent Prolonged bleeding time Frequent Lisch nodules Occasional Autosomal dominant inheritance - Axillary freckling - Cafe-au-lait spot - Cubitus valgus - Delayed speech and language development - Epicanthus - Inguinal freckling - Low posterior hairline - Low-set ears - Macrocephaly - Malar flattening - Midface retrusion - Neurofibromas - Optic glioma - Pectus excavatum of inferior sternum - Posteriorly rotated ears - Prominent nasolabial fold - Scoliosis - Secundum atrial septal defect - Short neck - Superior pectus carinatum - View complete list of signs and symptoms... Making a diagnosis for a genetic or rare disease can often be challenging. Healthcare professionals typically look at a person’s medical history, symptoms, physical exam, and laboratory test results in order to make a diagnosis. The following resources provide information relating to diagnosis and testing for this condition. If you have questions about getting a diagnosis, you should contact a healthcare professional. Testing Resources The Genetic Testing Registry (GTR) provides information about the genetic tests for this condition. The intended audience for the GTR is health care providers and researchers. Patients and consumers with specific questions about a genetic test should contact a health care provider or a genetics professional. The following diseases are related to Neurofibromatosis-Noonan syndrome. If you have a question about any of these diseases, you can contact GARD. Neurofibromatosis\",\n",
       "  'AnswerURL': 'https://rarediseases.info.nih.gov/diseases/372/neurofibromatosis-noonan-syndrome',\n",
       "  'ReferenceRank': '10',\n",
       "  'ReferenceScore': '1',\n",
       "  'SystemRank': '7'},\n",
       " {'AID': '1_Answer8',\n",
       "  'AnswerText': \"Noonan syndrome: Noonan syndrome is a condition that affects many areas of the body. It is characterized by mildly unusual facial features, short stature, heart defects, bleeding problems, skeletal malformations, and many other signs and symptoms. People with Noonan syndrome have distinctive facial features such as a deep groove in the area between the nose and mouth (philtrum), widely spaced eyes that are usually pale blue or blue-green in color, and low-set ears that are rotated backward. Affected individuals may have a high arch in the roof of the mouth (high-arched palate), poor teeth alignment, and a small lower jaw (micrognathia). Many children with Noonan syndrome have a short neck, and both children and adults may have excess neck skin (also called webbing) and a low hairline at the back of the neck. Between 50 and 70 percent of individuals with Noonan syndrome have short stature. At birth, they are usually a normal length and weight, but growth slows over time. Abnormal levels of growth hormone, a protein that is necessary for the normal growth of the body's bones and tissues, may contribute to the slow growth. Individuals with Noonan syndrome often have either a sunken chest (pectus excavatum) or a protruding chest (pectus carinatum). Some affected people may also have an abnormal side-to-side curvature of the spine (scoliosis). Most people with Noonan syndrome have some form of critical congenital heart disease. The most common heart defect in these individuals is a narrowing of the valve that controls blood flow from the heart to the lungs (pulmonary valve stenosis). Some have hypertrophic cardiomyopathy, which enlarges and weakens the heart muscle. A variety of bleeding disorders have been associated with Noonan syndrome. Some affected individuals have excessive bruising, nosebleeds, or prolonged bleeding following injury or surgery. Rarely, women with Noonan syndrome who have a bleeding disorder have excessive bleeding during menstruation (menorrhagia) or childbirth. Adolescent males with Noonan syndrome typically experience delayed puberty. They go through puberty starting at age 13 or 14 and have a reduced pubertal growth spurt that results in shortened stature. Most males with Noonan syndrome have undescended testes (cryptorchidism), which may contribute to infertility (inability to father a child) later in life. Females with Noonan syndrome can experience delayed puberty but most have normal puberty and fertility. Noonan syndrome can cause a variety of other signs and symptoms. Most children diagnosed with Noonan syndrome have normal intelligence, but a few have special educational needs, and some have intellectual disability. Some affected individuals have vision or hearing problems. Affected infants may have feeding problems, which typically get better by age 1 or 2 years. Infants with Noonan syndrome may be born with puffy hands and feet caused by a buildup of fluid (lymphedema), which can go away on its own. Older individuals can also develop lymphedema, usually in the ankles and lower legs. Some people with Noonan syndrome develop cancer, particularly those involving the blood-forming cells (leukemia). It has been estimated that children with Noonan syndrome have an eightfold increased risk of developing leukemia or other cancers over age-matched peers. Noonan syndrome is one of a group of related conditions, collectively known as RASopathies. These conditions all have similar signs and symptoms and are caused by changes in the same cell signaling pathway. In addition to Noonan syndrome, the RASopathies include cardiofaciocutaneous syndrome, Costello syndrome, neurofibromatosis type 1, Legius syndrome, and Noonan syndrome with multiple lentigines. Noonan syndrome occurs in approximately 1 in 1,000 to 2,500 people. Mutations in multiple genes can cause Noonan syndrome. Mutations in the PTPN11 gene cause about half of all cases. SOS1 gene mutations cause an additional 10 to 15 percent, and RAF1 and RIT1 genes each account for about 5 percent of cases. Mutations in other genes each account for a small number of cases. The cause of Noonan syndrome in 15 to 20 percent of people with this disorder is unknown. The PTPN11, SOS1, RAF1, and RIT1 genes all provide instructions for making proteins that are important in the RAS/MAPK cell signaling pathway, which is needed for cell division and growth (proliferation), the process by which cells mature to carry out specific functions (differentiation), and cell movement (migration). Many of the mutations in the genes associated with Noonan syndrome cause the resulting protein to be turned on (active) longer than normal, rather than promptly switching on and off in response to cell signals. This prolonged activation alters normal RAS/MAPK signaling, which disrupts the regulation of cell growth and division, leading to the characteristic features of Noonan syndrome. Rarely, Noonan syndrome is associated with genes that are not involved in the RAS/MAPK cell signaling pathway. Researchers are working to determine how mutations in these genes can lead to the signs and symptoms of Noonan syndrome. This condition is inherited in an autosomal dominant pattern, which means one copy of the altered gene in each cell is sufficient to cause the disorder. Chen PC, Yin J, Yu HW, Yuan T, Fernandez M, Yung CK, Trinh QM, Peltekova VD, Reid JG, Tworog-Dube E, Morgan MB, Muzny DM, Stein L, McPherson JD, Roberts AE, Gibbs RA, Neel BG, Kucherlapati R. Next-generation sequencing identifies rare variants associated with Noonan syndrome. Proc Natl Acad Sci U S A. 2014 Aug 5;111(31):11473-8. doi: 10.1073/pnas.1324128111. Epub 2014 Jul 21.\",\n",
       "  'AnswerURL': 'https://ghr.nlm.nih.gov/condition/noonan-syndrome',\n",
       "  'ReferenceRank': '4',\n",
       "  'ReferenceScore': '3',\n",
       "  'SystemRank': '8'},\n",
       " {'AID': '1_Answer9',\n",
       "  'AnswerText': 'PTPN11 gene (Noonan syndrome): More than 90 mutations causing Noonan syndrome have been identified in the PTPN11 gene. This condition is characterized by mildly unusual facial characteristics, short stature, heart defects, bleeding problems, skeletal malformations, and many other signs and symptoms. Most of the PTPN11 gene mutations replace single amino acids used to make the SHP-2 protein. The resulting protein is either continuously turned on (active) or has prolonged activation, rather than promptly switching on and off in response to other cellular proteins. This increase in protein activity disrupts the regulation of the RAS/MAPK signaling pathway that controls cell functions such as proliferation. This misregulation can result in the heart defects, growth problems, skeletal abnormalities, and other features of Noonan syndrome. Rarely, a person with Noonan syndrome caused by PTPN11 gene mutations will also develop juvenile myelomonocytic leukemia, which is a type of blood cancer that typically affects children or adolescents.',\n",
       "  'AnswerURL': 'https://ghr.nlm.nih.gov/gene/PTPN11',\n",
       "  'ReferenceRank': '9',\n",
       "  'ReferenceScore': '2',\n",
       "  'SystemRank': '9'},\n",
       " {'AID': '1_Answer10',\n",
       "  'AnswerText': \"Noonan syndrome (Symptoms): Signs and symptoms of Noonan syndrome vary greatly among individuals and may be mild to severe. Characteristics may be related to the specific gene containing the mutation. Facial appearance is one of the key clinical features that leads to a diagnosis of Noonan syndrome. These features may be more pronounced in infants and young children, but change with age. In adulthood, these distinct features become more subtle. Features may include the following: - Eyes are wide-set and down-slanting with droopy lids. Irises are pale blue or green. - Ears are low-set and rotated backward. - Nose is depressed at the top, with a wide base and bulbous tip. - Mouth has a deep groove between the nose and the mouth and wide peaks in the upper lip. The crease that runs from the edge of the nose to the corner of the mouth becomes deeply grooved with age. Teeth may be crooked, the inside roof of the mouth (palate) may be highly arched and the lower jaw may be small. - Facial features may appear coarse, but appear sharper with age. The face may appear droopy and expressionless. - Head may appear large with a prominent forehead and a low hairline on the back of the head. - Skin may appear thin and transparent with age. Many people with Noonan syndrome are born with some form of heart defect (congenital heart disease), accounting for some of the key signs and symptoms of the disorder. Some heart problems can occur later in life. Some forms of congenital heart disease associated with this disorder include: - Valve disorders. Pulmonary valve stenosis is a narrowing of the pulmonary valve, the flap of tissue that separates the lower right chamber (ventricle) of the heart from the artery that supplies blood to the lungs (pulmonary artery). It's the most common heart problem seen with Noonan syndrome, and it may occur alone or with other heart defects. - Thickening of the heart muscle (hypertrophic cardiomyopathy). This is abnormal growth or thickening of the heart muscle that affects some people with Noonan syndrome. - Other structural defects of the heart. The defects can involve a hole in the wall that separates the two lower chambers of the heart (ventricular septal defect), narrowing of the artery that carries blood to the lungs for oxygen (pulmonary artery stenosis), or narrowing of the major blood vessel (aorta) that carries blood from the heart to the body (aortic coarctation). - Irregular heart rhythm. This can occur with or without structural heart abnormalities. Irregular heart rhythm occurs in the majority of people with Noonan syndrome. Noonan syndrome can affect normal growth. Many children with Noonan syndrome don't grow at a normal rate. Issues may include the following: - Birth weight will likely be normal, but growth slows over time. - Eating difficulties may result in inadequate nutrition and poor weight gain. - Growth hormone levels may be insufficient. - The growth spurt that's usually seen during the teenage years may be delayed. But because this disorder causes bone maturity to be delayed, growth sometimes continues into the late teens. - By adulthood, some people with Noonan syndrome may have normal height, but short stature is more common. Some common issues can include: - An unusually shaped chest often with a sunken sternum (pectus excavatum) or raised sternum (pectus carinatum) - Wide-set nipples - Short neck, often with extra folds of skin (webbed neck) or prominent neck muscles (trapezius) - Deformities of the spine Intelligence isn't affected for most people with Noonan syndrome. However, individuals may have: - An increased risk of learning disabilities and mild intellectual disability - A wide range of mental, emotional and behavioral issues that are usually mild - Hearing and vision deficits that may complicate learning A common sign of Noonan syndrome is abnormalities of the eyes and eyelids. These may include: - Problems with the eye muscles, such as cross-eye (strabismus) - Refractive problems, such as astigmatism, nearsightedness (myopia) or farsightedness (hyperopia) - Rapid movement of the eyeballs (nystagmus) - Cataracts Noonan syndrome can cause hearing deficits due to nerve issues or to structural abnormalities in the inner ear bones. Noonan syndrome can cause excessive bleeding and bruising due to clotting defects or having too few platelets. Noonan syndrome can cause problems with the lymphatic system, which drains excess fluid from the body and helps fight infection. These problems: - May show up before or after birth or develop in the teenage years or adulthood - Can be focused in a particular area of the body or widespread - Most commonly cause excess fluid (lymphedema) on the back of the hands or top of the feet Many people, especially males, with Noonan syndrome can have problems with the genitals and kidneys. - Testicles. Undescended testicles (cryptorchidism) are common in males. - Puberty. Puberty may be delayed in both boys and girls. - Fertility. Most females develop normal fertility. In males, however, fertility may not develop normally, often because of undescended testicles. - Kidneys. Kidney problems are generally mild and occur in a fairly small number of people with Noonan syndrome. People with Noonan syndrome may have skin conditions, which most commonly are: - Various problems that affect the color and texture of the skin - Curly, coarse hair or sparse hair The signs and symptoms of Noonan syndrome can be subtle. If you suspect you or your child may have the disorder, see your primary care doctor or your child's pediatrician. You or your child may be referred to a geneticist or a cardiologist. If your unborn child is at risk because of a family history of Noonan syndrome, prenatal tests may be available.\",\n",
       "  'AnswerURL': 'https://www.mayoclinic.org/diseases-conditions/noonan-syndrome/symptoms-causes/syc-20354422',\n",
       "  'ReferenceRank': '1',\n",
       "  'ReferenceScore': '4',\n",
       "  'SystemRank': '10'}]"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect Answers for first example\n",
    "question_dic[0]['Answer_dics']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eLIyTFaaPJSA"
   },
   "source": [
    "# Medication_QA_MedInfo2019\n",
    "[Medication_QA_MedInfo2019](https://github.com/abachaa/Medication_QA_MedInfo2019) is the gold standard corpus for medication qustion answering. The dataset consists of 674 question-answer pairs with annotations of the question focus, type, and the answer source. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LsVrJyNelnkG"
   },
   "outputs": [],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1652,
     "status": "ok",
     "timestamp": 1633753399112,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "_xpKw8BfPT2y",
    "outputId": "f9fff5bd-b322-4a57-8cdd-ffc9ceb7f792"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'Medication_QA_MedInfo2019'...\n",
      "remote: Enumerating objects: 18, done.\u001b[K\n",
      "remote: Total 18 (delta 0), reused 0 (delta 0), pack-reused 18\u001b[K\n",
      "Unpacking objects: 100% (18/18), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/abachaa/Medication_QA_MedInfo2019.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 117,
     "status": "ok",
     "timestamp": 1633959714377,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "BjW2h8c2SdAo",
    "outputId": "05205b65-9e28-4a04-c287-76f912c1fcd8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/MyDrive/Medication_QA_MedInfo2019\n"
     ]
    }
   ],
   "source": [
    "%cd Medication_QA_MedInfo2019/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2JgPB8FQlr81"
   },
   "source": [
    "The dataset is in an excel sheet, load into dataframe using pandas. Inspect first 3 rows of data. <br>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 337
    },
    "executionInfo": {
     "elapsed": 126,
     "status": "ok",
     "timestamp": 1633960431330,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "3BNa-1-XSqKT",
    "outputId": "e5ced2ca-5bd7-493b-b076-5672ac11c03d"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Question</th>\n",
       "      <th>Focus (Drug)</th>\n",
       "      <th>Question Type</th>\n",
       "      <th>Answer</th>\n",
       "      <th>Section Title</th>\n",
       "      <th>URL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>how does rivatigmine and otc sleep medicine in...</td>\n",
       "      <td>rivastigmine</td>\n",
       "      <td>Interaction</td>\n",
       "      <td>tell your doctor and pharmacist what prescript...</td>\n",
       "      <td>What special precautions should I follow?</td>\n",
       "      <td>https://medlineplus.gov/druginfo/meds/a602009....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>how does valium affect the brain</td>\n",
       "      <td>Valium</td>\n",
       "      <td>Action</td>\n",
       "      <td>Diazepam is a benzodiazepine that exerts anxio...</td>\n",
       "      <td>CLINICAL PHARMACOLOGY</td>\n",
       "      <td>https://dailymed.nlm.nih.gov/dailymed/drugInfo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>what is morphine</td>\n",
       "      <td>morphine</td>\n",
       "      <td>Information</td>\n",
       "      <td>Morphine is a pain medication of the opiate fa...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://en.wikipedia.org/wiki/Morphine</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Question  ...                                                URL\n",
       "0  how does rivatigmine and otc sleep medicine in...  ...  https://medlineplus.gov/druginfo/meds/a602009....\n",
       "1                   how does valium affect the brain  ...  https://dailymed.nlm.nih.gov/dailymed/drugInfo...\n",
       "2                                   what is morphine  ...             https://en.wikipedia.org/wiki/Morphine\n",
       "\n",
       "[3 rows x 6 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "file = 'MedInfo2019-QA-Medications.xlsx'\n",
    "df = pd.read_excel(file)\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 110,
     "status": "ok",
     "timestamp": 1633960464794,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "AI1quZ-XmrKe",
    "outputId": "b88daef4-304b-4341-b959-b84aa31302b2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of examples:  690\n",
      "Column:  Index(['Question', 'Focus (Drug)', 'Question Type', 'Answer', 'Section Title',\n",
      "       'URL'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of examples: \", len(df))\n",
    "print(\"Column: \", df.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 220
    },
    "executionInfo": {
     "elapsed": 126,
     "status": "ok",
     "timestamp": 1633960465520,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "p13cHWqOnB4x",
    "outputId": "d4b325ec-44d5-4ea7-e520-f9392421cf97"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Question</th>\n",
       "      <th>Focus (Drug)</th>\n",
       "      <th>Question Type</th>\n",
       "      <th>Answer</th>\n",
       "      <th>Section Title</th>\n",
       "      <th>URL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>690</td>\n",
       "      <td>689</td>\n",
       "      <td>690</td>\n",
       "      <td>689</td>\n",
       "      <td>617</td>\n",
       "      <td>677</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>651</td>\n",
       "      <td>515</td>\n",
       "      <td>37</td>\n",
       "      <td>652</td>\n",
       "      <td>262</td>\n",
       "      <td>591</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>what does memantine look like</td>\n",
       "      <td>marijuana</td>\n",
       "      <td>Information</td>\n",
       "      <td>No answers</td>\n",
       "      <td>DOSAGE AND ADMINISTRATION</td>\n",
       "      <td>https://medlineplus.gov/marijuana.html</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>4</td>\n",
       "      <td>14</td>\n",
       "      <td>112</td>\n",
       "      <td>8</td>\n",
       "      <td>60</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             Question  ...                                     URL\n",
       "count                             690  ...                                     677\n",
       "unique                            651  ...                                     591\n",
       "top     what does memantine look like  ...  https://medlineplus.gov/marijuana.html\n",
       "freq                                4  ...                                       8\n",
       "\n",
       "[4 rows x 6 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# summary of the dataset\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 191
    },
    "executionInfo": {
     "elapsed": 102,
     "status": "ok",
     "timestamp": 1633960466197,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "cPoMqun1nWuo",
    "outputId": "698711a7-d4cb-4a5d-a829-972dca5cdb93"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Question</th>\n",
       "      <th>Focus (Drug)</th>\n",
       "      <th>Question Type</th>\n",
       "      <th>Answer</th>\n",
       "      <th>Section Title</th>\n",
       "      <th>URL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>405</th>\n",
       "      <td>does marijuana use lead to negative health out...</td>\n",
       "      <td>marijuana</td>\n",
       "      <td>Side effects</td>\n",
       "      <td>Marijuana can cause problems with memory, lear...</td>\n",
       "      <td>Summary</td>\n",
       "      <td>https://medlineplus.gov/marijuana.html</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>432</th>\n",
       "      <td>does marijuana use lead to negative health out...</td>\n",
       "      <td>marijuana</td>\n",
       "      <td>Side effects</td>\n",
       "      <td>Marijuana can cause problems with memory, lear...</td>\n",
       "      <td>Summary</td>\n",
       "      <td>https://medlineplus.gov/marijuana.html</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              Question  ...                                     URL\n",
       "405  does marijuana use lead to negative health out...  ...  https://medlineplus.gov/marijuana.html\n",
       "432  does marijuana use lead to negative health out...  ...  https://medlineplus.gov/marijuana.html\n",
       "\n",
       "[2 rows x 6 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check for duplicated rows\n",
    "df[df.duplicated(keep=False)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "X26D4u9UPqbx"
   },
   "source": [
    "# BiQA\n",
    "[BiQA](https://github.com/lasigeBioTM/BiQA) Generating Scientific Question Answering Corpora from Q&A forums (StackExchange & Reddit), including Biology, Medical Sciences, and Nutrition.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 121,
     "status": "ok",
     "timestamp": 1633961973131,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "LTXc_NpIuOGS",
    "outputId": "4b8f6669-44a9-4d59-a620-69cae5c0ed2c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive\n"
     ]
    }
   ],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 2761,
     "status": "ok",
     "timestamp": 1633961987971,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "JxSmy7jHuSIL",
    "outputId": "50ca0a05-d3e3-4c52-b72e-213c62f6211e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'BiQA'...\n",
      "remote: Enumerating objects: 64, done.\u001b[K\n",
      "remote: Counting objects: 100% (64/64), done.\u001b[K\n",
      "remote: Compressing objects: 100% (46/46), done.\u001b[K\n",
      "remote: Total 64 (delta 31), reused 43 (delta 18), pack-reused 0\u001b[K\n",
      "Unpacking objects: 100% (64/64), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/lasigeBioTM/BiQA.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 102,
     "status": "ok",
     "timestamp": 1633962009364,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "LsFwlCdVuU2U",
    "outputId": "6b4a1884-3369-43c3-e38d-4f960dd9bfd2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/BiQA\n"
     ]
    }
   ],
   "source": [
    "%cd BiQA/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 94,
     "status": "ok",
     "timestamp": 1633962021052,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "u-wRoG07uap7",
    "outputId": "8c48f7e2-de60-4a81-89bd-9661a1b2d8a6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/BiQA/april2020\n"
     ]
    }
   ],
   "source": [
    "%cd april2020"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 300
    },
    "executionInfo": {
     "elapsed": 132,
     "status": "ok",
     "timestamp": 1633962067575,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "3WXjBo2LudiV",
    "outputId": "bf8b1689-2a40-4a65-c3f9-8b11eb815a3c"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question_id</th>\n",
       "      <th>answer_id</th>\n",
       "      <th>question_text</th>\n",
       "      <th>question_score</th>\n",
       "      <th>pmid</th>\n",
       "      <th>pmtitle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21216</td>\n",
       "      <td>21219</td>\n",
       "      <td>Why do I only breathe out of one nostril?</td>\n",
       "      <td>286</td>\n",
       "      <td>7876041</td>\n",
       "      <td>EEG changes during forced alternate nostril br...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>56476</td>\n",
       "      <td>56498</td>\n",
       "      <td>Why are so few foods blue?</td>\n",
       "      <td>190</td>\n",
       "      <td>11598230</td>\n",
       "      <td>Why leaves turn red in autumn. The role of ant...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30116</td>\n",
       "      <td>30126</td>\n",
       "      <td>Does DNA have the equivalent of IF-statements,...</td>\n",
       "      <td>153</td>\n",
       "      <td>15922833</td>\n",
       "      <td>Transcriptional interference--a crash course.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>937</td>\n",
       "      <td>939</td>\n",
       "      <td>How many times did terrestrial life emerge fro...</td>\n",
       "      <td>149</td>\n",
       "      <td>15535883</td>\n",
       "      <td>A genomic timescale of prokaryote evolution: i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>937</td>\n",
       "      <td>939</td>\n",
       "      <td>How many times did terrestrial life emerge fro...</td>\n",
       "      <td>149</td>\n",
       "      <td>20204349</td>\n",
       "      <td>The influence of different land uses on the st...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   question_id  ...                                            pmtitle\n",
       "0        21216  ...  EEG changes during forced alternate nostril br...\n",
       "1        56476  ...  Why leaves turn red in autumn. The role of ant...\n",
       "2        30116  ...      Transcriptional interference--a crash course.\n",
       "3          937  ...  A genomic timescale of prokaryote evolution: i...\n",
       "4          937  ...  The influence of different land uses on the st...\n",
       "\n",
       "[5 rows x 6 columns]"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "filename = 'biology_202004.csv'\n",
    "df = pd.read_csv(filename)\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9vbHhYAcQDTL"
   },
   "source": [
    "# MASHQA\n",
    "[MASHQA](https://github.com/mingzhu0527/MASHQA)\n",
    "\n",
    "Download data zip file from github repo. Upload to google drive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 115,
     "status": "ok",
     "timestamp": 1633962202090,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "O8J7zpOBvFCg",
    "outputId": "502aefdb-dcd7-488d-cffc-94184aa7c197"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive\n"
     ]
    }
   ],
   "source": [
    "# If you are in the folder of another dataset, uncomment and run the following command\n",
    "# %cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 103,
     "status": "ok",
     "timestamp": 1633962210584,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "r3b8evdjQHae",
    "outputId": "183cc022-b885-41ee-b837-ae81c2b5f34f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/content/drive/My Drive/mashqa_data\n"
     ]
    }
   ],
   "source": [
    "%cd mashqa_data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 244,
     "status": "ok",
     "timestamp": 1633962214767,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "91tuDLXNXqJd",
    "outputId": "dd446a88-7843-47da-df74-0f72a105ea76"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test_webmd_squad_v2_consec.json   train_webmd_squad_v2_full.json\n",
      "test_webmd_squad_v2_full.json\t  val_webmd_squad_v2_consec.json\n",
      "train_webmd_squad_v2_consec.json  val_webmd_squad_v2_full.json\n"
     ]
    }
   ],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "aAF-a9u2X3eC"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "data_file = 'train_webmd_squad_v2_consec.json'\n",
    "data = json.load(open(data_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 118,
     "status": "ok",
     "timestamp": 1633962342631,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "vG7_Fl0_vqqc",
    "outputId": "a04b424d-151d-4d83-97fc-239bf108aec1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['version', 'data'])"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect keys \n",
    "data.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "executionInfo": {
     "elapsed": 144,
     "status": "ok",
     "timestamp": 1633962364043,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "Gdv-AbJWvsbW",
    "outputId": "efd71b03-f290-4911-8548-d0b7e6fd0ef3"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "type": "string"
      },
      "text/plain": [
       "'2.0'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['version']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 117,
     "status": "ok",
     "timestamp": 1633962544974,
     "user": {
      "displayName": "Matcha Crepe",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s64",
      "userId": "13944455749463804680"
     },
     "user_tz": 240
    },
    "id": "PN-BKiOovxST",
    "outputId": "3bf80d13-2040-48cd-920f-b2ec0dd610bf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n",
      "<class 'dict'>\n",
      "dict_keys(['title', 'paragraphs'])\n",
      "Title:\n",
      " https://www.webmd.com/eye-health/understanding-glaucoma-treatment\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'context': \"Treatment of open-angle glaucoma -- the most common form of the disease -- requires lowering the eye's pressure by increasing the drainage of aqueous humor fluid or decreasing the production of that fluid. Medications can accomplish both of these goals. Surgery and laser treatments are directed at improving the eye's aqueous drainage. If not diagnosed early, open-angle glaucoma may significantly damage vision and even cause blindness. That is why it's so important to have your eye doctor test you regularly for glaucoma. Once diagnosed, glaucoma is usually controlled with eye drops that reduce eye pressure. Glaucoma is a life-long condition and needs continual follow-up with your eye doctor. Both drugs and surgery have high rates of success in treating chronic open-angle glaucoma, but you can help yourself by carefully following the doctor's treatment plan. Some patients may find it difficult to follow a regimen involving two or three different eye drops. Be candid and tell the doctor if you cannot follow the medication schedule or if the eye drops cause unwanted side effects. There are frequently alternative treatments. Because of potential drug interactions, be sure to tell your doctor about any other medical problems you have or other medications you take. If glaucoma drops causes the eyes to become chronically red, consult your doctor about switching to preservative-free glaucoma drops that may alleviate the redness from preservatives. Acute angle-closure glaucoma is different from chronic open-angle glaucoma in several important ways: The symptoms usually occur with relative suddenness; the eye is painful and red. If the high pressure in the eye is not relieved quickly, blindness can occur. On the other hand, treatments for acute angle-closure glaucoma -- usually laser treatment -- are permanent and do not require long-term therapy. For this type of glaucoma, making a hole in the iris to allow fluid to drain, called an iridectomy, is the standard treatment to cure it. The unaffected eye also is usually treated to prevent a future attack. However, it's important to get your eyes checked regularly, as some people may develop a case of chronic angle-closure glaucoma later in life, even after laser treatment. If the glaucoma does not respond to medication, or if you cannot tolerate the side effects, your doctor may change medications or recommend one of several surgical techniques: Laser trabeculoplasty creates small laser burns in the area where the fluid drains, improving the outflow rate of aqueous fluid. This relatively brief procedure can often be done in an ophthalmologist's clinic. Trabeculectomy is a surgical procedure that creates a new channel for fluid outflow in cases in which the intraocular pressure is high and the optic nerve damage progresses. Long-term results vary, but generally, the success rate is good. Surgical implants that shunt fluid out of the eye may also be used to decrease pressure in the eye. Remember, all forms of medical or surgical treatment have potential benefits and risks. Before giving your consent, always ask the surgeon to clearly explain any treatment or surgery as well as the proposed benefits, effective alternatives, and potential risks or complications.\",\n",
       " 'qas': [{'answers': [{'answer_span': [19, 20, 21, 22, 23, 24, 25],\n",
       "     'answer_start': 2249,\n",
       "     'answer_starts': [[2249, 304],\n",
       "      [2554, 81],\n",
       "      [2636, 173],\n",
       "      [2810, 64],\n",
       "      [2875, 99],\n",
       "      [2975, 87],\n",
       "      [3063, 190]],\n",
       "     'text': \"If the glaucoma does not respond to medication, or if you cannot tolerate the side effects, your doctor may change medications or recommend one of several surgical techniques: Laser trabeculoplasty creates small laser burns in the area where the fluid drains, improving the outflow rate of aqueous fluid. This relatively brief procedure can often be done in an ophthalmologist's clinic. Trabeculectomy is a surgical procedure that creates a new channel for fluid outflow in cases in which the intraocular pressure is high and the optic nerve damage progresses. Long-term results vary, but generally, the success rate is good. Surgical implants that shunt fluid out of the eye may also be used to decrease pressure in the eye. Remember, all forms of medical or surgical treatment have potential benefits and risks. Before giving your consent, always ask the surgeon to clearly explain any treatment or surgery as well as the proposed benefits, effective alternatives, and potential risks or complications.\"}],\n",
       "   'id': 'dd439956090f55da0c227d582b9629f3',\n",
       "   'is_impossible': False,\n",
       "   'question': 'What surgical techniques are used to treat glaucoma?',\n",
       "   'url': 'https://www.webmd.com/eye-health/qa/what-surgical-techniques-are-used-to-treat-glaucoma'},\n",
       "  {'answers': [{'answer_span': [7, 8, 9, 10, 11, 12],\n",
       "     'answer_start': 700,\n",
       "     'answer_starts': [[700, 168],\n",
       "      [869, 99],\n",
       "      [969, 123],\n",
       "      [1093, 44],\n",
       "      [1138, 140],\n",
       "      [1279, 183]],\n",
       "     'text': \"Both drugs and surgery have high rates of success in treating chronic open-angle glaucoma, but you can help yourself by carefully following the doctor's treatment plan. Some patients may find it difficult to follow a regimen involving two or three different eye drops. Be candid and tell the doctor if you cannot follow the medication schedule or if the eye drops cause unwanted side effects. There are frequently alternative treatments. Because of potential drug interactions, be sure to tell your doctor about any other medical problems you have or other medications you take. If glaucoma drops causes the eyes to become chronically red, consult your doctor about switching to preservative-free glaucoma drops that may alleviate the redness from preservatives.\"}],\n",
       "   'id': '42e1b369ac64b210e9789234f16b0fb2',\n",
       "   'is_impossible': False,\n",
       "   'question': 'What are the best ways to treat glaucoma?',\n",
       "   'url': 'https://www.webmd.com/eye-health/qa/what-are-the-best-ways-to-treat-glaucoma'},\n",
       "  {'answers': [{'answer_span': [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],\n",
       "     'answer_start': 439,\n",
       "     'answer_starts': [[439, 86],\n",
       "      [526, 87],\n",
       "      [614, 85],\n",
       "      [700, 168],\n",
       "      [869, 99],\n",
       "      [969, 123],\n",
       "      [1093, 44],\n",
       "      [1138, 140],\n",
       "      [1279, 183],\n",
       "      [1463, 182],\n",
       "      [1646, 77]],\n",
       "     'text': \"That is why it's so important to have your eye doctor test you regularly for glaucoma. Once diagnosed, glaucoma is usually controlled with eye drops that reduce eye pressure. Glaucoma is a life-long condition and needs continual follow-up with your eye doctor. Both drugs and surgery have high rates of success in treating chronic open-angle glaucoma, but you can help yourself by carefully following the doctor's treatment plan. Some patients may find it difficult to follow a regimen involving two or three different eye drops. Be candid and tell the doctor if you cannot follow the medication schedule or if the eye drops cause unwanted side effects. There are frequently alternative treatments. Because of potential drug interactions, be sure to tell your doctor about any other medical problems you have or other medications you take. If glaucoma drops causes the eyes to become chronically red, consult your doctor about switching to preservative-free glaucoma drops that may alleviate the redness from preservatives. Acute angle-closure glaucoma is different from chronic open-angle glaucoma in several important ways: The symptoms usually occur with relative suddenness; the eye is painful and red. If the high pressure in the eye is not relieved quickly, blindness can occur.\"}],\n",
       "   'id': 'b8d6614d83f60b699f44f3140323aef5',\n",
       "   'is_impossible': False,\n",
       "   'question': 'What should you know about treating open-angle glaucoma?',\n",
       "   'url': 'https://www.webmd.com/eye-health/qa/what-should-you-know-about-treating-openangle-glaucoma'},\n",
       "  {'answers': [{'answer_span': [25],\n",
       "     'answer_start': 3063,\n",
       "     'answer_starts': [[3063, 190]],\n",
       "     'text': 'Before giving your consent, always ask the surgeon to clearly explain any treatment or surgery as well as the proposed benefits, effective alternatives, and potential risks or complications.'}],\n",
       "   'id': '741c9aa5e637dad142bca777fc1efad8',\n",
       "   'is_impossible': False,\n",
       "   'question': 'Is surgery for glaucoma dangerous?',\n",
       "   'url': 'https://www.webmd.com/eye-health/qa/is-surgery-for-glaucoma-dangerous'},\n",
       "  {'answers': [{'answer_span': [13, 14, 15, 16, 17, 18],\n",
       "     'answer_start': 1463,\n",
       "     'answer_starts': [[1463, 182],\n",
       "      [1646, 77],\n",
       "      [1724, 144],\n",
       "      [1869, 137],\n",
       "      [2007, 70],\n",
       "      [2078, 170]],\n",
       "     'text': \"Acute angle-closure glaucoma is different from chronic open-angle glaucoma in several important ways: The symptoms usually occur with relative suddenness; the eye is painful and red. If the high pressure in the eye is not relieved quickly, blindness can occur. On the other hand, treatments for acute angle-closure glaucoma -- usually laser treatment -- are permanent and do not require long-term therapy. For this type of glaucoma, making a hole in the iris to allow fluid to drain, called an iridectomy, is the standard treatment to cure it. The unaffected eye also is usually treated to prevent a future attack. However, it's important to get your eyes checked regularly, as some people may develop a case of chronic angle-closure glaucoma later in life, even after laser treatment.\"}],\n",
       "   'id': 'e5eacc312dfca28ccbfb25ad577a3b0e',\n",
       "   'is_impossible': False,\n",
       "   'question': 'How is acute closed-angle glaucoma treated?',\n",
       "   'url': 'https://www.webmd.com/eye-health/qa/how-is-acute-closedangle-glaucoma-treated'}],\n",
       " 'sent_list': [\"Treatment of open-angle glaucoma -- the most common form of the disease -- requires lowering the eye's pressure by increasing the drainage of aqueous humor fluid or decreasing the production of that fluid.\",\n",
       "  'Medications can accomplish both of these goals.',\n",
       "  \"Surgery and laser treatments are directed at improving the eye's aqueous drainage.\",\n",
       "  'If not diagnosed early, open-angle glaucoma may significantly damage vision and even cause blindness.',\n",
       "  \"That is why it's so important to have your eye doctor test you regularly for glaucoma.\",\n",
       "  'Once diagnosed, glaucoma is usually controlled with eye drops that reduce eye pressure.',\n",
       "  'Glaucoma is a life-long condition and needs continual follow-up with your eye doctor.',\n",
       "  \"Both drugs and surgery have high rates of success in treating chronic open-angle glaucoma, but you can help yourself by carefully following the doctor's treatment plan.\",\n",
       "  'Some patients may find it difficult to follow a regimen involving two or three different eye drops.',\n",
       "  'Be candid and tell the doctor if you cannot follow the medication schedule or if the eye drops cause unwanted side effects.',\n",
       "  'There are frequently alternative treatments.',\n",
       "  'Because of potential drug interactions, be sure to tell your doctor about any other medical problems you have or other medications you take.',\n",
       "  'If glaucoma drops causes the eyes to become chronically red, consult your doctor about switching to preservative-free glaucoma drops that may alleviate the redness from preservatives.',\n",
       "  'Acute angle-closure glaucoma is different from chronic open-angle glaucoma in several important ways: The symptoms usually occur with relative suddenness; the eye is painful and red.',\n",
       "  'If the high pressure in the eye is not relieved quickly, blindness can occur.',\n",
       "  'On the other hand, treatments for acute angle-closure glaucoma -- usually laser treatment -- are permanent and do not require long-term therapy.',\n",
       "  'For this type of glaucoma, making a hole in the iris to allow fluid to drain, called an iridectomy, is the standard treatment to cure it.',\n",
       "  'The unaffected eye also is usually treated to prevent a future attack.',\n",
       "  \"However, it's important to get your eyes checked regularly, as some people may develop a case of chronic angle-closure glaucoma later in life, even after laser treatment.\",\n",
       "  'If the glaucoma does not respond to medication, or if you cannot tolerate the side effects, your doctor may change medications or recommend one of several surgical techniques: Laser trabeculoplasty creates small laser burns in the area where the fluid drains, improving the outflow rate of aqueous fluid.',\n",
       "  \"This relatively brief procedure can often be done in an ophthalmologist's clinic.\",\n",
       "  'Trabeculectomy is a surgical procedure that creates a new channel for fluid outflow in cases in which the intraocular pressure is high and the optic nerve damage progresses.',\n",
       "  'Long-term results vary, but generally, the success rate is good.',\n",
       "  'Surgical implants that shunt fluid out of the eye may also be used to decrease pressure in the eye.',\n",
       "  'Remember, all forms of medical or surgical treatment have potential benefits and risks.',\n",
       "  'Before giving your consent, always ask the surgeon to clearly explain any treatment or surgery as well as the proposed benefits, effective alternatives, and potential risks or complications.'],\n",
       " 'sent_starts': [[0, 205],\n",
       "  [206, 47],\n",
       "  [254, 82],\n",
       "  [337, 101],\n",
       "  [439, 86],\n",
       "  [526, 87],\n",
       "  [614, 85],\n",
       "  [700, 168],\n",
       "  [869, 99],\n",
       "  [969, 123],\n",
       "  [1093, 44],\n",
       "  [1138, 140],\n",
       "  [1279, 183],\n",
       "  [1463, 182],\n",
       "  [1646, 77],\n",
       "  [1724, 144],\n",
       "  [1869, 137],\n",
       "  [2007, 70],\n",
       "  [2078, 170],\n",
       "  [2249, 304],\n",
       "  [2554, 81],\n",
       "  [2636, 173],\n",
       "  [2810, 64],\n",
       "  [2875, 99],\n",
       "  [2975, 87],\n",
       "  [3063, 190]]}"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(type(data['data']))\n",
    "print(type(data['data'][0]))\n",
    "print(data['data'][0].keys())\n",
    "print('Title:\\n', data['data'][0]['title'])\n",
    "paragraphs = data['data'][0]['paragraphs']\n",
    "paragraphs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3DeGfLm_PiaC"
   },
   "source": [
    "# EPIC QA\n",
    "[EPICQA](https://bionlp.nlm.nih.gov/epic_qa/)\n",
    "develop systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2, related corona viruses, and the recommended response to the pandemic.\n",
    "\n",
    "Two tasks: <br>\n",
    "1) ExpertQA: In Task A, teams are provided with a set of questions asked by experts and are asked to provide a ranked list of expert-level answers to each question. In Task A, answers should provide information that is useful to researchers, scientists, or clinicians. <br>  \n",
    "2) Consumer QA: In Task B, teams are provided with a set of questions asked by consumers and are asked to provide a ranked list of consumer-friendly answers to each question. In Task B, answers should be understandable by the general public. <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2q3rnE0mPpxF"
   },
   "outputs": [],
   "source": [
    "# json file\n",
    "# document collection 1.4GB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5YIRDtVbPVgK"
   },
   "source": [
    "# emrQA: A Large Corpus for Question Answering on Electronic Medical Records\n",
    "[emrQA](https://github.com/panushri25/emrQA)\n",
    "\n",
    "need to register"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RkZE7cjWPg5g"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0PeFrk9RP5yp"
   },
   "source": [
    "# HealthQA\n",
    "[github](https://github.com/mingzhu0527/HAR)\n",
    "Need to email for access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bLOdGLtftS2f"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyOzA9ZQ9rgw3rcZ3MRrP4pQ",
   "collapsed_sections": [
    "6_3560E1B1-u",
    "hnBdXsoOAc-t",
    "vbA5FZuq9BxK",
    "4_NKv95BKJGE",
    "uMN56--YMFtH",
    "sRnz45nIOv6E",
    "eLIyTFaaPJSA",
    "X26D4u9UPqbx",
    "3DeGfLm_PiaC",
    "5YIRDtVbPVgK",
    "0PeFrk9RP5yp"
   ],
   "name": "MedicalQADataset.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}