--- a
+++ b/freetext_nlp_extraction.ipynb
@@ -0,0 +1,2215 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "13021ce2",
+   "metadata": {},
+   "source": [
+    "# Extracting patient symptoms from medical notes using natural language processing (NLP)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a3713b2",
+   "metadata": {},
+   "source": [
+    "This notebook will presents a method of extracting a pre-defined set of patient symptoms from a database of freetext medical notes (a.k.a. blob text). This example uses spaCy it's NLP framework of choice and uses rule-based matching and negation detection (rather than any machine learning models).\n",
+    "\n",
+    "The original dataset that was used is not included here as it contains personally identifiable information. However, its properties will still be referred to throughout to justify the need for certain pre-processing and steps in the NLP pipeline. \n",
+    "\n",
+    "If you have any comments or questions please contact the author, Will Mower at william.mower@nhs.net.\n",
+    "\n",
+    "Thanks!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fbfa4a4",
+   "metadata": {},
+   "source": [
+    "### Import modules used throughout the notebook"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d7e0f518",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Standard data processing tools in python\n",
+    "import pandas as pd \n",
+    "pd.options.mode.chained_assignment = None\n",
+    "import numpy as np\n",
+    "\n",
+    "#NLP framework of choice\n",
+    "import spacy \n",
+    "from spacy.matcher import Matcher\n",
+    "from spacy.tokens import Span\n",
+    "\n",
+    "#Negation component used within the spaCy pipeline\n",
+    "from negspacy.negation import Negex\n",
+    "from negspacy.termsets import termset\n",
+    "\n",
+    "#Regular expressions\n",
+    "import re\n",
+    "\n",
+    "#Used to extract data from XML and HTML formats\n",
+    "from lxml import etree,html\n",
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "#Misc. python modules\n",
+    "import time\n",
+    "from datetime import datetime\n",
+    "import os \n",
+    "import itertools\n",
+    "\n",
+    "#used to save pandas dataframe \n",
+    "import pickle\n",
+    "\n",
+    "#for extract comparison metrics\n",
+    "from sklearn.metrics import confusion_matrix\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "032b2323",
+   "metadata": {},
+   "source": [
+    "## Import the data files\n",
+    "Two example files have been provided in the GitHub repository. The first, the blob text file, mimics the format of the file that was originally provided for this project containing medical notes that were extracted from an EHR database. \n",
+    "\n",
+    "The second file is a example of the desired medical concepts to be extracted and various ways that they are typically written in medical notes. Columns can be added to this file for any other concepts that should be extracted from the text.\n",
+    "\n",
+    "### Freetext file\n",
+    "\n",
+    "This file contains a row for individual medical notes that are stored at various points throughout a patient's visit to hospital. Each visit has a unique encounter ID which is also included in the data set to allow for notes to be grouped by encounter. The original file was already filtered for the medical records for certain encounter IDs and between certain dates as so none of this pre-processing is included here. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "dce87bfd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>ENCNTR_ID</th>\n",
+       "      <th>BLOB_CONTENTS</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111</td>\n",
+       "      <td>72 h onset of palpitation s , worse when walki...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>111</td>\n",
+       "      <td>Pt has not had recent surgery / immobilisation...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>111</td>\n",
+       "      <td>Pt has not had recent surgery / immobilisation...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>222</td>\n",
+       "      <td>Pre-Arrival Summary  Name:  Doe, John   Curren...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>222</td>\n",
+       "      <td>PC: no chest pains, SOB Pmh: htn, high cholest...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   ENCNTR_ID                                      BLOB_CONTENTS\n",
+       "0        111  72 h onset of palpitation s , worse when walki...\n",
+       "1        111  Pt has not had recent surgery / immobilisation...\n",
+       "2        111  Pt has not had recent surgery / immobilisation...\n",
+       "3        222  Pre-Arrival Summary  Name:  Doe, John   Curren...\n",
+       "4        222  PC: no chest pains, SOB Pmh: htn, high cholest..."
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#import freetext notes with encounter ids\n",
+    "#could be imported as excel, .csv or other\n",
+    "org_freetext_df = pd.read_csv(\"sample_data/sample_blob_text.csv\")\n",
+    "org_freetext_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e609eec0",
+   "metadata": {},
+   "source": [
+    "### \"Symptoms\" list file\n",
+    "\n",
+    "This file contains all of the symptoms, conditions and any other terms that will be searched for in the freetext. Each symptom type is a column header and any alternative forms expressing it are listed in the column.\n",
+    "\n",
+    "> *The terms that are to be extracted from the freetext will be referred to throughout as **symptoms** for simplicity even though many of them are not symptoms but instead conditions etc.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c49ab5e3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Hypertension</th>\n",
+       "      <th>Chronic heart failure</th>\n",
+       "      <th>Cancer</th>\n",
+       "      <th>Prior pe</th>\n",
+       "      <th>PE</th>\n",
+       "      <th>Chest Pain</th>\n",
+       "      <th>Dyspnea</th>\n",
+       "      <th>DOA</th>\n",
+       "      <th>recent surgery</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>HTN</td>\n",
+       "      <td>heart failure</td>\n",
+       "      <td>malignancy</td>\n",
+       "      <td>Prior/previous_PE/pes/DVT/dvts/pulmonary embol...</td>\n",
+       "      <td>DVT</td>\n",
+       "      <td>CP</td>\n",
+       "      <td>DIB</td>\n",
+       "      <td>DOAC</td>\n",
+       "      <td>recent immobilisation</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>high BP</td>\n",
+       "      <td>CHF</td>\n",
+       "      <td>lung ca</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>deep vein thrombosis</td>\n",
+       "      <td>C/P</td>\n",
+       "      <td>SOB</td>\n",
+       "      <td>Apixaban</td>\n",
+       "      <td>recent surg</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>&gt;BP</td>\n",
+       "      <td>CCF</td>\n",
+       "      <td>chemo</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>pulmonary embolism</td>\n",
+       "      <td>angina</td>\n",
+       "      <td>short of breath</td>\n",
+       "      <td>dabigatran</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>high blood pressure</td>\n",
+       "      <td>LVF</td>\n",
+       "      <td>chemotherapy</td>\n",
+       "      <td>MAKE_PHRASE_COMBINATIONS</td>\n",
+       "      <td>dvts</td>\n",
+       "      <td>chest pains</td>\n",
+       "      <td>shortness of breath</td>\n",
+       "      <td>edoxaban</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>carcinoma</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>pes</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>breathless</td>\n",
+       "      <td>rivaroxaban</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          Hypertension Chronic heart failure        Cancer  \\\n",
+       "0                  HTN         heart failure    malignancy   \n",
+       "1              high BP                   CHF       lung ca   \n",
+       "2                  >BP                   CCF         chemo   \n",
+       "3  high blood pressure                   LVF  chemotherapy   \n",
+       "4                  NaN                   NaN     carcinoma   \n",
+       "\n",
+       "                                            Prior pe                    PE  \\\n",
+       "0  Prior/previous_PE/pes/DVT/dvts/pulmonary embol...                   DVT   \n",
+       "1                                                NaN  deep vein thrombosis   \n",
+       "2                                                NaN    pulmonary embolism   \n",
+       "3                           MAKE_PHRASE_COMBINATIONS                 dvts    \n",
+       "4                                                NaN                   pes   \n",
+       "\n",
+       "    Chest Pain              Dyspnea          DOA         recent surgery  \n",
+       "0           CP                  DIB         DOAC  recent immobilisation  \n",
+       "1          C/P                  SOB     Apixaban            recent surg  \n",
+       "2       angina      short of breath   dabigatran                    NaN  \n",
+       "3  chest pains  shortness of breath     edoxaban                    NaN  \n",
+       "4          NaN           breathless  rivaroxaban                    NaN  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "freetext_terms_df = pd.read_csv(\"sample_data/sample_symptom_names.csv\")\n",
+    "freetext_terms_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c275f85",
+   "metadata": {},
+   "source": [
+    "## Data cleaning and pre-processing \n",
+    "### Freetext file\n",
+    "Keep only the encounter ID (ENCNTR_ID) and medical note freetext (BLOB_CONTENTS) columns, renaming both for ease of use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "b86e8b75",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>e_id</th>\n",
+       "      <th>text</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111</td>\n",
+       "      <td>72 h onset of palpitation s , worse when walki...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>111</td>\n",
+       "      <td>Pt has not had recent surgery / immobilisation...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>111</td>\n",
+       "      <td>Pt has not had recent surgery / immobilisation...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>222</td>\n",
+       "      <td>Pre-Arrival Summary  Name:  Doe, John   Curren...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>222</td>\n",
+       "      <td>PC: no chest pains, SOB Pmh: htn, high cholest...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   e_id                                               text\n",
+       "0   111  72 h onset of palpitation s , worse when walki...\n",
+       "1   111  Pt has not had recent surgery / immobilisation...\n",
+       "2   111  Pt has not had recent surgery / immobilisation...\n",
+       "3   222  Pre-Arrival Summary  Name:  Doe, John   Curren...\n",
+       "4   222  PC: no chest pains, SOB Pmh: htn, high cholest..."
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "freetext_df = org_freetext_df[[\"ENCNTR_ID\",\"BLOB_CONTENTS\"]].rename(columns = {\"ENCNTR_ID\":\"e_id\",\"BLOB_CONTENTS\":'text'})\n",
+    "freetext_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8739d56",
+   "metadata": {},
+   "source": [
+    "In the original freetext table there were 736 unique encounter IDs and ~7600 rows of medical notes. As such there are on average 10 text entries in the dataframe for each encounter. Infomation from multiple entries for the same encounter will need to be combined later on. *(Here there are 3 unique encounters and on average 2 entries per encounter)*\n",
+    "\n",
+    "Additionally, there were slightly fewer unique text entries than the number of rows, meaning that there are some duplicate rows. After removing duplicates rows (considering both encounter IDs and the text) most duplicates are removed. Some duplicate text remains meaning that there are some duplicate text entries across different encounters (likely default entries) - these will not be removed. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "09aa96dc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The number of rows in the sample dataframe:  6\n",
+      "\n",
+      "The number of unique values per column:\n",
+      " e_id    3\n",
+      "text    5\n",
+      "dtype: int64\n",
+      "\n",
+      "After dropping duplicates: 5 rows\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"The number of rows in the sample dataframe: \",freetext_df.shape[0])\n",
+    "print(\"\\nThe number of unique values per column:\\n\", freetext_df.nunique())\n",
+    "\n",
+    "freetext_df.drop_duplicates(inplace=True)\n",
+    "print(f\"\\nAfter dropping duplicates: {freetext_df.shape[0]} rows\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "858ebc0f",
+   "metadata": {},
+   "source": [
+    "### Symptoms dictionary\n",
+    "\n",
+    "The symptom list dataframe is cleaned up (e.g. remove NaN.s etc.). Each symptom is converted to a dictionary entry using the column name as the key and the set of all the different permutations for the symptom as the items. All symptoms entries are also converted to lower-case.\n",
+    "\n",
+    "Two custom keywords are also included in certain columns: \"make_phrase_combinations\" and \"secondary_matcher\". These signify that the entries in the column should be treated differently depending on the keyword. Their uses will be explained later in the next text cell.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "1b12b8db",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'>bp', 'high blood pressure', 'high bp', 'htn', 'hypertension'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#convert freetext options to lists\n",
+    "freetext_terms_df.columns = [col.replace(\"\\n\",\"\").lower() for col in freetext_terms_df.columns]\n",
+    "freetext_terms_df = freetext_terms_df.append(pd.DataFrame([freetext_terms_df.columns], columns = freetext_terms_df.columns),ignore_index = True)\n",
+    "\n",
+    "\n",
+    "symptom_dict = {}\n",
+    "for col in freetext_terms_df.columns:\n",
+    "    symptom_dict[col.lower()] = set(freetext_terms_df[col].dropna().str.lower().values)\n",
+    "    \n",
+    "#dictionary entry for hypertension symptom \n",
+    "symptom_dict[\"hypertension\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e64901e",
+   "metadata": {},
+   "source": [
+    "## Symptoms to pattern matcher format\n",
+    "\n",
+    "For NLP tasks, a series of different processes, each with a particular function, are joined together to form the whole NLP model than can be used to extract information from the inputted block of text. Each process is known as a component and together, the components form an NLP pipeline. spaCy's default NLP pipeline includes a variety of different components that label and modify the words. You can read more about these components and their purposes here: https://spacy.io/usage/processing-pipelines. \n",
+    "\n",
+    "We will be creating a simple NLP pipeline containing only three components - one for sentence identification, one that identifies the patterns we have pre-defined (symptoms and their variations) and another that checks for negation. We will use the default spaCy sentencizer which splits sentences based on a set of default punctation characters (see https://spacy.io/api/sentencizer for details).\n",
+    "\n",
+    "### EntityRuler\n",
+    "\n",
+    "The second component will be spaCy's Entity Ruler component (https://spacy.io/api/entityruler) which is used for rule-based entity recognition. We will use this to find the any instances of the symptoms and their variations in the text and label them as an entity.\n",
+    "\n",
+    "The EntityRuler component uses the PhraseMatcher object to match any phrases that are passed through it and, as such, the patterns representing words and phrases that we want to label must be entered in the accepted \"Pattern\" format. \n",
+    "\n",
+    "### \"Pattern\" format \n",
+    "The pattern format requires each word in a phrase to be identified individually along with the matching method used to search for matches. The \"lower\" matching tag is used here to match the symptom string to the lower case version of a word in the text. In order for the lower case freetext to match with the phrase pattern defined, the phrase pattern must also be in lower case, hence the use of the .lower() method. \n",
+    "\n",
+    "The \"orth\" matching tag is used to match the exact string supplied in the pattern and here is used for any alphanumeric characters (such as > in >BP). The \"OP\" pattern parameter can be used to identify string that can be optional for a pattern to match using the \"?\" value.\n",
+    "\n",
+    "*Phrase patterns mentioned here are technically Token patterns, as true Phrase patterns match exact,case-sensitive strings - find more details here: https://spacy.io/usage/rule-based-matching#entityruler*\n",
+    "\n",
+    "The phrase pattern also includes an entity label that will be attributed to any words in the text that are found to match the pattern. In this case, a new entity \"SYM\" (for symptom) will be assigned, rather than a pre-defined spaCy entity. Finally, the pattern is given an ID that can be used to group patterns that refer to the same overarching symptom i.e. \"HT\" and \">BP\" would both have the id \"hypertension\".\n",
+    "\n",
+    "Certain custom patterns are also defined and manually added to the pattern dictionary for more complex patterns containing optional tokens (shown by \"OP\":\"?\" in token dictionary).\n",
+    "\n",
+    "The below function outputs a list containing pattern dictionaries for each symptom variation. This will be used in the main extraction function later on.\n",
+    "\n",
+    "### Other custom entities\n",
+    "\n",
+    "Two other pattern generator functions are listed below; one for \"previous medical history\" entities and one for \"family\" entities. These will be used in combination with symptom entities for more refined symptom identification following the NLP extraction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "88d6054f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Sample of pattern dictionaries for hypertension symptom:\n",
+      "\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pe'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pes'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'dvt'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'dvts'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pulmonary'}, {'lower': 'embolism'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'deep'}, {'lower': 'vein'}, {'lower': 'thrombosis'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'pe'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'pes'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'dvt'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'dvts'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'pulmonary'}, {'lower': 'embolism'}], 'id': 'prior pe'}\n",
+      "{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'deep'}, {'lower': 'vein'}, {'lower': 'thrombosis'}], 'id': 'prior pe'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "def create_symptom_patterns(symp_dict):\n",
+    "    #uses the dictionary keys as the ids for each pattern\n",
+    "    \n",
+    "    symp_patterns = []\n",
+    "    secondary_symp_patterns = []\n",
+    "    for symptom_name,symp_phrase_options_object in symp_dict.items():\n",
+    "        \n",
+    "        symp_phrase_options = list(symp_phrase_options_object)\n",
+    "        \n",
+    "        #deal with phrase combinations seperately to avoid having to write them all out\n",
+    "        #data column must contain \"make_phrase_combinations\" string for this\n",
+    "        #see \"prior_pe\" column for example\n",
+    "        \n",
+    "        if \"make_phrase_combinations\" in symp_phrase_options:\n",
+    "            \n",
+    "            phrase = [i for i in list(symp_phrase_options) if \"/\" in i][0]\n",
+    "            \n",
+    "            phrase_combos_list = []\n",
+    "            phrase_components = re.split('_',phrase)\n",
+    "            \n",
+    "            for component_variations in phrase_components:\n",
+    "                phrase_combo = \"\"\n",
+    "                component_variation_list = re.split('\\/',component_variations)\n",
+    "                component_variation_list = [i.lower() for i in component_variation_list if i not in [\"\",None]]\n",
+    "                \n",
+    "                phrase_combos_list.append(component_variation_list)\n",
+    "        \n",
+    "            symp_phrase_options = list(itertools.product(*phrase_combos_list))\n",
+    "            \n",
+    "            # function that converts tuple to string\n",
+    "            def join_tuple_string(strings_tuple) -> str:\n",
+    "                return ' '.join(strings_tuple)\n",
+    "\n",
+    "            # joining all the tuples\n",
+    "            result = map(join_tuple_string, symp_phrase_options)\n",
+    "            \n",
+    "            # converting and printing the result\n",
+    "            symp_phrase_options = list(result)\n",
+    "        \n",
+    "        #used to include any patterns generated from a column with an entry \"secondary_matcher\"\n",
+    "        #into a phrase matcher that comes after the primary one used for all the other patterns\n",
+    "        \n",
+    "        #needed because the phrase \"prior pe\" in the text would match with \"pe\" instead of \"prior pe\"\n",
+    "        #which means it is dealt with as if there is no mention of \"prior/previous\" in find_pmh_entities\n",
+    "        if \"secondary_matcher\" in symp_phrase_options:\n",
+    "            secondary_patterns = True\n",
+    "            symp_phrase_options.remove(\"secondary_matcher\")\n",
+    "            \n",
+    "        else:\n",
+    "            secondary_patterns = False\n",
+    "            \n",
+    "          \n",
+    "        for phrase in symp_phrase_options:\n",
+    "               \n",
+    "            #split on space or non-alphanum character\n",
+    "            #capture non-space alphanum (i.e. symbols)\n",
+    "            split_phrase = re.split('\\s|([^a-zA-Z0-9])',phrase)\n",
+    "            split_phrase = [i.lower() for i in split_phrase if i not in [\"\",None]]\n",
+    "            \n",
+    "            new_pattern = []\n",
+    "            #create phrase pattern that copes with alpnum and symbols token patterns\n",
+    "            for i in split_phrase:\n",
+    "                if i.isalnum():\n",
+    "                    #match lower case alphanum sequence \n",
+    "                    new_pattern.append({\"lower\":i})\n",
+    "                else:\n",
+    "                    #match exact symbols\n",
+    "                    new_pattern.append({\"orth\":i})\n",
+    "                    \n",
+    "            if not secondary_patterns:\n",
+    "                symp_patterns.append({\"label\":\"SYM\",\"pattern\":new_pattern,\"id\":symptom_name})\n",
+    "            \n",
+    "            else:\n",
+    "                secondary_symp_patterns.append({\"label\":\"SYM\",\"pattern\":new_pattern,\"id\":symptom_name})\n",
+    "                \n",
+    "        #manual addition of more complex patterns including optional parts \n",
+    "        #(could be generated using a conditional like above but this was a simpler, short-term option)\n",
+    "        #any symptom ids used here must be the same as a column header in the symptom file\n",
+    "        symp_patterns.append({\"label\":\"SYM\",\"pattern\":[\n",
+    "            {\"lower\":\"recent\"},\n",
+    "            #any word\n",
+    "            {\"IS_ALPHA\":True,\"OP\":\"?\"},\n",
+    "            {\"lower\":{\"IN\":[\"surg\",\"surgery\"]}}\n",
+    "        ],\"id\":\"recent surgery\"})\n",
+    "        \n",
+    " \n",
+    "    return symp_patterns, secondary_symp_patterns\n",
+    "\n",
+    "#create patterns for previous medical history entities\n",
+    "def create_pmh_patterns():\n",
+    "    \n",
+    "    pmh_list = ['pmh','pmx', 'previous medical history','previous hx', 'previous mh',\\\n",
+    "               'hx of','hs of', 'bg','phx', 'history of','pmhx']\n",
+    "    \n",
+    "    pmh_patterns = []\n",
+    "    for phrase in pmh_list:\n",
+    "        #split on space or non-alphanum character\n",
+    "        #capture non-space alphanum (i.e. symbols)\n",
+    "        split_phrase = re.split('\\s|([^a-zA-Z0-9])',phrase)\n",
+    "        split_phrase = [i.lower() for i in split_phrase if i not in [\"\",None]]\n",
+    "\n",
+    "        new_pattern = []\n",
+    "        #create phrase pattern that copes with alpnum and symbols token patterns\n",
+    "        for i in split_phrase:\n",
+    "            if i.isalnum():\n",
+    "                #match lower case alphanum sequence \n",
+    "                new_pattern.append({\"lower\":i})\n",
+    "            else:\n",
+    "                #match exact symbols\n",
+    "                new_pattern.append({\"orth\":i})\n",
+    "                \n",
+    "        pmh_patterns.append({\"label\":\"PMH\",\"pattern\":new_pattern,\"id\":\"pmh\"})\n",
+    "    \n",
+    "    return pmh_patterns\n",
+    "\n",
+    "#create patterns for family history / family entities\n",
+    "def create_family_patterns():\n",
+    "    \n",
+    "    family_hx_list = ['fh','fhx', 'f hx', ]\n",
+    "    family_list = ['family','fam',]\n",
+    "    \n",
+    "    family_patterns = []\n",
+    "    for phrase in family_hx_list:\n",
+    "        #split on space or non-alphanum character\n",
+    "        #capture non-space alphanum (i.e. symbols)\n",
+    "        split_phrase = re.split('\\s|([^a-zA-Z0-9])',phrase)\n",
+    "        split_phrase = [i.lower() for i in split_phrase if i not in [\"\",None]]\n",
+    "\n",
+    "        new_pattern = []\n",
+    "        #create phrase pattern that copes with alpnum and symbols token patterns\n",
+    "        for i in split_phrase:\n",
+    "            if i.isalnum():\n",
+    "                #match lower case alphanum sequence \n",
+    "                new_pattern.append({\"lower\":i})\n",
+    "            else:\n",
+    "                #match exact symbols\n",
+    "                new_pattern.append({\"orth\":i})\n",
+    "                \n",
+    "        family_patterns.append({\"label\":\"FHX\",\"pattern\":new_pattern,\"id\":\"family_hx\"})\n",
+    "        \n",
+    "    for phrase in family_list:\n",
+    "        #split on space or non-alphanum character\n",
+    "        #capture non-space alphanum (i.e. symbols)\n",
+    "        split_phrase = re.split('\\s|([^a-zA-Z0-9])',phrase)\n",
+    "        split_phrase = [i.lower() for i in split_phrase if i not in [\"\",None]]\n",
+    "\n",
+    "        new_pattern = []\n",
+    "        #create phrase pattern that copes with alpnum and symbols token patterns\n",
+    "        for i in split_phrase:\n",
+    "            if i.isalnum():\n",
+    "                #match lower case alphanum sequence \n",
+    "                new_pattern.append({\"lower\":i})\n",
+    "            else:\n",
+    "                #match exact symbols\n",
+    "                new_pattern.append({\"orth\":i})\n",
+    "                \n",
+    "        family_patterns.append({\"label\":\"FAM\",\"pattern\":new_pattern,\"id\":\"family\"})\n",
+    "        \n",
+    "    return family_patterns\n",
+    "\n",
+    "#functions to create all entity patterns\n",
+    "def create_ent_patterns(symptoms):\n",
+    "    symptom_patterns, secondary_symptom_patterns = create_symptom_patterns(symptoms)\n",
+    "    pmh_patterns = create_pmh_patterns()\n",
+    "    family_patterns = create_family_patterns()\n",
+    "\n",
+    "    all_ent_patterns = symptom_patterns + pmh_patterns + family_patterns\n",
+    "    \n",
+    "    return all_ent_patterns, secondary_symptom_patterns\n",
+    "\n",
+    "all_patterns, secondary_symptom_patterns = create_ent_patterns(symptom_dict)\n",
+    "\n",
+    "print(\"Sample of pattern dictionaries for hypertension symptom:\\n\")\n",
+    "[print(i) for i in all_patterns if i['id'] == 'prior pe'];\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "ed116107",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'label': 'SYM',\n",
+       "  'pattern': [{'lower': 'pulmonary'}, {'lower': 'embolism'}],\n",
+       "  'id': 'pe'},\n",
+       " {'label': 'SYM',\n",
+       "  'pattern': [{'lower': 'deep'}, {'lower': 'vein'}, {'lower': 'thrombosis'}],\n",
+       "  'id': 'pe'},\n",
+       " {'label': 'SYM', 'pattern': [{'lower': 'dvts'}], 'id': 'pe'},\n",
+       " {'label': 'SYM', 'pattern': [{'lower': 'dvt'}], 'id': 'pe'},\n",
+       " {'label': 'SYM', 'pattern': [{'lower': 'pe'}], 'id': 'pe'},\n",
+       " {'label': 'SYM', 'pattern': [{'lower': 'pes'}], 'id': 'pe'}]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "secondary_symptom_patterns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "61e9fc36",
+   "metadata": {},
+   "source": [
+    "## Freetext preprocessing\n",
+    "\n",
+    "The 1st function defined below will run within the extraction process and can be modified to include any pre-processing deemed necessary. It converts the freetext to lower case as well as extracting all of the useful text data from the XML string data using the 2nd and 3rd functions in the cell (the original dataset contained XML strings). It also adds a full-stop before a variety of manually inputted phrases identified to usually indicate the start of a new section/sentence in the text. This helps with the correct identification of negated phrases. \n",
+    "\n",
+    "A function to remove any entries that are sub-sets of other entries is also defined here.\n",
+    "\n",
+    "> N.B. the preprocessing is defined seperately to the NLP pipeline here and will modify the text before it enters the pipeline. However, it could also be added to the pipeline as a custom component."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d60fdf3a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def freetext_preprocessing(blob_df):\n",
+    "    \n",
+    "    freetext = blob_df['text']\n",
+    "    \n",
+    "    freetext = freetext.apply(clean_blob_xml)\n",
+    "     \n",
+    "    #convert all text to lowercase\n",
+    "    freetext = freetext.str.lower()\n",
+    "    \n",
+    "    #add full-stop before these words to stop ongoing negation\n",
+    "    pmh_terms = [\"pmh\",\"pmx\",'pmhx', 'hpc', 'meds','dx','dh','oe:','o/e','author:',\n",
+    "                'past medical history:','social history:','dhx','shx','pc','sh','medications']\n",
+    "    for term in pmh_terms:\n",
+    "        freetext.replace(to_replace = r'\\W{term}\\W'.format(term = term), value = f'. {term} ', regex=True, inplace=True)\n",
+    "     \n",
+    "    #replace 2 or more newline characters with full-stop to \n",
+    "    #represent new section as new sentence\n",
+    "    \n",
+    "    #replace newline characters with fullstop\n",
+    "    freetext.replace(to_replace = r'\\n\\W*\\n?', value = '. ', regex=True, inplace=True)\n",
+    "\n",
+    "    #remove multiple fullstops in a row\n",
+    "    freetext.replace(to_replace = r'\\W*\\.(\\W*\\.\\W*)*', value = '. ', regex=True, inplace=True)\n",
+    "    \n",
+    "    blob_df['text'] = freetext\n",
+    "    \n",
+    "    blob_df = blob_df[['text','e_id']].groupby(by='e_id').apply(remove_sub_strings)\n",
+    "    \n",
+    "    #drop e_id index \n",
+    "    blob_df = blob_df.droplevel(\"e_id\")\n",
+    "    \n",
+    "    return blob_df\n",
+    "\n",
+    "\n",
+    "#functions to extract text from xml strings \n",
+    "def clean_blob_xml(freetext):\n",
+    "    if freetext[0:5] == \"<?xml\":\n",
+    "        return xml_to_string(freetext)\n",
+    "    else:\n",
+    "        return freetext\n",
+    "    \n",
+    "\n",
+    "def xml_to_string(xml_string):\n",
+    "    #convert xml string to html object with lxml\n",
+    "    root = html.fromstring(str.encode(xml_string))\n",
+    "    #convert html from lxml object to bs4 html object \n",
+    "    soup = BeautifulSoup(html.tostring(root))\n",
+    "    #extract text from html keeping newlines\n",
+    "    html_string = soup.get_text('\\n')\n",
+    "    \n",
+    "    \n",
+    "    return html_string\n",
+    "\n",
+    "#remove entries that are sub-sets of others for same encounter\n",
+    "def remove_sub_strings(df):\n",
+    "        \n",
+    "    #sort columns by length of string - longer strings can't be sub strings of shorter ones\n",
+    "    df = df.sort_values(by='text',key=lambda x: x.str.len())\n",
+    "    sorted_text = df['text'].values\n",
+    "    not_sub_string = []\n",
+    "    for idx,text in enumerate(sorted_text):\n",
+    "        if sum([text in i for i in sorted_text[idx+1:]]) == 0:\n",
+    "            not_sub_string.append(True)\n",
+    "\n",
+    "        else:\n",
+    "            not_sub_string.append(False)\n",
+    "\n",
+    "    #drop columns that are substrings of others        \n",
+    "    df.loc[:, 'not_sub_string'] = not_sub_string\n",
+    "    df = df[df['not_sub_string']].drop(['not_sub_string'], axis=1)\n",
+    "\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8812b1a",
+   "metadata": {},
+   "source": [
+    "## Negation identification\n",
+    "\n",
+    "The negspacy module which contains a negation spaCy component is used to identify if any identified symptoms are negated in the text i.e. for \"patient has SOB but *not*  hypertension\", hypertension is negated but SOB isn't. ref: https://github.com/jenojp/negspacy\n",
+    "\n",
+    "The algorithm is based off the NegEx algorithm (https://doi.org/10.1006/jbin.2001.1029) and uses a list of negations patterns to label specific entities within the text.\n",
+    "\n",
+    "The negation pattern types are:\n",
+    " - **pseudo_negations** - phrases that are false triggers, ambiguous negations, or double negatives\n",
+    " - **preceding_negations** - negation phrases that precede an entity\n",
+    " - **following_negations** - negation phrases that follow an entity\n",
+    " - **termination** - phrases that cut a sentence into parts, for purposes of negation detection (.e.g., \"but\")\n",
+    "\n",
+    "If either preceding or following negations are found in the text, any entity after or before the negation respectively will be classed as negated. Termination patterns stops any negation passing through them (e.g. in \"doesn't have HT but has CHF\" CHF would not be negated). Pseudo-negations are removed retrospectively if initially picked up by preceding or following negations (i.e. \"not necessarily HT\" would initially be negated due to \"not\" but reverted as \"not necessarily\" is a psuedo-negation)\n",
+    "\n",
+    "negspacy was initially developed for clinical data and as such, its default term set is designed for clinical use.\n",
+    "\n",
+    "We can edit the termsets by adding or removing any patterns to fit our use case. Here, \"nil\" has been added as preceding and following negation and \"pmh\" and \"pmx\" has been included as a termination as they usually indicate the start of a new section. The addition of a preceding full-stop above serves the same purpose as the inclusion of termination terms but is better interpretation.\n",
+    "\n",
+    "See below for examples of patterns from each category in the en_clinical termset:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "627986dd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "pseudo_negations :\n",
+      "['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily', 'without any further', 'without difficulty', 'without further', 'might not', 'not only'] \n",
+      "\n",
+      "preceding_negations :\n",
+      "['without indication of', 'fails to reveal', 'rule out', 'never', 'denied', 'no signs of', 'couldnt', 'nil', 'not demonstrate', 'negative for'] \n",
+      "\n",
+      "following_negations :\n",
+      "['free', 'was not', 'unlikely', 'were not', 'were ruled out', \"wasn't\", \"weren't\", 'was ruled out', 'nil', 'werent'] \n",
+      "\n",
+      "termination :\n",
+      "['still', 'trigger event for', 'however', 'aside from', 'as there are', 'etiology for', 'other possibilities of', 'except', 'origin for', 'source for'] \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "def create_negation_termset():\n",
+    "    ts = termset(\"en_clinical\")\n",
+    "    \n",
+    "    ts.add_patterns({\n",
+    "        \"preceding_negations\":['nil'],\n",
+    "        \"following_negations\":['nil'],\n",
+    "        \"termination\": [\"pmh\",'pmx', ], \n",
+    "    })\n",
+    "    \n",
+    "    return ts\n",
+    "    \n",
+    "tas = create_negation_termset()    \n",
+    "    \n",
+    "for key, items in tas.get_patterns().items():\n",
+    "    print(key,\":\")\n",
+    "    print(items[:10],'\\n')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23491e17",
+   "metadata": {},
+   "source": [
+    "## Generating the NLP pipeline\n",
+    "\n",
+    "The below function brings together the components introduced above to create the NLP pipeline that will be used to extract the symptoms. \n",
+    "\n",
+    "As only a simple NLP pipeline will be used here, a blank pipeline object is started with. The default sentence segmentation component is added to identify sentences which is necessary for the negation component. The default spaCy pipeline could be used instead here and would provide a lot more information about the structure and content of the text inputted. For an initial, rule-based NLP model, these non-essential components can be ommitted for the sake of simplicity and performance. \n",
+    "\n",
+    "### Entity recogniser\n",
+    "Then, the main rule-based entity recogniser is added, using the symptom, PMH and family patterns created by the pattern generator functions above. This will identify any occurences of the symptoms or their variations in the text and label them as \"SYM\" entities. Similarly, PMH,FAM and FHX entities are matched and labelled.\n",
+    "\n",
+    "### Secondary entity recogniser\n",
+    "A second rule-based entity recogniser is used for overlapping patterns with a lower priority than those included in the first matcher. For example, for the dataset originally used, we were interested in whether a patient had previously had a pulmonary embolism (PE) or deep vein thrombosis (DVT). Mentions of *prior* pe/dvt can automatically be extracted, however mentions of pe/dvt should only be included in the conclusions about the patient if they are mentioned as having been previously experienced by a patient. As such, if they follow a \"previous medical history\" entity then they can be consider as past instances of the concept. \n",
+    "\n",
+    "*Other phrases not following a PMH entity but otherwise implying a previous condition e.g. \"patient has previously has pe\" are not dealt with / extracted with this rule-based method as there aare too many possible variations. Luckily, most notes are written in the same format and most instances of prior pe are covered by the patterns generated previously*\n",
+    "\n",
+    "Therefore, we want \"prior pe\" to be included instead of just \"pe\" which is what would happen if both the \"prior pe\" and \"pe\" patterns were in the same matcher. SpaCy prioritises the entities labelled earlier in the pipeline, so we can use a secondary matcher to include patterns that may overlap with patterns of higher priority. As we will be checking if \"pe\" entities are preceding by a \"pmh\" entity, we do not want to ignore any \"prior pe\" that may have otherwise matched with the \"pe\" pattern. \n",
+    "\n",
+    "### Negation\n",
+    "Finally, the negation component is added, including a few custom negations that are added after having been manually identified as negators in freetext examples from this dataset.\n",
+    "\n",
+    "The resulting NLP model known in spaCy as a *Language* class can now be used to extract symptoms from any sentence fed into it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "15e780f0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#initialise nlp model using symptom patterns and custom negation\n",
+    "def make_nlp_model(symptoms):\n",
+    "    #input: takes dictionary of symptoms and their variations in the format created above\n",
+    "    # as the symptom_dict variable\n",
+    "    \n",
+    "    #load blank spaCy nlp pipeline\n",
+    "    nlp = spacy.blank(\"en\")\n",
+    "    \n",
+    "    #add default sentence segmentation component\n",
+    "    nlp.add_pipe(\"sentencizer\")\n",
+    "    \n",
+    "    #creates new rule-based entity recognition pipeline component\n",
+    "    #matches patterns using the LOWER matcher attribute (looks for strings matching when lower case)\n",
+    "    config = {\n",
+    "       \"phrase_matcher_attr\": \"LOWER\",\n",
+    "       \"validate\": False,\n",
+    "       \"overwrite_ents\": False,\n",
+    "       \"ent_id_sep\": \"||\",\n",
+    "    }\n",
+    "\n",
+    "    #add all symptom phrases as patterns to rule-based entity recog.\n",
+    "    patterns, secondary_patterns = create_ent_patterns(symptoms)\n",
+    "    \n",
+    "    ruler = nlp.add_pipe('entity_ruler',\"phrase_matcher_1\",config = config)\n",
+    "    ruler.add_patterns(patterns)\n",
+    "    \n",
+    "    #secondary matcher for lower priority matches \n",
+    "    secondary_ruler = nlp.add_pipe(\"entity_ruler\",\"phrase_matcher_2\",config = config)\n",
+    "    secondary_ruler.add_patterns(secondary_patterns)\n",
+    "    \n",
+    "    #add component to find negation of symptom entities\n",
+    "    #add custom negation patterns\n",
+    "    ts = create_negation_termset()\n",
+    "    \n",
+    "    nlp.add_pipe(\"negex\", config={\"ent_types\":[\"SYM\",\"PMH\",\"FAM\"],'neg_termset':ts.get_patterns()})\n",
+    "    \n",
+    "    return nlp\n",
+    "\n",
+    "nlp_model = make_nlp_model(symptom_dict)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45a52ded",
+   "metadata": {},
+   "source": [
+    "### spaCy \"Language\" class object\n",
+    "\n",
+    "The language object that is returned by the above function, typically called *nlp* (here it's called *nlp_model*), contains all of the infomation about the NLP pipeline created as well as all of the vocabulary associated with it. \n",
+    "\n",
+    "#### Applying the pipeline\n",
+    "Applying the model to some text produces information about any symptom, pmh and family entities contained within it whether they are negated or not. To do this, the language object is simply called with desired text as an input variable. \n",
+    "\n",
+    "The resulting output is a *doc* object that contains all of this information for this text specifically. Below, we will extract any entities that have been identified as well as their corresponding negation. As we have only included the rule-based enitity recogniser that finds SYM entities in the pipeline, only SYM entities will be returned."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "04708758",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Extracted sypmtoms from:  'pt has recently had surgery. no SOB pmx htn and chf ' \n",
+      "\n",
+      "Word       Identified symptom        Symptom is negated\n",
+      "\n",
+      "SOB        dyspnea                   True  \n",
+      "pmx        pmh                       False \n",
+      "htn        hypertension              False \n",
+      "chf        chronic heart failure     False \n"
+     ]
+    }
+   ],
+   "source": [
+    "text = \"'pt has recently had surgery. no SOB pmx htn and chf '\"\n",
+    "\n",
+    "doc = nlp_model(text)\n",
+    "print(\"Extracted sypmtoms from: \", text,\"\\n\")\n",
+    "print(\"%-10s %-25s %-6s\" % (\"Word\", \"Identified symptom\",\"Symptom is negated\\n\"))\n",
+    "for i in doc.ents:\n",
+    "    print(f\"{str(i):{10}} {str(i.ent_id_):{25}} {str(i._.negex):{6}}\")\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8be61d2b",
+   "metadata": {},
+   "source": [
+    "## Identifying previous PE/DVT\n",
+    "\n",
+    "The below function is used to check whether any occurences of \"pe\" / \"dvt\" are preceded by a PMH entity, indicating that they are previous occurences of pe/dvt and can be included along with other identified \"prior pe\" entities. The presence of a family entity before a PMH entity ignores any phrases identified as PE until the next PMH entity, similarly with the presence of a FHX (family history) entity alone."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "38cd870f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#find instances of pe / dvt symptom and only keep those that occur after pmh entity\n",
+    "\n",
+    "def find_pmh_entities(doc):\n",
+    "    new_doc_ents = []\n",
+    "    \n",
+    "    for sent in doc.sents:\n",
+    "\n",
+    "        sent_doc = sent.as_doc()\n",
+    "\n",
+    "        sent_tokens = [i.text for i in sent_doc]\n",
+    "\n",
+    "        entities = [i for i in sent.ents]\n",
+    "        ent_ids = [i.ent_id_ for i in sent.ents]\n",
+    "\n",
+    "        if \"pe\" in ent_ids:\n",
+    "            fam_pmh = False\n",
+    "            after_pmh = False\n",
+    "            fam_on = False\n",
+    "\n",
+    "            #loop through entities \n",
+    "            for entity in entities:\n",
+    "                label = entity.label_\n",
+    "                ent_id = entity.ent_id_\n",
+    "\n",
+    "                #if entity is pe symptom include only if after pmh entity\n",
+    "                if label == \"SYM\":\n",
+    "                    if ent_id == \"pe\":\n",
+    "                        if after_pmh and not fam_pmh:\n",
+    "                            new_doc_ents.append(entity)\n",
+    "\n",
+    "                        else:\n",
+    "                            continue\n",
+    "\n",
+    "                    #for other SYM entities include only if not after fam pmh\n",
+    "                    else:\n",
+    "                        if not fam_pmh:\n",
+    "                            #new_doc_ents.append(make_span(doc,entity))\n",
+    "                            new_doc_ents.append(entity)\n",
+    "\n",
+    "                        else:\n",
+    "                            continue\n",
+    "\n",
+    "                #if family entity found\n",
+    "                elif label == \"FAM\":\n",
+    "                    fam_on = True\n",
+    "                    fam_end_loc = (entity.end)\n",
+    "\n",
+    "                #if pmh entity found\n",
+    "                elif label == \"PMH\":\n",
+    "                    #if family entity in previous two words - removes the symptoms\n",
+    "                    if fam_on and entity.start <= fam_end_loc + 1:\n",
+    "                        fam_pmh = True\n",
+    "                        after_pmh = False\n",
+    "                    else:\n",
+    "                        fam_pmh = False\n",
+    "                        after_pmh = True\n",
+    "                        #include PMH entities\n",
+    "                    new_doc_ents.append(entity)\n",
+    "\n",
+    "                elif label == \"FHX\":\n",
+    "                    fam_pmh = True\n",
+    "                    after_pmh = False\n",
+    "                    new_doc_ents.append(entity)\n",
+    "\n",
+    "                else:\n",
+    "                    print(\"entity not dealt with \")\n",
+    "\n",
+    "        else:\n",
+    "            new_doc_ents += sent.ents\n",
+    "\n",
+    "    doc.set_ents(new_doc_ents)\n",
+    "    \n",
+    "    return doc\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "579cd814",
+   "metadata": {},
+   "source": [
+    "## Using NLP to extract symtpoms \n",
+    "\n",
+    "The below cells brings together all of the previous cells executed so far. The NLP pipeline is imported and the freetext is pre-processed and then passed through the pipeline. Any instances of \"PE\" entities (not \"prior pe\") are checked for a preceding PMH entity as defined by the function above and any positive matches as included as \"prior pe\" matches. For each freetext row, the symptoms and whether they are negated or not is extracted. \n",
+    "\n",
+    "### Combining extract symptoms \n",
+    "\n",
+    "It is possible that multiple of the same symptoms are found in the same text extract and they may have differing negations. In addition, as mentioned previously, there were multiple freetext rows for most encounters in the original dataset - on average each encounter has ~10 freetext entries. As such, multiple labels for the same symptoms must be dealt with once the labels are found for each text entry. \n",
+    "\n",
+    "The algorithm used to decide a final symptom presence classification of True or False is a simple one and could be adapted for different datasets and to deal with different symptoms differently. It works as follows:\n",
+    " - The symptom extracts are collated for each encounter ID\n",
+    " - If a symptom gets no matches from any freetext, the boolean symptom presence label defaults to False\n",
+    " - If there is one or more match for a symptom, whichever negation label is more common across the matched words is used for the final presence label (e.g. if the negation labels are [True, False, True] for the cancer symptom, the final presence label will be True as it is most common)\n",
+    " - It there are an equal number of each negation label, the presence label defaults to True\n",
+    " \n",
+    "This is performed by the *combine_e_ids* function below and executed at the end of the main *extract_symptoms* function. \n",
+    "\n",
+    "*Note that the Truth values returned by the NLP pipeline represent the truth of whether the symptom was negated in the text, so a True label would imply that the symptom **isn't** present in the patient. The output truth values are for presence of the symptom, hence the inversion that is implicit in the function.*\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "d5b4b763",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# input is freetext data table generated below grouped by encounter id.\n",
+    "# the function combines identified symptoms to create single symptom\n",
+    "#presence label\n",
+    "def combine_e_ids(df, include_ratio=False):\n",
+    "    \n",
+    "    new_df = {}\n",
+    "    \n",
+    "    #iterate through each symptom type\n",
+    "    #combine results across all freetext for a given encounter\n",
+    "    for symptom in df['extracted_symps_dict'].values[0].keys():\n",
+    "        #bias towards positive diagnosis if equal number of +/- ive\n",
+    "        symptom_count = 0\n",
+    "        total_count = 0\n",
+    "        total_records = 0\n",
+    "        for d in df['extracted_symps_dict'].values:\n",
+    "            total_records += 1\n",
+    "            for i in d[symptom]:\n",
+    "                #if negation value is True adds 1; 0 if False\n",
+    "                symptom_count += i\n",
+    "                total_count += 1\n",
+    "        \n",
+    "        final_presence_label = None\n",
+    "        \n",
+    "        #if there are no labels either +/- ive then default to negative\n",
+    "        if total_count == 0:\n",
+    "            final_presence_label = False\n",
+    "            \n",
+    "        else:\n",
+    "            #if mean labels is <0.5 then there are more negatives\n",
+    "            #this means there is a mean positive identification of the symptom\n",
+    "            mean_count = symptom_count / total_count\n",
+    "            \n",
+    "            #more positive negation labels so false symptom \n",
+    "            if mean_count > 0.5:\n",
+    "                final_presence_label = False\n",
+    "                \n",
+    "            #more than or equal negative negation labels so positive symptom \n",
+    "            else:\n",
+    "                final_presence_label = True\n",
+    "        \n",
+    "        #to include the ratio of True/False labels in the table output\n",
+    "        if include_ratio:\n",
+    "            pos_count = total_count - symptom_count\n",
+    "            neg_count = symptom_count\n",
+    "            final_presence_label = str(final_presence_label) + f\" ({pos_count}+  {neg_count}-)\"\n",
+    "        \n",
+    "        new_df.update({symptom:final_presence_label})\n",
+    "            \n",
+    "    new_df.update({\"Total records\":total_records})\n",
+    "    \n",
+    "    all_text = \"\"\n",
+    "    #generate string containing all notes for an encounter\n",
+    "    for i, text in enumerate(df['text'].values):\n",
+    "        all_text += f\"---------\\n Note {i}\\n---------\\n\"\n",
+    "        all_text += text\n",
+    "        all_text += \"\\n\\n\"\n",
+    "    \n",
+    "    new_df.update({\"All Notes\":all_text})\n",
+    "    \n",
+    "    return pd.DataFrame(index = [0],data = new_df)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "22d90d02",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "before preprocessing  5\n",
+      "after preprocessing  5\n",
+      "Row  0\n",
+      "Time taken for 5 rows: 0 seconds\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>e_id</th>\n",
+       "      <th>hypertension</th>\n",
+       "      <th>chronic heart failure</th>\n",
+       "      <th>cancer</th>\n",
+       "      <th>prior pe</th>\n",
+       "      <th>chest pain</th>\n",
+       "      <th>dyspnea</th>\n",
+       "      <th>doa</th>\n",
+       "      <th>recent surgery</th>\n",
+       "      <th>Total records</th>\n",
+       "      <th>All Notes</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (2+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>2</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npt has not had ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>222</td>\n",
+       "      <td>True (2+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>False (1+  3-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>2</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npc: no chest pa...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>333</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>True (2+  1-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>1</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npatient known k...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   e_id    hypertension chronic heart failure          cancer        prior pe  \\\n",
+       "0   111  False (0+  0-)         True (1+  0-)  False (0+  0-)   True (2+  0-)   \n",
+       "1   222   True (2+  0-)        False (0+  0-)  False (0+  0-)  False (0+  0-)   \n",
+       "2   333   True (1+  0-)        False (0+  1-)   True (2+  1-)  False (0+  0-)   \n",
+       "\n",
+       "       chest pain         dyspnea             doa  recent surgery  \\\n",
+       "0  False (0+  0-)  False (0+  1-)   True (1+  0-)  False (0+  1-)   \n",
+       "1  False (0+  1-)  False (1+  3-)  False (0+  0-)  False (0+  0-)   \n",
+       "2  False (0+  0-)   True (1+  0-)  False (0+  0-)  False (0+  0-)   \n",
+       "\n",
+       "   Total records                                          All Notes  \n",
+       "0              2  ---------\\n Note 0\\n---------\\npt has not had ...  \n",
+       "1              2  ---------\\n Note 0\\n---------\\npc: no chest pa...  \n",
+       "2              1  ---------\\n Note 0\\n---------\\npatient known k...  "
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#function to search blob text for symptoms and their variations\n",
+    "#listed in freetext_term_df\n",
+    "\n",
+    "\n",
+    "def extract_symptoms(blob_df,symptoms, include_ratio = False, include_uncombined = False):\n",
+    "    #include_ratio - includes the ratio of True and False presence labels \n",
+    "    #that determined the final presence label\n",
+    "\n",
+    "    #include_uncombined - include the uncombined df which contains doc objects \n",
+    "    #for all freetext entries\n",
+    "\n",
+    "    start_time = time.time()\n",
+    "    \n",
+    "    #generate nlp model using symptoms and their variations\n",
+    "    nlp = make_nlp_model(symptoms)\n",
+    "    \n",
+    "    #pre-process freetext - see function defintion above for details\n",
+    "    print(\"before preprocessing \",blob_df.shape[0])\n",
+    "    blob_df = freetext_preprocessing(blob_df)\n",
+    "    print(\"after preprocessing \",blob_df.shape[0])\n",
+    "    \n",
+    "    #uncomment to test function on sample of df\n",
+    "    #blob_df = blob_df.head(10)\n",
+    "     \n",
+    "    #dictionary of each symptom and whether any terms (negated or not)\n",
+    "    #were found for symptoms in freetext\n",
+    "    blob_symptoms_dict = []\n",
+    "    \n",
+    "    #string of symptoms for which any terms are found with\n",
+    "    #their negation in brackets e.g. HT (False)\n",
+    "    blob_symptoms = []\n",
+    "    \n",
+    "    #list of the doc objects for each freetext entry\n",
+    "    blob_docs = []\n",
+    "    \n",
+    "    #iterate through freetext df one entry at a time\n",
+    "    for row_id, row in blob_df.sort_index().iterrows():\n",
+    "        #print progress\n",
+    "        if row_id % 500 == 0:\n",
+    "            print(\"Row \",row_id)\n",
+    "        \n",
+    "        #run freetext through nlp pipeline\n",
+    "        doc = nlp(row['text'])\n",
+    "        \n",
+    "        #keep instances of pe/dvt only following PMH entity\n",
+    "        doc = find_pmh_entities(doc)\n",
+    "        \n",
+    "        #initialise dictionary for each symptoms that will be filled\n",
+    "        #with any found occurences\n",
+    "        all_symps_names = list(symptoms.keys())\n",
+    "        all_symps_names.remove('pe')\n",
+    "        symp_dict = {term:[] for term in all_symps_names}\n",
+    "        symp_string = None\n",
+    "        \n",
+    "        #loop through all found entities in freetext doc object\n",
+    "        for e in doc.ents:\n",
+    "            \n",
+    "            #if entity is of custom symptom type: SYM\n",
+    "            if e.label_ == \"SYM\":\n",
+    "                \n",
+    "                #convert pe to prior pe to combine the columns\n",
+    "                if e.ent_id_ == 'pe':\n",
+    "                    ent_id = 'prior pe'\n",
+    "                else:\n",
+    "                    ent_id = e.ent_id_\n",
+    "                    \n",
+    "                \n",
+    "                symp_dict[ent_id].append(e._.negex)\n",
+    "                if symp_string == None:\n",
+    "                    symp_string = f\"{ent_id} ({e._.negex}), \"\n",
+    "                else:\n",
+    "                    symp_string += f\"{ent_id} ({e._.negex}), \"\n",
+    "                \n",
+    "        blob_symptoms_dict.append(symp_dict)\n",
+    "        blob_symptoms.append(symp_string)\n",
+    "        blob_docs.append(doc)\n",
+    "    \n",
+    "    blob_df.loc[:,'extracted_symps_'] = blob_symptoms\n",
+    "    blob_df.loc[:,'extracted_symps_dict'] = blob_symptoms_dict\n",
+    "    blob_df.loc[:,'doc'] = blob_docs\n",
+    "    \n",
+    "    \n",
+    "    #group by encounter and combine presence labels\n",
+    "    final_extract_df = blob_df.groupby(by='e_id').apply(combine_e_ids, include_ratio = include_ratio)\n",
+    "    final_extract_df = final_extract_df.reset_index().drop(columns = ['level_1'])\n",
+    "    \n",
+    "    end_time = time.time()\n",
+    "    \n",
+    "    print(f\"Time taken for {blob_df.shape[0]} rows: {round(end_time-start_time)} seconds\")\n",
+    "    \n",
+    "    if include_uncombined:\n",
+    "        return final_extract_df, blob_df, nlp\n",
+    "    \n",
+    "    else:\n",
+    "        return final_extract_df,nlp\n",
+    "       \n",
+    "    \n",
+    "combined_extract_df, uncombined_extract_df ,extract_nlp_model = extract_symptoms(freetext_df, symptom_dict, include_ratio=True, include_uncombined = True)\n",
+    "\n",
+    "\n",
+    "\n",
+    "combined_extract_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "87827889",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>e_id</th>\n",
+       "      <th>extracted_symps_</th>\n",
+       "      <th>extracted_symps_dict</th>\n",
+       "      <th>doc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>pt has not had recent surgery / immobilisation...</td>\n",
+       "      <td>111</td>\n",
+       "      <td>dyspnea (True), chronic heart failure (False),...</td>\n",
+       "      <td>{'hypertension': [], 'chronic heart failure': ...</td>\n",
+       "      <td>(72, h, onset, of, palpitation, s, ,, worse, w...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>72 h onset of palpitation s , worse when walki...</td>\n",
+       "      <td>111</td>\n",
+       "      <td>recent surgery (True), prior pe (False), doa (...</td>\n",
+       "      <td>{'hypertension': [], 'chronic heart failure': ...</td>\n",
+       "      <td>(pt, has, not, had, recent, surgery, /, immobi...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>pc: no chest pains, sob.  pmh  htn, high chole...</td>\n",
+       "      <td>222</td>\n",
+       "      <td>dyspnea (False), dyspnea (True), dyspnea (True...</td>\n",
+       "      <td>{'hypertension': [False], 'chronic heart failu...</td>\n",
+       "      <td>(pre, -, arrival, summary,  , name, :,  , doe,...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>pre-arrival summary  name:  doe, john   curren...</td>\n",
+       "      <td>222</td>\n",
+       "      <td>chest pain (True), dyspnea (True), hypertensio...</td>\n",
+       "      <td>{'hypertension': [False], 'chronic heart failu...</td>\n",
+       "      <td>(pc, :, no, chest, pains, ,, sob, .,  , pmh,  ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>patient known kidney cancer on no chemo as kno...</td>\n",
+       "      <td>333</td>\n",
+       "      <td>cancer (False), cancer (True), chronic heart f...</td>\n",
+       "      <td>{'hypertension': [False], 'chronic heart failu...</td>\n",
+       "      <td>(patient, known, kidney, cancer, on, no, chemo...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                text  e_id  \\\n",
+       "1  pt has not had recent surgery / immobilisation...   111   \n",
+       "0  72 h onset of palpitation s , worse when walki...   111   \n",
+       "4  pc: no chest pains, sob.  pmh  htn, high chole...   222   \n",
+       "3  pre-arrival summary  name:  doe, john   curren...   222   \n",
+       "5  patient known kidney cancer on no chemo as kno...   333   \n",
+       "\n",
+       "                                    extracted_symps_  \\\n",
+       "1  dyspnea (True), chronic heart failure (False),...   \n",
+       "0  recent surgery (True), prior pe (False), doa (...   \n",
+       "4  dyspnea (False), dyspnea (True), dyspnea (True...   \n",
+       "3  chest pain (True), dyspnea (True), hypertensio...   \n",
+       "5  cancer (False), cancer (True), chronic heart f...   \n",
+       "\n",
+       "                                extracted_symps_dict  \\\n",
+       "1  {'hypertension': [], 'chronic heart failure': ...   \n",
+       "0  {'hypertension': [], 'chronic heart failure': ...   \n",
+       "4  {'hypertension': [False], 'chronic heart failu...   \n",
+       "3  {'hypertension': [False], 'chronic heart failu...   \n",
+       "5  {'hypertension': [False], 'chronic heart failu...   \n",
+       "\n",
+       "                                                 doc  \n",
+       "1  (72, h, onset, of, palpitation, s, ,, worse, w...  \n",
+       "0  (pt, has, not, had, recent, surgery, /, immobi...  \n",
+       "4  (pre, -, arrival, summary,  , name, :,  , doe,...  \n",
+       "3  (pc, :, no, chest, pains, ,, sob, .,  , pmh,  ...  \n",
+       "5  (patient, known, kidney, cancer, on, no, chemo...  "
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "uncombined_extract_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "82ce7a57",
+   "metadata": {},
+   "source": [
+    "### Saving output\n",
+    "For use again in Python, the output can be saved as a 'pickle' file. The advantage of this over a .csv is that it stores all of the \"doc\" objects etc. correctly. rather than converting them to strings.\n",
+    "\n",
+    "However, it is not necessary to save the pickles to keep using the notebook. Uncomment the functions to save and load the pickles if desired."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "230425a2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def save_pickle(combined_extract_df, blob_df):\n",
+    "    date_time_string = datetime.today().strftime(\"%y.%m.%d-%H%M%S\")\n",
+    "    with open(f'pickle_data/{date_time_string}.pickle', 'wb') as f:\n",
+    "        pickle.dump([combined_extract_df, blob_df],f)\n",
+    "        \n",
+    "def get_latest_pickle():\n",
+    "    folder_dates = os.listdir('pickle_data/')\n",
+    "    int_dates = [''.join(i for i in j if i.isdigit()) for j in folder_dates]\n",
+    "    latest_date = folder_dates[np.argmax(int_dates)]\n",
+    "    file_path = 'pickle_data/' + latest_date \n",
+    "    \n",
+    "    with open(file_path,'rb') as f:\n",
+    "        combined_extract_df, blob_df = pickle.load(f)\n",
+    "        \n",
+    "    return combined_extract_df, blob_df\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "955cb947",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>e_id</th>\n",
+       "      <th>hypertension</th>\n",
+       "      <th>chronic heart failure</th>\n",
+       "      <th>cancer</th>\n",
+       "      <th>prior pe</th>\n",
+       "      <th>chest pain</th>\n",
+       "      <th>dyspnea</th>\n",
+       "      <th>doa</th>\n",
+       "      <th>recent surgery</th>\n",
+       "      <th>Total records</th>\n",
+       "      <th>All Notes</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (2+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>2</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npt has not had ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>222</td>\n",
+       "      <td>True (2+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>False (1+  3-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>2</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npc: no chest pa...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>333</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>True (2+  1-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>1</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npatient known k...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   e_id    hypertension chronic heart failure          cancer        prior pe  \\\n",
+       "0   111  False (0+  0-)         True (1+  0-)  False (0+  0-)   True (2+  0-)   \n",
+       "1   222   True (2+  0-)        False (0+  0-)  False (0+  0-)  False (0+  0-)   \n",
+       "2   333   True (1+  0-)        False (0+  1-)   True (2+  1-)  False (0+  0-)   \n",
+       "\n",
+       "       chest pain         dyspnea             doa  recent surgery  \\\n",
+       "0  False (0+  0-)  False (0+  1-)   True (1+  0-)  False (0+  1-)   \n",
+       "1  False (0+  1-)  False (1+  3-)  False (0+  0-)  False (0+  0-)   \n",
+       "2  False (0+  0-)   True (1+  0-)  False (0+  0-)  False (0+  0-)   \n",
+       "\n",
+       "   Total records                                          All Notes  \n",
+       "0              2  ---------\\n Note 0\\n---------\\npt has not had ...  \n",
+       "1              2  ---------\\n Note 0\\n---------\\npc: no chest pa...  \n",
+       "2              1  ---------\\n Note 0\\n---------\\npatient known k...  "
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#save_pickle(combined_extract_df, uncombined_extract_df)\n",
+    "#combined_extract_df, blob_df = get_latest_pickle()\n",
+    "combined_extract_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe6aeda6",
+   "metadata": {},
+   "source": [
+    "### Save to csv\n",
+    "To save file as csv - removes non-symptom columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "99835cc4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_to_save = combined_extract_df.drop(columns=['All Notes','Total records']).rename(columns={\"e_id\":\"ENCNTR_ID\"})\n",
+    "#df_to_save.to_csv('nlp_extract.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d22dd4a",
+   "metadata": {},
+   "source": [
+    "## Brief data exploration\n",
+    "From now on are various functions to explore the data and NLP predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "5cee00f3",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>e_id</th>\n",
+       "      <th>hypertension</th>\n",
+       "      <th>chronic heart failure</th>\n",
+       "      <th>cancer</th>\n",
+       "      <th>prior pe</th>\n",
+       "      <th>chest pain</th>\n",
+       "      <th>dyspnea</th>\n",
+       "      <th>doa</th>\n",
+       "      <th>recent surgery</th>\n",
+       "      <th>Total records</th>\n",
+       "      <th>All Notes</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>True (2+  0-)</td>\n",
+       "      <td>False (0+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>True (1+  0-)</td>\n",
+       "      <td>False (0+  1-)</td>\n",
+       "      <td>2</td>\n",
+       "      <td>---------\\n Note 0\\n---------\\npt has not had ...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   e_id    hypertension chronic heart failure          cancer       prior pe  \\\n",
+       "0   111  False (0+  0-)         True (1+  0-)  False (0+  0-)  True (2+  0-)   \n",
+       "\n",
+       "       chest pain         dyspnea            doa  recent surgery  \\\n",
+       "0  False (0+  0-)  False (0+  1-)  True (1+  0-)  False (0+  1-)   \n",
+       "\n",
+       "   Total records                                          All Notes  \n",
+       "0              2  ---------\\n Note 0\\n---------\\npt has not had ...  "
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#extract all encounters with non-default false instance of a symptom - \"prior pe\" below\n",
+    "#include_ratio must be True in extract_symptoms function to include ratios used here\n",
+    "\n",
+    "pe_df = combined_extract_df[~combined_extract_df['prior pe'].str.contains(\"0+  0-\",regex=False)]\n",
+    "pe_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "564eaa5b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def print_encounter_data(e_id,symptom_to_show = str):\n",
+    "    print(\"Combined extract NLP: \\n\",combined_extract_df[combined_extract_df['e_id'] == e_id])\n",
+    "    \n",
+    "    print(\"------\\n Extracted symptoms from notes \\n--------\\n\")\n",
+    "    note = 1\n",
+    "    for row_id, row in uncombined_extract_df.iterrows():\n",
+    "        print(f\"NOTE {note}\\n Identified symptoms and if negated: \\n\",repr(row.extracted_symps_),'\\n',)\n",
+    "        spacy.displacy.render(row.doc,style='ent',jupyter=True)\n",
+    "        note +=1\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "9df0ad20",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0    111\n",
+       "Name: e_id, dtype: int64"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#get 10 encounters with any mention of prior_pe\n",
+    "pe_df['e_id'].tail(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "6d2f74d0",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Combined extract NLP: \n",
+      "    e_id    hypertension chronic heart failure          cancer       prior pe  \\\n",
+      "0   111  False (0+  0-)         True (1+  0-)  False (0+  0-)  True (2+  0-)   \n",
+      "\n",
+      "       chest pain         dyspnea            doa  recent surgery  \\\n",
+      "0  False (0+  0-)  False (0+  1-)  True (1+  0-)  False (0+  1-)   \n",
+      "\n",
+      "   Total records                                          All Notes  \n",
+      "0              2  ---------\\n Note 0\\n---------\\npt has not had ...  \n",
+      "------\n",
+      " Extracted symptoms from notes \n",
+      "--------\n",
+      "\n",
+      "NOTE 1\n",
+      " Identified symptoms and if negated: \n",
+      " 'dyspnea (True), chronic heart failure (False), prior pe (False), ' \n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">72 h onset of palpitation s , worse when walking small distances,no \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    sob\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       "  has extensive.  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pmh\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       "  copd , \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    heart failure\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", tb , \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pe\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " this ear.  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pmh\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       "  ex ivdu , hep c mrsa +ve  on crew arrival cyanosed lips  has recent had a lrti ==&gt; finished abs yesterday   plan: bloods + crp + vbg + cxray + ecg likely for ivabs</div></span>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "NOTE 2\n",
+      " Identified symptoms and if negated: \n",
+      " 'recent surgery (True), prior pe (False), doa (False), ' \n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">pt has not had \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    recent surgery\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " / immobilisation or travel.  pt does have \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    history of\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       " \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pe\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " earlier this year, and has was prescribed \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    apixaban\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ".  </div></span>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "NOTE 3\n",
+      " Identified symptoms and if negated: \n",
+      " 'dyspnea (False), dyspnea (True), dyspnea (True), hypertension (False), ' \n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">pre-arrival summary  name:  doe, john   current date:  03/`jul/2018 09:23:46 bst gender:  male date of birth:  12/mar/39 age:  78 years pre-arrival type:   eta:  03/`jul/2018 09:48:00 bst primary care physician:   presenting problem:   pre-arrival user:  referring source:    handover taken from xxx pt has been having \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    sob\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " on exertion since long time but worse in the last 2/7.  pt lives alone and has got carers 3 times /week.  o/e  alert, not distress, able to speak in full sentences, no \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    sob\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", no \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    dib\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ".  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pmh\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       "  pacemaker, \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    htn\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", high cholesterol, breathing problem, leg ulcers under the district nurse rr 17, sat 100%, p 86 reg, bp 103/81, bm 8. 8, temp 37, gcs 15/15</div></span>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "NOTE 4\n",
+      " Identified symptoms and if negated: \n",
+      " 'chest pain (True), dyspnea (True), hypertension (False), ' \n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">pc: no \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    chest pains\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    sob\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ".  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pmh\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       "  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    htn\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", high cholesterol.  need to r/o pe</div></span>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "NOTE 5\n",
+      " Identified symptoms and if negated: \n",
+      " 'cancer (False), cancer (True), chronic heart failure (True), dyspnea (False), cancer (False), hypertension (False), ' \n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">patient known kidney \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    cancer\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " on no \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    chemo\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " as known \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    heart failure\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ".  patient been having increasing \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    shortness of breath\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       " on rest and more at night.  not able to sleep at night due to gasping for breath.  not noticed more swelling than usual to legs.  having lower right side abdo pain.  patient been losing weight.  known to suffer from blood clots.  not been on blood thinners since august.  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    pmh\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PMH</span>\n",
+       "</mark>\n",
+       "  \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    kidney ca\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ".  blood clots, sleep apnoea, \n",
+       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
+       "    htn\n",
+       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">SYM</span>\n",
+       "</mark>\n",
+       ", disc problems.  medications tinzaparin, mst, atirvastain, oromorph.  allergies: nil</div></span>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\n",
+    "print_encounter_data(111)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "d96db982",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "' patient been losing weight.  known to suffer from blood clots.  not been on blood thinners since august.  ' Encounter ID:  333\n",
+      "'clots.  not been on blood thinners since august.  pmh  kidney ca.  blood clo' Encounter ID:  333\n",
+      "'thinners since august.  pmh  kidney ca.  blood clots, sleep apnoea, htn, disc problems.  medicati' Encounter ID:  333\n"
+     ]
+    }
+   ],
+   "source": [
+    "#search for occurences of \"blood\" in text\n",
+    "search_string = \"\\Wblood\\W\"\n",
+    "#search_string = '\\Wdvt\\W|deep vein thrombosis'\n",
+    "for row_id, row in uncombined_extract_df[uncombined_extract_df['text'].str.contains(search_string, regex=True)].iterrows():\n",
+    "    e_id = row.e_id\n",
+    "    medical_note = row.text\n",
+    "    match = re.search(search_string, medical_note)\n",
+    "    while match:\n",
+    "        \n",
+    "        start_inx = match.span()[0]\n",
+    "        end_inx = match.span()[1]\n",
+    "        start = start_inx-50 if start_inx >= 50 else 0\n",
+    "        end = end_inx+50 if len(medical_note) - end_inx > 0 else -1\n",
+    "        print(repr(medical_note[start:end]),\"Encounter ID: \",e_id)\n",
+    "        \n",
+    "        medical_note = medical_note[end_inx:]\n",
+    "        match = re.search(search_string, medical_note)\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d684f91b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python (wells)",
+   "language": "python",
+   "name": "wells_env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}