# Extracting patient symptoms from medical notes using natural language processing (NLP)

This notebook will presents a method of extracting a pre-defined set of patient symptoms from a database of freetext medical notes (a.k.a. blob text). This example uses spaCy it's NLP framework of choice and uses rule-based matching and negation detection (rather than any machine learning models).

The original dataset that was used is not included here as it contains personally identifiable information. However, its properties will still be referred to throughout to justify the need for certain pre-processing and steps in the NLP pipeline. 

If you have any comments or questions please contact the author, Will Mower at william.mower@nhs.net.

Thanks!

### Import modules used throughout the notebook

In [1]:
#Standard data processing tools in python
import pandas as pd 
pd.options.mode.chained_assignment = None
import numpy as np

#NLP framework of choice
import spacy 
from spacy.matcher import Matcher
from spacy.tokens import Span

#Negation component used within the spaCy pipeline
from negspacy.negation import Negex
from negspacy.termsets import termset

#Regular expressions
import re

#Used to extract data from XML and HTML formats
from lxml import etree,html
from bs4 import BeautifulSoup

#Misc. python modules
import time
from datetime import datetime
import os 
import itertools

#used to save pandas dataframe 
import pickle

#for extract comparison metrics
from sklearn.metrics import confusion_matrix


## Import the data files
Two example files have been provided in the GitHub repository. The first, the blob text file, mimics the format of the file that was originally provided for this project containing medical notes that were extracted from an EHR database. 

The second file is a example of the desired medical concepts to be extracted and various ways that they are typically written in medical notes. Columns can be added to this file for any other concepts that should be extracted from the text.

### Freetext file

This file contains a row for individual medical notes that are stored at various points throughout a patient's visit to hospital. Each visit has a unique encounter ID which is also included in the data set to allow for notes to be grouped by encounter. The original file was already filtered for the medical records for certain encounter IDs and between certain dates as so none of this pre-processing is included here. 


In [2]:
#import freetext notes with encounter ids
#could be imported as excel, .csv or other
org_freetext_df = pd.read_csv("sample_data/sample_blob_text.csv")
org_freetext_df.head()

Unnamed: 0,ENCNTR_ID,BLOB_CONTENTS
0,111,"72 h onset of palpitation s , worse when walki..."
1,111,Pt has not had recent surgery / immobilisation...
2,111,Pt has not had recent surgery / immobilisation...
3,222,"Pre-Arrival Summary Name: Doe, John Curren..."
4,222,"PC: no chest pains, SOB Pmh: htn, high cholest..."


### "Symptoms" list file

This file contains all of the symptoms, conditions and any other terms that will be searched for in the freetext. Each symptom type is a column header and any alternative forms expressing it are listed in the column.

> *The terms that are to be extracted from the freetext will be referred to throughout as **symptoms** for simplicity even though many of them are not symptoms but instead conditions etc.*

In [3]:
freetext_terms_df = pd.read_csv("sample_data/sample_symptom_names.csv")
freetext_terms_df.head()

Unnamed: 0,Hypertension,Chronic heart failure,Cancer,Prior pe,PE,Chest Pain,Dyspnea,DOA,recent surgery
0,HTN,heart failure,malignancy,Prior/previous_PE/pes/DVT/dvts/pulmonary embol...,DVT,CP,DIB,DOAC,recent immobilisation
1,high BP,CHF,lung ca,,deep vein thrombosis,C/P,SOB,Apixaban,recent surg
2,>BP,CCF,chemo,,pulmonary embolism,angina,short of breath,dabigatran,
3,high blood pressure,LVF,chemotherapy,MAKE_PHRASE_COMBINATIONS,dvts,chest pains,shortness of breath,edoxaban,
4,,,carcinoma,,pes,,breathless,rivaroxaban,


## Data cleaning and pre-processing 
### Freetext file
Keep only the encounter ID (ENCNTR_ID) and medical note freetext (BLOB_CONTENTS) columns, renaming both for ease of use.

In [4]:
freetext_df = org_freetext_df[["ENCNTR_ID","BLOB_CONTENTS"]].rename(columns = {"ENCNTR_ID":"e_id","BLOB_CONTENTS":'text'})
freetext_df.head()

Unnamed: 0,e_id,text
0,111,"72 h onset of palpitation s , worse when walki..."
1,111,Pt has not had recent surgery / immobilisation...
2,111,Pt has not had recent surgery / immobilisation...
3,222,"Pre-Arrival Summary Name: Doe, John Curren..."
4,222,"PC: no chest pains, SOB Pmh: htn, high cholest..."


In the original freetext table there were 736 unique encounter IDs and ~7600 rows of medical notes. As such there are on average 10 text entries in the dataframe for each encounter. Infomation from multiple entries for the same encounter will need to be combined later on. *(Here there are 3 unique encounters and on average 2 entries per encounter)*

Additionally, there were slightly fewer unique text entries than the number of rows, meaning that there are some duplicate rows. After removing duplicates rows (considering both encounter IDs and the text) most duplicates are removed. Some duplicate text remains meaning that there are some duplicate text entries across different encounters (likely default entries) - these will not be removed. 

In [5]:
print("The number of rows in the sample dataframe: ",freetext_df.shape[0])
print("\nThe number of unique values per column:\n", freetext_df.nunique())

freetext_df.drop_duplicates(inplace=True)
print(f"\nAfter dropping duplicates: {freetext_df.shape[0]} rows")

The number of rows in the sample dataframe:  6

The number of unique values per column:
 e_id    3
text    5
dtype: int64

After dropping duplicates: 5 rows


### Symptoms dictionary

The symptom list dataframe is cleaned up (e.g. remove NaN.s etc.). Each symptom is converted to a dictionary entry using the column name as the key and the set of all the different permutations for the symptom as the items. All symptoms entries are also converted to lower-case.

Two custom keywords are also included in certain columns: "make_phrase_combinations" and "secondary_matcher". These signify that the entries in the column should be treated differently depending on the keyword. Their uses will be explained later in the next text cell.


In [6]:
#convert freetext options to lists
freetext_terms_df.columns = [col.replace("\n","").lower() for col in freetext_terms_df.columns]
freetext_terms_df = freetext_terms_df.append(pd.DataFrame([freetext_terms_df.columns], columns = freetext_terms_df.columns),ignore_index = True)


symptom_dict = {}
for col in freetext_terms_df.columns:
    symptom_dict[col.lower()] = set(freetext_terms_df[col].dropna().str.lower().values)
    
#dictionary entry for hypertension symptom 
symptom_dict["hypertension"]

{'>bp', 'high blood pressure', 'high bp', 'htn', 'hypertension'}

## Symptoms to pattern matcher format

For NLP tasks, a series of different processes, each with a particular function, are joined together to form the whole NLP model than can be used to extract information from the inputted block of text. Each process is known as a component and together, the components form an NLP pipeline. spaCy's default NLP pipeline includes a variety of different components that label and modify the words. You can read more about these components and their purposes here: https://spacy.io/usage/processing-pipelines. 

We will be creating a simple NLP pipeline containing only three components - one for sentence identification, one that identifies the patterns we have pre-defined (symptoms and their variations) and another that checks for negation. We will use the default spaCy sentencizer which splits sentences based on a set of default punctation characters (see https://spacy.io/api/sentencizer for details).

### EntityRuler

The second component will be spaCy's Entity Ruler component (https://spacy.io/api/entityruler) which is used for rule-based entity recognition. We will use this to find the any instances of the symptoms and their variations in the text and label them as an entity.

The EntityRuler component uses the PhraseMatcher object to match any phrases that are passed through it and, as such, the patterns representing words and phrases that we want to label must be entered in the accepted "Pattern" format. 

### "Pattern" format 
The pattern format requires each word in a phrase to be identified individually along with the matching method used to search for matches. The "lower" matching tag is used here to match the symptom string to the lower case version of a word in the text. In order for the lower case freetext to match with the phrase pattern defined, the phrase pattern must also be in lower case, hence the use of the .lower() method. 

The "orth" matching tag is used to match the exact string supplied in the pattern and here is used for any alphanumeric characters (such as > in >BP). The "OP" pattern parameter can be used to identify string that can be optional for a pattern to match using the "?" value.

*Phrase patterns mentioned here are technically Token patterns, as true Phrase patterns match exact,case-sensitive strings - find more details here: https://spacy.io/usage/rule-based-matching#entityruler*

The phrase pattern also includes an entity label that will be attributed to any words in the text that are found to match the pattern. In this case, a new entity "SYM" (for symptom) will be assigned, rather than a pre-defined spaCy entity. Finally, the pattern is given an ID that can be used to group patterns that refer to the same overarching symptom i.e. "HT" and ">BP" would both have the id "hypertension".

Certain custom patterns are also defined and manually added to the pattern dictionary for more complex patterns containing optional tokens (shown by "OP":"?" in token dictionary).

The below function outputs a list containing pattern dictionaries for each symptom variation. This will be used in the main extraction function later on.

### Other custom entities

Two other pattern generator functions are listed below; one for "previous medical history" entities and one for "family" entities. These will be used in combination with symptom entities for more refined symptom identification following the NLP extraction.

In [7]:
def create_symptom_patterns(symp_dict):
    #uses the dictionary keys as the ids for each pattern
    
    symp_patterns = []
    secondary_symp_patterns = []
    for symptom_name,symp_phrase_options_object in symp_dict.items():
        
        symp_phrase_options = list(symp_phrase_options_object)
        
        #deal with phrase combinations seperately to avoid having to write them all out
        #data column must contain "make_phrase_combinations" string for this
        #see "prior_pe" column for example
        
        if "make_phrase_combinations" in symp_phrase_options:
            
            phrase = [i for i in list(symp_phrase_options) if "/" in i][0]
            
            phrase_combos_list = []
            phrase_components = re.split('_',phrase)
            
            for component_variations in phrase_components:
                phrase_combo = ""
                component_variation_list = re.split('\/',component_variations)
                component_variation_list = [i.lower() for i in component_variation_list if i not in ["",None]]
                
                phrase_combos_list.append(component_variation_list)
        
            symp_phrase_options = list(itertools.product(*phrase_combos_list))
            
            # function that converts tuple to string
            def join_tuple_string(strings_tuple) -> str:
                return ' '.join(strings_tuple)

            # joining all the tuples
            result = map(join_tuple_string, symp_phrase_options)
            
            # converting and printing the result
            symp_phrase_options = list(result)
        
        #used to include any patterns generated from a column with an entry "secondary_matcher"
        #into a phrase matcher that comes after the primary one used for all the other patterns
        
        #needed because the phrase "prior pe" in the text would match with "pe" instead of "prior pe"
        #which means it is dealt with as if there is no mention of "prior/previous" in find_pmh_entities
        if "secondary_matcher" in symp_phrase_options:
            secondary_patterns = True
            symp_phrase_options.remove("secondary_matcher")
            
        else:
            secondary_patterns = False
            
          
        for phrase in symp_phrase_options:
               
            #split on space or non-alphanum character
            #capture non-space alphanum (i.e. symbols)
            split_phrase = re.split('\s|([^a-zA-Z0-9])',phrase)
            split_phrase = [i.lower() for i in split_phrase if i not in ["",None]]
            
            new_pattern = []
            #create phrase pattern that copes with alpnum and symbols token patterns
            for i in split_phrase:
                if i.isalnum():
                    #match lower case alphanum sequence 
                    new_pattern.append({"lower":i})
                else:
                    #match exact symbols
                    new_pattern.append({"orth":i})
                    
            if not secondary_patterns:
                symp_patterns.append({"label":"SYM","pattern":new_pattern,"id":symptom_name})
            
            else:
                secondary_symp_patterns.append({"label":"SYM","pattern":new_pattern,"id":symptom_name})
                
        #manual addition of more complex patterns including optional parts 
        #(could be generated using a conditional like above but this was a simpler, short-term option)
        #any symptom ids used here must be the same as a column header in the symptom file
        symp_patterns.append({"label":"SYM","pattern":[
            {"lower":"recent"},
            #any word
            {"IS_ALPHA":True,"OP":"?"},
            {"lower":{"IN":["surg","surgery"]}}
        ],"id":"recent surgery"})
        
 
    return symp_patterns, secondary_symp_patterns

#create patterns for previous medical history entities
def create_pmh_patterns():
    
    pmh_list = ['pmh','pmx', 'previous medical history','previous hx', 'previous mh',\
               'hx of','hs of', 'bg','phx', 'history of','pmhx']
    
    pmh_patterns = []
    for phrase in pmh_list:
        #split on space or non-alphanum character
        #capture non-space alphanum (i.e. symbols)
        split_phrase = re.split('\s|([^a-zA-Z0-9])',phrase)
        split_phrase = [i.lower() for i in split_phrase if i not in ["",None]]

        new_pattern = []
        #create phrase pattern that copes with alpnum and symbols token patterns
        for i in split_phrase:
            if i.isalnum():
                #match lower case alphanum sequence 
                new_pattern.append({"lower":i})
            else:
                #match exact symbols
                new_pattern.append({"orth":i})
                
        pmh_patterns.append({"label":"PMH","pattern":new_pattern,"id":"pmh"})
    
    return pmh_patterns

#create patterns for family history / family entities
def create_family_patterns():
    
    family_hx_list = ['fh','fhx', 'f hx', ]
    family_list = ['family','fam',]
    
    family_patterns = []
    for phrase in family_hx_list:
        #split on space or non-alphanum character
        #capture non-space alphanum (i.e. symbols)
        split_phrase = re.split('\s|([^a-zA-Z0-9])',phrase)
        split_phrase = [i.lower() for i in split_phrase if i not in ["",None]]

        new_pattern = []
        #create phrase pattern that copes with alpnum and symbols token patterns
        for i in split_phrase:
            if i.isalnum():
                #match lower case alphanum sequence 
                new_pattern.append({"lower":i})
            else:
                #match exact symbols
                new_pattern.append({"orth":i})
                
        family_patterns.append({"label":"FHX","pattern":new_pattern,"id":"family_hx"})
        
    for phrase in family_list:
        #split on space or non-alphanum character
        #capture non-space alphanum (i.e. symbols)
        split_phrase = re.split('\s|([^a-zA-Z0-9])',phrase)
        split_phrase = [i.lower() for i in split_phrase if i not in ["",None]]

        new_pattern = []
        #create phrase pattern that copes with alpnum and symbols token patterns
        for i in split_phrase:
            if i.isalnum():
                #match lower case alphanum sequence 
                new_pattern.append({"lower":i})
            else:
                #match exact symbols
                new_pattern.append({"orth":i})
                
        family_patterns.append({"label":"FAM","pattern":new_pattern,"id":"family"})
        
    return family_patterns

#functions to create all entity patterns
def create_ent_patterns(symptoms):
    symptom_patterns, secondary_symptom_patterns = create_symptom_patterns(symptoms)
    pmh_patterns = create_pmh_patterns()
    family_patterns = create_family_patterns()

    all_ent_patterns = symptom_patterns + pmh_patterns + family_patterns
    
    return all_ent_patterns, secondary_symptom_patterns

all_patterns, secondary_symptom_patterns = create_ent_patterns(symptom_dict)

print("Sample of pattern dictionaries for hypertension symptom:\n")
[print(i) for i in all_patterns if i['id'] == 'prior pe'];


Sample of pattern dictionaries for hypertension symptom:

{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pe'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pes'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'dvt'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'dvts'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'pulmonary'}, {'lower': 'embolism'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'prior'}, {'lower': 'deep'}, {'lower': 'vein'}, {'lower': 'thrombosis'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'pe'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'pes'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'dvt'}], 'id': 'prior pe'}
{'label': 'SYM', 'pattern': [{'lower': 'previous'}, {'lower': 'dvts'}], 'id': 'prior p

In [8]:
secondary_symptom_patterns

[{'label': 'SYM',
  'pattern': [{'lower': 'pulmonary'}, {'lower': 'embolism'}],
  'id': 'pe'},
 {'label': 'SYM',
  'pattern': [{'lower': 'deep'}, {'lower': 'vein'}, {'lower': 'thrombosis'}],
  'id': 'pe'},
 {'label': 'SYM', 'pattern': [{'lower': 'dvts'}], 'id': 'pe'},
 {'label': 'SYM', 'pattern': [{'lower': 'dvt'}], 'id': 'pe'},
 {'label': 'SYM', 'pattern': [{'lower': 'pe'}], 'id': 'pe'},
 {'label': 'SYM', 'pattern': [{'lower': 'pes'}], 'id': 'pe'}]

## Freetext preprocessing

The 1st function defined below will run within the extraction process and can be modified to include any pre-processing deemed necessary. It converts the freetext to lower case as well as extracting all of the useful text data from the XML string data using the 2nd and 3rd functions in the cell (the original dataset contained XML strings). It also adds a full-stop before a variety of manually inputted phrases identified to usually indicate the start of a new section/sentence in the text. This helps with the correct identification of negated phrases. 

A function to remove any entries that are sub-sets of other entries is also defined here.

> N.B. the preprocessing is defined seperately to the NLP pipeline here and will modify the text before it enters the pipeline. However, it could also be added to the pipeline as a custom component.

In [9]:
def freetext_preprocessing(blob_df):
    
    freetext = blob_df['text']
    
    freetext = freetext.apply(clean_blob_xml)
     
    #convert all text to lowercase
    freetext = freetext.str.lower()
    
    #add full-stop before these words to stop ongoing negation
    pmh_terms = ["pmh","pmx",'pmhx', 'hpc', 'meds','dx','dh','oe:','o/e','author:',
                'past medical history:','social history:','dhx','shx','pc','sh','medications']
    for term in pmh_terms:
        freetext.replace(to_replace = r'\W{term}\W'.format(term = term), value = f'. {term} ', regex=True, inplace=True)
     
    #replace 2 or more newline characters with full-stop to 
    #represent new section as new sentence
    
    #replace newline characters with fullstop
    freetext.replace(to_replace = r'\n\W*\n?', value = '. ', regex=True, inplace=True)

    #remove multiple fullstops in a row
    freetext.replace(to_replace = r'\W*\.(\W*\.\W*)*', value = '. ', regex=True, inplace=True)
    
    blob_df['text'] = freetext
    
    blob_df = blob_df[['text','e_id']].groupby(by='e_id').apply(remove_sub_strings)
    
    #drop e_id index 
    blob_df = blob_df.droplevel("e_id")
    
    return blob_df


#functions to extract text from xml strings 
def clean_blob_xml(freetext):
    if freetext[0:5] == "<?xml":
        return xml_to_string(freetext)
    else:
        return freetext
    

def xml_to_string(xml_string):
    #convert xml string to html object with lxml
    root = html.fromstring(str.encode(xml_string))
    #convert html from lxml object to bs4 html object 
    soup = BeautifulSoup(html.tostring(root))
    #extract text from html keeping newlines
    html_string = soup.get_text('\n')
    
    
    return html_string

#remove entries that are sub-sets of others for same encounter
def remove_sub_strings(df):
        
    #sort columns by length of string - longer strings can't be sub strings of shorter ones
    df = df.sort_values(by='text',key=lambda x: x.str.len())
    sorted_text = df['text'].values
    not_sub_string = []
    for idx,text in enumerate(sorted_text):
        if sum([text in i for i in sorted_text[idx+1:]]) == 0:
            not_sub_string.append(True)

        else:
            not_sub_string.append(False)

    #drop columns that are substrings of others        
    df.loc[:, 'not_sub_string'] = not_sub_string
    df = df[df['not_sub_string']].drop(['not_sub_string'], axis=1)

    return df

## Negation identification

The negspacy module which contains a negation spaCy component is used to identify if any identified symptoms are negated in the text i.e. for "patient has SOB but *not*  hypertension", hypertension is negated but SOB isn't. ref: https://github.com/jenojp/negspacy

The algorithm is based off the NegEx algorithm (https://doi.org/10.1006/jbin.2001.1029) and uses a list of negations patterns to label specific entities within the text.

The negation pattern types are:
 - **pseudo_negations** - phrases that are false triggers, ambiguous negations, or double negatives
 - **preceding_negations** - negation phrases that precede an entity
 - **following_negations** - negation phrases that follow an entity
 - **termination** - phrases that cut a sentence into parts, for purposes of negation detection (.e.g., "but")

If either preceding or following negations are found in the text, any entity after or before the negation respectively will be classed as negated. Termination patterns stops any negation passing through them (e.g. in "doesn't have HT but has CHF" CHF would not be negated). Pseudo-negations are removed retrospectively if initially picked up by preceding or following negations (i.e. "not necessarily HT" would initially be negated due to "not" but reverted as "not necessarily" is a psuedo-negation)

negspacy was initially developed for clinical data and as such, its default term set is designed for clinical use.

We can edit the termsets by adding or removing any patterns to fit our use case. Here, "nil" has been added as preceding and following negation and "pmh" and "pmx" has been included as a termination as they usually indicate the start of a new section. The addition of a preceding full-stop above serves the same purpose as the inclusion of termination terms but is better interpretation.

See below for examples of patterns from each category in the en_clinical termset:

In [10]:
def create_negation_termset():
    ts = termset("en_clinical")
    
    ts.add_patterns({
        "preceding_negations":['nil'],
        "following_negations":['nil'],
        "termination": ["pmh",'pmx', ], 
    })
    
    return ts
    
tas = create_negation_termset()    
    
for key, items in tas.get_patterns().items():
    print(key,":")
    print(items[:10],'\n')

pseudo_negations :
['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily', 'without any further', 'without difficulty', 'without further', 'might not', 'not only'] 

preceding_negations :
['without indication of', 'fails to reveal', 'rule out', 'never', 'denied', 'no signs of', 'couldnt', 'nil', 'not demonstrate', 'negative for'] 

following_negations :
['free', 'was not', 'unlikely', 'were not', 'were ruled out', "wasn't", "weren't", 'was ruled out', 'nil', 'werent'] 

termination :
['still', 'trigger event for', 'however', 'aside from', 'as there are', 'etiology for', 'other possibilities of', 'except', 'origin for', 'source for'] 



## Generating the NLP pipeline

The below function brings together the components introduced above to create the NLP pipeline that will be used to extract the symptoms. 

As only a simple NLP pipeline will be used here, a blank pipeline object is started with. The default sentence segmentation component is added to identify sentences which is necessary for the negation component. The default spaCy pipeline could be used instead here and would provide a lot more information about the structure and content of the text inputted. For an initial, rule-based NLP model, these non-essential components can be ommitted for the sake of simplicity and performance. 

### Entity recogniser
Then, the main rule-based entity recogniser is added, using the symptom, PMH and family patterns created by the pattern generator functions above. This will identify any occurences of the symptoms or their variations in the text and label them as "SYM" entities. Similarly, PMH,FAM and FHX entities are matched and labelled.

### Secondary entity recogniser
A second rule-based entity recogniser is used for overlapping patterns with a lower priority than those included in the first matcher. For example, for the dataset originally used, we were interested in whether a patient had previously had a pulmonary embolism (PE) or deep vein thrombosis (DVT). Mentions of *prior* pe/dvt can automatically be extracted, however mentions of pe/dvt should only be included in the conclusions about the patient if they are mentioned as having been previously experienced by a patient. As such, if they follow a "previous medical history" entity then they can be consider as past instances of the concept. 

*Other phrases not following a PMH entity but otherwise implying a previous condition e.g. "patient has previously has pe" are not dealt with / extracted with this rule-based method as there aare too many possible variations. Luckily, most notes are written in the same format and most instances of prior pe are covered by the patterns generated previously*

Therefore, we want "prior pe" to be included instead of just "pe" which is what would happen if both the "prior pe" and "pe" patterns were in the same matcher. SpaCy prioritises the entities labelled earlier in the pipeline, so we can use a secondary matcher to include patterns that may overlap with patterns of higher priority. As we will be checking if "pe" entities are preceding by a "pmh" entity, we do not want to ignore any "prior pe" that may have otherwise matched with the "pe" pattern. 

### Negation
Finally, the negation component is added, including a few custom negations that are added after having been manually identified as negators in freetext examples from this dataset.

The resulting NLP model known in spaCy as a *Language* class can now be used to extract symptoms from any sentence fed into it.

In [11]:
#initialise nlp model using symptom patterns and custom negation
def make_nlp_model(symptoms):
    #input: takes dictionary of symptoms and their variations in the format created above
    # as the symptom_dict variable
    
    #load blank spaCy nlp pipeline
    nlp = spacy.blank("en")
    
    #add default sentence segmentation component
    nlp.add_pipe("sentencizer")
    
    #creates new rule-based entity recognition pipeline component
    #matches patterns using the LOWER matcher attribute (looks for strings matching when lower case)
    config = {
       "phrase_matcher_attr": "LOWER",
       "validate": False,
       "overwrite_ents": False,
       "ent_id_sep": "||",
    }

    #add all symptom phrases as patterns to rule-based entity recog.
    patterns, secondary_patterns = create_ent_patterns(symptoms)
    
    ruler = nlp.add_pipe('entity_ruler',"phrase_matcher_1",config = config)
    ruler.add_patterns(patterns)
    
    #secondary matcher for lower priority matches 
    secondary_ruler = nlp.add_pipe("entity_ruler","phrase_matcher_2",config = config)
    secondary_ruler.add_patterns(secondary_patterns)
    
    #add component to find negation of symptom entities
    #add custom negation patterns
    ts = create_negation_termset()
    
    nlp.add_pipe("negex", config={"ent_types":["SYM","PMH","FAM"],'neg_termset':ts.get_patterns()})
    
    return nlp

nlp_model = make_nlp_model(symptom_dict)


### spaCy "Language" class object

The language object that is returned by the above function, typically called *nlp* (here it's called *nlp_model*), contains all of the infomation about the NLP pipeline created as well as all of the vocabulary associated with it. 

#### Applying the pipeline
Applying the model to some text produces information about any symptom, pmh and family entities contained within it whether they are negated or not. To do this, the language object is simply called with desired text as an input variable. 

The resulting output is a *doc* object that contains all of this information for this text specifically. Below, we will extract any entities that have been identified as well as their corresponding negation. As we have only included the rule-based enitity recogniser that finds SYM entities in the pipeline, only SYM entities will be returned.

In [12]:
text = "'pt has recently had surgery. no SOB pmx htn and chf '"

doc = nlp_model(text)
print("Extracted sypmtoms from: ", text,"\n")
print("%-10s %-25s %-6s" % ("Word", "Identified symptom","Symptom is negated\n"))
for i in doc.ents:
    print(f"{str(i):{10}} {str(i.ent_id_):{25}} {str(i._.negex):{6}}")
    

Extracted sypmtoms from:  'pt has recently had surgery. no SOB pmx htn and chf ' 

Word       Identified symptom        Symptom is negated

SOB        dyspnea                   True  
pmx        pmh                       False 
htn        hypertension              False 
chf        chronic heart failure     False 


## Identifying previous PE/DVT

The below function is used to check whether any occurences of "pe" / "dvt" are preceded by a PMH entity, indicating that they are previous occurences of pe/dvt and can be included along with other identified "prior pe" entities. The presence of a family entity before a PMH entity ignores any phrases identified as PE until the next PMH entity, similarly with the presence of a FHX (family history) entity alone.

In [13]:
#find instances of pe / dvt symptom and only keep those that occur after pmh entity

def find_pmh_entities(doc):
    new_doc_ents = []
    
    for sent in doc.sents:

        sent_doc = sent.as_doc()

        sent_tokens = [i.text for i in sent_doc]

        entities = [i for i in sent.ents]
        ent_ids = [i.ent_id_ for i in sent.ents]

        if "pe" in ent_ids:
            fam_pmh = False
            after_pmh = False
            fam_on = False

            #loop through entities 
            for entity in entities:
                label = entity.label_
                ent_id = entity.ent_id_

                #if entity is pe symptom include only if after pmh entity
                if label == "SYM":
                    if ent_id == "pe":
                        if after_pmh and not fam_pmh:
                            new_doc_ents.append(entity)

                        else:
                            continue

                    #for other SYM entities include only if not after fam pmh
                    else:
                        if not fam_pmh:
                            #new_doc_ents.append(make_span(doc,entity))
                            new_doc_ents.append(entity)

                        else:
                            continue

                #if family entity found
                elif label == "FAM":
                    fam_on = True
                    fam_end_loc = (entity.end)

                #if pmh entity found
                elif label == "PMH":
                    #if family entity in previous two words - removes the symptoms
                    if fam_on and entity.start <= fam_end_loc + 1:
                        fam_pmh = True
                        after_pmh = False
                    else:
                        fam_pmh = False
                        after_pmh = True
                        #include PMH entities
                    new_doc_ents.append(entity)

                elif label == "FHX":
                    fam_pmh = True
                    after_pmh = False
                    new_doc_ents.append(entity)

                else:
                    print("entity not dealt with ")

        else:
            new_doc_ents += sent.ents

    doc.set_ents(new_doc_ents)
    
    return doc



## Using NLP to extract symtpoms 

The below cells brings together all of the previous cells executed so far. The NLP pipeline is imported and the freetext is pre-processed and then passed through the pipeline. Any instances of "PE" entities (not "prior pe") are checked for a preceding PMH entity as defined by the function above and any positive matches as included as "prior pe" matches. For each freetext row, the symptoms and whether they are negated or not is extracted. 

### Combining extract symptoms 

It is possible that multiple of the same symptoms are found in the same text extract and they may have differing negations. In addition, as mentioned previously, there were multiple freetext rows for most encounters in the original dataset - on average each encounter has ~10 freetext entries. As such, multiple labels for the same symptoms must be dealt with once the labels are found for each text entry. 

The algorithm used to decide a final symptom presence classification of True or False is a simple one and could be adapted for different datasets and to deal with different symptoms differently. It works as follows:
 - The symptom extracts are collated for each encounter ID
 - If a symptom gets no matches from any freetext, the boolean symptom presence label defaults to False
 - If there is one or more match for a symptom, whichever negation label is more common across the matched words is used for the final presence label (e.g. if the negation labels are [True, False, True] for the cancer symptom, the final presence label will be True as it is most common)
 - It there are an equal number of each negation label, the presence label defaults to True
 
This is performed by the *combine_e_ids* function below and executed at the end of the main *extract_symptoms* function. 

*Note that the Truth values returned by the NLP pipeline represent the truth of whether the symptom was negated in the text, so a True label would imply that the symptom **isn't** present in the patient. The output truth values are for presence of the symptom, hence the inversion that is implicit in the function.*


In [14]:
# input is freetext data table generated below grouped by encounter id.
# the function combines identified symptoms to create single symptom
#presence label
def combine_e_ids(df, include_ratio=False):
    
    new_df = {}
    
    #iterate through each symptom type
    #combine results across all freetext for a given encounter
    for symptom in df['extracted_symps_dict'].values[0].keys():
        #bias towards positive diagnosis if equal number of +/- ive
        symptom_count = 0
        total_count = 0
        total_records = 0
        for d in df['extracted_symps_dict'].values:
            total_records += 1
            for i in d[symptom]:
                #if negation value is True adds 1; 0 if False
                symptom_count += i
                total_count += 1
        
        final_presence_label = None
        
        #if there are no labels either +/- ive then default to negative
        if total_count == 0:
            final_presence_label = False
            
        else:
            #if mean labels is <0.5 then there are more negatives
            #this means there is a mean positive identification of the symptom
            mean_count = symptom_count / total_count
            
            #more positive negation labels so false symptom 
            if mean_count > 0.5:
                final_presence_label = False
                
            #more than or equal negative negation labels so positive symptom 
            else:
                final_presence_label = True
        
        #to include the ratio of True/False labels in the table output
        if include_ratio:
            pos_count = total_count - symptom_count
            neg_count = symptom_count
            final_presence_label = str(final_presence_label) + f" ({pos_count}+  {neg_count}-)"
        
        new_df.update({symptom:final_presence_label})
            
    new_df.update({"Total records":total_records})
    
    all_text = ""
    #generate string containing all notes for an encounter
    for i, text in enumerate(df['text'].values):
        all_text += f"---------\n Note {i}\n---------\n"
        all_text += text
        all_text += "\n\n"
    
    new_df.update({"All Notes":all_text})
    
    return pd.DataFrame(index = [0],data = new_df)


In [15]:
#function to search blob text for symptoms and their variations
#listed in freetext_term_df


def extract_symptoms(blob_df,symptoms, include_ratio = False, include_uncombined = False):
    #include_ratio - includes the ratio of True and False presence labels 
    #that determined the final presence label

    #include_uncombined - include the uncombined df which contains doc objects 
    #for all freetext entries

    start_time = time.time()
    
    #generate nlp model using symptoms and their variations
    nlp = make_nlp_model(symptoms)
    
    #pre-process freetext - see function defintion above for details
    print("before preprocessing ",blob_df.shape[0])
    blob_df = freetext_preprocessing(blob_df)
    print("after preprocessing ",blob_df.shape[0])
    
    #uncomment to test function on sample of df
    #blob_df = blob_df.head(10)
     
    #dictionary of each symptom and whether any terms (negated or not)
    #were found for symptoms in freetext
    blob_symptoms_dict = []
    
    #string of symptoms for which any terms are found with
    #their negation in brackets e.g. HT (False)
    blob_symptoms = []
    
    #list of the doc objects for each freetext entry
    blob_docs = []
    
    #iterate through freetext df one entry at a time
    for row_id, row in blob_df.sort_index().iterrows():
        #print progress
        if row_id % 500 == 0:
            print("Row ",row_id)
        
        #run freetext through nlp pipeline
        doc = nlp(row['text'])
        
        #keep instances of pe/dvt only following PMH entity
        doc = find_pmh_entities(doc)
        
        #initialise dictionary for each symptoms that will be filled
        #with any found occurences
        all_symps_names = list(symptoms.keys())
        all_symps_names.remove('pe')
        symp_dict = {term:[] for term in all_symps_names}
        symp_string = None
        
        #loop through all found entities in freetext doc object
        for e in doc.ents:
            
            #if entity is of custom symptom type: SYM
            if e.label_ == "SYM":
                
                #convert pe to prior pe to combine the columns
                if e.ent_id_ == 'pe':
                    ent_id = 'prior pe'
                else:
                    ent_id = e.ent_id_
                    
                
                symp_dict[ent_id].append(e._.negex)
                if symp_string == None:
                    symp_string = f"{ent_id} ({e._.negex}), "
                else:
                    symp_string += f"{ent_id} ({e._.negex}), "
                
        blob_symptoms_dict.append(symp_dict)
        blob_symptoms.append(symp_string)
        blob_docs.append(doc)
    
    blob_df.loc[:,'extracted_symps_'] = blob_symptoms
    blob_df.loc[:,'extracted_symps_dict'] = blob_symptoms_dict
    blob_df.loc[:,'doc'] = blob_docs
    
    
    #group by encounter and combine presence labels
    final_extract_df = blob_df.groupby(by='e_id').apply(combine_e_ids, include_ratio = include_ratio)
    final_extract_df = final_extract_df.reset_index().drop(columns = ['level_1'])
    
    end_time = time.time()
    
    print(f"Time taken for {blob_df.shape[0]} rows: {round(end_time-start_time)} seconds")
    
    if include_uncombined:
        return final_extract_df, blob_df, nlp
    
    else:
        return final_extract_df,nlp
       
    
combined_extract_df, uncombined_extract_df ,extract_nlp_model = extract_symptoms(freetext_df, symptom_dict, include_ratio=True, include_uncombined = True)



combined_extract_df.head()

before preprocessing  5
after preprocessing  5
Row  0
Time taken for 5 rows: 0 seconds


Unnamed: 0,e_id,hypertension,chronic heart failure,cancer,prior pe,chest pain,dyspnea,doa,recent surgery,Total records,All Notes
0,111,False (0+ 0-),True (1+ 0-),False (0+ 0-),True (2+ 0-),False (0+ 0-),False (0+ 1-),True (1+ 0-),False (0+ 1-),2,---------\n Note 0\n---------\npt has not had ...
1,222,True (2+ 0-),False (0+ 0-),False (0+ 0-),False (0+ 0-),False (0+ 1-),False (1+ 3-),False (0+ 0-),False (0+ 0-),2,---------\n Note 0\n---------\npc: no chest pa...
2,333,True (1+ 0-),False (0+ 1-),True (2+ 1-),False (0+ 0-),False (0+ 0-),True (1+ 0-),False (0+ 0-),False (0+ 0-),1,---------\n Note 0\n---------\npatient known k...


In [16]:
uncombined_extract_df.head()

Unnamed: 0,text,e_id,extracted_symps_,extracted_symps_dict,doc
1,pt has not had recent surgery / immobilisation...,111,"dyspnea (True), chronic heart failure (False),...","{'hypertension': [], 'chronic heart failure': ...","(72, h, onset, of, palpitation, s, ,, worse, w..."
0,"72 h onset of palpitation s , worse when walki...",111,"recent surgery (True), prior pe (False), doa (...","{'hypertension': [], 'chronic heart failure': ...","(pt, has, not, had, recent, surgery, /, immobi..."
4,"pc: no chest pains, sob. pmh htn, high chole...",222,"dyspnea (False), dyspnea (True), dyspnea (True...","{'hypertension': [False], 'chronic heart failu...","(pre, -, arrival, summary, , name, :, , doe,..."
3,"pre-arrival summary name: doe, john curren...",222,"chest pain (True), dyspnea (True), hypertensio...","{'hypertension': [False], 'chronic heart failu...","(pc, :, no, chest, pains, ,, sob, ., , pmh, ..."
5,patient known kidney cancer on no chemo as kno...,333,"cancer (False), cancer (True), chronic heart f...","{'hypertension': [False], 'chronic heart failu...","(patient, known, kidney, cancer, on, no, chemo..."


### Saving output
For use again in Python, the output can be saved as a 'pickle' file. The advantage of this over a .csv is that it stores all of the "doc" objects etc. correctly. rather than converting them to strings.

However, it is not necessary to save the pickles to keep using the notebook. Uncomment the functions to save and load the pickles if desired.

In [17]:
def save_pickle(combined_extract_df, blob_df):
    date_time_string = datetime.today().strftime("%y.%m.%d-%H%M%S")
    with open(f'pickle_data/{date_time_string}.pickle', 'wb') as f:
        pickle.dump([combined_extract_df, blob_df],f)
        
def get_latest_pickle():
    folder_dates = os.listdir('pickle_data/')
    int_dates = [''.join(i for i in j if i.isdigit()) for j in folder_dates]
    latest_date = folder_dates[np.argmax(int_dates)]
    file_path = 'pickle_data/' + latest_date 
    
    with open(file_path,'rb') as f:
        combined_extract_df, blob_df = pickle.load(f)
        
    return combined_extract_df, blob_df
    

In [18]:
#save_pickle(combined_extract_df, uncombined_extract_df)
#combined_extract_df, blob_df = get_latest_pickle()
combined_extract_df.head()

Unnamed: 0,e_id,hypertension,chronic heart failure,cancer,prior pe,chest pain,dyspnea,doa,recent surgery,Total records,All Notes
0,111,False (0+ 0-),True (1+ 0-),False (0+ 0-),True (2+ 0-),False (0+ 0-),False (0+ 1-),True (1+ 0-),False (0+ 1-),2,---------\n Note 0\n---------\npt has not had ...
1,222,True (2+ 0-),False (0+ 0-),False (0+ 0-),False (0+ 0-),False (0+ 1-),False (1+ 3-),False (0+ 0-),False (0+ 0-),2,---------\n Note 0\n---------\npc: no chest pa...
2,333,True (1+ 0-),False (0+ 1-),True (2+ 1-),False (0+ 0-),False (0+ 0-),True (1+ 0-),False (0+ 0-),False (0+ 0-),1,---------\n Note 0\n---------\npatient known k...


### Save to csv
To save file as csv - removes non-symptom columns

In [19]:
df_to_save = combined_extract_df.drop(columns=['All Notes','Total records']).rename(columns={"e_id":"ENCNTR_ID"})
#df_to_save.to_csv('nlp_extract.csv')

## Brief data exploration
From now on are various functions to explore the data and NLP predictions

In [20]:
#extract all encounters with non-default false instance of a symptom - "prior pe" below
#include_ratio must be True in extract_symptoms function to include ratios used here

pe_df = combined_extract_df[~combined_extract_df['prior pe'].str.contains("0+  0-",regex=False)]
pe_df

Unnamed: 0,e_id,hypertension,chronic heart failure,cancer,prior pe,chest pain,dyspnea,doa,recent surgery,Total records,All Notes
0,111,False (0+ 0-),True (1+ 0-),False (0+ 0-),True (2+ 0-),False (0+ 0-),False (0+ 1-),True (1+ 0-),False (0+ 1-),2,---------\n Note 0\n---------\npt has not had ...


In [21]:
def print_encounter_data(e_id,symptom_to_show = str):
    print("Combined extract NLP: \n",combined_extract_df[combined_extract_df['e_id'] == e_id])
    
    print("------\n Extracted symptoms from notes \n--------\n")
    note = 1
    for row_id, row in uncombined_extract_df.iterrows():
        print(f"NOTE {note}\n Identified symptoms and if negated: \n",repr(row.extracted_symps_),'\n',)
        spacy.displacy.render(row.doc,style='ent',jupyter=True)
        note +=1
        

In [22]:
#get 10 encounters with any mention of prior_pe
pe_df['e_id'].tail(10)

0    111
Name: e_id, dtype: int64

In [23]:

print_encounter_data(111)


Combined extract NLP: 
    e_id    hypertension chronic heart failure          cancer       prior pe  \
0   111  False (0+  0-)         True (1+  0-)  False (0+  0-)  True (2+  0-)   

       chest pain         dyspnea            doa  recent surgery  \
0  False (0+  0-)  False (0+  1-)  True (1+  0-)  False (0+  1-)   

   Total records                                          All Notes  
0              2  ---------\n Note 0\n---------\npt has not had ...  
------
 Extracted symptoms from notes 
--------

NOTE 1
 Identified symptoms and if negated: 
 'dyspnea (True), chronic heart failure (False), prior pe (False), ' 



NOTE 2
 Identified symptoms and if negated: 
 'recent surgery (True), prior pe (False), doa (False), ' 



NOTE 3
 Identified symptoms and if negated: 
 'dyspnea (False), dyspnea (True), dyspnea (True), hypertension (False), ' 



NOTE 4
 Identified symptoms and if negated: 
 'chest pain (True), dyspnea (True), hypertension (False), ' 



NOTE 5
 Identified symptoms and if negated: 
 'cancer (False), cancer (True), chronic heart failure (True), dyspnea (False), cancer (False), hypertension (False), ' 



In [25]:
#search for occurences of "blood" in text
search_string = "\Wblood\W"
#search_string = '\Wdvt\W|deep vein thrombosis'
for row_id, row in uncombined_extract_df[uncombined_extract_df['text'].str.contains(search_string, regex=True)].iterrows():
    e_id = row.e_id
    medical_note = row.text
    match = re.search(search_string, medical_note)
    while match:
        
        start_inx = match.span()[0]
        end_inx = match.span()[1]
        start = start_inx-50 if start_inx >= 50 else 0
        end = end_inx+50 if len(medical_note) - end_inx > 0 else -1
        print(repr(medical_note[start:end]),"Encounter ID: ",e_id)
        
        medical_note = medical_note[end_inx:]
        match = re.search(search_string, medical_note)
        

' patient been losing weight.  known to suffer from blood clots.  not been on blood thinners since august.  ' Encounter ID:  333
'clots.  not been on blood thinners since august.  pmh  kidney ca.  blood clo' Encounter ID:  333
'thinners since august.  pmh  kidney ca.  blood clots, sleep apnoea, htn, disc problems.  medicati' Encounter ID:  333
