# Query Extraction

This notebook walks the user through a query extraction test. Using the `QueryExtraction` module of the EHRKit, we can search for specific queries in medical text. This example loads data from the MIMIC-III dataset and uses several different methods to search for a couple given queries.

The user must set the root directory for the EHRKit, and optionally the data directories.

In [10]:
ROOT_EHR_DIR = '/home/lily/br384/EHRKit/' # set your root EHRKit directory here (with the '/' at the end)
import sys
import os
sys.path.append(os.path.dirname(ROOT_EHR_DIR))

In [11]:
# Set your mimic path here
# Put all of the individual mimic csv files in MIMIC_PATH, with the `/` at the end. These files should be all cap csv files, such as NOTEEVENTS.csv. Keep OUTPUT_DATA_PATH empty, the processed data will be deposited there.
OUTPUT_DATA_PATH = ROOT_EHR_DIR + 'data/output_data/'
MIMIC_PATH = ROOT_EHR_DIR + 'data/mimic_data/'

In [12]:
from mimic_icd9_coding.coding_pipeline import codingPipeline
from mimic_icd9_coding.utils.mimic_data_preparation import run_mimic_prep

In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /home/lily/br384/anaconda3/envs/EHRKit/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases lat

In [4]:
# Run the mimic label preparation, examine the top n most common labels, and create a dataset of labels and text
# Default is to save a dataset to Data, but if return_df=True, then can stay in ipynb
run_mimic_prep(output_folder = OUTPUT_DATA_PATH, mimic_data_path= MIMIC_PATH)


In [5]:
my_mimic_pipeline = codingPipeline(verbose=False, model=None, data_path = OUTPUT_DATA_PATH, run=False)
# Load the mimic data without running



## Let's test out the model on a specific note!

In [6]:
# Here we load the data into the pipeline, this function simply saves the data, we don't want to save the data automatically because it uses more memory
my_mimic_pipeline.load_data()
df = my_mimic_pipeline.data

Use the following command to move into the root directory of EHRKit, make sure to change code accordingly


In [7]:
from QueryExtraction.extraction import main_extraction

In [8]:
txt0 = df.iloc[0]['TEXT']
txt1 = df.iloc[1]['TEXT']

word1 = ['heart', 'liver', 'stomach', 'hypertension']

main_extraction(word1, txt0)

Extraction:
SpaCy TextRank:
Captured  0
['mg tablet sig', 'discharge instructions', 'social history', 'history', 'sliding scale recommendations', 'medical history', 'discharge diagnosis', 'discharge', 'sliding scale', 'discharge condition', 'discharge disposition', 'family history', 'midodrine 2.5 mg tablet sig', '300 mg capsule sig', 'upper extremities', 'dr.[**last name', 'reduced sensation distal le', 'insulin srip', 'disp:*60 capsule', 'chronic kidney disease', 'following fluids', 'prior admissions', 'admission labs', 'chronic cough', '30 mg capsule', 'aggressive volume resuscitation', 'admission', 'baseline cr', 'long acting levemir', 'chronic renal insufficiency']
------------------------------
Gensim TextRank:
Captured  1
['discharge', 'discharged', 'diabetes', 'diabetic', 'disp', 'home', 'gastroparesis ckd', 'nausea vomiting', 'controlled', 'control', 'controls', 'date', 'medical', 'medications', 'power', 'extremities', 'vomit', 'insulin', 'sliding', 'admission', 'admissions', 

HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=38.0), HTML(value='')))


Captured  0
['medicine allergies', '200s diabetic', 'evidence pneumonia', 'diabetes mellitis', 'neuropathy date', 'labs 2117', 'fracture 2117', 'medications citalopram', 'birth 2082', 'fax 85219', 'outpatient setting', 'ketoacidosis patient', 'vomiting hematocrit', 'radiology cxr', 'hsm doctor', 'lipase 22', '22am wbc', 'aspirin 325mg', 'hospitalizations 12', 'doctor 515', 'chronic renal', 'glucose 466', '118 urean', 'home insulin', 'major surgical', 'years gastroparesis', 'ckd retinopathy', '137 potassium', 'keotacidosis hematemesis', 'vomiting coffee']


In [9]:
main_extraction(word1, txt1)

Extraction:
SpaCy TextRank:
01-Jun-21 18:08:13 - Initiated a keyword detector instance.
Captured  0
['mg tablet sig', 'non-steroidal induced ulcer', 'non-steroidal induced gastritis', 'blood calcium-7.5', 'coffee ground emesis', 'blood wbc-17.9', 'blood alt-126', 'blood alt-113', 'blood pressure', 'grade ii esophageal varices', 'right ventricular chamber size', 'disp:*60 tablet', 'melena x2 days', 'abdominal pain', 'ventricular systolic function', 'folic acid 1 mg tablet sig', 'hcv cirrhosis', 'guaiac positive brown stool', 'medical history', 'moderate pulmonary artery systolic hypertension', 'discharge', 'social history', 'discharge disposition', 'tarry black stool', 'ef>75%', 'prophylactic medications', '40 mg tablet', 'family history', 'right atrium', 'back pain']
------------------------------
Gensim TextRank:
Captured  2
['daily', 'negative', 'discharge', 'blood', 'medication', 'medical', 'medications', 'days', 'day', 'pain', 'portal', 'times', 'sig', 'egd right', 'history', 'effu

HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=49.0), HTML(value='')))


Captured  0
['medicine allergies', '2150 coffee', 'hcv cirrhosis', 'fax 2422', 'birth 2090', 'labs 2150', 'cardiology tte', 'naproxen pantoprazole', 'nsaids blood', 'endoscopy week', 'lastname 52368', 'pneumothorax pleural', 'duodenum radiology', '13pm blood', 'bacteri yeast', 'brief hospital', 'hemodynamic monitoring', 'octreotide drip', 'ultrasound showed', 'ct 186', 'menthol lotion', 'hospital1 times', 'instead nsaids', 'discharge hcv', '59m hcv', 'left ventricular', 'acetaminophen 325', 'past medical', 'blood wbc', 'cardiac tamponade']
