# Jupyter Notebook for EHRKit
This Jupyter notebook demonstrates how to run the Naive Bayes summarization model trained on the PubMed corpus. We can use this on any text, or on a particular EHR extracted from the MIMIC database on Tangra. 

## Important
Before running this notebook, make sure to take a look at the README files in the /summarization/pubmed_summarization and /pubmed folders. The README in the /pubmed folder contains the necessary scripts to download the Pubmed dataset in XML format and parse each article. It is recommended to parse around 500 files and just their body introductions for simplicity and time. Finally, the README in the /summarization/pubmed_summarization describes how to train the Naive Bayes model for summarization on the parsed Pubmed articles.

The one change that the user needs to make to this code is to set their root directory for the EHRKit

In [1]:
ROOT_EHR_DIR =  '/data/lily/br384/clean_EHRKit/EHRKit/' # set your root EHRKit directory here (with the '/' at the end)
import sys
import os
sys.path.append(os.path.dirname(ROOT_EHR_DIR))

In [2]:
from ehrkit import ehrkit
from demos import demo

2021-06-01 18:18:12,099 : INFO : Loading faiss with AVX2 support.


## Using the Tangra MIMIC database

Now let's try running the model on an EHR extracted from the Tangra database using the naive_bayes_db() function. Note that if you do not have access to the EHR database on Tangra, you can ignore this part and skip to the next piece of code.

In [3]:
# Download pubmed files (note this will take a while)
# Don't forget to comment this out once you run it once!
!cd {ROOT_EHR_DIR} && bash pubmed/download_pubmed.sh

--2021-06-01 18:18:12--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.A-B.xml.tar.gz
           => ‘non_comm_use.A-B.xml.tar.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.10, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/pmc/oa_bulk ... done.
==> SIZE non_comm_use.A-B.xml.tar.gz ... 2839251385
==> PASV ... done.    ==> RETR non_comm_use.A-B.xml.tar.gz ... done.
Length: 2839251385 (2.6G) (unauthoritative)


2021-06-01 18:19:33 (33.6 MB/s) - ‘non_comm_use.A-B.xml.tar.gz’ saved [2839251385]

--2021-06-01 18:21:20--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.C-H.xml.tar.gz
           => ‘non_comm_use.C-H.xml.tar.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 165.112.9.228, 2607:f220:41e:250::12, ...
Connecting 

In [5]:
# Parse the pubmed articles, this also will take a while, it is recommended to run it from the command line, this will give more interactivity and allow running in the background more easily, also potentially faster runtime.
from pubmed.parse_articles import run_parser
run_parser()

newer version
Path to XML files: /data/lily/br384/EHRKit/pubmed/xml
Path to parsed PubMed files: /data/lily/br384/EHRKit/pubmed/parsed_articles
Only 0 files could be parsed.


In [3]:
# Install a few extras to run the summarizer:
import nltk
nltk.download('averaged_perceptron_tagger')

In [11]:
demo.naive_bayes_db() 



------------------------------Full EHR------------------------------
Chief Complaint:
   24 Hour Events:
   - called out to floor, Vt code at 7pm before transfer to floor, no
   shock given, pt recovered pulse on own at initiation of CPR.  Post
   arrest EKG without ST elevations
   - likely episodes of atrial tachycardia with baseline possible
   junctional rhythm.  plan cardiology evaluation
   - plan for echo [**3-26**].
   - RLE U/S negative for DVT as source of underlying cellulitis
   Allergies:
   Haldol (Oral) (Haloperidol)
   Unknown;
   Penicillins
   Unknown;
   Augmentin (Oral) (Amox Tr/Potassium Clavulanate)
   Unknown;
   Last dose of Antibiotics:
   Cefipime - [**2115-3-24**] 08:56 PM
   Infusions:
   Other ICU medications:
   Other medications:
   Changes to medical and family history:
   Review of systems is unchanged from admission except as noted below
   Review of systems:
   Flowsheet Data as of  [**2115-3-26**] 06:09 AM
   Vital signs
   Hemodynamic monitoring
 

## General Usage
We can also simply run this model on any text in general by using the naive_bayes(text) version. 

In [8]:
text = "Large necrotic inguinal lymph node metastases bilaterally. 2) Successful aspiration of serosanguinous fluid from the fluid components of lymph nodes within both inguinal regions. This fluid was sent for gram stain and culture.  Fine needle aspiration of the solid components of lymph nodes within both inguinal regions was also performed without complication."

In [9]:
print(demo.naive_bayes(text))

This fluid was sent for gram stain and culture.
