[2d4573]: / tutorials / naiveBayes.ipynb

Download this file

257 lines (257 with data), 21.8 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Jupyter Notebook for EHRKit\n",
    "This Jupyter notebook demonstrates how to run the Naive Bayes summarization model trained on the PubMed corpus. We can use this on any text, or on a particular EHR extracted from the MIMIC database on Tangra. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Important\n",
    "Before running this notebook, make sure to take a look at the README files in the /summarization/pubmed_summarization and /pubmed folders. The README in the /pubmed folder contains the necessary scripts to download the Pubmed dataset in XML format and parse each article. It is recommended to parse around 500 files and just their body introductions for simplicity and time. Finally, the README in the /summarization/pubmed_summarization describes how to train the Naive Bayes model for summarization on the parsed Pubmed articles."
   ]
  },
  {
   "source": [
    "The one change that the user needs to make to this code is to set their root directory for the EHRKit"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "ROOT_EHR_DIR =  '/data/lily/br384/clean_EHRKit/EHRKit/' # set your root EHRKit directory here (with the '/' at the end)\n",
    "import sys\n",
    "import os\n",
    "sys.path.append(os.path.dirname(ROOT_EHR_DIR))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stderr",
     "text": [
      "2021-06-01 18:18:12,099 : INFO : Loading faiss with AVX2 support.\n"
     ]
    }
   ],
   "source": [
    "from ehrkit import ehrkit\n",
    "from demos import demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the Tangra MIMIC database\n",
    "\n",
    "Now let's try running the model on an EHR extracted from the Tangra database using the naive_bayes_db() function. Note that if you do not have access to the EHR database on Tangra, you can ignore this part and skip to the next piece of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--2021-06-01 18:18:12--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.A-B.xml.tar.gz\n",
      "           => ‘non_comm_use.A-B.xml.tar.gz’\n",
      "Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.10, 2607:f220:41e:250::12, ...\n",
      "Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /pub/pmc/oa_bulk ... done.\n",
      "==> SIZE non_comm_use.A-B.xml.tar.gz ... 2839251385\n",
      "==> PASV ... done.    ==> RETR non_comm_use.A-B.xml.tar.gz ... done.\n",
      "Length: 2839251385 (2.6G) (unauthoritative)\n",
      "\n",
      "non_comm_use.A-B.xm 100%[===================>]   2.64G  31.6MB/s    in 81s     \n",
      "\n",
      "2021-06-01 18:19:33 (33.6 MB/s) - ‘non_comm_use.A-B.xml.tar.gz’ saved [2839251385]\n",
      "\n",
      "--2021-06-01 18:21:20--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.C-H.xml.tar.gz\n",
      "           => ‘non_comm_use.C-H.xml.tar.gz’\n",
      "Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 165.112.9.228, 2607:f220:41e:250::12, ...\n",
      "Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /pub/pmc/oa_bulk ... done.\n",
      "==> SIZE non_comm_use.C-H.xml.tar.gz ... 3871497977\n",
      "==> PASV ... done.    ==> RETR non_comm_use.C-H.xml.tar.gz ... done.\n",
      "Length: 3871497977 (3.6G) (unauthoritative)\n",
      "\n",
      "non_comm_use.C-H.xm 100%[===================>]   3.61G  49.1MB/s    in 73s     \n",
      "\n",
      "2021-06-01 18:22:33 (50.6 MB/s) - ‘non_comm_use.C-H.xml.tar.gz’ saved [3871497977]\n",
      "\n",
      "--2021-06-01 18:24:52--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.I-N.xml.tar.gz\n",
      "           => ‘non_comm_use.I-N.xml.tar.gz’\n",
      "Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.12, 2607:f220:41e:250::7, ...\n",
      "Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /pub/pmc/oa_bulk ... done.\n",
      "==> SIZE non_comm_use.I-N.xml.tar.gz ... 6978237687\n",
      "==> PASV ... done.    ==> RETR non_comm_use.I-N.xml.tar.gz ... done.\n",
      "Length: 6978237687 (6.5G) (unauthoritative)\n",
      "\n",
      "non_comm_use.I-N.xm 100%[===================>]   6.50G  11.3MB/s    in 10m 23s \n",
      "\n",
      "2021-06-01 18:35:17 (10.7 MB/s) - ‘non_comm_use.I-N.xml.tar.gz’ saved [6978237687]\n",
      "\n",
      "--2021-06-01 18:39:29--  ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.O-Z.xml.tar.gz\n",
      "           => ‘non_comm_use.O-Z.xml.tar.gz’\n",
      "Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 2607:f220:41e:250::11, ...\n",
      "Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /pub/pmc/oa_bulk ... done.\n",
      "==> SIZE non_comm_use.O-Z.xml.tar.gz ... 3355282641\n",
      "==> PASV ... done.    ==> RETR non_comm_use.O-Z.xml.tar.gz ... done.\n",
      "Length: 3355282641 (3.1G) (unauthoritative)\n",
      "\n",
      "non_comm_use.O-Z.xm 100%[===================>]   3.12G  29.1MB/s    in 2m 36s  \n",
      "\n",
      "2021-06-01 18:42:05 (20.6 MB/s) - ‘non_comm_use.O-Z.xml.tar.gz’ saved [3355282641]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Download pubmed files (note this will take a while)\n",
    "# Don't forget to comment this out once you run it once!\n",
    "!cd {ROOT_EHR_DIR} && bash pubmed/download_pubmed.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "newer version\n",
      "Path to XML files: /data/lily/br384/EHRKit/pubmed/xml\n",
      "Path to parsed PubMed files: /data/lily/br384/EHRKit/pubmed/parsed_articles\n",
      "Only 0 files could be parsed.\n"
     ]
    }
   ],
   "source": [
    "# Parse the pubmed articles, this also will take a while, it is recommended to run it from the command line, this will give more interactivity and allow running in the background more easily, also potentially faster runtime.\n",
    "from pubmed.parse_articles import run_parser\n",
    "run_parser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install a few extras to run the summarizer:\n",
    "import nltk\n",
    "nltk.download('averaged_perceptron_tagger')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "\n\n------------------------------Full EHR------------------------------\nChief Complaint:\n   24 Hour Events:\n   - called out to floor, Vt code at 7pm before transfer to floor, no\n   shock given, pt recovered pulse on own at initiation of CPR.  Post\n   arrest EKG without ST elevations\n   - likely episodes of atrial tachycardia with baseline possible\n   junctional rhythm.  plan cardiology evaluation\n   - plan for echo [**3-26**].\n   - RLE U/S negative for DVT as source of underlying cellulitis\n   Allergies:\n   Haldol (Oral) (Haloperidol)\n   Unknown;\n   Penicillins\n   Unknown;\n   Augmentin (Oral) (Amox Tr/Potassium Clavulanate)\n   Unknown;\n   Last dose of Antibiotics:\n   Cefipime - [**2115-3-24**] 08:56 PM\n   Infusions:\n   Other ICU medications:\n   Other medications:\n   Changes to medical and family history:\n   Review of systems is unchanged from admission except as noted below\n   Review of systems:\n   Flowsheet Data as of  [**2115-3-26**] 06:09 AM\n   Vital signs\n   Hemodynamic monitoring\n   Fluid balance\n                                                                  24 hours\n                                                               Since 12 AM\n   Tmax: 37.1\nC (98.7\n   Tcurrent: 36.2\nC (97.1\n   HR: 57 (44 - 100) bpm\n   BP: 115/56(71) {88/46(60) - 124/68(82)} mmHg\n   RR: 16 (13 - 20) insp/min\n   SpO2: 100%\n   Heart rhythm: SB (Sinus Bradycardia)\n   Height: 65 Inch\n    CVP: 14 (14 - 14)mmHg\n   Bladder pressure: 16 (16 - 16) mmHg\n   Total In:\n                                                                  4,125 mL\n                                                                    776 mL\n   PO:\n   TF:\n                                                                    682 mL\n                                                                    245 mL\n   IVF:\n                                                                  2,693 mL\n                                                                     31 mL\n   Blood products:\n   Total out:\n                                                                    756 mL\n                                                                    245 mL\n   Urine:\n                                                                    756 mL\n                                                                    245 mL\n   NG:\n   Stool:\n   Drains:\n   Balance:\n                                                                  3,369 mL\n                                                                    531 mL\n   Respiratory support\n   O2 Delivery Device: Nasal cannula\n   SpO2: 100%\n   ABG: ///22/\n   Physical Examination\n   General: nonverbal, responds to pain\n   HEENT: Sclera anicteric, dry mucous membranes, oropharynx clear\n   Neck: supple, no LAD\n   Lungs: CTAB anteriorly\n   CV: Heart sounds essentially inaudible while patient\ns overlying\n   son[**Name (NI) 979**] respirations\n   Abdomen: BS+, soft, NT, ND, no rebound tenderness or guarding, PEG\n   present with exudate around tube, but without expanding erythema from\n   PEG site\n   Ext: RLE with erythema, warmth, swelling from thigh to groin; however,\n   erythema does not track across groin to sacrum\n   Sacrum: + 5cm x 4cm eschar\n   Neuro: opens eyes, does not track\n   Labs / Radiology\n   98 K/uL\n   11.2 g/dL\n   142 mg/dL\n   1.0 mg/dL\n   22 mEq/L\n   3.7 mEq/L\n   28 mg/dL\n   129 mEq/L\n   155 mEq/L\n   34.1 %\n   9.9 K/uL\n        [image002.jpg]\n                             [**2115-3-24**]  05:45 PM\n                             [**2115-3-25**]  01:11 AM\n                             [**2115-3-25**]  04:09 AM\n                             [**2115-3-25**]  09:02 AM\n                             [**2115-3-25**]  02:47 PM\n                             [**2115-3-25**]  06:34 PM\n                             [**2115-3-26**]  03:28 AM\n   WBC\n   14.0\n   10.9\n   9.9\n   Hct\n   35.5\n   31.4\n   34.1\n   Plt\n   102\n   93\n   98\n   Cr\n   1.5\n   1.2\n   1.1\n   1.1\n   1.1\n   1.0\n   TropT\n   0.07\n   0.05\n   0.04\n   0.03\n   0.03\n   TCO2\n   20\n   Glucose\n   [**Telephone/Fax (3) 6664**]63\n   210\n   142\n   Other labs: PT / PTT / INR:13.9/32.7/1.2, CK / CKMB /\n   Troponin-T:574/5/0.03, ALT / AST:18/19, Alk Phos / T Bili:66/0.5,\n   Differential-Neuts:83.5 %, Lymph:11.4 %, Mono:2.9 %, Eos:1.9 %,\n   Fibrinogen:611 mg/dL, Lactic Acid:1.1 mmol/L, Albumin:2.4 g/dL, LDH:213\n   IU/L, Ca++:7.3 mg/dL, Mg++:2.3 mg/dL, PO4:2.2 mg/dL\n   Assessment and Plan\n   This is an 87 yo M s/p CVA, nonverbal, with HTN, type 2 DM who presents\n   with fever to 106, likely secondary to extensive LE cellulitis.\n   Additionally has hypernatremia and acute renal failure likely secondary\n   to dehydration.\n   #   Fever / Leukocytosis:\n   Likely attributed to cellulitis. Patient with significant warmth and\n   and erythema over right lower extremity at presentation. Now with\n   decrease in WBC from 14 to 10.9 and has been afebrile following\n   admission. Despite cellulitis being most likely cause of fever and\n   leukocytosis, pneumonia (given chest CT) and peri-PEG infection (given\n   purulent drainage around PEG). RLE U/S negative as source of\n   fever/cellulitis.  CXR without evidence of PNA, likely atelectasis at\n   LLL.\n   - Continue Vanc, cefepime\n   - d/c levo\n   - Follow-up blood cultures\n currently negative\n   - Follow-up wound cultures from PEG site\n GM stain GPC/GNR\n   - RLE doppler negative for DVT as source of cellulitis\n   - follow fever curve.  Remains afebrile\n   #  Hypotension / Bradycardia:\n   Hypotension likely secondary to poor PO intake over several weeks in\n   addition to sepsis from infection (likely cellulitis). Does not appear\n   to be linked to bradycardia (baseline HR in low 50s) at this time;\n   however, will continue to monitor.  BP stable while in MICU.\n   - IVF boluses as needed\n   - Hold antihypertensives (lisinopril, amlodipine, flomax) and baclofen\n   # VT code:\n   Patient with episode of pulseless/unresponsive VT on [**3-25**] 7PM, code\n   called, no shock delivered, CPR initiated but stopped immediately after\n   initiation due to pt spontaneous recovery of pulse.  Awake and\n   responsive afterwards, post-code EKG without ST elevations, appears to\n   be intermittent atrial tachy with baseline bradycardia, troponin/CK\n   continued trend downward since admission.\n   - Cardiology consult\n   - Echo today\n   - Cont telemetry\n   #  Hypernatremia:\n   Significant hypernatremia likely reflects ongoing infection and poor PO\n   intake over the course of several days to week. Initially received NS\n   for rehydration. Now that pressures are maintained without need for\n   frequent boluses, will hold on further IVF and manage hypernatremia\n   with free water via PEG.  Trending downward slowly.\n   - Free water flushes through PEG\n   - cont daily electrolytes check\n   #  [**Last Name (un) **]:\n   Creatinine rose after getting 5L NS in ED and in MICU. Has fallen\n   slightly to 1.1 upon 0900 labs this AM.  Likely from profound\n   dehydration and possible sepsis in setting of cellulitis.  Creatinine\n   trended downward to normal over course of rehydration.\n   - Follow-up Cr with ongoing hydration via PEG and daily electrolytes\n   checks\n   # Hyperglycemia:\n   Patient is a type 2 diabetic.  No evidence of DKA.  Likely elevated in\n   setting of infection.  Taken off insulin gtt without complications.\n   FSBG trending into acceptable range in sliding scale and lantus.\n   - Lantus + HISS and uptitrate doses as needed. Will set lantus at 14 U\n   tonight to cover increased tube feeds.\n   - QID fingersticks\n   #  AMS:\n   Unknown recent baseline. Likely secondary to s/p CVA in [**2109**], s/p\n   traumatic SDH x 2, significant hypernatremia, and infection.  Head CT\n   showed no acute process.  Mental status improved over course of MICU\n   stay with hydration and treatment of infection.\n   - Continue donepezil\n   - vanco/cefepime for infection\n   # Hypocalcemia:\n   Corrected Ca of 7.3 this AM.  Remained stable without issues\n   - Replete and recheck Ca on PM lytes\n   #  Thrombocytopenia:\n   Unknown baseline.  Platelet count most recent was 137 on [**3-15**]. Is 93\n   this AM, stable in 90\ns. DIC labs negative yesterday for signs of DIC.\n   - Follow-up peripheral smear\n   - Cont to follow plts.  Unlikely DIC at this point.\n   # Anemia:\n   Unclear baseline. Down slightly to 31.4 from 35.5 at admission. Not\n   iron deficient by labs (ferritin 1478 and FeSat = 44%).\n   - Trend HCT, stable\n   - Guiaic stools\n   # Troponin leak:\n   When afebrile and s/p 5L NS, HR has fallen to 40s.  No prior EKG for\n   comparison but no evidence of ACS.  Likely demand ischemia in setting\n   of infection and tachcardia. Troponin at presentation was 0.07 and was\n   0.05 this AM.\n   - Continued downward trend of troponin post-VT code, unlikely MI\n   #  Sacral decubitis ulcer:\n   - Wound care via nursing\n   ICU Care\n   Nutrition:\n   Nutren 2.0 (Full) - [**2115-3-25**] 07:13 PM 40 mL/hour.  Will have PEG tube\n   evaluation for concern that tubing is too large for connectors.\n   Glycemic Control:\n   Lines:\n   18 Gauge - [**2115-3-24**] 05:30 PM\n   20 Gauge - [**2115-3-24**] 05:30 PM\n   Multi Lumen - [**2115-3-25**] 01:45 AM will plan to d/c central line today.\n   Will order PICC line.\n   Prophylaxis:\n   DVT:\n   Stress ulcer:\n   VAP:\n   Comments:\n   Communication:  Comments:\n   Code status: Full code\n   Disposition:  will plan to call pt out to floor today.  Rehab screen\n   today.  Will d/[**Initials (NamePattern4) **] [**Last Name (NamePattern4) **]-med.\n\n\n--------------------------------------------------------------------------------\n\n\n------------------------------Predicted Summary Naive Bayes------------------------------\nDespite cellulitis being most likely cause of fever and\n   leukocytosis, pneumonia (given chest CT) and peri-PEG infection (given\n   purulent drainage around PEG).RLE U/S negative as source of\n   fever/cellulitis.- Continue Vanc, cefepime\n   - d/c levo\n   - Follow-up blood cultures\n currently negative\n   - Follow-up wound cultures from PEG site\n GM stain GPC/GNR\n   - RLE doppler negative for DVT as source of cellulitis\n   - follow fever curve.Remains afebrile\n   #  Hypotension / Bradycardia:\n   Hypotension likely secondary to poor PO intake over several weeks in\n   addition to sepsis from infection (likely cellulitis).BP stable while in MICU.- IVF boluses as needed\n   - Hold antihypertensives (lisinopril, amlodipine, flomax) and baclofen\n   # VT code:\n   Patient with episode of pulseless/unresponsive VT on [**3-25**] 7PM, code\n   called, no shock delivered, CPR initiated but stopped immediately after\n   initiation due to pt spontaneous recovery of pulse.Awake and\n   responsive afterwards, post-code EKG without ST elevations, appears to\n   be intermittent atrial tachy with baseline bradycardia, troponin/CK\n   continued trend downward since admission.- Cardiology consult\n   - Echo today\n   - Cont telemetry\n   #  Hypernatremia:\n   Significant hypernatremia likely reflects ongoing infection and poor PO\n   intake over the course of several days to week.Now that pressures are maintained without need for\n   frequent boluses, will hold on further IVF and manage hypernatremia\n   with free water via PEG.Has fallen\n   slightly to 1.1 upon 0900 labs this AM.Likely from profound\n   dehydration and possible sepsis in setting of cellulitis.Likely elevated in\n   setting of infection.- Continue donepezil\n   - vanco/cefepime for infection\n   # Hypocalcemia:\n   Corrected Ca of 7.3 this AM.Remained stable without issues\n   - Replete and recheck Ca on PM lytes\n   #  Thrombocytopenia:\n   Unknown baseline.Is 93\n   this AM, stable in 90\ns. DIC labs negative yesterday for signs of DIC.- Continued downward trend of troponin post-VT code, unlikely MI\n   #  Sacral decubitis ulcer:\n   - Wound care via nursing\n   ICU Care\n   Nutrition:\n   Nutren 2.0 (Full) - [**2115-3-25**] 07:13 PM 40 mL/hour.Glycemic Control:\n   Lines:\n   18 Gauge - [**2115-3-24**] 05:30 PM\n   20 Gauge - [**2115-3-24**] 05:30 PM\n   Multi Lumen - [**2115-3-25**] 01:45 AM will plan to d/c central line today.\n--------------------------------------------------------------------------------\n\n\n"
     ]
    }
   ],
   "source": [
    "demo.naive_bayes_db() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## General Usage\n",
    "We can also simply run this model on any text in general by using the naive_bayes(text) version. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Large necrotic inguinal lymph node metastases bilaterally. 2) Successful aspiration of serosanguinous fluid from the fluid components of lymph nodes within both inguinal regions. This fluid was sent for gram stain and culture.  Fine needle aspiration of the solid components of lymph nodes within both inguinal regions was also performed without complication.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "This fluid was sent for gram stain and culture.\n"
     ]
    }
   ],
   "source": [
    "print(demo.naive_bayes(text))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python361064bit061169e3e2c44726897aa17d9d7ef2cc",
   "display_name": "Python 3.6.10 64-bit ('EHRKit': conda)"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "metadata": {
   "interpreter": {
    "hash": "1452ce364e145c8938a76e90050576b8a2a4d70ee75de50f3361ff243fa2a5f7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}