|
a |
|
b/README.md |
|
|
1 |
<p align="center"> |
|
|
2 |
<img src="https://github.com/Yale-LILY/EHRKit/blob/master/wrapper_functions/EHRLogo.png" alt="EHRKit"/> |
|
|
3 |
</p> |
|
|
4 |
|
|
|
5 |
|
|
|
6 |
# EHRKit: A Python Natural Language Processing Toolkit for Electronic Health Record Texts |
|
|
7 |
|
|
|
8 |
[](https://www.python.org/downloads/release/python-360/) |
|
|
9 |
|
|
|
10 |
This library aims at processing medical texts in electronic health records. We provide specific functions to access the [MIMIC-III](https://physionet.org/content/mimiciii-demo/) record efficiently; the method includes searching by record ID, searching similar records, searching with an input query. We also support functions for some NLP tasks, including abbreviation disambiguation, extractive and abstractive summarization. For more specific evaluaiton, please check this [pre-print]([url](https://arxiv.org/abs/2204.06604)). |
|
|
11 |
|
|
|
12 |
Moreover, if users want to deal with general medical texts, we integrate third-party libraries, including [hugginface](https://huggingface.co/), [scispacy](https://allenai.github.io/scispacy/), [allennlp](https://github.com/allenai/allennlp), [stanza](https://stanfordnlp.github.io/stanza/), and so on. Please checkout the special verison of this library, [EHRKit-WF](https://github.com/Yale-LILY/EHRKit/tree/master/wrapper_functions). |
|
|
13 |
|
|
|
14 |
<p align="center"> |
|
|
15 |
<img src="https://github.com/Yale-LILY/EHRKit-2022/blob/main/ehrkit.jpg" alt="EHRKit"/> |
|
|
16 |
</p> |
|
|
17 |
|
|
|
18 |
## Table of Contents |
|
|
19 |
|
|
|
20 |
1. [Updates](#updates) |
|
|
21 |
2. [Data](#data) |
|
|
22 |
3. [Setup](#setup) |
|
|
23 |
4. [Toolkit](#toolkit) |
|
|
24 |
5. [Get Involved](#get-involved) |
|
|
25 |
6. [Off-shelf Functions](#get-involved) |
|
|
26 |
<!-- 6. [Citation](#get-involved) --> |
|
|
27 |
|
|
|
28 |
|
|
|
29 |
## Updates |
|
|
30 |
_24_05_2023_ - New Release Pretrained Models for Machine Translation. <br/> |
|
|
31 |
_15_03_2022_ - Merged a wrapper function folder to support off-shelf medical text processing. <br/> |
|
|
32 |
_10_03_2022_ - Made all tests avaiable in a ipynb file and updated the most recent version. <br/> |
|
|
33 |
_12_17_2021_ - New folder collated_tasks containing Fall 2021 functionalities added <br/> |
|
|
34 |
_05_11_2021_ - cleaned up the notebooks, fixed up the readme using depth=1 <br/> |
|
|
35 |
_05_04_2021_ - Tests run-through added in `tests` <br/> |
|
|
36 |
_04_22_2021_ - Freezing development <br/> |
|
|
37 |
_04_22_2021_ - Completed the tutorials and readme. <br/> |
|
|
38 |
_04_20_2021_ - Spring functionality finished -- mimic classification, summarization, and query extraction <br/> |
|
|
39 |
|
|
|
40 |
## Data |
|
|
41 |
EHRKit is built for use with Medical Information Mart for Intensive Care-III (MIMIC-III). It requires this dataset to be downloaded. This dataset is freely available to the public, but it requires completion of an online training course. Information on accessing MIMIC-III can be found at https://mimic.physionet.org/gettingstarted/access. Once this process is complete, it is recommended to download the mimic files to the folder `data/` |
|
|
42 |
|
|
|
43 |
The other dataset that is required for some of the modules is the [pubmed dataset](https://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/DATASET/), this dataset contains a large number of medical articles. The required downloading and parsing is all performed in the `pubmed/` folder. First run `bash download_pubmed.sh` and then `python parse_articles.py`. This process is also detailed in the tutorial notebook for summarization: `tutorials/naiveBayes.ipynb` |
|
|
44 |
|
|
|
45 |
## Setup |
|
|
46 |
|
|
|
47 |
### Download Repository |
|
|
48 |
|
|
|
49 |
You can download EHRKit as a git repository, simply clone to your choice of directories (keep depth small to keep the old versions out and reduce size) |
|
|
50 |
``` |
|
|
51 |
git clone https://github.com/Yale-LILY/EHRKit.git --depth=1 |
|
|
52 |
``` |
|
|
53 |
|
|
|
54 |
#### Environment Option 1: using conda |
|
|
55 |
See the `environment.yml` file for specific requirements. Aside from basic conda packages, pytorch and transformers are required for more advanced models. |
|
|
56 |
|
|
|
57 |
|
|
|
58 |
To create the required environment, go to the root directory of EHRKit and run: |
|
|
59 |
``` |
|
|
60 |
conda env create -f environment.yml --name <ENV_NAME> |
|
|
61 |
``` |
|
|
62 |
|
|
|
63 |
For local LiLY lab users on tangra, setup will work a little differently: |
|
|
64 |
``` |
|
|
65 |
pip install nltk |
|
|
66 |
pip install pymysql |
|
|
67 |
pip install requests |
|
|
68 |
pip install gensim |
|
|
69 |
pip install torch |
|
|
70 |
pip install scikit-learn |
|
|
71 |
pip install spacy |
|
|
72 |
python -m spacy download en_core_web_sm |
|
|
73 |
pip install -U pip setuptools wheel |
|
|
74 |
pip install -U spacy[cuda102] |
|
|
75 |
pip install transformers |
|
|
76 |
``` |
|
|
77 |
|
|
|
78 |
#### Environment Option 2: using virtualenv |
|
|
79 |
``` |
|
|
80 |
cd EHRKit |
|
|
81 |
python3 -m venv ehrvir/ |
|
|
82 |
source ehrvir/bin/activate |
|
|
83 |
pip install -r requirements.txt |
|
|
84 |
``` |
|
|
85 |
Then you are good to go! |
|
|
86 |
|
|
|
87 |
### Testing |
|
|
88 |
|
|
|
89 |
You can test your installation (assuming you're in the `/EHRKit/` folder) and get familiar with the library through `tests/`. Note that this will only work with the sql mimic database setup. |
|
|
90 |
|
|
|
91 |
``` |
|
|
92 |
python -m spacy download en_core_web_sm #some spacy extras must be downloaded |
|
|
93 |
python -m tests/tests.py |
|
|
94 |
# If you want to run all the tests, including the longer tests |
|
|
95 |
python -m test/all_tests.py |
|
|
96 |
``` |
|
|
97 |
|
|
|
98 |
|
|
|
99 |
Most of the modules access the data through a sql database. The construction of the database is described in `database_readmes` |
|
|
100 |
|
|
|
101 |
#### MIMIC |
|
|
102 |
EHRKit requires Medical Information Mart for Intensive Care-III (MIMIC-III) database to be installed. This database is freely available to the public, but it requires completion of an online training course. Information on accessing MIMIC-III can be found at https://mimic.physionet.org/gettingstarted/access. |
|
|
103 |
|
|
|
104 |
Once you have gotten access, you can put the mimic data in the folder `data` |
|
|
105 |
|
|
|
106 |
### Pubmed |
|
|
107 |
The other dataset that is required for some of the modules is the [pubmed dataset](https://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/DATASET/), this dataset contains a large number of medical articles. The required downloading and parsing is all performed in the `pubmed/` folder. First run `bash download_pubmed.sh` and then `python parse_articles.py`. This process is also detailed in the tutorial notebook for summarization: `tutorials/naiveBayes.ipynb` |
|
|
108 |
|
|
|
109 |
### Getting started |
|
|
110 |
Once the required data has been downloaded, a new user is recommended to explore the `tutorials/` folder. These tutorials cover summarization, code classification, and query extraction. |
|
|
111 |
To get a new user started, there are a number of jupyter notebooks in `tutorials/`, these tutorials cover summarization, icd9 code classification, query extraction |
|
|
112 |
|
|
|
113 |
## Toolkit |
|
|
114 |
Jupyter notebook walkthroughs for some of these packages are available in `tutorials/`. These tutorials are the best way for a novice to familiarize themselves with these works, and in the interest of consolidation, that information will not be repeated here. |
|
|
115 |
|
|
|
116 |
In addition to these most recent models, there are a number of other packages which do not have full tutorials, as they are built for different interactions. Readmes are written out for these packages (before we switched to the tutorial model). A full list of the modules is below. |
|
|
117 |
|
|
|
118 |
|
|
|
119 |
### Modules |
|
|
120 |
- `summarization/` has a naive bayes model developed by Jeremy, this model is built for extractive summarization of medical text, trained on the PubMed corpus. |
|
|
121 |
- `mimic_icd9_coding/` contains a general pipeline for classifying clinical notes from the MIMIC dataset into ICD9 billing codes. These codes contain diagnoses among other information. |
|
|
122 |
- `QueryExtraction/` demonstrates simple tools for performing automated query-based extraction from text, which can easily be run on MIMIC data. |
|
|
123 |
- `extractiveSummarization/` contains an implementation of Lexrank for MIMIC-III and PubMed. In the future we will make a test in tests.py that runs it on an EHR. It was developed by Sanya Nijhawan, B.S. CS '20. |
|
|
124 |
- `allennlp/` has scripts utilized by tests 6.1 and 6.2 in tests.py for efficiently calling functions in the allennlp library. |
|
|
125 |
- `ehrkit/` contains scripts that enable interaction with MIMIC data |
|
|
126 |
- `pubmed/` contains scripts for downloading the PubMed corpus of biomedical research papers. Once downloaded, the papers are stored inside this directory. |
|
|
127 |
- `tests/` has tests on the MIMIC dataset (in tests.py) and the PubMed corpus (in pubmed_tests.py). |
|
|
128 |
- `collated_tasks/` has a collection of tasks on MIMIC data, including extracting named entities, abbreviations, hyponyms & linked entities, machine translation, sentence segmentation, document clustering, and retrieving similar documents. It also contains auxiliary functions including retrieving notes from NOTEEVENTS.csv and creating vector representations using bag-of-words or pre-trained transformer models. Tutorials for tasks on non-MIMIC data are also available for de-identification, inference, and medical question answering. Developed by Keen during Fall 2021. |
|
|
129 |
|
|
|
130 |
|
|
|
131 |
## Get involved |
|
|
132 |
|
|
|
133 |
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs! |
|
|
134 |
|
|
|
135 |
|
|
|
136 |
|
|
|
137 |
# Off-shelf Functions for Medical Text Processing: EHRKit-WF |
|
|
138 |
If you want to process some general medical text, please check the EHRKit-WF in [wrapper_functions](https://github.com/Yale-LILY/EHRKit/tree/master/wrapper_functions). Note: you do not need MIMIC-III access for running this. We support the following key functions: |
|
|
139 |
|
|
|
140 |
- Abbreviation Detection & Expansion |
|
|
141 |
- Hyponym Detection |
|
|
142 |
- Entity Linking |
|
|
143 |
- Named Entity Recognition |
|
|
144 |
- Translation |
|
|
145 |
- Sentencizer |
|
|
146 |
- Document clustering |
|
|
147 |
- Similar Document Retrieval |
|
|
148 |
- Word Tokenization |
|
|
149 |
- Negation Detection |
|
|
150 |
- Section Detection |
|
|
151 |
- UMLS Concept Extraction |
|
|
152 |
|
|
|
153 |
|