Diff of /README.md [000000] .. [2d4573]

Switch to unified view

a b/README.md
1
<p align="center">
2
   <img src="https://github.com/Yale-LILY/EHRKit/blob/master/wrapper_functions/EHRLogo.png" alt="EHRKit"/>
3
</p>
4
5
6
# EHRKit: A Python Natural Language Processing Toolkit for Electronic Health Record Texts
7
8
[![Python 3.6.13](https://img.shields.io/badge/python-3.6.13-green.svg)](https://www.python.org/downloads/release/python-360/)
9
10
This library aims at processing medical texts in electronic health records. We provide specific functions to access the [MIMIC-III](https://physionet.org/content/mimiciii-demo/) record efficiently; the method includes searching by record ID, searching similar records, searching with an input query. We also support functions for some NLP tasks, including abbreviation disambiguation, extractive and abstractive summarization. For more specific evaluaiton, please check this [pre-print]([url](https://arxiv.org/abs/2204.06604)).
11
12
Moreover, if users want to deal with general medical texts, we integrate third-party libraries, including [hugginface](https://huggingface.co/), [scispacy](https://allenai.github.io/scispacy/), [allennlp](https://github.com/allenai/allennlp), [stanza](https://stanfordnlp.github.io/stanza/), and so on. Please checkout the special verison of this library, [EHRKit-WF](https://github.com/Yale-LILY/EHRKit/tree/master/wrapper_functions).
13
14
<p align="center">
15
   <img src="https://github.com/Yale-LILY/EHRKit-2022/blob/main/ehrkit.jpg" alt="EHRKit"/>
16
</p>
17
18
## Table of Contents
19
20
1. [Updates](#updates)
21
2. [Data](#data)
22
3. [Setup](#setup)
23
4. [Toolkit](#toolkit)
24
5. [Get Involved](#get-involved)
25
6. [Off-shelf Functions](#get-involved)
26
<!-- 6. [Citation](#get-involved) -->
27
28
29
## Updates
30
_24_05_2023_ - New Release Pretrained Models for Machine Translation. <br/>
31
_15_03_2022_ - Merged a wrapper function folder to support off-shelf medical text processing. <br/>
32
_10_03_2022_ - Made all tests avaiable in a ipynb file and updated the most recent version. <br/>
33
_12_17_2021_ - New folder collated_tasks containing Fall 2021 functionalities added <br/>
34
_05_11_2021_ - cleaned up the notebooks, fixed up the readme using depth=1 <br/>
35
_05_04_2021_ - Tests run-through added in `tests` <br/>
36
_04_22_2021_ - Freezing development <br/>
37
_04_22_2021_ - Completed the tutorials and readme. <br/>
38
_04_20_2021_ - Spring functionality finished -- mimic classification, summarization, and query extraction <br/>
39
40
## Data
41
EHRKit is built for use with Medical Information Mart for Intensive Care-III (MIMIC-III). It requires this dataset to be downloaded. This dataset is freely available to the public, but it requires completion of an online training course. Information on accessing MIMIC-III can be found at https://mimic.physionet.org/gettingstarted/access. Once this process is complete, it is recommended to download the mimic files to the folder `data/`
42
43
The other dataset that is required for some of the modules is the [pubmed dataset](https://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/DATASET/), this dataset contains a large number of medical articles. The required downloading and parsing is all performed in the `pubmed/` folder. First run `bash download_pubmed.sh` and then `python parse_articles.py`. This process is also detailed in the tutorial notebook for summarization: `tutorials/naiveBayes.ipynb`
44
45
## Setup
46
47
### Download Repository
48
49
You can download EHRKit as a git repository, simply clone to your choice of directories (keep depth small to keep the old versions out and reduce size)
50
```
51
git clone https://github.com/Yale-LILY/EHRKit.git --depth=1
52
```
53
54
#### Environment Option 1: using conda
55
See the `environment.yml` file for specific requirements. Aside from basic conda packages, pytorch and transformers are required for more advanced models.
56
57
58
To create the required environment, go to the root directory of EHRKit and run:
59
```
60
conda env create -f environment.yml --name <ENV_NAME>
61
```
62
63
For local LiLY lab users on tangra, setup will work a little differently:
64
```
65
pip install nltk
66
pip install pymysql
67
pip install requests
68
pip install gensim
69
pip install torch
70
pip install scikit-learn
71
pip install spacy
72
python -m spacy download en_core_web_sm
73
pip install -U pip setuptools wheel
74
pip install -U spacy[cuda102]
75
pip install transformers
76
```
77
78
#### Environment Option 2: using virtualenv
79
```
80
cd EHRKit
81
python3 -m venv ehrvir/
82
source ehrvir/bin/activate
83
pip install -r requirements.txt
84
```
85
Then you are good to go!
86
87
### Testing
88
89
You can test your installation (assuming you're in the `/EHRKit/` folder) and get familiar with the library through `tests/`. Note that this will only work with the sql mimic database setup.
90
91
```
92
python -m spacy download en_core_web_sm #some spacy extras must be downloaded
93
python -m tests/tests.py
94
# If you want to run all the tests, including the longer tests
95
python -m test/all_tests.py
96
```
97
98
99
Most of the modules access the data through a sql database. The construction of the database is described in `database_readmes`
100
101
#### MIMIC
102
EHRKit requires Medical Information Mart for Intensive Care-III (MIMIC-III) database to be installed. This database is freely available to the public, but it requires completion of an online training course. Information on accessing MIMIC-III can be found at https://mimic.physionet.org/gettingstarted/access.
103
104
Once you have gotten access, you can put the mimic data in the folder `data`
105
106
### Pubmed
107
The other dataset that is required for some of the modules is the [pubmed dataset](https://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/DATASET/), this dataset contains a large number of medical articles. The required downloading and parsing is all performed in the `pubmed/` folder. First run `bash download_pubmed.sh` and then `python parse_articles.py`. This process is also detailed in the tutorial notebook for summarization: `tutorials/naiveBayes.ipynb`
108
109
### Getting started
110
Once the required data has been downloaded, a new user is recommended to explore the `tutorials/` folder. These tutorials cover summarization, code classification, and query extraction.
111
To get a new user started, there are a number of jupyter notebooks in `tutorials/`, these tutorials cover summarization, icd9 code classification, query extraction
112
113
## Toolkit 
114
Jupyter notebook walkthroughs for some of these packages are available in `tutorials/`. These tutorials are the best way for a novice to familiarize themselves with these works, and in the interest of consolidation, that information will not be repeated here.
115
116
In addition to these most recent models, there are a number of other packages which do not have full tutorials, as they are built for different interactions. Readmes are written out for these packages (before we switched to the tutorial model). A full list of the modules is below.
117
118
119
### Modules
120
- `summarization/` has a naive bayes model developed by Jeremy, this model is built for extractive summarization of medical text, trained on the PubMed corpus. 
121
- `mimic_icd9_coding/` contains a general pipeline for classifying clinical notes from the MIMIC dataset into ICD9 billing codes. These codes contain diagnoses among other information.
122
- `QueryExtraction/` demonstrates simple tools for performing automated query-based extraction from text, which can easily be run on MIMIC data.
123
- `extractiveSummarization/` contains an implementation of Lexrank for MIMIC-III and PubMed. In the future we will make a test in tests.py that runs it on an EHR. It was developed by Sanya Nijhawan, B.S. CS '20.
124
- `allennlp/` has scripts utilized by tests 6.1 and 6.2 in tests.py for efficiently calling functions in the allennlp library.
125
- `ehrkit/` contains scripts that enable interaction with MIMIC data
126
- `pubmed/` contains scripts for downloading the PubMed corpus of biomedical research papers. Once downloaded, the papers are stored inside this directory.
127
- `tests/` has tests on the MIMIC dataset (in tests.py) and the PubMed corpus (in pubmed_tests.py).
128
- `collated_tasks/` has a collection of tasks on MIMIC data, including extracting named entities, abbreviations, hyponyms & linked entities, machine translation, sentence segmentation, document clustering, and retrieving similar documents. It also contains auxiliary functions including retrieving notes from NOTEEVENTS.csv and creating vector representations using bag-of-words or pre-trained transformer models. Tutorials for tasks on non-MIMIC data are also available for de-identification, inference, and medical question answering. Developed by Keen during Fall 2021.
129
130
131
## Get involved
132
133
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
134
135
136
137
# Off-shelf Functions for Medical Text Processing: EHRKit-WF
138
If you want to process some general medical text, please check the EHRKit-WF in [wrapper_functions](https://github.com/Yale-LILY/EHRKit/tree/master/wrapper_functions). Note: you do not need MIMIC-III access for running this. We support the following key functions:
139
140
- Abbreviation Detection & Expansion
141
- Hyponym Detection
142
- Entity Linking
143
- Named Entity Recognition
144
- Translation
145
- Sentencizer
146
- Document clustering
147
- Similar Document Retrieval
148
- Word Tokenization
149
- Negation Detection
150
- Section Detection
151
- UMLS Concept Extraction
152
153