|
a |
|
b/wrapper_functions/README.md |
|
|
1 |
|
|
|
2 |
# EHRKit-WF: we provide many off-shelf methods for processing medical text. |
|
|
3 |
[](https://www.python.org/downloads/release/python-360/) |
|
|
4 |
[](https://www.python.org/downloads/release/python-380/) |
|
|
5 |
## Overview |
|
|
6 |
We integrate various text-processing tools, with a focus on the medical domain, into one single, user-friendly toolkit. |
|
|
7 |
|
|
|
8 |
## Installation |
|
|
9 |
### Create virtual environment |
|
|
10 |
|
|
|
11 |
```bash |
|
|
12 |
python3 -m venv virtenv/ |
|
|
13 |
source virtenv/bin/activate |
|
|
14 |
``` |
|
|
15 |
### Install packages |
|
|
16 |
```bash |
|
|
17 |
pip install pip==21.3.1 |
|
|
18 |
pip install -U spacy |
|
|
19 |
pip install scispacy |
|
|
20 |
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz |
|
|
21 |
pip install pandas==1.1.5 |
|
|
22 |
pip install numpy==1.19.5 |
|
|
23 |
pip install transformers==4.12.3 |
|
|
24 |
pip install torch==1.10.0 |
|
|
25 |
pip install torchvision==0.11.1 |
|
|
26 |
pip install sentencepiece==0.1.96 |
|
|
27 |
pip install sklearn |
|
|
28 |
pip install PyRuSH |
|
|
29 |
pip install stanza==1.3.0 |
|
|
30 |
pip install ipywidgets |
|
|
31 |
pip install ipykernel |
|
|
32 |
pip install summa==1.2.0 |
|
|
33 |
pip install negspacy==1.0.2 |
|
|
34 |
pip install medspacy==0.2.0.0 |
|
|
35 |
wget https://github.com/jianlins/PyRuSH/raw/master/conf/rush_rules.tsv -P conf |
|
|
36 |
``` |
|
|
37 |
|
|
|
38 |
## Quick Start |
|
|
39 |
Here, we show examples of runnng a single-document task and a multi-document task. |
|
|
40 |
|
|
|
41 |
For a complete run-through of all tasks, run the demo script by using ```python demo.py```. |
|
|
42 |
|
|
|
43 |
For a comprehensive tutorial, check the [tutorial notebook](https://github.com/karenacorn99/LILY-EHRKit/blob/main/EHRKit_tutorials.ipynb). |
|
|
44 |
|
|
|
45 |
### Single-document Task Example |
|
|
46 |
Single document tasks operates on a single free-text record. |
|
|
47 |
```python |
|
|
48 |
from EHRKit import EHRkit |
|
|
49 |
|
|
|
50 |
# create kit |
|
|
51 |
kit = EHRKit() |
|
|
52 |
|
|
|
53 |
main_record = "Spinal and bulbar muscular atrophy (SBMA) is an \ |
|
|
54 |
inherited motor neuron disease caused by the expansion \ |
|
|
55 |
of a polyglutamine tract within the androgen receptor (AR). \ |
|
|
56 |
SBMA can be caused by this easily." |
|
|
57 |
|
|
|
58 |
# add main_record |
|
|
59 |
kit.update_and_delete_main_record(main_record) |
|
|
60 |
|
|
|
61 |
# call single-document tasks on main_record |
|
|
62 |
kit.get_abbreviations() |
|
|
63 |
>> [('SBMA', 'Spinal and bulbar muscular atrophy'), |
|
|
64 |
('SBMA', 'Spinal and bulbar muscular atrophy'), |
|
|
65 |
('AR', 'androgen receptor')] |
|
|
66 |
``` |
|
|
67 |
|
|
|
68 |
### Multi-document Task Example |
|
|
69 |
Multi-document tasks operate on several free-text records. |
|
|
70 |
```python |
|
|
71 |
from EHRKit import EHRkit |
|
|
72 |
|
|
|
73 |
# create kit |
|
|
74 |
kit = EHRKit() |
|
|
75 |
|
|
|
76 |
''' A document about neuron.''' |
|
|
77 |
record = "Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, " \ |
|
|
78 |
"the cells responsible for receiving sensory input from the external world, for sending motor commands to " \ |
|
|
79 |
"our muscles, and for transforming and relaying the electrical signals at every step in between. More than " \ |
|
|
80 |
"that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do" \ |
|
|
81 |
" interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, " \ |
|
|
82 |
"although it’s not really known)." |
|
|
83 |
|
|
|
84 |
''' A document about neural network. ''' |
|
|
85 |
cand1 = "A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of " \ |
|
|
86 |
"data through a process that mimics the way the human brain operates. In this sense, neural networks refer to " \ |
|
|
87 |
"systems of neurons, either organic or artificial in nature." |
|
|
88 |
|
|
|
89 |
''' A document about aspirin. ''' |
|
|
90 |
cand2 = "Prescription aspirin is used to relieve the symptoms of rheumatoid arthritis (arthritis caused by swelling " \ |
|
|
91 |
"of the lining of the joints), osteoarthritis (arthritis caused by breakdown of the lining of the joints), " \ |
|
|
92 |
"systemic lupus erythematosus (condition in which the immune system attacks the joints and organs and causes " \ |
|
|
93 |
"pain and swelling) and certain other rheumatologic conditions (conditions in which the immune system " \ |
|
|
94 |
"attacks parts of the body)." |
|
|
95 |
|
|
|
96 |
''' Another document about aspirin. ''' |
|
|
97 |
cand3 = "People can buy aspirin over the counter without a prescription. Everyday uses include relieving headache, " \ |
|
|
98 |
"reducing swelling, and reducing a fever. Taken daily, aspirin can lower the risk of cardiovascular events, " \ |
|
|
99 |
"such as a heart attack or stroke, in people with a high risk. Doctors may administer aspirin immediately" \ |
|
|
100 |
" after a heart attack to prevent further clots and heart tissue death." |
|
|
101 |
|
|
|
102 |
# add main_record |
|
|
103 |
kit.update_and_delete_main_record(record) |
|
|
104 |
|
|
|
105 |
# add supporting_records |
|
|
106 |
kit.replace_supporting_records([cand1, cand2, cand3]) |
|
|
107 |
|
|
|
108 |
# performs k-means clustering on the 4 documents |
|
|
109 |
kit.get_clusters(k=2) |
|
|
110 |
|
|
|
111 |
>> note cluster |
|
|
112 |
0 Neurons (also called neurones or ... 0 |
|
|
113 |
1 A neural netwrok is a series of ... 0 |
|
|
114 |
2 Prescription aspirin is used to ... 1 |
|
|
115 |
3 People can buy aspirin over the ... 1 |
|
|
116 |
``` |
|
|
117 |
|
|
|
118 |
### Key Functions |
|
|
119 |
- Abbreviation Detection & Expansion |
|
|
120 |
- Hyponym Detection |
|
|
121 |
- Entity Linking |
|
|
122 |
- Named Entity Recognition |
|
|
123 |
- Translation |
|
|
124 |
- Sentencizer |
|
|
125 |
- Document clustering |
|
|
126 |
- Similar Document Retrieval |
|
|
127 |
- Word Tokenization |
|
|
128 |
- Negation Detection |
|
|
129 |
- Section Detection |
|
|
130 |
- UMLS Concept Extraction |
|
|
131 |
|
|
|
132 |
### New Release Models for Machine Translation - May, 2023 |
|
|
133 |
We fine-tuned on the [UFAL data](https://ufal.mff.cuni.cz/ufal_medical_corpus) to support more languages, feel free to download the Transformer models [MT5-based](https://huggingface.co/qcz), more models Users can also be found [SciFive-based](https://huggingface.co/irenelizihui/scifive_ufal_MT_en_es/). |
|
|
134 |
|
|
|
135 |
## Troubleshooting 🔧 |
|
|
136 |
|
|
|
137 |
### `ModuleNotFoundError: No module named 'click._bashcomplete'` |
|
|
138 |
|
|
|
139 |
You may have dependency confusion and have the wrong version of click installed. Try `pip install click==7.1.1`. |
|
|
140 |
|
|
|
141 |
### The demo.py file outputs "Killed" with no error message. |
|
|
142 |
|
|
|
143 |
Your computer does not have enough CPU/GPU/RAM to run this model so your kernel shut down the process because it was starved for resources. |
|
|
144 |
|
|
|
145 |
### `TypeError: 'module' object is not callable` |
|
|
146 |
|
|
|
147 |
For some reason the PyRuSH module does not behave the same on all machines. Try replacing the line `rush = RuSH('conf/rush_rules.tsv')` with `rush = RuSH.RuSH('conf/rush_rules.tsv')` in the `utils.py` file. |
|
|
148 |
|
|
|
149 |
### `AttributeError: 'IntervalTree' object has no attribute 'search'` |
|
|
150 |
|
|
|
151 |
Another dependency confusion error: try `pip install intervaltree==2.1.0`. |
|
|
152 |
|