We integrate various text-processing tools, with a focus on the medical domain, into one single, user-friendly toolkit.
python3 -m venv virtenv/
source virtenv/bin/activate
pip install pip==21.3.1
pip install -U spacy
pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
pip install pandas==1.1.5
pip install numpy==1.19.5
pip install transformers==4.12.3
pip install torch==1.10.0
pip install torchvision==0.11.1
pip install sentencepiece==0.1.96
pip install sklearn
pip install PyRuSH
pip install stanza==1.3.0
pip install ipywidgets
pip install ipykernel
pip install summa==1.2.0
pip install negspacy==1.0.2
pip install medspacy==0.2.0.0
wget https://github.com/jianlins/PyRuSH/raw/master/conf/rush_rules.tsv -P conf
Here, we show examples of runnng a single-document task and a multi-document task.
For a complete run-through of all tasks, run the demo script by using python demo.py
.
For a comprehensive tutorial, check the tutorial notebook.
Single document tasks operates on a single free-text record.
from EHRKit import EHRkit
# create kit
kit = EHRKit()
main_record = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."
# add main_record
kit.update_and_delete_main_record(main_record)
# call single-document tasks on main_record
kit.get_abbreviations()
>> [('SBMA', 'Spinal and bulbar muscular atrophy'),
('SBMA', 'Spinal and bulbar muscular atrophy'),
('AR', 'androgen receptor')]
Multi-document tasks operate on several free-text records.
from EHRKit import EHRkit
# create kit
kit = EHRKit()
''' A document about neuron.'''
record = "Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, " \
"the cells responsible for receiving sensory input from the external world, for sending motor commands to " \
"our muscles, and for transforming and relaying the electrical signals at every step in between. More than " \
"that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do" \
" interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, " \
"although it’s not really known)."
''' A document about neural network. '''
cand1 = "A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of " \
"data through a process that mimics the way the human brain operates. In this sense, neural networks refer to " \
"systems of neurons, either organic or artificial in nature."
''' A document about aspirin. '''
cand2 = "Prescription aspirin is used to relieve the symptoms of rheumatoid arthritis (arthritis caused by swelling " \
"of the lining of the joints), osteoarthritis (arthritis caused by breakdown of the lining of the joints), " \
"systemic lupus erythematosus (condition in which the immune system attacks the joints and organs and causes " \
"pain and swelling) and certain other rheumatologic conditions (conditions in which the immune system " \
"attacks parts of the body)."
''' Another document about aspirin. '''
cand3 = "People can buy aspirin over the counter without a prescription. Everyday uses include relieving headache, " \
"reducing swelling, and reducing a fever. Taken daily, aspirin can lower the risk of cardiovascular events, " \
"such as a heart attack or stroke, in people with a high risk. Doctors may administer aspirin immediately" \
" after a heart attack to prevent further clots and heart tissue death."
# add main_record
kit.update_and_delete_main_record(record)
# add supporting_records
kit.replace_supporting_records([cand1, cand2, cand3])
# performs k-means clustering on the 4 documents
kit.get_clusters(k=2)
>> note cluster
0 Neurons (also called neurones or ... 0
1 A neural netwrok is a series of ... 0
2 Prescription aspirin is used to ... 1
3 People can buy aspirin over the ... 1
We fine-tuned on the UFAL data to support more languages, feel free to download the Transformer models MT5-based, more models Users can also be found SciFive-based.
ModuleNotFoundError: No module named 'click._bashcomplete'
You may have dependency confusion and have the wrong version of click installed. Try pip install click==7.1.1
.
Your computer does not have enough CPU/GPU/RAM to run this model so your kernel shut down the process because it was starved for resources.
TypeError: 'module' object is not callable
For some reason the PyRuSH module does not behave the same on all machines. Try replacing the line rush = RuSH('conf/rush_rules.tsv')
with rush = RuSH.RuSH('conf/rush_rules.tsv')
in the utils.py
file.
AttributeError: 'IntervalTree' object has no attribute 'search'
Another dependency confusion error: try pip install intervaltree==2.1.0
.