|
a |
|
b/README.md |
|
|
1 |
# Medical natural language parsing and utility library |
|
|
2 |
|
|
|
3 |
[![PyPI][pypi-badge]][pypi-link] |
|
|
4 |
[![Python 3.11][python311-badge]][python311-link] |
|
|
5 |
[![Build Status][build-badge]][build-link] |
|
|
6 |
|
|
|
7 |
A natural language medical domain parsing library. This library: |
|
|
8 |
|
|
|
9 |
- Provides an interface to the [UTS] ([UMLS] Terminology Services) RESTful |
|
|
10 |
service with data caching (NIH login needed). |
|
|
11 |
- Wraps the [MedCAT] library by parsing medical and clinical text into first |
|
|
12 |
class Python objects reflecting the structure of the natural language |
|
|
13 |
complete with [UMLS] entity linking with [CUIs] and other domain specific |
|
|
14 |
features. |
|
|
15 |
- Combines non-medical (such as POS and NER tags) and medical features (such as |
|
|
16 |
[CUIs]) in one API and resulting data structure and/or as a [Pandas] data |
|
|
17 |
frame. |
|
|
18 |
- Provides [cui2vec] as a [word embedding model] for either fast indexing and |
|
|
19 |
access or to use directly as features in a [Zensols Deep NLP embedding layer] |
|
|
20 |
model. |
|
|
21 |
- Provides access to [cTAKES] using as a dictionary like [Stash] abstraction. |
|
|
22 |
- Includes a command line program to access all of these features without |
|
|
23 |
having to write any code. |
|
|
24 |
|
|
|
25 |
|
|
|
26 |
## Documentation |
|
|
27 |
|
|
|
28 |
See the [full documentation](https://plandes.github.io/mednlp/index.html). |
|
|
29 |
The [API reference](https://plandes.github.io/mednlp/api.html) is also |
|
|
30 |
available. |
|
|
31 |
|
|
|
32 |
|
|
|
33 |
## Installing |
|
|
34 |
|
|
|
35 |
Install the library using a Python package manager such as `pip`: |
|
|
36 |
```bash |
|
|
37 |
pip3 install zensols.mednlp |
|
|
38 |
``` |
|
|
39 |
|
|
|
40 |
### CUI Embeddings |
|
|
41 |
|
|
|
42 |
To use the `cui2vec` to functionality, the embeddings must be *manually* |
|
|
43 |
downloaded. Start with this commands: |
|
|
44 |
```bash |
|
|
45 |
mkdir -p ~/.cache/zensols/mednlp |
|
|
46 |
wget -O ~/.cache/zensols/mednlp/cui2vec.zip https://figshare.com/ndownloader/files/10959626?private_link=00d69861786cd0156d81 |
|
|
47 |
``` |
|
|
48 |
If the download fails or the file is not a zip file (rather an HTML error |
|
|
49 |
message text), then you will need to download the file manually by |
|
|
50 |
[browsing](https://figshare.com/ndownloader/files/10959626) to the file, and |
|
|
51 |
then moving it to `~/.cache/zensols/mednlp/cui2vec.zip`. |
|
|
52 |
|
|
|
53 |
|
|
|
54 |
## Usage |
|
|
55 |
|
|
|
56 |
To parse text, create features, and extract clinical concept identifiers: |
|
|
57 |
```python |
|
|
58 |
>>> from zensols.mednlp import ApplicationFactory |
|
|
59 |
>>> doc_parser = ApplicationFactory.get_doc_parser() |
|
|
60 |
>>> doc = doc_parser('John was diagnosed with kidney failure') |
|
|
61 |
>>> for tok in doc.tokens: print(tok.norm, tok.pos_, tok.tag_, tok.cui_, tok.detected_name_) |
|
|
62 |
John PROPN NNP -<N>- -<N>- |
|
|
63 |
was AUX VBD -<N>- -<N>- |
|
|
64 |
diagnosed VERB VBN -<N>- -<N>- |
|
|
65 |
with ADP IN -<N>- -<N>- |
|
|
66 |
kidney NOUN NN C0035078 kidney~failure |
|
|
67 |
failure NOUN NN C0035078 kidney~failure |
|
|
68 |
>>> print(doc.entities) |
|
|
69 |
(<John>, <kidney failure>) |
|
|
70 |
``` |
|
|
71 |
See the [full example](example/features/simple.py), and for other |
|
|
72 |
functionality, see the [examples](example). |
|
|
73 |
|
|
|
74 |
|
|
|
75 |
## MedCAT Models |
|
|
76 |
|
|
|
77 |
By default, this library uses the small MedCAT model used for |
|
|
78 |
[tutorials](https://github.com/CogStack/MedCATtutorials/pull/12), and is not |
|
|
79 |
sufficient for any serious project. To get the UMLS trained model,the [MedCAT |
|
|
80 |
UMLS request form] from be filled out (see the [MedCAT] repository). |
|
|
81 |
|
|
|
82 |
After you obtain access and download the new model, add the following to |
|
|
83 |
`~/.mednlprc` with the following: |
|
|
84 |
|
|
|
85 |
```ini |
|
|
86 |
[medcat_status_resource] |
|
|
87 |
url = file:///location/to/the/downloaded/file/umls_sm_wstatus_2021_oct.zip' |
|
|
88 |
``` |
|
|
89 |
|
|
|
90 |
|
|
|
91 |
## Attribution |
|
|
92 |
|
|
|
93 |
This API utilizes the following frameworks: |
|
|
94 |
|
|
|
95 |
* [MedCAT]: used to extract information from Electronic Health Records (EHRs) |
|
|
96 |
and link it to biomedical ontologies like SNOMED-CT and UMLS. |
|
|
97 |
* [cTAKES]: a natural language processing system for extraction of information |
|
|
98 |
from electronic medical record clinical free-text. |
|
|
99 |
* [cui2vec]: a new set of (like word) embeddings for medical concepts learned |
|
|
100 |
using an extremely large collection of multimodal medical data. |
|
|
101 |
* [Zensols Deep NLP library]: a deep learning utility library for natural |
|
|
102 |
language processing that aids in feature engineering and embedding layers. |
|
|
103 |
* [ctakes-parser]: parses [cTAKES] output in to a [Pandas] data frame. |
|
|
104 |
|
|
|
105 |
|
|
|
106 |
## Citation |
|
|
107 |
|
|
|
108 |
If you use this project in your research please use the following BibTeX entry: |
|
|
109 |
|
|
|
110 |
```bibtex |
|
|
111 |
@inproceedings{landes-etal-2023-deepzensols, |
|
|
112 |
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility", |
|
|
113 |
author = "Landes, Paul and |
|
|
114 |
Di Eugenio, Barbara and |
|
|
115 |
Caragea, Cornelia", |
|
|
116 |
editor = "Tan, Liling and |
|
|
117 |
Milajevs, Dmitrijs and |
|
|
118 |
Chauhan, Geeticka and |
|
|
119 |
Gwinnup, Jeremy and |
|
|
120 |
Rippeth, Elijah", |
|
|
121 |
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", |
|
|
122 |
month = dec, |
|
|
123 |
year = "2023", |
|
|
124 |
address = "Singapore, Singapore", |
|
|
125 |
publisher = "Association for Computational Linguistics", |
|
|
126 |
url = "https://aclanthology.org/2023.nlposs-1.16", |
|
|
127 |
pages = "141--146" |
|
|
128 |
} |
|
|
129 |
``` |
|
|
130 |
|
|
|
131 |
|
|
|
132 |
## Community |
|
|
133 |
|
|
|
134 |
Please star the project and let me know how and where you use this API. |
|
|
135 |
Contributions as pull requests, feedback and any input is welcome. |
|
|
136 |
|
|
|
137 |
|
|
|
138 |
## Changelog |
|
|
139 |
|
|
|
140 |
An extensive changelog is available [here](CHANGELOG.md). |
|
|
141 |
|
|
|
142 |
|
|
|
143 |
## License |
|
|
144 |
|
|
|
145 |
[MIT License](LICENSE.md) |
|
|
146 |
|
|
|
147 |
Copyright (c) 2021 - 2025 Paul Landes |
|
|
148 |
|
|
|
149 |
|
|
|
150 |
<!-- links --> |
|
|
151 |
[pypi]: https://pypi.org/project/zensols.mednlp/ |
|
|
152 |
[pypi-link]: https://pypi.python.org/pypi/zensols.mednlp |
|
|
153 |
[pypi-badge]: https://img.shields.io/pypi/v/zensols.mednlp.svg |
|
|
154 |
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg |
|
|
155 |
[python311-link]: https://www.python.org/downloads/release/python-3110 |
|
|
156 |
[build-badge]: https://github.com/plandes/mednlp/workflows/CI/badge.svg |
|
|
157 |
[build-link]: https://github.com/plandes/mednlp/actions |
|
|
158 |
|
|
|
159 |
[MedCAT]: https://github.com/CogStack/MedCAT |
|
|
160 |
[MedCAT UMLS request form]: https://uts.nlm.nih.gov/uts/login?service=https:%2F%2Fmedcat.rosalind.kcl.ac.uk%2Fauth-callback |
|
|
161 |
|
|
|
162 |
[Pandas]: https://pandas.pydata.org |
|
|
163 |
[ctakes-parser]: https://pypi.org/project/ctakes-parser |
|
|
164 |
|
|
|
165 |
[UTS]: https://uts.nlm.nih.gov/uts/ |
|
|
166 |
[UMLS]: https://www.nlm.nih.gov/research/umls/ |
|
|
167 |
[CUIs]: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html |
|
|
168 |
[cui2vec]: https://arxiv.org/abs/1804.01486 |
|
|
169 |
[cTAKES]: https://ctakes.apache.org |
|
|
170 |
[word embedding model]: https://plandes.github.io/deepnlp/api/zensols.deepnlp.embed.html#zensols.deepnlp.embed.domain.WordEmbedModel |
|
|
171 |
[Zensols NLP parsing API]: https://plandes.github.io/nlparse/doc/feature-doc.html |
|
|
172 |
[Zensols Deep NLP library]: https://github.com/plandes/deepnlp |
|
|
173 |
[Zensols Deep NLP embedding layer]: https://plandes.github.io/deepnlp/api/zensols.deepnlp.layer.html#zensols.deepnlp.layer.embed.EmbeddingNetworkModule |
|
|
174 |
[Stash]: https://plandes.github.io/util/api/zensols.persist.html#zensols.persist.domain.Stash |