Diff of /README.md [000000] .. [ca4dac]

Switch to unified view

a b/README.md
1
# Medical natural language parsing and utility library
2
3
[![PyPI][pypi-badge]][pypi-link]
4
[![Python 3.11][python311-badge]][python311-link]
5
[![Build Status][build-badge]][build-link]
6
7
A natural language medical domain parsing library.  This library:
8
9
- Provides an interface to the [UTS] ([UMLS] Terminology Services) RESTful
10
  service with data caching (NIH login needed).
11
- Wraps the [MedCAT] library by parsing medical and clinical text into first
12
  class Python objects reflecting the structure of the natural language
13
  complete with [UMLS] entity linking with [CUIs] and other domain specific
14
  features.
15
- Combines non-medical (such as POS and NER tags) and medical features (such as
16
  [CUIs]) in one API and resulting data structure and/or as a [Pandas] data
17
  frame.
18
- Provides [cui2vec] as a [word embedding model] for either fast indexing and
19
  access or to use directly as features in a [Zensols Deep NLP embedding layer]
20
  model.
21
- Provides access to [cTAKES] using as a dictionary like [Stash] abstraction.
22
- Includes a command line program to access all of these features without
23
  having to write any code.
24
25
26
## Documentation
27
28
See the [full documentation](https://plandes.github.io/mednlp/index.html).
29
The [API reference](https://plandes.github.io/mednlp/api.html) is also
30
available.
31
32
33
## Installing
34
35
Install the library using a Python package manager such as `pip`:
36
```bash
37
pip3 install zensols.mednlp
38
```
39
40
### CUI Embeddings
41
42
To use the `cui2vec` to functionality, the embeddings must be *manually*
43
downloaded.  Start with this commands:
44
```bash
45
mkdir -p ~/.cache/zensols/mednlp
46
wget -O ~/.cache/zensols/mednlp/cui2vec.zip https://figshare.com/ndownloader/files/10959626?private_link=00d69861786cd0156d81
47
```
48
If the download fails or the file is not a zip file (rather an HTML error
49
message text), then you will need to download the file manually by
50
[browsing](https://figshare.com/ndownloader/files/10959626) to the file, and
51
then moving it to `~/.cache/zensols/mednlp/cui2vec.zip`.
52
53
54
## Usage
55
56
To parse text, create features, and extract clinical concept identifiers:
57
```python
58
>>> from zensols.mednlp import ApplicationFactory
59
>>> doc_parser = ApplicationFactory.get_doc_parser()
60
>>> doc = doc_parser('John was diagnosed with kidney failure')
61
>>> for tok in doc.tokens: print(tok.norm, tok.pos_, tok.tag_, tok.cui_, tok.detected_name_)
62
John PROPN NNP -<N>- -<N>-
63
was AUX VBD -<N>- -<N>-
64
diagnosed VERB VBN -<N>- -<N>-
65
with ADP IN -<N>- -<N>-
66
kidney NOUN NN C0035078 kidney~failure
67
failure NOUN NN C0035078 kidney~failure
68
>>> print(doc.entities)
69
(<John>, <kidney failure>)
70
```
71
See the [full example](example/features/simple.py), and for other
72
functionality, see the [examples](example).
73
74
75
## MedCAT Models
76
77
By default, this library uses the small MedCAT model used for
78
[tutorials](https://github.com/CogStack/MedCATtutorials/pull/12), and is not
79
sufficient for any serious project.  To get the UMLS trained model,the [MedCAT
80
UMLS request form] from be filled out (see the [MedCAT] repository).
81
82
After you obtain access and download the new model, add the following to
83
`~/.mednlprc` with the following:
84
85
```ini
86
[medcat_status_resource]
87
url = file:///location/to/the/downloaded/file/umls_sm_wstatus_2021_oct.zip'
88
```
89
90
91
## Attribution
92
93
This API utilizes the following frameworks:
94
95
* [MedCAT]: used to extract information from Electronic Health Records (EHRs)
96
  and link it to biomedical ontologies like SNOMED-CT and UMLS.
97
* [cTAKES]: a natural language processing system for extraction of information
98
  from electronic medical record clinical free-text.
99
* [cui2vec]: a new set of (like word) embeddings for medical concepts learned
100
  using an extremely large collection of multimodal medical data.
101
* [Zensols Deep NLP library]: a deep learning utility library for natural
102
  language processing that aids in feature engineering and embedding layers.
103
* [ctakes-parser]: parses [cTAKES] output in to a [Pandas] data frame.
104
105
106
## Citation
107
108
If you use this project in your research please use the following BibTeX entry:
109
110
```bibtex
111
@inproceedings{landes-etal-2023-deepzensols,
112
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
113
    author = "Landes, Paul  and
114
      Di Eugenio, Barbara  and
115
      Caragea, Cornelia",
116
    editor = "Tan, Liling  and
117
      Milajevs, Dmitrijs  and
118
      Chauhan, Geeticka  and
119
      Gwinnup, Jeremy  and
120
      Rippeth, Elijah",
121
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
122
    month = dec,
123
    year = "2023",
124
    address = "Singapore, Singapore",
125
    publisher = "Association for Computational Linguistics",
126
    url = "https://aclanthology.org/2023.nlposs-1.16",
127
    pages = "141--146"
128
}
129
```
130
131
132
## Community
133
134
Please star the project and let me know how and where you use this API.
135
Contributions as pull requests, feedback and any input is welcome.
136
137
138
## Changelog
139
140
An extensive changelog is available [here](CHANGELOG.md).
141
142
143
## License
144
145
[MIT License](LICENSE.md)
146
147
Copyright (c) 2021 - 2025 Paul Landes
148
149
150
<!-- links -->
151
[pypi]: https://pypi.org/project/zensols.mednlp/
152
[pypi-link]: https://pypi.python.org/pypi/zensols.mednlp
153
[pypi-badge]: https://img.shields.io/pypi/v/zensols.mednlp.svg
154
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
155
[python311-link]: https://www.python.org/downloads/release/python-3110
156
[build-badge]: https://github.com/plandes/mednlp/workflows/CI/badge.svg
157
[build-link]: https://github.com/plandes/mednlp/actions
158
159
[MedCAT]: https://github.com/CogStack/MedCAT
160
[MedCAT UMLS request form]: https://uts.nlm.nih.gov/uts/login?service=https:%2F%2Fmedcat.rosalind.kcl.ac.uk%2Fauth-callback
161
162
[Pandas]: https://pandas.pydata.org
163
[ctakes-parser]: https://pypi.org/project/ctakes-parser
164
165
[UTS]: https://uts.nlm.nih.gov/uts/
166
[UMLS]: https://www.nlm.nih.gov/research/umls/
167
[CUIs]: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
168
[cui2vec]: https://arxiv.org/abs/1804.01486
169
[cTAKES]: https://ctakes.apache.org
170
[word embedding model]: https://plandes.github.io/deepnlp/api/zensols.deepnlp.embed.html#zensols.deepnlp.embed.domain.WordEmbedModel
171
[Zensols NLP parsing API]: https://plandes.github.io/nlparse/doc/feature-doc.html
172
[Zensols Deep NLP library]: https://github.com/plandes/deepnlp
173
[Zensols Deep NLP embedding layer]: https://plandes.github.io/deepnlp/api/zensols.deepnlp.layer.html#zensols.deepnlp.layer.embed.EmbeddingNetworkModule
174
[Stash]: https://plandes.github.io/util/api/zensols.persist.html#zensols.persist.domain.Stash