Diff of /doc/api-usage.md [000000] .. [ca4dac]

Switch to unified view

a b/doc/api-usage.md
1
# Medical NLP and Utility API
2
3
This API primarily wraps others with the [Zensols Framework] to provide easy
4
way and reproducible method of utilization and experimentation with medical and
5
clinical natural language text.  It provides the following functionality:
6
7
* [UMLS Access via UTS]
8
* [Medical Concept and Entity Linking]
9
* [Using CUI as Word Embeddings](#using-cui-as-word-embeddings)
10
* [Entity Linking with cTAKES](#entity-linking-with-ctakes)
11
12
The rest of this document is structured as a cookbook style tutorial.  Each
13
sub-section describes the examples in the [examples] directory.
14
15
**Important**: many of the examples use [UMLS] UTS service, which requires a
16
key that is provided by NIH.  If you do not have a key, request one and add it
17
to the [UTS key file].
18
19
20
## Medical Concept and Entity Linking
21
22
Concept linking with [CUIs] is provided using the same interface as the
23
[Zensols NLP parsing API].  The resource library provided with this package
24
creates a `mednlp_doc_parser` as shown in the [entity-example].  First we start
25
with the configuration with file name `features.conf`, which starts with
26
telling the [CLI] to import the [Zensols NLP package] and this
27
(`zensols.mednlp`) package:
28
```ini
29
[import]
30
sections = list: imp_conf
31
32
[imp_conf]
33
type = importini
34
config_files = list:
35
    resource(zensols.nlp): resources/obj.conf,
36
    resource(zensols.nlp): resources/mapper.conf,
37
    resource(zensols.mednlp): resources/lang.conf
38
```
39
40
Next configure the parser with specific features, since otherwise, the parser
41
will retain all medical and non-medical features:
42
```ini
43
[mednlp_doc_parser]
44
token_feature_ids = set: norm, is_ent, cui, cui_, pref_name_, detected_name_, is_concept, ent_, ent
45
```
46
47
Finally, declare the application, which is needed by the [CLI] glue code to
48
invoke the class we will write afterward:
49
```ini
50
[app]
51
class_name = ${program:name}.Application
52
doc_parser = instance: mednlp_doc_parser
53
```
54
55
Next comes the application class:
56
```python
57
@dataclass
58
class Application(object):
59
    doc_parser: FeatureDocumentParser = field()
60
61
    def show(self, sent: str = None):
62
        if sent is None:
63
            sent = 'He was diagnosed with kidney failure in the United States.'
64
        doc: FeatureDocument = self.doc_parser(sent)
65
        print('first three tokens:')
66
        for tok in it.islice(doc.token_iter(), 3):
67
            print(tok.norm)
68
            tok.write_attributes(1, include_type=False)
69
```
70
This uses the document parser to create the feature document, which has both
71
the medical and linguistic features in tokens (provided by `token_iter()`) of the document.
72
73
Use the [CLI] API in the entry point to use the configuration and application
74
class:
75
```python
76
if (__name__ == '__main__'):
77
    CliHarness(
78
        app_config_resource='uts.conf',
79
        app_config_context=ProgramNameConfigurator(
80
            None, default='uts').create_section(),
81
        proto_args='',
82
    ).run()
83
```
84
85
Running the program produces one such token data:
86
```
87
...
88
diagnosed
89
    cui=11900
90
    cui_=C0011900
91
    detected_name_=diagnosed
92
    ent=13188083023294932426
93
    ent_=concept
94
    i=2
95
    i_sent=2
96
    idx=7
97
    is_concept=True
98
    is_ent=True
99
    norm=diagnosed
100
    pref_name_=Diagnosis
101
...
102
```
103
See the full [entity example] for the full example code, which will also output
104
both linguistic and medical features as a [Pandas] data frame.
105
106
107
## UMLS Access via UTS
108
109
NIH provides a very rough REST client using the `requests` library given as an
110
example.  This API takes that example, adds some "rigor" and structure in a
111
an easy to use class called `UTSClient`.  This is configured by first defining
112
paths for where fetched entities are cached:
113
```ini
114
[default]
115
# root directory given by the application, which is the parent directory
116
root_dir = ${appenv:root_dir}/..
117
# the directory to hold the cached UMLS data
118
cache_dir = ${root_dir}/cache
119
```
120
121
Next, import the this package's resource library (`zensols.mednlp`).  Note we
122
have to refer to sections that substitute the `default` section's data:
123
```ini
124
[import]
125
references = list: uts, default
126
sections = list: imp_uts_key, imp_conf
127
128
[imp_conf]
129
type = importini
130
config_file = resource(zensols.mednlp): resources/uts.conf
131
132
[imp_uts_key]
133
type = json
134
default_section = uts
135
config_file = ${default:root_dir}/uts-key.json
136
```
137
The `imp_uts_key` points to a file where you put add your UTS key, which is
138
given by NIH.
139
140
Now indicate where to cache the [UMLS] data and define our application we'll
141
write afterward:
142
```ini
143
# UTS (UMLS access)
144
[uts]
145
cache_file = ${default:cache_dir}/uts-request.dat
146
```
147
148
For brevity the [CLI] application code and configuration is omitted (see [UMLS
149
Access via UTS] for more detail).
150
151
To use the API to first search a term, then print entity information, we can
152
use the `search_term` method with `get_atoms`:
153
```python
154
@dataclass
155
class Application(object):
156
    ...
157
    def lookup(self, term: str = 'heart'):
158
        # terms are returned as a list of pages with dictionaries of data
159
        pages: List[Dict[str, str]] = self.uts_client.search_term(term)
160
        # get all term dictionaries from the first page
161
        terms: Dict[str, str] = pages[0]
162
        # get the concept unique identifier
163
        cui: str = terms['ui']
164
165
        # print atoms of this concept
166
        print('atoms:')
167
        pprint(self.uts_client.get_atoms(cui))
168
```
169
This yields the following output:
170
```
171
atoms:
172
{'ancestors': None,
173
 'classType': 'Atom',
174
 'code': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/source/MTH/NOCODE',
175
 'concept': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/CUI/C0018787',
176
 'contentViewMemberships': [{'memberUri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357/member/A0066369',
177
                             'name': 'MetaMap NLP View',
178
                             'uri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357'}],
179
 'name': 'Heart',
180
 'obsolete': 'false',
181
 'rootSource': 'MTH',
182
...
183
}
184
```
185
186
See the full [UTS example] for the full example code.
187
188
189
## Using CUI as Word Embeddings
190
191
[cui2vec] was trained and can be in the same way as [word2vec].  Such examples
192
is computing a similarity between [UMLS] [CUIs].  This API provides access to
193
the vectors directly along with all the functionality using [cui2vec] with the
194
[gensim] package.  This example computes the similarity between two medical
195
concepts.  For brevity the [CLI] application code and configuration is omitted
196
(see [UMLS Access via UTS] for more detail).
197
198
Let's jump right to how we import everything we need for the [cui2vec] example,
199
which the `uts` and `cui2vec` resource libraries:
200
```ini
201
[imp_conf]
202
type = importini
203
config_files = list:
204
    resource(zensols.mednlp): resources/uts.conf,
205
    resource(zensols.mednlp): resources/cui2vec.conf
206
```
207
The UTS configuration is given as in the [UMLS Access via UTS] section and the
208
parser is configured as in the [Medical Concept and Entity Linking] section.
209
210
With the high level classes given the configuration is class looks similar to
211
what we've seen before, this time we define a `similarity` method/[CLI] action:
212
```python
213
@dataclass
214
class Application(object):
215
    def similarity(self, term: str = 'heart disease', topn: int = 5):
216
```
217
218
Next, get the [gensim] `KeyedVectors` instance, which provides (among *many*
219
other useful methods) one to compute the similarity between two words, or in
220
our case, two medical [CUIs]:
221
```python
222
        embedding: Cui2VecEmbedModel = self.cui2vec_embedding
223
        kv: KeyedVectors = embedding.keyed_vectors
224
```
225
226
Next we use UTS to get the term we're searching on, use [gensim] to find
227
similarities, and output them:
228
```python
229
        res: List[Dict[str, str]] = self.uts_client.search_term(term)
230
        cui: str = res[0]['ui']
231
        sims_by_word: List[Tuple[str, float]] = kv.similar_by_word(cui, topn)
232
        for rel_cui, proba in sims_by_word:
233
            rel_atom: Dict[str, str] = self.uts_client.get_atoms(rel_cui)
234
            rel_name = rel_atom.get('name', 'Unknown')
235
            print(f'{rel_name} ({rel_cui}): {proba * 100:.2f}%')
236
```
237
238
The output contains the top (`topn`) 5 matches and their similarity to the
239
search term in the example `heart`:
240
```
241
Heart failure (C0018801): 72.03%
242
Atrial Premature Complexes (C0033036): 71.53%
243
Chronic myocardial ischemia (C0264694): 69.68%
244
Right bundle branch block (C0085615): 69.34%
245
First degree atrioventricular block (C0085614): 69.09%
246
```
247
248
See the full [cui2vec example] for the full example code.
249
250
251
## Entity Linking with cTAKES
252
253
This package provides an interface to [cTAKES], which primarily manages the
254
file system and invokes the Java program to produce results.  It then uses the
255
[ctakes-parser] to create a data frame of features and linked entities from
256
tokens of the source text.
257
258
The configuration is a bit more involved since you have to indicate where the
259
[cTAKES] program is installed, and provide your NIH key as detailed in the
260
[UMLS Access via UTS] section:
261
```ini
262
[import]
263
# refer to sections for which we need substitution in this file
264
references = list: default, ctakes, uts
265
sections = list: imp_env, imp_uts_key, imp_conf
266
267
# expose the user HOME environment variable
268
[imp_env]
269
type = environment
270
section_name = env
271
includes = set: HOME
272
273
# import the Zensols NLP UTS resource library
274
[imp_conf]
275
type = importini
276
config_files = list:
277
    resource(zensols.mednlp): resources/uts.conf,
278
    resource(zensols.mednlp): resources/ctakes.conf
279
280
# indicate where Apache cTAKES is installed
281
[ctakes]
282
home = ${env:home}/opt/app/ctakes-4.0.0.1
283
source_dir = ${default:cache_dir}/ctakes/source
284
```
285
For brevity the [CLI] application code and configuration is omitted, and other
286
configuration given in previous sections (see [UMLS Access via UTS] for more
287
detail). See the full [cTAKES example] for the full example code.
288
289
The pertinent snippet to get the [Pandas] data frame from the medical text is
290
very simple:
291
```python
292
@dataclass
293
class Application(object):
294
    def entities(self, sent: str = None, output: Path = None):
295
        if sent is None:
296
            sent = 'He was diagnosed with kidney failure in the United States.'
297
        self.ctakes_stash.set_documents([sent])
298
        df: pd.DataFrame = self.ctakes_stash['0']
299
        print(df)
300
        if output is not None:
301
            df.to_csv(output)
302
            print(f'wrote: {output}')
303
```
304
The `set_documents` expects a list of text, which is saved to disk.  When
305
[cTAKES] is run, the directory where this list of text is saved (one file per
306
element in the list).  The access to the [Stash] accesses the first document by
307
element ID.  **Note**: the element ID has to be a string to follow the [Stash]
308
API.
309
310
311
<!-- links -->
312
[UMLS Access via UTS]: #umls-access-via-uts
313
[Medical Concept and Entity Linking]: #medical-concept-and-entity-linking
314
315
[UMLS]: https://www.nlm.nih.gov/research/umls/
316
[CUIs]: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
317
[cui2vec]: https://arxiv.org/abs/1804.01486
318
[word2vec]: https://arxiv.org/abs/1301.3781
319
320
[Pandas]: https://pandas.pydata.org
321
[gensim]: https://radimrehurek.com/gensim/
322
[cTAKES]: https://ctakes.apache.org
323
[ctakes-parser]: https://pypi.org/project/ctakes-parser
324
325
[Zensols Framework]: https://arxiv.org/abs/2109.03383
326
[CLI]: https://plandes.github.io/util/doc/command-line.html
327
[Stash]: https://plandes.github.io/util/api/zensols.persist.html#zensols.persist.domain.Stash
328
[Zensols NLP package]: https://github.com/plandes/nlparse
329
[Zensols NLP parsing API]: https://plandes.github.io/nlparse/doc/feature-doc.html
330
331
[examples]: https://github.com/plandes/mednlp/tree/master/example
332
[entity example]: https://github.com/plandes/mednlp/tree/master/example/features
333
[cTAKES example]: https://github.com/plandes/mednlp/tree/master/example/ctakes
334
[cui2vec example]: https://github.com/plandes/mednlp/tree/master/example/cui2vec
335
[UTS example]: https://github.com/plandes/mednlp/tree/master/example/uts
336
[UTS key file]: https://github.com/plandes/mednlp/tree/master/example/uts-key.json