|
a |
|
b/doc/api-usage.md |
|
|
1 |
# Medical NLP and Utility API |
|
|
2 |
|
|
|
3 |
This API primarily wraps others with the [Zensols Framework] to provide easy |
|
|
4 |
way and reproducible method of utilization and experimentation with medical and |
|
|
5 |
clinical natural language text. It provides the following functionality: |
|
|
6 |
|
|
|
7 |
* [UMLS Access via UTS] |
|
|
8 |
* [Medical Concept and Entity Linking] |
|
|
9 |
* [Using CUI as Word Embeddings](#using-cui-as-word-embeddings) |
|
|
10 |
* [Entity Linking with cTAKES](#entity-linking-with-ctakes) |
|
|
11 |
|
|
|
12 |
The rest of this document is structured as a cookbook style tutorial. Each |
|
|
13 |
sub-section describes the examples in the [examples] directory. |
|
|
14 |
|
|
|
15 |
**Important**: many of the examples use [UMLS] UTS service, which requires a |
|
|
16 |
key that is provided by NIH. If you do not have a key, request one and add it |
|
|
17 |
to the [UTS key file]. |
|
|
18 |
|
|
|
19 |
|
|
|
20 |
## Medical Concept and Entity Linking |
|
|
21 |
|
|
|
22 |
Concept linking with [CUIs] is provided using the same interface as the |
|
|
23 |
[Zensols NLP parsing API]. The resource library provided with this package |
|
|
24 |
creates a `mednlp_doc_parser` as shown in the [entity-example]. First we start |
|
|
25 |
with the configuration with file name `features.conf`, which starts with |
|
|
26 |
telling the [CLI] to import the [Zensols NLP package] and this |
|
|
27 |
(`zensols.mednlp`) package: |
|
|
28 |
```ini |
|
|
29 |
[import] |
|
|
30 |
sections = list: imp_conf |
|
|
31 |
|
|
|
32 |
[imp_conf] |
|
|
33 |
type = importini |
|
|
34 |
config_files = list: |
|
|
35 |
resource(zensols.nlp): resources/obj.conf, |
|
|
36 |
resource(zensols.nlp): resources/mapper.conf, |
|
|
37 |
resource(zensols.mednlp): resources/lang.conf |
|
|
38 |
``` |
|
|
39 |
|
|
|
40 |
Next configure the parser with specific features, since otherwise, the parser |
|
|
41 |
will retain all medical and non-medical features: |
|
|
42 |
```ini |
|
|
43 |
[mednlp_doc_parser] |
|
|
44 |
token_feature_ids = set: norm, is_ent, cui, cui_, pref_name_, detected_name_, is_concept, ent_, ent |
|
|
45 |
``` |
|
|
46 |
|
|
|
47 |
Finally, declare the application, which is needed by the [CLI] glue code to |
|
|
48 |
invoke the class we will write afterward: |
|
|
49 |
```ini |
|
|
50 |
[app] |
|
|
51 |
class_name = ${program:name}.Application |
|
|
52 |
doc_parser = instance: mednlp_doc_parser |
|
|
53 |
``` |
|
|
54 |
|
|
|
55 |
Next comes the application class: |
|
|
56 |
```python |
|
|
57 |
@dataclass |
|
|
58 |
class Application(object): |
|
|
59 |
doc_parser: FeatureDocumentParser = field() |
|
|
60 |
|
|
|
61 |
def show(self, sent: str = None): |
|
|
62 |
if sent is None: |
|
|
63 |
sent = 'He was diagnosed with kidney failure in the United States.' |
|
|
64 |
doc: FeatureDocument = self.doc_parser(sent) |
|
|
65 |
print('first three tokens:') |
|
|
66 |
for tok in it.islice(doc.token_iter(), 3): |
|
|
67 |
print(tok.norm) |
|
|
68 |
tok.write_attributes(1, include_type=False) |
|
|
69 |
``` |
|
|
70 |
This uses the document parser to create the feature document, which has both |
|
|
71 |
the medical and linguistic features in tokens (provided by `token_iter()`) of the document. |
|
|
72 |
|
|
|
73 |
Use the [CLI] API in the entry point to use the configuration and application |
|
|
74 |
class: |
|
|
75 |
```python |
|
|
76 |
if (__name__ == '__main__'): |
|
|
77 |
CliHarness( |
|
|
78 |
app_config_resource='uts.conf', |
|
|
79 |
app_config_context=ProgramNameConfigurator( |
|
|
80 |
None, default='uts').create_section(), |
|
|
81 |
proto_args='', |
|
|
82 |
).run() |
|
|
83 |
``` |
|
|
84 |
|
|
|
85 |
Running the program produces one such token data: |
|
|
86 |
``` |
|
|
87 |
... |
|
|
88 |
diagnosed |
|
|
89 |
cui=11900 |
|
|
90 |
cui_=C0011900 |
|
|
91 |
detected_name_=diagnosed |
|
|
92 |
ent=13188083023294932426 |
|
|
93 |
ent_=concept |
|
|
94 |
i=2 |
|
|
95 |
i_sent=2 |
|
|
96 |
idx=7 |
|
|
97 |
is_concept=True |
|
|
98 |
is_ent=True |
|
|
99 |
norm=diagnosed |
|
|
100 |
pref_name_=Diagnosis |
|
|
101 |
... |
|
|
102 |
``` |
|
|
103 |
See the full [entity example] for the full example code, which will also output |
|
|
104 |
both linguistic and medical features as a [Pandas] data frame. |
|
|
105 |
|
|
|
106 |
|
|
|
107 |
## UMLS Access via UTS |
|
|
108 |
|
|
|
109 |
NIH provides a very rough REST client using the `requests` library given as an |
|
|
110 |
example. This API takes that example, adds some "rigor" and structure in a |
|
|
111 |
an easy to use class called `UTSClient`. This is configured by first defining |
|
|
112 |
paths for where fetched entities are cached: |
|
|
113 |
```ini |
|
|
114 |
[default] |
|
|
115 |
# root directory given by the application, which is the parent directory |
|
|
116 |
root_dir = ${appenv:root_dir}/.. |
|
|
117 |
# the directory to hold the cached UMLS data |
|
|
118 |
cache_dir = ${root_dir}/cache |
|
|
119 |
``` |
|
|
120 |
|
|
|
121 |
Next, import the this package's resource library (`zensols.mednlp`). Note we |
|
|
122 |
have to refer to sections that substitute the `default` section's data: |
|
|
123 |
```ini |
|
|
124 |
[import] |
|
|
125 |
references = list: uts, default |
|
|
126 |
sections = list: imp_uts_key, imp_conf |
|
|
127 |
|
|
|
128 |
[imp_conf] |
|
|
129 |
type = importini |
|
|
130 |
config_file = resource(zensols.mednlp): resources/uts.conf |
|
|
131 |
|
|
|
132 |
[imp_uts_key] |
|
|
133 |
type = json |
|
|
134 |
default_section = uts |
|
|
135 |
config_file = ${default:root_dir}/uts-key.json |
|
|
136 |
``` |
|
|
137 |
The `imp_uts_key` points to a file where you put add your UTS key, which is |
|
|
138 |
given by NIH. |
|
|
139 |
|
|
|
140 |
Now indicate where to cache the [UMLS] data and define our application we'll |
|
|
141 |
write afterward: |
|
|
142 |
```ini |
|
|
143 |
# UTS (UMLS access) |
|
|
144 |
[uts] |
|
|
145 |
cache_file = ${default:cache_dir}/uts-request.dat |
|
|
146 |
``` |
|
|
147 |
|
|
|
148 |
For brevity the [CLI] application code and configuration is omitted (see [UMLS |
|
|
149 |
Access via UTS] for more detail). |
|
|
150 |
|
|
|
151 |
To use the API to first search a term, then print entity information, we can |
|
|
152 |
use the `search_term` method with `get_atoms`: |
|
|
153 |
```python |
|
|
154 |
@dataclass |
|
|
155 |
class Application(object): |
|
|
156 |
... |
|
|
157 |
def lookup(self, term: str = 'heart'): |
|
|
158 |
# terms are returned as a list of pages with dictionaries of data |
|
|
159 |
pages: List[Dict[str, str]] = self.uts_client.search_term(term) |
|
|
160 |
# get all term dictionaries from the first page |
|
|
161 |
terms: Dict[str, str] = pages[0] |
|
|
162 |
# get the concept unique identifier |
|
|
163 |
cui: str = terms['ui'] |
|
|
164 |
|
|
|
165 |
# print atoms of this concept |
|
|
166 |
print('atoms:') |
|
|
167 |
pprint(self.uts_client.get_atoms(cui)) |
|
|
168 |
``` |
|
|
169 |
This yields the following output: |
|
|
170 |
``` |
|
|
171 |
atoms: |
|
|
172 |
{'ancestors': None, |
|
|
173 |
'classType': 'Atom', |
|
|
174 |
'code': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/source/MTH/NOCODE', |
|
|
175 |
'concept': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/CUI/C0018787', |
|
|
176 |
'contentViewMemberships': [{'memberUri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357/member/A0066369', |
|
|
177 |
'name': 'MetaMap NLP View', |
|
|
178 |
'uri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357'}], |
|
|
179 |
'name': 'Heart', |
|
|
180 |
'obsolete': 'false', |
|
|
181 |
'rootSource': 'MTH', |
|
|
182 |
... |
|
|
183 |
} |
|
|
184 |
``` |
|
|
185 |
|
|
|
186 |
See the full [UTS example] for the full example code. |
|
|
187 |
|
|
|
188 |
|
|
|
189 |
## Using CUI as Word Embeddings |
|
|
190 |
|
|
|
191 |
[cui2vec] was trained and can be in the same way as [word2vec]. Such examples |
|
|
192 |
is computing a similarity between [UMLS] [CUIs]. This API provides access to |
|
|
193 |
the vectors directly along with all the functionality using [cui2vec] with the |
|
|
194 |
[gensim] package. This example computes the similarity between two medical |
|
|
195 |
concepts. For brevity the [CLI] application code and configuration is omitted |
|
|
196 |
(see [UMLS Access via UTS] for more detail). |
|
|
197 |
|
|
|
198 |
Let's jump right to how we import everything we need for the [cui2vec] example, |
|
|
199 |
which the `uts` and `cui2vec` resource libraries: |
|
|
200 |
```ini |
|
|
201 |
[imp_conf] |
|
|
202 |
type = importini |
|
|
203 |
config_files = list: |
|
|
204 |
resource(zensols.mednlp): resources/uts.conf, |
|
|
205 |
resource(zensols.mednlp): resources/cui2vec.conf |
|
|
206 |
``` |
|
|
207 |
The UTS configuration is given as in the [UMLS Access via UTS] section and the |
|
|
208 |
parser is configured as in the [Medical Concept and Entity Linking] section. |
|
|
209 |
|
|
|
210 |
With the high level classes given the configuration is class looks similar to |
|
|
211 |
what we've seen before, this time we define a `similarity` method/[CLI] action: |
|
|
212 |
```python |
|
|
213 |
@dataclass |
|
|
214 |
class Application(object): |
|
|
215 |
def similarity(self, term: str = 'heart disease', topn: int = 5): |
|
|
216 |
``` |
|
|
217 |
|
|
|
218 |
Next, get the [gensim] `KeyedVectors` instance, which provides (among *many* |
|
|
219 |
other useful methods) one to compute the similarity between two words, or in |
|
|
220 |
our case, two medical [CUIs]: |
|
|
221 |
```python |
|
|
222 |
embedding: Cui2VecEmbedModel = self.cui2vec_embedding |
|
|
223 |
kv: KeyedVectors = embedding.keyed_vectors |
|
|
224 |
``` |
|
|
225 |
|
|
|
226 |
Next we use UTS to get the term we're searching on, use [gensim] to find |
|
|
227 |
similarities, and output them: |
|
|
228 |
```python |
|
|
229 |
res: List[Dict[str, str]] = self.uts_client.search_term(term) |
|
|
230 |
cui: str = res[0]['ui'] |
|
|
231 |
sims_by_word: List[Tuple[str, float]] = kv.similar_by_word(cui, topn) |
|
|
232 |
for rel_cui, proba in sims_by_word: |
|
|
233 |
rel_atom: Dict[str, str] = self.uts_client.get_atoms(rel_cui) |
|
|
234 |
rel_name = rel_atom.get('name', 'Unknown') |
|
|
235 |
print(f'{rel_name} ({rel_cui}): {proba * 100:.2f}%') |
|
|
236 |
``` |
|
|
237 |
|
|
|
238 |
The output contains the top (`topn`) 5 matches and their similarity to the |
|
|
239 |
search term in the example `heart`: |
|
|
240 |
``` |
|
|
241 |
Heart failure (C0018801): 72.03% |
|
|
242 |
Atrial Premature Complexes (C0033036): 71.53% |
|
|
243 |
Chronic myocardial ischemia (C0264694): 69.68% |
|
|
244 |
Right bundle branch block (C0085615): 69.34% |
|
|
245 |
First degree atrioventricular block (C0085614): 69.09% |
|
|
246 |
``` |
|
|
247 |
|
|
|
248 |
See the full [cui2vec example] for the full example code. |
|
|
249 |
|
|
|
250 |
|
|
|
251 |
## Entity Linking with cTAKES |
|
|
252 |
|
|
|
253 |
This package provides an interface to [cTAKES], which primarily manages the |
|
|
254 |
file system and invokes the Java program to produce results. It then uses the |
|
|
255 |
[ctakes-parser] to create a data frame of features and linked entities from |
|
|
256 |
tokens of the source text. |
|
|
257 |
|
|
|
258 |
The configuration is a bit more involved since you have to indicate where the |
|
|
259 |
[cTAKES] program is installed, and provide your NIH key as detailed in the |
|
|
260 |
[UMLS Access via UTS] section: |
|
|
261 |
```ini |
|
|
262 |
[import] |
|
|
263 |
# refer to sections for which we need substitution in this file |
|
|
264 |
references = list: default, ctakes, uts |
|
|
265 |
sections = list: imp_env, imp_uts_key, imp_conf |
|
|
266 |
|
|
|
267 |
# expose the user HOME environment variable |
|
|
268 |
[imp_env] |
|
|
269 |
type = environment |
|
|
270 |
section_name = env |
|
|
271 |
includes = set: HOME |
|
|
272 |
|
|
|
273 |
# import the Zensols NLP UTS resource library |
|
|
274 |
[imp_conf] |
|
|
275 |
type = importini |
|
|
276 |
config_files = list: |
|
|
277 |
resource(zensols.mednlp): resources/uts.conf, |
|
|
278 |
resource(zensols.mednlp): resources/ctakes.conf |
|
|
279 |
|
|
|
280 |
# indicate where Apache cTAKES is installed |
|
|
281 |
[ctakes] |
|
|
282 |
home = ${env:home}/opt/app/ctakes-4.0.0.1 |
|
|
283 |
source_dir = ${default:cache_dir}/ctakes/source |
|
|
284 |
``` |
|
|
285 |
For brevity the [CLI] application code and configuration is omitted, and other |
|
|
286 |
configuration given in previous sections (see [UMLS Access via UTS] for more |
|
|
287 |
detail). See the full [cTAKES example] for the full example code. |
|
|
288 |
|
|
|
289 |
The pertinent snippet to get the [Pandas] data frame from the medical text is |
|
|
290 |
very simple: |
|
|
291 |
```python |
|
|
292 |
@dataclass |
|
|
293 |
class Application(object): |
|
|
294 |
def entities(self, sent: str = None, output: Path = None): |
|
|
295 |
if sent is None: |
|
|
296 |
sent = 'He was diagnosed with kidney failure in the United States.' |
|
|
297 |
self.ctakes_stash.set_documents([sent]) |
|
|
298 |
df: pd.DataFrame = self.ctakes_stash['0'] |
|
|
299 |
print(df) |
|
|
300 |
if output is not None: |
|
|
301 |
df.to_csv(output) |
|
|
302 |
print(f'wrote: {output}') |
|
|
303 |
``` |
|
|
304 |
The `set_documents` expects a list of text, which is saved to disk. When |
|
|
305 |
[cTAKES] is run, the directory where this list of text is saved (one file per |
|
|
306 |
element in the list). The access to the [Stash] accesses the first document by |
|
|
307 |
element ID. **Note**: the element ID has to be a string to follow the [Stash] |
|
|
308 |
API. |
|
|
309 |
|
|
|
310 |
|
|
|
311 |
<!-- links --> |
|
|
312 |
[UMLS Access via UTS]: #umls-access-via-uts |
|
|
313 |
[Medical Concept and Entity Linking]: #medical-concept-and-entity-linking |
|
|
314 |
|
|
|
315 |
[UMLS]: https://www.nlm.nih.gov/research/umls/ |
|
|
316 |
[CUIs]: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html |
|
|
317 |
[cui2vec]: https://arxiv.org/abs/1804.01486 |
|
|
318 |
[word2vec]: https://arxiv.org/abs/1301.3781 |
|
|
319 |
|
|
|
320 |
[Pandas]: https://pandas.pydata.org |
|
|
321 |
[gensim]: https://radimrehurek.com/gensim/ |
|
|
322 |
[cTAKES]: https://ctakes.apache.org |
|
|
323 |
[ctakes-parser]: https://pypi.org/project/ctakes-parser |
|
|
324 |
|
|
|
325 |
[Zensols Framework]: https://arxiv.org/abs/2109.03383 |
|
|
326 |
[CLI]: https://plandes.github.io/util/doc/command-line.html |
|
|
327 |
[Stash]: https://plandes.github.io/util/api/zensols.persist.html#zensols.persist.domain.Stash |
|
|
328 |
[Zensols NLP package]: https://github.com/plandes/nlparse |
|
|
329 |
[Zensols NLP parsing API]: https://plandes.github.io/nlparse/doc/feature-doc.html |
|
|
330 |
|
|
|
331 |
[examples]: https://github.com/plandes/mednlp/tree/master/example |
|
|
332 |
[entity example]: https://github.com/plandes/mednlp/tree/master/example/features |
|
|
333 |
[cTAKES example]: https://github.com/plandes/mednlp/tree/master/example/ctakes |
|
|
334 |
[cui2vec example]: https://github.com/plandes/mednlp/tree/master/example/cui2vec |
|
|
335 |
[UTS example]: https://github.com/plandes/mednlp/tree/master/example/uts |
|
|
336 |
[UTS key file]: https://github.com/plandes/mednlp/tree/master/example/uts-key.json |