mednlp / Git / Diff of /doc/api-usage.md

Models:
philipB/
mednlp
Downloads: 1
Diff of /doc/api-usage.md [000000] .. [ca4dac]
Switch to side-by-side view

--- a
+++ b/doc/api-usage.md
@@ -0,0 +1,336 @@
+# Medical NLP and Utility API
+
+This API primarily wraps others with the [Zensols Framework] to provide easy
+way and reproducible method of utilization and experimentation with medical and
+clinical natural language text.  It provides the following functionality:
+
+* [UMLS Access via UTS]
+* [Medical Concept and Entity Linking]
+* [Using CUI as Word Embeddings](#using-cui-as-word-embeddings)
+* [Entity Linking with cTAKES](#entity-linking-with-ctakes)
+
+The rest of this document is structured as a cookbook style tutorial.  Each
+sub-section describes the examples in the [examples] directory.
+
+**Important**: many of the examples use [UMLS] UTS service, which requires a
+key that is provided by NIH.  If you do not have a key, request one and add it
+to the [UTS key file].
+
+
+## Medical Concept and Entity Linking
+
+Concept linking with [CUIs] is provided using the same interface as the
+[Zensols NLP parsing API].  The resource library provided with this package
+creates a `mednlp_doc_parser` as shown in the [entity-example].  First we start
+with the configuration with file name `features.conf`, which starts with
+telling the [CLI] to import the [Zensols NLP package] and this
+(`zensols.mednlp`) package:
+```ini
+[import]
+sections = list: imp_conf
+
+[imp_conf]
+type = importini
+config_files = list:
+    resource(zensols.nlp): resources/obj.conf,
+    resource(zensols.nlp): resources/mapper.conf,
+    resource(zensols.mednlp): resources/lang.conf
+```
+
+Next configure the parser with specific features, since otherwise, the parser
+will retain all medical and non-medical features:
+```ini
+[mednlp_doc_parser]
+token_feature_ids = set: norm, is_ent, cui, cui_, pref_name_, detected_name_, is_concept, ent_, ent
+```
+
+Finally, declare the application, which is needed by the [CLI] glue code to
+invoke the class we will write afterward:
+```ini
+[app]
+class_name = ${program:name}.Application
+doc_parser = instance: mednlp_doc_parser
+```
+
+Next comes the application class:
+```python
+@dataclass
+class Application(object):
+    doc_parser: FeatureDocumentParser = field()
+
+    def show(self, sent: str = None):
+        if sent is None:
+            sent = 'He was diagnosed with kidney failure in the United States.'
+        doc: FeatureDocument = self.doc_parser(sent)
+        print('first three tokens:')
+        for tok in it.islice(doc.token_iter(), 3):
+            print(tok.norm)
+            tok.write_attributes(1, include_type=False)
+```
+This uses the document parser to create the feature document, which has both
+the medical and linguistic features in tokens (provided by `token_iter()`) of the document.
+
+Use the [CLI] API in the entry point to use the configuration and application
+class:
+```python
+if (__name__ == '__main__'):
+    CliHarness(
+        app_config_resource='uts.conf',
+        app_config_context=ProgramNameConfigurator(
+            None, default='uts').create_section(),
+        proto_args='',
+    ).run()
+```
+
+Running the program produces one such token data:
+```
+...
+diagnosed
+    cui=11900
+    cui_=C0011900
+    detected_name_=diagnosed
+    ent=13188083023294932426
+    ent_=concept
+    i=2
+    i_sent=2
+    idx=7
+    is_concept=True
+    is_ent=True
+    norm=diagnosed
+    pref_name_=Diagnosis
+...
+```
+See the full [entity example] for the full example code, which will also output
+both linguistic and medical features as a [Pandas] data frame.
+
+
+## UMLS Access via UTS
+
+NIH provides a very rough REST client using the `requests` library given as an
+example.  This API takes that example, adds some "rigor" and structure in a
+an easy to use class called `UTSClient`.  This is configured by first defining
+paths for where fetched entities are cached:
+```ini
+[default]
+# root directory given by the application, which is the parent directory
+root_dir = ${appenv:root_dir}/..
+# the directory to hold the cached UMLS data
+cache_dir = ${root_dir}/cache
+```
+
+Next, import the this package's resource library (`zensols.mednlp`).  Note we
+have to refer to sections that substitute the `default` section's data:
+```ini
+[import]
+references = list: uts, default
+sections = list: imp_uts_key, imp_conf
+
+[imp_conf]
+type = importini
+config_file = resource(zensols.mednlp): resources/uts.conf
+
+[imp_uts_key]
+type = json
+default_section = uts
+config_file = ${default:root_dir}/uts-key.json
+```
+The `imp_uts_key` points to a file where you put add your UTS key, which is
+given by NIH.
+
+Now indicate where to cache the [UMLS] data and define our application we'll
+write afterward:
+```ini
+# UTS (UMLS access)
+[uts]
+cache_file = ${default:cache_dir}/uts-request.dat
+```
+
+For brevity the [CLI] application code and configuration is omitted (see [UMLS
+Access via UTS] for more detail).
+
+To use the API to first search a term, then print entity information, we can
+use the `search_term` method with `get_atoms`:
+```python
+@dataclass
+class Application(object):
+    ...
+    def lookup(self, term: str = 'heart'):
+        # terms are returned as a list of pages with dictionaries of data
+        pages: List[Dict[str, str]] = self.uts_client.search_term(term)
+        # get all term dictionaries from the first page
+        terms: Dict[str, str] = pages[0]
+        # get the concept unique identifier
+        cui: str = terms['ui']
+
+        # print atoms of this concept
+        print('atoms:')
+        pprint(self.uts_client.get_atoms(cui))
+```
+This yields the following output:
+```
+atoms:
+{'ancestors': None,
+ 'classType': 'Atom',
+ 'code': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/source/MTH/NOCODE',
+ 'concept': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/CUI/C0018787',
+ 'contentViewMemberships': [{'memberUri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357/member/A0066369',
+                             'name': 'MetaMap NLP View',
+                             'uri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357'}],
+ 'name': 'Heart',
+ 'obsolete': 'false',
+ 'rootSource': 'MTH',
+...
+}
+```
+
+See the full [UTS example] for the full example code.
+
+
+## Using CUI as Word Embeddings
+
+[cui2vec] was trained and can be in the same way as [word2vec].  Such examples
+is computing a similarity between [UMLS] [CUIs].  This API provides access to
+the vectors directly along with all the functionality using [cui2vec] with the
+[gensim] package.  This example computes the similarity between two medical
+concepts.  For brevity the [CLI] application code and configuration is omitted
+(see [UMLS Access via UTS] for more detail).
+
+Let's jump right to how we import everything we need for the [cui2vec] example,
+which the `uts` and `cui2vec` resource libraries:
+```ini
+[imp_conf]
+type = importini
+config_files = list:
+    resource(zensols.mednlp): resources/uts.conf,
+    resource(zensols.mednlp): resources/cui2vec.conf
+```
+The UTS configuration is given as in the [UMLS Access via UTS] section and the
+parser is configured as in the [Medical Concept and Entity Linking] section.
+
+With the high level classes given the configuration is class looks similar to
+what we've seen before, this time we define a `similarity` method/[CLI] action:
+```python
+@dataclass
+class Application(object):
+    def similarity(self, term: str = 'heart disease', topn: int = 5):
+```
+
+Next, get the [gensim] `KeyedVectors` instance, which provides (among *many*
+other useful methods) one to compute the similarity between two words, or in
+our case, two medical [CUIs]:
+```python
+        embedding: Cui2VecEmbedModel = self.cui2vec_embedding
+        kv: KeyedVectors = embedding.keyed_vectors
+```
+
+Next we use UTS to get the term we're searching on, use [gensim] to find
+similarities, and output them:
+```python
+        res: List[Dict[str, str]] = self.uts_client.search_term(term)
+        cui: str = res[0]['ui']
+        sims_by_word: List[Tuple[str, float]] = kv.similar_by_word(cui, topn)
+        for rel_cui, proba in sims_by_word:
+            rel_atom: Dict[str, str] = self.uts_client.get_atoms(rel_cui)
+            rel_name = rel_atom.get('name', 'Unknown')
+            print(f'{rel_name} ({rel_cui}): {proba * 100:.2f}%')
+```
+
+The output contains the top (`topn`) 5 matches and their similarity to the
+search term in the example `heart`:
+```
+Heart failure (C0018801): 72.03%
+Atrial Premature Complexes (C0033036): 71.53%
+Chronic myocardial ischemia (C0264694): 69.68%
+Right bundle branch block (C0085615): 69.34%
+First degree atrioventricular block (C0085614): 69.09%
+```
+
+See the full [cui2vec example] for the full example code.
+
+
+## Entity Linking with cTAKES
+
+This package provides an interface to [cTAKES], which primarily manages the
+file system and invokes the Java program to produce results.  It then uses the
+[ctakes-parser] to create a data frame of features and linked entities from
+tokens of the source text.
+
+The configuration is a bit more involved since you have to indicate where the
+[cTAKES] program is installed, and provide your NIH key as detailed in the
+[UMLS Access via UTS] section:
+```ini
+[import]
+# refer to sections for which we need substitution in this file
+references = list: default, ctakes, uts
+sections = list: imp_env, imp_uts_key, imp_conf
+
+# expose the user HOME environment variable
+[imp_env]
+type = environment
+section_name = env
+includes = set: HOME
+
+# import the Zensols NLP UTS resource library
+[imp_conf]
+type = importini
+config_files = list:
+    resource(zensols.mednlp): resources/uts.conf,
+    resource(zensols.mednlp): resources/ctakes.conf
+
+# indicate where Apache cTAKES is installed
+[ctakes]
+home = ${env:home}/opt/app/ctakes-4.0.0.1
+source_dir = ${default:cache_dir}/ctakes/source
+```
+For brevity the [CLI] application code and configuration is omitted, and other
+configuration given in previous sections (see [UMLS Access via UTS] for more
+detail). See the full [cTAKES example] for the full example code.
+
+The pertinent snippet to get the [Pandas] data frame from the medical text is
+very simple:
+```python
+@dataclass
+class Application(object):
+    def entities(self, sent: str = None, output: Path = None):
+        if sent is None:
+            sent = 'He was diagnosed with kidney failure in the United States.'
+        self.ctakes_stash.set_documents([sent])
+        df: pd.DataFrame = self.ctakes_stash['0']
+        print(df)
+        if output is not None:
+            df.to_csv(output)
+            print(f'wrote: {output}')
+```
+The `set_documents` expects a list of text, which is saved to disk.  When
+[cTAKES] is run, the directory where this list of text is saved (one file per
+element in the list).  The access to the [Stash] accesses the first document by
+element ID.  **Note**: the element ID has to be a string to follow the [Stash]
+API.
+
+
+<!-- links -->
+[UMLS Access via UTS]: #umls-access-via-uts
+[Medical Concept and Entity Linking]: #medical-concept-and-entity-linking
+
+[UMLS]: https://www.nlm.nih.gov/research/umls/
+[CUIs]: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
+[cui2vec]: https://arxiv.org/abs/1804.01486
+[word2vec]: https://arxiv.org/abs/1301.3781
+
+[Pandas]: https://pandas.pydata.org
+[gensim]: https://radimrehurek.com/gensim/
+[cTAKES]: https://ctakes.apache.org
+[ctakes-parser]: https://pypi.org/project/ctakes-parser
+
+[Zensols Framework]: https://arxiv.org/abs/2109.03383
+[CLI]: https://plandes.github.io/util/doc/command-line.html
+[Stash]: https://plandes.github.io/util/api/zensols.persist.html#zensols.persist.domain.Stash
+[Zensols NLP package]: https://github.com/plandes/nlparse
+[Zensols NLP parsing API]: https://plandes.github.io/nlparse/doc/feature-doc.html
+
+[examples]: https://github.com/plandes/mednlp/tree/master/example
+[entity example]: https://github.com/plandes/mednlp/tree/master/example/features
+[cTAKES example]: https://github.com/plandes/mednlp/tree/master/example/ctakes
+[cui2vec example]: https://github.com/plandes/mednlp/tree/master/example/cui2vec
+[UTS example]: https://github.com/plandes/mednlp/tree/master/example/uts
+[UTS key file]: https://github.com/plandes/mednlp/tree/master/example/uts-key.json