Card

Medical natural language parsing and utility library

PyPI
Python 3.11
Build Status

A natural language medical domain parsing library. This library:

  • Provides an interface to the [UTS] ([UMLS] Terminology Services) RESTful
    service with data caching (NIH login needed).
  • Wraps the [MedCAT] library by parsing medical and clinical text into first
    class Python objects reflecting the structure of the natural language
    complete with [UMLS] entity linking with [CUIs] and other domain specific
    features.
  • Combines non-medical (such as POS and NER tags) and medical features (such as
    [CUIs]) in one API and resulting data structure and/or as a [Pandas] data
    frame.
  • Provides [cui2vec] as a [word embedding model] for either fast indexing and
    access or to use directly as features in a [Zensols Deep NLP embedding layer]
    model.
  • Provides access to [cTAKES] using as a dictionary like [Stash] abstraction.
  • Includes a command line program to access all of these features without
    having to write any code.

Documentation

See the full documentation.
The API reference is also
available.

Installing

Install the library using a Python package manager such as pip:

pip3 install zensols.mednlp

CUI Embeddings

To use the cui2vec to functionality, the embeddings must be manually
downloaded. Start with this commands:

mkdir -p ~/.cache/zensols/mednlp
wget -O ~/.cache/zensols/mednlp/cui2vec.zip https://figshare.com/ndownloader/files/10959626?private_link=00d69861786cd0156d81

If the download fails or the file is not a zip file (rather an HTML error
message text), then you will need to download the file manually by
browsing to the file, and
then moving it to ~/.cache/zensols/mednlp/cui2vec.zip.

Usage

To parse text, create features, and extract clinical concept identifiers:

>>> from zensols.mednlp import ApplicationFactory
>>> doc_parser = ApplicationFactory.get_doc_parser()
>>> doc = doc_parser('John was diagnosed with kidney failure')
>>> for tok in doc.tokens: print(tok.norm, tok.pos_, tok.tag_, tok.cui_, tok.detected_name_)
John PROPN NNP -<N>- -<N>-
was AUX VBD -<N>- -<N>-
diagnosed VERB VBN -<N>- -<N>-
with ADP IN -<N>- -<N>-
kidney NOUN NN C0035078 kidney~failure
failure NOUN NN C0035078 kidney~failure
>>> print(doc.entities)
(<John>, <kidney failure>)

See the full example, and for other
functionality, see the examples.

MedCAT Models

By default, this library uses the small MedCAT model used for
tutorials, and is not
sufficient for any serious project. To get the UMLS trained model,the [MedCAT
UMLS request form]
from be filled out (see the [MedCAT] repository).

After you obtain access and download the new model, add the following to
~/.mednlprc with the following:

[medcat_status_resource]
url = file:///location/to/the/downloaded/file/umls_sm_wstatus_2021_oct.zip'

Attribution

This API utilizes the following frameworks:

  • [MedCAT]: used to extract information from Electronic Health Records (EHRs)
    and link it to biomedical ontologies like SNOMED-CT and UMLS.
  • [cTAKES]: a natural language processing system for extraction of information
    from electronic medical record clinical free-text.
  • [cui2vec]: a new set of (like word) embeddings for medical concepts learned
    using an extremely large collection of multimodal medical data.
  • [Zensols Deep NLP library]: a deep learning utility library for natural
    language processing that aids in feature engineering and embedding layers.
  • [ctakes-parser]: parses [cTAKES] output in to a [Pandas] data frame.

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}

Community

Please star the project and let me know how and where you use this API.
Contributions as pull requests, feedback and any input is welcome.

Changelog

An extensive changelog is available here.

License

MIT License

Copyright (c) 2021 - 2025 Paul Landes