# 📔 Documentation
A Python Natural Language Processing Toolkit for Electronic Health Record Texts
get_similar_documents(bert_model, query_note, candidate_notes, candidates, top_k)
: find similar documents/records given the query record ID number, return the top_k
results.
Parameters:
* bert_model
: the name of the bert model
* query_note
: query note, string
* candidate_notes
: candiate note, a list of string
* candidates
: a list of candidate ID, a list of int
* top_k
: the number of return results, default value is 2
* returns a DataFrame
with candidate_note_id, similarity_score, and candidate_text
get_clusters(bert_model, notes, k=2)
: Use K-means to cluster the records using pretrained bert encoding into k
clusters.
Parameters:
* bert_model
: the name of the bert model
* top_k
: the number of clusters, default value is 2
* returns a DataFrame
object...
get_abbreviations(model, text)
: get abbreviations and their meanings of the input text.
Parameters:
* model
: model name, supports Spacy models
* text
: input text, string
* returns a list of tuples in the form (abbreviation, expanded form), each element being a str
get_hyponyms(model, text)
: get hyponyms of the recognized entities in the input text.
Parameters:
* model
: model name, supports Spacy models
* text
: input text, string
* returns a list of tuples in the form (hearst_pattern, entity_1, entity_2, ...), each element being a str
get_linked_entities(model, text)
: get linked entities in the input text.
Parameters:
* model
: model name, supports Spacy models
* text
: input text, string
* returns a dictionary in the form {named entity: list of strings each describing one piece of linked information}
get_named_entities(model, text)
: get named entities in the input text.
Parameters:
* model
: model name, supports Spacy models
* text
: input text, string
* returns a list of strings, each string is an identified named entity
get_supported_translation_languages()
: returns a list of support target language names in string.
get_translation(text, model_name, target_language)
: translate the input text into the target language.
Parameters:
* text
: input text in string
* model_name
: bert model name in string
* target_language
: target language name from the supported langauge list
* returns a string, which is the translated version of text]
get_bert_embeddings(pretrained_model, texts)
: encode the input text with pretrained bert model
Parameters:
* pretrained_model
: bert model name in string
* texts
: input text in a list of string
* returns a list of lists of sentences, each list is made up of sentences from the same document
get_denpendencies(text)
: dependency parsing result for the input text
in string, this is a wrapper of the stanza library.
get_single_summary(text, model_name="t5-small", min_length=50, max_length=200)
: single document summarization.
Parameters:
* text
: a string for the input text
* model_name
: bert model name in string, now we support the following models: bart-large-cnn
', 't5-small
', 't5-base
', 't5-large
', 't5-3b
', 't5-11b
* min_length
: min length in summary
* max_length
: max length in summary
* returns a list of summarization in string
get_multi_summary_joint(text, model_name="osama7/t5-summarization-multinews", min_length=50, max_length=200)
: multi-document summarization function. Join all the input documents as a long document, then do single document summarization.
Parameters:
* text
: a list of document in string
* model_name
: bert model name in string, now we support the following models: bart-large-cnn
, t5-small
, t5-base
, t5-large
, t5-3b
', t5-11b
* min_length
: min length in summary
* max_length
: max length in summary
* returns a list of summarization in string
get_multi_summary_extractive_textRank(text,ratio=-0.1,words=0)
: Textrank method for multi-doc summarization.
Parameters:
* text
: a list of string
* ratio
: the ratio of summary (0-1.0)
* words
: the number of words of summary, default is 50
* returns a string as the final summarization
get_word_tokenization(text)
: word tokenization using medspaCy package.
Parameters:
* text
: input string text
* returns a list of token or word in string
get_section_detection(text,rules)
: given a string as the input, extract sections, consisting of medical history, allergies, comments and so on.
Parameters:
* text
: input string text
* rule
: the personalized rules, a dictionary of string, i.e., {"category": "allergies"}, default is None
* returns a list of spacy Section object
get_UMLS_match(text)
: match the UMLS concept for the input text.
Parameters:
* text
: input string text
* returns a list of tuples, (entity_text, label, similarity, semtypes)