|
a |
|
b/wrapper_functions/documentation.md |
|
|
1 |
|
|
|
2 |
# 📔 Documentation |
|
|
3 |
|
|
|
4 |
<img src="https://github.com/karenacorn99/LILY-EHRKit/blob/main/EHRLogo.png" alt="drawing" width="140"/> |
|
|
5 |
|
|
|
6 |
A Python Natural Language Processing Toolkit for Electronic Health Record Texts |
|
|
7 |
## Key Modules and Functions |
|
|
8 |
|
|
|
9 |
### ✨multi_doc_functions.py |
|
|
10 |
|
|
|
11 |
`get_similar_documents(bert_model, query_note, candidate_notes, candidates, top_k)`: find similar documents/records given the query record ID number, return the `top_k` results. |
|
|
12 |
|
|
|
13 |
**Parameters**: |
|
|
14 |
* `bert_model`: the name of the bert model |
|
|
15 |
* `query_note`: query note, string |
|
|
16 |
* `candidate_notes`: candiate note, a list of string |
|
|
17 |
* `candidates`: a list of candidate ID, a list of int |
|
|
18 |
* `top_k`: the number of return results, default value is 2 |
|
|
19 |
* returns a `DataFrame` with candidate_note_id, similarity_score, and candidate_text |
|
|
20 |
|
|
|
21 |
`get_clusters(bert_model, notes, k=2)`: Use K-means to cluster the records using pretrained bert encoding into `k` clusters. |
|
|
22 |
|
|
|
23 |
**Parameters**: |
|
|
24 |
* `bert_model`: the name of the bert model |
|
|
25 |
* `top_k`: the number of clusters, default value is 2 |
|
|
26 |
* returns a `DataFrame` object... |
|
|
27 |
|
|
|
28 |
### ✨scispacy_functions.py |
|
|
29 |
`get_abbreviations(model, text)`: get abbreviations and their meanings of the input text. |
|
|
30 |
|
|
|
31 |
**Parameters**: |
|
|
32 |
* `model`: model name, supports Spacy models |
|
|
33 |
* `text` : input text, string |
|
|
34 |
* returns a list of tuples in the form (abbreviation, expanded form), each element being a str |
|
|
35 |
|
|
|
36 |
`get_hyponyms(model, text)`: get hyponyms of the recognized entities in the input text. |
|
|
37 |
|
|
|
38 |
**Parameters**: |
|
|
39 |
* `model`: model name, supports Spacy models |
|
|
40 |
* `text` : input text, string |
|
|
41 |
* returns a list of tuples in the form (hearst_pattern, entity_1, entity_2, ...), each element being a str |
|
|
42 |
|
|
|
43 |
`get_linked_entities(model, text)`: get linked entities in the input text. |
|
|
44 |
|
|
|
45 |
**Parameters**: |
|
|
46 |
* `model`: model name, supports Spacy models |
|
|
47 |
* `text` : input text, string |
|
|
48 |
* returns a dictionary in the form {named entity: list of strings each describing one piece of linked information} |
|
|
49 |
|
|
|
50 |
`get_named_entities(model, text)`: get named entities in the input text. |
|
|
51 |
|
|
|
52 |
**Parameters**: |
|
|
53 |
* `model`: model name, supports Spacy models |
|
|
54 |
* `text` : input text, string |
|
|
55 |
* returns a list of strings, each string is an identified named entity |
|
|
56 |
|
|
|
57 |
### ✨transformer_functions.py |
|
|
58 |
`get_supported_translation_languages()`: returns a list of support target language names in string. |
|
|
59 |
|
|
|
60 |
`get_translation(text, model_name, target_language)`: translate the input text into the target language. |
|
|
61 |
|
|
|
62 |
**Parameters**: |
|
|
63 |
* `text`: input text in string |
|
|
64 |
* `model_name`: bert model name in string |
|
|
65 |
* `target_language`: target language name from the supported langauge list |
|
|
66 |
* returns a string, which is the translated version of text] |
|
|
67 |
|
|
|
68 |
`get_bert_embeddings(pretrained_model, texts)`: encode the input text with pretrained bert model |
|
|
69 |
|
|
|
70 |
**Parameters**: |
|
|
71 |
* `pretrained_model`: bert model name in string |
|
|
72 |
* `texts`: input text in a list of string |
|
|
73 |
* returns a list of lists of sentences, each list is made up of sentences from the same document |
|
|
74 |
|
|
|
75 |
### ✨stanza_functions.py |
|
|
76 |
`get_denpendencies(text)`: dependency parsing result for the input `text` in string, this is a wrapper of the stanza library. |
|
|
77 |
|
|
|
78 |
|
|
|
79 |
### ✨summarization_functions.py |
|
|
80 |
`get_single_summary(text, model_name="t5-small", min_length=50, max_length=200)`: single document summarization. |
|
|
81 |
|
|
|
82 |
**Parameters**: |
|
|
83 |
* `text`: a string for the input text |
|
|
84 |
* `model_name`: bert model name in string, now we support the following models: `bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b` |
|
|
85 |
* `min_length`: min length in summary |
|
|
86 |
* `max_length`: max length in summary |
|
|
87 |
* returns a list of summarization in string |
|
|
88 |
|
|
|
89 |
`get_multi_summary_joint(text, model_name="osama7/t5-summarization-multinews", min_length=50, max_length=200)`: multi-document summarization function. Join all the input documents as a long document, then do single document summarization. |
|
|
90 |
|
|
|
91 |
**Parameters**: |
|
|
92 |
* `text`: a list of document in string |
|
|
93 |
* `model_name`: bert model name in string, now we support the following models: `bart-large-cnn`, `t5-small`, `t5-base`, `t5-large`, `t5-3b`', `t5-11b` |
|
|
94 |
* `min_length`: min length in summary |
|
|
95 |
* `max_length`: max length in summary |
|
|
96 |
* returns a list of summarization in string |
|
|
97 |
|
|
|
98 |
`get_multi_summary_extractive_textRank(text,ratio=-0.1,words=0)`: Textrank method for multi-doc summarization. |
|
|
99 |
|
|
|
100 |
**Parameters**: |
|
|
101 |
* `text`: a list of string |
|
|
102 |
* `ratio`: the ratio of summary (0-1.0) |
|
|
103 |
* `words`: the number of words of summary, default is 50 |
|
|
104 |
* returns a string as the final summarization |
|
|
105 |
|
|
|
106 |
### ✨medspacy_functions.py |
|
|
107 |
|
|
|
108 |
`get_word_tokenization(text)`: word tokenization using medspaCy package. |
|
|
109 |
|
|
|
110 |
**Parameters**: |
|
|
111 |
* `text`: input string text |
|
|
112 |
* returns a list of token or word in string |
|
|
113 |
|
|
|
114 |
`get_section_detection(text,rules)`: given a string as the input, extract sections, consisting of medical history, allergies, comments and so on. |
|
|
115 |
|
|
|
116 |
**Parameters**: |
|
|
117 |
* `text`: input string text |
|
|
118 |
* `rule`: the personalized rules, a dictionary of string, i.e., {"category": "allergies"}, default is None |
|
|
119 |
* returns a list of spacy Section object |
|
|
120 |
|
|
|
121 |
`get_UMLS_match(text)`: match the UMLS concept for the input text. |
|
|
122 |
|
|
|
123 |
**Parameters**: |
|
|
124 |
* `text`: input string text |
|
|
125 |
* returns a list of tuples, (entity_text, label, similarity, semtypes) |
|
|
126 |
|