## Comprehensive Tutorials for EHRKit

In [3]:
from EHRKit import EHRKit

### Initializations & Updates

Create EHRKit object

In [4]:
# create empty kit
kit = EHRKit()

# inspect default values
print(f"main record: {kit.main_record}")
print(f"supporting records: {kit.supporting_records}")
print(f"scispacy model: {kit.scispacy_model}")
print(f"marianMT model: {kit.marian_model}")

main record: 
supporting records: []
scispacy model: en_core_sci_sm
marianMT model: Helsinki-NLP/opus-mt-en-ROMANCE


In [5]:
# create kit with specifications
main_record = "main record"
supporting_records = ["supporting document 1", "supporting document 2"]
kit = EHRKit(main_record, supporting_records, "en_core_sci_sm")

print(f"main record: {kit.main_record}")
print(f"supporting records: {kit.supporting_records}")
print(f"scispacy model: {kit.scispacy_model}")
print(f"marianMT model: {kit.marian_model}")

main record: main record
supporting records: ['supporting document 1', 'supporting document 2']
scispacy model: en_core_sci_sm
marianMT model: Helsinki-NLP/opus-mt-en-ROMANCE


Use update_and_delete_main_record to replace current main record with new one

In [6]:
kit.update_and_delete_main_record("new main record")
print(kit.main_record)
print(kit.supporting_records)

new main record
['supporting document 1', 'supporting document 2']


Use update_and_keep_main_record to replace current main record new one AND place previous main record to the end of supporting_records

In [7]:
kit.update_and_keep_main_record("new new main record")
print(kit.main_record)
print(kit.supporting_records)

new new main record
['supporting document 1', 'supporting document 2', 'new main record']


*Remark: updating main_record to empty will throw an error

Use replace_supporting_records to replace the entire supporting_records list

In [8]:
kit.replace_supporting_records(['new support doc 1', 'new support doc 2'])
kit.supporting_records

['new support doc 1', 'new support doc 2']

Use add_supporting_records to append new supporting records to existing list of supporting records

In [9]:
kit.add_supporting_records(['addition 1'])
print(kit.supporting_records)
kit.add_supporting_records(['addition 2', 'addition 3', 'addition 4'])
print(kit.supporting_records)

['new support doc 1', 'new support doc 2', 'addition 1']
['new support doc 1', 'new support doc 2', 'addition 1', 'addition 2', 'addition 3', 'addition 4']


Update default models

TODO: add valid model options

In [10]:
kit.update_scispacy_model('new scispacy model')
kit.update_bert_model('new bert model')
kit.update_marian_model('new marian model')
print(kit.scispacy_model)
print(kit.bert_model)
print(kit.marian_model)

new scispacy model
new bert model
new marian model


### Functions for textual record processing

In [11]:
kit = EHRKit()

**Abbreviation detection & expansion**: returns a list of tuples in the form (abbreviation, expanded form), each element being a str

In [12]:
record = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."

kit.update_and_delete_main_record(record)
kit.get_abbreviations()

Identifying abbrevations using en_core_sci_sm
Input text (truncated): Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily.
...


[('SBMA', 'Spinal and bulbar muscular atrophy'),
 ('SBMA', 'Spinal and bulbar muscular atrophy'),
 ('AR', 'androgen receptor')]

**Hyponym detection**: returns a list of tuples in the form (hearst_pattern, entity_1, entity_2, ...), each element being a str

In [13]:
record = "Keystone plant species such as fig trees are good for the soil."

kit.update_and_delete_main_record(record)
kit.get_hyponyms()

Extracting hyponyms using en_core_sci_sm
Input text (truncated): Keystone plant species such as fig trees are good for the soil.
...


[('such_as', 'Keystone plant species', 'fig trees')]

**Entity linking**: returns a dictionary in the form {named entity: list of strings each describing one piece of linked information}

In [14]:
record = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."

kit.update_and_delete_main_record(record)
kit.get_linked_entities()

Entity linking using en_core_sci_sm
Input text (truncated): Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily.
...


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


{Spinal: ['CUI: C0521329, Name: Spinal\nDefinition: Of or relating to the spine or spinal cord.\nTUI(s): T082\nAliases: (total: 2): \n\t Spinal, spinal',
  'CUI: C0037922, Name: Spinal Canal\nDefinition: The cavity within the SPINAL COLUMN through which the SPINAL CORD passes.\nTUI(s): T030\nAliases (abbreviated, total: 15): \n\t vertebral canal, canal spinal, Vertebral Canal, Spinal canal, NOS, Vertebral canal, NOS, neural canal, Spinal Canals, Canal, Spinal, Spinal canal structure, Spinal canal',
  'CUI: C3887662, Name: Intraspinal Neoplasm\nDefinition: A primary or metastatic neoplasm that occurs within the spinal canal including the spinal cord and surrounding paraspinal spaces.\nTUI(s): T191\nAliases (abbreviated, total: 16): \n\t Spinal Canal Tumors, neoplasm spinal, Neoplasms of the Spinal Canal and Spinal Cord, Spinal Neoplasms, Tumor of the Spinal Canal and Spinal Cord, Neoplasms of Spinal Canal and Spinal Cord, Tumor of Spinal Canal and Spinal Cord, Neoplasm of the Spinal Can

**Named entity recognition**: returns a list of strings, each string is an identified named entity

In [33]:
record = """Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC)."""

# using scispacy
kit.update_and_delete_main_record(record)
kit.get_named_entities()

Extracting named entities using en_core_sci_sm
Input text (truncated): Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
...


['Myeloid',
 'suppressor cells',
 'MDSC',
 'immature',
 'myeloid cells',
 'immunosuppressive activity',
 'accumulate',
 'tumor-bearing mice',
 'humans',
 'cancer',
 'hepatocellular\ncarcinoma',
 'HCC']

In [15]:
# using stanza biomed
kit.get_named_entities(tool='stanza')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:51:22 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| backward_charlm | mimic   |
| pretrain        | mimic   |
| forward_charlm  | mimic   |

2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/mimic.pt.
2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/mimic.pt.
2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/mimic.pt.
2022-02-16 09:51:24 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/mimic.pt.
2022-02-16 09:51:25 INFO: File exists: /home/lily/ky334/stanza_resources/en/ner/i2b2.pt.
2022-02-16 09:51:25 INFO: File exists: /home/lily/ky334/stanza_resources/en/backward_charlm/mimic.pt.
2022-02-16 09:51:26 INFO: File exists: /ho

[('Spinal and bulbar muscular atrophy', 'PROBLEM'),
 ('an inherited motor neuron disease', 'PROBLEM'),
 ('a polyglutamine tract', 'PROBLEM'),
 ('SBMA', 'PROBLEM')]

**Translation**: returns a string, which is the translated version of text. Default target_language is Spanish. Use get_supported_translation_language to get a list of supported languages.

In [16]:
kit.get_supported_translation_languages()

['Malay_written_with_Latin',
 'Mauritian_Creole',
 'Haitian',
 'Papiamento',
 'Asturian',
 'Catalan',
 'Indonesian',
 'Galician',
 'Walloon',
 'Spanish',
 'French',
 'Romanian',
 'Portuguese',
 'Italian',
 'Occitan',
 'Aragonese',
 'Minangkabau']

In [17]:
# reference: https://qbi.uq.edu.au/brain/brain-anatomy/what-neuron
record = "Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, \
the cells responsible for receiving sensory input from the external world, for sending motor commands to  \
our muscles, and for transforming and relaying the electrical signals at every step in between. More than  \
that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do \
interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  \
although it’s not really known)."

kit.update_and_delete_main_record(record)
print(kit.get_translation())
print(kit.get_translation('French'))

Translating medical note using Helsinki-NLP/opus-mt-en-ROMANCE
Input text (truncated): Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between. More than  that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).
...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:51:58 INFO: Downloading default packages for language: en (English)...
2022-02-16 09:52:01 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-02-16 09:52:07 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:52:07 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-02-16 09:52:07 INFO: Use device: gpu
2022-02-16 09:52:07 INFO: Loading: tokenize
2022-02-16 09:52:07 INFO: Done loading processors!


Neurons (tambiè neurones o cellule nervese) sono le unità fondamentali del cervello e del sistema nervoso, le cellule responsabili di ricevere input sensorial dal mondo esterno, de mandar comandos motori ai nostri muscoli, e de trasformare e di retransmissione dei segnali elettrici a ogni passo entre. Piutè, le loro interazioni definisce chi somos noi come persone. Dicho ciò, i nostri circa 100 miliardi di neuroni interagiscono strettamente con altri tipi cellulari, largamente classificati come glia (essas possono in realtà superare neuroni, anche se non è realmente noto).
Translating medical note using Helsinki-NLP/opus-mt-en-ROMANCE
Input text (truncated): Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between. More than  that, their int

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:52:14 INFO: Downloading default packages for language: en (English)...
2022-02-16 09:52:15 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-02-16 09:52:20 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:52:20 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-02-16 09:52:20 INFO: Use device: gpu
2022-02-16 09:52:20 INFO: Loading: tokenize
2022-02-16 09:52:20 INFO: Done loading processors!


Neurons (tambiè neurones o cellule nervese) sono le unità fondamentali del cervello e del sistema nervoso, le cellule responsabili di ricevere input sensorial dal mondo esterno, de mandar comandos motori ai nostri muscoli, e de trasformare e di retransmissione dei segnali elettrici a ogni passo entre. Piutè, le loro interazioni definisce chi somos noi come persone. Dicho ciò, i nostri circa 100 miliardi di neuroni interagiscono strettamente con altri tipi cellulari, largamente classificati come glia (essas possono in realtà superare neuroni, anche se non è realmente noto).


**Sentencizer**: sentence tokenizer

In [19]:
# using pyrush
print(kit.get_sentences('pyrush'))

# using stanza
print(kit.get_sentences('stanza'))

# using scispacy
print(kit.get_sentences('scispacy'))

# using stanza biomed
print(kit.get_sentences('stanza-biomed'))

Segment into sentences using PyRuSH
['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:52:31 INFO: Downloading default packages for language: en (English)...
2022-02-16 09:52:32 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-02-16 09:52:37 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:52:37 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-02-16 09:52:37 INFO: Use device: gpu
2022-02-16 09:52:37 INFO: Loading: tokenize
2022-02-16 09:52:37 INFO: Done loading processors!


['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']
['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neu

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:52:38 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.
2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.
2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.
2022-02-16 09:52:39 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.
2022-02-16 09:52:40 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.
2022-02-16 09:52:40 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:52:40 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft  

['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']


**Tokenizer**: tokenize input document, create a list of lists, each list contains tokens from a sentence in the document.

In [20]:
kit.get_tokens()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:54:00 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.
2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.
2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.
2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.
2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.
2022-02-16 09:54:00 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:54:00 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft  

[['Neurons',
  '(',
  'also',
  'called',
  'neurones',
  'or',
  'nerve',
  'cells',
  ')',
  'are',
  'the',
  'fundamental',
  'units',
  'of',
  'the',
  'brain',
  'and',
  'nervous',
  'system',
  ',',
  'the',
  'cells',
  'responsible',
  'for',
  'receiving',
  'sensory',
  'input',
  'from',
  'the',
  'external',
  'world',
  ',',
  'for',
  'sending',
  'motor',
  'commands',
  'to',
  'our',
  'muscles',
  ',',
  'and',
  'for',
  'transforming',
  'and',
  'relaying',
  'the',
  'electrical',
  'signals',
  'at',
  'every',
  'step',
  'in',
  'between',
  '.'],
 ['More',
  'than',
  'that',
  ',',
  'their',
  'interactions',
  'define',
  'who',
  'we',
  'are',
  'as',
  'people',
  '.'],
 ['Having',
  'said',
  'that',
  ',',
  'our',
  'roughly',
  '100',
  'billion',
  'neurons',
  'do',
  'interact',
  'closely',
  'with',
  'other',
  'cell',
  'types',
  ',',
  'broadly',
  'classified',
  'as',
  'glia',
  '(',
  'these',
  'may',
  'actually',
  'outnumber',
  

**Part-speech-tags and morphological features**: returns a list of lists of tuples of length 4: word, universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats). Each list corresponds to a sentence in the document.

In [21]:
kit.get_pos_tags()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:55:22 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.
2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.
2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.
2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.
2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.
2022-02-16 09:55:22 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:55:22 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft  

[[('Neurons', 'NOUN', 'NNS', '_'),
  ('(', 'PUNCT', '-LRB-', '_'),
  ('also', 'ADV', 'RB', '_'),
  ('called', 'VERB', 'VBN', '_'),
  ('neurones', 'NOUN', 'NNS', '_'),
  ('or', 'CONJ', 'CC', '_'),
  ('nerve', 'NOUN', 'NN', '_'),
  ('cells', 'NOUN', 'NNS', '_'),
  (')', 'PUNCT', '-RRB-', '_'),
  ('are', 'VERB', 'VBP', '_'),
  ('the', 'DET', 'DT', '_'),
  ('fundamental', 'ADJ', 'JJ', '_'),
  ('units', 'NOUN', 'NNS', '_'),
  ('of', 'ADP', 'IN', '_'),
  ('the', 'DET', 'DT', '_'),
  ('brain', 'NOUN', 'NN', '_'),
  ('and', 'CONJ', 'CC', '_'),
  ('nervous', 'ADJ', 'JJ', '_'),
  ('system', 'NOUN', 'NN', '_'),
  (',', 'PUNCT', ',', '_'),
  ('the', 'DET', 'DT', '_'),
  ('cells', 'NOUN', 'NNS', '_'),
  ('responsible', 'ADJ', 'JJ', '_'),
  ('for', 'SCONJ', 'IN', '_'),
  ('receiving', 'VERB', 'VBG', '_'),
  ('sensory', 'ADJ', 'JJ', '_'),
  ('input', 'NOUN', 'NN', '_'),
  ('from', 'ADP', 'IN', '_'),
  ('the', 'DET', 'DT', '_'),
  ('external', 'ADJ', 'JJ', '_'),
  ('world', 'NOUN', 'NN', '_'),
  (',',

**Lemmatization**: returns a list of lists of tuples, each tuple in the form (token, lemma), each list corresponds to a sentence in the document.

In [22]:
kit.get_lemmas()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:57:11 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.
2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.
2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.
2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.
2022-02-16 09:57:12 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.
2022-02-16 09:57:12 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:57:12 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft  

[[('Neurons', 'neuron'),
  ('(', '('),
  ('also', 'also'),
  ('called', 'call'),
  ('neurones', 'neurone'),
  ('or', 'or'),
  ('nerve', 'nerve'),
  ('cells', 'cell'),
  (')', ')'),
  ('are', 'be'),
  ('the', 'the'),
  ('fundamental', 'fundamental'),
  ('units', 'unit'),
  ('of', 'of'),
  ('the', 'the'),
  ('brain', 'brain'),
  ('and', 'and'),
  ('nervous', 'nervous'),
  ('system', 'system'),
  (',', ','),
  ('the', 'the'),
  ('cells', 'cell'),
  ('responsible', 'responsible'),
  ('for', 'for'),
  ('receiving', 'receive'),
  ('sensory', 'sensory'),
  ('input', 'input'),
  ('from', 'from'),
  ('the', 'the'),
  ('external', 'external'),
  ('world', 'world'),
  (',', ','),
  ('for', 'for'),
  ('sending', 'send'),
  ('motor', 'motor'),
  ('commands', 'command'),
  ('to', 'to'),
  ('our', 'we'),
  ('muscles', 'muscle'),
  (',', ','),
  ('and', 'and'),
  ('for', 'for'),
  ('transforming', 'transform'),
  ('and', 'and'),
  ('relaying', 'relay'),
  ('the', 'the'),
  ('electrical', 'electrical')

**Dependency Parser**: returns a list of lists of tuple of length 5 (word id, word text, head id, head text, deprel). Each list corresponds to a sentence in the document.

In [23]:
kit.get_dependency()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-02-16 09:58:34 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.
2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.
2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.
2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.
2022-02-16 09:58:35 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.
2022-02-16 09:58:35 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-02-16 09:58:35 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft  

[[(1, 'Neurons', 13, 'units', 'nsubj'),
  (2, '(', 1, 'Neurons', 'punct'),
  (3, 'also', 4, 'called', 'advmod'),
  (4, 'called', 1, 'Neurons', 'acl'),
  (5, 'neurones', 4, 'called', 'xcomp'),
  (6, 'or', 8, 'cells', 'cc'),
  (7, 'nerve', 8, 'cells', 'compound'),
  (8, 'cells', 5, 'neurones', 'conj'),
  (9, ')', 13, 'units', 'punct'),
  (10, 'are', 13, 'units', 'cop'),
  (11, 'the', 13, 'units', 'det'),
  (12, 'fundamental', 13, 'units', 'amod'),
  (13, 'units', 0, 'root', 'root'),
  (14, 'of', 16, 'brain', 'case'),
  (15, 'the', 16, 'brain', 'det'),
  (16, 'brain', 13, 'units', 'nmod'),
  (17, 'and', 19, 'system', 'cc'),
  (18, 'nervous', 19, 'system', 'amod'),
  (19, 'system', 16, 'brain', 'conj'),
  (20, ',', 13, 'units', 'punct'),
  (21, 'the', 22, 'cells', 'det'),
  (22, 'cells', 13, 'units', 'appos'),
  (23, 'responsible', 22, 'cells', 'amod'),
  (24, 'for', 25, 'receiving', 'mark'),
  (25, 'receiving', 23, 'responsible', 'advcl'),
  (26, 'sensory', 27, 'input', 'amod'),
  (27, 'i

**Clustering**: performs k-means clustering with documents represented using pre-trained transformers, returns a dataframe with 2 columns: note and assigned cluster id. Main record and supporting records are combined in clustering. Default number of clusters is 2.

In [9]:
''' A document about neuron.'''
record = "Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, " \
         "the cells responsible for receiving sensory input from the external world, for sending motor commands to " \
         "our muscles, and for transforming and relaying the electrical signals at every step in between. More than " \
         "that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do" \
         " interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, " \
         "although it’s not really known)."

# reference: https://www.investopedia.com/terms/n/neuralnetwork.asp
''' A document about neural network. '''
cand1 = "A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of " \
        "data through a process that mimics the way the human brain operates. In this sense, neural networks refer to " \
        "systems of neurons, either organic or artificial in nature."

# reference: https://medlineplus.gov/druginfo/meds/a682878.html
''' A document about aspirin. '''
cand2 = "Prescription aspirin is used to relieve the symptoms of rheumatoid arthritis (arthritis caused by swelling " \
        "of the lining of the joints), osteoarthritis (arthritis caused by breakdown of the lining of the joints), " \
        "systemic lupus erythematosus (condition in which the immune system attacks the joints and organs and causes " \
        "pain and swelling) and certain other rheumatologic conditions (conditions in which the immune system " \
        "attacks parts of the body)."

# reference: https://www.medicalnewstoday.com/articles/161255
''' Another document about aspirin. '''
cand3 = "People can buy aspirin over the counter without a prescription. Everyday uses include relieving headache, " \
        "reducing swelling, and reducing a fever. Taken daily, aspirin can lower the risk of cardiovascular events, " \
        "such as a heart attack or stroke, in people with a high risk. Doctors may administer aspirin immediately" \
        " after a heart attack to prevent further clots and heart tissue death."

kit.update_and_delete_main_record(record)
kit.replace_supporting_records([cand1, cand2, cand3])

kit.get_clusters(k=2)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-30 16:13:02 INFO: Downloading default packages for language: en (English)...
2022-01-30 16:13:03 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-01-30 16:13:08 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-01-30 16:13:08 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-01-30 16:13:08 INFO: Use device: gpu
2022-01-30 16:13:08 INFO: Loading: tokenize
2022-01-30 16:13:08 INFO: Done loading processors!
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predic

Unnamed: 0,note,cluster
0,Neurons (also called neurones or nerve cells) ...,0
1,A neural network is a series of algorithms tha...,0
2,Prescription aspirin is used to relieve the sy...,1
3,People can buy aspirin over the counter withou...,1


In this example, we see that the document about neurons and the document about neural networks are grouped into one cluster. The two documents about aspirin are grouped into a second cluster. 

**Similar document retrieval**: retrieve top_k documents in candidate_notes that are most similar to query_note, returns a dataframe with candidate_note_id, similarity_score, and candidate_text. Default number of similar documents is 2.

In [11]:
# using the same documents as in clustering
kit.get_similar_documents(3)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-30 16:15:42 INFO: Downloading default packages for language: en (English)...
2022-01-30 16:15:43 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-01-30 16:15:48 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-01-30 16:15:48 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-01-30 16:15:48 INFO: Use device: gpu
2022-01-30 16:15:48 INFO: Loading: tokenize
2022-01-30 16:15:48 INFO: Done loading processors!
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predic

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-30 16:15:52 INFO: Downloading default packages for language: en (English)...
2022-01-30 16:15:53 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.
2022-01-30 16:15:58 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.
2022-01-30 16:15:58 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-01-30 16:15:58 INFO: Use device: gpu
2022-01-30 16:15:58 INFO: Loading: tokenize
2022-01-30 16:15:58 INFO: Done loading processors!
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predic

Unnamed: 0,candidate_id,similarity_score,candidate_text
0,0,0.864067,A neural network is a series of algorithms tha...
1,2,0.824104,People can buy aspirin over the counter withou...
2,1,0.746512,Prescription aspirin is used to relieve the sy...


**Summarization**: summarize a single document (main record) or multiple documents (main records AND supporting records)

In [14]:
# single-document summarization
kit = EHRKit()

main_record = "Paris is the capital and most populous city of France, \
with an estimated population of 2,175,601 residents as of 2018, in an \
area of more than 105 square kilometres (41 square miles). The City of \
Paris is the centre and seat of government of the region and province \
of Île-de-France, or Paris Region, which has an estimated population of \
12,174,880, or about 18 percent of the population of France as of 2017."

kit.update_and_delete_main_record(main_record)
kit.get_single_record_summary()

Your max_length is set to 200, but you input_length is only 100. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


'the city of Paris is the centre and seat of government of the region and province of Île-de-France . it has an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles)'

In [16]:
# multi-document summarization
kit = EHRKit()

record = "Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, " \
         "the cells responsible for receiving sensory input from the external world, for sending motor commands to " \
         "our muscles, and for transforming and relaying the electrical signals at every step in between. More than " \
         "that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do" \
         " interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, " \
         "although it’s not really known)."

doc1 = "A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of " \
        "data through a process that mimics the way the human brain operates. In this sense, neural networks refer to " \
        "systems of neurons, either organic or artificial in nature."

# https://qbi.uq.edu.au/brain/brain-physiology/what-are-neurotransmitters
doc2 = "Neurotransmitters are often referred to as the body’s chemical messengers. They are the molecules used by the " \
       "nervous system to transmit messages between neurons, or from neurons to muscles. Communication between two neurons " \
       "happens in the synaptic cleft (the small gap between the synapses of neurons). Here, electrical signals that have "\
       "travelled along the axon are briefly converted into chemical ones through the release of neurotransmitters, causing "\
       "a specific response in the receiving neuron."

kit.update_and_delete_main_record(record)
kit.replace_supporting_records([doc1, doc2])

kit.get_multi_record_summary()

'– Neurons, also known as nerve cells, are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to our muscles, and for transforming and relaying the electrical signals at every step in between. More than that, their interactions define who we are as people, the New York Times reports. Neuron networks are a series of algorithms that endeavor to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either'