[2d4573]: / wrapper_functions / EHRKit_tutorials.ipynb

Download this file

2075 lines (2074 with data), 99.9 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "52b36ac1",
   "metadata": {},
   "source": [
    "## Comprehensive Tutorials for EHRKit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ce111e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EHRKit import EHRKit"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0db065a",
   "metadata": {},
   "source": [
    "### Initializations & Updates"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e035ea33",
   "metadata": {},
   "source": [
    "Create EHRKit object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fc030fa3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "main record: \n",
      "supporting records: []\n",
      "scispacy model: en_core_sci_sm\n",
      "marianMT model: Helsinki-NLP/opus-mt-en-ROMANCE\n"
     ]
    }
   ],
   "source": [
    "# create empty kit\n",
    "kit = EHRKit()\n",
    "\n",
    "# inspect default values\n",
    "print(f\"main record: {kit.main_record}\")\n",
    "print(f\"supporting records: {kit.supporting_records}\")\n",
    "print(f\"scispacy model: {kit.scispacy_model}\")\n",
    "print(f\"marianMT model: {kit.marian_model}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1c39145c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "main record: main record\n",
      "supporting records: ['supporting document 1', 'supporting document 2']\n",
      "scispacy model: en_core_sci_sm\n",
      "marianMT model: Helsinki-NLP/opus-mt-en-ROMANCE\n"
     ]
    }
   ],
   "source": [
    "# create kit with specifications\n",
    "main_record = \"main record\"\n",
    "supporting_records = [\"supporting document 1\", \"supporting document 2\"]\n",
    "kit = EHRKit(main_record, supporting_records, \"en_core_sci_sm\")\n",
    "\n",
    "print(f\"main record: {kit.main_record}\")\n",
    "print(f\"supporting records: {kit.supporting_records}\")\n",
    "print(f\"scispacy model: {kit.scispacy_model}\")\n",
    "print(f\"marianMT model: {kit.marian_model}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "634c3932",
   "metadata": {},
   "source": [
    "Use update_and_delete_main_record to replace current main record with new one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6544d096",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "new main record\n",
      "['supporting document 1', 'supporting document 2']\n"
     ]
    }
   ],
   "source": [
    "kit.update_and_delete_main_record(\"new main record\")\n",
    "print(kit.main_record)\n",
    "print(kit.supporting_records)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7232def6",
   "metadata": {},
   "source": [
    "Use update_and_keep_main_record to replace current main record new one AND place previous main record to the end of supporting_records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "85c29ffa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "new new main record\n",
      "['supporting document 1', 'supporting document 2', 'new main record']\n"
     ]
    }
   ],
   "source": [
    "kit.update_and_keep_main_record(\"new new main record\")\n",
    "print(kit.main_record)\n",
    "print(kit.supporting_records)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0d99493",
   "metadata": {},
   "source": [
    "*Remark: updating main_record to empty will throw an error"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53dcd41c",
   "metadata": {},
   "source": [
    "Use replace_supporting_records to replace the entire supporting_records list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "eca10382",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['new support doc 1', 'new support doc 2']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.replace_supporting_records(['new support doc 1', 'new support doc 2'])\n",
    "kit.supporting_records"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4716da7e",
   "metadata": {},
   "source": [
    "Use add_supporting_records to append new supporting records to existing list of supporting records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "155a716b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['new support doc 1', 'new support doc 2', 'addition 1']\n",
      "['new support doc 1', 'new support doc 2', 'addition 1', 'addition 2', 'addition 3', 'addition 4']\n"
     ]
    }
   ],
   "source": [
    "kit.add_supporting_records(['addition 1'])\n",
    "print(kit.supporting_records)\n",
    "kit.add_supporting_records(['addition 2', 'addition 3', 'addition 4'])\n",
    "print(kit.supporting_records)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f58440ff",
   "metadata": {},
   "source": [
    "Update default models\n",
    "\n",
    "TODO: add valid model options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "09c6e716",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "new scispacy model\n",
      "new bert model\n",
      "new marian model\n"
     ]
    }
   ],
   "source": [
    "kit.update_scispacy_model('new scispacy model')\n",
    "kit.update_bert_model('new bert model')\n",
    "kit.update_marian_model('new marian model')\n",
    "print(kit.scispacy_model)\n",
    "print(kit.bert_model)\n",
    "print(kit.marian_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2e096a1",
   "metadata": {},
   "source": [
    "### Functions for textual record processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "424d1dd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "kit = EHRKit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fc66486",
   "metadata": {},
   "source": [
    "**Abbreviation detection & expansion**: returns a list of tuples in the form (abbreviation, expanded form), each element being a str"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "85f5b722",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Identifying abbrevations using en_core_sci_sm\n",
      "Input text (truncated): Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily.\n",
      "...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('SBMA', 'Spinal and bulbar muscular atrophy'),\n",
       " ('SBMA', 'Spinal and bulbar muscular atrophy'),\n",
       " ('AR', 'androgen receptor')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record = \"Spinal and bulbar muscular atrophy (SBMA) is an \\\n",
    "inherited motor neuron disease caused by the expansion \\\n",
    "of a polyglutamine tract within the androgen receptor (AR). \\\n",
    "SBMA can be caused by this easily.\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.get_abbreviations()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49fb300f",
   "metadata": {},
   "source": [
    "**Hyponym detection**: returns a list of tuples in the form (hearst_pattern, entity_1, entity_2, ...), each element being a str"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6a70e009",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting hyponyms using en_core_sci_sm\n",
      "Input text (truncated): Keystone plant species such as fig trees are good for the soil.\n",
      "...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('such_as', 'Keystone plant species', 'fig trees')]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record = \"Keystone plant species such as fig trees are good for the soil.\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.get_hyponyms()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b2e3e2e",
   "metadata": {},
   "source": [
    "**Entity linking**: returns a dictionary in the form {named entity: list of strings each describing one piece of linked information}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "47a201ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entity linking using en_core_sci_sm\n",
      "Input text (truncated): Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily.\n",
      "...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/data/lily/ky334/LILY-EHRKit/virtenv/lib/python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.20.3 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.\n",
      "  UserWarning)\n",
      "/data/lily/ky334/LILY-EHRKit/virtenv/lib/python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.20.3 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.\n",
      "  UserWarning)\n",
      "/data/lily/ky334/LILY-EHRKit/virtenv/lib/python3.6/site-packages/scispacy/candidate_generation.py:284: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
      "  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]\n",
      "/data/lily/ky334/LILY-EHRKit/virtenv/lib/python3.6/site-packages/scispacy/candidate_generation.py:285: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
      "  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{Spinal: ['CUI: C0521329, Name: Spinal\\nDefinition: Of or relating to the spine or spinal cord.\\nTUI(s): T082\\nAliases: (total: 2): \\n\\t Spinal, spinal',\n",
       "  'CUI: C0037922, Name: Spinal Canal\\nDefinition: The cavity within the SPINAL COLUMN through which the SPINAL CORD passes.\\nTUI(s): T030\\nAliases (abbreviated, total: 15): \\n\\t vertebral canal, canal spinal, Vertebral Canal, Spinal canal, NOS, Vertebral canal, NOS, neural canal, Spinal Canals, Canal, Spinal, Spinal canal structure, Spinal canal',\n",
       "  'CUI: C3887662, Name: Intraspinal Neoplasm\\nDefinition: A primary or metastatic neoplasm that occurs within the spinal canal including the spinal cord and surrounding paraspinal spaces.\\nTUI(s): T191\\nAliases (abbreviated, total: 16): \\n\\t Spinal Canal Tumors, neoplasm spinal, Neoplasms of the Spinal Canal and Spinal Cord, Spinal Neoplasms, Tumor of the Spinal Canal and Spinal Cord, Neoplasms of Spinal Canal and Spinal Cord, Tumor of Spinal Canal and Spinal Cord, Neoplasm of the Spinal Canal and Spinal Cord, Spinal Tumors, Intraspinal Tumor',\n",
       "  'CUI: C1334264, Name: Intraspinal Meningioma\\nDefinition: A meningioma that arises from the spinal meninges.\\nTUI(s): T191\\nAliases: (total: 3): \\n\\t Meningioma of the Spinal Canal and Spinal Cord, Meningioma of Spinal Canal and Spinal Cord, Spinal Canal and Spinal Cord Meningioma',\n",
       "  'CUI: C0037925, Name: Spinal Cord\\nDefinition: A cylindrical column of tissue that lies within the vertebral canal. It is composed of WHITE MATTER and GRAY MATTER.\\nTUI(s): T023\\nAliases (abbreviated, total: 19): \\n\\t Spinal Cords, Spinal cord, Spinal Cord, SPINAL CORD, spinal cord, Spinal cord, NOS, Spinali, Medulla, medulla spinalis, Medulla Spinalis, spinal cords'],\n",
       " bulbar: ['CUI: C1947952, Name: anatomical bulb\\nDefinition: A rounded dilation or expansion in a canal, vessel, or organ.\\nTUI(s): T017\\nAliases: (total: 2): \\n\\t Bulbar, Bulb',\n",
       "  'CUI: C0032372, Name: Poliomyelitis, Bulbar\\nDefinition: A form of paralytic poliomyelitis affecting neurons of the MEDULLA OBLONGATA of the brain stem. Clinical features include impaired respiration, HYPERTENSION, alterations of vasomotor control, and dysphagia. Weakness and atrophy of the limbs and trunk due to spinal cord involvement is usually associated. (From Adams et al., Principles of Neurology, 6th ed, p765)\\nTUI(s): T047\\nAliases (abbreviated, total: 23): \\n\\t Acute bulbar polioencephalitis, Bulbar Polio, Poliomyelitis, Medullary Involvement, BULBAR POLIO, Acute paralytic poliomyelitis specified as bulbar, Polio, Bulbar, Bulbar Poliomyelitis, Anterior acute poliomyelitis, Acute infantile paralysis, Acute paralytic poliomyelitis, bulbar',\n",
       "  \"CUI: C2586323, Name: Structure of fascial sheath of eyeball\\nDefinition: Sheath of the eyeball consisting of fascia extending from the OPTIC NERVE to the corneal limbus.\\nTUI(s): T023\\nAliases (abbreviated, total: 20): \\n\\t Eyeball, sheath, Vaginal bulbi, Capsule, Tenon, Bulbar sheath, Tenon's capsule, Fascia bulbi, Tenons Capsule, Structure of fascial sheath of eyeball, tenon capsule, Sheath of eyeball\",\n",
       "  'CUI: C1744560, Name: Bulbar urethra\\nDefinition: The portion of the penile urethra that spans the bulb of the penis.\\nTUI(s): T023\\nAliases: (total: 5): \\n\\t Structure of bulbar urethra, Bulbar urethra, Bulbar Portion of the Urethra, Structure of bulbar urethra (body structure), Bulbar Urethra',\n",
       "  'CUI: C0030442, Name: Progressive bulbar palsy\\nDefinition: A motor neuron disease marked by progressive weakness of the muscles innervated by cranial nerves of the lower brain stem. Clinical manifestations include dysarthria, dysphagia, facial weakness, tongue weakness, and fasciculations of the tongue and facial muscles. The adult form of the disease is marked initially by bulbar weakness which progresses to involve motor neurons throughout the neuroaxis. Eventually this condition may become indistinguishable from AMYOTROPHIC LATERAL SCLEROSIS. Fazio-Londe syndrome is an inherited form of this illness which occurs in children and young adults. (Adams et al., Principles of Neurology, 6th ed, p1091; Brain 1992 Dec;115(Pt 6):1889-1900)\\nTUI(s): T047\\nAliases (abbreviated, total: 13): \\n\\t bulbar palsy progressive, Palsies, Progressive Bulbar, Progressive Bulbar Palsy, Bulbar Palsy, Progressive, Progressive bulbar palsy (disorder), PBP - Progressive bulbar palsy, Bulbar paralysis, Progressive Bulbar Palsies, Bulbar palsy, Progressive bulbar palsy'],\n",
       " muscular atrophy: ['CUI: C0026846, Name: Muscular Atrophy\\nDefinition: Derangement in size and number of muscle fibers occurring with aging, reduction in blood supply, or following immobilization, prolonged weightlessness, malnutrition, and particularly in denervation.\\nTUI(s): T046\\nAliases (abbreviated, total: 32): \\n\\t Muscle atrophy, NOS, ATROPHY MUSCLE, amyotrophia, Muscle wasting, NOS, Muscle Atrophy, Muscle wasting disorder, Muscular atrophy, Muscle atrophy, Atrophies, Muscle, Muscle Wasting',\n",
       "  'CUI: C0541794, Name: Skeletal muscle atrophy\\nDefinition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content, fiber diameter, force production and fatigue resistance in response to different conditions such as starvation, aging and disuse. [GOC:mtg_muscle]\\nTUI(s): T046\\nAliases: (total: 10): \\n\\t Atrophy of the skeletal muscles, Skeletal muscle atrophy, Muscular atrophy, Muscle atrophy, ATROPHY SKELETAL MUSCLE, Amyotrophy, Muscle wasting, Amyotrophy involving the extremities, skeletal muscle atrophy, Muscle hypotrophy',\n",
       "  'CUI: C0026847, Name: Spinal Muscular Atrophy\\nDefinition: A group of disorders marked by progressive degeneration of motor neurons in the spinal cord resulting in weakness and muscular atrophy, usually without evidence of injury to the corticospinal tracts. Diseases in this category include Werdnig-Hoffmann disease and later onset SPINAL MUSCULAR ATROPHIES OF CHILDHOOD, most of which are hereditary. (Adams et al., Principles of Neurology, 6th ed, p1089)\\nTUI(s): T047\\nAliases (abbreviated, total: 27): \\n\\t atrophy muscular spinal, SMA - Spinal muscular atrophy, Spinal Amyotrophies, Spinal Muscular Atrophy, atrophy muscular sma spinal, Spinal muscular atrophy (disorder), Amyotrophies, Spinal, Spinal muscle degeneration, Atrophy, Spinal Muscular, muscle spinal atrophy',\n",
       "  'CUI: C1848736, Name: Distal amyotrophy\\nDefinition: Muscular atrophy affecting muscles in the distal portions of the extremities. [HPO:curators]\\nTUI(s): T047\\nAliases (abbreviated, total: 16): \\n\\t Distal amyotrophy, especially of hands and feet, Amyotrophy, distal, Muscle atrophy, distal, Muscle atrophy, distal upper and lower limbs, Distal amyotrophy, especially of the hands and feet, Atrophy of distal muscles, Muscle atrophy, distal, upper and lower limbs, Distal muscle atrophy, Distal amyotrophy, Distal muscle atrophy, upper and lower limbs',\n",
       "  'CUI: C0043116, Name: HMN (Hereditary Motor Neuropathy) Proximal Type I\\nDefinition: The most severe form of spinal muscular atrophy. It is manifested in the first year of life with muscle weakness, poor muscle tone, and lack of motor development. The motor neuron death affects the major organ systems, particularly the respiratory system. Most patients die before the age of two secondary to pneumonia.\\nTUI(s): T047\\nAliases (abbreviated, total: 42): \\n\\t Werdnig-Hoffman disease, SPINAL MUSCULAR ATROPHY, TYPE I, Infantile spinal muscular atrophy, Muscular Atrophy, Spinal, Type I, disease werdnig hoffmans, werdnig-hoffman disease, Spinal Muscular Atrophy 1, werdnig hoffmann disease, Werdnig-Hoffmann Disease, WHD - Werdnig-Hoffmann disease'],\n",
       " SBMA: ['CUI: C1705240, Name: AR wt Allele\\nDefinition: Human AR wild-type allele is located within Xq11.2-q12 and is approximately 180 kb in length. This allele, which encodes androgen receptor protein, is involved in steroid-hormone activated transcriptional regulation. Mutations in the gene are associated with complete androgen insensitivity syndrome.\\nTUI(s): T028\\nAliases (abbreviated, total: 11): \\n\\t AIS, AR wt Allele, TFM, KD, SMAX1, HUMARA, NR3C4, Androgen Receptor (Dihydrotestosterone Receptor; Testicular Feminization; Spinal and Bulbar Muscular Atrophy; Kennedy Disease) wt Allele, SBMA, AR',\n",
       "  \"CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked\\nDefinition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.\\nTUI(s): T047\\nAliases (abbreviated, total: 39): \\n\\t Bulbospinal Muscular Atrophy, X linked, SMAX1, X Linked Spinal and Bulbar Muscular Atrophy, Bulbospinal muscular atrophy, kennedy's syndrome, Atrophy, Muscular, Spinobulbar, X Linked Bulbo Spinal Atrophy, X-Linked Spinal and Bulbar Muscular Atrophy, Bulbo Spinal Atrophy, X Linked, X-Linked Bulbo-Spinal Atrophy\"],\n",
       " inherited: ['CUI: C0439660, Name: Hereditary\\nDefinition: Transmitted from parent to child by information contained in the genes.\\nTUI(s): T169\\nAliases: (total: 10): \\n\\t Inherited, HEREDITARY, Hereditary, inherited, inherit, Hereditary (qualifier value), inheriting, Heritable, INHERITED, hereditary',\n",
       "  'CUI: C0019247, Name: Hereditary Diseases\\nDefinition: Genetic diseases are diseases in which inherited genes predispose to increased risk. The genetic disorders associated with cancer often result from an alteration or mutation in a single gene. The diseases range from rare dominant cancer family syndrome to familial tendencies in which low-penetrance genes may interact with other genes or environmental factors to induce cancer. Research may involve clinical, epidemiologic, and laboratory studies of persons, families, and populations at high risk of these disorders.\\nTUI(s): T047\\nAliases (abbreviated, total: 45): \\n\\t Genetic condition, Molecular Disease, genetic syndrome, Disorder, Genetic, genetic disorder syndrome, Diseases, Genetic, genetics disease, genetics syndrome, hereditary disease, Hereditary Diseases',\n",
       "  'CUI: C4277541, Name: Paternal Inheritance\\nDefinition: A form of inheritance where the traits of the offspring are paternal in origin due to the expression of extra-nuclear genetic material such as MITOCHONDRIAL DNA or Y chromosome genes. CENTRIOLES are also paternally inherited.\\nTUI(s): T045\\nAliases: (total: 9): \\n\\t Paternal Effects, Paternal Effect, Paternally Inherited, Effects, Paternal, Paternally, Inherited, Inherited, Paternally, Effect, Paternal, Inherited Paternally, Inheritance, Paternal',\n",
       "  'CUI: C4277511, Name: Maternal Inheritance\\nDefinition: Transmission of genetic characters, qualities, and traits, solely from maternal extra-nuclear elements such as MITOCHONDRIAL DNA or MATERNAL MESSENGER RNA.\\nTUI(s): T045\\nAliases: (total: 9): \\n\\t Effects, Maternal, Inheritance, Maternal, Maternally Inherited, Inherited, Maternally, Maternally, Inherited, Maternal Effects, Effect, Maternal, Maternal Effect, Inherited Maternally',\n",
       "  'CUI: C0598589, Name: Inherited Neuropathy\\nDefinition: A hereditary disorder that affects the sensory and/or motor nerves or the autonomic nerves.\\nTUI(s): T047\\nAliases: (total: 5): \\n\\t inherited neuropathies, neuropathy hereditary, hereditary neuropathies, hereditary neuropathy, inherited neuropathy'],\n",
       " motor neuron: ['CUI: C0026609, Name: Motor Neurons\\nDefinition: Neurons which activate MUSCLE CELLS.\\nTUI(s): T025\\nAliases (abbreviated, total: 14): \\n\\t Motor Neurons, Neurons, Motor, motoneurons, motor cell, Motor neurons, Motoneuron, motoneuron, motor neurons, Neuron, Motor, Motor neuron',\n",
       "  'CUI: C0027884, Name: Neurons, Efferent\\nDefinition: Neurons which send impulses peripherally to activate muscles or secretory cells.\\nTUI(s): T025\\nAliases: (total: 9): \\n\\t Motor Neurons, Efferent Nerve, Motoneuron, Neuron, Efferent, efferent nerve, efferent neuron, Efferent Neurons, Neurons, Efferent, Efferent Neuron',\n",
       "  'CUI: C0085084, Name: Motor Neuron Disease\\nDefinition: Diseases characterized by a selective degeneration of the motor neurons of the spinal cord, brainstem, or motor cortex. Clinical subtypes are distinguished by the major site of degeneration. In AMYOTROPHIC LATERAL SCLEROSIS there is involvement of upper, lower, and brainstem motor neurons. In progressive muscular atrophy and related syndromes (see MUSCULAR ATROPHY, SPINAL) the motor neurons in the spinal cord are primarily affected. With progressive bulbar palsy (BULBAR PALSY, PROGRESSIVE), the initial degeneration occurs in the brainstem. In primary lateral sclerosis, the cortical neurons are affected in isolation. (Adams et al., Principles of Neurology, 6th ed, p1089)\\nTUI(s): T047\\nAliases (abbreviated, total: 25): \\n\\t motor neuron diseases, Motor Neuron Diseases and Syndromes, MOTOR SYSTEM DISEASE, Neuron Diseases, Motor, Motor neuron disease, NOS, degenerative motor system disorder, Motor System Diseases, Motor Neuron Diseases, MND - Motor neurone disease, disease motor neurons',\n",
       "  'CUI: C0026610, Name: Motor Neurons, Gamma\\nDefinition: Motor neurons which activate the contractile regions of intrafusal SKELETAL MUSCLE FIBERS, thus adjusting the sensitivity of the MUSCLE SPINDLES to stretch. Gamma motor neurons may be \"static\" or \"dynamic\" according to which aspect of responsiveness (or which fiber types) they regulate. The alpha and gamma motor neurons are often activated together (alpha gamma coactivation) which allows the spindles to contribute to the control of movement trajectories despite changes in muscle length.\\nTUI(s): T025\\nAliases (abbreviated, total: 21): \\n\\t Fusimotor Neurons, Neurons, Fusimotor, Gamma-Efferent Motor Neurons, Neurons, Gamma Motor, Motor Neuron, Gamma, Neuron, Gamma Motor, Motorneurons, Gamma, Neurons, Gamma-Efferent Motor, Fusimotor Neuron, Gamma motor neuron',\n",
       "  'CUI: C4024896, Name: Motor neuron atrophy\\nDefinition: Wasting involving the motor neuron. [HPO:probinson]\\nTUI(s): T047\\nAliases: (total: 1): \\n\\t Motor neuron degeneration'],\n",
       " expansion: ['CUI: C0007595, Name: Cell Growth\\nDefinition: The process in which a cell irreversibly increases in size over time by accretion and biosynthetic production of matter similar to that already present. [GOC:ai]\\nTUI(s): T043\\nAliases (abbreviated, total: 11): \\n\\t cell expansion, cell growth, growth of cell, Cells--Growth, Cellular Expansion, cellular growth, Cell Growth, cell growths, Cellular Growth, growth cell',\n",
       "  'CUI: C4761515, Name: Cell Expansion\\nDefinition: A cell culture procedure that is designed to promote cell growth.\\nTUI(s): T059\\nAliases: (total: 3): \\n\\t Cell Culture Expansion, Cell Expansion, Cell expansion',\n",
       "  'CUI: C1654621, Name: imaginal disc-derived wing expansion\\nDefinition: The process of expanding or inflating the folded imaginal disc-derived pupal wing, and the adhering of the dorsal and ventral surfaces, to form the mature adult wing. [GOC:mtg_sensu, GOC:rc]\\nTUI(s): T042\\nAliases: (total: 2): \\n\\t wing inflation, wing expansion',\n",
       "  'CUI: C0196940, Name: Nerve Expansion\\nDefinition: Procedures that stimulate nerve elongation over a period of time. They are used in repairing nerve tissue.\\nTUI(s): T061\\nAliases (abbreviated, total: 26): \\n\\t Neurectasis, NOS, Lengthening, Nerve, Neurectasis (procedure), Neurectasis, Nerve Lengthenings, Nerve Elongation Procedure, Stretching, Nerve, Stretchings, Nerve, Elongation Procedure, Nerve, Nerve Expansions',\n",
       "  'CUI: C1516670, Name: Clonal Expansion\\nDefinition: Multiplication or reproduction by cell division of a population of identical cells descended from a single progenitor. In immunology, may refer to the clonal proliferation of cells responsive to a specific antigen as part of an immune response. (NCI)\\nTUI(s): T046\\nAliases: (total: 0): \\n\\t '],\n",
       " polyglutamine tract: ['CUI: C0032500, Name: Polyglutamic Acid\\nDefinition: A peptide that is a homopolymer of glutamic acid.\\nTUI(s): T116\\nAliases: (total: 2): \\n\\t L-Glutamic acid, homopolymer, polyglutamic acid',\n",
       "  'CUI: C0392213, Name: polyglutamates\\nDefinition: peptide that is a homopolymer of glutamic acid.\\nTUI(s): T116\\nAliases: (total: 2): \\n\\t polyglutamate, polyglutamates'],\n",
       " androgen receptor: ['CUI: C0034786, Name: Androgen Receptor\\nDefinition: Proteins, generally found in the CYTOPLASM, that specifically bind ANDROGENS and mediate their cellular actions. The complex of the androgen and receptor migrates to the CELL NUCLEUS where it induces transcription of specific segments of DNA.\\nTUI(s): T116, T192\\nAliases (abbreviated, total: 25): \\n\\t Receptor, 5 alpha-Dihydrotestosterone, Receptors, Stanolone, Androgens Receptors, 5 alpha Dihydrotestosterone Receptor, Receptors, Androgen, Receptor, Stanolone, androgen receptor, Receptor, Androgen, 5 alpha-Dihydrotestosterone Receptor, Receptors, Dihydrotestosterone',\n",
       "  'CUI: C1367578, Name: AR gene\\nDefinition: This gene plays a role in the transcriptional activation of androgen responsive genes.\\nTUI(s): T028\\nAliases (abbreviated, total: 16): \\n\\t Androgen Receptor (Dihydrotestosterone Receptor; Testicular Feminization; Spinal and Bulbar Muscular Atrophy; Kennedy Disease) Gene, NUCLEAR RECEPTOR SUBFAMILY 3, GROUP C, MEMBER 4, AIS, SMAX1, testicular feminization, androgen receptor, HUMARA, NR3C4, ANDROGEN RECEPTOR, DHTR',\n",
       "  'CUI: C1447749, Name: AR protein, human\\nDefinition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene. This protein plays a role in the modulation of steroid-dependent gene transcription.\\nTUI(s): T116, T192\\nAliases (abbreviated, total: 16): \\n\\t Dihydrotestosterone Receptor, AR protein, human, Nuclear Receptor Subfamily 3 Group C Member 4, androgen receptor, human, spinal and bulbar muscular atrophy protein, human, NR3C4, Kennedy disease protein, human, AR (androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease)) protein, human, Androgen Receptor, HUMARA protein, human',\n",
       "  'CUI: C4764257, Name: Androgen Receptor Status\\nDefinition: Refers to the presence or absence of androgen receptor molecules on the surface of a cells in a specimen.\\nTUI(s): T033\\nAliases: (total: 2): \\n\\t AR Status, Androgen Receptor Status',\n",
       "  'CUI: C1323350, Name: androgen receptor binding\\nDefinition: Interacting selectively and non-covalently with an androgen receptor. [GOC:ai]\\nTUI(s): T044\\nAliases: (total: 1): \\n\\t AR binding'],\n",
       " AR: ['CUI: C0003504, Name: Aortic Valve Insufficiency\\nDefinition: Pathological condition characterized by the backflow of blood from the ASCENDING AORTA back into the LEFT VENTRICLE, leading to regurgitation. It is caused by diseases of the AORTIC VALVE or its surrounding tissue (aortic root).\\nTUI(s): T047\\nAliases (abbreviated, total: 39): \\n\\t Aortic Insufficiency, Aortic incompetence, Aortic valve regurgitation, AORTIC REGURGITATION, AORTIC VALVE REGURGITATION, aortic insufficiency, Aortic Regurgitation, Aortic valve incompetence, Regurgitation, Aortic Valve, Aortic valve insufficiency',\n",
       "  'CUI: C0003761, Name: Country of Argentina\\nDefinition: Country located in southern South America, bordering the South Atlantic Ocean, between Chile and Uruguay.\\nTUI(s): T083\\nAliases: (total: 6): \\n\\t ARGENTINA, ARG, Argentina, argentina, Argentina (geographic location), AR',\n",
       "  'CUI: C0003790, Name: Arkansas\\nDefinition: State of the UNITED STATES OF AMERICA bounded on the north by Missouri, on the east by Tennessee and Mississippi, on the south by Louisiana, and on the west by Oklahoma and Texas.\\nTUI(s): T083\\nAliases: (total: 4): \\n\\t Arkansas, AR, arkansas, Arkansas (geographic location)',\n",
       "  'CUI: C0332284, Name: Arising in\\nDefinition: None\\nTUI(s): T080\\nAliases: (total: 5): \\n\\t Arising in (attribute), arising in, arising, Arising in, AR',\n",
       "  'CUI: C0559546, Name: Adverse reactions\\nDefinition: An unexpected medical problem that happens during treatment with a drug or other therapy. Adverse effects do not have to be caused by the drug or therapy, and they may be mild, moderate, or severe.\\nTUI(s): T046\\nAliases (abbreviated, total: 12): \\n\\t Adverse Reaction, adverse reactions, Adverse reaction (disorder), reactions adverse, Adverse reaction, ADR, Adverse Effect, adverse effect, Adverse reactions, ADVERSE REACTION'],\n",
       " SBMA: ['CUI: C1705240, Name: AR wt Allele\\nDefinition: Human AR wild-type allele is located within Xq11.2-q12 and is approximately 180 kb in length. This allele, which encodes androgen receptor protein, is involved in steroid-hormone activated transcriptional regulation. Mutations in the gene are associated with complete androgen insensitivity syndrome.\\nTUI(s): T028\\nAliases (abbreviated, total: 11): \\n\\t AIS, AR wt Allele, TFM, KD, SMAX1, HUMARA, NR3C4, Androgen Receptor (Dihydrotestosterone Receptor; Testicular Feminization; Spinal and Bulbar Muscular Atrophy; Kennedy Disease) wt Allele, SBMA, AR',\n",
       "  \"CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked\\nDefinition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.\\nTUI(s): T047\\nAliases (abbreviated, total: 39): \\n\\t Bulbospinal Muscular Atrophy, X linked, SMAX1, X Linked Spinal and Bulbar Muscular Atrophy, Bulbospinal muscular atrophy, kennedy's syndrome, Atrophy, Muscular, Spinobulbar, X Linked Bulbo Spinal Atrophy, X-Linked Spinal and Bulbar Muscular Atrophy, Bulbo Spinal Atrophy, X Linked, X-Linked Bulbo-Spinal Atrophy\"]}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record = \"Spinal and bulbar muscular atrophy (SBMA) is an \\\n",
    "inherited motor neuron disease caused by the expansion \\\n",
    "of a polyglutamine tract within the androgen receptor (AR). \\\n",
    "SBMA can be caused by this easily.\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.get_linked_entities()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49e53301",
   "metadata": {},
   "source": [
    "**Named entity recognition**: returns a list of strings, each string is an identified named entity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "f8d34fa2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting named entities using en_core_sci_sm\n",
      "Input text (truncated): Myeloid derived suppressor cells (MDSC) are immature\n",
      "myeloid cells with immunosuppressive activity.\n",
      "They accumulate in tumor-bearing mice and humans\n",
      "with different types of cancer, including hepatocellular\n",
      "carcinoma (HCC).\n",
      "...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Myeloid',\n",
       " 'suppressor cells',\n",
       " 'MDSC',\n",
       " 'immature',\n",
       " 'myeloid cells',\n",
       " 'immunosuppressive activity',\n",
       " 'accumulate',\n",
       " 'tumor-bearing mice',\n",
       " 'humans',\n",
       " 'cancer',\n",
       " 'hepatocellular\\ncarcinoma',\n",
       " 'HCC']"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record = \"\"\"Myeloid derived suppressor cells (MDSC) are immature\n",
    "myeloid cells with immunosuppressive activity.\n",
    "They accumulate in tumor-bearing mice and humans\n",
    "with different types of cancer, including hepatocellular\n",
    "carcinoma (HCC).\"\"\"\n",
    "\n",
    "# using scispacy\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.get_named_entities()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3189c27e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3d83ad2397e646b694b59e238ea5d33c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:51:22 INFO: Downloading these customized packages for language: en (English)...\n",
      "=============================\n",
      "| Processor       | Package |\n",
      "-----------------------------\n",
      "| tokenize        | mimic   |\n",
      "| pos             | mimic   |\n",
      "| lemma           | mimic   |\n",
      "| depparse        | mimic   |\n",
      "| ner             | i2b2    |\n",
      "| backward_charlm | mimic   |\n",
      "| pretrain        | mimic   |\n",
      "| forward_charlm  | mimic   |\n",
      "=============================\n",
      "\n",
      "2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/mimic.pt.\n",
      "2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/mimic.pt.\n",
      "2022-02-16 09:51:23 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/mimic.pt.\n",
      "2022-02-16 09:51:24 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/mimic.pt.\n",
      "2022-02-16 09:51:25 INFO: File exists: /home/lily/ky334/stanza_resources/en/ner/i2b2.pt.\n",
      "2022-02-16 09:51:25 INFO: File exists: /home/lily/ky334/stanza_resources/en/backward_charlm/mimic.pt.\n",
      "2022-02-16 09:51:26 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/mimic.pt.\n",
      "2022-02-16 09:51:26 INFO: File exists: /home/lily/ky334/stanza_resources/en/forward_charlm/mimic.pt.\n",
      "2022-02-16 09:51:26 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:51:26 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | mimic   |\n",
      "| pos       | mimic   |\n",
      "| lemma     | mimic   |\n",
      "| depparse  | mimic   |\n",
      "| ner       | i2b2    |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:51:26 INFO: Use device: gpu\n",
      "2022-02-16 09:51:26 INFO: Loading: tokenize\n",
      "2022-02-16 09:51:41 INFO: Loading: pos\n",
      "2022-02-16 09:51:42 INFO: Loading: lemma\n",
      "2022-02-16 09:51:42 INFO: Loading: depparse\n",
      "2022-02-16 09:51:42 INFO: Loading: ner\n",
      "2022-02-16 09:51:42 INFO: Done loading processors!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('Spinal and bulbar muscular atrophy', 'PROBLEM'),\n",
       " ('an inherited motor neuron disease', 'PROBLEM'),\n",
       " ('a polyglutamine tract', 'PROBLEM'),\n",
       " ('SBMA', 'PROBLEM')]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# using stanza biomed\n",
    "kit.get_named_entities(tool='stanza')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4738435",
   "metadata": {},
   "source": [
    "**Translation**: returns a string, which is the translated version of text. Default target_language is Spanish. Use get_supported_translation_language to get a list of supported languages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "bcdeecc1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Malay_written_with_Latin',\n",
       " 'Mauritian_Creole',\n",
       " 'Haitian',\n",
       " 'Papiamento',\n",
       " 'Asturian',\n",
       " 'Catalan',\n",
       " 'Indonesian',\n",
       " 'Galician',\n",
       " 'Walloon',\n",
       " 'Spanish',\n",
       " 'French',\n",
       " 'Romanian',\n",
       " 'Portuguese',\n",
       " 'Italian',\n",
       " 'Occitan',\n",
       " 'Aragonese',\n",
       " 'Minangkabau']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.get_supported_translation_languages()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "41074fb3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Translating medical note using Helsinki-NLP/opus-mt-en-ROMANCE\n",
      "Input text (truncated): Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between. More than  that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).\n",
      "...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "86b28affcd8544fbb6ee6639fc1d7ce2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:51:58 INFO: Downloading default packages for language: en (English)...\n",
      "2022-02-16 09:52:01 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-02-16 09:52:07 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:52:07 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-02-16 09:52:07 INFO: Use device: gpu\n",
      "2022-02-16 09:52:07 INFO: Loading: tokenize\n",
      "2022-02-16 09:52:07 INFO: Done loading processors!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Neurons (tambiè neurones o cellule nervese) sono le unità fondamentali del cervello e del sistema nervoso, le cellule responsabili di ricevere input sensorial dal mondo esterno, de mandar comandos motori ai nostri muscoli, e de trasformare e di retransmissione dei segnali elettrici a ogni passo entre. Piutè, le loro interazioni definisce chi somos noi come persone. Dicho ciò, i nostri circa 100 miliardi di neuroni interagiscono strettamente con altri tipi cellulari, largamente classificati come glia (essas possono in realtà superare neuroni, anche se non è realmente noto).\n",
      "Translating medical note using Helsinki-NLP/opus-mt-en-ROMANCE\n",
      "Input text (truncated): Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between. More than  that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).\n",
      "...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "668a85a626df42a3bb9d897bc628db0f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:52:14 INFO: Downloading default packages for language: en (English)...\n",
      "2022-02-16 09:52:15 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-02-16 09:52:20 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:52:20 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-02-16 09:52:20 INFO: Use device: gpu\n",
      "2022-02-16 09:52:20 INFO: Loading: tokenize\n",
      "2022-02-16 09:52:20 INFO: Done loading processors!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Neurons (tambiè neurones o cellule nervese) sono le unità fondamentali del cervello e del sistema nervoso, le cellule responsabili di ricevere input sensorial dal mondo esterno, de mandar comandos motori ai nostri muscoli, e de trasformare e di retransmissione dei segnali elettrici a ogni passo entre. Piutè, le loro interazioni definisce chi somos noi come persone. Dicho ciò, i nostri circa 100 miliardi di neuroni interagiscono strettamente con altri tipi cellulari, largamente classificati come glia (essas possono in realtà superare neuroni, anche se non è realmente noto).\n"
     ]
    }
   ],
   "source": [
    "# reference: https://qbi.uq.edu.au/brain/brain-anatomy/what-neuron\n",
    "record = \"Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, \\\n",
    "the cells responsible for receiving sensory input from the external world, for sending motor commands to  \\\n",
    "our muscles, and for transforming and relaying the electrical signals at every step in between. More than  \\\n",
    "that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do \\\n",
    "interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  \\\n",
    "although it’s not really known).\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "print(kit.get_translation())\n",
    "print(kit.get_translation('French'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee7d6e8e",
   "metadata": {},
   "source": [
    "**Sentencizer**: sentence tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "c9bce87d",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Segment into sentences using PyRuSH\n",
      "['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c6e70b41f4ae4fbd85815ec3842b4f9e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:52:31 INFO: Downloading default packages for language: en (English)...\n",
      "2022-02-16 09:52:32 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-02-16 09:52:37 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:52:37 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-02-16 09:52:37 INFO: Use device: gpu\n",
      "2022-02-16 09:52:37 INFO: Loading: tokenize\n",
      "2022-02-16 09:52:37 INFO: Done loading processors!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']\n",
      "['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c0a30d42895c46379cf991bcdfb53335",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:52:38 INFO: Downloading these customized packages for language: en (English)...\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "| pretrain  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.\n",
      "2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.\n",
      "2022-02-16 09:52:38 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.\n",
      "2022-02-16 09:52:39 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.\n",
      "2022-02-16 09:52:40 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.\n",
      "2022-02-16 09:52:40 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:52:40 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:52:40 INFO: Use device: gpu\n",
      "2022-02-16 09:52:40 INFO: Loading: tokenize\n",
      "2022-02-16 09:52:40 INFO: Loading: pos\n",
      "2022-02-16 09:52:40 INFO: Loading: lemma\n",
      "2022-02-16 09:52:40 INFO: Loading: depparse\n",
      "2022-02-16 09:52:40 INFO: Done loading processors!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to  our muscles, and for transforming and relaying the electrical signals at every step in between.', 'More than  that, their interactions define who we are as people.', 'Having said that, our roughly 100 billion neurons do interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons,  although it’s not really known).']\n"
     ]
    }
   ],
   "source": [
    "# using pyrush\n",
    "print(kit.get_sentences('pyrush'))\n",
    "\n",
    "# using stanza\n",
    "print(kit.get_sentences('stanza'))\n",
    "\n",
    "# using scispacy\n",
    "print(kit.get_sentences('scispacy'))\n",
    "\n",
    "# using stanza biomed\n",
    "print(kit.get_sentences('stanza-biomed'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0ba61df",
   "metadata": {},
   "source": [
    "**Tokenizer**: tokenize input document, create a list of lists, each list contains tokens from a sentence in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "5046c538",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e79a050267144bd099fd4462a52659d1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:54:00 INFO: Downloading these customized packages for language: en (English)...\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "| pretrain  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.\n",
      "2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.\n",
      "2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.\n",
      "2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.\n",
      "2022-02-16 09:54:00 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.\n",
      "2022-02-16 09:54:00 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:54:00 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:54:00 INFO: Use device: gpu\n",
      "2022-02-16 09:54:00 INFO: Loading: tokenize\n",
      "2022-02-16 09:54:00 INFO: Loading: pos\n",
      "2022-02-16 09:54:00 INFO: Loading: lemma\n",
      "2022-02-16 09:54:00 INFO: Loading: depparse\n",
      "2022-02-16 09:54:01 INFO: Done loading processors!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[['Neurons',\n",
       "  '(',\n",
       "  'also',\n",
       "  'called',\n",
       "  'neurones',\n",
       "  'or',\n",
       "  'nerve',\n",
       "  'cells',\n",
       "  ')',\n",
       "  'are',\n",
       "  'the',\n",
       "  'fundamental',\n",
       "  'units',\n",
       "  'of',\n",
       "  'the',\n",
       "  'brain',\n",
       "  'and',\n",
       "  'nervous',\n",
       "  'system',\n",
       "  ',',\n",
       "  'the',\n",
       "  'cells',\n",
       "  'responsible',\n",
       "  'for',\n",
       "  'receiving',\n",
       "  'sensory',\n",
       "  'input',\n",
       "  'from',\n",
       "  'the',\n",
       "  'external',\n",
       "  'world',\n",
       "  ',',\n",
       "  'for',\n",
       "  'sending',\n",
       "  'motor',\n",
       "  'commands',\n",
       "  'to',\n",
       "  'our',\n",
       "  'muscles',\n",
       "  ',',\n",
       "  'and',\n",
       "  'for',\n",
       "  'transforming',\n",
       "  'and',\n",
       "  'relaying',\n",
       "  'the',\n",
       "  'electrical',\n",
       "  'signals',\n",
       "  'at',\n",
       "  'every',\n",
       "  'step',\n",
       "  'in',\n",
       "  'between',\n",
       "  '.'],\n",
       " ['More',\n",
       "  'than',\n",
       "  'that',\n",
       "  ',',\n",
       "  'their',\n",
       "  'interactions',\n",
       "  'define',\n",
       "  'who',\n",
       "  'we',\n",
       "  'are',\n",
       "  'as',\n",
       "  'people',\n",
       "  '.'],\n",
       " ['Having',\n",
       "  'said',\n",
       "  'that',\n",
       "  ',',\n",
       "  'our',\n",
       "  'roughly',\n",
       "  '100',\n",
       "  'billion',\n",
       "  'neurons',\n",
       "  'do',\n",
       "  'interact',\n",
       "  'closely',\n",
       "  'with',\n",
       "  'other',\n",
       "  'cell',\n",
       "  'types',\n",
       "  ',',\n",
       "  'broadly',\n",
       "  'classified',\n",
       "  'as',\n",
       "  'glia',\n",
       "  '(',\n",
       "  'these',\n",
       "  'may',\n",
       "  'actually',\n",
       "  'outnumber',\n",
       "  'neurons',\n",
       "  ',',\n",
       "  'although',\n",
       "  'it',\n",
       "  '’s',\n",
       "  'not',\n",
       "  'really',\n",
       "  'known',\n",
       "  ')',\n",
       "  '.']]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.get_tokens()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e969cb09",
   "metadata": {},
   "source": [
    "**Part-speech-tags and morphological features**: returns a list of lists of tuples of length 4: word, universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats). Each list corresponds to a sentence in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "2bd4fcce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "98bfc433916d4aaeb330daa705daea83",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:55:22 INFO: Downloading these customized packages for language: en (English)...\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "| pretrain  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.\n",
      "2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.\n",
      "2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.\n",
      "2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.\n",
      "2022-02-16 09:55:22 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.\n",
      "2022-02-16 09:55:22 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:55:22 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:55:22 INFO: Use device: gpu\n",
      "2022-02-16 09:55:22 INFO: Loading: tokenize\n",
      "2022-02-16 09:55:22 INFO: Loading: pos\n",
      "2022-02-16 09:55:22 INFO: Loading: lemma\n",
      "2022-02-16 09:55:22 INFO: Loading: depparse\n",
      "2022-02-16 09:55:23 INFO: Done loading processors!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[[('Neurons', 'NOUN', 'NNS', '_'),\n",
       "  ('(', 'PUNCT', '-LRB-', '_'),\n",
       "  ('also', 'ADV', 'RB', '_'),\n",
       "  ('called', 'VERB', 'VBN', '_'),\n",
       "  ('neurones', 'NOUN', 'NNS', '_'),\n",
       "  ('or', 'CONJ', 'CC', '_'),\n",
       "  ('nerve', 'NOUN', 'NN', '_'),\n",
       "  ('cells', 'NOUN', 'NNS', '_'),\n",
       "  (')', 'PUNCT', '-RRB-', '_'),\n",
       "  ('are', 'VERB', 'VBP', '_'),\n",
       "  ('the', 'DET', 'DT', '_'),\n",
       "  ('fundamental', 'ADJ', 'JJ', '_'),\n",
       "  ('units', 'NOUN', 'NNS', '_'),\n",
       "  ('of', 'ADP', 'IN', '_'),\n",
       "  ('the', 'DET', 'DT', '_'),\n",
       "  ('brain', 'NOUN', 'NN', '_'),\n",
       "  ('and', 'CONJ', 'CC', '_'),\n",
       "  ('nervous', 'ADJ', 'JJ', '_'),\n",
       "  ('system', 'NOUN', 'NN', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('the', 'DET', 'DT', '_'),\n",
       "  ('cells', 'NOUN', 'NNS', '_'),\n",
       "  ('responsible', 'ADJ', 'JJ', '_'),\n",
       "  ('for', 'SCONJ', 'IN', '_'),\n",
       "  ('receiving', 'VERB', 'VBG', '_'),\n",
       "  ('sensory', 'ADJ', 'JJ', '_'),\n",
       "  ('input', 'NOUN', 'NN', '_'),\n",
       "  ('from', 'ADP', 'IN', '_'),\n",
       "  ('the', 'DET', 'DT', '_'),\n",
       "  ('external', 'ADJ', 'JJ', '_'),\n",
       "  ('world', 'NOUN', 'NN', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('for', 'SCONJ', 'IN', '_'),\n",
       "  ('sending', 'VERB', 'VBG', '_'),\n",
       "  ('motor', 'NOUN', 'NN', '_'),\n",
       "  ('commands', 'NOUN', 'NNS', '_'),\n",
       "  ('to', 'ADP', 'IN', '_'),\n",
       "  ('our', 'PRON', 'PRP$', '_'),\n",
       "  ('muscles', 'NOUN', 'NNS', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('and', 'CONJ', 'CC', '_'),\n",
       "  ('for', 'SCONJ', 'IN', '_'),\n",
       "  ('transforming', 'VERB', 'VBG', '_'),\n",
       "  ('and', 'CCONJ', 'CC', '_'),\n",
       "  ('relaying', 'VERB', 'VBG', '_'),\n",
       "  ('the', 'DET', 'DT', '_'),\n",
       "  ('electrical', 'ADJ', 'JJ', '_'),\n",
       "  ('signals', 'NOUN', 'NNS', '_'),\n",
       "  ('at', 'ADP', 'IN', '_'),\n",
       "  ('every', 'DET', 'DT', '_'),\n",
       "  ('step', 'NOUN', 'NN', '_'),\n",
       "  ('in', 'ADP', 'IN', '_'),\n",
       "  ('between', 'ADV', 'RB', '_'),\n",
       "  ('.', 'PUNCT', '.', '_')],\n",
       " [('More', 'ADJ', 'JJR', '_'),\n",
       "  ('than', 'ADP', 'IN', '_'),\n",
       "  ('that', 'PRON', 'DT', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('their', 'PRON', 'PRP$', '_'),\n",
       "  ('interactions', 'NOUN', 'NNS', '_'),\n",
       "  ('define', 'VERB', 'VBP', '_'),\n",
       "  ('who', 'PRON', 'WP', '_'),\n",
       "  ('we', 'PRON', 'PRP', '_'),\n",
       "  ('are', 'AUX', 'VBP', '_'),\n",
       "  ('as', 'ADP', 'IN', '_'),\n",
       "  ('people', 'NOUN', 'NNS', '_'),\n",
       "  ('.', 'PUNCT', '.', '_')],\n",
       " [('Having', 'AUX', 'VBG', '_'),\n",
       "  ('said', 'VERB', 'VBN', '_'),\n",
       "  ('that', 'SCONJ', 'IN', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('our', 'PRON', 'PRP$', '_'),\n",
       "  ('roughly', 'ADV', 'RB', '_'),\n",
       "  ('100', 'NUM', 'CD', '_'),\n",
       "  ('billion', 'NUM', 'CD', '_'),\n",
       "  ('neurons', 'NOUN', 'NNS', '_'),\n",
       "  ('do', 'AUX', 'VBP', '_'),\n",
       "  ('interact', 'VERB', 'VB', '_'),\n",
       "  ('closely', 'ADV', 'RB', '_'),\n",
       "  ('with', 'ADP', 'IN', '_'),\n",
       "  ('other', 'ADJ', 'JJ', '_'),\n",
       "  ('cell', 'NOUN', 'NN', '_'),\n",
       "  ('types', 'NOUN', 'NNS', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('broadly', 'ADV', 'RB', '_'),\n",
       "  ('classified', 'VERB', 'VBN', '_'),\n",
       "  ('as', 'ADP', 'IN', '_'),\n",
       "  ('glia', 'NOUN', 'NNS', '_'),\n",
       "  ('(', 'PUNCT', '-LRB-', '_'),\n",
       "  ('these', 'PRON', 'DT', '_'),\n",
       "  ('may', 'AUX', 'MD', '_'),\n",
       "  ('actually', 'ADV', 'RB', '_'),\n",
       "  ('outnumber', 'VERB', 'VB', '_'),\n",
       "  ('neurons', 'NOUN', 'NNS', '_'),\n",
       "  (',', 'PUNCT', ',', '_'),\n",
       "  ('although', 'SCONJ', 'IN', '_'),\n",
       "  ('it', 'PRON', 'PRP', '_'),\n",
       "  ('’s', 'AUX', 'VBZ', '_'),\n",
       "  ('not', 'PART', 'RB', '_'),\n",
       "  ('really', 'ADV', 'RB', '_'),\n",
       "  ('known', 'VERB', 'VBN', '_'),\n",
       "  (')', 'PUNCT', '-RRB-', '_'),\n",
       "  ('.', 'PUNCT', '.', '_')]]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.get_pos_tags()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0c0f5bc",
   "metadata": {},
   "source": [
    "**Lemmatization**: returns a list of lists of tuples, each tuple in the form (token, lemma), each list corresponds to a sentence in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "f690a0ff",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "72de07c10c204457ab79b1a26ef88747",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:57:11 INFO: Downloading these customized packages for language: en (English)...\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "| pretrain  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.\n",
      "2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.\n",
      "2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.\n",
      "2022-02-16 09:57:11 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.\n",
      "2022-02-16 09:57:12 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.\n",
      "2022-02-16 09:57:12 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:57:12 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:57:12 INFO: Use device: gpu\n",
      "2022-02-16 09:57:12 INFO: Loading: tokenize\n",
      "2022-02-16 09:57:12 INFO: Loading: pos\n",
      "2022-02-16 09:57:12 INFO: Loading: lemma\n",
      "2022-02-16 09:57:12 INFO: Loading: depparse\n",
      "2022-02-16 09:57:12 INFO: Done loading processors!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[[('Neurons', 'neuron'),\n",
       "  ('(', '('),\n",
       "  ('also', 'also'),\n",
       "  ('called', 'call'),\n",
       "  ('neurones', 'neurone'),\n",
       "  ('or', 'or'),\n",
       "  ('nerve', 'nerve'),\n",
       "  ('cells', 'cell'),\n",
       "  (')', ')'),\n",
       "  ('are', 'be'),\n",
       "  ('the', 'the'),\n",
       "  ('fundamental', 'fundamental'),\n",
       "  ('units', 'unit'),\n",
       "  ('of', 'of'),\n",
       "  ('the', 'the'),\n",
       "  ('brain', 'brain'),\n",
       "  ('and', 'and'),\n",
       "  ('nervous', 'nervous'),\n",
       "  ('system', 'system'),\n",
       "  (',', ','),\n",
       "  ('the', 'the'),\n",
       "  ('cells', 'cell'),\n",
       "  ('responsible', 'responsible'),\n",
       "  ('for', 'for'),\n",
       "  ('receiving', 'receive'),\n",
       "  ('sensory', 'sensory'),\n",
       "  ('input', 'input'),\n",
       "  ('from', 'from'),\n",
       "  ('the', 'the'),\n",
       "  ('external', 'external'),\n",
       "  ('world', 'world'),\n",
       "  (',', ','),\n",
       "  ('for', 'for'),\n",
       "  ('sending', 'send'),\n",
       "  ('motor', 'motor'),\n",
       "  ('commands', 'command'),\n",
       "  ('to', 'to'),\n",
       "  ('our', 'we'),\n",
       "  ('muscles', 'muscle'),\n",
       "  (',', ','),\n",
       "  ('and', 'and'),\n",
       "  ('for', 'for'),\n",
       "  ('transforming', 'transform'),\n",
       "  ('and', 'and'),\n",
       "  ('relaying', 'relay'),\n",
       "  ('the', 'the'),\n",
       "  ('electrical', 'electrical'),\n",
       "  ('signals', 'signal'),\n",
       "  ('at', 'at'),\n",
       "  ('every', 'every'),\n",
       "  ('step', 'step'),\n",
       "  ('in', 'in'),\n",
       "  ('between', 'between'),\n",
       "  ('.', '.')],\n",
       " [('More', 'more'),\n",
       "  ('than', 'than'),\n",
       "  ('that', 'that'),\n",
       "  (',', ','),\n",
       "  ('their', 'they'),\n",
       "  ('interactions', 'interaction'),\n",
       "  ('define', 'define'),\n",
       "  ('who', 'who'),\n",
       "  ('we', 'we'),\n",
       "  ('are', 'be'),\n",
       "  ('as', 'as'),\n",
       "  ('people', 'people'),\n",
       "  ('.', '.')],\n",
       " [('Having', 'have'),\n",
       "  ('said', 'say'),\n",
       "  ('that', 'that'),\n",
       "  (',', ','),\n",
       "  ('our', 'we'),\n",
       "  ('roughly', 'roughly'),\n",
       "  ('100', '100'),\n",
       "  ('billion', 'billion'),\n",
       "  ('neurons', 'neuron'),\n",
       "  ('do', 'do'),\n",
       "  ('interact', 'interact'),\n",
       "  ('closely', 'closely'),\n",
       "  ('with', 'with'),\n",
       "  ('other', 'other'),\n",
       "  ('cell', 'cell'),\n",
       "  ('types', 'type'),\n",
       "  (',', ','),\n",
       "  ('broadly', 'broadly'),\n",
       "  ('classified', 'classify'),\n",
       "  ('as', 'as'),\n",
       "  ('glia', 'glium'),\n",
       "  ('(', '('),\n",
       "  ('these', 'these'),\n",
       "  ('may', 'may'),\n",
       "  ('actually', 'actually'),\n",
       "  ('outnumber', 'outnumber'),\n",
       "  ('neurons', 'neuron'),\n",
       "  (',', ','),\n",
       "  ('although', 'although'),\n",
       "  ('it', 'it'),\n",
       "  ('’s', 'be'),\n",
       "  ('not', 'not'),\n",
       "  ('really', 'really'),\n",
       "  ('known', 'know'),\n",
       "  (')', ')'),\n",
       "  ('.', '.')]]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.get_lemmas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9c0f1d4",
   "metadata": {},
   "source": [
    "**Dependency Parser**: returns a list of lists of tuple of length 5 (word id, word text, head id, head text, deprel). Each list corresponds to a sentence in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "315dcca0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a1a1e87823f34d438221b93e8766637d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-02-16 09:58:34 INFO: Downloading these customized packages for language: en (English)...\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "| pretrain  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/tokenize/craft.pt.\n",
      "2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/pos/craft.pt.\n",
      "2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/lemma/craft.pt.\n",
      "2022-02-16 09:58:34 INFO: File exists: /home/lily/ky334/stanza_resources/en/depparse/craft.pt.\n",
      "2022-02-16 09:58:35 INFO: File exists: /home/lily/ky334/stanza_resources/en/pretrain/craft.pt.\n",
      "2022-02-16 09:58:35 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-02-16 09:58:35 INFO: Loading these models for language: en (English):\n",
      "=======================\n",
      "| Processor | Package |\n",
      "-----------------------\n",
      "| tokenize  | craft   |\n",
      "| pos       | craft   |\n",
      "| lemma     | craft   |\n",
      "| depparse  | craft   |\n",
      "=======================\n",
      "\n",
      "2022-02-16 09:58:35 INFO: Use device: gpu\n",
      "2022-02-16 09:58:35 INFO: Loading: tokenize\n",
      "2022-02-16 09:58:35 INFO: Loading: pos\n",
      "2022-02-16 09:58:35 INFO: Loading: lemma\n",
      "2022-02-16 09:58:35 INFO: Loading: depparse\n",
      "2022-02-16 09:58:35 INFO: Done loading processors!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[[(1, 'Neurons', 13, 'units', 'nsubj'),\n",
       "  (2, '(', 1, 'Neurons', 'punct'),\n",
       "  (3, 'also', 4, 'called', 'advmod'),\n",
       "  (4, 'called', 1, 'Neurons', 'acl'),\n",
       "  (5, 'neurones', 4, 'called', 'xcomp'),\n",
       "  (6, 'or', 8, 'cells', 'cc'),\n",
       "  (7, 'nerve', 8, 'cells', 'compound'),\n",
       "  (8, 'cells', 5, 'neurones', 'conj'),\n",
       "  (9, ')', 13, 'units', 'punct'),\n",
       "  (10, 'are', 13, 'units', 'cop'),\n",
       "  (11, 'the', 13, 'units', 'det'),\n",
       "  (12, 'fundamental', 13, 'units', 'amod'),\n",
       "  (13, 'units', 0, 'root', 'root'),\n",
       "  (14, 'of', 16, 'brain', 'case'),\n",
       "  (15, 'the', 16, 'brain', 'det'),\n",
       "  (16, 'brain', 13, 'units', 'nmod'),\n",
       "  (17, 'and', 19, 'system', 'cc'),\n",
       "  (18, 'nervous', 19, 'system', 'amod'),\n",
       "  (19, 'system', 16, 'brain', 'conj'),\n",
       "  (20, ',', 13, 'units', 'punct'),\n",
       "  (21, 'the', 22, 'cells', 'det'),\n",
       "  (22, 'cells', 13, 'units', 'appos'),\n",
       "  (23, 'responsible', 22, 'cells', 'amod'),\n",
       "  (24, 'for', 25, 'receiving', 'mark'),\n",
       "  (25, 'receiving', 23, 'responsible', 'advcl'),\n",
       "  (26, 'sensory', 27, 'input', 'amod'),\n",
       "  (27, 'input', 25, 'receiving', 'obj'),\n",
       "  (28, 'from', 31, 'world', 'case'),\n",
       "  (29, 'the', 31, 'world', 'det'),\n",
       "  (30, 'external', 31, 'world', 'amod'),\n",
       "  (31, 'world', 25, 'receiving', 'obl'),\n",
       "  (32, ',', 13, 'units', 'punct'),\n",
       "  (33, 'for', 34, 'sending', 'mark'),\n",
       "  (34, 'sending', 25, 'receiving', 'advcl'),\n",
       "  (35, 'motor', 36, 'commands', 'compound'),\n",
       "  (36, 'commands', 34, 'sending', 'obj'),\n",
       "  (37, 'to', 39, 'muscles', 'case'),\n",
       "  (38, 'our', 39, 'muscles', 'nmod:poss'),\n",
       "  (39, 'muscles', 34, 'sending', 'obl'),\n",
       "  (40, ',', 34, 'sending', 'punct'),\n",
       "  (41, 'and', 43, 'transforming', 'cc'),\n",
       "  (42, 'for', 43, 'transforming', 'mark'),\n",
       "  (43, 'transforming', 34, 'sending', 'conj'),\n",
       "  (44, 'and', 45, 'relaying', 'cc'),\n",
       "  (45, 'relaying', 43, 'transforming', 'conj'),\n",
       "  (46, 'the', 48, 'signals', 'det'),\n",
       "  (47, 'electrical', 48, 'signals', 'amod'),\n",
       "  (48, 'signals', 43, 'transforming', 'obj'),\n",
       "  (49, 'at', 51, 'step', 'case'),\n",
       "  (50, 'every', 51, 'step', 'det'),\n",
       "  (51, 'step', 43, 'transforming', 'obl'),\n",
       "  (52, 'in', 53, 'between', 'case'),\n",
       "  (53, 'between', 51, 'step', 'nmod'),\n",
       "  (54, '.', 13, 'units', 'punct')],\n",
       " [(1, 'More', 7, 'define', 'advmod'),\n",
       "  (2, 'than', 3, 'that', 'case'),\n",
       "  (3, 'that', 1, 'More', 'obl'),\n",
       "  (4, ',', 7, 'define', 'punct'),\n",
       "  (5, 'their', 6, 'interactions', 'nmod:poss'),\n",
       "  (6, 'interactions', 7, 'define', 'nsubj'),\n",
       "  (7, 'define', 0, 'root', 'root'),\n",
       "  (8, 'who', 12, 'people', 'nsubj'),\n",
       "  (9, 'we', 12, 'people', 'nsubj'),\n",
       "  (10, 'are', 12, 'people', 'cop'),\n",
       "  (11, 'as', 12, 'people', 'case'),\n",
       "  (12, 'people', 7, 'define', 'ccomp'),\n",
       "  (13, '.', 7, 'define', 'punct')],\n",
       " [(1, 'Having', 2, 'said', 'aux'),\n",
       "  (2, 'said', 0, 'root', 'root'),\n",
       "  (3, 'that', 11, 'interact', 'mark'),\n",
       "  (4, ',', 11, 'interact', 'punct'),\n",
       "  (5, 'our', 9, 'neurons', 'nmod:poss'),\n",
       "  (6, 'roughly', 7, '100', 'advmod'),\n",
       "  (7, '100', 8, 'billion', 'compound'),\n",
       "  (8, 'billion', 9, 'neurons', 'nummod'),\n",
       "  (9, 'neurons', 11, 'interact', 'nsubj'),\n",
       "  (10, 'do', 11, 'interact', 'aux'),\n",
       "  (11, 'interact', 2, 'said', 'ccomp'),\n",
       "  (12, 'closely', 11, 'interact', 'advmod'),\n",
       "  (13, 'with', 16, 'types', 'case'),\n",
       "  (14, 'other', 16, 'types', 'amod'),\n",
       "  (15, 'cell', 16, 'types', 'compound'),\n",
       "  (16, 'types', 11, 'interact', 'obl'),\n",
       "  (17, ',', 16, 'types', 'punct'),\n",
       "  (18, 'broadly', 19, 'classified', 'advmod'),\n",
       "  (19, 'classified', 16, 'types', 'acl'),\n",
       "  (20, 'as', 21, 'glia', 'case'),\n",
       "  (21, 'glia', 19, 'classified', 'obl'),\n",
       "  (22, '(', 26, 'outnumber', 'punct'),\n",
       "  (23, 'these', 26, 'outnumber', 'nsubj'),\n",
       "  (24, 'may', 26, 'outnumber', 'aux'),\n",
       "  (25, 'actually', 26, 'outnumber', 'advmod'),\n",
       "  (26, 'outnumber', 11, 'interact', 'parataxis'),\n",
       "  (27, 'neurons', 26, 'outnumber', 'obj'),\n",
       "  (28, ',', 26, 'outnumber', 'punct'),\n",
       "  (29, 'although', 34, 'known', 'mark'),\n",
       "  (30, 'it', 34, 'known', 'nsubj:pass'),\n",
       "  (31, '’s', 34, 'known', 'aux:pass'),\n",
       "  (32, 'not', 34, 'known', 'advmod'),\n",
       "  (33, 'really', 34, 'known', 'advmod'),\n",
       "  (34, 'known', 26, 'outnumber', 'advcl'),\n",
       "  (35, ')', 26, 'outnumber', 'punct'),\n",
       "  (36, '.', 2, 'said', 'punct')]]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kit.get_dependency()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf38cdb8",
   "metadata": {},
   "source": [
    "**Clustering**: performs k-means clustering with documents represented using pre-trained transformers, returns a dataframe with 2 columns: note and assigned cluster id. Main record and supporting records are combined in clustering. Default number of clusters is 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ae688d70",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ace326a986784db796adec725bc3e7cd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-01-30 16:13:02 INFO: Downloading default packages for language: en (English)...\n",
      "2022-01-30 16:13:03 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-01-30 16:13:08 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-01-30 16:13:08 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-01-30 16:13:08 INFO: Use device: gpu\n",
      "2022-01-30 16:13:08 INFO: Loading: tokenize\n",
      "2022-01-30 16:13:08 INFO: Done loading processors!\n",
      "Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>note</th>\n",
       "      <th>cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Neurons (also called neurones or nerve cells) ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A neural network is a series of algorithms tha...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Prescription aspirin is used to relieve the sy...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>People can buy aspirin over the counter withou...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                note  cluster\n",
       "0  Neurons (also called neurones or nerve cells) ...        0\n",
       "1  A neural network is a series of algorithms tha...        0\n",
       "2  Prescription aspirin is used to relieve the sy...        1\n",
       "3  People can buy aspirin over the counter withou...        1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "''' A document about neuron.'''\n",
    "record = \"Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, \" \\\n",
    "         \"the cells responsible for receiving sensory input from the external world, for sending motor commands to \" \\\n",
    "         \"our muscles, and for transforming and relaying the electrical signals at every step in between. More than \" \\\n",
    "         \"that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do\" \\\n",
    "         \" interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, \" \\\n",
    "         \"although it’s not really known).\"\n",
    "\n",
    "# reference: https://www.investopedia.com/terms/n/neuralnetwork.asp\n",
    "''' A document about neural network. '''\n",
    "cand1 = \"A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of \" \\\n",
    "        \"data through a process that mimics the way the human brain operates. In this sense, neural networks refer to \" \\\n",
    "        \"systems of neurons, either organic or artificial in nature.\"\n",
    "\n",
    "# reference: https://medlineplus.gov/druginfo/meds/a682878.html\n",
    "''' A document about aspirin. '''\n",
    "cand2 = \"Prescription aspirin is used to relieve the symptoms of rheumatoid arthritis (arthritis caused by swelling \" \\\n",
    "        \"of the lining of the joints), osteoarthritis (arthritis caused by breakdown of the lining of the joints), \" \\\n",
    "        \"systemic lupus erythematosus (condition in which the immune system attacks the joints and organs and causes \" \\\n",
    "        \"pain and swelling) and certain other rheumatologic conditions (conditions in which the immune system \" \\\n",
    "        \"attacks parts of the body).\"\n",
    "\n",
    "# reference: https://www.medicalnewstoday.com/articles/161255\n",
    "''' Another document about aspirin. '''\n",
    "cand3 = \"People can buy aspirin over the counter without a prescription. Everyday uses include relieving headache, \" \\\n",
    "        \"reducing swelling, and reducing a fever. Taken daily, aspirin can lower the risk of cardiovascular events, \" \\\n",
    "        \"such as a heart attack or stroke, in people with a high risk. Doctors may administer aspirin immediately\" \\\n",
    "        \" after a heart attack to prevent further clots and heart tissue death.\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.replace_supporting_records([cand1, cand2, cand3])\n",
    "\n",
    "kit.get_clusters(k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e745076",
   "metadata": {},
   "source": [
    "In this example, we see that the document about neurons and the document about neural networks are grouped into one cluster. The two documents about aspirin are grouped into a second cluster. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b56ca67",
   "metadata": {},
   "source": [
    "**Similar document retrieval**: retrieve top_k documents in candidate_notes that are most similar to query_note, returns a dataframe with candidate_note_id, similarity_score, and candidate_text. Default number of similar documents is 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a4fca47a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "502669cf93f94ad1b9e7811868f40b1d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-01-30 16:15:42 INFO: Downloading default packages for language: en (English)...\n",
      "2022-01-30 16:15:43 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-01-30 16:15:48 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-01-30 16:15:48 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-01-30 16:15:48 INFO: Use device: gpu\n",
      "2022-01-30 16:15:48 INFO: Loading: tokenize\n",
      "2022-01-30 16:15:48 INFO: Done loading processors!\n",
      "Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2bf2cf9888c440cb87315384ecd135c7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-01-30 16:15:52 INFO: Downloading default packages for language: en (English)...\n",
      "2022-01-30 16:15:53 INFO: File exists: /home/lily/ky334/stanza_resources/en/default.zip.\n",
      "2022-01-30 16:15:58 INFO: Finished downloading models and saved to /home/lily/ky334/stanza_resources.\n",
      "2022-01-30 16:15:58 INFO: Loading these models for language: en (English):\n",
      "========================\n",
      "| Processor | Package  |\n",
      "------------------------\n",
      "| tokenize  | combined |\n",
      "========================\n",
      "\n",
      "2022-01-30 16:15:58 INFO: Use device: gpu\n",
      "2022-01-30 16:15:58 INFO: Loading: tokenize\n",
      "2022-01-30 16:15:58 INFO: Done loading processors!\n",
      "Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>candidate_id</th>\n",
       "      <th>similarity_score</th>\n",
       "      <th>candidate_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.864067</td>\n",
       "      <td>A neural network is a series of algorithms tha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0.824104</td>\n",
       "      <td>People can buy aspirin over the counter withou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0.746512</td>\n",
       "      <td>Prescription aspirin is used to relieve the sy...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   candidate_id  similarity_score  \\\n",
       "0             0          0.864067   \n",
       "1             2          0.824104   \n",
       "2             1          0.746512   \n",
       "\n",
       "                                      candidate_text  \n",
       "0  A neural network is a series of algorithms tha...  \n",
       "1  People can buy aspirin over the counter withou...  \n",
       "2  Prescription aspirin is used to relieve the sy...  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# using the same documents as in clustering\n",
    "kit.get_similar_documents(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "327e6d13",
   "metadata": {},
   "source": [
    "**Summarization**: summarize a single document (main record) or multiple documents (main records AND supporting records)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "3d3115ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Your max_length is set to 200, but you input_length is only 100. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'the city of Paris is the centre and seat of government of the region and province of Île-de-France . it has an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles)'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# single-document summarization\n",
    "kit = EHRKit()\n",
    "\n",
    "main_record = \"Paris is the capital and most populous city of France, \\\n",
    "with an estimated population of 2,175,601 residents as of 2018, in an \\\n",
    "area of more than 105 square kilometres (41 square miles). The City of \\\n",
    "Paris is the centre and seat of government of the region and province \\\n",
    "of Île-de-France, or Paris Region, which has an estimated population of \\\n",
    "12,174,880, or about 18 percent of the population of France as of 2017.\"\n",
    "\n",
    "kit.update_and_delete_main_record(main_record)\n",
    "kit.get_single_record_summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4b3d8784",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'– Neurons, also known as nerve cells, are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world, for sending motor commands to our muscles, and for transforming and relaying the electrical signals at every step in between. More than that, their interactions define who we are as people, the New York Times reports. Neuron networks are a series of algorithms that endeavor to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# multi-document summarization\n",
    "kit = EHRKit()\n",
    "\n",
    "record = \"Neurons (also called neurones or nerve cells) are the fundamental units of the brain and nervous system, \" \\\n",
    "         \"the cells responsible for receiving sensory input from the external world, for sending motor commands to \" \\\n",
    "         \"our muscles, and for transforming and relaying the electrical signals at every step in between. More than \" \\\n",
    "         \"that, their interactions define who we are as people. Having said that, our roughly 100 billion neurons do\" \\\n",
    "         \" interact closely with other cell types, broadly classified as glia (these may actually outnumber neurons, \" \\\n",
    "         \"although it’s not really known).\"\n",
    "\n",
    "doc1 = \"A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of \" \\\n",
    "        \"data through a process that mimics the way the human brain operates. In this sense, neural networks refer to \" \\\n",
    "        \"systems of neurons, either organic or artificial in nature.\"\n",
    "\n",
    "# https://qbi.uq.edu.au/brain/brain-physiology/what-are-neurotransmitters\n",
    "doc2 = \"Neurotransmitters are often referred to as the body’s chemical messengers. They are the molecules used by the \" \\\n",
    "       \"nervous system to transmit messages between neurons, or from neurons to muscles. Communication between two neurons \" \\\n",
    "       \"happens in the synaptic cleft (the small gap between the synapses of neurons). Here, electrical signals that have \"\\\n",
    "       \"travelled along the axon are briefly converted into chemical ones through the release of neurotransmitters, causing \"\\\n",
    "       \"a specific response in the receiving neuron.\"\n",
    "\n",
    "kit.update_and_delete_main_record(record)\n",
    "kit.replace_supporting_records([doc1, doc2])\n",
    "\n",
    "kit.get_multi_record_summary()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "lily-ehrkit",
   "language": "python",
   "name": "lily-ehrkit"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}