[3af7d7]: / docs / notebooks / talk2knowledgegraphs / tutorial_primekg_loader.ipynb

Download this file

668 lines (667 with data), 19.5 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PrimeKG Loader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we will explain how to load dataframes of PrimeKG containing the information of the entities and the relations of the knowledge graph.\n",
    "\n",
    "Prior information about the PrimeKG can be found in the following repositories:\n",
    "- https://github.com/mims-harvard/PrimeKG\n",
    "- https://github.com/mims-harvard/TDC/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:\n",
    "\n",
    "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM\n",
    "\n",
    "By the time we are writing this tutorial, the latest version of PrimeKG (`kg.csv`) is `2.1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First of all, we need to import necessary libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_id\" in SysBioModel has conflict with protected namespace \"model_\".\n",
      "\n",
      "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
      "  warnings.warn(\n",
      "c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_id\" in BasicoModel has conflict with protected namespace \"model_\".\n",
      "\n",
      "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
      "  warnings.warn(\n",
      "c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_data\" in SimulateModelInput has conflict with protected namespace \"model_\".\n",
      "\n",
      "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "# Import necessary libraries\n",
    "import sys\n",
    "sys.path.append('../../..')\n",
    "from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load PrimeKG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `PrimeKG` dataset allows to load the data from the Harvard Dataverse server if the data is not available locally. \n",
    "\n",
    "Otherwise, the data is loaded from the local directory as defined in the `local_dir`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define primekg data by providing a local directory where the data is stored\n",
    "primekg_data = PrimeKG(local_dir=\"../../../../data/primekg/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load the dataframes of nodes and edges from PrimeKG, we just need to invoke a method as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading node file from https://dataverse.harvard.edu/api/access/datafile/6180617\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 8.89M/8.89M [00:01<00:00, 8.06MiB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading edge file from https://dataverse.harvard.edu/api/access/datafile/6180616\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 387M/387M [00:35<00:00, 10.8MiB/s] \n"
     ]
    }
   ],
   "source": [
    "# Invoke a method to load the data\n",
    "primekg_data.load_data()\n",
    "\n",
    "# Get primekg_nodes and primekg_edges\n",
    "primekg_nodes = primekg_data.get_nodes()\n",
    "primekg_edges = primekg_data.get_edges()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check PrimeKG Dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned before, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively.\n",
    "\n",
    "We can further analyze the dataframes to extract the information we need.\n",
    "\n",
    "For instance, we can construct a graph from the nodes and edges dataframes using the networkx library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### PrimeKG Nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`primekg_nodes` is a dataframe of nodes, which has the following columns:\n",
    "- `node_index`: the index of the node\n",
    "- `node`: the node name\n",
    "- `node_id`: the id of the node (currently set as node name itself, for visualization purposes)\n",
    "- `node_uid`: the unique identifier of the node (source name + unique id)\n",
    "- `node_type`: the type of the node\n",
    "\n",
    "We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_index</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_source</th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>PHYHIP</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>9796</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>GPANK1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7918</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>ZRSR2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>8233</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>NRF1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4899</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>PI4KA</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5297</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   node_index node_name node_source node_id     node_type\n",
       "0           0    PHYHIP        NCBI    9796  gene/protein\n",
       "1           1    GPANK1        NCBI    7918  gene/protein\n",
       "2           2     ZRSR2        NCBI    8233  gene/protein\n",
       "3           3      NRF1        NCBI    4899  gene/protein\n",
       "4           4     PI4KA        NCBI    5297  gene/protein"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check a sample of the primekg nodes\n",
    "primekg_nodes.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The current version of PrimeKG has about 130K of nodes in total as we can observe in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(129375, 5)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check dimensions of the primekg nodes\n",
    "primekg_nodes.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " We can breakdown the statistics of the primekg nodes by their types as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "node_type\n",
       "biological_process    28642\n",
       "gene/protein          27671\n",
       "disease               17080\n",
       "effect/phenotype      15311\n",
       "anatomy               14035\n",
       "molecular_function    11169\n",
       "drug                   7957\n",
       "cellular_component     4176\n",
       "pathway                2516\n",
       "exposure                818\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show node types and their counts\n",
    "primekg_nodes['node_type'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PrimeKG was built using various sources, as we can observe from their unique node sources as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "node_source\n",
       "GO               43987\n",
       "NCBI             27671\n",
       "MONDO            15813\n",
       "HPO              15311\n",
       "UBERON           14035\n",
       "DrugBank          7957\n",
       "REACTOME          2516\n",
       "MONDO_grouped     1267\n",
       "CTD                818\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show source of the primekg nodes\n",
    "primekg_nodes['node_source'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### PrimeKG Edges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`primekg_edges` is a dataframe of edges, which has the following columns:\n",
    "- `head_index`: the index of the head node\n",
    "- `head_name`: the name of the head node\n",
    "- `head_source`: the source database of head node\n",
    "- `head_id`: the id in source database of head node\n",
    "- `tail_index`: the index of the tail node\n",
    "- `tail_name`: the name of the tail node\n",
    "- `tail_source`: the source database of tail node\n",
    "- `tail_id`: the id in source database of tail node\n",
    "- `display_relation`: the type of the edge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check a sample of the primekg edges to see the interconnections between the nodes in the PrimeKG dataset as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>PHYHIP</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>9796</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>8889</td>\n",
       "      <td>KIF15</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>56992</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>ppi</td>\n",
       "      <td>protein_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>GPANK1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7918</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>2798</td>\n",
       "      <td>PNMA1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>9240</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>ppi</td>\n",
       "      <td>protein_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>ZRSR2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>8233</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>5646</td>\n",
       "      <td>TTC33</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>23548</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>ppi</td>\n",
       "      <td>protein_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>NRF1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4899</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>11592</td>\n",
       "      <td>MAN1B1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>11253</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>ppi</td>\n",
       "      <td>protein_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>PI4KA</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5297</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>2122</td>\n",
       "      <td>RGS20</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>8601</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>ppi</td>\n",
       "      <td>protein_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   head_index head_name head_source head_id     head_type  tail_index  \\\n",
       "0           0    PHYHIP        NCBI    9796  gene/protein        8889   \n",
       "1           1    GPANK1        NCBI    7918  gene/protein        2798   \n",
       "2           2     ZRSR2        NCBI    8233  gene/protein        5646   \n",
       "3           3      NRF1        NCBI    4899  gene/protein       11592   \n",
       "4           4     PI4KA        NCBI    5297  gene/protein        2122   \n",
       "\n",
       "  tail_name tail_source tail_id     tail_type display_relation  \\\n",
       "0     KIF15        NCBI   56992  gene/protein              ppi   \n",
       "1     PNMA1        NCBI    9240  gene/protein              ppi   \n",
       "2     TTC33        NCBI   23548  gene/protein              ppi   \n",
       "3    MAN1B1        NCBI   11253  gene/protein              ppi   \n",
       "4     RGS20        NCBI    8601  gene/protein              ppi   \n",
       "\n",
       "          relation  \n",
       "0  protein_protein  \n",
       "1  protein_protein  \n",
       "2  protein_protein  \n",
       "3  protein_protein  \n",
       "4  protein_protein  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check a sample of the primekg edges\n",
    "primekg_edges.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The current version of PrimeKG has about 8.1M of edges in total as we can observe in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8100498, 12)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check dimensions of the primekg nodes\n",
    "primekg_edges.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " We can breakdown the statistics of the primekg edges by their types as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "display_relation\n",
       "expression present         3036406\n",
       "synergistic interaction    2672628\n",
       "interacts with              686550\n",
       "ppi                         642150\n",
       "phenotype present           300634\n",
       "parent-child                281744\n",
       "associated with             167482\n",
       "side effect                 129568\n",
       "contraindication             61350\n",
       "expression absent            39774\n",
       "target                       32760\n",
       "indication                   18776\n",
       "enzyme                       10634\n",
       "transporter                   6184\n",
       "off-label use                 5136\n",
       "linked to                     4608\n",
       "phenotype absent              2386\n",
       "carrier                       1728\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show edge types and their counts\n",
    "primekg_edges['display_relation'].value_counts()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}