AIAgents4Pharma / Git / [9d3784] /docs/notebooks/talk2knowledgegraphs/tutorial_primekg

Models:
Amanda-D/
AIAgents4Pharma
Downloads: 1
[9d3784]: / docs / notebooks / talk2knowledgegraphs / tutorial_primekg_subgraph.ipynb
History
Download this file
4787 lines (4786 with data), 190.3 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PrimeKG Subgraph Construction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we will showcase how to construct a subraph from PrimeKG and prepare necessary graph formats for further analysis.\n",
    "\n",
    "In particular, we will slice a subgraph from PrimeKG related to inflammatory bowel disease (IBD).\n",
    "\n",
    "The subgraph will contain all nodes and edges that are connected to IBD-related disease nodes, including the following relationships:\n",
    "- Disease-Protein Relationship\n",
    "- Disease-Disease Relationship (skipped as of now)\n",
    "- Protein-Protein Relationship (skipped as of now)\n",
    "- Drug-Protein Relationship\n",
    "- Pathway-Protein Relationship\n",
    "- Pathway-Pathway Relationship (skipped as of now)\n",
    "- Bioprocess-Protein Relationship\n",
    "- Molecular Function-Protein Relationship\n",
    "- Cellular Component-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, to enrich the nodes and edges, we will perform the following tasks:\n",
    "- Textual enrichment (only this task is implemented as of now) \n",
    "- Multi-modal enrichment (to be added)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "First of all, we need to import necessary libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import networkx as nx\n",
    "import pickle\n",
    "from tqdm import tqdm\n",
    "from torch_geometric.utils import from_networkx\n",
    "import sys\n",
    "sys.path.append('../../..')\n",
    "from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG\n",
    "from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG\n",
    "from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama\n",
    "from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils\n",
    "\n",
    "# # Set the logging level for httpx to WARNING to suppress INFO messages\n",
    "import logging\n",
    "logging.getLogger(\"httpx\").setLevel(logging.WARNING)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PrimeKG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We utilize the `PrimeKG` class from the aiagents4pharma/talk2knowledgegraphs library.\n",
    "\n",
    "The `PrimeKG` needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading nodes of PrimeKG dataset ...\n",
      "../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.\n",
      "Loading edges of PrimeKG dataset ...\n",
      "../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.\n"
     ]
    }
   ],
   "source": [
    "# Define primekg data by providing a local directory where the data is stored\n",
    "primekg_data = PrimeKG(local_dir=\"../../../../data/primekg/\")\n",
    "\n",
    "# Invoke a method to load the data\n",
    "primekg_data.load_data()\n",
    "\n",
    "# Get primekg_nodes and primekg_edges\n",
    "primekg_nodes = primekg_data.get_nodes()\n",
    "primekg_edges = primekg_data.get_edges()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### IBD-related Data Filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### IBD-related Disease Nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a first step, we will perform data filtering over the primekg_nodes by querying the nodes that contains the following terms:\n",
    "- inflammatory bowel disease\n",
    "- crohn\n",
    "- ulcerative colitis\n",
    "\n",
    "As of now, this basic query is used to filter the data. However, this can be replaced with a more complex query that can capture more nodes related to IBD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_index</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_source</th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27269</th>\n",
       "      <td>27269</td>\n",
       "      <td>IL21-related infantile inflammatory bowel disease</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>14338</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28158</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29293</th>\n",
       "      <td>29293</td>\n",
       "      <td>inflammatory bowel disease, immunodeficiency, ...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>32601</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35814</th>\n",
       "      <td>35814</td>\n",
       "      <td>Crohn ileitis and jejunitis</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>709_21207</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35815</th>\n",
       "      <td>35815</td>\n",
       "      <td>small bowel Crohn disease</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5539</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37784</th>\n",
       "      <td>37784</td>\n",
       "      <td>Crohn disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>5011_5535</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37785</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39013</th>\n",
       "      <td>39013</td>\n",
       "      <td>immune dysregulation-inflammatory bowel diseas...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>16542</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39787</th>\n",
       "      <td>39787</td>\n",
       "      <td>immune dysregulation with inflammatory bowel d...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>33967</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83770</th>\n",
       "      <td>83770</td>\n",
       "      <td>Crohn's colitis</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5532</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95279</th>\n",
       "      <td>95279</td>\n",
       "      <td>Crohn jejunoileitis</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>708</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95280</th>\n",
       "      <td>95280</td>\n",
       "      <td>gastroduodenal Crohn disease</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>710</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97088</th>\n",
       "      <td>97088</td>\n",
       "      <td>perianal Crohn disease</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5537</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99325</th>\n",
       "      <td>99325</td>\n",
       "      <td>Crohn disease of the esophagus</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>22901</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99680</th>\n",
       "      <td>99680</td>\n",
       "      <td>immune dysregulation-inflammatory bowel diseas...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>33968</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99681</th>\n",
       "      <td>99681</td>\n",
       "      <td>inflammatory bowel disease-recurrent sinopulmo...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>33969</td>\n",
       "      <td>disease</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node_index                                          node_name  \\\n",
       "27269       27269  IL21-related infantile inflammatory bowel disease   \n",
       "28158       28158                         inflammatory bowel disease   \n",
       "29293       29293  inflammatory bowel disease, immunodeficiency, ...   \n",
       "35814       35814                        Crohn ileitis and jejunitis   \n",
       "35815       35815                          small bowel Crohn disease   \n",
       "37784       37784                                      Crohn disease   \n",
       "37785       37785                       ulcerative colitis (disease)   \n",
       "39013       39013  immune dysregulation-inflammatory bowel diseas...   \n",
       "39787       39787  immune dysregulation with inflammatory bowel d...   \n",
       "83770       83770                                    Crohn's colitis   \n",
       "95279       95279                                Crohn jejunoileitis   \n",
       "95280       95280                       gastroduodenal Crohn disease   \n",
       "97088       97088                             perianal Crohn disease   \n",
       "99325       99325                     Crohn disease of the esophagus   \n",
       "99680       99680  immune dysregulation-inflammatory bowel diseas...   \n",
       "99681       99681  inflammatory bowel disease-recurrent sinopulmo...   \n",
       "\n",
       "         node_source                                            node_id  \\\n",
       "27269          MONDO                                              14338   \n",
       "28158  MONDO_grouped  9960_12845_33643_11471_12831_12875_12941_13153...   \n",
       "29293          MONDO                                              32601   \n",
       "35814  MONDO_grouped                                          709_21207   \n",
       "35815          MONDO                                               5539   \n",
       "37784  MONDO_grouped                                          5011_5535   \n",
       "37785          MONDO                                               5101   \n",
       "39013          MONDO                                              16542   \n",
       "39787          MONDO                                              33967   \n",
       "83770          MONDO                                               5532   \n",
       "95279          MONDO                                                708   \n",
       "95280          MONDO                                                710   \n",
       "97088          MONDO                                               5537   \n",
       "99325          MONDO                                              22901   \n",
       "99680          MONDO                                              33968   \n",
       "99681          MONDO                                              33969   \n",
       "\n",
       "      node_type  \n",
       "27269   disease  \n",
       "28158   disease  \n",
       "29293   disease  \n",
       "35814   disease  \n",
       "35815   disease  \n",
       "37784   disease  \n",
       "37785   disease  \n",
       "39013   disease  \n",
       "39787   disease  \n",
       "83770   disease  \n",
       "95279   disease  \n",
       "95280   disease  \n",
       "97088   disease  \n",
       "99325   disease  \n",
       "99680   disease  \n",
       "99681   disease  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Query for nodes related to IBD\n",
    "query_str = 'node_name_lower.str.contains(\"inflammatory bowel disease\")'\n",
    "query_str += 'or node_name_lower.str.contains(\"crohn\")'\n",
    "query_str += 'or node_name_lower.str.contains(\"ulcerative colitis\")'\n",
    "\n",
    "# Get the nodes related to IBD\n",
    "ibd_nodes_df = primekg_nodes.copy()\n",
    "ibd_nodes_df[\"node_name_lower\"] = primekg_nodes.node_name.apply(lambda x: x.lower())\n",
    "ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == \"disease\"].query(query_str, engine='python')\n",
    "ibd_nodes_df.drop(columns=[\"node_name_lower\"], inplace=True)\n",
    "ibd_nodes_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Disease-Protein Relationship\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the nodes related to IBD, we can further capture the records containing the relationships of disease-gene/protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5988787</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5988788</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5988789</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5988790</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5988791</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2712</td>\n",
       "      <td>CASP3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>836</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3304471</th>\n",
       "      <td>34780</td>\n",
       "      <td>IRGM</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>345611</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>35814</td>\n",
       "      <td>Crohn ileitis and jejunitis</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>709_21207</td>\n",
       "      <td>disease</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3310277</th>\n",
       "      <td>5022</td>\n",
       "      <td>ITGAM</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3684</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>35814</td>\n",
       "      <td>Crohn ileitis and jejunitis</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>709_21207</td>\n",
       "      <td>disease</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3313160</th>\n",
       "      <td>2889</td>\n",
       "      <td>TGFB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7040</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>29293</td>\n",
       "      <td>inflammatory bowel disease, immunodeficiency, ...</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>32601</td>\n",
       "      <td>disease</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3314800</th>\n",
       "      <td>9104</td>\n",
       "      <td>INAVA</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>55765</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3314949</th>\n",
       "      <td>34967</td>\n",
       "      <td>IL21</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>59067</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>27269</td>\n",
       "      <td>IL21-related infantile inflammatory bowel disease</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>14338</td>\n",
       "      <td>disease</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>620 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index                     head_name    head_source  \\\n",
       "5988787       37785  ulcerative colitis (disease)          MONDO   \n",
       "5988788       28158    inflammatory bowel disease  MONDO_grouped   \n",
       "5988789       37785  ulcerative colitis (disease)          MONDO   \n",
       "5988790       28158    inflammatory bowel disease  MONDO_grouped   \n",
       "5988791       37785  ulcerative colitis (disease)          MONDO   \n",
       "...             ...                           ...            ...   \n",
       "3304471       34780                          IRGM           NCBI   \n",
       "3310277        5022                         ITGAM           NCBI   \n",
       "3313160        2889                         TGFB1           NCBI   \n",
       "3314800        9104                         INAVA           NCBI   \n",
       "3314949       34967                          IL21           NCBI   \n",
       "\n",
       "                                                   head_id     head_type  \\\n",
       "5988787                                               5101       disease   \n",
       "5988788  9960_12845_33643_11471_12831_12875_12941_13153...       disease   \n",
       "5988789                                               5101       disease   \n",
       "5988790  9960_12845_33643_11471_12831_12875_12941_13153...       disease   \n",
       "5988791                                               5101       disease   \n",
       "...                                                    ...           ...   \n",
       "3304471                                             345611  gene/protein   \n",
       "3310277                                               3684  gene/protein   \n",
       "3313160                                               7040  gene/protein   \n",
       "3314800                                              55765  gene/protein   \n",
       "3314949                                              59067  gene/protein   \n",
       "\n",
       "         tail_index                                          tail_name  \\\n",
       "5988787        7359                                              ADCY7   \n",
       "5988788        7359                                              ADCY7   \n",
       "5988789        2874                                              PRDM1   \n",
       "5988790        2874                                              PRDM1   \n",
       "5988791        2712                                              CASP3   \n",
       "...             ...                                                ...   \n",
       "3304471       35814                        Crohn ileitis and jejunitis   \n",
       "3310277       35814                        Crohn ileitis and jejunitis   \n",
       "3313160       29293  inflammatory bowel disease, immunodeficiency, ...   \n",
       "3314800       28158                         inflammatory bowel disease   \n",
       "3314949       27269  IL21-related infantile inflammatory bowel disease   \n",
       "\n",
       "           tail_source                                            tail_id  \\\n",
       "5988787           NCBI                                                113   \n",
       "5988788           NCBI                                                113   \n",
       "5988789           NCBI                                                639   \n",
       "5988790           NCBI                                                639   \n",
       "5988791           NCBI                                                836   \n",
       "...                ...                                                ...   \n",
       "3304471  MONDO_grouped                                          709_21207   \n",
       "3310277  MONDO_grouped                                          709_21207   \n",
       "3313160          MONDO                                              32601   \n",
       "3314800  MONDO_grouped  9960_12845_33643_11471_12831_12875_12941_13153...   \n",
       "3314949          MONDO                                              14338   \n",
       "\n",
       "            tail_type display_relation         relation  \n",
       "5988787  gene/protein  associated with  disease_protein  \n",
       "5988788  gene/protein  associated with  disease_protein  \n",
       "5988789  gene/protein  associated with  disease_protein  \n",
       "5988790  gene/protein  associated with  disease_protein  \n",
       "5988791  gene/protein  associated with  disease_protein  \n",
       "...               ...              ...              ...  \n",
       "3304471       disease  associated with  disease_protein  \n",
       "3310277       disease  associated with  disease_protein  \n",
       "3313160       disease  associated with  disease_protein  \n",
       "3314800       disease  associated with  disease_protein  \n",
       "3314949       disease  associated with  disease_protein  \n",
       "\n",
       "[620 rows x 12 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD disease_protein edges\n",
    "ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & \n",
    "                                                        (primekg_edges.tail_type == 'gene/protein')],\n",
    "                                          primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & \n",
    "                                                        (primekg_edges.head_type == 'gene/protein')]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_disease_protein_edges_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  144,   179,   192,   279,   417,   625,   657,   729,   772,\n",
       "         989,  1004,  1122,  1299,  1480,  1567,  1618,  1654,  1777,\n",
       "        1990,  2012,  2057,  2078,  2111,  2139,  2329,  2384,  2543,\n",
       "        2643,  2712,  2749,  2874,  2889,  2978,  2983,  3064,  3088,\n",
       "        3233,  3259,  3333,  3414,  3460,  3469,  3474,  3484,  3495,\n",
       "        3578,  3646,  4152,  4162,  4731,  4818,  4968,  4997,  5022,\n",
       "        5195,  5385,  5720,  5805,  5915,  6168,  6175,  6229,  6428,\n",
       "        6661,  7059,  7083,  7359,  7384,  7899,  7958,  8030,  8564,\n",
       "        9104,  9454,  9763, 10113, 10191, 10919, 11103, 11134, 11199,\n",
       "       11523, 11588, 12305, 12663, 12740, 12763, 12816, 13014, 13365,\n",
       "       21972, 22105, 34623, 34776, 34777, 34778, 34779, 34780, 34781,\n",
       "       34814, 34887, 34967, 35156])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get unique protein index\n",
    "ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),\n",
    "                                              ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))\n",
    "ibd_protein_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Disease-Disease Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we can get the records containing the relationships of disease-disease nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # IBD disease_disease edges \n",
    "# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & \n",
    "#                                                         (primekg_edges.tail_type == 'disease')],\n",
    "#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & \n",
    "#                                                         (primekg_edges.head_type == 'disease')]])\n",
    "\n",
    "# # Check dataframe\n",
    "# ibd_disease_disease_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Protein-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also can get the records containing the relationships of gene/protein-gene/protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # IBD protein_protein edges \n",
    "# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) & \n",
    "#                                                         (primekg_edges.tail_type == 'gene/protein')],\n",
    "#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) & \n",
    "#                                                         (primekg_edges.head_type == 'gene/protein')]])\n",
    "\n",
    "# # Check dataframe\n",
    "# ibd_protein_protein_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Drug-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will get the records containing the relationships of drug-gene/protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>321759</th>\n",
       "      <td>14118</td>\n",
       "      <td>Rose bengal</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB11182</td>\n",
       "      <td>drug</td>\n",
       "      <td>3233</td>\n",
       "      <td>LTF</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4057</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>carrier</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321763</th>\n",
       "      <td>14038</td>\n",
       "      <td>Fluticasone furoate</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB08906</td>\n",
       "      <td>drug</td>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>carrier</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321764</th>\n",
       "      <td>14555</td>\n",
       "      <td>Technetium Tc-99m tetrofosmin</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB09160</td>\n",
       "      <td>drug</td>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>carrier</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321765</th>\n",
       "      <td>14040</td>\n",
       "      <td>Fluticasone</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB13867</td>\n",
       "      <td>drug</td>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>carrier</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>322373</th>\n",
       "      <td>14060</td>\n",
       "      <td>Levothyroxine</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB00451</td>\n",
       "      <td>drug</td>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>enzyme</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5731639</th>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>14498</td>\n",
       "      <td>Risdiplam</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB15305</td>\n",
       "      <td>drug</td>\n",
       "      <td>transporter</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5731640</th>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>14908</td>\n",
       "      <td>Ubrogepant</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB15328</td>\n",
       "      <td>drug</td>\n",
       "      <td>transporter</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5731641</th>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>14499</td>\n",
       "      <td>Elexacaftor</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB15444</td>\n",
       "      <td>drug</td>\n",
       "      <td>transporter</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5731642</th>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>14050</td>\n",
       "      <td>Prednisolone acetate</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB15566</td>\n",
       "      <td>drug</td>\n",
       "      <td>transporter</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5731643</th>\n",
       "      <td>4152</td>\n",
       "      <td>ABCB1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5243</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>15752</td>\n",
       "      <td>Selpercatinib</td>\n",
       "      <td>DrugBank</td>\n",
       "      <td>DB15685</td>\n",
       "      <td>drug</td>\n",
       "      <td>transporter</td>\n",
       "      <td>drug_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2030 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index                      head_name head_source  head_id  \\\n",
       "321759        14118                    Rose bengal    DrugBank  DB11182   \n",
       "321763        14038            Fluticasone furoate    DrugBank  DB08906   \n",
       "321764        14555  Technetium Tc-99m tetrofosmin    DrugBank  DB09160   \n",
       "321765        14040                    Fluticasone    DrugBank  DB13867   \n",
       "322373        14060                  Levothyroxine    DrugBank  DB00451   \n",
       "...             ...                            ...         ...      ...   \n",
       "5731639        4152                          ABCB1        NCBI     5243   \n",
       "5731640        4152                          ABCB1        NCBI     5243   \n",
       "5731641        4152                          ABCB1        NCBI     5243   \n",
       "5731642        4152                          ABCB1        NCBI     5243   \n",
       "5731643        4152                          ABCB1        NCBI     5243   \n",
       "\n",
       "            head_type  tail_index             tail_name tail_source  tail_id  \\\n",
       "321759           drug        3233                   LTF        NCBI     4057   \n",
       "321763           drug        4152                 ABCB1        NCBI     5243   \n",
       "321764           drug        4152                 ABCB1        NCBI     5243   \n",
       "321765           drug        4152                 ABCB1        NCBI     5243   \n",
       "322373           drug        4152                 ABCB1        NCBI     5243   \n",
       "...               ...         ...                   ...         ...      ...   \n",
       "5731639  gene/protein       14498             Risdiplam    DrugBank  DB15305   \n",
       "5731640  gene/protein       14908            Ubrogepant    DrugBank  DB15328   \n",
       "5731641  gene/protein       14499           Elexacaftor    DrugBank  DB15444   \n",
       "5731642  gene/protein       14050  Prednisolone acetate    DrugBank  DB15566   \n",
       "5731643  gene/protein       15752         Selpercatinib    DrugBank  DB15685   \n",
       "\n",
       "            tail_type display_relation      relation  \n",
       "321759   gene/protein          carrier  drug_protein  \n",
       "321763   gene/protein          carrier  drug_protein  \n",
       "321764   gene/protein          carrier  drug_protein  \n",
       "321765   gene/protein          carrier  drug_protein  \n",
       "322373   gene/protein           enzyme  drug_protein  \n",
       "...               ...              ...           ...  \n",
       "5731639          drug      transporter  drug_protein  \n",
       "5731640          drug      transporter  drug_protein  \n",
       "5731641          drug      transporter  drug_protein  \n",
       "5731642          drug      transporter  drug_protein  \n",
       "5731643          drug      transporter  drug_protein  \n",
       "\n",
       "[2030 rows x 12 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD drug_protein edges\n",
    "ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') & \n",
    "                                                     (primekg_edges.tail_type == 'gene/protein') & \n",
    "                                                     (primekg_edges.tail_index.isin(ibd_protein_index))], \n",
    "                                       primekg_edges[(primekg_edges.tail_type == 'drug') & \n",
    "                                                     (primekg_edges.head_type == 'gene/protein') & \n",
    "                                                     (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_drug_protein_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pathway-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this case, we will get the records containing the relationships of pathway-protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6505784</th>\n",
       "      <td>62703</td>\n",
       "      <td>Adherens junctions interactions</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-418990</td>\n",
       "      <td>pathway</td>\n",
       "      <td>8030</td>\n",
       "      <td>CDH3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1001</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6506102</th>\n",
       "      <td>128079</td>\n",
       "      <td>Regulation of actin dynamics for phagocytic cu...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-2029482</td>\n",
       "      <td>pathway</td>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6506103</th>\n",
       "      <td>128183</td>\n",
       "      <td>EPHB-mediated forward signaling</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-3928662</td>\n",
       "      <td>pathway</td>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6506104</th>\n",
       "      <td>128022</td>\n",
       "      <td>RHO GTPases Activate WASPs and WAVEs</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-5663213</td>\n",
       "      <td>pathway</td>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6506105</th>\n",
       "      <td>62931</td>\n",
       "      <td>Clathrin-mediated endocytosis</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-8856828</td>\n",
       "      <td>pathway</td>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3834665</th>\n",
       "      <td>2543</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>999</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>127731</td>\n",
       "      <td>Integrin cell surface interactions</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-216083</td>\n",
       "      <td>pathway</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3834666</th>\n",
       "      <td>2543</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>999</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>127617</td>\n",
       "      <td>Apoptotic cleavage of cell adhesion  proteins</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-351906</td>\n",
       "      <td>pathway</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3834667</th>\n",
       "      <td>2543</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>999</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>62703</td>\n",
       "      <td>Adherens junctions interactions</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-418990</td>\n",
       "      <td>pathway</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3834668</th>\n",
       "      <td>2543</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>999</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>128018</td>\n",
       "      <td>RHO GTPases activate IQGAPs</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-5626467</td>\n",
       "      <td>pathway</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3834669</th>\n",
       "      <td>2543</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>999</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>129039</td>\n",
       "      <td>InlA-mediated entry of Listeria monocytogenes ...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-8876493</td>\n",
       "      <td>pathway</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>pathway_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1030 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index                                          head_name  \\\n",
       "6505784       62703                    Adherens junctions interactions   \n",
       "6506102      128079  Regulation of actin dynamics for phagocytic cu...   \n",
       "6506103      128183                    EPHB-mediated forward signaling   \n",
       "6506104      128022               RHO GTPases Activate WASPs and WAVEs   \n",
       "6506105       62931                      Clathrin-mediated endocytosis   \n",
       "...             ...                                                ...   \n",
       "3834665        2543                                               CDH1   \n",
       "3834666        2543                                               CDH1   \n",
       "3834667        2543                                               CDH1   \n",
       "3834668        2543                                               CDH1   \n",
       "3834669        2543                                               CDH1   \n",
       "\n",
       "        head_source        head_id     head_type  tail_index  \\\n",
       "6505784    REACTOME   R-HSA-418990       pathway        8030   \n",
       "6506102    REACTOME  R-HSA-2029482       pathway        2139   \n",
       "6506103    REACTOME  R-HSA-3928662       pathway        2139   \n",
       "6506104    REACTOME  R-HSA-5663213       pathway        2139   \n",
       "6506105    REACTOME  R-HSA-8856828       pathway        2139   \n",
       "...             ...            ...           ...         ...   \n",
       "3834665        NCBI            999  gene/protein      127731   \n",
       "3834666        NCBI            999  gene/protein      127617   \n",
       "3834667        NCBI            999  gene/protein       62703   \n",
       "3834668        NCBI            999  gene/protein      128018   \n",
       "3834669        NCBI            999  gene/protein      129039   \n",
       "\n",
       "                                                 tail_name tail_source  \\\n",
       "6505784                                               CDH3        NCBI   \n",
       "6506102                                              ARPC2        NCBI   \n",
       "6506103                                              ARPC2        NCBI   \n",
       "6506104                                              ARPC2        NCBI   \n",
       "6506105                                              ARPC2        NCBI   \n",
       "...                                                    ...         ...   \n",
       "3834665                 Integrin cell surface interactions    REACTOME   \n",
       "3834666      Apoptotic cleavage of cell adhesion  proteins    REACTOME   \n",
       "3834667                    Adherens junctions interactions    REACTOME   \n",
       "3834668                        RHO GTPases activate IQGAPs    REACTOME   \n",
       "3834669  InlA-mediated entry of Listeria monocytogenes ...    REACTOME   \n",
       "\n",
       "               tail_id     tail_type display_relation         relation  \n",
       "6505784           1001  gene/protein   interacts with  pathway_protein  \n",
       "6506102          10109  gene/protein   interacts with  pathway_protein  \n",
       "6506103          10109  gene/protein   interacts with  pathway_protein  \n",
       "6506104          10109  gene/protein   interacts with  pathway_protein  \n",
       "6506105          10109  gene/protein   interacts with  pathway_protein  \n",
       "...                ...           ...              ...              ...  \n",
       "3834665   R-HSA-216083       pathway   interacts with  pathway_protein  \n",
       "3834666   R-HSA-351906       pathway   interacts with  pathway_protein  \n",
       "3834667   R-HSA-418990       pathway   interacts with  pathway_protein  \n",
       "3834668  R-HSA-5626467       pathway   interacts with  pathway_protein  \n",
       "3834669  R-HSA-8876493       pathway   interacts with  pathway_protein  \n",
       "\n",
       "[1030 rows x 12 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD pathway_protein edges \n",
    "ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') & \n",
    "                                                        (primekg_edges.tail_type == 'gene/protein') & \n",
    "                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
    "                                          primekg_edges[(primekg_edges.tail_type == 'pathway') & \n",
    "                                                        (primekg_edges.head_type == 'gene/protein') & \n",
    "                                                        (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_pathway_protein_edges_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 62341,  62347,  62348,  62373,  62376,  62394,  62400,  62401,\n",
       "        62404,  62405,  62414,  62448,  62449,  62462,  62465,  62467,\n",
       "        62469,  62472,  62476,  62477,  62483,  62543,  62571,  62573,\n",
       "        62575,  62583,  62588,  62596,  62603,  62606,  62628,  62644,\n",
       "        62651,  62655,  62657,  62672,  62675,  62691,  62692,  62697,\n",
       "        62702,  62703,  62711,  62717,  62733,  62734,  62768,  62770,\n",
       "        62805,  62807,  62836,  62865,  62916,  62925,  62931,  62968,\n",
       "        62976,  62987,  62996,  63041,  63064,  63071,  63076, 127601,\n",
       "       127615, 127616, 127617, 127619, 127620, 127624, 127628, 127629,\n",
       "       127639, 127640, 127649, 127659, 127662, 127682, 127683, 127688,\n",
       "       127691, 127693, 127694, 127695, 127696, 127726, 127727, 127728,\n",
       "       127729, 127730, 127731, 127732, 127733, 127791, 127797, 127810,\n",
       "       127814, 127815, 127833, 127835, 127856, 127858, 127866, 127867,\n",
       "       127869, 127886, 127891, 127908, 127917, 127918, 127921, 127928,\n",
       "       127958, 127960, 127971, 127977, 127999, 128001, 128002, 128003,\n",
       "       128008, 128010, 128015, 128018, 128022, 128025, 128034, 128058,\n",
       "       128065, 128071, 128072, 128073, 128074, 128078, 128079, 128080,\n",
       "       128086, 128111, 128113, 128116, 128117, 128129, 128137, 128138,\n",
       "       128139, 128158, 128165, 128170, 128176, 128183, 128186, 128191,\n",
       "       128198, 128199, 128204, 128208, 128209, 128224, 128227, 128242,\n",
       "       128243, 128244, 128253, 128254, 128270, 128271, 128272, 128273,\n",
       "       128299, 128302, 128341, 128348, 128349, 128350, 128351, 128353,\n",
       "       128360, 128378, 128381, 128393, 128395, 128396, 128399, 128430,\n",
       "       128440, 128453, 128460, 128470, 128472, 128473, 128477, 128478,\n",
       "       128479, 128480, 128481, 128482, 128483, 128484, 128486, 128487,\n",
       "       128497, 128498, 128499, 128500, 128501, 128503, 128527, 128535,\n",
       "       128550, 128593, 128599, 128601, 128602, 128604, 128655, 128677,\n",
       "       128715, 128759, 128766, 128767, 128779, 128781, 128782, 128783,\n",
       "       128784, 128789, 128792, 128801, 128804, 128814, 128815, 128827,\n",
       "       128828, 128829, 128830, 128832, 128835, 128837, 128838, 128841,\n",
       "       128846, 128851, 128852, 128878, 128976, 128977, 128978, 128979,\n",
       "       128980, 128981, 128988, 128990, 129007, 129015, 129016, 129021,\n",
       "       129023, 129035, 129039, 129040, 129042, 129044, 129047, 129048,\n",
       "       129052, 129099, 129110, 129124, 129125, 129126, 129127, 129128,\n",
       "       129131, 129135, 129136, 129139, 129140, 129141, 129148, 129155,\n",
       "       129167, 129181, 129183, 129190, 129195, 129196, 129197, 129198,\n",
       "       129215, 129217, 129238, 129257, 129258, 129259, 129264, 129266,\n",
       "       129289, 129294, 129296, 129302, 129303, 129310, 129355, 129360,\n",
       "       129361, 129365, 129366, 129367])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get unique protein index\n",
    "ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),\n",
    "                                              ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))\n",
    "ibd_pathway_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pathway-Pathway Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As well as, a set of records containing the relationships of pathway-pathway nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # # IBD pathway_pathway edges \n",
    "# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) & \n",
    "#                                                         (primekg_edges.tail_type == 'pathway')],\n",
    "#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) & \n",
    "#                                                         (primekg_edges.head_type == 'pathway')]])\n",
    "\n",
    "# # Check dataframe\n",
    "# ibd_pathway_pathway_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Bioprocess-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next step is to get the records containing the relationships of biological_process-gene/protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6351294</th>\n",
       "      <td>112487</td>\n",
       "      <td>neutrophil degranulation</td>\n",
       "      <td>GO</td>\n",
       "      <td>43312</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>1990</td>\n",
       "      <td>FCGR2A</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>2212</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6351300</th>\n",
       "      <td>112487</td>\n",
       "      <td>neutrophil degranulation</td>\n",
       "      <td>GO</td>\n",
       "      <td>43312</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>3333</td>\n",
       "      <td>FPR2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>2358</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6351340</th>\n",
       "      <td>112487</td>\n",
       "      <td>neutrophil degranulation</td>\n",
       "      <td>GO</td>\n",
       "      <td>43312</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>2012</td>\n",
       "      <td>CXCR1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3577</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6351341</th>\n",
       "      <td>112487</td>\n",
       "      <td>neutrophil degranulation</td>\n",
       "      <td>GO</td>\n",
       "      <td>43312</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>3064</td>\n",
       "      <td>CXCR2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3579</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6351346</th>\n",
       "      <td>112487</td>\n",
       "      <td>neutrophil degranulation</td>\n",
       "      <td>GO</td>\n",
       "      <td>43312</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>5022</td>\n",
       "      <td>ITGAM</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3684</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3781707</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>51599</td>\n",
       "      <td>negative regulation of peroxidase activity</td>\n",
       "      <td>GO</td>\n",
       "      <td>2000469</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3781708</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>52358</td>\n",
       "      <td>regulation of kidney size</td>\n",
       "      <td>GO</td>\n",
       "      <td>35564</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3781710</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>109343</td>\n",
       "      <td>negative regulation of thioredoxin peroxidase ...</td>\n",
       "      <td>GO</td>\n",
       "      <td>1903125</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3781811</th>\n",
       "      <td>22105</td>\n",
       "      <td>GPBAR1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>151306</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>105254</td>\n",
       "      <td>cell surface bile acid receptor signaling pathway</td>\n",
       "      <td>GO</td>\n",
       "      <td>38184</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3781824</th>\n",
       "      <td>34779</td>\n",
       "      <td>NKX2-3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>159296</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>100699</td>\n",
       "      <td>post-embryonic digestive tract morphogenesis</td>\n",
       "      <td>GO</td>\n",
       "      <td>48621</td>\n",
       "      <td>biological_process</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>bioprocess_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6300 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index                 head_name head_source head_id  \\\n",
       "6351294      112487  neutrophil degranulation          GO   43312   \n",
       "6351300      112487  neutrophil degranulation          GO   43312   \n",
       "6351340      112487  neutrophil degranulation          GO   43312   \n",
       "6351341      112487  neutrophil degranulation          GO   43312   \n",
       "6351346      112487  neutrophil degranulation          GO   43312   \n",
       "...             ...                       ...         ...     ...   \n",
       "3781707        2111                     LRRK2        NCBI  120892   \n",
       "3781708        2111                     LRRK2        NCBI  120892   \n",
       "3781710        2111                     LRRK2        NCBI  120892   \n",
       "3781811       22105                    GPBAR1        NCBI  151306   \n",
       "3781824       34779                    NKX2-3        NCBI  159296   \n",
       "\n",
       "                  head_type  tail_index  \\\n",
       "6351294  biological_process        1990   \n",
       "6351300  biological_process        3333   \n",
       "6351340  biological_process        2012   \n",
       "6351341  biological_process        3064   \n",
       "6351346  biological_process        5022   \n",
       "...                     ...         ...   \n",
       "3781707        gene/protein       51599   \n",
       "3781708        gene/protein       52358   \n",
       "3781710        gene/protein      109343   \n",
       "3781811        gene/protein      105254   \n",
       "3781824        gene/protein      100699   \n",
       "\n",
       "                                                 tail_name tail_source  \\\n",
       "6351294                                             FCGR2A        NCBI   \n",
       "6351300                                               FPR2        NCBI   \n",
       "6351340                                              CXCR1        NCBI   \n",
       "6351341                                              CXCR2        NCBI   \n",
       "6351346                                              ITGAM        NCBI   \n",
       "...                                                    ...         ...   \n",
       "3781707         negative regulation of peroxidase activity          GO   \n",
       "3781708                          regulation of kidney size          GO   \n",
       "3781710  negative regulation of thioredoxin peroxidase ...          GO   \n",
       "3781811  cell surface bile acid receptor signaling pathway          GO   \n",
       "3781824       post-embryonic digestive tract morphogenesis          GO   \n",
       "\n",
       "         tail_id           tail_type display_relation            relation  \n",
       "6351294     2212        gene/protein   interacts with  bioprocess_protein  \n",
       "6351300     2358        gene/protein   interacts with  bioprocess_protein  \n",
       "6351340     3577        gene/protein   interacts with  bioprocess_protein  \n",
       "6351341     3579        gene/protein   interacts with  bioprocess_protein  \n",
       "6351346     3684        gene/protein   interacts with  bioprocess_protein  \n",
       "...          ...                 ...              ...                 ...  \n",
       "3781707  2000469  biological_process   interacts with  bioprocess_protein  \n",
       "3781708    35564  biological_process   interacts with  bioprocess_protein  \n",
       "3781710  1903125  biological_process   interacts with  bioprocess_protein  \n",
       "3781811    38184  biological_process   interacts with  bioprocess_protein  \n",
       "3781824    48621  biological_process   interacts with  bioprocess_protein  \n",
       "\n",
       "[6300 rows x 12 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD bioprocess_protein edges \n",
    "ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') & \n",
    "                                                           (primekg_edges.tail_type == 'gene/protein') & \n",
    "                                                           (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
    "                                             primekg_edges[(primekg_edges.tail_type == 'biological_process') & \n",
    "                                                           (primekg_edges.head_type == 'gene/protein') & \n",
    "                                                           (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_bioprocess_protein_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### MolFunc-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we would like to get biological_process-gene/protein relationships."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6198264</th>\n",
       "      <td>54035</td>\n",
       "      <td>interleukin-1 binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>19966</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>1654</td>\n",
       "      <td>IL1R2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7850</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6198359</th>\n",
       "      <td>54290</td>\n",
       "      <td>enzyme binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>19899</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>3578</td>\n",
       "      <td>ECM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1893</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6198366</th>\n",
       "      <td>54290</td>\n",
       "      <td>enzyme binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>19899</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>2057</td>\n",
       "      <td>FN1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>2335</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6198442</th>\n",
       "      <td>54290</td>\n",
       "      <td>enzyme binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>19899</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>989</td>\n",
       "      <td>PPARG</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5468</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6198462</th>\n",
       "      <td>54290</td>\n",
       "      <td>enzyme binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>19899</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>772</td>\n",
       "      <td>RELA</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>5970</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3553533</th>\n",
       "      <td>6229</td>\n",
       "      <td>NOD2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>64127</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>122117</td>\n",
       "      <td>muramyl dipeptide binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>32500</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3553770</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>115199</td>\n",
       "      <td>GTP-dependent protein kinase activity</td>\n",
       "      <td>GO</td>\n",
       "      <td>34211</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3553771</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>118105</td>\n",
       "      <td>beta-catenin destruction complex binding</td>\n",
       "      <td>GO</td>\n",
       "      <td>1904713</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3553773</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>119847</td>\n",
       "      <td>peroxidase inhibitor activity</td>\n",
       "      <td>GO</td>\n",
       "      <td>36479</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3553832</th>\n",
       "      <td>22105</td>\n",
       "      <td>GPBAR1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>151306</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>116806</td>\n",
       "      <td>G protein-coupled bile acid receptor activity</td>\n",
       "      <td>GO</td>\n",
       "      <td>38182</td>\n",
       "      <td>molecular_function</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>molfunc_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1466 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index              head_name head_source head_id  \\\n",
       "6198264       54035  interleukin-1 binding          GO   19966   \n",
       "6198359       54290         enzyme binding          GO   19899   \n",
       "6198366       54290         enzyme binding          GO   19899   \n",
       "6198442       54290         enzyme binding          GO   19899   \n",
       "6198462       54290         enzyme binding          GO   19899   \n",
       "...             ...                    ...         ...     ...   \n",
       "3553533        6229                   NOD2        NCBI   64127   \n",
       "3553770        2111                  LRRK2        NCBI  120892   \n",
       "3553771        2111                  LRRK2        NCBI  120892   \n",
       "3553773        2111                  LRRK2        NCBI  120892   \n",
       "3553832       22105                 GPBAR1        NCBI  151306   \n",
       "\n",
       "                  head_type  tail_index  \\\n",
       "6198264  molecular_function        1654   \n",
       "6198359  molecular_function        3578   \n",
       "6198366  molecular_function        2057   \n",
       "6198442  molecular_function         989   \n",
       "6198462  molecular_function         772   \n",
       "...                     ...         ...   \n",
       "3553533        gene/protein      122117   \n",
       "3553770        gene/protein      115199   \n",
       "3553771        gene/protein      118105   \n",
       "3553773        gene/protein      119847   \n",
       "3553832        gene/protein      116806   \n",
       "\n",
       "                                             tail_name tail_source  tail_id  \\\n",
       "6198264                                          IL1R2        NCBI     7850   \n",
       "6198359                                           ECM1        NCBI     1893   \n",
       "6198366                                            FN1        NCBI     2335   \n",
       "6198442                                          PPARG        NCBI     5468   \n",
       "6198462                                           RELA        NCBI     5970   \n",
       "...                                                ...         ...      ...   \n",
       "3553533                      muramyl dipeptide binding          GO    32500   \n",
       "3553770          GTP-dependent protein kinase activity          GO    34211   \n",
       "3553771       beta-catenin destruction complex binding          GO  1904713   \n",
       "3553773                  peroxidase inhibitor activity          GO    36479   \n",
       "3553832  G protein-coupled bile acid receptor activity          GO    38182   \n",
       "\n",
       "                  tail_type display_relation         relation  \n",
       "6198264        gene/protein   interacts with  molfunc_protein  \n",
       "6198359        gene/protein   interacts with  molfunc_protein  \n",
       "6198366        gene/protein   interacts with  molfunc_protein  \n",
       "6198442        gene/protein   interacts with  molfunc_protein  \n",
       "6198462        gene/protein   interacts with  molfunc_protein  \n",
       "...                     ...              ...              ...  \n",
       "3553533  molecular_function   interacts with  molfunc_protein  \n",
       "3553770  molecular_function   interacts with  molfunc_protein  \n",
       "3553771  molecular_function   interacts with  molfunc_protein  \n",
       "3553773  molecular_function   interacts with  molfunc_protein  \n",
       "3553832  molecular_function   interacts with  molfunc_protein  \n",
       "\n",
       "[1466 rows x 12 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD molfunc_protein edges \n",
    "ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') & \n",
    "                                                        (primekg_edges.tail_type == 'gene/protein') & \n",
    "                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
    "                                           primekg_edges[(primekg_edges.tail_type == 'molecular_function') & \n",
    "                                                         (primekg_edges.head_type == 'gene/protein') & \n",
    "                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_molfunc_protein_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### CellComp-Protein Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we are getting the records containing the relationships of cellular_component-gene/protein nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6267848</th>\n",
       "      <td>126078</td>\n",
       "      <td>ficolin-1-rich granule lumen</td>\n",
       "      <td>GO</td>\n",
       "      <td>1904813</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>3474</td>\n",
       "      <td>MMP9</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4318</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6268120</th>\n",
       "      <td>124245</td>\n",
       "      <td>extracellular space</td>\n",
       "      <td>GO</td>\n",
       "      <td>5615</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>2384</td>\n",
       "      <td>CRP</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1401</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6268163</th>\n",
       "      <td>124245</td>\n",
       "      <td>extracellular space</td>\n",
       "      <td>GO</td>\n",
       "      <td>5615</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>5805</td>\n",
       "      <td>DEFA5</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1670</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6268164</th>\n",
       "      <td>124245</td>\n",
       "      <td>extracellular space</td>\n",
       "      <td>GO</td>\n",
       "      <td>5615</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>657</td>\n",
       "      <td>DEFA6</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1671</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6268173</th>\n",
       "      <td>124245</td>\n",
       "      <td>extracellular space</td>\n",
       "      <td>GO</td>\n",
       "      <td>5615</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>3578</td>\n",
       "      <td>ECM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>1893</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3636708</th>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126261</td>\n",
       "      <td>muscle cell projection membrane</td>\n",
       "      <td>GO</td>\n",
       "      <td>36195</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3636819</th>\n",
       "      <td>9763</td>\n",
       "      <td>ORMDL3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>94103</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126815</td>\n",
       "      <td>SPOTS complex</td>\n",
       "      <td>GO</td>\n",
       "      <td>35339</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3637211</th>\n",
       "      <td>6661</td>\n",
       "      <td>ATG16L1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>55054</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126444</td>\n",
       "      <td>vacuole-isolation membrane contact site</td>\n",
       "      <td>GO</td>\n",
       "      <td>120095</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3637234</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126938</td>\n",
       "      <td>cytoplasmic side of mitochondrial outer membrane</td>\n",
       "      <td>GO</td>\n",
       "      <td>32473</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3637328</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>125942</td>\n",
       "      <td>caveola neck</td>\n",
       "      <td>GO</td>\n",
       "      <td>99400</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1348 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         head_index                     head_name head_source  head_id  \\\n",
       "6267848      126078  ficolin-1-rich granule lumen          GO  1904813   \n",
       "6268120      124245           extracellular space          GO     5615   \n",
       "6268163      124245           extracellular space          GO     5615   \n",
       "6268164      124245           extracellular space          GO     5615   \n",
       "6268173      124245           extracellular space          GO     5615   \n",
       "...             ...                           ...         ...      ...   \n",
       "3636708        2139                         ARPC2        NCBI    10109   \n",
       "3636819        9763                        ORMDL3        NCBI    94103   \n",
       "3637211        6661                       ATG16L1        NCBI    55054   \n",
       "3637234        2111                         LRRK2        NCBI   120892   \n",
       "3637328        2111                         LRRK2        NCBI   120892   \n",
       "\n",
       "                  head_type  tail_index  \\\n",
       "6267848  cellular_component        3474   \n",
       "6268120  cellular_component        2384   \n",
       "6268163  cellular_component        5805   \n",
       "6268164  cellular_component         657   \n",
       "6268173  cellular_component        3578   \n",
       "...                     ...         ...   \n",
       "3636708        gene/protein      126261   \n",
       "3636819        gene/protein      126815   \n",
       "3637211        gene/protein      126444   \n",
       "3637234        gene/protein      126938   \n",
       "3637328        gene/protein      125942   \n",
       "\n",
       "                                                tail_name tail_source tail_id  \\\n",
       "6267848                                              MMP9        NCBI    4318   \n",
       "6268120                                               CRP        NCBI    1401   \n",
       "6268163                                             DEFA5        NCBI    1670   \n",
       "6268164                                             DEFA6        NCBI    1671   \n",
       "6268173                                              ECM1        NCBI    1893   \n",
       "...                                                   ...         ...     ...   \n",
       "3636708                   muscle cell projection membrane          GO   36195   \n",
       "3636819                                     SPOTS complex          GO   35339   \n",
       "3637211           vacuole-isolation membrane contact site          GO  120095   \n",
       "3637234  cytoplasmic side of mitochondrial outer membrane          GO   32473   \n",
       "3637328                                      caveola neck          GO   99400   \n",
       "\n",
       "                  tail_type display_relation          relation  \n",
       "6267848        gene/protein   interacts with  cellcomp_protein  \n",
       "6268120        gene/protein   interacts with  cellcomp_protein  \n",
       "6268163        gene/protein   interacts with  cellcomp_protein  \n",
       "6268164        gene/protein   interacts with  cellcomp_protein  \n",
       "6268173        gene/protein   interacts with  cellcomp_protein  \n",
       "...                     ...              ...               ...  \n",
       "3636708  cellular_component   interacts with  cellcomp_protein  \n",
       "3636819  cellular_component   interacts with  cellcomp_protein  \n",
       "3637211  cellular_component   interacts with  cellcomp_protein  \n",
       "3637234  cellular_component   interacts with  cellcomp_protein  \n",
       "3637328  cellular_component   interacts with  cellcomp_protein  \n",
       "\n",
       "[1348 rows x 12 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IBD molfunc_protein edges \n",
    "ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') & \n",
    "                                                        (primekg_edges.tail_type == 'gene/protein') & \n",
    "                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
    "                                           primekg_edges[(primekg_edges.tail_type == 'cellular_component') & \n",
    "                                                         (primekg_edges.head_type == 'gene/protein') & \n",
    "                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
    "\n",
    "# Check dataframe\n",
    "ibd_cellcomp_protein_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merge all dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have all of particular type of edges, we can merge them into a single dataframe representing a subgraph of IBD inferred from PrimeKG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "      <th>edge_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2712</td>\n",
       "      <td>CASP3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>836</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12747</th>\n",
       "      <td>2139</td>\n",
       "      <td>ARPC2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>10109</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126261</td>\n",
       "      <td>muscle cell projection membrane</td>\n",
       "      <td>GO</td>\n",
       "      <td>36195</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "      <td>(gene/protein, interacts with, cellular_compon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12748</th>\n",
       "      <td>9763</td>\n",
       "      <td>ORMDL3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>94103</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126815</td>\n",
       "      <td>SPOTS complex</td>\n",
       "      <td>GO</td>\n",
       "      <td>35339</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "      <td>(gene/protein, interacts with, cellular_compon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12749</th>\n",
       "      <td>6661</td>\n",
       "      <td>ATG16L1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>55054</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126444</td>\n",
       "      <td>vacuole-isolation membrane contact site</td>\n",
       "      <td>GO</td>\n",
       "      <td>120095</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "      <td>(gene/protein, interacts with, cellular_compon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12750</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>126938</td>\n",
       "      <td>cytoplasmic side of mitochondrial outer membrane</td>\n",
       "      <td>GO</td>\n",
       "      <td>32473</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "      <td>(gene/protein, interacts with, cellular_compon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12751</th>\n",
       "      <td>2111</td>\n",
       "      <td>LRRK2</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>120892</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>125942</td>\n",
       "      <td>caveola neck</td>\n",
       "      <td>GO</td>\n",
       "      <td>99400</td>\n",
       "      <td>cellular_component</td>\n",
       "      <td>interacts with</td>\n",
       "      <td>cellcomp_protein</td>\n",
       "      <td>(gene/protein, interacts with, cellular_compon...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12752 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       head_index                     head_name    head_source  \\\n",
       "0           37785  ulcerative colitis (disease)          MONDO   \n",
       "1           28158    inflammatory bowel disease  MONDO_grouped   \n",
       "2           37785  ulcerative colitis (disease)          MONDO   \n",
       "3           28158    inflammatory bowel disease  MONDO_grouped   \n",
       "4           37785  ulcerative colitis (disease)          MONDO   \n",
       "...           ...                           ...            ...   \n",
       "12747        2139                         ARPC2           NCBI   \n",
       "12748        9763                        ORMDL3           NCBI   \n",
       "12749        6661                       ATG16L1           NCBI   \n",
       "12750        2111                         LRRK2           NCBI   \n",
       "12751        2111                         LRRK2           NCBI   \n",
       "\n",
       "                                                 head_id     head_type  \\\n",
       "0                                                   5101       disease   \n",
       "1      9960_12845_33643_11471_12831_12875_12941_13153...       disease   \n",
       "2                                                   5101       disease   \n",
       "3      9960_12845_33643_11471_12831_12875_12941_13153...       disease   \n",
       "4                                                   5101       disease   \n",
       "...                                                  ...           ...   \n",
       "12747                                              10109  gene/protein   \n",
       "12748                                              94103  gene/protein   \n",
       "12749                                              55054  gene/protein   \n",
       "12750                                             120892  gene/protein   \n",
       "12751                                             120892  gene/protein   \n",
       "\n",
       "       tail_index                                         tail_name  \\\n",
       "0            7359                                             ADCY7   \n",
       "1            7359                                             ADCY7   \n",
       "2            2874                                             PRDM1   \n",
       "3            2874                                             PRDM1   \n",
       "4            2712                                             CASP3   \n",
       "...           ...                                               ...   \n",
       "12747      126261                   muscle cell projection membrane   \n",
       "12748      126815                                     SPOTS complex   \n",
       "12749      126444           vacuole-isolation membrane contact site   \n",
       "12750      126938  cytoplasmic side of mitochondrial outer membrane   \n",
       "12751      125942                                      caveola neck   \n",
       "\n",
       "      tail_source tail_id           tail_type display_relation  \\\n",
       "0            NCBI     113        gene/protein  associated with   \n",
       "1            NCBI     113        gene/protein  associated with   \n",
       "2            NCBI     639        gene/protein  associated with   \n",
       "3            NCBI     639        gene/protein  associated with   \n",
       "4            NCBI     836        gene/protein  associated with   \n",
       "...           ...     ...                 ...              ...   \n",
       "12747          GO   36195  cellular_component   interacts with   \n",
       "12748          GO   35339  cellular_component   interacts with   \n",
       "12749          GO  120095  cellular_component   interacts with   \n",
       "12750          GO   32473  cellular_component   interacts with   \n",
       "12751          GO   99400  cellular_component   interacts with   \n",
       "\n",
       "               relation                                          edge_type  \n",
       "0       disease_protein           (disease, associated with, gene/protein)  \n",
       "1       disease_protein           (disease, associated with, gene/protein)  \n",
       "2       disease_protein           (disease, associated with, gene/protein)  \n",
       "3       disease_protein           (disease, associated with, gene/protein)  \n",
       "4       disease_protein           (disease, associated with, gene/protein)  \n",
       "...                 ...                                                ...  \n",
       "12747  cellcomp_protein  (gene/protein, interacts with, cellular_compon...  \n",
       "12748  cellcomp_protein  (gene/protein, interacts with, cellular_compon...  \n",
       "12749  cellcomp_protein  (gene/protein, interacts with, cellular_compon...  \n",
       "12750  cellcomp_protein  (gene/protein, interacts with, cellular_compon...  \n",
       "12751  cellcomp_protein  (gene/protein, interacts with, cellular_compon...  \n",
       "\n",
       "[12752 rows x 13 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# PrimeKG edges related to IBD\n",
    "primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,\n",
    "                                #   ibd_disease_disease_edges_df,\n",
    "                                #   ibd_protein_protein_edges_df,\n",
    "                                  ibd_drug_protein_edges_df,\n",
    "                                  ibd_pathway_protein_edges_df,\n",
    "                                #   ibd_pathway_pathway_edges_df,\n",
    "                                  ibd_bioprocess_protein_edges_df,\n",
    "                                  ibd_molfunc_protein_edges_df,\n",
    "                                  ibd_cellcomp_protein_edges_df])\n",
    "primekg_ibd_edges_df[\"edge_type\"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)\n",
    "primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)\n",
    "primekg_ibd_edges_df.reset_index(drop=True, inplace=True)\n",
    "primekg_ibd_edges_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get a dataframe of nodes based on the above edge dataframe as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_index</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_source</th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>144</td>\n",
       "      <td>SMAD3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4088</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>179</td>\n",
       "      <td>IL10RB</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3588</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>192</td>\n",
       "      <td>GNA12</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>2768</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279</th>\n",
       "      <td>279</td>\n",
       "      <td>HNF4A</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3172</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>417</th>\n",
       "      <td>417</td>\n",
       "      <td>VCAM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7412</td>\n",
       "      <td>gene/protein</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129360</th>\n",
       "      <td>129360</td>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-975163</td>\n",
       "      <td>pathway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129361</th>\n",
       "      <td>129361</td>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-975110</td>\n",
       "      <td>pathway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129365</th>\n",
       "      <td>129365</td>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983168</td>\n",
       "      <td>pathway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129366</th>\n",
       "      <td>129366</td>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983170</td>\n",
       "      <td>pathway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129367</th>\n",
       "      <td>129367</td>\n",
       "      <td>Kinesins</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983189</td>\n",
       "      <td>pathway</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3426 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        node_index                                          node_name  \\\n",
       "144            144                                              SMAD3   \n",
       "179            179                                             IL10RB   \n",
       "192            192                                              GNA12   \n",
       "279            279                                              HNF4A   \n",
       "417            417                                              VCAM1   \n",
       "...            ...                                                ...   \n",
       "129360      129360  IRAK2 mediated activation of TAK1 complex upon...   \n",
       "129361      129361  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...   \n",
       "129365      129365  Antigen processing: Ubiquitination & Proteasom...   \n",
       "129366      129366  Antigen Presentation: Folding, assembly and pe...   \n",
       "129367      129367                                           Kinesins   \n",
       "\n",
       "       node_source       node_id     node_type  \n",
       "144           NCBI          4088  gene/protein  \n",
       "179           NCBI          3588  gene/protein  \n",
       "192           NCBI          2768  gene/protein  \n",
       "279           NCBI          3172  gene/protein  \n",
       "417           NCBI          7412  gene/protein  \n",
       "...            ...           ...           ...  \n",
       "129360    REACTOME  R-HSA-975163       pathway  \n",
       "129361    REACTOME  R-HSA-975110       pathway  \n",
       "129365    REACTOME  R-HSA-983168       pathway  \n",
       "129366    REACTOME  R-HSA-983170       pathway  \n",
       "129367    REACTOME  R-HSA-983189       pathway  \n",
       "\n",
       "[3426 rows x 5 columns]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# PrimeKG nodes related to IBD\n",
    "primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(), \n",
    "                                                                                   primekg_ibd_edges_df.tail_index.unique()])))]\n",
    "primekg_ibd_nodes_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can store the nodes and edges related to IBD in a parquet file for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store the IBD-related nodes and edges\n",
    "local_dir = '../../../../data/primekg_ibd/'\n",
    "if not os.path.exists(local_dir):\n",
    "    os.makedirs(local_dir)\n",
    "primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)\n",
    "primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of IBD-related nodes: 3426\n",
      "Number of IBD-related edges: 12752\n"
     ]
    }
   ],
   "source": [
    "# Statistics over the IBD-related nodes and edges\n",
    "print(f\"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}\")\n",
    "print(f\"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "node_type\n",
       "biological_process    1642\n",
       "cellular_component     207\n",
       "disease                  7\n",
       "drug                   835\n",
       "gene/protein           103\n",
       "molecular_function     324\n",
       "pathway                308\n",
       "dtype: int64"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count the number of nodes by node type\n",
    "primekg_ibd_nodes_df.groupby('node_type').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "relation            display_relation\n",
       "bioprocess_protein  interacts with      6300\n",
       "cellcomp_protein    interacts with      1348\n",
       "disease_protein     associated with      620\n",
       "drug_protein        carrier                8\n",
       "                    enzyme                64\n",
       "                    target               776\n",
       "                    transporter         1140\n",
       "molfunc_protein     interacts with      1466\n",
       "pathway_protein     interacts with      1030\n",
       "dtype: int64"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count the number of edges by relation and display_relation\n",
    "primekg_ibd_edges_df.groupby(['relation','display_relation']).size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "edge_type\n",
       "(biological_process, interacts with, gene/protein)    3150\n",
       "(cellular_component, interacts with, gene/protein)     674\n",
       "(disease, associated with, gene/protein)               310\n",
       "(drug, carrier, gene/protein)                            4\n",
       "(drug, enzyme, gene/protein)                            32\n",
       "(drug, target, gene/protein)                           388\n",
       "(drug, transporter, gene/protein)                      570\n",
       "(gene/protein, associated with, disease)               310\n",
       "(gene/protein, carrier, drug)                            4\n",
       "(gene/protein, enzyme, drug)                            32\n",
       "(gene/protein, interacts with, biological_process)    3150\n",
       "(gene/protein, interacts with, cellular_component)     674\n",
       "(gene/protein, interacts with, molecular_function)     733\n",
       "(gene/protein, interacts with, pathway)                515\n",
       "(gene/protein, target, drug)                           388\n",
       "(gene/protein, transporter, drug)                      570\n",
       "(molecular_function, interacts with, gene/protein)     733\n",
       "(pathway, interacts with, gene/protein)                515\n",
       "dtype: int64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count the number of edges by edge type\n",
    "primekg_ibd_edges_df.groupby(['edge_type']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enrichment (using textual as of now)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this point onwards, we will use the pre-processed IBD-related nodes and edges to create a set of graph formats.\n",
    "\n",
    "Before that, we should perform enrichment and embedding over the IBD-related nodes and edges.\n",
    "\n",
    "As of now, we will conduct a textual enrichment over the records.\n",
    "\n",
    "Since StarQA provide most of information of the nodes, we will use StarkQA to get the information of the nodes related to IBD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading StarkQAPrimeKG dataset...\n",
      "../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.\n",
      "Loading StarkQAPrimeKG embeddings...\n"
     ]
    }
   ],
   "source": [
    "# Define starkqa primekg data by providing a local directory where the data is stored\n",
    "starkqa_data = StarkQAPrimeKG(local_dir=\"../../../../data/starkqa_primekg/\")\n",
    "\n",
    "# Invoke a method to load the data\n",
    "starkqa_data.load_data()\n",
    "\n",
    "# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information\n",
    "# starkqa_df = starkqa_data.get_starkqa()\n",
    "starkqa_node_info = starkqa_data.get_starkqa_node_info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that not all nodes in the StarkQA-PrimeKG have additional information. \n",
    "\n",
    "For this case, we provide a basic text enrichment for the nodes by simply specifying their node name and type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_enrichment_text(data, starkqa_node_info):\n",
    "    \"\"\"\n",
    "    Enrich the node with additional textual information from BioBridge and StarkQA.\n",
    "\n",
    "    Args:\n",
    "        data (dict): The node data from PrimeKG\n",
    "        starkqa_node_info (dict): The node information from StarkQA-PrimeKG\n",
    "    \"\"\"\n",
    "    # Basic textual enrichment of the node\n",
    "    enriched_node = f\"{data['node_name']} belongs to {data['node_type']} category. \"\n",
    "\n",
    "    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which\n",
    "    # has additional information in the node_info of StarkQA-PrimeKG\n",
    "    added_info = ''\n",
    "    if data['node_type'] == 'gene/protein':\n",
    "        added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''\n",
    "    elif data['node_type'] == 'drug':\n",
    "        added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['indication']).replace('nan', '')])\n",
    "    elif data['node_type'] == 'disease':\n",
    "        added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),\n",
    "                               str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])\n",
    "    elif data['node_type'] == 'pathway':\n",
    "        added_info += f\"This pathway found in {starkqa_node_info['details']['speciesName']}. \" + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''\n",
    "\n",
    "    # Append the additional information for enrichment\n",
    "    enriched_node += added_info\n",
    "    return enriched_node"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using the above function, we can enrich the node information from PrimeKG with additional information from StarkQA-PrimeKG as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_64662/2873064541.py:3: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_index</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_source</th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_type</th>\n",
       "      <th>enriched_node</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>144</td>\n",
       "      <td>SMAD3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>4088</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>179</td>\n",
       "      <td>IL10RB</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3588</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>IL10RB belongs to gene/protein category. The p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>192</td>\n",
       "      <td>GNA12</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>2768</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279</th>\n",
       "      <td>279</td>\n",
       "      <td>HNF4A</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>3172</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>417</th>\n",
       "      <td>417</td>\n",
       "      <td>VCAM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>7412</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129360</th>\n",
       "      <td>129360</td>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-975163</td>\n",
       "      <td>pathway</td>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129361</th>\n",
       "      <td>129361</td>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-975110</td>\n",
       "      <td>pathway</td>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129365</th>\n",
       "      <td>129365</td>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983168</td>\n",
       "      <td>pathway</td>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129366</th>\n",
       "      <td>129366</td>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983170</td>\n",
       "      <td>pathway</td>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129367</th>\n",
       "      <td>129367</td>\n",
       "      <td>Kinesins</td>\n",
       "      <td>REACTOME</td>\n",
       "      <td>R-HSA-983189</td>\n",
       "      <td>pathway</td>\n",
       "      <td>Kinesins belongs to pathway category. This pat...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3426 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        node_index                                          node_name  \\\n",
       "144            144                                              SMAD3   \n",
       "179            179                                             IL10RB   \n",
       "192            192                                              GNA12   \n",
       "279            279                                              HNF4A   \n",
       "417            417                                              VCAM1   \n",
       "...            ...                                                ...   \n",
       "129360      129360  IRAK2 mediated activation of TAK1 complex upon...   \n",
       "129361      129361  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...   \n",
       "129365      129365  Antigen processing: Ubiquitination & Proteasom...   \n",
       "129366      129366  Antigen Presentation: Folding, assembly and pe...   \n",
       "129367      129367                                           Kinesins   \n",
       "\n",
       "       node_source       node_id     node_type  \\\n",
       "144           NCBI          4088  gene/protein   \n",
       "179           NCBI          3588  gene/protein   \n",
       "192           NCBI          2768  gene/protein   \n",
       "279           NCBI          3172  gene/protein   \n",
       "417           NCBI          7412  gene/protein   \n",
       "...            ...           ...           ...   \n",
       "129360    REACTOME  R-HSA-975163       pathway   \n",
       "129361    REACTOME  R-HSA-975110       pathway   \n",
       "129365    REACTOME  R-HSA-983168       pathway   \n",
       "129366    REACTOME  R-HSA-983170       pathway   \n",
       "129367    REACTOME  R-HSA-983189       pathway   \n",
       "\n",
       "                                            enriched_node  \n",
       "144     SMAD3 belongs to gene/protein category. The SM...  \n",
       "179     IL10RB belongs to gene/protein category. The p...  \n",
       "192     GNA12 belongs to gene/protein category. Predic...  \n",
       "279     HNF4A belongs to gene/protein category. The pr...  \n",
       "417     VCAM1 belongs to gene/protein category. This g...  \n",
       "...                                                   ...  \n",
       "129360  IRAK2 mediated activation of TAK1 complex upon...  \n",
       "129361  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...  \n",
       "129365  Antigen processing: Ubiquitination & Proteasom...  \n",
       "129366  Antigen Presentation: Folding, assembly and pe...  \n",
       "129367  Kinesins belongs to pathway category. This pat...  \n",
       "\n",
       "[3426 rows x 6 columns]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Perform node enrichment for each row in primekg_nodes\n",
    "text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()\n",
    "primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes\n",
    "primekg_ibd_nodes_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subsequently, we can perform similar textual enrichment for the edges in PrimeKG.\n",
    "\n",
    "Since StarkQA only provides node information, we can only enrich the edges with basic information of the triples in combination with the head and tail nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_index</th>\n",
       "      <th>head_name</th>\n",
       "      <th>head_source</th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_type</th>\n",
       "      <th>tail_index</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>tail_source</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_type</th>\n",
       "      <th>display_relation</th>\n",
       "      <th>relation</th>\n",
       "      <th>edge_type</th>\n",
       "      <th>enriched_edge</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>113</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>MONDO_grouped</td>\n",
       "      <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
       "      <td>disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>639</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>MONDO</td>\n",
       "      <td>5101</td>\n",
       "      <td>disease</td>\n",
       "      <td>2712</td>\n",
       "      <td>CASP3</td>\n",
       "      <td>NCBI</td>\n",
       "      <td>836</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>associated with</td>\n",
       "      <td>disease_protein</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   head_index                     head_name    head_source  \\\n",
       "0       37785  ulcerative colitis (disease)          MONDO   \n",
       "1       28158    inflammatory bowel disease  MONDO_grouped   \n",
       "2       37785  ulcerative colitis (disease)          MONDO   \n",
       "3       28158    inflammatory bowel disease  MONDO_grouped   \n",
       "4       37785  ulcerative colitis (disease)          MONDO   \n",
       "\n",
       "                                             head_id head_type  tail_index  \\\n",
       "0                                               5101   disease        7359   \n",
       "1  9960_12845_33643_11471_12831_12875_12941_13153...   disease        7359   \n",
       "2                                               5101   disease        2874   \n",
       "3  9960_12845_33643_11471_12831_12875_12941_13153...   disease        2874   \n",
       "4                                               5101   disease        2712   \n",
       "\n",
       "  tail_name tail_source tail_id     tail_type display_relation  \\\n",
       "0     ADCY7        NCBI     113  gene/protein  associated with   \n",
       "1     ADCY7        NCBI     113  gene/protein  associated with   \n",
       "2     PRDM1        NCBI     639  gene/protein  associated with   \n",
       "3     PRDM1        NCBI     639  gene/protein  associated with   \n",
       "4     CASP3        NCBI     836  gene/protein  associated with   \n",
       "\n",
       "          relation                                 edge_type  \\\n",
       "0  disease_protein  (disease, associated with, gene/protein)   \n",
       "1  disease_protein  (disease, associated with, gene/protein)   \n",
       "2  disease_protein  (disease, associated with, gene/protein)   \n",
       "3  disease_protein  (disease, associated with, gene/protein)   \n",
       "4  disease_protein  (disease, associated with, gene/protein)   \n",
       "\n",
       "                                       enriched_edge  \n",
       "0  ulcerative colitis (disease) (disease) has a d...  \n",
       "1  inflammatory bowel disease (disease) has a dir...  \n",
       "2  ulcerative colitis (disease) (disease) has a d...  \n",
       "3  inflammatory bowel disease (disease) has a dir...  \n",
       "4  ulcerative colitis (disease) (disease) has a d...  "
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information\n",
    "text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f\"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).\", axis=1).tolist()\n",
    "primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges\n",
    "primekg_ibd_edges_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embeddings (using textual embedding as of now)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to perform embedding using the enriched nodes and edges by leveraging `EmbeddingWithOllama` class.\n",
    "\n",
    "For this purpose, we will use `nomic-embed-text`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using nomic-ai/nomic-embed-text-v1.5 model\n",
    "emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Node Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will perform node embedding for the IBD-related nodes using the Ollama model by using mini-batches of 100 nodes at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 35/35 [00:19<00:00,  1.75it/s]\n"
     ]
    }
   ],
   "source": [
    "# Since the records of nodes has large amount of data, we will split them into mini-batches\n",
    "mini_batch_size = 100\n",
    "node_embeddings = []\n",
    "for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):\n",
    "    outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])\n",
    "    node_embeddings.extend(outputs)\n",
    "# node_embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3426, 768)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the shape of the node embeddings\n",
    "len(node_embeddings), len(node_embeddings[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_64662/3470083233.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df['x'] = node_embeddings\n",
      "/tmp/ipykernel_64662/3470083233.py:5: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)\n",
      "/tmp/ipykernel_64662/3470083233.py:6: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_type</th>\n",
       "      <th>enriched_node</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>144</td>\n",
       "      <td>SMAD3</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
       "      <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>179</td>\n",
       "      <td>IL10RB</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>IL10RB belongs to gene/protein category. The p...</td>\n",
       "      <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>192</td>\n",
       "      <td>GNA12</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
       "      <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279</th>\n",
       "      <td>279</td>\n",
       "      <td>HNF4A</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
       "      <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>417</th>\n",
       "      <td>417</td>\n",
       "      <td>VCAM1</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
       "      <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     node_id node_name     node_type  \\\n",
       "144      144     SMAD3  gene/protein   \n",
       "179      179    IL10RB  gene/protein   \n",
       "192      192     GNA12  gene/protein   \n",
       "279      279     HNF4A  gene/protein   \n",
       "417      417     VCAM1  gene/protein   \n",
       "\n",
       "                                         enriched_node  \\\n",
       "144  SMAD3 belongs to gene/protein category. The SM...   \n",
       "179  IL10RB belongs to gene/protein category. The p...   \n",
       "192  GNA12 belongs to gene/protein category. Predic...   \n",
       "279  HNF4A belongs to gene/protein category. The pr...   \n",
       "417  VCAM1 belongs to gene/protein category. This g...   \n",
       "\n",
       "                                                     x  \n",
       "144  [0.026536005, 0.05420931, -0.17033643, -0.0248...  \n",
       "179  [0.024764946, 0.022782002, -0.16956052, -0.033...  \n",
       "192  [0.004795947, 0.04921528, -0.14488313, -0.0492...  \n",
       "279  [0.013905027, 0.032602787, -0.15260702, 0.0074...  \n",
       "417  [0.047299746, 0.032621186, -0.15677826, -0.021...  "
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Add them as features to the dataframe\n",
    "primekg_ibd_nodes_df['x'] = node_embeddings\n",
    "\n",
    "# Drop and rename several columns\n",
    "primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)\n",
    "primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)\n",
    "\n",
    "# Check dataframe of nodes\n",
    "primekg_ibd_nodes_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_64662/1471123717.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_type</th>\n",
       "      <th>enriched_node</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>node</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>144</td>\n",
       "      <td>SMAD3</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
       "      <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>179</td>\n",
       "      <td>IL10RB</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>IL10RB belongs to gene/protein category. The p...</td>\n",
       "      <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>192</td>\n",
       "      <td>GNA12</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
       "      <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279</th>\n",
       "      <td>279</td>\n",
       "      <td>HNF4A</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
       "      <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>417</th>\n",
       "      <td>417</td>\n",
       "      <td>VCAM1</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
       "      <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      node_id node_name     node_type  \\\n",
       "node                                    \n",
       "144       144     SMAD3  gene/protein   \n",
       "179       179    IL10RB  gene/protein   \n",
       "192       192     GNA12  gene/protein   \n",
       "279       279     HNF4A  gene/protein   \n",
       "417       417     VCAM1  gene/protein   \n",
       "\n",
       "                                          enriched_node  \\\n",
       "node                                                      \n",
       "144   SMAD3 belongs to gene/protein category. The SM...   \n",
       "179   IL10RB belongs to gene/protein category. The p...   \n",
       "192   GNA12 belongs to gene/protein category. Predic...   \n",
       "279   HNF4A belongs to gene/protein category. The pr...   \n",
       "417   VCAM1 belongs to gene/protein category. This g...   \n",
       "\n",
       "                                                      x  \n",
       "node                                                     \n",
       "144   [0.026536005, 0.05420931, -0.17033643, -0.0248...  \n",
       "179   [0.024764946, 0.022782002, -0.16956052, -0.033...  \n",
       "192   [0.004795947, 0.04921528, -0.14488313, -0.0492...  \n",
       "279   [0.013905027, 0.032602787, -0.15260702, 0.0074...  \n",
       "417   [0.047299746, 0.032621186, -0.15677826, -0.021...  "
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Duplicate a node_name as index and use it as index\n",
    "primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']\n",
    "primekg_ibd_nodes_df.set_index('node', inplace=True)\n",
    "primekg_ibd_nodes_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the embedded nodes dataframes to parquet file\n",
    "primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Edge Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Likewise, we also conduct node embedding for the IBD-related edges using the Ollama model by using mini-batches of 100 edges at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 128/128 [00:48<00:00,  2.64it/s]\n"
     ]
    }
   ],
   "source": [
    "# Since the records of edges has large amount of data, we will split them into mini-batches\n",
    "mini_batch_size = 100\n",
    "edge_embeddings = []\n",
    "for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):\n",
    "    outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])\n",
    "    edge_embeddings.extend(outputs)\n",
    "# edge_embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(12752, 768)"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the shape of the edge embeddings\n",
    "len(edge_embeddings), len(edge_embeddings[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_name</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>edge_type</th>\n",
       "      <th>enriched_edge</th>\n",
       "      <th>edge_attr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.061832674, 0.040013667, -0.15366873, -0.008...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>7359</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "      <td>[0.050393466, 0.030410834, -0.15008788, -0.013...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.0401622, 0.028982995, -0.15433805, 0.006565...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>28158</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>2874</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "      <td>[0.02781422, 0.01603875, -0.14870141, 0.004470...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37785</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>2712</td>\n",
       "      <td>CASP3</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.07853663, 0.050751355, -0.1470567, -0.01237...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   head_id                     head_name  tail_id tail_name  \\\n",
       "0    37785  ulcerative colitis (disease)     7359     ADCY7   \n",
       "1    28158    inflammatory bowel disease     7359     ADCY7   \n",
       "2    37785  ulcerative colitis (disease)     2874     PRDM1   \n",
       "3    28158    inflammatory bowel disease     2874     PRDM1   \n",
       "4    37785  ulcerative colitis (disease)     2712     CASP3   \n",
       "\n",
       "                                  edge_type  \\\n",
       "0  (disease, associated with, gene/protein)   \n",
       "1  (disease, associated with, gene/protein)   \n",
       "2  (disease, associated with, gene/protein)   \n",
       "3  (disease, associated with, gene/protein)   \n",
       "4  (disease, associated with, gene/protein)   \n",
       "\n",
       "                                       enriched_edge  \\\n",
       "0  ulcerative colitis (disease) (disease) has a d...   \n",
       "1  inflammatory bowel disease (disease) has a dir...   \n",
       "2  ulcerative colitis (disease) (disease) has a d...   \n",
       "3  inflammatory bowel disease (disease) has a dir...   \n",
       "4  ulcerative colitis (disease) (disease) has a d...   \n",
       "\n",
       "                                           edge_attr  \n",
       "0  [0.061832674, 0.040013667, -0.15366873, -0.008...  \n",
       "1  [0.050393466, 0.030410834, -0.15008788, -0.013...  \n",
       "2  [0.0401622, 0.028982995, -0.15433805, 0.006565...  \n",
       "3  [0.02781422, 0.01603875, -0.14870141, 0.004470...  \n",
       "4  [0.07853663, 0.050751355, -0.1470567, -0.01237...  "
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Add them as features to the dataframe\n",
    "primekg_ibd_edges_df['edge_attr'] = edge_embeddings\n",
    "\n",
    "# Drop and rename several columns\n",
    "primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)\n",
    "primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)\n",
    "\n",
    "# Check dataframe of edges\n",
    "primekg_ibd_edges_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the embedded nodes dataframes to parquet file\n",
    "primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Knowledge Graph Construction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this section, we would like to convert our dataframes to networkx `DiGraph` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_64662/4233198491.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df[\"node\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
      "/tmp/ipykernel_64662/4233198491.py:3: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  primekg_ibd_nodes_df[\"node_id\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_name</th>\n",
       "      <th>node_type</th>\n",
       "      <th>enriched_node</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>node</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>SMAD3_(144)</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>SMAD3</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
       "      <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IL10RB_(179)</th>\n",
       "      <td>IL10RB_(179)</td>\n",
       "      <td>IL10RB</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>IL10RB belongs to gene/protein category. The p...</td>\n",
       "      <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GNA12_(192)</th>\n",
       "      <td>GNA12_(192)</td>\n",
       "      <td>GNA12</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
       "      <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HNF4A_(279)</th>\n",
       "      <td>HNF4A_(279)</td>\n",
       "      <td>HNF4A</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
       "      <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>VCAM1_(417)</th>\n",
       "      <td>VCAM1_(417)</td>\n",
       "      <td>VCAM1</td>\n",
       "      <td>gene/protein</td>\n",
       "      <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
       "      <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   node_id node_name     node_type  \\\n",
       "node                                                 \n",
       "SMAD3_(144)    SMAD3_(144)     SMAD3  gene/protein   \n",
       "IL10RB_(179)  IL10RB_(179)    IL10RB  gene/protein   \n",
       "GNA12_(192)    GNA12_(192)     GNA12  gene/protein   \n",
       "HNF4A_(279)    HNF4A_(279)     HNF4A  gene/protein   \n",
       "VCAM1_(417)    VCAM1_(417)     VCAM1  gene/protein   \n",
       "\n",
       "                                                  enriched_node  \\\n",
       "node                                                              \n",
       "SMAD3_(144)   SMAD3 belongs to gene/protein category. The SM...   \n",
       "IL10RB_(179)  IL10RB belongs to gene/protein category. The p...   \n",
       "GNA12_(192)   GNA12 belongs to gene/protein category. Predic...   \n",
       "HNF4A_(279)   HNF4A belongs to gene/protein category. The pr...   \n",
       "VCAM1_(417)   VCAM1 belongs to gene/protein category. This g...   \n",
       "\n",
       "                                                              x  \n",
       "node                                                             \n",
       "SMAD3_(144)   [0.026536005, 0.05420931, -0.17033643, -0.0248...  \n",
       "IL10RB_(179)  [0.024764946, 0.022782002, -0.16956052, -0.033...  \n",
       "GNA12_(192)   [0.004795947, 0.04921528, -0.14488313, -0.0492...  \n",
       "HNF4A_(279)   [0.013905027, 0.032602787, -0.15260702, 0.0074...  \n",
       "VCAM1_(417)   [0.047299746, 0.032621186, -0.15677826, -0.021...  "
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Modify the node dataframe\n",
    "primekg_ibd_nodes_df[\"node\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
    "primekg_ibd_nodes_df[\"node_id\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
    "primekg_ibd_nodes_df.set_index('node', inplace=True)\n",
    "primekg_ibd_nodes_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_id</th>\n",
       "      <th>head_name</th>\n",
       "      <th>tail_id</th>\n",
       "      <th>tail_name</th>\n",
       "      <th>edge_type</th>\n",
       "      <th>enriched_edge</th>\n",
       "      <th>edge_attr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ulcerative colitis (disease)_(37785)</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>ADCY7_(7359)</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.061832674, 0.040013667, -0.15366873, -0.008...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>inflammatory bowel disease_(28158)</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>ADCY7_(7359)</td>\n",
       "      <td>ADCY7</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "      <td>[0.050393466, 0.030410834, -0.15008788, -0.013...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ulcerative colitis (disease)_(37785)</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>PRDM1_(2874)</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.0401622, 0.028982995, -0.15433805, 0.006565...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>inflammatory bowel disease_(28158)</td>\n",
       "      <td>inflammatory bowel disease</td>\n",
       "      <td>PRDM1_(2874)</td>\n",
       "      <td>PRDM1</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>inflammatory bowel disease (disease) has a dir...</td>\n",
       "      <td>[0.02781422, 0.01603875, -0.14870141, 0.004470...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ulcerative colitis (disease)_(37785)</td>\n",
       "      <td>ulcerative colitis (disease)</td>\n",
       "      <td>CASP3_(2712)</td>\n",
       "      <td>CASP3</td>\n",
       "      <td>(disease, associated with, gene/protein)</td>\n",
       "      <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
       "      <td>[0.07853663, 0.050751355, -0.1470567, -0.01237...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                head_id                     head_name  \\\n",
       "0  ulcerative colitis (disease)_(37785)  ulcerative colitis (disease)   \n",
       "1    inflammatory bowel disease_(28158)    inflammatory bowel disease   \n",
       "2  ulcerative colitis (disease)_(37785)  ulcerative colitis (disease)   \n",
       "3    inflammatory bowel disease_(28158)    inflammatory bowel disease   \n",
       "4  ulcerative colitis (disease)_(37785)  ulcerative colitis (disease)   \n",
       "\n",
       "        tail_id tail_name                                 edge_type  \\\n",
       "0  ADCY7_(7359)     ADCY7  (disease, associated with, gene/protein)   \n",
       "1  ADCY7_(7359)     ADCY7  (disease, associated with, gene/protein)   \n",
       "2  PRDM1_(2874)     PRDM1  (disease, associated with, gene/protein)   \n",
       "3  PRDM1_(2874)     PRDM1  (disease, associated with, gene/protein)   \n",
       "4  CASP3_(2712)     CASP3  (disease, associated with, gene/protein)   \n",
       "\n",
       "                                       enriched_edge  \\\n",
       "0  ulcerative colitis (disease) (disease) has a d...   \n",
       "1  inflammatory bowel disease (disease) has a dir...   \n",
       "2  ulcerative colitis (disease) (disease) has a d...   \n",
       "3  inflammatory bowel disease (disease) has a dir...   \n",
       "4  ulcerative colitis (disease) (disease) has a d...   \n",
       "\n",
       "                                           edge_attr  \n",
       "0  [0.061832674, 0.040013667, -0.15366873, -0.008...  \n",
       "1  [0.050393466, 0.030410834, -0.15008788, -0.013...  \n",
       "2  [0.0401622, 0.028982995, -0.15433805, 0.006565...  \n",
       "3  [0.02781422, 0.01603875, -0.14870141, 0.004470...  \n",
       "4  [0.07853663, 0.050751355, -0.1470567, -0.01237...  "
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Modify the edge dataframe\n",
    "primekg_ibd_edges_df[\"head_id\"] = primekg_ibd_edges_df.apply(lambda x: f\"{x.head_name}_({x.head_id})\", axis=1)\n",
    "primekg_ibd_edges_df[\"tail_id\"] = primekg_ibd_edges_df.apply(lambda x: f\"{x.tail_name}_({x.tail_id})\", axis=1)\n",
    "primekg_ibd_edges_df.reset_index(drop=True, inplace=True)\n",
    "primekg_ibd_edges_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Convert dataframes to knowledge graph as networkx object\n",
    "kg = nx.DiGraph()\n",
    "for i, row in primekg_ibd_nodes_df.iterrows():\n",
    "    kg.add_node(row['node_id'], **row.to_dict())\n",
    "for i, row in primekg_ibd_edges_df.iterrows():\n",
    "    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save graph object\n",
    "local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'\n",
    "with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:\n",
    "    pickle.dump(kg, f)\n",
    "\n",
    "# # Load graph object\n",
    "# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:\n",
    "#     kg = pickle.load(f)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#Nodes 3426\n",
      "#Edges 12752\n"
     ]
    }
   ],
   "source": [
    "print (\"#Nodes\", kg.number_of_nodes())\n",
    "print (\"#Edges\", kg.number_of_edges())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, we can convert the networkx graph to PyG `Data` object for further processing (e.g., subgraph extraction)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert networkx graph to PyG data object\n",
    "pyg_graph = from_networkx(kg)\n",
    "\n",
    "# Save graph object\n",
    "with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:\n",
    "    pickle.dump(pyg_graph, f)\n",
    "\n",
    "# Load graph object\n",
    "# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:\n",
    "#     pyg_graph = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, we are going to prepare a textualized graph of nodes and edges for RAG application, for instance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_id</th>\n",
       "      <th>node_attr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IL10RB_(179)</td>\n",
       "      <td>IL10RB belongs to gene/protein category. The p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GNA12_(192)</td>\n",
       "      <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>HNF4A_(279)</td>\n",
       "      <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>VCAM1_(417)</td>\n",
       "      <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3421</th>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3422</th>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3423</th>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3424</th>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3425</th>\n",
       "      <td>Kinesins_(129367)</td>\n",
       "      <td>Kinesins belongs to pathway category. This pat...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3426 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                node_id  \\\n",
       "0                                           SMAD3_(144)   \n",
       "1                                          IL10RB_(179)   \n",
       "2                                           GNA12_(192)   \n",
       "3                                           HNF4A_(279)   \n",
       "4                                           VCAM1_(417)   \n",
       "...                                                 ...   \n",
       "3421  IRAK2 mediated activation of TAK1 complex upon...   \n",
       "3422  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...   \n",
       "3423  Antigen processing: Ubiquitination & Proteasom...   \n",
       "3424  Antigen Presentation: Folding, assembly and pe...   \n",
       "3425                                  Kinesins_(129367)   \n",
       "\n",
       "                                              node_attr  \n",
       "0     SMAD3 belongs to gene/protein category. The SM...  \n",
       "1     IL10RB belongs to gene/protein category. The p...  \n",
       "2     GNA12 belongs to gene/protein category. Predic...  \n",
       "3     HNF4A belongs to gene/protein category. The pr...  \n",
       "4     VCAM1 belongs to gene/protein category. This g...  \n",
       "...                                                 ...  \n",
       "3421  IRAK2 mediated activation of TAK1 complex upon...  \n",
       "3422  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...  \n",
       "3423  Antigen processing: Ubiquitination & Proteasom...  \n",
       "3424  Antigen Presentation: Folding, assembly and pe...  \n",
       "3425  Kinesins belongs to pathway category. This pat...  \n",
       "\n",
       "[3426 rows x 2 columns]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Prepare nodes\n",
    "nodes_df = pd.DataFrame({\n",
    "    'node_id': list(pyg_graph.node_id),\n",
    "    'node_attr': list(pyg_graph.enriched_node),\n",
    "})\n",
    "nodes_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>head_id</th>\n",
       "      <th>edge_type</th>\n",
       "      <th>tail_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>(gene/protein, associated with, disease)</td>\n",
       "      <td>Crohn disease_(37784)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>(gene/protein, associated with, disease)</td>\n",
       "      <td>inflammatory bowel disease_(28158)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>(gene/protein, associated with, disease)</td>\n",
       "      <td>Crohn's colitis_(83770)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>(gene/protein, associated with, disease)</td>\n",
       "      <td>Crohn ileitis and jejunitis_(35814)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SMAD3_(144)</td>\n",
       "      <td>(gene/protein, interacts with, pathway)</td>\n",
       "      <td>Signaling by NODAL_(62373)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12747</th>\n",
       "      <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
       "      <td>(pathway, interacts with, gene/protein)</td>\n",
       "      <td>TLR4_(3259)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12748</th>\n",
       "      <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
       "      <td>(pathway, interacts with, gene/protein)</td>\n",
       "      <td>TLR9_(10113)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12749</th>\n",
       "      <td>Antigen processing: Ubiquitination &amp; Proteasom...</td>\n",
       "      <td>(pathway, interacts with, gene/protein)</td>\n",
       "      <td>HERC2_(1777)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12750</th>\n",
       "      <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
       "      <td>(pathway, interacts with, gene/protein)</td>\n",
       "      <td>ERAP2_(12763)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12751</th>\n",
       "      <td>Kinesins_(129367)</td>\n",
       "      <td>(pathway, interacts with, gene/protein)</td>\n",
       "      <td>KIF21B_(8564)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12752 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 head_id  \\\n",
       "0                                            SMAD3_(144)   \n",
       "1                                            SMAD3_(144)   \n",
       "2                                            SMAD3_(144)   \n",
       "3                                            SMAD3_(144)   \n",
       "4                                            SMAD3_(144)   \n",
       "...                                                  ...   \n",
       "12747  IRAK2 mediated activation of TAK1 complex upon...   \n",
       "12748  TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...   \n",
       "12749  Antigen processing: Ubiquitination & Proteasom...   \n",
       "12750  Antigen Presentation: Folding, assembly and pe...   \n",
       "12751                                  Kinesins_(129367)   \n",
       "\n",
       "                                      edge_type  \\\n",
       "0      (gene/protein, associated with, disease)   \n",
       "1      (gene/protein, associated with, disease)   \n",
       "2      (gene/protein, associated with, disease)   \n",
       "3      (gene/protein, associated with, disease)   \n",
       "4       (gene/protein, interacts with, pathway)   \n",
       "...                                         ...   \n",
       "12747   (pathway, interacts with, gene/protein)   \n",
       "12748   (pathway, interacts with, gene/protein)   \n",
       "12749   (pathway, interacts with, gene/protein)   \n",
       "12750   (pathway, interacts with, gene/protein)   \n",
       "12751   (pathway, interacts with, gene/protein)   \n",
       "\n",
       "                                   tail_id  \n",
       "0                    Crohn disease_(37784)  \n",
       "1       inflammatory bowel disease_(28158)  \n",
       "2                  Crohn's colitis_(83770)  \n",
       "3      Crohn ileitis and jejunitis_(35814)  \n",
       "4               Signaling by NODAL_(62373)  \n",
       "...                                    ...  \n",
       "12747                          TLR4_(3259)  \n",
       "12748                         TLR9_(10113)  \n",
       "12749                         HERC2_(1777)  \n",
       "12750                        ERAP2_(12763)  \n",
       "12751                        KIF21B_(8564)  \n",
       "\n",
       "[12752 rows x 3 columns]"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Prepare edges\n",
    "edges_df = pd.DataFrame({\n",
    "    'head_id': list(pyg_graph.head_id),\n",
    "    'edge_type': list(pyg_graph.edge_type),\n",
    "    'tail_id': list(pyg_graph.tail_id),\n",
    "})\n",
    "edges_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), \"wb\") as f:\n",
    "    pickle.dump({\"nodes\": nodes_df, \"edges\": edges_df}, f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}